How to make a multi-group dotplot for single-cell RNAseq data

Jan 16, 2024 11 min read bioinformatics, bioconductor, single-cell

Dotplots are very popular for visualizing single-cell RNAseq data. In essence, the dot size represents the percentage of cells that are positive for that gene; the color intensity represents the average gene expression of that gene in a cell type.

It is easy to plot one using Seurat::dotplot or Sccustomize::clustered_dotplot. However, when you have multiple groups/conditions in your data and you want to visualize it by groups, it is not that straightforward.

However, if you understand that the underlying data are just the percentages of positive cells and the average expression values and the dotplot is essentially just a heatmap, you can write R code to visualize the data in anyway you want.

Let’s use this data from https://satijalab.org/seurat/articles/integration_introduction

The object contains data from human PBMC from two conditions, interferon-stimulated and control cells (stored in the stim column in the object metadata). We will aim to integrate the two conditions together, so that we can jointly identify cell subpopulations across datasets, and then explore how each group differs across conditions

library(Seurat)
library(SeuratData)
library(patchwork)
library(harmony)
library(dplyr)
# install dataset
InstallData("ifnb")

# load dataset
ifnb <- LoadData("ifnb")
ifnb<- UpdateSeuratObject(ifnb)

ifnb

#> An object of class Seurat 
#> 14053 features across 13999 samples within 1 assay 
#> Active assay: RNA (14053 features, 0 variable features)
#>  2 layers present: counts, data

Two conditions: control and stimulated.

table(ifnb$stim)

#> 
#> CTRL STIM 
#> 6548 7451

routine processing

ifnb<- ifnb %>%
  NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
  FindVariableFeatures( selection.method = "vst", nfeatures = 2000) %>%
  ScaleData() %>%
  RunPCA() %>%
  RunHarmony(group.by.vars = "stim", dims.use = 1:30) %>%
  RunUMAP(reduction = "harmony", dims = 1:30) %>%
  FindNeighbors(reduction = "harmony", dims = 1:30) %>% 
  FindClusters(resolution = 0.6)

#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 13999
#> Number of edges: 519190
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8855
#> Number of communities: 14
#> Elapsed time: 2 seconds

Let’s take a look at the UMAP:

# Visualization
DimPlot(ifnb, reduction = "umap", group.by = c("stim", "seurat_annotations"))

Let’s make a dotplot using Seurat

Idents(ifnb)<- ifnb$seurat_annotations
# NEEDS TO BE FIXED AND SET ORDER CORRECTLY
Idents(ifnb) <- factor(Idents(ifnb), 
                       levels = c("pDC", "Eryth", "Mk", "DC", 
                                                "CD14 Mono", "CD16 Mono","B Activated", 
                                                "B", "CD8 T", "NK", "T activated", 
                                                "CD4 Naive T", "CD4 Memory T"))

markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", 
                     "NKG7", "CCL5", "CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", 
                     "FCGR3A", "VMO1", "CCL2", "S100A9", "HLA-DQA1", "GPR183", 
                     "PPBP", "GNG11", "HBA2", "HBB", "TSPAN13", "IL3RA", "IGJ", 
                     "PRSS57")

DotPlot(ifnb, features = markers.to.plot, cols = c("blue", "red"), 
        dot.scale = 8, split.by = "stim") +
        RotatedAxis()

We can also use scCusstomize to cluster the dotplots, but you can not split by condition.

scCustomize::Clustered_DotPlot(ifnb, features = markers.to.plot,
                               plot_km_elbow = FALSE)

You can rotate the x and y axis:

scCustomize::Clustered_DotPlot(ifnb, features = markers.to.plot,
                               plot_km_elbow = FALSE, flip = TRUE)

What if I want to split the dotplot by the condition?

Customized multi-group dotplot

We need to get the percentage of positive cells and average expression by group.

For a single gene, put the groups into multiple rows, and each column is a cell type.

# group1 is the cell type/cluster annotation 
# group2 is any condition you want to further group, in this case, the stim 

GetMatrixFromSeuratByGroupSingle<- function(obj, feature, group1, group2){
  if (length(feature) != 1){
          stop("please only provide only one gene name")
  }
  exp_mat<- obj@assays$RNA@data[feature, ,drop=FALSE]
  count_mat<- obj@assays$RNA@counts[feature,,drop=FALSE ]
  
  meta<- obj@meta.data %>%
  tibble::rownames_to_column(var = "cell")
        
  # get the average expression matrix
  exp_df<- as.matrix(exp_mat) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column(var="gene") %>%
    tidyr::pivot_longer(!gene, names_to = "cell", values_to = "expression") %>%
    left_join(meta) %>%
    group_by(gene,{{group1}}, {{group2}}) %>%
    summarise(average_expression = mean(expression)) %>%
    tidyr::pivot_wider(names_from = {{group1}}, 
                       values_from= average_expression) 
  
  exp_mat<- exp_df[, -c(1,2)] %>% as.matrix()
  rownames(exp_mat)<- exp_df %>% pull({{group2}})
  
  # get the percentage positive cell matrix
  count_df<- as.matrix(count_mat) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column(var="gene") %>%
    tidyr::pivot_longer(!gene, names_to = "cell", values_to = "count") %>%
    left_join(meta) %>%
    group_by(gene, {{group1}}, {{group2}}) %>%
    summarise(percentage = mean(count >0)) %>%
    tidyr::pivot_wider(names_from = {{group1}}, 
                       values_from= percentage) 

  percent_mat<- count_df[, -c(1,2)] %>% as.matrix()
  rownames(percent_mat)<- count_df %>% pull({{group2}})
  
  if (!identical(dim(exp_mat), dim(percent_mat))) {
    stop("the dimension of the two matrice should be the same!")
  }
  
  if(! all.equal(colnames(exp_mat), colnames(percent_mat))) {
    stop("column names of the two matrice should be the same!")
  }
  
  if(! all.equal(rownames(exp_mat), rownames(percent_mat))) {
    stop("column names of the two matrice should be the same!")
  }
  return(list(exp_mat = exp_mat, percent_mat = percent_mat))
}

Let’s get the matrices for one gene

mat<- GetMatrixFromSeuratByGroupSingle(obj= ifnb, 
                                 feature = "ISG15", 
                                 seurat_annotations,
                                 stim)

# take a look at the matrices
mat$exp_mat

#>      CD14 Mono CD4 Naive T CD4 Memory T CD16 Mono         B     CD8 T
#> CTRL 0.7261719   0.3808203    0.4733928  1.117586 0.5495651 0.7284956
#> STIM 6.0702937   3.8087751    4.0137093  5.712665 4.3502870 4.0260494
#>      T activated        NK        DC B Activated        Mk       pDC     Eryth
#> CTRL   0.8074327 0.7889602 0.6561191   0.3800281 0.5941432 0.4085024 0.2929348
#> STIM   4.1129324 4.5385409 5.7027258   4.0121762 4.0114610 4.5247511 3.9331893

mat$percent_mat

#>      CD14 Mono CD4 Naive T CD4 Memory T CD16 Mono         B     CD8 T
#> CTRL  0.323702   0.1666667    0.2072177  0.495069 0.2334152 0.3011364
#> STIM  1.000000   0.9927916    0.9900332  1.000000 0.9982487 0.9891775
#>      T activated        NK        DC B Activated        Mk       pDC    Eryth
#> CTRL   0.3233333 0.3154362 0.3527132   0.1891892 0.2608696 0.2156863 0.173913
#> STIM   0.9759760 0.9875389 1.0000000   1.0000000 0.9752066 1.0000000 1.000000

Two matrices:

the average expression for each cell type per condition
the percentage of ISG15 positive cells for each cell type per condition

Now, Let’s visualize it using ComplexHeatmap. See my previous post too.

Always explore the data range before you decide how to map the data values to colors.

library(ComplexHeatmap)

quantile(mat$exp_mat, c(0.1, 0.5, 0.8, 0.9))

#>       10%       50%       80%       90% 
#> 0.3946614 2.4631808 4.3502870 5.1206334

col_fun<-  circlize::colorRamp2(c(0, 2, 5), c("#FDE725FF", "#238A8DFF", "#440154FF"))

0 will be mapped to #FDE725FF, 2 will be mapped to #238A8DFF and 5 will be mapped to #440154FF. The value in-between will be linearlly interpolated to get corresponding colors

Use the layer_fun to decide the size of the dots. Within the grid.circle, we specify the radius r= sqrt(pindex(mat$percent_mat, i, j)) of the circle to be the square root of the percentage, so the area size of the dots correspond to the percentage.

layer_fun = function(j, i, x, y, w, h, fill){
    grid.rect(x = x, y = y, width = w, height = h, 
              gp = gpar(col = "gray", fill = NA))
    grid.circle(x=x,y=y,r= sqrt(pindex(mat$percent_mat, i, j)) * unit(4, "mm"),
                gp = gpar(fill = col_fun(pindex(mat$exp_mat, i, j)), col = NA))}
  
hp<- Heatmap(mat$exp_mat,
             heatmap_legend_param=list(title= "expression"),
             column_title = "ISG15", 
             col=col_fun,
             rect_gp = gpar(type = "none"),
             layer_fun = layer_fun,
             row_names_gp = gpar(fontsize = 8),
             border = "black",
             cluster_rows = FALSE, 
             cluster_columns = FALSE,
             row_names_side  = "left")

make the legend for the dot size. Make sure the size is the same as the in the layer_fun which is unit(4, "mm"). You can change the size as you want.

lgd_list = list(
    Legend( labels = c(0, 10, 25, 50, 75), title = "percentage",
            graphics = list(
              function(x, y, w, h) grid.circle(x = x, y = y, r = 0  * unit(4, "mm"),
                                               gp = gpar(fill = "black")),
              function(x, y, w, h) grid.circle(x = x, y = y, r = sqrt(0.1)  * unit(4, "mm"),
                                               gp = gpar(fill = "black")),
              function(x, y, w, h) grid.circle(x = x, y = y, r = sqrt(0.25) * unit(4, "mm"),
                                               gp = gpar(fill = "black")),
              function(x, y, w, h) grid.circle(x = x, y = y, r = sqrt(0.5) * unit(4, "mm"),
                                               gp = gpar(fill = "black")),
              function(x, y, w, h) grid.circle(x = x, y = y, r = sqrt(0.75) * unit(4, "mm"),
                                               gp = gpar(fill = "black"))),
            row_gap = unit(2.5, "mm")
            ))

draw(hp, annotation_legend_list = lgd_list, ht_gap = unit(1, "cm"),
     annotation_legend_side = "right")

plot Multiple genes across multiple groups/conditions

Let’s change the function a little. The major difference is this line: tidyr::pivot_wider(names_from = c({{group1}}, {{group2}}), values_from= average_expression, names_sep="|")

So the data will be wrangled as: rows are genes, columns are cell_type|condition

GetMatrixFromSeuratByGroupMulti<- function(obj, features, group1, group2){
  exp_mat<- obj@assays$RNA@data[features, ,drop=FALSE]
  count_mat<- obj@assays$RNA@counts[features,,drop=FALSE ]
  
  meta<- obj@meta.data %>%
  tibble::rownames_to_column(var = "cell")
  
  # get the average expression matrix        
  exp_df<- as.matrix(exp_mat) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column(var="gene") %>%
    tidyr::pivot_longer(!gene, names_to = "cell", values_to = "expression") %>%
    left_join(meta) %>%
    group_by(gene,{{group1}}, {{group2}}) %>%
    summarise(average_expression = mean(expression)) %>%
    # the trick is to make the data wider in columns: cell_type|group
    tidyr::pivot_wider(names_from = c({{group1}}, {{group2}}), 
                       values_from= average_expression,
                       names_sep="|") 
  
  # convert to a matrix
  exp_mat<- exp_df[, -1] %>% as.matrix()
  rownames(exp_mat)<- exp_df$gene
  
  # get percentage of positive cells matrix
  count_df<- as.matrix(count_mat) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column(var="gene") %>%
    tidyr::pivot_longer(!gene, names_to = "cell", values_to = "count") %>%
    left_join(meta) %>%
    group_by(gene, {{group1}}, {{group2}}) %>%
    summarise(percentage = mean(count >0)) %>%
    tidyr::pivot_wider(names_from = c({{group1}}, {{group2}}), 
                       values_from= percentage,
                       names_sep="|")

  percent_mat<- count_df[, -1] %>% as.matrix()
  rownames(percent_mat)<- count_df$gene
  
  if (!identical(dim(exp_mat), dim(percent_mat))) {
    stop("the dimension of the two matrice should be the same!")
  }
  
  if(! all.equal(colnames(exp_mat), colnames(percent_mat))) {
    stop("column names of the two matrice should be the same!")
  }
  
  if(! all.equal(rownames(exp_mat), rownames(percent_mat))) {
    stop("column names of the two matrice should be the same!")
  }
  return(list(exp_mat = exp_mat, percent_mat = percent_mat))
}

Let’s take a look at the matrices:

mat2<- GetMatrixFromSeuratByGroupMulti(obj= ifnb, 
                                       features = c("ISG15", "CXCL10", "LYZ", "CXCL9"),
                                       seurat_annotations, stim)

mat2$exp_mat

#>        CD14 Mono|CTRL CD14 Mono|STIM CD4 Naive T|CTRL CD4 Naive T|STIM
#> CXCL10    0.087345121      5.0736492      0.002201396       0.61035184
#> CXCL9     0.003388209      0.1206664      0.000000000       0.00000000
#> ISG15     0.726171852      6.0702937      0.380820344       3.80877512
#> LYZ       2.414988107      2.6774910      0.041199264       0.04494943
#>        CD4 Memory T|CTRL CD4 Memory T|STIM CD16 Mono|CTRL CD16 Mono|STIM
#> CXCL10       0.004476775       0.668992851      0.7766469     5.00934744
#> CXCL9        0.000000000       0.002439572      0.0000000     0.02062243
#> ISG15        0.473392848       4.013709315      1.1175865     5.71266540
#> LYZ          0.033556360       0.064249244      1.3894338     0.98728252
#>            B|CTRL     B|STIM CD8 T|CTRL  CD8 T|STIM T activated|CTRL
#> CXCL10 0.02465307 2.40598430 0.00000000 0.509013005       0.00657299
#> CXCL9  0.00000000 0.01577367 0.00000000 0.004820181       0.00000000
#> ISG15  0.54956508 4.35028697 0.72849556 4.026049357       0.80743267
#> LYZ    0.04108924 0.04744420 0.03553156 0.041455107       0.04261386
#>        T activated|STIM    NK|CTRL   NK|STIM     DC|CTRL   DC|STIM
#> CXCL10       1.16903785 0.00000000 0.4221097 0.028490195 4.7406904
#> CXCL9        0.00000000 0.00000000 0.0000000 0.006332865 0.4458563
#> ISG15        4.11293242 0.78896021 4.5385409 0.656119064 5.7027258
#> LYZ          0.01990898 0.01638341 0.0857032 3.773146881 3.2401631
#>        B Activated|CTRL B Activated|STIM    Mk|CTRL    Mk|STIM   pDC|CTRL
#> CXCL10       0.01261649       2.26825925 0.07612572 1.31048801 0.04369413
#> CXCL9        0.00000000       0.02436157 0.00000000 0.01622667 0.00000000
#> ISG15        0.38002811       4.01217616 0.59414320 4.01146096 0.40850237
#> LYZ          0.06561868       0.07417401 0.72371186 0.32235497 0.24321177
#>          pDC|STIM Eryth|CTRL Eryth|STIM
#> CXCL10 2.25108405  0.0000000  1.6256344
#> CXCL9  0.06612518  0.0000000  0.0000000
#> ISG15  4.52475112  0.2929348  3.9331893
#> LYZ    0.02124968  0.6171084  0.3471586

mat2$percent_mat

#>        CD14 Mono|CTRL CD14 Mono|STIM CD4 Naive T|CTRL CD4 Naive T|STIM
#> CXCL10    0.035214447     0.98369818      0.001022495       0.25557012
#> CXCL9     0.001805869     0.07033069      0.000000000       0.00000000
#> ISG15     0.323702032     1.00000000      0.166666667       0.99279161
#> LYZ       0.817155756     0.88868188      0.019427403       0.02162516
#>        CD4 Memory T|CTRL CD4 Memory T|STIM CD16 Mono|CTRL CD16 Mono|STIM
#> CXCL10       0.002328289        0.27574751      0.2761341     0.98510242
#> CXCL9        0.000000000        0.00110742      0.0000000     0.01489758
#> ISG15        0.207217695        0.99003322      0.4950690     1.00000000
#> LYZ          0.016298021        0.02990033      0.6114398     0.47858473
#>            B|CTRL      B|STIM CD8 T|CTRL  CD8 T|STIM T activated|CTRL
#> CXCL10 0.00982801 0.651488616 0.00000000 0.225108225      0.003333333
#> CXCL9  0.00000000 0.007005254 0.00000000 0.002164502      0.000000000
#> ISG15  0.23341523 0.998248687 0.30113636 0.989177489      0.323333333
#> LYZ    0.01965602 0.022767075 0.01704545 0.019480519      0.020000000
#>        T activated|STIM     NK|CTRL    NK|STIM     DC|CTRL   DC|STIM
#> CXCL10      0.396396396 0.000000000 0.17445483 0.011627907 0.9766355
#> CXCL9       0.000000000 0.000000000 0.00000000 0.003875969 0.2570093
#> ISG15       0.975975976 0.315436242 0.98753894 0.352713178 1.0000000
#> LYZ         0.009009009 0.006711409 0.03738318 0.968992248 0.9626168
#>        B Activated|CTRL B Activated|STIM    Mk|CTRL     Mk|STIM   pDC|CTRL
#> CXCL10      0.005405405       0.60591133 0.02608696 0.371900826 0.01960784
#> CXCL9       0.000000000       0.01477833 0.00000000 0.008264463 0.00000000
#> ISG15       0.189189189       1.00000000 0.26086957 0.975206612 0.21568627
#> LYZ         0.037837838       0.03940887 0.26956522 0.123966942 0.11764706
#>          pDC|STIM Eryth|CTRL Eryth|STIM
#> CXCL10 0.70370370  0.0000000     0.4375
#> CXCL9  0.03703704  0.0000000     0.0000
#> ISG15  1.00000000  0.1739130     1.0000
#> LYZ    0.01234568  0.2608696     0.1875

Once you have the matrices, you can plot the dotplot as you want.

# change the layer function, because you need mat2$percent_mat for the percentages
layer_fun2 = function(j, i, x, y, w, h, fill){
    grid.rect(x = x, y = y, width = w, height = h, 
              gp = gpar(col = "gray", fill = NA))
    grid.circle(x=x,y=y,
                r= sqrt(pindex(mat2$percent_mat, i, j)) * unit(4, "mm"),
                gp = gpar(fill = col_fun(pindex(mat2$exp_mat, i, j)), col = NA))}

hp2<- Heatmap(mat2$exp_mat,
             heatmap_legend_param=list(title= "expression"),
             column_title = "genes", 
             col=col_fun,
             rect_gp = gpar(type = "none"),
             layer_fun = layer_fun2,
             row_names_gp = gpar(fontsize = 8),
             border = "black",
             cluster_rows = FALSE, 
             cluster_columns = FALSE,
             row_names_side  = "left")


draw(hp2, annotation_legend_list = lgd_list, ht_gap = unit(1, "cm"),
     annotation_legend_side = "right")

If you do want to cluster the columns and rows

hp3<- Heatmap(mat2$exp_mat,
             heatmap_legend_param=list(title= "expression"),
             column_title = "genes", 
             col=col_fun,
             rect_gp = gpar(type = "none"),
             layer_fun = layer_fun2,
             row_names_gp = gpar(fontsize = 8),
             border = "black",
             cluster_rows = TRUE, 
             cluster_columns = TRUE,
             row_names_side  = "left")


draw(hp3, annotation_legend_list = lgd_list, ht_gap = unit(1, "cm"),
     annotation_legend_side = "right")

With the matrices available, you can do subsetting of the matrices, arrange the rows and columns as you want and even add annotation bars. As you see, once you master the basics, you can plot anything as you want without relying on pre-built packages.

We can wrap the above code into a function for easier usage, but I will leave it as a homework for the readers.

Of course, you do not want to re-invent the wheels, use a well-maintained package when possible, build your own wheel when necessary.

How to make a multi-group dotplot for single-cell RNAseq data

Customized multi-group dotplot

plot Multiple genes across multiple groups/conditions

Further reading

Related