To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
I recently was interviewed by the SEQanswers forum on single-cell RNAseq analysis.
In your opinion, what is the most challenging aspect of single-cell analysis? Every single-cell dataset is unique in terms of data quality and QC has to be carried out in a dataset specific manner. Cell annotation is still one of the most challenging steps.
I want to curate a public scRNAseq dataset from this paper Single-cell analyses reveal key immune cell subsets associated with response to PD-L1 blockade in triple-negative breast cancer
ffq I first tried ffq, but it gave me errors.
ffq fetches metadata information from the following databases:
GEO: Gene Expression Omnibus, SRA: Sequence Read Archive, EMBL-EBI: European Molecular BIology Laboratory’s European BIoinformatics Institute, DDBJ: DNA Data Bank of Japan, NIH Biosample: Biological source materials used in experimental assays, ENCODE: The Encyclopedia of DNA Elements.
I am interested in learning more on matrix factorization and its application in scRNAseq data. I want to shout out to this paper: Enter the Matrix: Factorization Uncovers Knowledge from Omics by Elana J. Fertig group.
A matrix is decomposed to two matrices: the amplitude matrix and the pattern matrix. You can then do all sorts of things with the decomposed matrices. Single cell matrix is no special, one can use the matrix factorization techniques to derive interesting biological insights.
Dotplot is a nice way to visualize scRNAseq expression data across clusters. It gives information (by color) for the average expression level across cells within the cluster and the percentage (by size of the dot) of the cells express that gene within the cluster.
Seurat has a nice function for that. However, it can not do the clustering for the rows and columns. David McGaughey has written a blog post using ggplot2 and ggtree from Guangchuang Yu.
When it comes to make a heatmap, ComplexHeatmap by Zuguang Gu is my favorite. Check it out! You will be amazed on how flexible it is and the documentation is in top niche.
For Single-cell RNAseq, Seurat provides a DoHeatmap function using ggplot2. There are two limitations:
when your genes are not in the top variable gene list, the scale.data will not have that gene and DoHeatmap will drop those genes.
It is very common to see in the scRNAseq papers that the authors compare cell type abundance across groups (e.g., treatment vs control, responder vs non-responder).
Let’s create some dummy data.
library(tidyverse) set.seed(23) # we have 6 treatment samples and 6 control samples, 3 clusters A,B,C # but in the treatment samples, cluster C is absent (0 cells) in sample7 sample_id<- c(paste0("sample", 1:6, "_control", rep(c("_A","_B","_C"),each = 6)), paste0("sample", 8:12, "_treatment", rep(c("_A","_B", "_C"), each = 5))) sample_id<- c(sample_id, "sample7_treatment_A", "sample7_treatment_B") cell_id<- paste0("cell", 1:20000) cell_df<- tibble::tibble(sample_id = sample(sample_id, size = length(cell_id), replace = TRUE), cell_id = cell_id) %>% tidyr::separate(sample_id, into = c("sample_id", "group", "clusterid"), sep= "") cell_num<- cell_df %>% group_by(group, cluster_id, sample_id)%>% summarize(n=n()) cell_num ## # A tibble: 35 x 4 ## # Groups: group, cluster_id [6] ## group cluster_id sample_id n ## <chr> <chr> <chr> <int> ## 1 control A sample1 551 ## 2 control A sample2 546 ## 3 control A sample3 544 ## 4 control A sample4 585 ## 5 control A sample5 588 ## 6 control A sample6 542 ## 7 control B sample1 550 ## 8 control B sample2 562 ## 9 control B sample3 574 ## 10 control B sample4 563 ## # … with 25 more rows total_cells<- cell_df %>% group_by(sample_id) %>% summarise(total = n()) total_cells ## # A tibble: 12 x 2 ## sample_id total ## <chr> <int> ## 1 sample1 1673 ## 2 sample10 1713 ## 3 sample11 1691 ## 4 sample12 1696 ## 5 sample2 1633 ## 6 sample3 1700 ## 7 sample4 1711 ## 8 sample5 1768 ## 9 sample6 1727 ## 10 sample7 1225 ## 11 sample8 1720 ## 12 sample9 1743 join the two dataframe to get percentage of cells per cluster per sample