Bioinformatics

Be careful when left_join tables with duplicated rows

This is going to be a really short blog post. I recently found that if I join two tables with one of the tables having duplicated rows, the final joined table also contains the duplicated rows. It could be the expected behavior for others but I want to make a note here for myself. library(tidyverse) df1<- tibble(key = c("A", "B", "C", "D", "E"), value = 1:5) df1 ## # A tibble: 5 x 2 ## key value ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 dataframe 2 has two identical rows for B.

Matrix Factorization for single-cell RNAseq data

I am interested in learning more on matrix factorization and its application in scRNAseq data. I want to shout out to this paper: Enter the Matrix: Factorization Uncovers Knowledge from Omics by Elana J. Fertig group. A matrix is decomposed to two matrices: the amplitude matrix and the pattern matrix. You can then do all sorts of things with the decomposed matrices. Single cell matrix is no special, one can use the matrix factorization techniques to derive interesting biological insights.

How to test if two distributions are different

I asked this question on Twitter: what test to test if two distributions are different? I am aware of KS test. When n is large (which is common in genomic studies), the p-value is always significant. better to test against an effect size? how to do it in this context? In genomics studies, it is very common to have large N (e.g., the number of introns, promoters in the genome, number of cells in the single-cell studies).

clustered dotplot for single-cell RNAseq

Dotplot is a nice way to visualize scRNAseq expression data across clusters. It gives information (by color) for the average expression level across cells within the cluster and the percentage (by size of the dot) of the cells express that gene within the cluster. Seurat has a nice function for that. However, it can not do the clustering for the rows and columns. David McGaughey has written a blog post using ggplot2 and ggtree from Guangchuang Yu.

Review 2020

It is the end of the year again and it is a good time to review my 2020. I wrote one for 2019. It has been a tough year for many of us. It is the same for me. I switched my job from Harvard FAS informatics to Department of Data Science at Dana-Farber Cancer Institute during the shutdown because of COVID19 at the end of March and has been working from home since then.

Enhancement of scRNAseq heatmap using complexheatmap

When it comes to make a heatmap, ComplexHeatmap by Zuguang Gu is my favorite. Check it out! You will be amazed on how flexible it is and the documentation is in top niche. For Single-cell RNAseq, Seurat provides a DoHeatmap function using ggplot2. There are two limitations: when your genes are not in the top variable gene list, the scale.data will not have that gene and DoHeatmap will drop those genes.

dplyr::count misses factor levels: a case in comparing scRNAseq cell type abundance

It is very common to see in the scRNAseq papers that the authors compare cell type abundance across groups (e.g., treatment vs control, responder vs non-responder). Let’s create some dummy data. library(tidyverse) set.seed(23) # we have 6 treatment samples and 6 control samples, 3 clusters A,B,C # but in the treatment samples, cluster C is absent (0 cells) in sample7 sample_id<- c(paste0("sample", 1:6, "_control", rep(c("_A","_B","_C"),each = 6)), paste0("sample", 8:12, "_treatment", rep(c("_A","_B", "_C"), each = 5))) sample_id<- c(sample_id, "sample7_treatment_A", "sample7_treatment_B") cell_id<- paste0("cell", 1:20000) cell_df<- tibble::tibble(sample_id = sample(sample_id, size = length(cell_id), replace = TRUE), cell_id = cell_id) %>% tidyr::separate(sample_id, into = c("sample_id", "group", "clusterid"), sep= "") cell_num<- cell_df %>% group_by(group, cluster_id, sample_id)%>% summarize(n=n()) cell_num ## # A tibble: 35 x 4 ## # Groups: group, cluster_id [6] ## group cluster_id sample_id n ## <chr> <chr> <chr> <int> ## 1 control A sample1 551 ## 2 control A sample2 546 ## 3 control A sample3 544 ## 4 control A sample4 585 ## 5 control A sample5 588 ## 6 control A sample6 542 ## 7 control B sample1 550 ## 8 control B sample2 562 ## 9 control B sample3 574 ## 10 control B sample4 563 ## # … with 25 more rows total_cells<- cell_df %>% group_by(sample_id) %>% summarise(total = n()) total_cells ## # A tibble: 12 x 2 ## sample_id total ## <chr> <int> ## 1 sample1 1673 ## 2 sample10 1713 ## 3 sample11 1691 ## 4 sample12 1696 ## 5 sample2 1633 ## 6 sample3 1700 ## 7 sample4 1711 ## 8 sample5 1768 ## 9 sample6 1727 ## 10 sample7 1225 ## 11 sample8 1720 ## 12 sample9 1743 join the two dataframe to get percentage of cells per cluster per sample

setting up Travis CI for github repos

I wanted to set up Travis CI long time ago. Finally, I have a chance to do so. There are already many posts on how to do it. I write it down here for my own future reference. Thanks Tao Liu for the instructions. I have been learning a lot from him in collabrating with the MAESTRO project. read the official documentation https://docs.travis-ci.com/user/tutorial/ Jean Fan’s blog post https://jef.works/blog/2019/02/17/automate-testing-of-your-R-package/ R specific https://docs.

build your own singularity image

I was using the tidyverse rocker image on HPC by singularity pull. see my previous post. Everything was OK until I encountered problems installing jpeg and Cairo R packages. Later, I also had an error installing scRepertoire dependency gsl. It turns out I have to install debian packages inside the container: $ apt update && apt install -y –no-install-recommends libjpeg62-turbo-dev zlib1g-dev libpng-dev \ && apt install -y –no-install-recommends libx11-dev libcairo2-dev libxt-dev \ && apt install -y libgsl-dev However, singularity file system is read-only.

customize FeaturePlot in Seurat for multi-condition comparisons using patchwork

Seurat is great for scRNAseq analysis and it provides many easy-to-use ggplot2 wrappers for visualization. However, this brings the cost of flexibility. For example, In FeaturePlot, one can specify multiple genes and also split.by to further split to multiple the conditions in the meta.data. If split.by is not NULL, the ncol is ignored so you can not arrange the grid. This is best to understand with an example. library(dplyr) library(Seurat) library(patchwork) library(ggplot2) # Load the PBMC dataset pbmc.