Fine tune the best clustering resolution for scRNAseq data: trying out callback

Context and Problem In scRNA-seq, each cell is sequenced individually, allowing for the analysis of gene expression at the single-cell level. This provides a wealth of information about the cellular identities and states. However, the high dimensionality of the data (thousands of genes) and the technical noise in the data can lead to challenges in accurately clustering the cells. Over-clustering is one such challenge, where cells that are biologically similar are clustered into distinct clusters.

Do you really understand log2Fold change in single-cell RNAseq data?

In Single-cell RNAseq analysis, there is a step to find the marker genes for each cluster. The output from Seurat FindAllMarkers has a column called avg_log2FC. It is the gene expression log2 fold change between cluster x and all other clusters. How is that calculated? In this tweet thread by Lior Pachter, he said that there was a discrepancy for the logFC changes between Seurat and Scanpy: Actually, both Scanpy and Seurat calculate it wrong.

Part 3 Centered log ratio (CLR) normalization for CITE-seq protein count data

Following my last blog post, download the CITE-seq protein and RNA count data at here. library(Seurat) library(ggplot2) library(dplyr) pbmc<- readRDS("~/blog_data/CITEseq/pbmc1k_adt.rds") How to normalize protein ADT count data? Seurat uses the centered log ratio (CLR) to normalize protein count data. In the Seurat github page: clr_function <- function(x) { return(log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x))))) } log1p(x) computes log(1+x) accurately also for |x| << 1.

scRNAseq clustering significance test: an unsolvable problem?

Introductioon In scRNA-seq data analysis, one of the most crucial and demanding tasks is determining the optimal resolution and cluster number. Achieving an appropriate balance between over-clustering and under-clustering is often intricate, as it directly impacts the identification of distinct cell populations and biological insights. The clustering algorithms have many parameters to tune and it can generate more clusters if e.g., you increase the resolution parameter. However, whether the newly generated clusters are meaningful or not is a question.

Reuse the single cell data! How to create a seurat object from GEO datasets

Download the data cd data/GSE116256 wget tar xvf GSE116256_RAW.tar rm GSE116256_RAW.tar Depending on how the authors upload their data. Some authors may just upload the merged count matrix file. This is the easiest situation. In this dataset, each sample has a separate set of matrix (*dem.txt.gz), features and barcodes. Total, there are 43 samples. For each sample, it has an associated metadata file (*anno.txt.gz) too. You can inspect the files in command line:

use random forest and boost trees to find marker genes in scRNAseq data

This is a blog post for a series of posts on marker gene identification using machine learning methods. Read the previous posts: logistic regression and partial least square regression. This blog post will explore the tree based method: random forest and boost trees (gradient boost tree/XGboost). I highly recommend going through for related sections by Josh Starmer. Note, all the tree based methods can be used to do both classification and regression.

customize FeaturePlot in Seurat for multi-condition comparisons using patchwork

Seurat is great for scRNAseq analysis and it provides many easy-to-use ggplot2 wrappers for visualization. However, this brings the cost of flexibility. For example, In FeaturePlot, one can specify multiple genes and also to further split to multiple the conditions in the If is not NULL, the ncol is ignored so you can not arrange the grid. This is best to understand with an example. library(dplyr) library(Seurat) library(patchwork) library(ggplot2) # Load the PBMC dataset pbmc.