Part 3 Centered log ratio (CLR) normalization for CITE-seq protein count data

Following my last blog post, download the CITE-seq protein and RNA count data at here. library(Seurat) library(ggplot2) library(dplyr) pbmc<- readRDS("~/blog_data/CITEseq/pbmc1k_adt.rds") How to normalize protein ADT count data? Seurat uses the centered log ratio (CLR) to normalize protein count data. In the Seurat github page: clr_function <- function(x) { return(log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x))))) } log1p(x) computes log(1+x) accurately also for |x| << 1.

scRNAseq clustering significance test: an unsolvable problem?

Introductioon In scRNA-seq data analysis, one of the most crucial and demanding tasks is determining the optimal resolution and cluster number. Achieving an appropriate balance between over-clustering and under-clustering is often intricate, as it directly impacts the identification of distinct cell populations and biological insights. The clustering algorithms have many parameters to tune and it can generate more clusters if e.g., you increase the resolution parameter. However, whether the newly generated clusters are meaningful or not is a question.

Reuse the single cell data! How to create a seurat object from GEO datasets

Download the data cd data/GSE116256 wget tar xvf GSE116256_RAW.tar rm GSE116256_RAW.tar Depending on how the authors upload their data. Some authors may just upload the merged count matrix file. This is the easiest situation. In this dataset, each sample has a separate set of matrix (*dem.txt.gz), features and barcodes. Total, there are 43 samples. For each sample, it has an associated metadata file (*anno.txt.gz) too. You can inspect the files in command line:

use random forest and boost trees to find marker genes in scRNAseq data

This is a blog post for a series of posts on marker gene identification using machine learning methods. Read the previous posts: logistic regression and partial least square regression. This blog post will explore the tree based method: random forest and boost trees (gradient boost tree/XGboost). I highly recommend going through for related sections by Josh Starmer. Note, all the tree based methods can be used to do both classification and regression.

customize FeaturePlot in Seurat for multi-condition comparisons using patchwork

Seurat is great for scRNAseq analysis and it provides many easy-to-use ggplot2 wrappers for visualization. However, this brings the cost of flexibility. For example, In FeaturePlot, one can specify multiple genes and also to further split to multiple the conditions in the If is not NULL, the ncol is ignored so you can not arrange the grid. This is best to understand with an example. library(dplyr) library(Seurat) library(patchwork) library(ggplot2) # Load the PBMC dataset pbmc.