Bioinformatics

How to do gene correlation for single-cell RNAseq data (part 2) using meta-cell

In my last blog post, I showed that pearson gene correlation for single-cell data has flaws because of the sparsity of the count matrix. One way to get around it is to use the so called meta-cell. One can use KNN to find the K nearest neighbors and collapse them into a meta-cell. You can implement it from scratch, but one should not re-invent the wheel. For example, you can use metacells.

How to do gene correlation for single-cell RNAseq data (part 1)

Load libraries library(dplyr) library(Seurat) library(patchwork) library(ggplot2) library(ComplexHeatmap) library(SeuratData) set.seed(1234) prepare the data data("pbmc3k") pbmc3k #> An object of class Seurat #> 13714 features across 2700 samples within 1 assay #> Active assay: RNA (13714 features, 0 variable features) ## routine processing pbmc3k<- pbmc3k %>% NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>% FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>% ScaleData() %>% RunPCA(verbose = FALSE) %>% FindNeighbors(dims = 1:10, verbose = FALSE) %>% FindClusters(resolution = 0.

transpose single-cell cell x gene dataframe to gene x cell

Single cell matrix is often represented as gene x cell in R/Seurat, but it is represented as cell x gene in python/scanpy. Let’s use a real example to show how to transpose between the two formats. The GEO accession page is at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154763 Download the data We can use command line to download the count matrix at ftp: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE154nnn/GSE154763/suppl/ wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE154nnn/GSE154763/suppl/GSE154763_ESCA_normalized_expression.csv.gz -O ~/blog_data/GSE154763_ESCA_normalized_expression.csv.gz # decompress the file gunzip GSE154763_ESCA_normalized_expression.csv.gz # this GEO matrix is cell x gene # take a look by https://www.

Review 2022

Every year, I write a review for the past 12 months. Find my review for 2021. Some top wins: Anna was born in September! It is a lot of work with three kids, but they are cute! I am grateful for our CSO Thomas Tan, who has confidence in me to lead the team. We built an amazing computational biology team at Immunitas. I am so fortunate to have Matt, Tim and Michelle in the group.

How to run dockerized Rstudio server on google cloud

Create a google VM Follow the process using the console https://cloud.google.com/compute/docs/instances/create-start-instance#console_1 Install docker Follow https://docs.docker.com/engine/install/debian/ Note this example for the debian build. If you created your VM using ubuntu as the boot disk, you should follow the ubuntu section https://docs.docker.com/engine/install/ubuntu/. In the GCP VM: sudo apt-get update sudo apt-get install \ ca-certificates \ curl \ gnupg \ lsb-release sudo mkdir -p /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg –dearmor -o /etc/apt/keyrings/docker.

Are PDL1 RNA and protein levels correlated in cancer cell lines?

Are protein and RNA levels correlated? This is a big question. see replies to this tweet at https://twitter.com/slavov_n/status/1561844133496512512. In general, RNA and protein abundances should be correlated but there are exceptions of course. Biology is complicated/weird! One of my favorite examples is Hypoxia-inducible factor 1-alpha, HIF-1α. The protein is efficiently degraded in most tissues most of the time unless stabilized by hypoxia.

use random forest and boost trees to find marker genes in scRNAseq data

This is a blog post for a series of posts on marker gene identification using machine learning methods. Read the previous posts: logistic regression and partial least square regression. This blog post will explore the tree based method: random forest and boost trees (gradient boost tree/XGboost). I highly recommend going through https://app.learney.me/maps/StatQuest for related sections by Josh Starmer. Note, all the tree based methods can be used to do both classification and regression.

Partial least square regression for marker gene identification in scRNAseq data

This is an extension of my last blog post marker gene selection using logistic regression and regularization for scRNAseq. Let’s use the same PBMC single-cell RNAseq data as an example. Load libraries library(Seurat) library(tidyverse) library(tidymodels) library(scCustomize) # for plotting library(patchwork) Preprocess the data

Load the PBMC dataset pbmc.data <- Read10X(data.dir = "~/blog_data/filtered_gene_bc_matrices/hg19/") # Initialize the Seurat object with the raw (non-normalized data). pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.

My odyssey of obtaining scRNAseq metadata

I want to curate a public scRNAseq dataset from this paper Single-cell analyses reveal key immune cell subsets associated with response to PD-L1 blockade in triple-negative breast cancer ffq I first tried ffq, but it gave me errors. ffq fetches metadata information from the following databases: GEO: Gene Expression Omnibus, SRA: Sequence Read Archive, EMBL-EBI: European Molecular BIology Laboratory’s European BIoinformatics Institute, DDBJ: DNA Data Bank of Japan, NIH Biosample: Biological source materials used in experimental assays, ENCODE: The Encyclopedia of DNA Elements.

marker gene selection using logistic regression and regularization for scRNAseq

why this blog post? I saw a biorxiv paper titled A comparison of marker gene selection methods for single-cell RNA sequencing data Our results highlight the efficacy of simple methods, especially the Wilcoxon rank-sum test, Student’s t-test and logistic regression I am interested in using logistic regression to find marker genes and want to try fitting the model in the tidymodel ecosystem and using different regularization methods.