Bioinformatics

Hidden skills beyond programming for computational biology

There are some hidden gems beyond the typical programming skills that have been instrumental in my journey. These are the often-overlooked yet crucial practices that have empowered me to tackle challenges and make sense of data in meaningful ways. Firstly, let’s talk about patience. It’s not as glamorous as diving straight into analysis, but taking the time for thorough quality control is invaluable. Before you get carried away, understand the experimental design.

How to make a multi-group dotplot for single-cell RNAseq data

Dotplots are very popular for visualizing single-cell RNAseq data. In essence, the dot size represents the percentage of cells that are positive for that gene; the color intensity represents the average gene expression of that gene in a cell type. It is easy to plot one using Seurat::dotplot or Sccustomize::clustered_dotplot. However, when you have multiple groups/conditions in your data and you want to visualize it by groups, it is not that straightforward.

Great Things Take Time: How Decades of Effort Led to My Dream Career

Everyone is unique. Only you can talk about the story about yourself, and I realized that no matter how many times I have told my story, I have to tell it again, again, again, and again. Because no matter how many times I tell it, there is always someone who hear my story the first time. I hope it can inspire more people every time I tell it. Fast backward 37 years ago, 1986.

Part 4 CITE-seq normalization using empty droplets with the dsb package

In this post, we are going to try a CITE-seq normalization method called dsb published in Normalizing and denoising protein expression data from droplet-based single cell profiling two major components of protein expression noise in droplet-based single cell experiments: (1) protein-specific noise originating from ambient, unbound antibody encapsulated in droplets that can be accurately estimated via the level of “ambient” ADT counts in empty droplets, and (2) droplet/cell-specific noise revealed via the shared variance component associated with isotype antibody controls and background protein counts in each cell.

Part 3 Centered log ratio (CLR) normalization for CITE-seq protein count data

Following my last blog post, download the CITE-seq protein and RNA count data at here. library(Seurat) library(ggplot2) library(dplyr) pbmc<- readRDS("~/blog_data/CITEseq/pbmc1k_adt.rds") How to normalize protein ADT count data? Seurat uses the centered log ratio (CLR) to normalize protein count data. In the Seurat github page:

https://github.com/satijalab/seurat/blob/fc4a4f5203227832477a576bfe01bc6efeb23f51/R/preprocessing.R#L1768-L1769 clr_function <- function(x) { return(log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x))))) } log1p(x) computes log(1+x) accurately also for |x| << 1.

Part 2 CITE-seq downstream analysis: From Alevin output to Seurat visualization

In my last post, I showed you how to get the protein and RNA counts from a CITE-seq experiment using Simpleaf. Now that we have the raw count matrices, we are ready to explore them within R. To follow the tutorial, you can download the associated data from here. Load the data suppressPackageStartupMessages({ library(fishpond) library(ggplot2) library(dplyr) library(SingleCellExperiment) library(Seurat) library(DropletUtils) }) # set the seed set.seed(123) #gex_q <- loadFry('~/blog_data/CITEseq/alevin_rna/af_quant') #fb_q <- loadFry( '~/blog_data/CITEseq/alevin_adt/af_quant') # I saved the above objs first to rds files, now just read them back fb_q<- readRDS("~/blog_data/CITEseq/fb_q.

Part 1 How to use Salmon/Alevin to preprocess CITE-seq data

Introduction A state-of-the-art method called CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) allows surface protein levels and RNA expression to be measured simultaneously in individual cells. CITE-seq uses traditional single-cell RNA-sequencing to read out both transcriptome and proteomic information from the same cell after labeling it with oligo-conjugated antibodies. This gets over the drawbacks of techniques that just test proteins or RNA separately. CITE-seq reveals coordinated control of gene and protein activity, offering a potent multidimensional perspective of cell states.

My take on Data Challenges in Immuno-oncology, the Role of the Cloud, and Growing a Computational Biology Team

The original link. https://connect.corrdyn.com/blog/ming-tang-on-data-challenges-in-immuno-oncology-the-role-of-the-cloud-and-growing-a-computational-biology-team Guest Profile Tommy Tang’s career began when he pursued his Ph.D. in genetics and genomics at the University of Florida. Initially trained in molecular biology in the wet lab, he was driven to explore computational biology after encountering the limitations of traditional analysis methods. Through self-study, Tommy developed skills that enabled him to analyze complex genomic data sets. Following his Ph.D., Tommy joined MD Anderson Cancer Center and later moved to Harvard and the Dana Farber Cancer Institute, where he worked on single-cell RNA sequencing.

How to use random forest as a clustering method

If you ask me: what’s your favorite machine learning algorithm? I would answer logistic regression (with regularization: Lasso, Ridge and Elastic) followed by random forest. In fact, that’s how we try those methods in order. Deep learning can perform well for tabular data with complicated architecture while random forest or boost tree based method usually work well out of the box. Regression and random forest are more interpretable too.

My 4-steps to learn deep learning for genomics

Step 1, get a high-level understanding Watch statquest by Josh Starmer. 1blue3brown deep learning playlist Step2, code it out! If you are into python, watch “The spelled-out intro to neural networks and backpropagation: building micrograd”: I still code in R for most of the time, so I walk through the R code in the deep learning with R book.