Hidden skills beyond programming for computational biology

There are some hidden gems beyond the typical programming skills that have been instrumental in my journey. These are the often-overlooked yet crucial practices that have empowered me to tackle challenges and make sense of data in meaningful ways. Firstly, let’s talk about patience. It’s not as glamorous as diving straight into analysis, but taking the time for thorough quality control is invaluable. Before you get carried away, understand the experimental design.

Great Things Take Time: How Decades of Effort Led to My Dream Career

Everyone is unique. Only you can talk about the story about yourself, and I realized that no matter how many times I have told my story, I have to tell it again, again, again, and again. Because no matter how many times I tell it, there is always someone who hear my story the first time. I hope it can inspire more people every time I tell it. Fast backward 37 years ago, 1986.

Review 2023

By the end of every year, I write a review of the past year. It is a great time to reflect on the Losses and Wins and plan for the new year. I can not believe 2024 is right at the corner. My review of 2022 can be found at Goals reached for 2023 This is in the same order of the goals of 2023 in my last year’s review.

Part 4 CITE-seq normalization using empty droplets with the dsb package

In this post, we are going to try a CITE-seq normalization method called dsb published in Normalizing and denoising protein expression data from droplet-based single cell profiling two major components of protein expression noise in droplet-based single cell experiments: (1) protein-specific noise originating from ambient, unbound antibody encapsulated in droplets that can be accurately estimated via the level of “ambient” ADT counts in empty droplets, and (2) droplet/cell-specific noise revealed via the shared variance component associated with isotype antibody controls and background protein counts in each cell.

Part 3 Centered log ratio (CLR) normalization for CITE-seq protein count data

Following my last blog post, download the CITE-seq protein and RNA count data at here. library(Seurat) library(ggplot2) library(dplyr) pbmc<- readRDS("~/blog_data/CITEseq/pbmc1k_adt.rds") How to normalize protein ADT count data? Seurat uses the centered log ratio (CLR) to normalize protein count data. In the Seurat github page: clr_function <- function(x) { return(log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x))))) } log1p(x) computes log(1+x) accurately also for |x| << 1.

Part 2 CITE-seq downstream analysis: From Alevin output to Seurat visualization

In my last post, I showed you how to get the protein and RNA counts from a CITE-seq experiment using Simpleaf. Now that we have the raw count matrices, we are ready to explore them within R. To follow the tutorial, you can download the associated data from here. Load the data suppressPackageStartupMessages({ library(fishpond) library(ggplot2) library(dplyr) library(SingleCellExperiment) library(Seurat) library(DropletUtils) }) # set the seed set.seed(123) #gex_q <- loadFry('~/blog_data/CITEseq/alevin_rna/af_quant') #fb_q <- loadFry( '~/blog_data/CITEseq/alevin_adt/af_quant') # I saved the above objs first to rds files, now just read them back fb_q<- readRDS("~/blog_data/CITEseq/fb_q.

Part 1 How to use Salmon/Alevin to preprocess CITE-seq data

Introduction A state-of-the-art method called CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) allows surface protein levels and RNA expression to be measured simultaneously in individual cells. CITE-seq uses traditional single-cell RNA-sequencing to read out both transcriptome and proteomic information from the same cell after labeling it with oligo-conjugated antibodies. This gets over the drawbacks of techniques that just test proteins or RNA separately. CITE-seq reveals coordinated control of gene and protein activity, offering a potent multidimensional perspective of cell states.

My take on Data Challenges in Immuno-oncology, the Role of the Cloud, and Growing a Computational Biology Team

The original link. Guest Profile Tommy Tang’s career began when he pursued his Ph.D. in genetics and genomics at the University of Florida. Initially trained in molecular biology in the wet lab, he was driven to explore computational biology after encountering the limitations of traditional analysis methods. Through self-study, Tommy developed skills that enabled him to analyze complex genomic data sets. Following his Ph.D., Tommy joined MD Anderson Cancer Center and later moved to Harvard and the Dana Farber Cancer Institute, where he worked on single-cell RNA sequencing.

How to use random forest as a clustering method

If you ask me: what’s your favorite machine learning algorithm? I would answer logistic regression (with regularization: Lasso, Ridge and Elastic) followed by random forest. In fact, that’s how we try those methods in order. Deep learning can perform well for tabular data with complicated architecture while random forest or boost tree based method usually work well out of the box. Regression and random forest are more interpretable too.

How to convert raw counts to TPM for TCGA data and make a heatmap across cancer types

Sign up for my newsletter to not miss a post like this The Cancer Genome Atlas (TCGA) project is probably one of the most well-known large-scale cancer sequencing project. It sequenced ~10,000 treatment-naive tumors across 33 cancer types. Different data including whole-exome, whole-genome, copy-number (SNP array), bulk RNAseq, protein expression (Reverse-Phase Protein Array), DNA methylation are available. TCGA is a very successful large sequencing project. I highly recommend learning from the organization of it.