Genomics

Has AI Changed the Course of Drug Development? Three Years Later

Almost three years ago I wrote Has AI Changed the Course of Drug Development?. The post was cautiously skeptical. AlphaFold was the headline. BenevolentAI had just flopped a Phase 2a in atopic dermatitis. Verge Genomics’ VRG50635 was about to enter the clinic. I argued that data quality was the real bottleneck, not algorithms. Time to revisit that. I now lead epigenomics bioinformatics at AstraZeneca and I use Claude Code daily.

Reviving BETA for Python 3: Integrating ChIP-seq and RNA-seq to Predict TF Targets

I started to learn bioinformatics because I needed to analyze public ChIP-seq data in 2012. That’s how I got to know Shirley Liu’s lab at Dana-Farber Cancer Institute. And God knows that I would join her group in 2020 for a staff scientist position to lead the CIDC bioinformatic project. I witnessed the development of many groundbreaking computational tools for genomics in Shirley’s lab. One tool that I found particularly elegant was BETA (Binding and Expression Target Analysis), developed by Su Wang and published in Nature Protocols in 2013.

How to make a triangle correlation heatmap with p-values labeled

In this blog post, I am going to show you how to make a correlation heatmap with p-values and significant values labeled in the heatmap body. Let’s use the PBMC single cell data as an example. You may want to read my previous blog post How to do gene correlation for single-cell RNAseq data. Load libraries library(dplyr) library(Seurat) library(patchwork) library(ggplot2) library(ComplexHeatmap) library(SeuratData) library(hdWGCNA) library(WGCNA) set.seed(1234) prepare the data data("pbmc3k") pbmc3k #> An object of class Seurat #> 13714 features across 2700 samples within 1 assay #> Active assay: RNA (13714 features, 0 variable features) ## routine processing pbmc3k<- pbmc3k %>% NormalizeData(normalization.

Run Rstudio server with singularity on HPC

Background Please read the following before go ahead: what is docker? what is Rocker? what is singularity? from Harvard Research computing website: Odyssey has singularity installed. Why Singularity? There are some important differences between Docker and Singularity: Docker and Singularity have their own container formats. Docker containers may be imported to run via Singularity. Docker containers need root privileges for full functionality which is not suitable for a shared HPC environment.

Calculate scATACseq TSS enrichment score

TSS enrichment score serves as an important quality control metric for ATACseq data. I want to write a script for single cell ATACseq data. From the Encode page: Transcription Start Site (TSS) Enrichment Score - The TSS enrichment calculation is a signal to noise calculation. The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1000 bp in either direction (for a total of 2000bp).

understand 10x scRNAseq and scATAC fastqs

single cell RNAseq Please read the following posts by Dave Tang. When I google, I always find his posts on top of the pages. Thanks for sharing your knowledge. https://davetang.org/muse/2018/06/06/10x-single-cell-bam-files/ https://davetang.org/muse/2018/08/09/getting-started-with-cell-ranger/ From the 10x manual: The final Single Cell 3’ Libraries contain the P5 and P7 primers used in Illumina bridge amplification PCR. The 10x Barcode and Read 1 (primer site for sequencing read 1) is added to the molecules during the GEMRT incubation.

How to make a transcript to gene mapping file

I need a transcript to gene mapping file for Salmon. I am aware of annotation bioconductor packages that can do this job. However, I was working on a species which does not have the annotation in a package format (I am going to use Drosphila as an example for this blog post). I had to go and got the gtf file and made such a file from scratch. Please read the specifications of those two file formats.

Understanding p value, multiple comparisons, FDR and q value

UPDATE 01/29/2019. Read this awesome paper Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. This was an old post I wrote 3 years ago after I took HarvardX: PH525.3x Advanced Statistics for the Life Sciences on edx taught by Rafael Irizarry. It is still one of the best courses to get you started using R for genomics. I am very thankful to have those high quality classes available to me when I started to learn.

PCA in action

PCA in practice. Principal Component Analysis(PCA) is a very important skill for dimention reduction to analyze high-dimentional data. High-dimentional data are data with features (p) a lot more than observations (n). This types of data are very commonly generated from high-throuput sequencing experiments. For example, an RNA-seq or microarry experiment measures expression of tens of thousands of genes for only 8 samples (4 controls and 4 treatments). Let’s use a microarray data for demonstration.

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises. 1. The devil 1 and 0 coordinate system read detail here https://github.com/crazyhottommy/DNA-seq-analysis#tips-and-lessons-learned-during-my-dna-seq-data-analysis-journey some files such as bed file is 0 based. Two genomic regions: chr1 0 1000 chr1 1001 2000 when you import that bed file into R using rtracklayer::import(), it will become chr1 1 1000 chr1 1002 2000 The function convert it to 1 based internally (R is 1 based unlike python).