Bioinformatics

PCA analysis on TCGA bulk RNAseq data

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. what is PCA? Principal Component Analysis (PCA) is a mathematical technique used to reduce the dimensionality of large datasets while preserving the most important patterns in the data. It transforms the original high-dimensional data into a smaller set of new variables called principal components (PCs), which capture the most variation in the data.

Biotech Data Strategy: Building a Scalable Foundation for Startups

In a biotech startup, an early data strategy is key to ensure public and private data remain useful and valuable. As AI hype reaches new heights, I want to emphasize that a data strategy must precede any AI strategy. Data is the oil of the AI engine. Unfortunately, the real-world data are usually messy and not AI-ready. Without a robust data strategy, you are building an AI system on a shaky foundation.

Review 2024

As 2024 wraps up, it’s the perfect time to reflect and prepare for the new year. I wrote the review for 2023 here. Goals reached ✅ I lost 12 lb in 6 weeks! ✅ Supported the clinical trial to identify bio markers for potential responders to our drug at Immunitas. Helped with indication selection for the second and third program. I moved to Astrazeneca in August. I really appreciate my experience at Immunitas and learned a lot.

I regret not doing so

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. My regret is not learning linear algebra well during college. I barely passed the exam for it (and calculus, it was a nightmare :) ). To be fair.. It was not taught well and it sounded too boring. I did not know what the application of matrix multiplication was, not until… Many years later, I started to learn bioinformatics.

How CCA alignment and cell label transfer work in Seurat

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. Understand CCA Following my last blog post on PCA projection and cell label transfer, we are going to talk about CCA. In single-cell RNA-seq data integration using Canonical Correlation Analysis (CCA), we typically align two matrices representing different datasets, where both datasets have the same set of genes but different numbers of cells.

How PCA projection and cell label transfer work in Seurat

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. Understand the example datasets We will use PBMC3k and PBMC10k data. We will project the PBMC3k data to the PBMC10k data and get the labels library(Seurat) library(Matrix) library(irlba) # For PCA library(RcppAnnoy) # For fast nearest neighbor search library(dplyr) # Assuming the PBMC datasets (3k and 10k) are already normalized # and represented as sparse matrices # devtools::install_github('satijalab/seurat-data') library(SeuratData) #AvailableData() #InstallData("pbmc3k") pbmc3k<-UpdateSeuratObject(pbmc3k) pbmc3k@meta.

You need to master it if you deal with genomics data

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. Motivation What’s the most common problem you need to solve when dealing with genomics data? For me, it is Genomic Intervals! The genomics data usually represents linearly: chromosome name, start and end. We use it to define a region in the genome ( A peak from ChIP-seq data); the location of a gene, a DNA methylation site ( a single point), a mutation call (a single point), and a duplication region in cancer etc.

A docker image to keep this site alive

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. I have been writing blog posts for over 10 years. I was using blogspot and in 2018, I switched to blogdown and I love it. My blogdown website divingintogeneticsandgenomics.com was using Hugo v0.42 and blogdown v1.0. It has been many years and now I have a macbook pro with an M3 chip. I could not install the old versions of the R packages to serve the site.

The Most Common Mistake In Bioinformatics, one-off error

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. In my last blog post, I talked about some common bioinformatics mistakes. Today, we are going to talk about THE MOST common bioinformatics mistake people make. And I think it deserves a separate post about it. Even some experienced programmers get it wrong and the mistake prevails in many bioinformatics software: The one-off mistake!

The Most Common Stupid Mistakes In Bioinformatics

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. This post is inspired by this popular thread in https://www.biostars.org/. Common mistakes in general Off-by-One Errors: Mistakes occur when switching between different indexing systems. For example, BED files are 0-based while GFF/GTF files are 1-based, leading to potential misinterpretations of genomic coordinates. This is one of the most common mistakes!