Education
As 2024 wraps up, it’s the perfect time to reflect and prepare for the new year. I wrote the review for 2023 here.
Goals reached ✅ I lost 12 lb in 6 weeks!
✅ Supported the clinical trial to identify bio markers for potential responders to our drug at Immunitas. Helped with indication selection for the second and third program. I moved to Astrazeneca in August. I really appreciate my experience at Immunitas and learned a lot.
To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
My regret is not learning linear algebra well during college.
I barely passed the exam for it (and calculus, it was a nightmare :) ).
To be fair..
It was not taught well and it sounded too boring. I did not know what the application of matrix multiplication was, not until…
Many years later, I started to learn bioinformatics.
To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
Understand CCA Following my last blog post on PCA projection and cell label transfer, we are going to talk about CCA.
In single-cell RNA-seq data integration using Canonical Correlation Analysis (CCA), we typically align two matrices representing different datasets, where both datasets have the same set of genes but different numbers of cells.
To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
Understand the example datasets We will use PBMC3k and PBMC10k data. We will project the PBMC3k data to the PBMC10k data and get the labels
library(Seurat) library(Matrix) library(irlba) # For PCA library(RcppAnnoy) # For fast nearest neighbor search library(dplyr) # Assuming the PBMC datasets (3k and 10k) are already normalized # and represented as sparse matrices # devtools::install_github('satijalab/seurat-data') library(SeuratData) #AvailableData() #InstallData("pbmc3k") pbmc3k<-UpdateSeuratObject(pbmc3k) pbmc3k@meta.
To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
Motivation What’s the most common problem you need to solve when dealing with genomics data?
For me, it is Genomic Intervals!
The genomics data usually represents linearly: chromosome name, start and end.
We use it to define a region in the genome ( A peak from ChIP-seq data); the location of a gene, a DNA methylation site ( a single point), a mutation call (a single point), and a duplication region in cancer etc.