Data

PCA analysis on TCGA bulk RNAseq data continued

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. In my last blog post, I showed you how to download TCGA RNAseq count data and do PCA and make a heatmap. It is interesting to see some of the LUSC samples mix with the LUAD samples and vice versa. In this post, we will continue to use PCA to do more Exploratory data analysis (EDA).

PCA analysis on scATACseq data

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. In my last post, I showed you how to use PCA for bulk RNAseq data. Today, let’s see how we can use it for scATACseq data. Download the example dataset from 10x genomics https://support.10xgenomics.com/single-cell-atac/datasets/1.1.0/atac_pbmc_5k_v1 The dataset is 5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor (v1.0). Download the atac_pbmc_5k_v1_filtered_peak_bc_matrix.tar.gz file and unzip it.

PCA analysis on TCGA bulk RNAseq data

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. what is PCA? Principal Component Analysis (PCA) is a mathematical technique used to reduce the dimensionality of large datasets while preserving the most important patterns in the data. It transforms the original high-dimensional data into a smaller set of new variables called principal components (PCs), which capture the most variation in the data.

Biotech Data Strategy: Building a Scalable Foundation for Startups

In a biotech startup, an early data strategy is key to ensure public and private data remain useful and valuable. As AI hype reaches new heights, I want to emphasize that a data strategy must precede any AI strategy. Data is the oil of the AI engine. Unfortunately, the real-world data are usually messy and not AI-ready. Without a robust data strategy, you are building an AI system on a shaky foundation.

Review 2024

As 2024 wraps up, it’s the perfect time to reflect and prepare for the new year. I wrote the review for 2023 here. Goals reached ✅ I lost 12 lb in 6 weeks! ✅ Supported the clinical trial to identify bio markers for potential responders to our drug at Immunitas. Helped with indication selection for the second and third program. I moved to Astrazeneca in August. I really appreciate my experience at Immunitas and learned a lot.

I regret not doing so

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. My regret is not learning linear algebra well during college. I barely passed the exam for it (and calculus, it was a nightmare :) ). To be fair.. It was not taught well and it sounded too boring. I did not know what the application of matrix multiplication was, not until… Many years later, I started to learn bioinformatics.

Be careful when left_join tables with duplicated rows

This is going to be a really short blog post. I recently found that if I join two tables with one of the tables having duplicated rows, the final joined table also contains the duplicated rows. It could be the expected behavior for others but I want to make a note here for myself. library(tidyverse) df1<- tibble(key = c("A", "B", "C", "D", "E"), value = 1:5) df1 ## # A tibble: 5 x 2 ## key value ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 dataframe 2 has two identical rows for B.

Backup automatically with cron

Data backup is an essential step in the data analysis life cycle. As shown in a pic below taken from DataOne. There are so many important things you may want to back up: your raw/processed data, your code, and your dot configuration files. While for every project, I have git version control my scripts (not the data) and push it to github or gitlab to have a backup, big files can not be hosted on github or gitlab.