Data

Biotech Data Strategy: Building a Scalable Foundation for Startups

In a biotech startup, an early data strategy is key to ensure public and private data remain useful and valuable. As AI hype reaches new heights, I want to emphasize that a data strategy must precede any AI strategy. Data is the oil of the AI engine. Unfortunately, the real-world data are usually messy and not AI-ready. Without a robust data strategy, you are building an AI system on a shaky foundation.

Review 2024

As 2024 wraps up, it’s the perfect time to reflect and prepare for the new year. I wrote the review for 2023 here. Goals reached ✅ I lost 12 lb in 6 weeks! ✅ Supported the clinical trial to identify bio markers for potential responders to our drug at Immunitas. Helped with indication selection for the second and third program. I moved to Astrazeneca in August. I really appreciate my experience at Immunitas and learned a lot.

I regret not doing so

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. My regret is not learning linear algebra well during college. I barely passed the exam for it (and calculus, it was a nightmare :) ). To be fair.. It was not taught well and it sounded too boring. I did not know what the application of matrix multiplication was, not until… Many years later, I started to learn bioinformatics.

Be careful when left_join tables with duplicated rows

This is going to be a really short blog post. I recently found that if I join two tables with one of the tables having duplicated rows, the final joined table also contains the duplicated rows. It could be the expected behavior for others but I want to make a note here for myself. library(tidyverse) df1<- tibble(key = c("A", "B", "C", "D", "E"), value = 1:5) df1 ## # A tibble: 5 x 2 ## key value ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 dataframe 2 has two identical rows for B.

Backup automatically with cron

Data backup is an essential step in the data analysis life cycle. As shown in a pic below taken from DataOne. There are so many important things you may want to back up: your raw/processed data, your code, and your dot configuration files. While for every project, I have git version control my scripts (not the data) and push it to github or gitlab to have a backup, big files can not be hosted on github or gitlab.