Data
In a biotech startup, an early data strategy is key to ensure public and private data remain useful and valuable. As AI hype reaches new heights, I want to emphasize that a data strategy must precede any AI strategy.
Data is the oil of the AI engine. Unfortunately, the real-world data are usually messy and not AI-ready. Without a robust data strategy, you are building an AI system on a shaky foundation.
As 2024 wraps up, it’s the perfect time to reflect and prepare for the new year. I wrote the review for 2023 here.
Goals reached ✅ I lost 12 lb in 6 weeks!
✅ Supported the clinical trial to identify bio markers for potential responders to our drug at Immunitas. Helped with indication selection for the second and third program. I moved to Astrazeneca in August. I really appreciate my experience at Immunitas and learned a lot.
To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
My regret is not learning linear algebra well during college.
I barely passed the exam for it (and calculus, it was a nightmare :) ).
To be fair..
It was not taught well and it sounded too boring. I did not know what the application of matrix multiplication was, not until…
Many years later, I started to learn bioinformatics.
This is going to be a really short blog post. I recently found that if I join two tables with one of the tables having duplicated rows, the final joined table also contains the duplicated rows. It could be the expected behavior for others but I want to make a note here for myself.
library(tidyverse) df1<- tibble(key = c("A", "B", "C", "D", "E"), value = 1:5) df1 ## # A tibble: 5 x 2 ## key value ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 dataframe 2 has two identical rows for B.
Data backup is an essential step in the data analysis life cycle. As shown in a pic below taken from DataOne.
There are so many important things you may want to back up: your raw/processed data, your code, and your dot configuration files. While for every project, I have git version control my scripts (not the data) and push it to github or gitlab to have a backup, big files can not be hosted on github or gitlab.