Data
To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
My regret is not learning linear algebra well during college.
I barely passed the exam for it (and calculus, it was a nightmare :) ).
To be fair..
It was not taught well and it sounded too boring. I did not know what the application of matrix multiplication was, not until…
Many years later, I started to learn bioinformatics.
This is going to be a really short blog post. I recently found that if I join two tables with one of the tables having duplicated rows, the final joined table also contains the duplicated rows. It could be the expected behavior for others but I want to make a note here for myself.
library(tidyverse) df1<- tibble(key = c("A", "B", "C", "D", "E"), value = 1:5) df1 ## # A tibble: 5 x 2 ## key value ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 dataframe 2 has two identical rows for B.
Data backup is an essential step in the data analysis life cycle. As shown in a pic below taken from DataOne.
There are so many important things you may want to back up: your raw/processed data, your code, and your dot configuration files. While for every project, I have git version control my scripts (not the data) and push it to github or gitlab to have a backup, big files can not be hosted on github or gitlab.