To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
R or Python for Bioinformatics? Watch the video here:
If you need to pick Python or R for bioinformatics, which one should you choose? This is a decades-old question from many beginners.
This is my story.
I started learning Unix Commands 12 years ago (See an example of how powerful Unix commands can be).
To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
The other day, I saw this tweet:
Machine learning and bioinformatics tutorials these days pic.twitter.com/0FhWWG09TB — Ramon Massoni Badosa (@rmassonix) May 15, 2024 Many of the bioinformatics tutorials are like that. I am not saying the tutorial is not good. For beginners, we need something basic first to understand it.
In R, S3 and S4 objects are related to object-oriented programming (OOP), which allows you to create custom data structures with associated behaviors and methods. Let me explain them using simple language and metaphors, along with practical examples.
S3 Objects Imagine you have a collection of toys, like cars, dolls, and action figures. Each toy has its own set of properties (color, size, material) and behaviors (move, make sounds, etc.
I was asked this question very often: “Tommy, what’s the p-value cutoff should I use to determine the differentially expressed genes; what log2 Fold change cutoff should I use too?”
For single-cell RNAseq quality control, what’s the cutoff for mitochondrial content?
My answer is always: it depends. I was joking: determining a cutoff is 90% of the work a bioinformatician does.
Why is that?
Biology is more than just statistics.
Context and Problem In scRNA-seq, each cell is sequenced individually, allowing for the analysis of gene expression at the single-cell level. This provides a wealth of information about the cellular identities and states. However, the high dimensionality of the data (thousands of genes) and the technical noise in the data can lead to challenges in accurately clustering the cells. Over-clustering is one such challenge, where cells that are biologically similar are clustered into distinct clusters.
Join my newsletter to not miss a post like this
In the last blog post, I showed you how to use salmon to get counts from fastq files downloaded from GEO. In this post, I am going to show you how to read in the .sf salmon quantification file into R; how to get the tx2gene.txt file and do DESeq2 for differential gene expression analysis. Let’s dive in!
library(tximport) library(dplyr) library(ggplot2) files<- list.
Install fastq-dl To easily download fastq from GEO or ENA, use fastq-dl
Assume you already have conda installed, do the following:
conda config –add channels conda-forge conda config –add channels bioconda conda create -n fastq_download -c conda-forge -c bioconda fastq-dl conda activate fastq_download Tip: use mamba if conda is too slow for you. They are all big snakes!!
We will use bulk RNAseq data from this GEO accession ID: https://www.
In Single-cell RNAseq analysis, there is a step to find the marker genes for each cluster. The output from Seurat FindAllMarkers has a column called avg_log2FC. It is the gene expression log2 fold change between cluster x and all other clusters.
How is that calculated? In this tweet thread by Lior Pachter, he said that there was a discrepancy for the logFC changes between Seurat and Scanpy: Actually, both Scanpy and Seurat calculate it wrong.
There are some hidden gems beyond the typical programming skills that have been instrumental in my journey. These are the often-overlooked yet crucial practices that have empowered me to tackle challenges and make sense of data in meaningful ways.
Firstly, let’s talk about patience. It’s not as glamorous as diving straight into analysis, but taking the time for thorough quality control is invaluable. Before you get carried away, understand the experimental design.
Everyone is unique. Only you can talk about the story about yourself, and I realized that no matter how many times I have told my story, I have to tell it again, again, again, and again. Because no matter how many times I tell it, there is always someone who hear my story the first time. I hope it can inspire more people every time I tell it.
Fast backward 37 years ago, 1986.