Bioinformatics is not (just) statistics

Mar 27, 2024 4 min read bioinformatics

I was asked this question very often: “Tommy, what’s the p-value cutoff should I use to determine the differentially expressed genes; what log2 Fold change cutoff should I use too?”

For single-cell RNAseq quality control, what’s the cutoff for mitochondrial content?

My answer is always: it depends. I was joking: determining a cutoff is 90% of the work a bioinformatician does.

Why is that?

Biology is more than just statistics. Several examples:

when you have gazillions of data points, the p-value will be inherently small.

This is especially true for single-cell RNAseq marker gene identification. We may have thousands of cells in each cluster and the p-value will be inherently small. Moreover, we are double-dipping as we cluster first and then test differences in the clusters identified which causes the p-values to be even smaller. You probably see many marker genes with p values of 10^-10. We may want to focus on the effect size (the magnitude of changes). e.g. log2 Fold change.

This applies to any dataset with large sample size.

Reminder: You will get small p-values when your the number of data points is large https://t.co/wqx6JxRtGH
— Ming "Tommy" Tang (@tangming2005) February 5, 2022

and when you calcuate correlations too:

Question: if you have tens of thousands of data points with a correlation of 0.2 and a p-value 10^-11. Is it meaningful to show that? you always get a tiny p-value when you have a lot of data points. pic.twitter.com/lVQKOJGnfW
— Ming "Tommy" Tang (@tangming2005) September 28, 2020

On the other hand, what cutoffs should one use for the log2Fold change? Traditionally, people uses log2Fold of 1 which is 2 fold change. Again, I would argue this is quite subjective and your dataset may have very few number of genes that pass that cutoff and you want to relax the cutoff. The whole idea of using those cutoffs is to narrow down the gene list so one can pick one of them, do experiments to validate them and build a story for a paper.

Is 50% of increase of 50% of decrease of gene expression important or not? It depends on the genes. For example, X chromosome inactivation escaping can cause genes to increase 50% of some gene product in females (XX in female vs XY in male). It causes diseases.

Beta-thalassemia is an autosomal recessive disorder caused by mutations in the HBB (beta-globin) gene. This gene encodes the beta-globin subunit of hemoglobin, the oxygen-carrying protein in red blood cells. Mutations in HBB lead to reduced or absent production of functional beta-globin chains, resulting in ineffective erythropoiesis and hemolytic anemia. A 50% decrease in HBB gene expression can lead to a deficiency of beta-globin chains, causing the clinical manifestations of beta-thalassemia, including anemia, splenomegaly, and skeletal abnormalities.

Mitochondrial and ribosomal genes content are quality control metrics for single-cell RNAseq data.

As more mitochondrial gene expression indicates dying cells. What’s the cutoff one should use? Some use 5%, some use 10%, others may use 20%. It is all heuristic.

Indeed, there are tools such as miQC which uses the data to determine a cutoff. However, some tissues or cells types express high levels of mitochondrial and ribosomal genes because of their biology.

metabolically active tissues (e.g., muscle, kidney) have higher mitochondrial transcript content

For example, naive poised T cells are known to have higher ribosomal content, as are malignant cells.

So next time, if you ask me again, I will answer: it depends :)

references

bioinformatics

Bioinformatics is not (just) statistics

references

Related