The Most Common Mistake In Bioinformatics, one-off error

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

In my last blog post, I talked about some common bioinformatics mistakes.

Today, we are going to talk about THE MOST common bioinformatics mistake people make. And I think it deserves a separate post about it. Even some experienced programmers get it wrong and the mistake prevails in many bioinformatics software:

The one-off mistake!!

What is that?

Genomics files such as bed, vcf, gtf etc come with a coordinate system. To describe a region a position in a genome, you specify something like chr3, 4-7. This means the region is on chromosome 3 and from base 4 to 7. But there are more nuances here..

There are the 0-based and 1-based coordinates systems in genomics:

The 0-based starts with 0 and the 1-based begins with 1. Why it matters?

Different file formats (THE bioiFORMATics problem!!) use different systems:

So if you are not careful, you will make one-off mistakes.

I highly recommend you to read this old post on biostars: https://www.biostars.org/p/84686/

To make things worse :) you know R is 1-based and Python is 0-based!!

If you use different languages to analyze the data, you must also be careful.

some files such as bed file is 0 based. Two genomic regions:

chr1 0 1000

chr1 1000 2000

when you import that bed file into R using rtracklayer::import(), it will become

chr1 1 1000

chr1 1001 2000

The function converts it to 1 based internally (R is 1 based unlike Python).

When you read the bed file with read.table and use GenomicRanges::makeGRangesFromDataFrame() to convert it to a GRanges object, do not forget to add 1 to the start before doing it.

Similarly, when you write a GRanges object to disk using rtracklayer::export, you do not need to worry, R will convert it back to 0 based in the file. However, if you make a dataframe out of the GRanges object, and write that dataframe to file, remember to do start -1 before doing it.

I hope you learned something new!

Further reading: UCSC coordinate counting system https://genome-blog.gi.ucsc.edu/blog/2016/12/12/the-ucsc-genome-browser-coordinate-counting-systems/

Happy Learning!

Tommy

Related

Next
Previous
comments powered by Disqus