The Most Common Stupid Mistakes In Bioinformatics

Aug 13, 2024 8 min read unix, bioinformatics

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

This post is inspired by this popular thread in https://www.biostars.org/.

Common mistakes in general

Off-by-One Errors:
- Mistakes occur when switching between different indexing systems. For example, BED files are 0-based while GFF/GTF files are 1-based, leading to potential misinterpretations of genomic coordinates.

This is one of the most common mistakes! I highly recommend you to read this Tutorial:Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

Switching Between Programming Languages:
- Indexing errors happen when a developer switches between languages with different base indexes. Python and most modern languages use 0-based indexing, whereas R and Lua are 1-based.
See an example here: Three gotchas when using R for Genomic data analysis
Incorrect Chromosome Sorting:
- Assuming alphabetical order instead of natural sort leads to chr10 being listed before chr2. Consider implementing natural sorting to avoid this issue.
If you do want to have natural sort see https://gist.github.com/crazyhottommy/e778ceb39cebfa20739a
Regex Errors:
- Errors arise from constructing regular expressions incorrectly, leading to failure in pattern matching, which can result in missed or incorrect data extraction.
Incorrect File Parsing:
- Complex file formats like BLAST or GenBank require precise parsing rules. Errors can occur if the format specifications are misunderstood or files are parsed incorrectly. Do not reinvent the wheel!, I have seen people write their own fastq parsers. Use a well-tested library.
Strand Orientation and Sequence Reversal:
- Not accounting for the strand direction can result in incorrect data interpretation, such as failing to reverse complement sequences when required.
Loop and File End Errors:
- When looping through files, especially if the last line lacks an end-of-line character, logic errors can lead to missing data processing.
Operating System Line Breaks:
- Line break conventions vary across operating systems. Failing to handle these differences can cause issues reading or writing files across different platforms. dos2unix is your friend. I have been bitten by it many times!
Selecting Incorrect Genomic Assemblies:
- Mistakenly using the wrong assembly, annotation, or release version can lead to inaccurate analysis results. e.g., Double check if the genome build is hg19 or hg38 for human genome. If you aligned your fastq reads to hg19 genome and visualize in hg38 genome build UCSC genome browser or IGV, you should ask yourself why all the coverage is not in the exons!
Managing Multiple File Versions:
- Using outdated or incorrect file versions without clear version tracking may lead to inconsistent data analysis outcomes. This is one of the core problems of reproducible computing. Always version control your files! (using git lfs?)
Handling Nested Genome Annotations:
- Complex annotations, such as nested genes, need careful handling to avoid missing or double-counting features. Some different genes may have overlapping exons or introns.
Data Randomization and Statistical Tests:
- Not properly randomizing data or misusing statistical tests can lead to biased results and incorrect conclusions.
Poor Documentation Practices:
- Failing to fully document methods and procedures makes it difficult to review and correct errors, and hinders reproducibility and collaboration.

Some command line mistake examples

Here are some of the common mistakes when using command line tools for bioinformatics tasks.

Using `rm *` in the wrong directory

Mistake: Running rm * without checking the directory.

What you meant to do: Delete files in a specific subdirectory.
Actual Command: rm * (in the wrong directory).
Correction: Navigate to the correct directory first:
```
cd target_directory
rm *
```

Mistaking `>` for `>>`

Mistake: Using > instead of >> to append to a file. > will overwrite the file.

What you meant to do: Append to a file.
Actual Command: command > file
Correction: Use >> for appending:
```
command >> file
```

Misspelling file extensions

Mistake: Incorrect file extension.

What you meant to do: Delete .fastq files.
Actual Command: rm *.fasq
Correction: Verify the extension:
```
rm *.fastq
```

Path misconfiguration

Mistake: Executing a command in a misconfigured environment.

What you meant to do: Use a tool installed in a different path.
Actual Command: myfancytool
Correction: Update your $PATH variable or use absolute path:
```
/usr/local/bin/myfancytool
```

Watch my chatomics video to understand the PATH variable:

Incorrect use of file wildcard

Mistake: Incorrect wildcard usage.

What you meant to do: Delete only .txt files.
Actual Command: rm *txt
Correction: Correct the wildcard pattern:
```
rm *.txt
```

remove fasta with unintentional spaces

Mistake: Accidental space.

What you meant to do: remove all fasta file.
Actual Command: rm -rf * .fasta removes all files!
Correction: Ensure no space before .fasta:
```
`rm -rf *.fasta
```

Forgetting `-r` with `rm`

Mistake: Forgetting recursive flag for directories.

What you meant to do: Delete a directory.
Actual Command: rm directory
Correction: Use -r for directories:
```
rm -r directory
```

Not escaping special characters

Mistake: Forgetting to escape special characters.

What you meant to do: Search for * in files.
Actual Command: grep * file
Correction: Escape the character:
```
grep \* file
```

Overwriting important files

Mistake: Overwriting important data files.

What you meant to do: Save output to a temporary file.
Actual Command: command > important_file
Correction: Use a temporary filename:
```
command > tempfile
```

Using `cat` for large files

Mistake: Using cat for very large files.

What you meant to do: Preview content of a large file.
Actual Command: cat largefile
Correction: Use less or head/tail:
```
less largefile
```

Tip: I usually use less -S largefile so the line will not be wrapped if it is too long.

Incorrect `find` syntax

Mistake: Incorrect parameters with find.

What you meant to do: Find .txt files.
Actual Command: find . -name *txt
Correction: Use quotes properly:
```
find . -name "*.txt"
```

Misunderstanding `chmod`

Mistake: Incorrectly setting file permissions.

What you meant to do: Make a file executable.
Actual Command: chmod 777 file
Correction: Use appropriate permissions:
```
chmod +x file
```
if you only want the owner to have executable permission
```
chomod u+x file
```

Each digit is for: user, group and other.

chmod 754 myfile: this means the user has read, write and execute permssion; member in the same group has read and execute permission but no write permission; other people in the world only has read permission.

4 stands for “read”,
2 stands for “write”,
1 stands for “execute”, and. 0 stands for “no permission.”

So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).

It is sometimes hard to remember. one can use the letter:The letters u, g, and o stand for “user”, “group”, and “other”; “r”, “w”, and “x” stand for “read”, “write”, and “execute”, respectively.

For example:

chmod u+x myfile
chmod g+r myfile

grep “>” without quote

Mistake: not using quote for > sign.

What you meant to do: search “>” in a fasta file.
Actual Command: grep > some.fasta
Correction: Use quote for the > sign:
```
grep '>' some.fasta
```

Forgetting about hidden files

Mistake: Not considering hidden files when deleting.

What you meant to do: Delete all files in a directory.
Actual Command: rm *
Correction: Include hidden files:
```
rm * .*
```

hidden files starts with ..

Incorrect argument order in `tar`

Mistake: Wrong argument order in tar.

What you meant to do: Extract a tarball.
Actual Command: tar -xvf file.tar.gz -C directory
Correction: Correct argument order:
```
tar -xvzf file.tar.gz -C directory
```

I have to google every time for different compressed files. Use this one below instead:

#!/bin/bash
# function Extract for common file formats

function extract {
 if [ -z "$1" ]; then
    # display usage if no parameters given
    echo "Usage: extract <path/file_name>.<zip|rar|bz2|gz|tar|tbz2|tgz|Z|7z|xz|ex|tar.bz2|tar.gz|tar.xz>"
 else
    if [ -f "$1" ] ; then
        NAME=${1%.*}
        #mkdir $NAME && cd $NAME
        case "$1" in
          *.tar.bz2)   tar xvjf ./"$1"    ;;
          *.tar.gz)    tar xvzf ./"$1"    ;;
          *.tar.xz)    tar xvJf ./"$1"    ;;
          *.lzma)      unlzma ./"$1"      ;;
          *.bz2)       bunzip2 ./"$1"     ;;
          *.rar)       unrar x -ad ./"$1" ;;
          *.gz)        gunzip ./"$1"      ;;
          *.tar)       tar xvf ./"$1"     ;;
          *.tbz2)      tar xvjf ./"$1"    ;;
          *.tgz)       tar xvzf ./"$1"    ;;
          *.zip)       unzip ./"$1"       ;;
          *.Z)         uncompress ./"$1"  ;;
          *.7z)        7z x ./"$1"        ;;
          *.xz)        unxz ./"$1"        ;;
          *.exe)       cabextract ./"$1"  ;;
          *)           echo "extract: '$1' - unknown archive method" ;;
        esac
    else
        echo "'$1' - file does not exist"
    fi
fi
}

Save it as extract in the /local/usr/bin and chomod u+x extract. you can then use it to extract any files without remembering the syntax.

Misuse of `cut` without delimiter

Mistake: Using cut without specifying delimiter.

What you meant to do: Extract a column from a CSV.
Actual Command: cut -f2 file.csv
Correction: Specify the delimiter:
```
cut -d, -f2 file.csv
```
default is tab as the delimiter.

Overwriting `.bashrc`

Mistake: Using > to update .bashrc.

What you meant to do: Append to .bashrc.
Actual Command: echo "export PATH=$PATH:/new/path" > ~/.bashrc

Correction: Use >> for appending:

echo "export PATH=$PATH:/new/path" >> ~/.bashrc

Misinterpreting `awk` syntax

Mistake: Incorrect awk syntax.

What you meant to do: Print the second column of a file.
Actual Command: awk {print $2} file
Correction: Use quoted expressions:
```
awk '{print $2}' file
```

Forgetting `-p` with `mkdir`

Mistake: Not using -p with mkdir.

What you meant to do: Create nested directories.
Actual Command: mkdir /path/to/new/directory
Correction: Use -p to create intermediate directories:
```
mkdir -p /path/to/new/directory
```

If the intermediate folders (to, new) does not exist, mkdir will error out. use mkdir -p instead.

Incorrect use of `|` (pipe)

Mistake: Misplaced pipe operator.

What you meant to do: Chain commands with a pipe.
Actual Command: command1 | command2 | > outputfile
Correction: Remove redundant |:
```
command1 | command2 > outputfile
```
Fun fact: |> is the built-in pipe in R.

What’s your mistakes? Leave a comment!

unix bioinformatics

The Most Common Stupid Mistakes In Bioinformatics

Common mistakes in general

Some command line mistake examples

Using rm * in the wrong directory

Mistaking > for >>

Misspelling file extensions

Path misconfiguration

Incorrect use of file wildcard

remove fasta with unintentional spaces

Forgetting -r with rm

Not escaping special characters

Overwriting important files

Using cat for large files

Incorrect find syntax

Misunderstanding chmod

grep “>” without quote

Forgetting about hidden files

Incorrect argument order in tar

Misuse of cut without delimiter

Overwriting .bashrc

Misinterpreting awk syntax

Forgetting -p with mkdir

Incorrect use of | (pipe)

Related