Obtain metadata for public datasets in GEO

There are so many public datasets there waiting for us to mine! It is the blessing and cursing as a computational biologist!

Metadata, or the data describing (e.g., responder or non-responder for the treatment) the data are critical in interpreting the analysis. Without metadata, your data are useless.

People usually go to GEO or ENA to download public data. I asked this question on twitter, and I will show you how to get the metadata as suggested by all the awesome tweeps. Thanks!

Use SRA run selector

go to https://www.ncbi.nlm.nih.gov/Traces/study/ and type in the accession number

Click Metadata below the Download column, a SraRuntable.txt file will be downloaded.

Use GEOquery or GEOmetadatadb

If you want to stay within R, take a look at GEOmetadatadb and GEOquery.


A nextflow pipeline: nf-core/fetchngs

Command line tool

check ffq and pysradb

pip install ffq
ffq -t GSE GSE176021

pip install pysradb
pysradb metadata SRP000002  --detailed

Other resources


comments powered by Disqus