Obtain metadata for public datasets in GEO

Dec 1, 2021 2 min read bioinformatics, R

There are so many public datasets there waiting for us to mine! It is the blessing and cursing as a computational biologist!

Metadata, or the data describing (e.g., responder or non-responder for the treatment) the data are critical in interpreting the analysis. Without metadata, your data are useless.

People usually go to GEO or ENA to download public data. I asked this question on twitter, and I will show you how to get the metadata as suggested by all the awesome tweeps. Thanks!

how to download GEO metadata again? I remember there is a way to click and download a table with GSM ids and other associated metadata.
— Ming (Tommy) Tang (@tangming2005) November 29, 2021

Use SRA run selector

go to https://www.ncbi.nlm.nih.gov/Traces/study/ and type in the accession number

Click Metadata below the Download column, a SraRuntable.txt file will be downloaded.

Use GEOquery or GEOmetadatadb

If you want to stay within R, take a look at GEOmetadatadb and GEOquery.

Nextflow

A nextflow pipeline: nf-core/fetchngs

Command line tool

check ffq and pysradb

pip install ffq
ffq -t GSE GSE176021


pip install pysradb
pysradb metadata SRP000002  --detailed

Other resources

SraExplorer
Recount3 summaries and queries for large-scale RNA-seq expression and splicing. paper recently published https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02533-6
Digital Expression Explorer 2 from Mark Ziemann.
Other databases I curated https://github.com/crazyhottommy/RNA-seq-analysis#databases

Bioconductor GEO R