There are so many public datasets there waiting for us to mine! It is the blessing and cursing as a computational biologist!
Metadata, or the data describing (e.g., responder or non-responder for the treatment) the data are critical in interpreting the analysis. Without metadata, your data are useless.
People usually go to GEO
or ENA
to download public data. I asked this question on twitter, and I will show you how to get the metadata as suggested by all the awesome tweeps. Thanks!
how to download GEO metadata again? I remember there is a way to click and download a table with GSM ids and other associated metadata.
— Ming (Tommy) Tang (@tangming2005) November 29, 2021
Use SRA run selector
go to https://www.ncbi.nlm.nih.gov/Traces/study/ and type in the accession number
Click Metadata
below the Download
column, a SraRuntable.txt
file will be downloaded.
Use GEOquery or GEOmetadatadb
If you want to stay within R
, take a look at GEOmetadatadb and
GEOquery.
Nextflow
A nextflow pipeline: nf-core/fetchngs
Command line tool
pip install ffq
ffq -t GSE GSE176021
pip install pysradb
pysradb metadata SRP000002 --detailed
Other resources
- SraExplorer
- Recount3 summaries and queries for large-scale RNA-seq expression and splicing. paper recently published https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02533-6
- Digital Expression Explorer 2 from Mark Ziemann.
- Other databases I curated https://github.com/crazyhottommy/RNA-seq-analysis#databases