Do not repeat yourself: List column to do RNAseq differential expression analysis

Last updated on Sep 25, 2025 21 min read tutorial

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

In this blog post, I am going to show you how to use list column and purrr::map(), a powerful toolkit in your belt to avoid repetition in your bioinformatics analysis.

To demonstrate the usage, We will use RNAseq differential expression analysis and pathway enrichment analysis as an example.

Let’s use a real example https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE197576

GSE197576 is a human RNA-seq dataset profiling transcriptional changes in SW480 colorectal cancer cells after knockout of ITPR3 or RELB compared to controls, under both normoxic and hypoxic conditions. The experiment includes duplicate samples for each condition and provides raw gene count matrices, enabling exploration of gene expression responses to both gene knockout and oxygen levels in this cell line

How to download the files from GEO ftp https://www.ncbi.nlm.nih.gov/geo/info/download.html

If use command line wget or curl, download the files here.

Alternatively, use GEOquery https://bioconductor.org/packages/release/bioc/html/GEOquery.html

cd blog_data/
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE197nnn/GSE197576/suppl/GSE197576_raw_gene_counts_matrix.tsv.gz

# if wget is not installed in your computer, use 
curl -O https://ftp.ncbi.nlm.nih.gov/geo/series/GSE197nnn/GSE197576/suppl/GSE197576_raw_gene_counts_matrix.tsv.gz

Standard DESeq2 workflow for a single comparison

read the data into R and make a DESeq2 object

follow the tutorial http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

library(dplyr)
library(readr)
library(here)
# BiocManager::install("DESeq2")
library(DESeq2)
library(purrr)  # for map functions
library(tibble) # for tibble functions
library(stringr)
library(ggplot2)

raw_counts<- read_tsv("~/blog_data/GSE197576_raw_gene_counts_matrix.tsv.gz")
head(raw_counts)

#> # A tibble: 6 × 13
#>   gene        `01_SW_sgCTRL_Norm` `02_SW_sgCTRL_Norm` `03_SW_sgITPR3_1_Norm`
#>   <chr>                     <dbl>               <dbl>                  <dbl>
#> 1 DDX11L1                       0                   0                      0
#> 2 WASH7P                       18                  11                     28
#> 3 MIR6859-1                     5                   1                      6
#> 4 MIR1302-2HG                   0                   0                      0
#> 5 MIR1302-2                     0                   0                      0
#> 6 FAM138A                       0                   0                      0
#> # ℹ 9 more variables: `04_SW_sgITPR3_1_Norm` <dbl>,
#> #   `07_SW_sgRELB_3_Norm` <dbl>, `08_SW_sgRELB_3_Norm` <dbl>,
#> #   `11_SW_sgCTRL_Hyp` <dbl>, `12_SW_sgCTRL_Hyp` <dbl>,
#> #   `13_SW_sgITPR3_1_Hyp` <dbl>, `14_SW_sgITPR3_1_Hyp` <dbl>,
#> #   `17_SW_sgRELB_3_Hyp` <dbl>, `18_SW_sgRELB_3_Hyp` <dbl>

colnames(raw_counts)

#>  [1] "gene"                 "01_SW_sgCTRL_Norm"    "02_SW_sgCTRL_Norm"   
#>  [4] "03_SW_sgITPR3_1_Norm" "04_SW_sgITPR3_1_Norm" "07_SW_sgRELB_3_Norm" 
#>  [7] "08_SW_sgRELB_3_Norm"  "11_SW_sgCTRL_Hyp"     "12_SW_sgCTRL_Hyp"    
#> [10] "13_SW_sgITPR3_1_Hyp"  "14_SW_sgITPR3_1_Hyp"  "17_SW_sgRELB_3_Hyp"  
#> [13] "18_SW_sgRELB_3_Hyp"

There are 12 samples with 6 conditions and each condition has a replicates.

control guide RNA: sgCTRL
ITPR3 knockout: sgITPR3
RELB konckout: sgRELB
low oxygen condition: hypoxia
normal oxygen condition: normoxia

convert the dataframe to a matrix:

raw_counts_mat<- raw_counts[, -1] %>% as.matrix()

rownames(raw_counts_mat)<- raw_counts$gene

head(raw_counts_mat)

#>             01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 03_SW_sgITPR3_1_Norm
#> DDX11L1                     0                 0                    0
#> WASH7P                     18                11                   28
#> MIR6859-1                   5                 1                    6
#> MIR1302-2HG                 0                 0                    0
#> MIR1302-2                   0                 0                    0
#> FAM138A                     0                 0                    0
#>             04_SW_sgITPR3_1_Norm 07_SW_sgRELB_3_Norm 08_SW_sgRELB_3_Norm
#> DDX11L1                        0                   0                   0
#> WASH7P                        20                  18                  25
#> MIR6859-1                      5                   3                  13
#> MIR1302-2HG                    0                   0                   0
#> MIR1302-2                      0                   0                   0
#> FAM138A                        0                   0                   0
#>             11_SW_sgCTRL_Hyp 12_SW_sgCTRL_Hyp 13_SW_sgITPR3_1_Hyp
#> DDX11L1                    0                0                   0
#> WASH7P                    23               45                  44
#> MIR6859-1                  8               12                   0
#> MIR1302-2HG                1                0                   0
#> MIR1302-2                  0                0                   0
#> FAM138A                    0                0                   0
#>             14_SW_sgITPR3_1_Hyp 17_SW_sgRELB_3_Hyp 18_SW_sgRELB_3_Hyp
#> DDX11L1                       0                  0                  0
#> WASH7P                       27                 30                 25
#> MIR6859-1                     8                  8                  5
#> MIR1302-2HG                   0                  0                  0
#> MIR1302-2                     0                  0                  0
#> FAM138A                       0                  0                  0

Differential expression analysis: hypoxia vs normoxia

We will do a single comparison first to compare hypoxia vs normoxia in sgCTRL.

subset the matrix and make a sample sheet:

# column 1,2 are sgCTRL normoxia column 7,8 are sgCTRL hypoxia

raw_counts_mat_sub<- raw_counts_mat[, c(1,2,7,8)]

head(raw_counts_mat_sub)

#>             01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
#> DDX11L1                     0                 0                0
#> WASH7P                     18                11               23
#> MIR6859-1                   5                 1                8
#> MIR1302-2HG                 0                 0                1
#> MIR1302-2                   0                 0                0
#> FAM138A                     0                 0                0
#>             12_SW_sgCTRL_Hyp
#> DDX11L1                    0
#> WASH7P                    45
#> MIR6859-1                 12
#> MIR1302-2HG                0
#> MIR1302-2                  0
#> FAM138A                    0

coldata<- data.frame(condition = c("normoxia", "normoxia", "hypoxia", "hypoxia"))

rownames(coldata)<- colnames(raw_counts_mat_sub)

coldata

#>                   condition
#> 01_SW_sgCTRL_Norm  normoxia
#> 02_SW_sgCTRL_Norm  normoxia
#> 11_SW_sgCTRL_Hyp    hypoxia
#> 12_SW_sgCTRL_Hyp    hypoxia

Make a DEseq2 object

all(rownames(coldata) == colnames(raw_counts_mat_sub))

#> [1] TRUE

keep<- rowSums(raw_counts_mat_sub > 10) >= 2 # 10 counts in at least 2 samples 

raw_counts_mat_sub<- raw_counts_mat_sub[keep, ]

# follow standard DESeq2 workflow
dds <- DESeqDataSetFromMatrix(countData = raw_counts_mat_sub,
                              colData = coldata,
                              design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))

res

#> log2 fold change (MLE): condition hypoxia vs normoxia 
#> Wald test p-value: condition hypoxia vs normoxia 
#> DataFrame with 14773 rows and 6 columns
#>                baseMean log2FoldChange     lfcSE      stat      pvalue
#>               <numeric>      <numeric> <numeric> <numeric>   <numeric>
#> WASH7P         23.21968       0.980058  0.563773   1.73839   0.0821421
#> LOC100996442   10.36881       1.187227  0.819025   1.44956   0.1471811
#> LOC729737       8.91192       2.142716  0.940163   2.27909   0.0226617
#> WASH9P         55.19471       0.526744  0.346707   1.51928   0.1286928
#> LOC100288069   76.26492       0.490303  0.297131   1.65012   0.0989180
#> ...                 ...            ...       ...       ...         ...
#> ND4          119814.453      0.0419782 0.0453806  0.925025 3.54953e-01
#> ND5           44220.020      0.5544584 0.0941469  5.889289 3.87860e-09
#> ND6           11378.733      0.9634273 0.1104717  8.721031 2.75680e-18
#> CYTB          55770.859      0.0480192 0.0530959  0.904386 3.65791e-01
#> TRNP            397.987      2.0433030 0.1474365 13.858872 1.12427e-43
#>                     padj
#>                <numeric>
#> WASH7P          0.136916
#> LOC100996442    0.224132
#> LOC729737       0.044295
#> WASH9P          0.200546
#> LOC100288069    0.160443
#> ...                  ...
#> ND4          4.60905e-01
#> ND5          2.06333e-08
#> ND6          2.97924e-17
#> CYTB         4.71909e-01
#> TRNP         3.92645e-42

significant_genes<- res %>%
  as.data.frame() %>%
  dplyr::filter(padj <=0.01, abs(log2FoldChange) >= 2) %>% 
  rownames()


significant_genes

#>   [1] "HES2"         "ERRFI1"       "NPPB"         "CTRC"         "ESPNP"       
#>   [6] "PADI2"        "CDA"          "LOC105376863" "CSMD2"        "GJB4"        
#>  [11] "COL8A2"       "POU3F1"       "EDN2"         "FAM183A"      "TMEM125"     
#>  [16] "BEST4"        "BTBD19"       "LOC107984952" "LDLRAD1"      "LOC105378751"
#>  [21] "KANK4"        "IL12RB2"      "LOC107985054" "CCN1"         "F3"          
#>  [26] "CNN3"         "S1PR1"        "OVGP1"        "KCND3"        "ITGA10"      
#>  [31] "ADAMTSL4"     "LOC105371441" "TCHH"         "MUC1"         "SEMA4A"      
#>  [36] "SH2D2A"       "DDR2"         "CACNA1E"      "LOC107985232" "KIF21B"      
#>  [41] "CACNA1S"      "TNNT2"        "ELF3"         "PCAT6"        "PPFIA4"      
#>  [46] "MYBPH"        "CHI3L1"       "KISS1"        "CD34"         "LAMB3"       
#>  [51] "TGFB2"        "CAPN8"        "ALKAL2"       "TRIB2"        "LRATD1"      
#>  [56] "MATN3"        "KCNK3"        "LBH"          "XDH"          "DYSF"        
#>  [61] "LOC107985917" "C2orf92"      "CREG2"        "LOC105373989" "INHBB"       
#>  [66] "CYP27C1"      "MYO7B"        "LIMS2"        "GPR17"        "PLA2R1"      
#>  [71] "NABP1-OT1"    "CAVIN2"       "MARCHF4"      "IGFBP5"       "PTPRN"       
#>  [76] "INHA"         "COL4A3"       "ALPP"         "ALPG"         "EFHD1"       
#>  [81] "ACKR3"        "KLHL30"       "LOC101927509" "LRRN1"        "BHLHE40"     
#>  [86] "ATP2B2"       "TMEM40"       "COLQ"         "THRB"         "LTF"         
#>  [91] "ITIH3"        "CACNA1D"      "TMEM45A"      "MYH15"        "ZBED2"       
#>  [96] "PLCXD2"       "MGLL"         "LOC107986127" "SLCO2A1"      "LOC105374118"
#> [101] "BCL6"         "PLAAT1"       "NRROS"        "LIMCH1"       "CXCL8"       
#> [106] "EREG"         "CCNG2"        "ABCG2"        "FAM13A"       "DDIT4L"      
#> [111] "ETNPPL"       "NDNF"         "DDX60L"       "UBE2QL1"      "SEMA5A"      
#> [116] "DNAH5"        "CCL28"        "LOC105379051" "LUCAT1"       "ARRDC3"      
#> [121] "LOX"          "ADAMTS19-AS1" "ADAMTS19"     "CSF2"         "KLHL3"       
#> [126] "PSD2"         "PCDH1"        "FGF1"         "ARHGEF37"     "PDGFRB"      
#> [131] "STC2"         "N4BP3"        "PPP1R3G"      "NRN1"         "F13A1"       
#> [136] "NEDD9"        "EDN1"         "LOC105375050" "FOXP4-AS1"    "VEGFA"       
#> [141] "PKHD1"        "PRSS35"       "PRDM1"        "CD24"         "TMEM200A"    
#> [146] "CCN2"         "SGK1"         "TNFAIP3"      "ZC3H12D"      "SYTL3"       
#> [151] "LOC442497"    "LOC116435278" "LOC112267991" "SLC29A4"      "ITGB8-AS1"   
#> [156] "ITGB8"        "INMT"         "AQP1"         "INHBA"        "IGFBP1"      
#> [161] "IGFBP3"       "LOC84214"     "GTF2IRD2B"    "CCL26"        "UPK3B"       
#> [166] "ABCB1"        "CDK14"        "CYP3A5"       "CYP3A7"       "NYAP1"       
#> [171] "ACHE"         "SERPINE1"     "CAV2"         "CAV1"         "CPA4"        
#> [176] "PLXNA4"       "LOC105375519" "AKR1B1"       "TCAF2"        "ZNF467"      
#> [181] "LOC101927914" "LOC105375610" "PPP1R3B"      "MTUS1-DT"     "LOXL2"       
#> [186] "STC1"         "SDAD1P1"      "BNIP3L"       "PNMA2"        "TEX15"       
#> [191] "DKK4"         "SNAI2"        "XKR4"         "CALB1"        "SLA"         
#> [196] "CCN4"         "NDRG1"        "LOC105375792" "ADGRB1"       "GLIS3"       
#> [201] "LURAP1L"      "MIR31HG"      "MOB3B"        "ARID3C"       "CA9"         
#> [206] "LOC105376041" "CNTNAP3"      "MAMDC2"       "ANXA1"        "DAPK1"       
#> [211] "LINC02937"    "LOC100129316" "ABCA1"        "KLF4"         "LPAR1"       
#> [216] "PTGS1"        "COL5A1"       "CLIC3"        "LOC105376363" "AKR1C3"      
#> [221] "PFKFB3"       "CELF2"        "LOC283028"    "ANXA8L1"      "ANXA8"       
#> [226] "HKDC1"        "UNC5B-AS1"    "DDIT4"        "SYNPO2L"      "CYP26A1"     
#> [231] "ANKRD2"       "WNT8B"        "COL17A1"      "DUSP5-DT"     "EMX2OS"      
#> [236] "EMX2"         "PLPP4"        "LOC107984128" "ADAM12"       "MIR210HG"    
#> [241] "IFITM10"      "ADM"          "CAND1.11"     "DEPDC7"       "LMO2"        
#> [246] "CHRM4"        "ZP1"          "LOC107984338" "LINC02736"    "LOC105369349"
#> [251] "CHKA-DT"      "LOC101928473" "ARHGEF17-AS1" "ARHGEF17"     "UCP3"        
#> [256] "SYTL2"        "BIRC3"        "CRYAB"        "LOC107984387" "LOC105369498"
#> [261] "DRD2"         "TAGLN"        "IL10RA"       "USP2-AS1"     "POU2F3"      
#> [266] "MIR100HG"     "NTM"          "IQSEC3"       "SLC6A12"      "SLC6A13"     
#> [271] "SLC2A14"      "LINC02449"    "LINC01252"    "GPRC5A"       "ARHGDIB"     
#> [276] "SLCO1B3"      "TSPAN11"      "PRICKLE1"     "AMIGO2"       "PCED1B"      
#> [281] "ENDOU"        "DDN"          "PRPH"         "KRT7"         "KRT75"       
#> [286] "KRT6A"        "KRT4"         "CALCOCO1"     "HOXC5"        "ITGA5"       
#> [291] "MMP19"        "ARHGAP9"      "LINC02454"    "BTG1"         "LOC105369958"
#> [296] "TCP11L2"      "TRPV4"        "HSPB8"        "CFAP251"      "LOC105370092"
#> [301] "EGLN3"        "EGLN3-AS1"    "GNG2"         "HIF1A-AS3"    "LOC107984016"
#> [306] "DIO3OS"       "CRIP2"        "ATP10A"       "PLCB2"        "SPTBN5"      
#> [311] "FAM81A"       "NOX5"         "HCN4"         "LOC102723750" "PSTPIP1"     
#> [316] "LINC01583"    "LOC105370924" "ISG20"        "CARMAL"       "IRAIN"       
#> [321] "LOC107984790" "TPSG1"        "TPSB2"        "TPSAB1"       "GREP1"       
#> [326] "MIR193BHG"    "LINC02167"    "MMP2"         "CAPNS2"       "CDH11"       
#> [331] "TPPP3"        "HAS3"         "MAF"          "ATP2C2-AS1"   "LOC105371393"
#> [336] "LOC107984862" "INCA1"        "PIK3R6"       "PIK3R5"       "CDRT1"       
#> [341] "NOS2"         "CORO6"        "CCL2"         "DUSP14"       "LOC105371934"
#> [346] "KRT13"        "KRT15"        "GAST"         "GNGT2"        "PICART1"     
#> [351] "COL1A1"       "NOG"          "YPEL2"        "CACNG4"       "SDK2"        
#> [356] "FOXJ1"        "FSCN2"        "CD7"          "LINC02582"    "LINC00908"   
#> [361] "SHC2"         "KISS1R"       "IZUMO4"       "S1PR4"        "SMIM24"      
#> [366] "EBI3"         "C3"           "FBN3"         "ANGPTL4"      "COL5A3"      
#> [371] "ZNF878"       "UCA1"         "CPAMD8"       "KCNN1"        "IQCN"        
#> [376] "LOC105372338" "LOC102724908" "LOC105372371" "LGI4"         "UPK1A"       
#> [381] "LGALS14"      "CEACAM6"      "PSG8"         "PSG1"         "PSG11"       
#> [386] "PSG2"         "LOC107985349" "PSG5"         "PSG4"         "PSG9"        
#> [391] "LYPD5"        "ZNF404"       "ZNF221"       "NKPD1"        "IGFL4"       
#> [396] "IGFL2"        "IGFL2-AS1"    "IGFL1"        "EHD2"         "CABP5"       
#> [401] "ZNF114"       "NTF4"         "LOC105372441" "KLK3"         "KLK10"       
#> [406] "SIGLEC6"      "VSTM1"        "SYT5"         "PTPRH"        "COX6B2"      
#> [411] "IL11"         "FER1L4"       "SPAG4"        "KIAA1755"     "WFDC10B"     
#> [416] "CYP24A1"      "LOC105369299" "LOC105372789" "CLDN14-AS1"   "LINC02940"   
#> [421] "LOC105372803" "LINC02943"    "LINC00323"    "ABCG1"        "TFF2"        
#> [426] "ITGB2"        "ITGB2-AS1"    "LOC112268278" "LOC105372838" "CCDC188"     
#> [431] "GAL3ST1"      "PIK3IP1"      "TIMP3"        "GRPR"         "MAGEB17"     
#> [436] "CACNA1F"      "IL2RG"        "NHSL2"        "RTL5"         "CAPN6"       
#> [441] "KLHL13"       "LMNTD2_1"     "MIR210HG_1"   "TRNP"

pathway analysis

follow the clusterprofiler tutorial

over-representation test

library(clusterProfiler)
library(org.Hs.eg.db)
#convert gene symbol to Entrez ID for 

significant_genes_map<- clusterProfiler::bitr(geneID = significant_genes,
                      fromType="SYMBOL", toType="ENTREZID",
                      OrgDb="org.Hs.eg.db")

head(significant_genes_map)

#>   SYMBOL ENTREZID
#> 1   HES2    54626
#> 2 ERRFI1    54206
#> 3   NPPB     4879
#> 4   CTRC    11330
#> 5  ESPNP   284729
#> 6  PADI2    11240

## background genes are genes that are detected in the RNAseq experiment 
background_genes<- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "gene") %>%
  pull(gene)

background_genes_map<- bitr(geneID = background_genes, 
                            fromType="SYMBOL", 
                            toType="ENTREZID",
                      OrgDb="org.Hs.eg.db")

GO term enrichment

Gene Ontology(GO) defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:

MF: Molecular Function molecular activities of gene products

CC: Cellular Component where gene products are active

BP: Biological Process pathways and larger processes made up of the activities of multiple gene products

GO terms are organized in a directed acyclic graph, where edges between terms represent parent-child relationship.

ego <- enrichGO(gene          = significant_genes_map$ENTREZID,
                universe      = background_genes_map$ENTREZID,
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)
head(ego)

#>                    ID                                  Description GeneRatio
#> GO:0001525 GO:0001525                                 angiogenesis    43/319
#> GO:0048514 GO:0048514                   blood vessel morphogenesis    47/319
#> GO:0007186 GO:0007186 G protein-coupled receptor signaling pathway    36/319
#> GO:0045765 GO:0045765                   regulation of angiogenesis    25/319
#> GO:1901342 GO:1901342        regulation of vasculature development    25/319
#> GO:0044703 GO:0044703          multi-organism reproductive process    20/319
#>              BgRatio RichFactor FoldEnrichment    zScore       pvalue
#> GO:0001525 372/11616  0.1155914       4.209121 10.571109 7.201188e-16
#> GO:0048514 443/11616  0.1060948       3.863314 10.325459 8.254001e-16
#> GO:0007186 375/11616  0.0960000       3.495724  8.255305 4.849715e-11
#> GO:0045765 188/11616  0.1329787       4.842260  8.924911 5.551690e-11
#> GO:1901342 192/11616  0.1302083       4.741379  8.784087 8.823659e-11
#> GO:0044703 127/11616  0.1574803       5.734456  9.014760 2.318484e-10
#>                p.adjust       qvalue
#> GO:0001525 1.388736e-12 1.115593e-12
#> GO:0048514 1.388736e-12 1.115593e-12
#> GO:0007186 4.670359e-08 3.751774e-08
#> GO:0045765 4.670359e-08 3.751774e-08
#> GO:1901342 5.938322e-08 4.770348e-08
#> GO:0044703 1.300283e-07 1.044538e-07
#>                                                                                                                                                                                                                                                                                 geneID
#> GO:0001525                   NPPB/COL8A2/EDN2/CCN1/F3/S1PR1/SEMA4A/SH2D2A/CHI3L1/CD34/TGFB2/COL4A3/ACKR3/CXCL8/EREG/NDNF/SEMA5A/FGF1/PDGFRB/EDN1/VEGFA/CCN2/TNFAIP3/ITGB8/AQP1/SERPINE1/CAV1/LOXL2/ADGRB1/ANXA1/KLF4/ADAM12/ADM/ITGA5/MMP19/BTG1/NOX5/MMP2/PIK3R6/CCL2/C3/ANGPTL4/KLK3
#> GO:0048514 NPPB/COL8A2/EDN2/CCN1/F3/S1PR1/SEMA4A/SH2D2A/CHI3L1/CD34/TGFB2/XDH/COL4A3/ACKR3/CXCL8/EREG/NDNF/SEMA5A/LOX/FGF1/PDGFRB/EDN1/VEGFA/PRDM1/CCN2/TNFAIP3/ITGB8/AQP1/SERPINE1/CAV1/LOXL2/ADGRB1/ANXA1/KLF4/ADAM12/ADM/ITGA5/MMP19/BTG1/NOX5/MMP2/PIK3R6/CCL2/NOG/C3/ANGPTL4/KLK3
#> GO:0007186                                                           NPPB/PADI2/S1PR1/KISS1/GPR17/ACKR3/CACNA1D/MGLL/CXCL8/EREG/ARRDC3/PDGFRB/EDN1/CCL26/CAV1/ADGRB1/ANXA1/ABCA1/LPAR1/AKR1C3/ADM/CHRM4/DRD2/GPRC5A/GNG2/PLCB2/PIK3R6/PIK3R5/CCL2/GAST/GNGT2/KISS1R/S1PR4/C3/TFF2/GRPR
#> GO:0045765                                                                                                                        NPPB/F3/SEMA4A/CHI3L1/CD34/TGFB2/COL4A3/CXCL8/SEMA5A/FGF1/VEGFA/TNFAIP3/ITGB8/AQP1/SERPINE1/ADGRB1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4/KLK3
#> GO:1901342                                                                                                                        NPPB/F3/SEMA4A/CHI3L1/CD34/TGFB2/COL4A3/CXCL8/SEMA5A/FGF1/VEGFA/TNFAIP3/ITGB8/AQP1/SERPINE1/ADGRB1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4/KLK3
#> GO:0044703                                                                                                                                                            ERRFI1/OVGP1/IGFBP5/STC2/EDN1/VEGFA/PRDM1/STC1/ADM/ARHGDIB/ENDOU/ITGA5/MMP2/TPPP3/PSG1/PSG11/PSG2/PSG5/PSG4/PSG9
#>            Count
#> GO:0001525    43
#> GO:0048514    47
#> GO:0007186    36
#> GO:0045765    25
#> GO:1901342    25
#> GO:0044703    20

library(enrichplot)
barplot(ego, showCategory=10)

dotplot(ego)

We see angiogenesis and response to oxygen levels pathways are enriched which makes sense!

MsigDB Hallmark gene sets

H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures

# install.packages("msigdbr")
library(msigdbr)

m_df <- msigdbr(species = "human")
head(m_df)

#> # A tibble: 6 × 20
#>   gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
#>   <chr>       <chr>     <chr>        <chr>          <chr>        <chr>          
#> 1 ABCC4       10257     ENSG0000012… ABCC4          10257        ENSG00000125257
#> 2 ABRAXAS2    23172     ENSG0000016… ABRAXAS2       23172        ENSG00000165660
#> 3 ACTN4       81        ENSG0000013… ACTN4          81           ENSG00000130402
#> 4 ACVR1       90        ENSG0000011… ACVR1          90           ENSG00000115170
#> 5 ADAM9       8754      ENSG0000016… ADAM9          8754         ENSG00000168615
#> 6 ADAMTS5     11096     ENSG0000015… ADAMTS5        11096        ENSG00000154736
#> # ℹ 14 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
#> #   db_target_species <chr>

# Note, the new version of msigdbr changed the arguments: category vs collection
# welcome to Bioinformatics!..

m_t2g <- msigdbr(species = "human", collection = "H") %>% 
  dplyr::select(gs_name, ncbi_gene)


head(m_t2g)

#> # A tibble: 6 × 2
#>   gs_name               ncbi_gene
#>   <chr>                 <chr>    
#> 1 HALLMARK_ADIPOGENESIS 19       
#> 2 HALLMARK_ADIPOGENESIS 11194    
#> 3 HALLMARK_ADIPOGENESIS 10449    
#> 4 HALLMARK_ADIPOGENESIS 33       
#> 5 HALLMARK_ADIPOGENESIS 34       
#> 6 HALLMARK_ADIPOGENESIS 35

em <- enricher(significant_genes_map$ENTREZID, TERM2GENE=m_t2g, 
               universe = background_genes_map$ENTREZID )
head(em)

#>                                                                                    ID
#> HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
#> HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
#> HALLMARK_KRAS_SIGNALING_DN                                 HALLMARK_KRAS_SIGNALING_DN
#> HALLMARK_KRAS_SIGNALING_UP                                 HALLMARK_KRAS_SIGNALING_UP
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> HALLMARK_INFLAMMATORY_RESPONSE                         HALLMARK_INFLAMMATORY_RESPONSE
#>                                                                           Description
#> HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
#> HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
#> HALLMARK_KRAS_SIGNALING_DN                                 HALLMARK_KRAS_SIGNALING_DN
#> HALLMARK_KRAS_SIGNALING_UP                                 HALLMARK_KRAS_SIGNALING_UP
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> HALLMARK_INFLAMMATORY_RESPONSE                         HALLMARK_INFLAMMATORY_RESPONSE
#>                                            GeneRatio  BgRatio RichFactor
#> HALLMARK_HYPOXIA                              29/136 170/3393  0.1705882
#> HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION    24/136 142/3393  0.1690141
#> HALLMARK_KRAS_SIGNALING_DN                    15/136 102/3393  0.1470588
#> HALLMARK_KRAS_SIGNALING_UP                    17/136 133/3393  0.1278195
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB              19/136 165/3393  0.1151515
#> HALLMARK_INFLAMMATORY_RESPONSE                15/136 119/3393  0.1260504
#>                                            FoldEnrichment   zScore       pvalue
#> HALLMARK_HYPOXIA                                 4.255926 8.899329 6.744891e-12
#> HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION       4.216653 8.000707 7.107222e-10
#> HALLMARK_KRAS_SIGNALING_DN                       3.668901 5.591871 9.013894e-06
#> HALLMARK_KRAS_SIGNALING_UP                       3.188910 5.261788 1.527983e-05
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB                 2.872861 5.039289 2.148691e-05
#> HALLMARK_INFLAMMATORY_RESPONSE                   3.144773 4.866356 5.948204e-05
#>                                                p.adjust       qvalue
#> HALLMARK_HYPOXIA                           2.900303e-10 2.200965e-10
#> HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 1.528053e-08 1.159599e-08
#> HALLMARK_KRAS_SIGNALING_DN                 1.291992e-04 9.804587e-05
#> HALLMARK_KRAS_SIGNALING_UP                 1.642582e-04 1.246512e-04
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB           1.847874e-04 1.402304e-04
#> HALLMARK_INFLAMMATORY_RESPONSE             4.262880e-04 3.234988e-04
#>                                                                                                                                                                                       geneID
#> HALLMARK_HYPOXIA                           54206/1907/3491/2152/8497/3623/57007/8553/55076/901/4015/8614/7422/1490/7128/3484/3486/5054/857/6781/665/10397/1289/5209/54541/133/694/3669/51129
#> HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION                         1296/3491/4148/3576/4015/5159/7422/1490/7128/3624/3486/5054/4017/6591/1289/8038/6876/50863/3678/4313/1009/1277/50509/7078
#> HALLMARK_KRAS_SIGNALING_DN                                                                                  1907/7042/80303/7068/3699/1906/6446/105375355/56154/3851/3860/3866/3780/7032/778
#> HALLMARK_KRAS_SIGNALING_UP                                                                             7139/28951/2069/1437/2162/639/7128/3624/3486/5243/9314/80201/330/3587/51129/3689/3561
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB                                                                  3491/2152/3914/57007/8553/604/1437/1906/7422/6446/7128/3624/5054/19/9314/5209/330/694/6347
#> HALLMARK_INFLAMMATORY_RESPONSE                                                                                     2152/3576/2069/1906/3696/3624/5054/19/1902/133/3587/3678/23533/6347/10148
#>                                            Count
#> HALLMARK_HYPOXIA                              29
#> HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION    24
#> HALLMARK_KRAS_SIGNALING_DN                    15
#> HALLMARK_KRAS_SIGNALING_UP                    17
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB              19
#> HALLMARK_INFLAMMATORY_RESPONSE                15

dotplot(em)

Hypoxia and EMT are on the top which makes biological sense too.

Gene set enrichment analysis

## you need all the genes and pre-rank them by p-value
## rank all the genes by signed fold change * -log10pvalue.

res_df<- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "gene")


res_df<- res_df %>% 
  mutate(signed_rank_stats = sign(log2FoldChange) * -log10(pvalue)) %>%
  left_join(background_genes_map, by= c("gene" = "SYMBOL")) %>%
  arrange(desc(signed_rank_stats))

gene_list<- res_df$signed_rank_stats
names(gene_list)<- res_df$ENTREZID

# em2 <- GSEA(gene_list, TERM2GENE=m_t2g)

The above will error out:

using ‘fgsea’ for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections… GSEA analysis… Error in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : Not all stats values are finite numbers

It is because some p-values are 0, and log10 will make them infinite.

head(res_df)

#>      gene  baseMean log2FoldChange      lfcSE     stat pvalue padj
#> 1  ERRFI1  4301.159       2.760227 0.06818372 40.48220      0    0
#> 2   LAMB3  3151.210       2.637983 0.06606053 39.93282      0    0
#> 3 BHLHE40  6333.176       2.389235 0.05264755 45.38170      0    0
#> 4  ARRDC3  2202.457       4.140424 0.09291231 44.56270      0    0
#> 5    STC2  9740.083       2.790358 0.05518147 50.56693      0    0
#> 6  IGFBP3 22598.174       5.370911 0.05519163 97.31386      0    0
#>   signed_rank_stats ENTREZID
#> 1               Inf    54206
#> 2               Inf     3914
#> 3               Inf     8553
#> 4               Inf    57561
#> 5               Inf     8614
#> 6               Inf     3486

change the inf to big numbers

res_df<- res_df %>%
  mutate(negative_log10pvalue = -log10(pvalue)) %>%
  mutate(negative_log10pvalue = ifelse(is.infinite(negative_log10pvalue), 1000, negative_log10pvalue)) %>%
  mutate(signed_rank_stats = sign(log2FoldChange) * negative_log10pvalue)

gene_list<- res_df$signed_rank_stats
names(gene_list)<- res_df$ENTREZID


em2 <- GSEA(gene_list, TERM2GENE=m_t2g)
head(em2)

#>                                                                    ID
#> HALLMARK_E2F_TARGETS                             HALLMARK_E2F_TARGETS
#> HALLMARK_G2M_CHECKPOINT                       HALLMARK_G2M_CHECKPOINT
#> HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION
#> HALLMARK_HYPOXIA                                     HALLMARK_HYPOXIA
#> HALLMARK_MYC_TARGETS_V1                       HALLMARK_MYC_TARGETS_V1
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB     HALLMARK_TNFA_SIGNALING_VIA_NFKB
#>                                                           Description setSize
#> HALLMARK_E2F_TARGETS                             HALLMARK_E2F_TARGETS     196
#> HALLMARK_G2M_CHECKPOINT                       HALLMARK_G2M_CHECKPOINT     194
#> HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION     190
#> HALLMARK_HYPOXIA                                     HALLMARK_HYPOXIA     170
#> HALLMARK_MYC_TARGETS_V1                       HALLMARK_MYC_TARGETS_V1     189
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB     HALLMARK_TNFA_SIGNALING_VIA_NFKB     165
#>                                    enrichmentScore       NES       pvalue
#> HALLMARK_E2F_TARGETS                    -0.7992625 -2.241756 1.000000e-10
#> HALLMARK_G2M_CHECKPOINT                 -0.7740591 -2.164090 1.000000e-10
#> HALLMARK_OXIDATIVE_PHOSPHORYLATION      -0.7497171 -2.082826 1.000000e-10
#> HALLMARK_HYPOXIA                         0.9399727  2.076040 1.000000e-10
#> HALLMARK_MYC_TARGETS_V1                 -0.7050760 -1.958278 1.920402e-08
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB         0.8803318  1.934970 1.240718e-07
#>                                        p.adjust       qvalue rank
#> HALLMARK_E2F_TARGETS               1.250000e-09 6.315789e-10 1752
#> HALLMARK_G2M_CHECKPOINT            1.250000e-09 6.315789e-10 1554
#> HALLMARK_OXIDATIVE_PHOSPHORYLATION 1.250000e-09 6.315789e-10 1904
#> HALLMARK_HYPOXIA                   1.250000e-09 6.315789e-10  653
#> HALLMARK_MYC_TARGETS_V1            1.920402e-07 9.703084e-08 1126
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB   1.033931e-06 5.224075e-07 1080
#>                                                      leading_edge
#> HALLMARK_E2F_TARGETS               tags=60%, list=12%, signal=54%
#> HALLMARK_G2M_CHECKPOINT            tags=51%, list=11%, signal=46%
#> HALLMARK_OXIDATIVE_PHOSPHORYLATION tags=50%, list=13%, signal=44%
#> HALLMARK_HYPOXIA                    tags=50%, list=4%, signal=48%
#> HALLMARK_MYC_TARGETS_V1             tags=66%, list=8%, signal=62%
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB    tags=41%, list=7%, signal=39%
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           core_enrichment
#> HALLMARK_E2F_TARGETS                                           27338/3978/6241/9972/7398/10635/55635/25842/79019/1633/3148/4605/10549/10248/9133/9125/6839/9833/10535/10733/3184/253714/3930/5558/10212/51155/10714/10615/29089/2146/4288/9319/84844/23594/5982/8726/4999/3161/24137/3015/84296/9787/672/10797/5889/9126/23649/983/3835/57122/7514/1965/3838/29028/9212/8914/11340/9700/3609/5424/4085/7153/701/204/4001/7112/6790/4830/9918/55646/6628/4678/3070/5631/10051/6427/29127/1111/332/1854/5591/11051/8243/2547/79075/54962/5983/991/4436/23165/10492/4171/1503/6749/5347/7374/9238/993/10527/1019/1164/4174/3159/5902/6426/1786/79077/5111/4172/4175/4176/10528/5901/4173/5425/1434/5036/9221
#> HALLMARK_G2M_CHECKPOINT                                                                                                                                                          4605/10403/3980/9133/6839/10592/10733/3184/6713/3832/1810/3930/1874/5558/10212/51155/3619/1869/8454/8438/2146/4288/23594/6599/3161/24137/22809/5933/9585/3015/7027/3151/890/23649/699/10036/983/262/3835/22974/7514/3838/5001/990/7150/7290/55872/10926/9212/9184/9700/3609/10419/51776/4085/8318/7153/4001/7112/6790/10432/4678/7272/10051/6427/29127/1111/332/10146/7371/8243/4953/9156/7090/6421/3837/3192/991/10492/51659/4171/10772/5347/993/1019/1164/6632/4174/3159/6426/8140/4172/4175/3312/85236/4691/1736/9221
#> HALLMARK_OXIDATIVE_PHOSPHORYLATION                                                                                                                                                                       6389/522/4726/10975/4702/4720/4717/4694/7384/1743/4835/23530/509/291/6832/9551/65003/10939/1340/4729/4723/513/4552/1351/374291/5018/4712/2746/3420/50/7417/2108/10440/7386/5250/9927/1355/1374/8604/3032/64981/34/10989/29796/3028/4706/56993/8802/1892/5447/10884/29088/7388/51318/26517/4697/9377/3313/2395/1738/4704/92609/533/64960/1349/1737/54704/6390/23203/517/4191/515/5435/2271/7416/518/6392/56945/3419/54205/9131/3417/38/516/11331/498/10935/10128/1537/1431/5160/2806/506/292/3945
#> HALLMARK_HYPOXIA                                                                                                                                                                                                                                   54206/8553/8614/3486/5054/665/10397/54541/6515/5230/133/2355/857/3491/6513/8497/5209/7045/8974/6385/1316/7052/3725/7422/205/55818/694/4627/3099/901/5155/5163/7167/2026/2632/2113/55076/1907/1490/302/5033/10370/55139/6095/23645/8660/1649/2152/284119/10957/3098/23764/4783/2023/5214/5211/5236/30001/3309/7852/1026/4601/112464/467/26118/5329/51129/6533/2597/26136/3669/54800/4015/23327/5066/7128/2131/7436/6781/26355/2997/3623/23036/3516/7538
#> HALLMARK_MYC_TARGETS_V1            6599/6637/4999/65005/5250/10921/3015/8662/8664/3066/7027/3608/6193/890/3178/4706/6128/5093/2058/7514/1965/55651/2739/7298/3838/9045/8886/5984/328/9184/10054/1982/9377/4085/6428/8318/5478/6633/7965/1478/1964/26354/51690/4830/6627/9136/23435/6194/6427/9188/3376/1854/10146/10856/6634/10971/4953/2079/2547/7416/56910/51491/10291/3837/1207/5496/8669/23196/3192/991/6732/1665/10728/23450/10492/10155/4171/1503/6187/3336/3181/11331/3615/1019/6418/10935/8761/2107/6632/7555/3735/1537/4686/4174/5902/26156/6059/6426/3183/7284/6723/5111/4175/10576/4176/790/6741/10528/5901/5245/2806/4173/10574/6950/26135/7203/708/5425/10575/3326/22948/5036/4869/9221/3329
#> HALLMARK_TNFA_SIGNALING_VIA_NFKB                                                                                                                                                                                                                                                                                                                   3914/8553/5054/6515/2355/3491/5209/6385/1316/3725/7422/694/9314/10318/22822/4616/23645/8660/2152/10957/4082/10769/56937/1827/6446/23764/4783/329/7050/3383/23135/9120/19/10725/1026/467/5329/2114/330/8744/960/8061/7128/604/4084/2354/4791/7538/10938/6303/1847/25976/3976/23258/4088/1906/3572/5606/4792/3280/3624/1437/4973/687/1839/9516/1846/3371

# em2@result %>% View()

The NES (normalized enrichment score) tells you the pathway is up or down-regulated.

For example, Hypoxia has a positive NES, so it is up-regulated. G2M checkpoint has a negative NES, so it is down-regulated.

visualization

p1<- gseaplot(em2, geneSetID = "HALLMARK_G2M_CHECKPOINT", 
              by = "runningScore", title = "HALLMARK_G2M_CHECKPOINT")

p2 <- gseaplot(em2, geneSetID = "HALLMARK_HYPOXIA", 
               by = "runningScore", title = "HALLMARK_HYPOXIA")

p1/p2

We rank the full gene list from the most up-regulated to the most down-regulated. Most of the hypoxia genes (the black bars in the lower panel) are in the front of the gene list while most of the G2M checkpoint genes (the black bars in the upper panel) are in the end of the gene list.

Do DESeq2 for multiple comparisons using list column and purrr::map()

We have gone through a routine RNAseq analysis for a single comparison: hypoxia vs normoxia. The original datasets have multiple different conditions. You can copy and paste the code and change the columns you want to subset. Or depending the purpose of your experiment, you may use DESeq2 for more complicated design.

This tutorial is to demonstrate how you can avoid repeating yourself. so let’s say you want to do the following comparisons:

hypoxia vs normoxia under sgCTRL
sgRELB vs sgCTRL under normoxia
sgITPR3 vs sgCTRL under normoxia
sgRELB vs sgCTRL under hypoxia
sgITPR3 vs sgCTRL under hypoxia

The first thing come to mind is to write a function:

design_table<- colnames(raw_counts_mat) %>%
        tibble::enframe(name = "sample_number", value = "sample_name") %>%
        mutate(condition = stringr::str_replace(sample_name, "[0-9]+_SW_", "")) %>%
        mutate(condition = stringr::str_replace(condition, "_[0-3]", ""))

design_table<- as.data.frame(design_table)

rownames(design_table)<- design_table$sample_name

design_table

#>                      sample_number          sample_name    condition
#> 01_SW_sgCTRL_Norm                1    01_SW_sgCTRL_Norm  sgCTRL_Norm
#> 02_SW_sgCTRL_Norm                2    02_SW_sgCTRL_Norm  sgCTRL_Norm
#> 03_SW_sgITPR3_1_Norm             3 03_SW_sgITPR3_1_Norm sgITPR3_Norm
#> 04_SW_sgITPR3_1_Norm             4 04_SW_sgITPR3_1_Norm sgITPR3_Norm
#> 07_SW_sgRELB_3_Norm              5  07_SW_sgRELB_3_Norm  sgRELB_Norm
#> 08_SW_sgRELB_3_Norm              6  08_SW_sgRELB_3_Norm  sgRELB_Norm
#> 11_SW_sgCTRL_Hyp                 7     11_SW_sgCTRL_Hyp   sgCTRL_Hyp
#> 12_SW_sgCTRL_Hyp                 8     12_SW_sgCTRL_Hyp   sgCTRL_Hyp
#> 13_SW_sgITPR3_1_Hyp              9  13_SW_sgITPR3_1_Hyp  sgITPR3_Hyp
#> 14_SW_sgITPR3_1_Hyp             10  14_SW_sgITPR3_1_Hyp  sgITPR3_Hyp
#> 17_SW_sgRELB_3_Hyp              11   17_SW_sgRELB_3_Hyp   sgRELB_Hyp
#> 18_SW_sgRELB_3_Hyp              12   18_SW_sgRELB_3_Hyp   sgRELB_Hyp

A function to do 2 condition DESeq2 comparison

run_deseq2<- function(counts_mat, design_table, contrast_name){
        design_table<- design_table %>%
                dplyr::filter(condition %in% contrast_name)
        counts_mat<- counts_mat[, rownames(design_table)]
        keep<- rowSums(counts_mat > 10) >= 2 
        counts_mat<- counts_mat[keep, ]
        
        dds<- DESeqDataSetFromMatrix(countData = counts_mat, 
                                     colData = design_table, 
                                     design= ~ condition)
        dds<- DESeq(dds)
        res<- results(dds, contrast = contrast_name)
        
        return(res)
}


# test the function 
hypoxia_vs_normoxia_res<- run_deseq2(counts_mat = raw_counts_mat,
                                     design_table = design_table,
                                     contrast_name = c("condition", "sgCTRL_Hyp", "sgCTRL_Norm"))

hypoxia_vs_normoxia_res

#> log2 fold change (MLE): condition sgCTRL_Hyp vs sgCTRL_Norm 
#> Wald test p-value: condition sgCTRL_Hyp vs sgCTRL_Norm 
#> DataFrame with 14773 rows and 6 columns
#>                baseMean log2FoldChange     lfcSE      stat      pvalue
#>               <numeric>      <numeric> <numeric> <numeric>   <numeric>
#> WASH7P         23.21968       0.980058  0.563773   1.73839   0.0821421
#> LOC100996442   10.36881       1.187227  0.819025   1.44956   0.1471811
#> LOC729737       8.91192       2.142716  0.940163   2.27909   0.0226617
#> WASH9P         55.19471       0.526744  0.346707   1.51928   0.1286928
#> LOC100288069   76.26492       0.490303  0.297131   1.65012   0.0989180
#> ...                 ...            ...       ...       ...         ...
#> ND4          119814.453      0.0419782 0.0453806  0.925025 3.54953e-01
#> ND5           44220.020      0.5544584 0.0941469  5.889289 3.87860e-09
#> ND6           11378.733      0.9634273 0.1104717  8.721031 2.75680e-18
#> CYTB          55770.859      0.0480192 0.0530959  0.904386 3.65791e-01
#> TRNP            397.987      2.0433030 0.1474365 13.858872 1.12427e-43
#>                     padj
#>                <numeric>
#> WASH7P          0.136916
#> LOC100996442    0.224132
#> LOC729737       0.044295
#> WASH9P          0.200546
#> LOC100288069    0.160443
#> ...                  ...
#> ND4          4.60905e-01
#> ND5          2.06333e-08
#> ND6          2.97924e-17
#> CYTB         4.71909e-01
#> TRNP         3.92645e-42

It is the same as we calculated in the very beginning

res

#> log2 fold change (MLE): condition hypoxia vs normoxia 
#> Wald test p-value: condition hypoxia vs normoxia 
#> DataFrame with 14773 rows and 6 columns
#>                baseMean log2FoldChange     lfcSE      stat      pvalue
#>               <numeric>      <numeric> <numeric> <numeric>   <numeric>
#> WASH7P         23.21968       0.980058  0.563773   1.73839   0.0821421
#> LOC100996442   10.36881       1.187227  0.819025   1.44956   0.1471811
#> LOC729737       8.91192       2.142716  0.940163   2.27909   0.0226617
#> WASH9P         55.19471       0.526744  0.346707   1.51928   0.1286928
#> LOC100288069   76.26492       0.490303  0.297131   1.65012   0.0989180
#> ...                 ...            ...       ...       ...         ...
#> ND4          119814.453      0.0419782 0.0453806  0.925025 3.54953e-01
#> ND5           44220.020      0.5544584 0.0941469  5.889289 3.87860e-09
#> ND6           11378.733      0.9634273 0.1104717  8.721031 2.75680e-18
#> CYTB          55770.859      0.0480192 0.0530959  0.904386 3.65791e-01
#> TRNP            397.987      2.0433030 0.1474365 13.858872 1.12427e-43
#>                     padj
#>                <numeric>
#> WASH7P          0.136916
#> LOC100996442    0.224132
#> LOC729737       0.044295
#> WASH9P          0.200546
#> LOC100288069    0.160443
#> ...                  ...
#> ND4          4.60905e-01
#> ND5          2.06333e-08
#> ND6          2.97924e-17
#> CYTB         4.71909e-01
#> TRNP         3.92645e-42

Now that we have a function, we can call that function with a for loop, but I want to show you a better way.

keep everything in a tibble

make a tibble which can hold a list column. It is super-powerful. Read the best tutorial on purrr.

res_df<- tibble::tibble(contrast_nominator = c("sgCTRL_Hyp", "sgRELB_Norm", "sgITPR3_Norm","sgRELB_Hyp", "sgITPR3_Hyp"),
                contrast_demoniator =c("sgCTRL_Norm", "sgCTRL_Norm", "sgCTRL_Norm", "sgCTRL_Hyp", "sgCTRL_Hyp"))

res_df

#> # A tibble: 5 × 2
#>   contrast_nominator contrast_demoniator
#>   <chr>              <chr>              
#> 1 sgCTRL_Hyp         sgCTRL_Norm        
#> 2 sgRELB_Norm        sgCTRL_Norm        
#> 3 sgITPR3_Norm       sgCTRL_Norm        
#> 4 sgRELB_Hyp         sgCTRL_Hyp         
#> 5 sgITPR3_Hyp        sgCTRL_Hyp

res_df<- res_df %>%
        dplyr::mutate(res = map2(contrast_nominator, contrast_demoniator,
                          ~ run_deseq2(counts_mat = raw_counts_mat,
                                       design_table = design_table,
                                       contrast = c("condition", .x, .y))))
## access individual DESeq2 results
res_df$res[[1]]

#> log2 fold change (MLE): condition sgCTRL_Hyp vs sgCTRL_Norm 
#> Wald test p-value: condition sgCTRL_Hyp vs sgCTRL_Norm 
#> DataFrame with 14773 rows and 6 columns
#>                baseMean log2FoldChange     lfcSE      stat      pvalue
#>               <numeric>      <numeric> <numeric> <numeric>   <numeric>
#> WASH7P         23.21968       0.980058  0.563773   1.73839   0.0821421
#> LOC100996442   10.36881       1.187227  0.819025   1.44956   0.1471811
#> LOC729737       8.91192       2.142716  0.940163   2.27909   0.0226617
#> WASH9P         55.19471       0.526744  0.346707   1.51928   0.1286928
#> LOC100288069   76.26492       0.490303  0.297131   1.65012   0.0989180
#> ...                 ...            ...       ...       ...         ...
#> ND4          119814.453      0.0419782 0.0453806  0.925025 3.54953e-01
#> ND5           44220.020      0.5544584 0.0941469  5.889289 3.87860e-09
#> ND6           11378.733      0.9634273 0.1104717  8.721031 2.75680e-18
#> CYTB          55770.859      0.0480192 0.0530959  0.904386 3.65791e-01
#> TRNP            397.987      2.0433030 0.1474365 13.858872 1.12427e-43
#>                     padj
#>                <numeric>
#> WASH7P          0.136916
#> LOC100996442    0.224132
#> LOC729737       0.044295
#> WASH9P          0.200546
#> LOC100288069    0.160443
#> ...                  ...
#> ND4          4.60905e-01
#> ND5          2.06333e-08
#> ND6          2.97924e-17
#> CYTB         4.71909e-01
#> TRNP         3.92645e-42

#res_df$res[[2]]

This is pretty cool!

volcano plots for all comparisons

First, need to convert the DESeqResults object to a dataframe

res_df2<- res_df %>%
        mutate(res2 = map(res, as.data.frame)) %>%
        dplyr::select(-res) 

res_df2

#> # A tibble: 5 × 3
#>   contrast_nominator contrast_demoniator res2             
#>   <chr>              <chr>               <list>           
#> 1 sgCTRL_Hyp         sgCTRL_Norm         <df [14,773 × 6]>
#> 2 sgRELB_Norm        sgCTRL_Norm         <df [14,731 × 6]>
#> 3 sgITPR3_Norm       sgCTRL_Norm         <df [14,696 × 6]>
#> 4 sgRELB_Hyp         sgCTRL_Hyp          <df [15,487 × 6]>
#> 5 sgITPR3_Hyp        sgCTRL_Hyp          <df [15,188 × 6]>

Now, unnest it to a full dataframe:

res_df2 %>%
        tidyr::unnest(cols = res2) %>%
        head()

#> # A tibble: 6 × 8
#>   contrast_nominator contrast_demoniator baseMean log2FoldChange lfcSE  stat
#>   <chr>              <chr>                  <dbl>          <dbl> <dbl> <dbl>
#> 1 sgCTRL_Hyp         sgCTRL_Norm            23.2           0.980 0.564  1.74
#> 2 sgCTRL_Hyp         sgCTRL_Norm            10.4           1.19  0.819  1.45
#> 3 sgCTRL_Hyp         sgCTRL_Norm             8.91          2.14  0.940  2.28
#> 4 sgCTRL_Hyp         sgCTRL_Norm            55.2           0.527 0.347  1.52
#> 5 sgCTRL_Hyp         sgCTRL_Norm            76.3           0.490 0.297  1.65
#> 6 sgCTRL_Hyp         sgCTRL_Norm            13.9           1.24  0.704  1.76
#> # ℹ 2 more variables: pvalue <dbl>, padj <dbl>

The nice thing is that this is a tidy dataframe and it is easy to make any figures.

res_df2 %>%
        tidyr::unnest(cols = res2) %>%
        mutate(comparison = paste0(contrast_nominator, "_vs_", contrast_demoniator)) %>%
        ggplot(aes(x=log2FoldChange, y = -log10(pvalue))) +
        geom_point() +
        facet_wrap(~comparison)

How powerful is this?!

Now let’s find the differentially expressed genes for each comparison

filter_sig_genes<- function(res){
        sig_genes<- res %>% as.data.frame() %>%
                dplyr::filter(padj < 0.05, abs(log2FoldChange) > 2) %>%
                rownames()
        
        return(sig_genes)
        
}

filter_sig_genes(hypoxia_vs_normoxia_res) %>%
        head()

#> [1] "LOC729737" "PERM1"     "HES2"      "ERRFI1"    "LINC01714" "NPPB"

res_df<- res_df %>%
        mutate(sig_genes = map(res, filter_sig_genes))

res_df

#> # A tibble: 5 × 4
#>   contrast_nominator contrast_demoniator res            sig_genes  
#>   <chr>              <chr>               <list>         <list>     
#> 1 sgCTRL_Hyp         sgCTRL_Norm         <DESqRslt[,6]> <chr [502]>
#> 2 sgRELB_Norm        sgCTRL_Norm         <DESqRslt[,6]> <chr [192]>
#> 3 sgITPR3_Norm       sgCTRL_Norm         <DESqRslt[,6]> <chr [166]>
#> 4 sgRELB_Hyp         sgCTRL_Hyp          <DESqRslt[,6]> <chr [275]>
#> 5 sgITPR3_Hyp        sgCTRL_Hyp          <DESqRslt[,6]> <chr [101]>

Now, the sig_genes is a list column contains a list of characters (genes)

Let’s keep going to do a gene set over-representation analysis for each comparison

# let's use the gene symbol 
m_t2g <- msigdbr(species = "human", collection = "H") %>% 
  dplyr::select(gs_name, db_gene_symbol)


head(m_t2g)

#> # A tibble: 6 × 2
#>   gs_name               db_gene_symbol
#>   <chr>                 <chr>         
#> 1 HALLMARK_ADIPOGENESIS ABCA1         
#> 2 HALLMARK_ADIPOGENESIS ABCB8         
#> 3 HALLMARK_ADIPOGENESIS ACAA2         
#> 4 HALLMARK_ADIPOGENESIS ACADL         
#> 5 HALLMARK_ADIPOGENESIS ACADM         
#> 6 HALLMARK_ADIPOGENESIS ACADS

em <- enricher(significant_genes_map$ENTREZID, TERM2GENE=m_t2g, 
               universe = background_genes_map$ENTREZID )


res_df<- res_df %>%
        mutate(hallmark_res = map(sig_genes, ~ enricher(.x, TERM2GENE=m_t2g, 
                                                        universe = background_genes_map$SYMBOL)))

res_df

#> # A tibble: 5 × 5
#>   contrast_nominator contrast_demoniator res            sig_genes  
#>   <chr>              <chr>               <list>         <list>     
#> 1 sgCTRL_Hyp         sgCTRL_Norm         <DESqRslt[,6]> <chr [502]>
#> 2 sgRELB_Norm        sgCTRL_Norm         <DESqRslt[,6]> <chr [192]>
#> 3 sgITPR3_Norm       sgCTRL_Norm         <DESqRslt[,6]> <chr [166]>
#> 4 sgRELB_Hyp         sgCTRL_Hyp          <DESqRslt[,6]> <chr [275]>
#> 5 sgITPR3_Hyp        sgCTRL_Hyp          <DESqRslt[,6]> <chr [101]>
#> # ℹ 1 more variable: hallmark_res <list>

Now, you see the pattern, you can keep add the results into a new column and because tibble can contain a list column and a list can contain any objects (even a ggplot2!).

res_df<- res_df %>%
        mutate(dotplots = map(hallmark_res, dotplot))

res_df

#> # A tibble: 5 × 6
#>   contrast_nominator contrast_demoniator res            sig_genes  
#>   <chr>              <chr>               <list>         <list>     
#> 1 sgCTRL_Hyp         sgCTRL_Norm         <DESqRslt[,6]> <chr [502]>
#> 2 sgRELB_Norm        sgCTRL_Norm         <DESqRslt[,6]> <chr [192]>
#> 3 sgITPR3_Norm       sgCTRL_Norm         <DESqRslt[,6]> <chr [166]>
#> 4 sgRELB_Hyp         sgCTRL_Hyp          <DESqRslt[,6]> <chr [275]>
#> 5 sgITPR3_Hyp        sgCTRL_Hyp          <DESqRslt[,6]> <chr [101]>
#> # ℹ 2 more variables: hallmark_res <list>, dotplots <list>

You can take a look at the plots:

res_df$dotplots[[1]]

# this is an empty one because no terms were enriched
# res_df$dotplots[[2]]

res_df$dotplots[[3]]

res_df$dotplots[[4]]

Finally, you can save those plots to files by:

pwalk(list(a = res_df$contrast_nominator, b = res_df$contrast_demoniator, c = res_df$dotplots),
     safely(function(a, b,c) ggsave(filename = paste0("~/blog_data/", a, "_vs_",b, ".pdf"), 
                                    plot = c, width = 6, height = 6)))

using safely can avoid that empty figure error.

Hopefully, you learned that how powerful tibble and purrr are.

start using them!

Happy Learning!

Tommy aka crazyhottommy

bioinformatics RNAseq DEseq2 R list column