How to convert raw counts to TPM for TCGA data and make a heatmap across cancer types

Sign up for my newsletter to not miss a post like this https://divingintogeneticsandgenomics.ck.page/newsletter

The Cancer Genome Atlas (TCGA) project is probably one of the most well-known large-scale cancer sequencing project. It sequenced ~10,000 treatment-naive tumors across 33 cancer types. Different data including whole-exome, whole-genome, copy-number (SNP array), bulk RNAseq, protein expression (Reverse-Phase Protein Array), DNA methylation are available.

TCGA is a very successful large sequencing project. I highly recommend learning from the organization of it. Read Collaborative Genomics Projects: A Comprehensive Guide.

In this blog post, I am going to show you how to download the raw RNA-seq counts for 33 cancer types, convert them to TPM (transcript per million) and plot them in a heatmap and boxplot.

Let’s dive in!

Use recount3

recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project

#BiocManager::install("recount3")
library(recount3)
library(purrr)
library(dplyr)
library(ggplot2)
human_projects <- available_projects()

tcga_info = subset(
    human_projects,
    file_source == "tcga" & project_type == "data_sources"
)

head(tcga_info)
#>      project organism file_source      project_home project_type n_samples
#> 8710     ACC    human        tcga data_sources/tcga data_sources        79
#> 8711    BLCA    human        tcga data_sources/tcga data_sources       433
#> 8712    BRCA    human        tcga data_sources/tcga data_sources      1256
#> 8713    CESC    human        tcga data_sources/tcga data_sources       309
#> 8714    CHOL    human        tcga data_sources/tcga data_sources        45
#> 8715    COAD    human        tcga data_sources/tcga data_sources       546
proj_info <- map(seq(nrow(tcga_info)), ~tcga_info[.x, ])

## create the RangedSummarizedExperiment. the create_rse function works on 
## one row a time 

rse_tcga <- map(proj_info, ~create_rse(.x))

The first time you use recount3, it will ask:

/Users/tommytang/Library/Caches/org.R-project.R/R/recount3 does not exist, create directory? (yes/no): yes

rse_tcga is a list of rse object, let’s take a look at one of them

rse_tcga[[1]]
#> class: RangedSummarizedExperiment 
#> dim: 63856 79 
#> metadata(8): time_created recount3_version ... annotation recount3_url
#> assays(1): raw_counts
#> rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
#>   ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
#> rowData names(10): source type ... havana_gene tag
#> colnames(79): 43e715bf-28d9-4b5e-b762-8cd1b69a430e
#>   1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872 ...
#>   a08b85ea-d1e7-4b77-8dec-36294305b9f7
#>   aa2d53e5-d389-4332-9dd5-a736052e48f8
#> colData names(937): rail_id external_id ... recount_seq_qc.errq
#>   BigWigURL
## get raw counts from one rse object
rse_tcga[[1]]@assays@data$raw_counts[1:5, 1:5]
#>                   43e715bf-28d9-4b5e-b762-8cd1b69a430e
#> ENSG00000278704.1                                    0
#> ENSG00000277400.1                                    0
#> ENSG00000274847.1                                    0
#> ENSG00000277428.1                                    0
#> ENSG00000276256.1                                    0
#>                   1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872
#> ENSG00000278704.1                                    0
#> ENSG00000277400.1                                    0
#> ENSG00000274847.1                                    0
#> ENSG00000277428.1                                    0
#> ENSG00000276256.1                                    0
#>                   93b382e4-9c9a-43f5-bd3b-502cc260b886
#> ENSG00000278704.1                                    0
#> ENSG00000277400.1                                    0
#> ENSG00000274847.1                                    0
#> ENSG00000277428.1                                    0
#> ENSG00000276256.1                                    0
#>                   1f39dadd-3655-474e-ba4c-a5bd32c97a8b
#> ENSG00000278704.1                                    0
#> ENSG00000277400.1                                    0
#> ENSG00000274847.1                                    0
#> ENSG00000277428.1                                    0
#> ENSG00000276256.1                                    0
#>                   8c8c09b9-ec83-45ec-bc4c-0ba92de60acb
#> ENSG00000278704.1                                    0
#> ENSG00000277400.1                                    0
#> ENSG00000274847.1                                    0
#> ENSG00000277428.1                                    0
#> ENSG00000276256.1                                    0
## get the gene information
rse_tcga[[1]]@rowRanges
#> GRanges object with 63856 ranges and 10 metadata columns:
#>                              seqnames            ranges strand |   source
#>                                 <Rle>         <IRanges>  <Rle> | <factor>
#>          ENSG00000278704.1 GL000009.2       56140-58376      - |  ENSEMBL
#>          ENSG00000277400.1 GL000194.1      53590-115018      - |  ENSEMBL
#>          ENSG00000274847.1 GL000194.1      53594-115055      - |  ENSEMBL
#>          ENSG00000277428.1 GL000195.1       37434-37534      - |  ENSEMBL
#>          ENSG00000276256.1 GL000195.1       42939-49164      - |  ENSEMBL
#>                        ...        ...               ...    ... .      ...
#>   ENSG00000124334.17_PAR_Y       chrY 57184101-57197337      + |   HAVANA
#>   ENSG00000185203.12_PAR_Y       chrY 57201143-57203357      - |   HAVANA
#>    ENSG00000270726.6_PAR_Y       chrY 57190738-57208756      + |   HAVANA
#>   ENSG00000182484.15_PAR_Y       chrY 57207346-57212230      + |   HAVANA
#>    ENSG00000227159.8_PAR_Y       chrY 57212184-57214397      - |   HAVANA
#>                                type bp_length     phase                gene_id
#>                            <factor> <numeric> <integer>            <character>
#>          ENSG00000278704.1     gene      2237      <NA>      ENSG00000278704.1
#>          ENSG00000277400.1     gene      2179      <NA>      ENSG00000277400.1
#>          ENSG00000274847.1     gene      1599      <NA>      ENSG00000274847.1
#>          ENSG00000277428.1     gene       101      <NA>      ENSG00000277428.1
#>          ENSG00000276256.1     gene      2195      <NA>      ENSG00000276256.1
#>                        ...      ...       ...       ...                    ...
#>   ENSG00000124334.17_PAR_Y     gene      2504      <NA> ENSG00000124334.17_P..
#>   ENSG00000185203.12_PAR_Y     gene      1054      <NA> ENSG00000185203.12_P..
#>    ENSG00000270726.6_PAR_Y     gene       773      <NA> ENSG00000270726.6_PA..
#>   ENSG00000182484.15_PAR_Y     gene      4618      <NA> ENSG00000182484.15_P..
#>    ENSG00000227159.8_PAR_Y     gene      1306      <NA> ENSG00000227159.8_PA..
#>                                         gene_type   gene_name       level
#>                                       <character> <character> <character>
#>          ENSG00000278704.1         protein_coding  BX004987.1           3
#>          ENSG00000277400.1         protein_coding  AC145212.2           3
#>          ENSG00000274847.1         protein_coding  AC145212.1           3
#>          ENSG00000277428.1               misc_RNA       Y_RNA           3
#>          ENSG00000276256.1         protein_coding  AC011043.1           3
#>                        ...                    ...         ...         ...
#>   ENSG00000124334.17_PAR_Y         protein_coding        IL9R           2
#>   ENSG00000185203.12_PAR_Y              antisense      WASIR1           2
#>    ENSG00000270726.6_PAR_Y   processed_transcript AJ271736.10           2
#>   ENSG00000182484.15_PAR_Y transcribed_unproces..      WASH6P           2
#>    ENSG00000227159.8_PAR_Y unprocessed_pseudogene    DDX11L16           2
#>                                     havana_gene         tag
#>                                     <character> <character>
#>          ENSG00000278704.1                 <NA>        <NA>
#>          ENSG00000277400.1                 <NA>        <NA>
#>          ENSG00000274847.1                 <NA>        <NA>
#>          ENSG00000277428.1                 <NA>        <NA>
#>          ENSG00000276256.1                 <NA>        <NA>
#>                        ...                  ...         ...
#>   ENSG00000124334.17_PAR_Y OTTHUMG00000022720.1         PAR
#>   ENSG00000185203.12_PAR_Y OTTHUMG00000022676.3         PAR
#>    ENSG00000270726.6_PAR_Y OTTHUMG00000184987.2         PAR
#>   ENSG00000182484.15_PAR_Y OTTHUMG00000022677.5         PAR
#>    ENSG00000227159.8_PAR_Y OTTHUMG00000022678.1         PAR
#>   -------
#>   seqinfo: 374 sequences from an unspecified genome; no seqlengths

The bp_length column is the gene length we will use.

## metadata info
rse_tcga[[1]]@colData@listData %>% as.data.frame() %>% 
        `[`(1:5, 1:5)
#>   rail_id                          external_id study
#> 1  106797 43e715bf-28d9-4b5e-b762-8cd1b69a430e   ACC
#> 2  110230 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872   ACC
#> 3  110773 93b382e4-9c9a-43f5-bd3b-502cc260b886   ACC
#> 4  110869 1f39dadd-3655-474e-ba4c-a5bd32c97a8b   ACC
#> 5  116503 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb   ACC
#>              tcga.tcga_barcode                     tcga.gdc_file_id
#> 1 TCGA-OR-A5KU-01A-11R-A29S-07 43e715bf-28d9-4b5e-b762-8cd1b69a430e
#> 2 TCGA-P6-A5OG-01A-22R-A29S-07 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872
#> 3 TCGA-OR-A5K5-01A-11R-A29S-07 93b382e4-9c9a-43f5-bd3b-502cc260b886
#> 4 TCGA-OR-A5K4-01A-11R-A29S-07 1f39dadd-3655-474e-ba4c-a5bd32c97a8b
#> 5 TCGA-OR-A5LP-01A-11R-A29S-07 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb

each row/sample corresponds to the each column/sample of the count matrix.

Now, let’s write a function to convert the raw counts to TPM (transcript per million). Read this blog post by Matthew Bernstein https://mbernste.github.io/posts/rna_seq_basics/ to get a better understanding of the math.

Put in plain words: given a raw count matrix of gene x sample, first divide each gene (row) by its (effective) gene length, then get the sum of each column (total), divide each entry for the same column by the column sum then times 1e6.

To understand the effective gene length, read my old blog post https://crazyhottommy.blogspot.com/2016/07/comparing-salmon-kalliso-and-star-htseq.html

In our example, we will just use the gene length.

# some cancer associated genes
genes_of_interest<- c("MSLN", "EGFR", "ERBB2", "CEACAM5", "NECTIN4", "EPCAM", "MUC16", "MUC1", "CD276", "FOLH1", "DLL3", "VTCN1", "PROM1", "PVR", "CLDN6", "MET", "FOLR1", "TNFRSF10B", "TACSTD2", "CD24")

#### Creating TPM
count2tpm<- function(rse){
        count_matrix <- rse@assays@data$raw_counts
        gene_length <- rse@rowRanges$bp_length
        reads_per_rpk <- count_matrix/gene_length
        per_mil_scale <- colSums(reads_per_rpk)/1000000
        tpm_matrix <- t(t(reads_per_rpk)/per_mil_scale)
         ## Make sure they match the ENSG and gene order
        gene_ind<-  rse@rowRanges$gene_name %in% genes_of_interest 
        tpm_submatrix <- tpm_matrix[gene_ind,]
        rownames(tpm_submatrix)<- rse@rowRanges[gene_ind, ]$gene_name
        return(tpm_submatrix)
}

## convert raw count matrix per cancer type to TPM and subset to only the genes of interest 
tpm_data<- map(rse_tcga, count2tpm)

## get the metadata column 
metadata<- map(rse_tcga, ~.x@colData@listData %>% as.data.frame())

# bind the data matrix across cancer types together 
tpm_data2<- reduce(tpm_data, cbind)

## bind the metadata across cancer types together
metadata2<- reduce(metadata, rbind)
dim(tpm_data2)
#> [1]    20 11348
dim(metadata2)
#> [1] 11348   937

Total, we have 20 genes x 11348 samples in the combined count matrix and 937 metadata columns.

## rename the some columns 
metadata2<- metadata2 %>%
        select(tcga.tcga_barcode, tcga.cgc_sample_sample_type, study) %>%
        mutate(sample_type = case_when(
                tcga.cgc_sample_sample_type == "Additional - New Primary" ~ "cancer",
                tcga.cgc_sample_sample_type == "Additional Metastatic" ~ "metastatic",
                tcga.cgc_sample_sample_type == "Metastatic" ~ "metastatic",
                tcga.cgc_sample_sample_type == "Primary Blood Derived Cancer - Peripheral Blood " ~ "cancer",
                tcga.cgc_sample_sample_type == "Primary Tumor" ~ "cancer",
                tcga.cgc_sample_sample_type == "Recurrent Tumor" ~ "cancer",
                tcga.cgc_sample_sample_type == "Solid Tissue Normal" ~ "normal"
    ))

combine the metadata and count matrix into a single dataframe

final_df<- cbind(t(tpm_data2), metadata2)

head(final_df)
#>                                         TACSTD2      VTCN1       MUC1
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e  0.7035937 0.00000000 0.67502205
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872 25.4360736 0.00000000 2.01525394
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886  1.5756197 0.00000000 0.90784666
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b  0.2702156 0.09099681 0.04293345
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb  0.4122814 0.00000000 0.11484380
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc  4.5469193 4.85973690 0.04208195
#>                                         NECTIN4     FOLH1      FOLR1     CD276
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e 0.08620727  7.213342 0.00000000  52.75981
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872 0.07279804 23.552286 0.12154673  78.78551
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886 0.69905270  2.853812 1.01000271 145.84399
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b 0.01652257  1.157070 0.27942068  48.45022
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb 0.03168398  2.408137 0.04922458  42.25592
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc 0.06828305  1.010411 0.02248965  20.63795
#>                                            MSLN      CLDN6     ERBB2
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e 0.06674445 0.09704962  1.879518
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872 0.95554610 0.25458796  7.777976
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886 0.04563568 0.25701910  2.905926
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b 0.03154912 0.24746913  4.914280
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb 0.26968788 0.12576720  1.494744
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc 0.01336404 0.01823883 13.474689
#>                                             MUC16       DLL3 CEACAM5      PVR
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e 0.0011479879 0.49589978       0 52.08113
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872 0.0008049670 2.52244014       0 40.87926
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886 0.0026190288 0.77074712       0 33.26727
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b 0.0051705741 0.10636402       0 28.26457
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb 0.0004894306 0.04483123       0 41.66776
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc 0.0000000000 0.01184285       0 30.18711
#>                                          EPCAM       PROM1       CD24      EGFR
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e  4.521984 0.025311008 0.55036003  1.286481
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872  9.530414 0.023576862 9.67272890  5.373307
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886 42.358567 0.000000000 0.06939934  4.600918
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b 16.316524 0.007783431 0.84522244  3.010374
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb 12.529742 0.019204339 0.21369023 16.476552
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc  2.430109 0.043719865 4.95506593  2.010338
#>                                             MET TNFRSF10B
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e  0.9320235  12.80547
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872  8.0610999  31.46289
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886  0.1295387  65.57967
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b  2.9728030  24.31636
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb 19.7360055  21.11014
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc  8.6087283  37.91574
#>                                                 tcga.tcga_barcode
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e TCGA-OR-A5KU-01A-11R-A29S-07
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872 TCGA-P6-A5OG-01A-22R-A29S-07
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886 TCGA-OR-A5K5-01A-11R-A29S-07
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b TCGA-OR-A5K4-01A-11R-A29S-07
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb TCGA-OR-A5LP-01A-11R-A29S-07
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc TCGA-PK-A5H9-01A-11R-A29S-07
#>                                      tcga.cgc_sample_sample_type study
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e               Primary Tumor   ACC
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872               Primary Tumor   ACC
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886               Primary Tumor   ACC
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b               Primary Tumor   ACC
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb               Primary Tumor   ACC
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc               Primary Tumor   ACC
#>                                      sample_type
#> 43e715bf-28d9-4b5e-b762-8cd1b69a430e      cancer
#> 1a5db9fc-2abd-4e1b-b5ef-b1cf5e5f3872      cancer
#> 93b382e4-9c9a-43f5-bd3b-502cc260b886      cancer
#> 1f39dadd-3655-474e-ba4c-a5bd32c97a8b      cancer
#> 8c8c09b9-ec83-45ec-bc4c-0ba92de60acb      cancer
#> 85a86b91-4f24-4e77-ae2d-520f8e205efc      cancer

With everything in a single dataframe, we are ready to do anything you want:)

For each gene, let’s take a median of all samples within the same cancer type and make a heatmap for all genes.

Make a boxplot for one gene across all cancer types

final_df %>%
        filter(sample_type == "cancer") %>%
        ggplot(aes(x = study, y = EPCAM )) +
        geom_boxplot() +
        scale_y_continuous(trans=scales::pseudo_log_trans(base = 2),
                           breaks = c(0, 2, 4, 8, 16, 32, 128, 1024)) +
        theme_bw(base_size = 14) +
        xlab("") +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Rearrange the boxes by the median level:

final_df %>%
        filter(sample_type == "cancer") %>%
        ggplot(aes(x = study %>%
                           forcats::fct_reorder(EPCAM, median), y = EPCAM )) +
        geom_boxplot() +
        scale_y_continuous(trans=scales::pseudo_log_trans(base = 2),
                           breaks = c(0, 2, 4, 8, 16, 32, 128, 1024)) +
        theme_bw(base_size = 14) +
        xlab("") +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

This looks really nice!

Let’s make a heatmap by taking the median

library(ComplexHeatmap)
tcga_df<- final_df %>%
        filter(sample_type == "cancer") %>%
        group_by(sample_type, study) %>%
        summarise(across(1:20, ~log2(.x+1))) %>%
        summarise(across(1:20, median)) %>%
        arrange(study) %>%
        filter(!is.na(sample_type))

tcga_mat<- tcga_df[, -c(1,2)] %>% as.matrix()
rownames(tcga_mat)<- tcga_df %>% pull(study)

cell_fun = function(j, i, x, y, w, h, fill) {
    grid.rect(x = x, y = y, width = w, height = h, 
              gp = gpar(col = "black", fill = NA))
}

Heatmap(tcga_mat, cluster_columns = TRUE, cell_fun = cell_fun,
        name = "log2TPM")

Always do a sanity check to see if the results make any biological sense or not. We see MSLN is high in MESO, FOLH1 is high in prostate cancer (PRAD). We are probably on the right track!

You can also scale the expression for each gene across the cancer types.

Heatmap(scale(tcga_mat), cluster_columns = TRUE, cell_fun = cell_fun,
        name = "scaled\nlog2TPM")

Let’s try a different data resource using ExperimentHub

#BiocManager::install("ExperimentHub")
library(ExperimentHub)
eh<- ExperimentHub()
query(eh , "GSE62944")
#> ExperimentHub with 3 records
#> # snapshotDate(): 2021-10-19
#> # $dataprovider: GEO
#> # $species: Homo sapiens
#> # $rdataclass: SummarizedExperiment, ExpressionSet
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH1"]]' 
#> 
#>            title                                                              
#>   EH1    | RNA-Sequencing and clinical data for 7706 tumor samples from The...
#>   EH1043 | RNA-Sequencing and clinical data for 9246 tumor samples from The...
#>   EH1044 | RNA-Sequencing and clinical data for 741 normal samples from The...
tcga_data<- eh[["EH1043"]]

colData(tcga_data)[1:5, 1:5]
#> DataFrame with 5 rows and 5 columns
#>                                                  bcr_patient_uuid
#>                                                          <factor>
#> TCGA-G9-A9S0-01A-11R-A41O-07 084AC674-F4CF-4A4A-A6A2-687C05C272EA
#> TCGA-E1-5318-01A-01R-1470-07 400b2fa3-baf7-4baa-99b4-88b9f25b5049
#> TCGA-25-1625-01A-01R-1566-13 88d61634-913c-435a-8d25-e019c8dab7da
#> TCGA-A2-A0T1-01A-21R-A084-07 02bed00f-bef7-4fb7-b243-540354990e45
#> TCGA-ER-A19N-06A-11R-A18S-07 9EE83669-D7A4-476E-8857-304600B4917A
#>                              bcr_patient_barcode form_completion_date
#>                                         <factor>             <factor>
#> TCGA-G9-A9S0-01A-11R-A41O-07        TCGA-G9-A9S0           2014-6-17 
#> TCGA-E1-5318-01A-01R-1470-07        TCGA-E1-5318           2011-1-26 
#> TCGA-25-1625-01A-01R-1566-13        TCGA-25-1625           2009-6-2  
#> TCGA-A2-A0T1-01A-21R-A084-07        TCGA-A2-A0T1           2010-11-12
#> TCGA-ER-A19N-06A-11R-A18S-07        TCGA-ER-A19N           2012-4-20 
#>                              prospective_collection retrospective_collection
#>                                            <factor>                 <factor>
#> TCGA-G9-A9S0-01A-11R-A41O-07        NO                       YES            
#> TCGA-E1-5318-01A-01R-1470-07        NO                       YES            
#> TCGA-25-1625-01A-01R-1566-13        [Not Available]          [Not Available]
#> TCGA-A2-A0T1-01A-21R-A084-07        NO                       YES            
#> TCGA-ER-A19N-06A-11R-A18S-07        YES                      NO
count_mat<- assay(tcga_data)

count_mat[1:5, 1:5]
#>             TCGA-G9-A9S0-01A-11R-A41O-07 TCGA-E1-5318-01A-01R-1470-07
#> 1/2-SBSRNA4                           40                          105
#> A1BG                                  26                           32
#> A1BG-AS1                               3                            4
#> A1CF                                   0                            0
#> A2LD1                                390                           28
#>             TCGA-25-1625-01A-01R-1566-13 TCGA-A2-A0T1-01A-21R-A084-07
#> 1/2-SBSRNA4                           38                           23
#> A1BG                                  83                          157
#> A1BG-AS1                              15                           60
#> A1CF                                   2                            2
#> A2LD1                               1408                          331
#>             TCGA-ER-A19N-06A-11R-A18S-07
#> 1/2-SBSRNA4                           28
#> A1BG                                 889
#> A1BG-AS1                             139
#> A1CF                                   0
#> A2LD1                                368

Get the gene length from TxDb.Hsapiens.UCSC.hg19.knownGene package:

#BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
#BiocManager::install("org.Hs.eg.db")

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

hg19_exons<- exonsBy(txdb, "gene")

hg19_exons is a GRangesList object, each element is a gene of multiple exons.

# the first gene has 15 exons.
hg19_exons[[1]]
#> GRanges object with 15 ranges and 2 metadata columns:
#>        seqnames            ranges strand |   exon_id   exon_name
#>           <Rle>         <IRanges>  <Rle> | <integer> <character>
#>    [1]    chr19 58858172-58858395      - |    250809        <NA>
#>    [2]    chr19 58858719-58859006      - |    250810        <NA>
#>    [3]    chr19 58859832-58860494      - |    250811        <NA>
#>    [4]    chr19 58860934-58862017      - |    250812        <NA>
#>    [5]    chr19 58861736-58862017      - |    250813        <NA>
#>    ...      ...               ...    ... .       ...         ...
#>   [11]    chr19 58868951-58869015      - |    250821        <NA>
#>   [12]    chr19 58869318-58869652      - |    250822        <NA>
#>   [13]    chr19 58869855-58869951      - |    250823        <NA>
#>   [14]    chr19 58870563-58870689      - |    250824        <NA>
#>   [15]    chr19 58874043-58874214      - |    250825        <NA>
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome
# get the width of each exon
exon_length<- width(hg19_exons)
exon_length
#> IntegerList of length 23459
#> [["1"]] 224 288 663 1084 282 297 273 270 36 96 65 335 97 127 172
#> [["10"]] 101 1216
#> [["100"]] 326 103 130 65 102 72 128 116 144 123 62 161
#> [["1000"]] 1394 165 140 234 234 143 254 186 138 173 145 156 147 227 106 112 519
#> [["10000"]] 218 68 5616 50 103 88 215 129 123 69 66 132 145 112 126 158 41
#> [["100008586"]] 92 121 126 127
#> [["100009676"]] 2784
#> [["10001"]] 704 28 116 478 801 109 130 83 92 160 52
#> [["10002"]] 308 127 104 222 176 201 46 106 918 705
#> [["10003"]] 191 112 187 102 126 187 94 1193 99 ... 64 68 92 91 265 82 93 1054
#> ...
#> <23449 more elements>
exon_length[[1]] %>% sum()
#> [1] 4309
# sum it up for each gene, in R, everything is vectorized :)
sum(exon_length) %>% head()
#>         1        10       100      1000     10000 100008586 
#>      4309      1317      1532      4473      7459       466
# turn it to a dataframe

hg19_gene_length<- sum(exon_length) %>% 
        tibble::enframe(name = "gene_id", value = "exon_length")

head(hg19_gene_length)
#> # A tibble: 6 × 2
#>   gene_id   exon_length
#>   <chr>           <int>
#> 1 1                4309
#> 2 10               1317
#> 3 100              1532
#> 4 1000             4473
#> 5 10000            7459
#> 6 100008586         466
# map the Entrez ID to gene symbol
gene_symbol<- AnnotationDbi::select(org.Hs.eg.db, keys=hg19_gene_length$gene_id, 
                                    columns="SYMBOL", keytype="ENTREZID")

all.equal(hg19_gene_length$gene_id, gene_symbol$ENTREZID)
#> [1] TRUE
hg19_gene_length$symbol<- gene_symbol$SYMBOL

hg19_gene_length
#> # A tibble: 23,459 × 3
#>    gene_id   exon_length symbol    
#>    <chr>           <int> <chr>     
#>  1 1                4309 A1BG      
#>  2 10               1317 NAT2      
#>  3 100              1532 ADA       
#>  4 1000             4473 CDH2      
#>  5 10000            7459 AKT3      
#>  6 100008586         466 GAGE12F   
#>  7 100009676        2784 ZBTB11-AS1
#>  8 10001            2753 MED6      
#>  9 10002            2913 NR2E3     
#> 10 10003            5056 NAALAD2   
#> # … with 23,449 more rows
table(rownames(count_mat) %in% hg19_gene_length$symbol)
#> 
#> FALSE  TRUE 
#>  3499 19869
#setdiff(rownames(count_mat), hg19_gene_length$symbol)
#setdiff(hg19_gene_length$symbol, rownames(count_mat))
# it seems to be microRNAs, LINC RNAs that are not matching...

3499 genes in the count table are not in the dataframe. gene symbols can change constantly and are given new names or aliases. Use HGNChelper if needed: https://cran.r-project.org/web/packages/HGNChelper/index.html

I will just remove those genes

indx<- intersect(rownames(count_mat), hg19_gene_length$symbol)

count_mat<- count_mat[indx, ]

gene_df<- hg19_gene_length %>%
  slice(match(indx, symbol))

dim(gene_df)
#> [1] 19869     3
dim(count_mat)
#> [1] 19869  9264

convert counts to TPM

count_to_tpm<- function(mat, gene_length){
  mat<- mat/gene_length
  total_per_sample<- colSums(mat)
  mat<- t(t(mat)/total_per_sample)
  return(mat*1000000)
}

tpm_mat<- count_to_tpm(count_mat, gene_df$exon_length)

# can not find "NECTIN4", it might be a different name 

genes_of_interest
#>  [1] "MSLN"      "EGFR"      "ERBB2"     "CEACAM5"   "NECTIN4"   "EPCAM"    
#>  [7] "MUC16"     "MUC1"      "CD276"     "FOLH1"     "DLL3"      "VTCN1"    
#> [13] "PROM1"     "PVR"       "CLDN6"     "MET"       "FOLR1"     "TNFRSF10B"
#> [19] "TACSTD2"   "CD24"
genes_of_interest[!genes_of_interest %in% rownames(tpm_mat)]
#> [1] "NECTIN4"

other names for NECTIN4: LNIR; PRR4; EDSS1; PVRL4; nectin-4

I tried all of them, and it seems that in the data matrix, it was named PRR4. Welcome to the everyday work of Bioinforamtics: converting gene ids…

"PRR4" %in% rownames(tpm_mat)
#> [1] TRUE

Make a heatmap:

genes_of_interest2<- c("MSLN", "EGFR", "ERBB2", "CEACAM5", "PRR4", "EPCAM", "MUC16", "MUC1", "CD276", "FOLH1", "DLL3", "VTCN1", "PROM1", "PVR", "CLDN6", "MET", "FOLR1", "TNFRSF10B", "TACSTD2", "CD24")

tpm_sub<- tpm_mat[genes_of_interest2, ]

tpm_median<- cbind(t(tpm_sub), CancerType = as.character(colData(tcga_data)$CancerType)) %>%
  as.data.frame() %>%
  mutate(across(1:20, as.numeric)) %>%
  group_by(CancerType) %>%
  summarise(across(1:20, median)) 

tpm_sub_mat<- as.matrix(tpm_median[,-1])
rownames(tpm_sub_mat)<- tpm_median$CancerType

Heatmap(log2(tpm_sub_mat +1), cluster_columns = TRUE, cell_fun = cell_fun,
        name = "log2TPM")

There are fewer cancer types in the ExperimentHub than the recount3, but in general, the patterns of gene expression are similar.

You may look into TCGAbiolinks and cBioPortalData too.

Happy Learning!

Related

Next
Previous
comments powered by Disqus