PCA analysis on TCGA bulk RNAseq data

Jan 29, 2025 7 min read review

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

what is PCA?

Principal Component Analysis (PCA) is a mathematical technique used to reduce the dimensionality of large datasets while preserving the most important patterns in the data.

It transforms the original high-dimensional data into a smaller set of new variables called principal components (PCs), which capture the most variation in the data.

Key Concepts of PCA:

Dimensionality Reduction – PCA reduces complex datasets with many features (e.g., thousands of genes) into a few key components, making analysis easier.
Variance Maximization – The first principal component (PC1) captures the most variation in the data, the second (PC2) captures the second most, and so on.
Orthogonality – Principal components are uncorrelated (perpendicular to each other), ensuring that each captures a unique aspect of the data.

Please read my previous posts on PCA:

PCA in action Step by step guide to calculate PCA with R
permutation test for PCA components
How PCA projection and cell label transfer work in Seurat

PCA on TCGA data

In this example, we will download TCGA LUAD and LUSC RNAseq counts data, and do a PCA. You will see when we do not normalize the counts matrix with total library size. The first PC (PC1) is correlated with sequencing depth.

Let’s dive in!

Download TCGA data using TCGAbiolinks

There are many ways to get TCGA data. I like TCGAbiolinks the best as it is very easy.

It takes long time to download the data, so I saved it as and Rdata.

library(TCGAbiolinks)
library(SummarizedExperiment)
library(dplyr)
library(here)

Download TCGA data using TCGAbiolinks.

# Load TCGA data (example for LUAD - Lung Adenocarcinoma)
LUAD_query <- GDCquery(project = "TCGA-LUAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "STAR - Counts")

GDCdownload(LUAD_query)

# this returns a summarizedExperiment object 
TCGA_LUAD_data <- GDCprepare(LUAD_query)

saveRDS(TCGA_LUAD_data, "~/blog_data/TCGA_LUAD_SummarizedExperiment.rds"))

Read in the saved RDS file:

TCGA_LUAD_data<- readRDS("~/blog_data/TCGA_LUAD_SummarizedExperiment.rds")

## there are many metadata for each sample
colData(TCGA_LUAD_data) %>% 
        colnames() %>%
        tail()

#> [1] "paper_chromosome.affected.by.chromothripsis"    
#> [2] "paper_iCluster.Group"                           
#> [3] "paper_CIMP.methylation.signature."              
#> [4] "paper_MTOR.mechanism.of.mTOR.pathway.activation"
#> [5] "paper_Ploidy.ABSOLUTE.calls"                    
#> [6] "paper_Purity.ABSOLUTE.calls"

# CIMP methylation subtypes
colData(TCGA_LUAD_data) %>%
  as.data.frame() %>%
  janitor::tabyl(`paper_CIMP.methylation.signature.`)

#>  paper_CIMP.methylation.signature.   n    percent valid_percent
#>                              high   37 0.06166667     0.1957672
#>                      intermediate   76 0.12666667     0.4021164
#>                               low   76 0.12666667     0.4021164
#>                               <NA> 411 0.68500000            NA

# save the raw counts matrix 
TCGA_LUAD_mat<- assay(TCGA_LUAD_data)


TCGA_LUAD_mat[1:5, 1:5]

#>                    TCGA-73-4658-01A-01R-1755-07 TCGA-44-2661-11A-01R-1758-07
#> ENSG00000000003.15                         3659                         1395
#> ENSG00000000005.6                           188                            8
#> ENSG00000000419.13                          981                         1031
#> ENSG00000000457.14                          456                          541
#> ENSG00000000460.17                          158                          135
#>                    TCGA-55-6986-11A-01R-1949-07 TCGA-55-8615-01A-11R-2403-07
#> ENSG00000000003.15                         6760                         2257
#> ENSG00000000005.6                             3                            0
#> ENSG00000000419.13                         2070                          644
#> ENSG00000000457.14                         1110                          538
#> ENSG00000000460.17                          202                          212
#>                    TCGA-97-8177-01A-11R-2287-07
#> ENSG00000000003.15                         5009
#> ENSG00000000005.6                            13
#> ENSG00000000419.13                         2731
#> ENSG00000000457.14                          919
#> ENSG00000000460.17                          321

dim(TCGA_LUAD_mat)

#> [1] 60660   600

We have 600 samples.

You noticed that the gene name is in ENSEMBLE ID.

We will download the TCGA LUSC data in the same way.

LUSC_query <- GDCquery(project = "TCGA-LUSC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "STAR - Counts")

GDCdownload(LUSC_query)

TCGA_LUSC_data <- GDCprepare(LUSC_query)

saveRDS(TCGA_LUSC_data, "~/blog_data/TCGA_LUSC_SummarizedExperiment.rds"))

read in the saved RDS

TCGA_LUSC_data<- readRDS("~/blog_data/TCGA_LUSC_SummarizedExperiment.rds")

# different transcription subtypes
colData(TCGA_LUSC_data) %>%
  as.data.frame() %>%
  janitor::tabyl(paper_Expression.Subtype)

#>  paper_Expression.Subtype   n    percent valid_percent
#>                     basal  43 0.07775769     0.2402235
#>                 classical  65 0.11754069     0.3631285
#>                 primitive  27 0.04882459     0.1508380
#>                 secretory  44 0.07956600     0.2458101
#>                      <NA> 374 0.67631103            NA

TCGA_LUSC_mat<- assay(TCGA_LUSC_data)

convert ENSEMBL id to gene symbols

library(org.Hs.eg.db)

# remove the version number in the end of the ENSEMBL id
TCGA_LUAD_genes<- rownames(TCGA_LUAD_mat) %>%
  tibble::enframe() %>%
  mutate(ENSEMBL =stringr::str_replace(value, "\\.[0-9]+", ""))

head(TCGA_LUAD_genes)

#> # A tibble: 6 × 3
#>    name value              ENSEMBL        
#>   <int> <chr>              <chr>          
#> 1     1 ENSG00000000003.15 ENSG00000000003
#> 2     2 ENSG00000000005.6  ENSG00000000005
#> 3     3 ENSG00000000419.13 ENSG00000000419
#> 4     4 ENSG00000000457.14 ENSG00000000457
#> 5     5 ENSG00000000460.17 ENSG00000000460
#> 6     6 ENSG00000000938.13 ENSG00000000938

# there are duplicated gene symbols for different ENSEMBL id
clusterProfiler::bitr(TCGA_LUAD_genes$ENSEMBL, 
                      fromType = "ENSEMBL",
                      toType = "SYMBOL",
                      OrgDb = org.Hs.eg.db) %>%
        janitor::get_dupes(SYMBOL) %>%
        head()

#>         SYMBOL dupe_count         ENSEMBL
#> 1 LOC124905422         16 ENSG00000199270
#> 2 LOC124905422         16 ENSG00000199334
#> 3 LOC124905422         16 ENSG00000199337
#> 4 LOC124905422         16 ENSG00000199352
#> 5 LOC124905422         16 ENSG00000199396
#> 6 LOC124905422         16 ENSG00000199910

# just keep one of them (simple solution)
TCGA_LUAD_gene_map<- clusterProfiler::bitr(TCGA_LUAD_genes$ENSEMBL,
                                           fromType = "ENSEMBL",
                                           toType = "SYMBOL",
                                           OrgDb = org.Hs.eg.db) %>%
        distinct(SYMBOL, .keep_all = TRUE) 
       
                       
TCGA_LUAD_gene_map<- TCGA_LUAD_gene_map %>%
  left_join(TCGA_LUAD_genes)

head(TCGA_LUAD_gene_map)

#>           ENSEMBL SYMBOL name              value
#> 1 ENSG00000000003 TSPAN6    1 ENSG00000000003.15
#> 2 ENSG00000000005   TNMD    2  ENSG00000000005.6
#> 3 ENSG00000000419   DPM1    3 ENSG00000000419.13
#> 4 ENSG00000000457  SCYL3    4 ENSG00000000457.14
#> 5 ENSG00000000460  FIRRM    5 ENSG00000000460.17
#> 6 ENSG00000000938    FGR    6 ENSG00000000938.13

Subset the gene expression matrices and replace the rownames to gene symbol.

TCGA_LUSC_mat<- TCGA_LUSC_mat[TCGA_LUAD_gene_map$value, ]
row.names(TCGA_LUSC_mat)<- TCGA_LUAD_gene_map$SYMBOL
TCGA_LUSC_mat[1:5, 1:5]

#>        TCGA-43-7657-11A-01R-2125-07 TCGA-43-7657-01A-31R-2125-07
#> TSPAN6                         2459                         2329
#> TNMD                              0                            0
#> DPM1                           1734                         2533
#> SCYL3                           795                          496
#> FIRRM                           206                          434
#>        TCGA-60-2695-01A-01R-0851-07 TCGA-21-1070-01A-01R-0692-07
#> TSPAN6                         4271                         2955
#> TNMD                              1                            0
#> DPM1                           2692                         2865
#> SCYL3                          2494                          887
#> FIRRM                          2850                          947
#>        TCGA-94-7033-01A-11R-1949-07
#> TSPAN6                         3405
#> TNMD                              2
#> DPM1                           1706
#> SCYL3                           600
#> FIRRM                           420

TCGA_LUAD_mat<- TCGA_LUAD_mat[TCGA_LUAD_gene_map$value, ]
row.names(TCGA_LUAD_mat)<- TCGA_LUAD_gene_map$SYMBOL
TCGA_LUAD_mat[1:5, 1:5]

#>        TCGA-73-4658-01A-01R-1755-07 TCGA-44-2661-11A-01R-1758-07
#> TSPAN6                         3659                         1395
#> TNMD                            188                            8
#> DPM1                            981                         1031
#> SCYL3                           456                          541
#> FIRRM                           158                          135
#>        TCGA-55-6986-11A-01R-1949-07 TCGA-55-8615-01A-11R-2403-07
#> TSPAN6                         6760                         2257
#> TNMD                              3                            0
#> DPM1                           2070                          644
#> SCYL3                          1110                          538
#> FIRRM                           202                          212
#>        TCGA-97-8177-01A-11R-2287-07
#> TSPAN6                         5009
#> TNMD                             13
#> DPM1                           2731
#> SCYL3                           919
#> FIRRM                           321

## check the dimentions 
dim(TCGA_LUSC_mat)

#> [1] 36940   553

dim(TCGA_LUAD_mat)

#> [1] 36940   600

combine TCGA LUSC and LUAD

PCA plot use top variable genes.

# double check the genes are the same
all.equal(rownames(TCGA_LUAD_mat), rownames(TCGA_LUSC_mat))

#> [1] TRUE

TCGA_lung_mat<- cbind(TCGA_LUSC_mat, TCGA_LUAD_mat)

TCGA_lung_meta<- data.frame(cancer_type = c(rep( "LUSC", ncol(TCGA_LUSC_mat)), 
                   rep("LUAD", ncol(TCGA_LUAD_mat))))

dim(TCGA_lung_mat)

#> [1] 36940  1153

PCA with the raw counts

library(ggplot2)
library(ggfortify)
# select the top 1000 most variable genes 
TCGA_gene_idx<- order(rowVars(TCGA_lung_mat), decreasing = TRUE)[1:1000]

TCGA_lung_mat_sub <- TCGA_lung_mat[TCGA_gene_idx, ]

TCGA_pca_res <- prcomp(t(TCGA_lung_mat_sub), scale. = TRUE)

autoplot(TCGA_pca_res, data = TCGA_lung_meta , color ="cancer_type") +
  scale_color_manual(values = c("blue", "red")) +
  ggtitle("TCGA NSCLC")

LUAD and LUSC samples are not separated in the first PC.

Let’s see how sequencing depth is correlated with first PC

read https://divingintogeneticsandgenomics.com/post/pca-in-action/ for more details for PCA.

# the PCs 
TCGA_pca_res$x[1:5, 1:5]

#>                                     PC1        PC2       PC3        PC4
#> TCGA-43-7657-11A-01R-2125-07  -9.143224  22.659680 7.0908829  9.0796187
#> TCGA-43-7657-01A-31R-2125-07 -13.494142 -14.609673 4.6152805  1.6117986
#> TCGA-60-2695-01A-01R-0851-07  -8.390102  -5.732319 0.3820473 -3.4361055
#> TCGA-21-1070-01A-01R-0692-07   4.081536  -4.608280 0.9799216  0.2175983
#> TCGA-94-7033-01A-11R-1949-07  -1.890497 -11.202800 1.3147853  4.1963502
#>                                    PC5
#> TCGA-43-7657-11A-01R-2125-07  2.449763
#> TCGA-43-7657-01A-31R-2125-07  1.826048
#> TCGA-60-2695-01A-01R-0851-07 -3.122432
#> TCGA-21-1070-01A-01R-0692-07  2.610431
#> TCGA-94-7033-01A-11R-1949-07 -4.869039

seq_depth<- colSums(TCGA_lung_mat)

# calculate the correlation of first PC and the sequencing depth
# the sign of the PCs are arbitrary, so let's get the absolute number
cor(TCGA_pca_res$x[, 1], seq_depth) %>%
        abs()

#> [1] 0.979129

A correlation of 0.98 for PC1!!

# PC2 correlation with sequencing depth
cor(TCGA_pca_res$x[, 2], seq_depth) %>%
        abs()

#> [1] 0.09609512

PC2 is not correlated with sequencing depth. As we can see PC2 separates the cancer types.

Let’s plot it on PCA.

TCGA_lung_meta$seq_depth<- seq_depth

autoplot(TCGA_pca_res, data = TCGA_lung_meta , color ="seq_depth") +
        scale_color_viridis_b() +
  ggtitle("TCGA NSCLC")

You see the sequencing depth is low to high from left to right in PC1.

PCA after cpm (counts per million) normalization

# convert the raw counts to log2(cpm+1) using edgeR.
TCGA_lung_cpm<- edgeR::cpm(TCGA_lung_mat, log = TRUE, prior.count = 1)

## top 1000 most variable genes
TCGA_gene_idx2<- order(rowVars(TCGA_lung_cpm), decreasing = TRUE)[1:1000]

TCGA_lung_cpm_sub <- TCGA_lung_cpm[TCGA_gene_idx2, ]

TCGA_pca_res2 <- prcomp(t(TCGA_lung_cpm_sub), scale. = TRUE)

autoplot(TCGA_pca_res2, data = TCGA_lung_meta , color ="cancer_type") +
  scale_color_manual(values = c("blue", "red")) +
  ggtitle("TCGA NSCLC")

Now we see the first PC separates cancer type: LUSC vs LUAD, which makes sense! It is interesting to see some of the LUSC samples mix with the LUAD. It will be interesting to further investigate those samples.

make a heatmap

My favorite package is ComplexHeatmap.

set.seed(123)
library(ComplexHeatmap)

TCGA_ha<- HeatmapAnnotation(df = TCGA_lung_meta,
    col = list(
               cancer_type = c("LUSC" = "red", "LUAD" = "blue"))
    )

Heatmap(t(scale(t(TCGA_lung_cpm_sub))),
        name = "TCGA lung RNA",
        show_column_names = FALSE,
        show_row_names = FALSE,
        show_row_dend = FALSE,
        #column_split = TCGA_lung_meta$cancer_type,
        top_annotation = TCGA_ha,
        row_names_gp = gpar(fontsize = 3),
        border = TRUE,
        row_km = 3
        )

This shows you how useful PCA is. In my next post, I will show you how a similar analysis is done in a single-cell ATACseq dataset. Stay tuned!

Watch the video here:

Do not forget to subscribe to my channel: Chatomics.