PCA analysis on TCGA bulk RNAseq data continued

Feb 7, 2025 10 min read review

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

In my last blog post, I showed you how to download TCGA RNAseq count data and do PCA and make a heatmap. It is interesting to see some of the LUSC samples mix with the LUAD samples and vice versa.

In this post, we will continue to use PCA to do more Exploratory data analysis (EDA).

Let’s further investigate those samples and use the rank matrix derived from the count matrix to do PCA.

Download and combine the data

The following code is the same as my last post.

It takes long time to download the data, so I saved it as and Rdata.

library(TCGAbiolinks)
library(SummarizedExperiment)
library(dplyr)
library(here)

Download TCGA data using TCGAbiolinks.

# Load TCGA data (example for LUAD - Lung Adenocarcinoma)
LUAD_query <- GDCquery(project = "TCGA-LUAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "STAR - Counts")

GDCdownload(LUAD_query)

# this returns a summarizedExperiment object 
TCGA_LUAD_data <- GDCprepare(LUAD_query)

saveRDS(TCGA_LUAD_data, "~/blog_data/TCGA_LUAD_SummarizedExperiment.rds"))

Read in the saved RDS file:

TCGA_LUAD_data<- readRDS("~/blog_data/TCGA_LUAD_SummarizedExperiment.rds")
TCGA_LUAD_mat<- assay(TCGA_LUAD_data)
dim(TCGA_LUAD_mat)

#> [1] 60660   600

We have 600 samples.

You noticed that the gene name is in ENSEMBLE ID.

We will download the TCGA LUSC data in the same way.

LUSC_query <- GDCquery(project = "TCGA-LUSC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "STAR - Counts")

GDCdownload(LUSC_query)

TCGA_LUSC_data <- GDCprepare(LUSC_query)

saveRDS(TCGA_LUSC_data, "~/blog_data/TCGA_LUSC_SummarizedExperiment.rds"))

read in the saved RDS

TCGA_LUSC_data<- readRDS("~/blog_data/TCGA_LUSC_SummarizedExperiment.rds")

TCGA_LUSC_mat<- assay(TCGA_LUSC_data)

dim(TCGA_LUSC_mat)

#> [1] 60660   553

convert ENSEMBL id to gene symbols

library(org.Hs.eg.db)

# remove the version number in the end of the ENSEMBL id
TCGA_LUAD_genes<- rownames(TCGA_LUAD_mat) %>%
  tibble::enframe() %>%
  mutate(ENSEMBL =stringr::str_replace(value, "\\.[0-9]+", ""))

head(TCGA_LUAD_genes)

#> # A tibble: 6 × 3
#>    name value              ENSEMBL        
#>   <int> <chr>              <chr>          
#> 1     1 ENSG00000000003.15 ENSG00000000003
#> 2     2 ENSG00000000005.6  ENSG00000000005
#> 3     3 ENSG00000000419.13 ENSG00000000419
#> 4     4 ENSG00000000457.14 ENSG00000000457
#> 5     5 ENSG00000000460.17 ENSG00000000460
#> 6     6 ENSG00000000938.13 ENSG00000000938

# there are duplicated gene symbols for different ENSEMBL id
clusterProfiler::bitr(TCGA_LUAD_genes$ENSEMBL, 
                      fromType = "ENSEMBL",
                      toType = "SYMBOL",
                      OrgDb = org.Hs.eg.db) %>%
        janitor::get_dupes(SYMBOL) %>%
        head()

#>         SYMBOL dupe_count         ENSEMBL
#> 1 LOC124905422         16 ENSG00000199270
#> 2 LOC124905422         16 ENSG00000199334
#> 3 LOC124905422         16 ENSG00000199337
#> 4 LOC124905422         16 ENSG00000199352
#> 5 LOC124905422         16 ENSG00000199396
#> 6 LOC124905422         16 ENSG00000199910

# just keep one of them (simple solution)
TCGA_LUAD_gene_map<- clusterProfiler::bitr(TCGA_LUAD_genes$ENSEMBL,
                                           fromType = "ENSEMBL",
                                           toType = "SYMBOL",
                                           OrgDb = org.Hs.eg.db) %>%
        distinct(SYMBOL, .keep_all = TRUE) 
       
                       
TCGA_LUAD_gene_map<- TCGA_LUAD_gene_map %>%
  left_join(TCGA_LUAD_genes)

head(TCGA_LUAD_gene_map)

#>           ENSEMBL SYMBOL name              value
#> 1 ENSG00000000003 TSPAN6    1 ENSG00000000003.15
#> 2 ENSG00000000005   TNMD    2  ENSG00000000005.6
#> 3 ENSG00000000419   DPM1    3 ENSG00000000419.13
#> 4 ENSG00000000457  SCYL3    4 ENSG00000000457.14
#> 5 ENSG00000000460  FIRRM    5 ENSG00000000460.17
#> 6 ENSG00000000938    FGR    6 ENSG00000000938.13

Subset the gene expression matrices and replace the rownames to gene symbol.

TCGA_LUSC_mat<- TCGA_LUSC_mat[TCGA_LUAD_gene_map$value, ]
row.names(TCGA_LUSC_mat)<- TCGA_LUAD_gene_map$SYMBOL
TCGA_LUSC_mat[1:5, 1:5]

#>        TCGA-43-7657-11A-01R-2125-07 TCGA-43-7657-01A-31R-2125-07
#> TSPAN6                         2459                         2329
#> TNMD                              0                            0
#> DPM1                           1734                         2533
#> SCYL3                           795                          496
#> FIRRM                           206                          434
#>        TCGA-60-2695-01A-01R-0851-07 TCGA-21-1070-01A-01R-0692-07
#> TSPAN6                         4271                         2955
#> TNMD                              1                            0
#> DPM1                           2692                         2865
#> SCYL3                          2494                          887
#> FIRRM                          2850                          947
#>        TCGA-94-7033-01A-11R-1949-07
#> TSPAN6                         3405
#> TNMD                              2
#> DPM1                           1706
#> SCYL3                           600
#> FIRRM                           420

TCGA_LUAD_mat<- TCGA_LUAD_mat[TCGA_LUAD_gene_map$value, ]
row.names(TCGA_LUAD_mat)<- TCGA_LUAD_gene_map$SYMBOL
TCGA_LUAD_mat[1:5, 1:5]

#>        TCGA-73-4658-01A-01R-1755-07 TCGA-44-2661-11A-01R-1758-07
#> TSPAN6                         3659                         1395
#> TNMD                            188                            8
#> DPM1                            981                         1031
#> SCYL3                           456                          541
#> FIRRM                           158                          135
#>        TCGA-55-6986-11A-01R-1949-07 TCGA-55-8615-01A-11R-2403-07
#> TSPAN6                         6760                         2257
#> TNMD                              3                            0
#> DPM1                           2070                          644
#> SCYL3                          1110                          538
#> FIRRM                           202                          212
#>        TCGA-97-8177-01A-11R-2287-07
#> TSPAN6                         5009
#> TNMD                             13
#> DPM1                           2731
#> SCYL3                           919
#> FIRRM                           321

## check the dimentions 
dim(TCGA_LUSC_mat)

#> [1] 36940   553

dim(TCGA_LUAD_mat)

#> [1] 36940   600

combine TCGA LUSC and LUAD

PCA plot use top variable genes.

# double check the genes are the same
all.equal(rownames(TCGA_LUAD_mat), rownames(TCGA_LUSC_mat))

#> [1] TRUE

TCGA_lung_mat<- cbind(TCGA_LUSC_mat, TCGA_LUAD_mat)

TCGA_lung_meta<- data.frame(cancer_type = c(rep( "LUSC", ncol(TCGA_LUSC_mat)), 
                   rep("LUAD", ncol(TCGA_LUAD_mat))))

rownames(TCGA_lung_meta)<- c(colnames(TCGA_LUAD_mat), colnames(TCGA_LUSC_mat))

head(TCGA_lung_meta)

#>                              cancer_type
#> TCGA-73-4658-01A-01R-1755-07        LUSC
#> TCGA-44-2661-11A-01R-1758-07        LUSC
#> TCGA-55-6986-11A-01R-1949-07        LUSC
#> TCGA-55-8615-01A-11R-2403-07        LUSC
#> TCGA-97-8177-01A-11R-2287-07        LUSC
#> TCGA-49-6744-11A-01R-1858-07        LUSC

dim(TCGA_lung_mat)

#> [1] 36940  1153

PCA after cpm (counts per million) normalization

library(ggplot2)
library(ggfortify)

# convert the raw counts to log2(cpm+1) using edgeR.
TCGA_lung_cpm<- edgeR::cpm(TCGA_lung_mat, log = TRUE, prior.count = 1)

select_top_var_genes<- function(mat, top_n=1000){
        idx<- order(rowVars(TCGA_lung_cpm), decreasing = TRUE)[1:top_n]
        return(mat[idx,])
}
## top 1000 most variable genes

TCGA_lung_cpm_sub <- select_top_var_genes(TCGA_lung_cpm, top_n = 1000)

TCGA_pca_res2 <- prcomp(t(TCGA_lung_cpm_sub), scale. = TRUE)

autoplot(TCGA_pca_res2, data = TCGA_lung_meta , color ="cancer_type") +
  scale_color_manual(values = c("blue", "red")) +
  ggtitle("TCGA NSCLC")

Now we see the first PC separates cancer type: LUSC vs LUAD, which makes sense! It is interesting to see some of the LUSC samples mix with the LUAD samples and vice versa.

Let’s further investigate those samples.

LUAD and LUSC markers

LUAD originates in the alveolar epithelial cells and is often associated with EGFR mutations, particularly in non-smokers.

NKX2-1 (TTF-1) Lung epithelial differentiation marker. Highly expressed in LUAD but absent in LUSC (TCGA, IHC studies).

NAPSA (Napsin A). Aspartic protease in surfactant protein processing. High specificity for LUAD (IHC studies).

TCGA_lung_meta$NAPSA<- TCGA_lung_cpm["NAPSA", ]
TCGA_lung_meta$TFF1<- TCGA_lung_cpm["TFF1", ]

autoplot(TCGA_pca_res2, data = TCGA_lung_meta , color ="TFF1") +
        scale_color_viridis_b() +
        facet_wrap(~ cancer_type) + 
        ggtitle("TCGA NSCLC")

Let’s cap the TFF1 max value at 6

quantile(TCGA_lung_meta$TFF1, c(0.1,0.2,0.5,0.8,0.9, 0.951))

#>       10%       20%       50%       80%       90%     95.1% 
#> -5.660578 -4.480366 -2.351457  0.832825  3.428506  6.247145

# Cap the maximum value at 6
TCGA_lung_meta$TFF1 <- pmin(TCGA_lung_meta$TFF1, 6)

autoplot(TCGA_pca_res2, data = TCGA_lung_meta , color ="TFF1") +
        scale_color_viridis_b() +
        facet_wrap(~ cancer_type) + 
        ggtitle("TCGA NSCLC")

TFF1 mRNA is not high across the LUAD samples. TFF1 protein maybe a better marker?

quantile(TCGA_lung_meta$NAPSA, c(0.1,0.2,0.5,0.8,0.9, 0.951))

#>        10%        20%        50%        80%        90%      95.1% 
#>  0.5522465  3.3168667  7.5590742  9.9427697 10.4473737 10.9337256

# Cap the maximum value at 8
TCGA_lung_meta$NAPSA <- pmin(TCGA_lung_meta$NAPSA, 8)

autoplot(TCGA_pca_res2, data = TCGA_lung_meta , color ="NAPSA") +
        scale_color_viridis_b() +
        facet_wrap(~ cancer_type) + 
        ggtitle("TCGA NSCLC")

NAPSA level looks pretty high in most of the LUAD samples. The LUAD samples at PC1 > 0 have lower NAPSA levels. Also, the LUSC samples at PC1 < 0 have high NAPSA levels too.

Are those samples mislabeled?

Let’s check several LUSC markers.

LUSC arises from the bronchial epithelium and is strongly associated with smoking.

TP63 (ΔNp63).
Transcription factor crucial for squamous differentiation.
High expression in LUSC, absent in LUAD (TCGA, IHC studies).

KRT5 (Cytokeratin 5).
Cytoskeletal protein in basal epithelial cells.
Expressed in LUSC, absent in LUAD.

TCGA_lung_meta$TP63<- TCGA_lung_cpm["TP63", ]
TCGA_lung_meta$KRT5<- TCGA_lung_cpm["KRT5", ]

TCGA_lung_meta$TP63 <- pmin(TCGA_lung_meta$TP63, 8)
autoplot(TCGA_pca_res2, data = TCGA_lung_meta , color ="TP63") +
        scale_color_viridis_b() +
        facet_wrap(~ cancer_type) + 
        ggtitle("TCGA NSCLC")

autoplot(TCGA_pca_res2, data = TCGA_lung_meta , color ="KRT5") +
        scale_color_viridis_b() +
        facet_wrap(~ cancer_type) + 
        ggtitle("TCGA NSCLC")

Both TP63 and KRT5 are very high in most of the LUSC samples except the samples that are located in PC1 < 0.

Isolate those samples based on PC1

PC1_greater_zero<- TCGA_pca_res2$x[, 1] > 0

PC1_smaller_zero<- TCGA_pca_res2$x[, 1] < 0 

TCGA_lung_meta$PC1_greater_zero<- PC1_greater_zero
TCGA_lung_meta$PC1_smaller_zero<- PC1_smaller_zero

head(TCGA_lung_meta)

#>                              cancer_type     NAPSA      TFF1      TP63
#> TCGA-73-4658-01A-01R-1755-07        LUSC  8.000000 -4.324249 3.9402703
#> TCGA-44-2661-11A-01R-1758-07        LUSC -1.606127 -4.303566 8.0000000
#> TCGA-55-6986-11A-01R-1949-07        LUSC  6.353927 -3.275591 0.2082599
#> TCGA-55-8615-01A-11R-2403-07        LUSC  5.488879 -5.660578 8.0000000
#> TCGA-97-8177-01A-11R-2287-07        LUSC  5.493101 -1.190246 7.3304868
#> TCGA-49-6744-11A-01R-1858-07        LUSC  7.820269 -2.755688 7.9849378
#>                                   KRT5 PC1_greater_zero PC1_smaller_zero
#> TCGA-73-4658-01A-01R-1755-07  5.319164            FALSE             TRUE
#> TCGA-44-2661-11A-01R-1758-07 12.662280             TRUE            FALSE
#> TCGA-55-6986-11A-01R-1949-07  9.679160             TRUE            FALSE
#> TCGA-55-8615-01A-11R-2403-07 12.642629             TRUE            FALSE
#> TCGA-97-8177-01A-11R-2287-07 13.268546             TRUE            FALSE
#> TCGA-49-6744-11A-01R-1858-07 12.970187             TRUE            FALSE

### samples that are labeled as LUAD but mixed with LUSC
TCGA_lung_meta %>%
        filter(cancer_type == "LUAD", PC1_greater_zero)

#>                              cancer_type      NAPSA       TFF1        TP63
#> TCGA-86-8673-01A-11R-2403-07        LUAD  4.5663310 -2.9086627  8.00000000
#> TCGA-62-A46V-01A-11R-A24H-07        LUAD  8.0000000 -3.9525629  8.00000000
#> TCGA-50-5044-01A-21R-1858-07        LUAD  5.1896518  6.0000000  4.97915437
#> TCGA-68-7757-01B-11R-2296-07        LUAD  5.2821693 -2.6573753  2.13422809
#> TCGA-85-A510-01A-11R-A26W-07        LUAD  0.6750456  0.8804522  3.06914674
#> TCGA-46-3766-01A-01R-0980-07        LUAD  3.8035608  1.2287754  8.00000000
#> TCGA-22-5481-11A-01R-1949-07        LUAD  1.6385518  2.6653434  3.75590039
#> TCGA-22-5478-11A-11R-1635-07        LUAD  0.8663622 -2.4048553  6.74546294
#> TCGA-85-7696-01A-11R-2125-07        LUAD  4.2119379 -5.6605785  4.22054191
#> TCGA-22-4594-01A-01R-1201-07        LUAD  8.0000000  2.2538799 -0.25637501
#> TCGA-6A-AB49-01A-12R-A405-07        LUAD  3.3164053  6.0000000 -0.36926332
#> TCGA-56-7582-11A-01R-2045-07        LUAD  4.7683084  4.7304238  7.43781179
#> TCGA-56-8305-01A-11R-2296-07        LUAD  2.4551189 -3.2311191  6.68722426
#> TCGA-56-8082-11A-01R-2247-07        LUAD -1.0005648 -5.6605785  3.04340846
#> TCGA-46-3768-01A-01R-0980-07        LUAD  3.4054552 -1.5515290  0.01880819
#> TCGA-51-6867-01A-11R-2045-07        LUAD -0.1692585 -3.7407587 -0.22384634
#> TCGA-18-5592-01A-01R-1635-07        LUAD  7.6273065  4.3000062  1.39552859
#> TCGA-37-A5EN-01A-21R-A26W-07        LUAD  4.4655326 -2.5591610  7.30258763
#> TCGA-33-4566-01A-01R-1443-07        LUAD  3.6139878 -1.5412537  0.67660915
#> TCGA-37-3783-01A-01R-1201-07        LUAD  0.4009309 -0.6624330  3.43389639
#> TCGA-63-6202-01A-11R-1820-07        LUAD  2.6775120 -1.4593075  7.44113586
#> TCGA-85-A4PA-01A-11R-A24Z-07        LUAD -0.8290303  5.1433581  2.46188014
#> TCGA-34-2596-01A-01R-0851-07        LUAD  5.0267500 -4.0101523  7.61493110
#> TCGA-85-8351-01A-11R-2296-07        LUAD  3.5228858  2.0954799  0.82487449
#> TCGA-22-5479-01A-31R-1949-07        LUAD  3.5100065  4.0738093  1.32455573
#> TCGA-NK-A5CX-01A-11R-A26W-07        LUAD  6.0066781  3.8234108  2.41929047
#> TCGA-NC-A5HQ-01A-11R-A26W-07        LUAD  7.9814798 -0.5346360  7.74020469
#> TCGA-85-7699-01A-11R-2125-07        LUAD  4.8822744  3.2524249  0.83695151
#> TCGA-22-4604-01A-01R-1201-07        LUAD  2.2691027  6.0000000 -1.44367696
#> TCGA-34-5234-01A-01R-1635-07        LUAD -4.5754090 -5.6605785 -1.69885139
#> TCGA-58-8392-01A-11R-2326-07        LUAD -1.1835498  2.4616245  2.52401659
#> TCGA-63-A5MV-01A-21R-A26W-07        LUAD  1.4774444  3.7063909 -0.04737659
#>                                    KRT5 PC1_greater_zero PC1_smaller_zero
#> TCGA-86-8673-01A-11R-2403-07 13.2356777             TRUE            FALSE
#> TCGA-62-A46V-01A-11R-A24H-07 11.0546842             TRUE            FALSE
#> TCGA-50-5044-01A-21R-1858-07  8.6900116             TRUE            FALSE
#> TCGA-68-7757-01B-11R-2296-07 -1.6421941             TRUE            FALSE
#> TCGA-85-A510-01A-11R-A26W-07 -0.7983457             TRUE            FALSE
#> TCGA-46-3766-01A-01R-0980-07 12.3189714             TRUE            FALSE
#> TCGA-22-5481-11A-01R-1949-07  0.9659658             TRUE            FALSE
#> TCGA-22-5478-11A-11R-1635-07  4.9274942             TRUE            FALSE
#> TCGA-85-7696-01A-11R-2125-07  0.1741246             TRUE            FALSE
#> TCGA-22-4594-01A-01R-1201-07 -1.1152292             TRUE            FALSE
#> TCGA-6A-AB49-01A-12R-A405-07  0.5338189             TRUE            FALSE
#> TCGA-56-7582-11A-01R-2045-07 11.0139166             TRUE            FALSE
#> TCGA-56-8305-01A-11R-2296-07  4.3536074             TRUE            FALSE
#> TCGA-56-8082-11A-01R-2247-07 -1.2340165             TRUE            FALSE
#> TCGA-46-3768-01A-01R-0980-07  2.8018288             TRUE            FALSE
#> TCGA-51-6867-01A-11R-2045-07  5.7568558             TRUE            FALSE
#> TCGA-18-5592-01A-01R-1635-07  2.0926904             TRUE            FALSE
#> TCGA-37-A5EN-01A-21R-A26W-07  2.4172140             TRUE            FALSE
#> TCGA-33-4566-01A-01R-1443-07  0.6207392             TRUE            FALSE
#> TCGA-37-3783-01A-01R-1201-07  7.6497640             TRUE            FALSE
#> TCGA-63-6202-01A-11R-1820-07 11.1047418             TRUE            FALSE
#> TCGA-85-A4PA-01A-11R-A24Z-07  4.5905074             TRUE            FALSE
#> TCGA-34-2596-01A-01R-0851-07  9.2787683             TRUE            FALSE
#> TCGA-85-8351-01A-11R-2296-07  3.2194062             TRUE            FALSE
#> TCGA-22-5479-01A-31R-1949-07 -1.4336687             TRUE            FALSE
#> TCGA-NK-A5CX-01A-11R-A26W-07 -0.5332040             TRUE            FALSE
#> TCGA-NC-A5HQ-01A-11R-A26W-07 10.9722815             TRUE            FALSE
#> TCGA-85-7699-01A-11R-2125-07  1.8870867             TRUE            FALSE
#> TCGA-22-4604-01A-01R-1201-07  0.7859955             TRUE            FALSE
#> TCGA-34-5234-01A-01R-1635-07  2.1115393             TRUE            FALSE
#> TCGA-58-8392-01A-11R-2326-07 -3.2069578             TRUE            FALSE
#> TCGA-63-A5MV-01A-21R-A26W-07 -1.2697271             TRUE            FALSE

One can further investigate those samples (by imaging data or proteomics data) and see if there could be mislabeled.

### samples that are labeled as LUSC but mixed with LUAD
TCGA_lung_meta %>%
        filter(cancer_type == "LUSC", PC1_smaller_zero) %>%
        head()

#>                              cancer_type   NAPSA      TFF1     TP63     KRT5
#> TCGA-73-4658-01A-01R-1755-07        LUSC 8.00000 -4.324249 3.940270 5.319164
#> TCGA-67-3771-01A-01R-0946-07        LUSC 8.00000 -3.701993 4.260983 7.132368
#> TCGA-44-6776-11A-01R-1858-07        LUSC 8.00000 -5.660578 3.632858 6.775651
#> TCGA-55-8511-01A-11R-2403-07        LUSC 7.89183 -3.911218 5.754526 3.194861
#> TCGA-97-8172-01A-11R-2287-07        LUSC 8.00000 -4.485356 1.690959 1.436336
#> TCGA-55-6979-01A-11R-1949-07        LUSC 6.51150 -3.224689 4.954470 2.687894
#>                              PC1_greater_zero PC1_smaller_zero
#> TCGA-73-4658-01A-01R-1755-07            FALSE             TRUE
#> TCGA-67-3771-01A-01R-0946-07            FALSE             TRUE
#> TCGA-44-6776-11A-01R-1858-07            FALSE             TRUE
#> TCGA-55-8511-01A-11R-2403-07            FALSE             TRUE
#> TCGA-97-8172-01A-11R-2287-07            FALSE             TRUE
#> TCGA-55-6979-01A-11R-1949-07            FALSE             TRUE

PCA on the rank matrix

There could be certain batch effect that we do not know.

and the batch effect somehow force the LUAD and LUSC mixing in the PCA plot.

Let’s try rank the genes per sample by their expression level and use the rank matrix to do PCA.

# Convert to rank matrix (ranking each column separately)
# TPM may be better than CPM 
# rank_matrix <- apply(TCGA_lung_cpm, 2, rank, ties.method = "average")
# faster than the apply

library(matrixStats)
rank_matrix <- colRanks(TCGA_lung_cpm, ties.method = "average")

## colRanks returns sample x gene, transpose it back to gene x sample
rank_matrix<- t(rank_matrix)

hist(rank_matrix)

PCA assumes continuous, approximately normal data because it relies on variance computation. Rank values are discrete and not normally distributed. However, in practice, PCA can still work on ranked data, though its interpretation changes.

rank_matrix_sub<- select_top_var_genes(rank_matrix, top_n = 1000)
hist(rank_matrix_sub)

It looks approximately normal.

PCA on this matrix:

pca_result_rank<- prcomp(t(rank_matrix_sub), scale. = TRUE)

autoplot(pca_result_rank, data = TCGA_lung_meta , color ="cancer_type") +
  ggtitle("TCGA NSCLC raw rank")

autoplot(pca_result_rank, data = TCGA_lung_meta , color ="cancer_type") +
  facet_wrap(~cancer_type) +
  ggtitle("TCGA NSCLC raw rank")

PCA on the raw rank matrix somehow can separate LUAD and LUSC as well!

PCA on normalized rank matrix

This maps ranks to a normal distribution using the inverse normal transformation.

rank_normalized<- qnorm((rank_matrix - 0.5) / nrow(rank_matrix))
hist(rank_normalized)

rank_normalized_sub<- select_top_var_genes(rank_normalized, top_n = 1000)
hist(rank_normalized_sub)

It looks approximately normal.

PCA on the normalized rank matrix

pca_result_rank<- prcomp(t(rank_normalized_sub), scale. = TRUE)

autoplot(pca_result_rank, data = TCGA_lung_meta , color ="cancer_type") +
  ggtitle("TCGA NSCLC normalized rank")

autoplot(pca_result_rank, data = TCGA_lung_meta , color ="cancer_type") +
  facet_wrap(~cancer_type) +
  ggtitle("TCGA NSCLC normalized rank")

The PCA plot looks similar to the one with the raw rank and the cpm normalized counts. We still see those LUAD/LUSC mixing samples.

Key takeaways

Use PCA do do exploratory data analysis.
Use your biology knowledge (in our case, LUAD and LUSC known markers) to investigate the strange samples.
PCA on the raw rank matrix and the normalized rank matrix can separate the LUAD and LUSC samples.

There is a long post for how to use DESeq2 on large scale dataset.

https://bsky.app/profile/iansudbery.bsky.social/post/3lgdp62x5bc2q

In my next blog post, I will show you using DESeq2 vs wilcox rank sum test for differential gene expression analysis.