Are PDL1 RNA and protein levels correlated in cancer cell lines?

Oct 6, 2022 38 min read bioinformatics

Are protein and RNA levels correlated? This is a big question.

see replies to this tweet at https://twitter.com/slavov_n/status/1561844133496512512. In general, RNA and protein abundances should be correlated but there are exceptions of course. Biology is complicated/weird!

One of my favorite examples is Hypoxia-inducible factor 1-alpha, HIF-1α. The protein is efficiently degraded in most tissues most of the time unless stabilized by hypoxia.

William Kaelin won Nobel prize for his work on how cells sense oxygen through the HIF pathway in 2019.

Coincidentally, my first paper during PhD is on how CTCF functions as an enhancer block to attenuate the HIF1 enhancers.

From this paper Quantitative Proteomics of the Cancer Cell Line Encyclopedia, some genes are well-correlated between protein and RNA while others are not.

From this review mRNAs, proteins and the emerging principles of gene expression control:

Therefore, mRNA levels should not be interpreted as the final output of gene expression. Instead, it is more instructive to think of mRNAs as what they mechanistically are: the templates for protein synthesis.

Explore proteomics data

Let’s do some analysis to check if PDL1/CD274’s mRNA and protein are correlated or not across cancer cell lines. PDL1 is the ligand of PD1 and is a prominent immunotherapy target.

The protemomics data are from this paper Pan-cancer proteomic map of 949 human cell lines

Both the RNAseq and the proteomics data were downloaded from https://cellmodelpassports.sanger.ac.uk/downloads

library(tidyverse)
library(janitor)

data<- read_tsv("~/blog_data/protein_matrix_averaged_20220713.tsv", skip = 1)

#remove one more line
data<- data %>%
  dplyr::slice(-1) 

data<- data %>%
  dplyr::rename(model_name = symbol) %>%
  dplyr::rename(model_id = `...2`)

        
head(data)

#> # A tibble: 6 × 8,459
#>   model_name model_id  `IGLV3-10` TRBV19 `IGHV3OR16-12` `IGHV3OR15-7` TRBV18
#>   <chr>      <chr>          <dbl>  <dbl>          <dbl>         <dbl>  <dbl>
#> 1 201T       SIDM00055         NA     NA             NA            NA     NA
#> 2 22RV1      SIDM00499         NA     NA             NA            NA     NA
#> 3 23132-87   SIDM00980         NA     NA             NA            NA     NA
#> 4 42-MG-BA   SIDM00982         NA     NA             NA            NA     NA
#> 5 451Lu      SIDM01240         NA     NA             NA            NA     NA
#> 6 5637       SIDM00807         NA     NA             NA            NA     NA
#> # … with 8,452 more variables: `IGHV3-15` <dbl>, PIGBOS1 <dbl>,
#> #   `IGHV1-3` <dbl>, `IGHV1-18` <dbl>, NBDY <dbl>, MMP24OS <dbl>,
#> #   ZC3H11B <dbl>, GOLGA6L7 <dbl>, TLE7 <dbl>, RBM47 <dbl>, UBA6 <dbl>,
#> #   ESYT2 <dbl>, MED19 <dbl>, SHTN1 <dbl>, TMEM223 <dbl>, BICDL2 <dbl>,
#> #   MEX3A <dbl>, ILVBL <dbl>, C1orf226 <dbl>, PLEKHG3 <dbl>, SH3PXD2B <dbl>,
#> #   ZC3H12D <dbl>, WDR27 <dbl>, NBAS <dbl>, FAM234B <dbl>, FUOM <dbl>,
#> #   VWA8 <dbl>, SBNO1 <dbl>, GTPBP10 <dbl>, WDR91 <dbl>, KLRG2 <dbl>, …

so each row is a cell line, and each column is a gene. It also has a model_id column that the consortium use internally.

read in the metadata:

meta<- read_csv("~/blog_data/model_list_20220628.csv")

head(meta)

#> # A tibble: 6 × 97
#>   model_id sample_id patient_id parent_id model_name synonyms tissue cancer_type
#>   <chr>    <chr>     <chr>      <chr>     <chr>      <chr>    <chr>  <chr>      
#> 1 SIDM017… SIDS01659 SIDP01578  <NA>      PK-59      <NA>     Pancr… Pancreatic…
#> 2 SIDM001… SIDS00612 SIDP00541  <NA>      SNU-1033   <NA>     Large… Colorectal…
#> 3 SIDM014… SIDS01466 SIDP01347  <NA>      SNU-466    <NA>     Centr… Glioblasto…
#> 4 SIDM015… SIDS01363 SIDP01247  <NA>      IST-MES-2  <NA>     Lung   Mesothelio…
#> 5 SIDM016… SIDS01631 SIDP01557  <NA>      MUTZ-5     <NA>     Haema… B-Lymphobl…
#> 6 SIDM014… SIDS01493 SIDP01374  <NA>      TM-31      <NA>     Centr… Glioma     
#> # … with 89 more variables: cancer_type_ncit_id <chr>, tissue_status <chr>,
#> #   sample_site <chr>, cancer_type_detail <chr>, model_type <chr>,
#> #   growth_properties <chr>, model_treatment <chr>, sampling_day <dbl>,
#> #   sampling_month <dbl>, sampling_year <dbl>, doi <chr>, pmed <dbl>,
#> #   msi_status <chr>, ploidy_snp6 <dbl>, ploidy_wes <dbl>, ploidy_wgs <dbl>,
#> #   mutational_burden <dbl>, model_comments <chr>,
#> #   model_relations_comment <chr>, COSMIC_ID <chr>, BROAD_ID <chr>, …

dim(meta)

#> [1] 2078   97

Sanity check if model_id and model_name are 1:1 mapping or not.

## model_id and model_name are not 1:1 mapping
meta %>% select(model_id, model_name) %>%
        distinct(model_name) %>%
        dim()

#> [1] 2075    1

## three same cell line has duplicated lines
meta %>%
        count(model_name) %>%
        dplyr::filter(n >1)

#> # A tibble: 3 × 2
#>   model_name     n
#>   <chr>      <int>
#> 1 ML-1           2
#> 2 MS-1           2
#> 3 SLR24          2

## model id is unique, but not the model_name
meta %>%
        dplyr::filter(model_name %in% c("ML-1", "MS-1", "SLR24"))

#> # A tibble: 6 × 97
#>   model_id sample_id patient_id parent_id model_name synonyms tissue cancer_type
#>   <chr>    <chr>     <chr>      <chr>     <chr>      <chr>    <chr>  <chr>      
#> 1 SIDM004… SIDS01183 SIDP01069  <NA>      MS-1       <NA>     Skin   Other Soli…
#> 2 SIDM002… SIDS00693 SIDP00619  <NA>      MS-1       MS-1-L   Lung   Small Cell…
#> 3 SIDM004… SIDS01117 SIDP01005  <NA>      ML-1       <NA>     Haema… Acute Myel…
#> 4 SIDM004… SIDS01115 SIDP01004  <NA>      ML-1       <NA>     Thyro… Thyroid Gl…
#> 5 SIDM013… SIDS01455 SIDP01336  <NA>      SLR24      <NA>     Kidney Kidney Car…
#> 6 SIDM019… SIDS01788 SIDP01690  SIDM01851 SLR24      <NA>     Kidney Kidney Car…
#> # … with 89 more variables: cancer_type_ncit_id <chr>, tissue_status <chr>,
#> #   sample_site <chr>, cancer_type_detail <chr>, model_type <chr>,
#> #   growth_properties <chr>, model_treatment <chr>, sampling_day <dbl>,
#> #   sampling_month <dbl>, sampling_year <dbl>, doi <chr>, pmed <dbl>,
#> #   msi_status <chr>, ploidy_snp6 <dbl>, ploidy_wes <dbl>, ploidy_wgs <dbl>,
#> #   mutational_burden <dbl>, model_comments <chr>,
#> #   model_relations_comment <chr>, COSMIC_ID <chr>, BROAD_ID <chr>, …

Those lines have the same model_name but from different patients so a different model_id.

head(data)

#> # A tibble: 6 × 8,459
#>   model_name model_id  `IGLV3-10` TRBV19 `IGHV3OR16-12` `IGHV3OR15-7` TRBV18
#>   <chr>      <chr>          <dbl>  <dbl>          <dbl>         <dbl>  <dbl>
#> 1 201T       SIDM00055         NA     NA             NA            NA     NA
#> 2 22RV1      SIDM00499         NA     NA             NA            NA     NA
#> 3 23132-87   SIDM00980         NA     NA             NA            NA     NA
#> 4 42-MG-BA   SIDM00982         NA     NA             NA            NA     NA
#> 5 451Lu      SIDM01240         NA     NA             NA            NA     NA
#> 6 5637       SIDM00807         NA     NA             NA            NA     NA
#> # … with 8,452 more variables: `IGHV3-15` <dbl>, PIGBOS1 <dbl>,
#> #   `IGHV1-3` <dbl>, `IGHV1-18` <dbl>, NBDY <dbl>, MMP24OS <dbl>,
#> #   ZC3H11B <dbl>, GOLGA6L7 <dbl>, TLE7 <dbl>, RBM47 <dbl>, UBA6 <dbl>,
#> #   ESYT2 <dbl>, MED19 <dbl>, SHTN1 <dbl>, TMEM223 <dbl>, BICDL2 <dbl>,
#> #   MEX3A <dbl>, ILVBL <dbl>, C1orf226 <dbl>, PLEKHG3 <dbl>, SH3PXD2B <dbl>,
#> #   ZC3H12D <dbl>, WDR27 <dbl>, NBAS <dbl>, FAM234B <dbl>, FUOM <dbl>,
#> #   VWA8 <dbl>, SBNO1 <dbl>, GTPBP10 <dbl>, WDR91 <dbl>, KLRG2 <dbl>, …

exp_data<- data[, c("model_name", "model_id", "CD274")] %>%
        left_join(meta)

head(exp_data)

#> # A tibble: 6 × 98
#>   model_name model_id  CD274 sample_id patient_id parent_id synonyms      tissue
#>   <chr>      <chr>     <dbl> <chr>     <chr>      <chr>     <chr>         <chr> 
#> 1 201T       SIDM00055    NA SIDS00744 SIDP00668  <NA>      <NA>          Lung  
#> 2 22RV1      SIDM00499    NA SIDS00942 SIDP00842  <NA>      CWR22-Rv1;CW… Prost…
#> 3 23132-87   SIDM00980    NA SIDS00443 SIDP00380  <NA>      St 23132      Stoma…
#> 4 42-MG-BA   SIDM00982    NA SIDS00444 SIDP00381  <NA>      <NA>          Centr…
#> 5 451Lu      SIDM01240    NA SIDS00062 SIDP00048  SIDM00802 WM 451-Lu;WC… Skin  
#> 6 5637       SIDM00807    NA SIDS00057 SIDP00045  <NA>      <NA>          Bladd…
#> # … with 90 more variables: cancer_type <chr>, cancer_type_ncit_id <chr>,
#> #   tissue_status <chr>, sample_site <chr>, cancer_type_detail <chr>,
#> #   model_type <chr>, growth_properties <chr>, model_treatment <chr>,
#> #   sampling_day <dbl>, sampling_month <dbl>, sampling_year <dbl>, doi <chr>,
#> #   pmed <dbl>, msi_status <chr>, ploidy_snp6 <dbl>, ploidy_wes <dbl>,
#> #   ploidy_wgs <dbl>, mutational_burden <dbl>, model_comments <chr>,
#> #   model_relations_comment <chr>, COSMIC_ID <chr>, BROAD_ID <chr>, …

table(is.na(exp_data$CD274))

#> 
#> FALSE  TRUE 
#>   159   790

only 159 lines with PDL1 expression at protein level.

# distribution of the PDL1 protein level
exp_data %>%
        dplyr::filter(!is.na(CD274)) %>%
        ggplot(aes(x= CD274)) +
        geom_histogram(bins = 30, color = "white") +
        ggtitle("PDL1 expression histogram") +
        theme_classic(base_size = 14)

p<- exp_data %>%
        dplyr::filter(!is.na(CD274)) %>%
        ggplot(aes(x = cancer_type, y= CD274 ,text = paste(cancer_type, model_name, sep=":"))) +
        geom_boxplot() +
        geom_point() +
        theme_classic(base_size = 14) +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

library(plotly)

ggplot_p<- ggplotly(p)
ggplot_p

#save the interactive plots
#htmlwidgets::saveWidget(ggplot_p, here("data/Proteomics_20220713/PDL1_proteomics_cell_line_20220718.html"), selfcontained = T)

read in the RNAseq data:

rna<- read_csv("~/blog_data/rnaseq_tpm_20220624.csv")

rna<- rna %>%
        dplyr::slice(5: nrow(rna)) %>%
        dplyr::rename(gene = `...2`) %>%
        dplyr::select(-model_id) %>%
        dplyr::mutate(across(!gene, as.numeric))

head(rna)

#> # A tibble: 6 × 1,432
#>   gene     SIDM00001 SIDM00002 SIDM00003 SIDM00005 SIDM00006 SIDM00007 SIDM00008
#>   <chr>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#> 1 A1BG          0.53      0.78      0.49      3.29      0.56      1.43      0   
#> 2 A1BG-AS1      2.43      6.01     11.5       7.71      7.53     16.2       0.06
#> 3 A1CF          0.03      0         0         0         0.04      0         0.06
#> 4 A2M           0.33      0.75     29.1       0.87     86.2       0.07      0.06
#> 5 A2M-AS1       0         0.71      0.83      5.59      1.04      1.67      0.06
#> 6 A2ML1         0         0.57      0         0.19      0         0         0.32
#> # … with 1,424 more variables: SIDM00009 <dbl>, SIDM00011 <dbl>,
#> #   SIDM00014 <dbl>, SIDM00015 <dbl>, SIDM00016 <dbl>, SIDM00017 <dbl>,
#> #   SIDM00018 <dbl>, SIDM00023 <dbl>, SIDM00025 <dbl>, SIDM00026 <dbl>,
#> #   SIDM00027 <dbl>, SIDM00029 <dbl>, SIDM00032 <dbl>, SIDM00036 <dbl>,
#> #   SIDM00040 <dbl>, SIDM00041 <dbl>, SIDM00042 <dbl>, SIDM00043 <dbl>,
#> #   SIDM00044 <dbl>, SIDM00045 <dbl>, SIDM00046 <dbl>, SIDM00047 <dbl>,
#> #   SIDM00048 <dbl>, SIDM00049 <dbl>, SIDM00050 <dbl>, SIDM00051 <dbl>, …

PDL1_rna_tpm<- rna %>%
        dplyr::filter(gene == "CD274") %>%
        pivot_longer(! gene, names_to = "model_id", values_to = "tpm") %>%
        pivot_wider(names_from = gene, values_from = tpm) %>%
        select(model_id, CD274)

PDL1_rna_tpm<- dplyr::rename(PDL1_rna_tpm, CD274_rna = CD274)

head(PDL1_rna_tpm)

#> # A tibble: 6 × 2
#>   model_id  CD274_rna
#>   <chr>         <dbl>
#> 1 SIDM00001     17.3 
#> 2 SIDM00002      0.04
#> 3 SIDM00003      1   
#> 4 SIDM00005      1.78
#> 5 SIDM00006      2.41
#> 6 SIDM00007     36.9

exp_data<- dplyr::rename(exp_data, CD274_protein = CD274)

If we only include cell lines with both protein and RNA expression, we see the protein and RNA are positively correlated.

left_join(exp_data, PDL1_rna_tpm) %>%
        dplyr::select(model_name, model_id, CD274_protein, CD274_rna, everything()) %>%
        dplyr::filter(!is.na(CD274_protein), !is.na(CD274_rna)) %>%
        arrange(desc(CD274_protein)) %>%
        ggplot(aes(x=CD274_rna, y = CD274_protein)) +
        scale_x_continuous(trans=scales::pseudo_log_trans(base = 2))+
        geom_point() +
        geom_smooth(method= "lm") +
        theme_classic(base_size = 14)

Let’s include all cell lines and designate the NAs to 0s.

p2<- left_join(exp_data, PDL1_rna_tpm) %>%
        dplyr::select(model_name, model_id, CD274_protein, CD274_rna, everything()) %>%
        dplyr::arrange(desc(CD274_protein)) %>%
        dplyr::mutate(CD274_rna = if_else(is.na(CD274_rna), 0, CD274_rna)) %>%
        dplyr::mutate(CD274_protein = if_else(is.na(CD274_protein), 0, CD274_protein)) %>%
        ggplot(aes(x= CD274_rna, y = CD274_protein, text = paste(model_name, cancer_type, sep =":"))) +
        geom_point() +
        scale_x_continuous(trans=scales::pseudo_log_trans(base = 2), breaks = c(0, 2, 4, 8, 16, 32, 64, 100, 200, 400, 600))+ 
        geom_smooth(method= "lm") +
        theme_classic(base_size = 14) 


ggplot_p2<- ggplotly(p2)

ggplot_p2

We see a bunch of cell lines with 0 protein level expression but have some levels of RNA expression. Are the NAs in the proteomics data means PDL1 is not detected or not assayed?

And what’s going on for MDA-MB-231 breast cancer cell line in the bottom? it has high RNA expression level but the protein is very low. Also, PC-3 (non-small cell lung cancer) and D-245MG (glioblastoma) cell line have decent protein expression but no RNA expression. I guess it is a technical problem.

bioinformatics

Are PDL1 RNA and protein levels correlated in cancer cell lines?

Explore proteomics data

read in the metadata:

read in the RNAseq data:

Related