How CCA alignment and cell label transfer work in Seurat

Last updated on Oct 26, 2024 15 min read education, single cell, bioinformatics, data, seurat

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

Understand CCA

Following my last blog post on PCA projection and cell label transfer, we are going to talk about CCA.

In single-cell RNA-seq data integration using Canonical Correlation Analysis (CCA), we typically align two matrices representing different datasets, where both datasets have the same set of genes but different numbers of cells. CCA is used to identify correlated structures across the two datasets and align them in a common space.

Let’s denote the two datasets as:

\(X\): a matrix with dimensions \(p \times n1\) (where \(p\) is the number of genes and \(n\) is the number of cells in dataset 1, e.g., PBMC 3k)

Y: a matrix with dimensions \(p \times n2\) (where \(p\) is the number of genes and \(n\) is the number of cells in dataset 2, e.g., PBMC 10k)

In this case, the number of genes (rows) is the same in both matrices, but the number of cells (columns) is different.

The goal of CCA here is to find linear combinations of the genes in both datasets that produce correlated cell embeddings in a lower-dimensional space. Essentially, we aim to align the two datasets by identifying the correlated patterns in gene expression that are shared between the two sets of cells.

How CCA Works

Given the two datasets \(X\) and \(Y\), CCA finds the linear combinations of the gene expression profiles that are maximally correlated. This means that for each cell in \(X\) and each cell in \(Y\) we find directions in the gene expression space that maximize the correlation between the two datasets.

Formulas for CCA

Input Data:

a matrix \(X \in \mathbb{R}^{p \times n_1}\) is the matrix of gene expression from the first dataset (3k cells), and \(Y \in \mathbb{R}^{p \times n_2}\) is the matrix of gene expression from the second dataset (10k cells).

Canonical Variates: We aim to find vectors \(a \in \mathbb{R}^p\) and \(b \in \mathbb{R}^p\) such that:

\[ u = X^T a \quad (\text{canonical variate for cells in } X) \]

\[ v = Y^T b \quad (\text{canonical variate for cells in } Y) \]

Where \(u\) and \(v\) are the canonical variates, and the goal is to maximize the correlation:

\[ \text{corr}(u, v) = \frac{\text{cov}(u, v)}{\sqrt{\text{var}(u)} \cdot \sqrt{\text{var}(v)}} \]

Covariance Matrices: \[ \Sigma_{XX} = \frac{1}{n_1 - 1} XX^T \quad \text{(covariance matrix of dataset } X) \]

\[ \Sigma_{YY} = \frac{1}{n_2 - 1} YY^T \quad \text{(covariance matrix of dataset } Y) \]

\[ \Sigma_{XY} = \frac{1}{\min(n_1, n_2) - 1} X^TY \quad \text{(cross-covariance matrix between } X \text{ and } Y) \]

Generalized Eigenvalue Problem (GEP):

To find the canonical weights \(a\) and \(b\), we solve the generalized eigenvalue problem:

\[ \Sigma_{XX}^{-1} \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{YX} a = \lambda a \]

Similarly for \(b\):

\[ \Sigma_{YY}^{-1} \Sigma_{YX} \Sigma_{XX}^{-1} \Sigma_{XY} b = \lambda b \]

The eigenvalue \(\lambda\) corresponds to the canonical correlation. \(a\) is the canonical weight vector for dataset \(X\) \(b\) is the canonical weight vector for dataset \(Y\)

Solving the eigenvalue problem involves finding the eigenvalues and eigenvectors that maximize the correlation between the datasets, but it effectively solves the same problem as the SVD for the covariance matrix \(\Sigma XY\). One can demonstrate that mathematically using matrix calculation. I am not a math person, I will leave it to you to do the proof :). Again, I really wish I learned linear algebra better.

Note watch this for eigenvalues and eigenvectors from 3blue1brown.

Note read this blog post https://xinmingtu.cn/blog/2022/CCA_dual_PCA/ by Xinming Tu (with nice visuals!) to understand the relationship between CCA and PCA in a more mathematical way.

How SVD is used and how it relates to the GEP.

Perform SVD on the Cross-Covariance Matrix:

The SVD of the cross-covariance matrix \(\Sigma_{XY}\) can be expressed as:

\[ \Sigma_{XY} = U D V^T \]

where:
- \(U\) is a matrix of left singular vectors (canonical directions for dataset \(X\),
- \(V\) is a matrix of right singular vectors (canonical directions for dataset \(Y\),
- \(D\) is a diagonal matrix containing singular values, which represent the strength of correlations.

The matrix \(U\) contains the left singular vectors of \(\Sigma_{XY}\), which correspond to the canonical directions for dataset \(X\). These directions are the axes along which the data in \(X\) is maximally correlated with the data in \(Y\), as measured by the singular values in \(D\). Thus, \(U\) represents the canonical variates (directions) for \(X\), just as \(V\) represents the canonical variates for \(Y\).

Let’s see an example in action

library(Seurat)
library(Matrix)
library(irlba)  # For PCA
library(dplyr)
# devtools::install_github('satijalab/seurat-data')
library(SeuratData)
#AvailableData()
#InstallData("pbmc3k")

The pbmc3k data and pbmc10k data have different number of gene names, let’s subset to the common genes.

# download 10k dataset here curl -Lo pbmc_10k_v3.rds https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1 

pbmc3k<-UpdateSeuratObject(pbmc3k)
pbmc10k<- readRDS("~/blog_data/pbmc_10k_v3.rds")
pbmc10k<-UpdateSeuratObject(pbmc10k)

pbmc3k_genes <- rownames(pbmc3k)
pbmc10k_genes <- rownames(pbmc10k)

# Find common genes
common_genes <- intersect(pbmc3k_genes, pbmc10k_genes)

# reorder the genes to the same order
pbmc3k <- subset(pbmc3k, features = common_genes)
pbmc10k <- subset(pbmc10k, features = common_genes)

How the pbmc10k data look like:

# routine processing for 3k dataset
pbmc3k<- pbmc3k %>% 
  NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
  FindVariableFeatures(selection.method = "vst", nfeatures = 5000) %>%
  ScaleData() %>%
  RunPCA(verbose = FALSE) %>%
  FindNeighbors(dims = 1:10, verbose = FALSE) %>%
  FindClusters(resolution = 0.5, verbose = FALSE) %>%
  RunUMAP(dims = 1:10, verbose = FALSE)

DimPlot(pbmc10k, label = TRUE, repel = TRUE) + NoLegend()

The pbmc3k dataset comes with annotations (the seurat_annotations column). In this experiment, we will pretend we do not have it and use the 10k pbmc data to transfer the labels. Also the pbmc10k cell labels are a little more granular.

pbmc10k <- NormalizeData(pbmc10k)
pbmc10k <- FindVariableFeatures(pbmc10k, selection.method = "vst", nfeatures = 5000)
pbmc10k <- ScaleData(pbmc10k)

# Find common variable features, Seurat has a more complex function
# we will take the intersection
#variable_genes <- SelectIntegrationFeatures(list(pbmc3k, pbmc10k), nfeatures = 3000)

variable_genes <- intersect(VariableFeatures(pbmc3k), VariableFeatures(pbmc10k))

Note we need to scale them by genes across the cells first.

# Step 1: Center the datasets
centered_pbmc3k <- t(scale(t(pbmc3k@assays$RNA@data[variable_genes,]), center =TRUE, scale =TRUE))
centered_pbmc10k <- t(scale(t(pbmc10k@assays$RNA@data[variable_genes,]), center= TRUE, scale =TRUE))

dim(centered_pbmc3k)

#> [1] 2376 2700

dim(centered_pbmc10k)

#> [1] 2376 9432

SVD directly computes the singular values (canonical correlations) and singular vectors (canonical variates) by decomposing \(\Sigma XY\)

The singular value decomposition (SVD) can be mathematically related to solving the generalized eigenvalue problem.

Sigma_XY<- (1 / (min(ncol(centered_pbmc3k), ncol(centered_pbmc10k)) - 1)) * t(centered_pbmc3k) %*% centered_pbmc10k

dim(Sigma_XY)

#> [1] 2700 9432

# Perform SVD
k <- 20  # Number of CCA dimensions
cca_svd <- irlba::irlba(Sigma_XY, nu=k, nv=k)

# Get canonical variates and correlations
canonical_variates_3k <- cca_svd$u  # PBMC3k canonical variates
canonical_variates_10k <- cca_svd$v  # PBMC10k canonical variates

# cca_svd$d contains the singular values
canonical_cors <- cca_svd$d  # Canonical correlations

range(canonical_cors)

#> [1]   4.70585 165.14183

# normalize it so the first CCA dimension has a correlation close to 1 
canonical_cors<- cca_svd$d / sum(cca_svd$d)

barplot(canonical_cors, 
        main="Canonical Correlations", 
        xlab="Canonical Dimension", 
        ylab="Correlation",
        col="lightblue")

The sum of the first few CCA dimension should have a corrrelation close to 1.

L2 Normalization of Canonical Correlation Vectors

From the Seurat V3 paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC6687398/

While MNNs have previously been identified using L2-normalized gene expression, significant differences across batches can obscure the accurate identification of MNNs, particularly when the batch effect is on a similar scale to the biological differences between cell states. To overcome this, we first jointly reduce the dimensionality of both datasets using diagonalized canonical correlation analysis (CCA), then apply L2-normalization to the canonical correlation vectors (Figure 1A,B). We next search for MNNs in this shared low-dimensional representation

Purpose of L2 Normalization: After computing the canonical variates (the linear combinations of the original variables that maximize correlation), L2 normalization (also known as Euclidean normalization) is applied to these vectors. This process scales the vectors so that their lengths (norms) equal one.

Mathematical Formulation: For a vector \(v\) L2 normalization can be expressed as:

\(v_{\text{normalized}} = \frac{v}{\|v\|_2}\)

where \(\|v\|_2\) is the L2 norm (Euclidean norm) of the vector.

Benefits of L2 Normalization:

Stabilization: It helps to stabilize the comparison of canonical correlation vectors across datasets, making the downstream analysis more robust.

Interpretable Scale: Normalized vectors provide a more interpretable scale for distances and similarities in the integrated dataset.

Apply the L2 normalization for each cell.

l2_normalize <- function(x) {
  x / sqrt(rowSums(x^2))
}

# L2-normalize the canonical variates for PBMC 3k
canonical_variates_3k<- l2_normalize(canonical_variates_3k)

# L2-normalize the canonical variates for PBMC 10k
canonical_variates_10k <- l2_normalize(canonical_variates_10k)

# Check dimensions to make sure they remain the same
dim(canonical_variates_3k)  # Should be 2700 by 20

#> [1] 2700   20

dim(canonical_variates_10k)  # Should be 9432 by 20

#> [1] 9432   20

Function to find Mutual nearest neighbors (MNN)

We will use RANN package to do it at a high level. We used RcppAnnoy in the last post.

library(RANN)

# Step 1: Find Nearest Neighbors using RANN
k <- 30  # Number of neighbors to consider
nn_result_10k <- nn2(data = canonical_variates_10k, query = canonical_variates_3k, k = k)
nn_indices_10k <- nn_result_10k$nn.idx  # Indices of nearest neighbors in pbmc10k

# Find the nearest neighbors in pbmc3k for pbmc10k cells
nn_result_3k <- nn2(data = canonical_variates_3k, query = canonical_variates_10k, k = k)
nn_indices_3k <- nn_result_3k$nn.idx  # Indices of nearest neighbors in pbmc3k

# Step 2: Identify Mutual Nearest Neighbors
find_mnn <- function(nn_indices_10k, nn_indices_3k) {
  mnn_list <- vector("list", length = nrow(canonical_variates_3k))
  
  for (i in 1:nrow(canonical_variates_3k)) {
    neighbors_10k <- nn_indices_10k[i, ]  # Neighbors in pbmc10k
    mutual_neighbors <- c()
    
    for (neighbor in neighbors_10k) {
      # Check if this neighbor sees the original cell as its nearest neighbor
      if (i %in% nn_indices_3k[neighbor, ]) {
        mutual_neighbors <- c(mutual_neighbors, neighbor)
      }
    }
    mnn_list[[i]] <- mutual_neighbors
  }
  return(mnn_list)
}

# Find MNNs
mnn_indices <- find_mnn(nn_indices_10k, nn_indices_3k)

Label transfer for cells that have MNNs.

# Step 3: Label Transfer with MNN
# Assume we have labels for the PBMC10k dataset
pbmc10k_labels <- as.character(pbmc10k$celltype)

# Initialize transferred labels and scores
transferred_labels <- character(length = nrow(canonical_variates_3k))
transfer_scores <- numeric(length = nrow(canonical_variates_3k))

# Transfer labels with error handling
for (i in seq_along(mnn_indices)) {
  if (length(mnn_indices[[i]]) > 0) {
    # Get labels for the MNNs
    labels <- pbmc10k_labels[mnn_indices[[i]]]
    
    # Remove NAs from labels
    labels <- labels[!is.na(labels)]
    
    # Check if we have any valid labels left
    if (length(labels) > 0) {
      # Assign the most common label among the MNNs
      transferred_labels[i] <- names(sort(table(labels), decreasing = TRUE))[1]
      # Calculate transfer score as the proportion of matching labels
      transfer_scores[i] <- max(table(labels)) / length(labels)
    } else {
      # Assign NA or a default value if no valid labels
      transferred_labels[i] <- NA_character_  # Keep it as NA of character type
      transfer_scores[i] <- 0
    }
  } else {
    # For cells without MNN, assign NA or a default label (e.g., "unknown")
    transferred_labels[i] <- NA_character_
    transfer_scores[i] <- 0
  }
}

Step 4: Handle cells without MNN

You can assign a default label or propagate labels based on other criteria. For example, using a knn approach or global label distribution. Optionally, you could implement a fallback strategy like the following:

for (i in seq_along(transferred_labels)) {
  if (is.na(transferred_labels[i])) {
    # Look for the nearest neighbor labels in pbmc10k
    nearest_label_index <- nn_indices_10k[i, 1]  # Get the first neighbor
    transferred_labels[i] <- pbmc10k_labels[nearest_label_index]  # Assign its label
  }
}

# transferred_labels now contains labels transferred from pbmc10k to pbmc3k
# transfer_scores indicates the confidence of the label transfer for each cell

pbmc3k$transferred_labels<- transferred_labels

compare with Seurat’s wrapper

the default is pcaprojection as what we did in my last blog post.

# Step 1: Find transfer anchors
anchors <- FindTransferAnchors(
  reference = pbmc10k,     # The reference dataset
  query = pbmc3k,          # The query dataset
  dims = 1:100,                # The dimensions to use for anchor finding
  #reduction = "cca" # 
)

# Step 2: Transfer labels
predictions <- TransferData(
  anchors = anchors,           # The anchors identified in the previous step
  refdata = pbmc10k$celltype, # Assuming 'label' is the metadata containing the true labels in seurat_10k
  dims = 1:30                  # Dimensions to use for transferring
)


# Step 3: Add predictions to the query dataset
pbmc3k <- AddMetaData(pbmc3k, metadata = predictions)

# predicted.id is from Seurat's wrapper function, predicted is from our naive implementation
table(pbmc3k$transferred_labels, pbmc3k$predicted.id)

#>                         
#>                          B cell progenitor CD14+ Monocytes CD16+ Monocytes
#>   B cell progenitor                     86               0               0
#>   CD14+ Monocytes                        0             480              15
#>   CD16+ Monocytes                        0               5             131
#>   CD4 Memory                             0               0               0
#>   CD4 Naive                              0               0               0
#>   CD8 effector                           0               0               0
#>   CD8 Naive                              0               0               0
#>   Dendritic cell                         0               3               1
#>   Double negative T cell                 0               0               0
#>   NK cell                                2               0               0
#>   pDC                                    0               2               0
#>   Platelets                              0               3               0
#>   pre-B cell                            15               0               0
#>                         
#>                          CD4 Memory CD4 Naive CD8 effector CD8 Naive
#>   B cell progenitor               2         6            0         1
#>   CD14+ Monocytes                23        20            0         0
#>   CD16+ Monocytes                 0         0            0         0
#>   CD4 Memory                    452       165            5        21
#>   CD4 Naive                       5       334            0        18
#>   CD8 effector                   16         6          163         9
#>   CD8 Naive                       0        34            0        77
#>   Dendritic cell                  0         1            0         1
#>   Double negative T cell         21        13            4         2
#>   NK cell                         1         1           14         0
#>   pDC                             2         1            0         0
#>   Platelets                       0         0            2         0
#>   pre-B cell                      0         0            0         0
#>                         
#>                          Dendritic cell Double negative T cell NK cell pDC
#>   B cell progenitor                   0                      0       0   0
#>   CD14+ Monocytes                     4                      0       1   0
#>   CD16+ Monocytes                     0                      0       0   0
#>   CD4 Memory                          0                      1       2   0
#>   CD4 Naive                           0                      0       0   0
#>   CD8 effector                        0                      2       0   0
#>   CD8 Naive                           0                      0       0   0
#>   Dendritic cell                     27                      0       0   0
#>   Double negative T cell              0                     91       0   0
#>   NK cell                             0                      1     150   0
#>   pDC                                 0                      0       0   4
#>   Platelets                           0                      0       0   0
#>   pre-B cell                          0                      0       0   0
#>                         
#>                          Platelets pre-B cell
#>   B cell progenitor              0         10
#>   CD14+ Monocytes                1          0
#>   CD16+ Monocytes                0          0
#>   CD4 Memory                     0          0
#>   CD4 Naive                      0          0
#>   CD8 effector                   0          0
#>   CD8 Naive                      0          0
#>   Dendritic cell                 0          0
#>   Double negative T cell         0          0
#>   NK cell                        0          1
#>   pDC                            0          0
#>   Platelets                     12          0
#>   pre-B cell                     0        230

visualize in a heatmap

library(ComplexHeatmap)
table(pbmc3k$transferred_labels, pbmc3k$predicted.id) %>%
        as.matrix() %>%
        scale() %>%
        Heatmap(cluster_rows = FALSE, cluster_columns= FALSE, name= "scaled\ncell number")

clustering and UMAP

With the CCA embeddings, you can do clustering (k-means, hierarchical clustering etc) and UMAP by concatenating the cannoical variates from two datasets together.

# Load required libraries for clustering and UMAP
library(uwot)  # For UMAP
library(cluster)  # For clustering algorithms (like k-means)

# Combine canonical variates from both datasets 
combined_variates <- rbind(canonical_variates_3k, canonical_variates_10k)
combined_variates<- scale(combined_variates)

# Step 1: Clustering (using k-means as an example)
# Set the number of clusters (k)
k <- 13  # Choose the number of clusters based on your data ( I am cheating as I know 13 celltypes in the dataset)

set.seed(123)
kmeans_result <- kmeans(combined_variates, centers = k)

# Extract cluster labels
cluster_labels <- kmeans_result$cluster

# Step 2: UMAP Calculation
# Compute UMAP on the canonical variates
umap_results <- umap(combined_variates)

# UMAP Results
umap_data <- as.data.frame(umap_results)

# Add cluster labels and data source to UMAP results for visualization
umap_data$Cluster <- factor(cluster_labels)
umap_data$dataset<- c(rep("3k", nrow(canonical_variates_3k)),
                      rep("10k", nrow(canonical_variates_10k)))
umap_data$celltype<- c(as.character(pbmc3k$transferred_labels),
                       as.character(pbmc10k$celltype))

table(umap_data$Cluster, umap_data$celltype) %>%
        as.matrix() %>%
        scale() %>%
        Heatmap(cluster_rows = FALSE, cluster_columns= FALSE, name= "scaled\ncell number")

K-means works reasonably well. It mixes CD4 naive and CD8 naive together; pDC and Dendritic cells together. It split CD14+ monocytes into two different cluster 6 and 11. One can do many times of K-means and use the consensus as the clusters. e.g., SC3: consensus clustering of single-cell RNA-seq data.

Datasets integration

Plot UMAP

library(ggplot2)

ggplot(umap_data, aes(x = V1, y = V2, color = dataset)) +
  geom_point(size = 0.4) +
  labs(title = "UMAP Projection of CCA Canonical Variates",
       x = "UMAP 1",
       y = "UMAP 2") +
  theme_minimal() +
  scale_color_discrete(name = "dataset") +
        facet_wrap(~dataset)

We see those two datasets are aligned super well!

ggplot(umap_data, aes(x = V1, y = V2, color = dataset)) +
  geom_point(size = 0.4) +
  labs(title = "UMAP Projection of CCA Canonical Variates",
       x = "UMAP 1",
       y = "UMAP 2") +
  theme_minimal() +
  scale_color_discrete(name = "dataset")

Let’s color the cells with celltype:

# use https://oompa.r-forge.r-project.org/packages/Polychrome/polychrome.html for colors
library(Polychrome)

# length(unique(umap_data$celltype)) 
# total 13 distinct cell types

# remove the first white color
colors_to_use<- kelly.colors(n=14)[-1] %>% unname()

ggplot(umap_data, aes(x = V1, y = V2, color = celltype)) +
  geom_point(size = 0.4) +
  labs(title = "UMAP Projection of CCA Canonical Variates",
       x = "UMAP 1",
       y = "UMAP 2") +
  theme_minimal() +
        scale_color_manual(values = colors_to_use) +
        guides(colour = guide_legend(override.aes = list(size=2))) # increase the dot size in the legend

What if we do not do CCA integration?

What if we do not use the CCA covariates for clustering? We can just concatenate two seurat object together and go through the routine process:

# add a metadata column so you know which dataset the cells are from 
pbmc3k$sample<- "3k"
pbmc10k$sample<- "10k"

# combine two seurat objects together
pbmc<- merge(pbmc3k, pbmc10k)

pbmc<- pbmc %>% 
  NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
  FindVariableFeatures(selection.method = "vst", nfeatures = 5000) %>%
  ScaleData() %>%
  RunPCA(verbose = FALSE) %>%
  FindNeighbors(dims = 1:30, verbose = FALSE) %>%
  FindClusters(resolution = 0.5, verbose = FALSE) %>%
  RunUMAP(dims = 1:30, verbose = FALSE)

DimPlot(pbmc, label = TRUE, repel = TRUE, group.by= "sample")

Without using CCA, you see the cells are separated by samples. The idea of data integration is to make the cells of the same cell type group together even they are from different datasets while cells of different cell types still separate from each other.

What CCA Achieves in Single-Cell Data:

Aligning Cell Embeddings: CCA produces canonical variates \(u\) and \(v\), which are lower-dimensional embeddings of the cells from the two datasets. These embeddings are aligned in such a way that the cells from both datasets are maximally correlated in this new space.
Identifying Shared Structure: By finding the common canonical directions, CCA captures the shared structure in gene expression between the two datasets, allowing cells with similar profiles across datasets to be aligned.

Comparison to PCA

Unlike PCA, which focuses on capturing variance within a single dataset, CCA focuses on capturing correlation between two datasets. Both involve matrix decomposition, but while PCA decomposes a single covariance matrix, CCA involves solving for relationships between the covariance matrices of two datasets.

To summarize:

PCA: Finds directions of maximum variance within a single dataset.
CCA: Finds directions of maximum covariance between two datasets.

Happy Learning!

Tommy aka. Crazyhottommy

bioinformatics education PCA CCA