To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
Understand the example datasets
We will use PBMC3k and PBMC10k data. We will project the PBMC3k data to the PBMC10k data and get the labels
library(Seurat)
library(Matrix)
library(irlba) # For PCA
library(RcppAnnoy) # For fast nearest neighbor search
library(dplyr)
# Assuming the PBMC datasets (3k and 10k) are already normalized
# and represented as sparse matrices
# devtools::install_github('satijalab/seurat-data')
library(SeuratData)
#AvailableData()
#InstallData("pbmc3k")
pbmc3k<-UpdateSeuratObject(pbmc3k)
pbmc3k@meta.data %>% head()
#> orig.ident nCount_RNA nFeature_RNA seurat_annotations
#> AAACATACAACCAC pbmc3k 2419 779 Memory CD4 T
#> AAACATTGAGCTAC pbmc3k 4903 1352 B
#> AAACATTGATCAGC pbmc3k 3147 1129 Memory CD4 T
#> AAACCGTGCTTCCG pbmc3k 2639 960 CD14+ Mono
#> AAACCGTGTATGCG pbmc3k 980 521 NK
#> AAACGCACTGGTAC pbmc3k 2163 781 Memory CD4 T
# routine processing
pbmc3k<- pbmc3k %>%
NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
FindVariableFeatures(selection.method = "vst", nfeatures = 3000) %>%
ScaleData() %>%
RunPCA(verbose = FALSE) %>%
FindNeighbors(dims = 1:10, verbose = FALSE) %>%
FindClusters(resolution = 0.5, verbose = FALSE) %>%
RunUMAP(dims = 1:10, verbose = FALSE)
Get an idea of how the pbmc3k data look like:
p1<- DimPlot(pbmc3k, reduction = "umap", label = TRUE, group.by =
"RNA_snn_res.0.5")
p2<- DimPlot(pbmc3k, reduction = "umap", label = TRUE, group.by = "seurat_annotations", label.size = 3)
p1 + p2
How the pbmc10k data look like:
# download it here curl -Lo pbmc_10k_v3.rds https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1
pbmc10k<- readRDS("~/blog_data/pbmc_10k_v3.rds")
pbmc10k<-UpdateSeuratObject(pbmc10k)
pbmc10k@meta.data %>% head()
#> orig.ident nCount_RNA nFeature_RNA observed simulated
#> rna_AAACCCAAGCGCCCAT-1 10x_RNA 2204 1087 0.035812672 0.4382022
#> rna_AAACCCACAGAGTTGG-1 10x_RNA 5884 1836 0.019227034 0.1017964
#> rna_AAACCCACAGGTATGG-1 10x_RNA 5530 2216 0.005447865 0.1392801
#> rna_AAACCCACATAGTCAC-1 10x_RNA 5106 1615 0.014276003 0.4949495
#> rna_AAACCCACATCCAATG-1 10x_RNA 4572 1800 0.053857351 0.1392801
#> rna_AAACCCAGTGGCTACC-1 10x_RNA 6702 1965 0.056603774 0.3554328
#> percent.mito RNA_snn_res.0.4 celltype
#> rna_AAACCCAAGCGCCCAT-1 0.02359347 1 CD4 Memory
#> rna_AAACCCACAGAGTTGG-1 0.10757988 0 CD14+ Monocytes
#> rna_AAACCCACAGGTATGG-1 0.07848101 5 NK cell
#> rna_AAACCCACATAGTCAC-1 0.10830396 3 pre-B cell
#> rna_AAACCCACATCCAATG-1 0.08989501 5 NK cell
#> rna_AAACCCAGTGGCTACC-1 0.06326470 1 CD4 Memory
DimPlot(pbmc10k, label = TRUE, repel = TRUE) + NoLegend()
The pbmc3k data and pbmc10k data have different number of gene names, let’s subset to the common genes.
the pbmc3k dataset comes with annotations (the seurat_annotations column). In this experiment, we will pretend we do not have it and use the 10k pbmc data to transfer the labels. Also the 10kpbmc cell labels are a little more granular.
pbmc3k_genes <- rownames(pbmc3k)
pbmc10k_genes <- rownames(pbmc10k)
# Find common genes
common_genes <- intersect(pbmc3k_genes, pbmc10k_genes)
pbmc3k <- subset(pbmc3k, features = common_genes)
pbmc10k <- subset(pbmc10k, features = common_genes)
all.equal(rownames(pbmc3k), rownames(pbmc10k))
#> [1] TRUE
Understand PCA/SVD
For Singular Value Decomposition (SVD), the decomposition of an \(đť‘‹\) matrix (with dimensions \(n\times p\))
- \(n\) is the number of cells/samples and
- \(p\) is the number of genes/features) is as follows:
\(X = U D V^T\)
Components of SVD:
\(U\) is a \(n \times n\) orthogonal matrix. It contains the left singular vectors (associated with the rows of \(X\) i.e., cells/samples).
\(V\) is a \(p \times p\) orthogonal matrix. It contains the right singular vectors (associated with the columns of \(X\) i.e., genes/features).
\(D\) is a \(n \times p\) diagonal matrix (with non-negative real numbers on the diagonal). The diagonal elements are the singular values of \(X\) which indicate the variance captured by each component.
Principal Components (PCs):
The principal components (PCs) are given by: \(Z = UD\)
This matrix has the dimensions \(n\times r\) (where \(r\) is the rank of \(X\))
\(Z\) contains the projection of your data onto the principal component space.
SVD is:
\(X = U D V^T\)
We \(\times V\) on both sides of the SVD
equation:
\(XV = U DV^TV\)
Since the V
matrix is orthonormal, \(V \times V^T = I\). \(I\) is the identity matrix.
The equation becomes:
\(XV = UD\)
So, alternatively, you can express the PCs as: \(Z = XV\)
In single-cell RNAseq analysis, the \(Z\) matrix is used to construct the k-nearest neighbor graph and clusters are detected using Louvain method in the graph. One can use any other clustering algorithms to cluster the cells (e.g., k-means, hierarchical clustering) in this PC space.
I really wish I learned linear algebra better during college:) Note, you can take MIT1806, which is a great course for linear algebra.
Let’s calculate the PCA from scratch with irlba
for big matrix. use built-in svd
if the matrix is small.
#install.packages("irlba")
library(irlba)
# use the scaled matrix
pbmc10k_scaled <- pbmc10k@assays$RNA@scale.data
dim(pbmc10k_scaled)
#> [1] 2068 9432
# Perform PCA using irlba (for large matrices). We transpose it first to gene x sample
pca_10k <- irlba(t(pbmc10k_scaled), nv = 100) # Keep 100 PCs. The orginal seurat object kept 100 PCs
read my previous blog post on some details of PCA steps within Seurat https://divingintogeneticsandgenomics.com/post/permute-test-for-pca-components/
By default, RunPCA
computes the PCA on the cell (n) x gene (p) matrix. One thing to note is that in linear algebra, a matrix is coded as n (rows are observations) X p (columns are features). That’s why by default, the gene x cell original matrix is transposed first to cell x gene: irlba(A = t(x = data.use), nv = pcs.compute, ...)
. After irlba
, the v
matrix is the gene loadings, the u
matrix is the cell embeddings.
This is the source code of RunPCA
from Seurat
:
pcs.compute <- min(pcs.compute, nrow(x = data.use)-1)
pca.results <- irlba(A = t(x = data.use), nv = pcs.compute, ...)
gene.loadings <- pca.results$v
sdev <- pca.results$d/sqrt(max(1, ncol(data.use) - 1))
if(weight.by.var){
cell.embeddings <- pca.results$u %*% diag(pca.results$d)
} else {
cell.embeddings <- pca.results$u
}
rownames(x = gene.loadings) <- rownames(x = data.use)
colnames(x = gene.loadings) <- paste0(reduction.key, 1:pcs.compute)
rownames(x = cell.embeddings) <- colnames(x = data.use)
colnames(x = cell.embeddings) <- colnames(x = gene.loadings)
Note, the diagonal matrix \(D\) in svd
/irlba
output in R
is the d
vector for the diagonal values, and you convert it to a matrix by diag(d)
.
# get the gene loadings (V matrix).
gene_loadings_10k <- pca_10k$v # Gene loadings (features/genes in rows, PCs in columns)
dim(gene_loadings_10k)
#> [1] 2068 100
rownames(gene_loadings_10k) <- rownames(pbmc10k_scaled)
colnames(gene_loadings_10k) <- paste0("PC", 1:100)
# 2068 most variable genes (after subsetting the common genes with the pbmc3k data)
# ideally, we should re-run FindVariableFeatures, but I am skipping it
VariableFeatures(pbmc10k) %>% length()
#> [1] 2068
# Get PCA embeddings/cell embeddings (U matrix * D matrix)
cell_embeddings_10k <- pca_10k$u %*% diag(pca_10k$d) # Cell embeddings (10k cells in rows)
dim(cell_embeddings_10k)
#> [1] 9432 100
rownames(cell_embeddings_10k) <- colnames(pbmc10k_scaled)
colnames(cell_embeddings_10k) <- colnames(gene_loadings_10k)
cell_embeddings_10k[1:5, 1:10]
#> PC1 PC2 PC3 PC4 PC5
#> rna_AAACCCAAGCGCCCAT-1 10.403049 5.696715 -3.163544 -0.09967979 1.671379
#> rna_AAACCCACAGAGTTGG-1 -16.407637 1.541569 -4.664094 -1.19047314 -7.783903
#> rna_AAACCCACAGGTATGG-1 8.260527 14.882676 20.279862 -3.38568692 -3.978372
#> rna_AAACCCACATAGTCAC-1 9.943605 -17.397959 3.370954 -0.37872376 -2.409626
#> rna_AAACCCACATCCAATG-1 10.418157 10.678506 13.450984 -2.11710310 -2.893075
#> PC6 PC7 PC8 PC9 PC10
#> rna_AAACCCAAGCGCCCAT-1 1.58299300 6.2184955 0.42955410 6.163026 -1.1548215
#> rna_AAACCCACAGAGTTGG-1 -0.03877799 -4.9201975 5.97044789 2.240565 -1.5349458
#> rna_AAACCCACAGGTATGG-1 -2.54951081 -5.5834799 -5.08844441 1.729374 -2.7080737
#> rna_AAACCCACATAGTCAC-1 -2.79861360 -0.4920498 1.06599051 -1.454068 2.1049640
#> rna_AAACCCACATCCAATG-1 -2.34159998 -0.5451218 -0.09535381 -1.472860 -0.9710287
center the 3k pbmc data with the 10k gene meas and scale:
pbmc3k_normalized <- pbmc3k@assays$RNA$data
# Center the 3k PBMC dataset based on 10k dataset's gene means
pbmc3k_scaled <- scale(t(pbmc3k_normalized), center = rowMeans(pbmc10k@assays$RNA$data), scale = TRUE)
Centering the transposed 3k PBMC dataset based on the 10k dataset’s gene means is an important step in the process of projecting one dataset onto another, particularly in single-cell RNA-seq analysis for label transfer. Here are the main reasons why this step is necessary:
Aligning Data Distributions: Centering the 3k dataset using the 10k dataset’s means ensures both datasets are aligned, reducing biases from differences in gene expression profiles.
Ensuring Consistency: If the datasets come from different conditions, centering standardizes them, making them more comparable.
Variance Representation: PCA is sensitive to the data’s mean; centering the 3k dataset with the 10k’s means ensures the variance is accurately captured by the principal components.
Improving Projection Accuracy: Proper centering improves projection accuracy and enhances the label transfer process by focusing on biological variation instead of technical noise.
dim(pbmc3k_scaled)
#> [1] 2700 11774
# subset the same genes for the scaled
pbmc3k_scaled<- pbmc3k_scaled[, rownames(pbmc10k_scaled)]
dim(pbmc3k_scaled)
#> [1] 2700 2068
Project the 3k cells onto the PCA space of 10K dataset
Now, we are using the \(U = XV\) formula. The expression matrix is the pbmc3k scaled matrix and the \(V\) matrix is from the 10k pbmc data.
library(ggplot2)
cell_embeddings_3k <- as.matrix(pbmc3k_scaled) %*% gene_loadings_10k
cell_embeddings_3k[1:5, 1:5]
#> PC1 PC2 PC3 PC4 PC5
#> AAACATACAACCAC 14.059153 3.5956611 -2.5969722 -0.9823900 1.446572
#> AAACATTGAGCTAC 10.120213 -10.7275385 3.5977444 -0.4405952 0.890632
#> AAACATTGATCAGC 12.226727 4.5484269 -3.9300445 -0.3233310 2.735518
#> AAACCGTGCTTCCG -2.538137 0.2757831 0.3029575 0.1371588 7.311531
#> AAACCGTGTATGCG 14.332570 6.5377602 9.7189625 -3.3478680 -2.428808
We can plot the 3K pmbc cells in the 10k pmbc PCA space:
all.equal(rownames(cell_embeddings_3k), rownames(pbmc3k@meta.data))
#> [1] TRUE
cbind(cell_embeddings_3k,pbmc3k@meta.data) %>%
ggplot(aes(x=PC1, y=PC2)) +
geom_point(aes(color = seurat_annotations)) +
theme_classic(base_size = 14)
# PCA space based on pbmc3k its own
DimPlot(pbmc3k, reduction = "pca", group.by = "seurat_annotations",
label = TRUE) +
NoLegend()
We see the cells roughly split into three major “islands”: B cells, myeloid cells (CD14/CD16+ monocytes) and the T cell/NK cells.
identification of the k nearest neighbors
Now that the 3k pbmc cell ebmeddings are projected to the 10k pbmc PCA space. We can find the k nearest neighbors in the 10k dataset to every cell in the 3k dataset.
We are using AnnoyAngular
which calculates the cosine distance in the PCA space.
# Use the Annoy algorithm to find nearest neighbors between 3k and 10k datasets
n_neighbors <- 30 # Number of nearest neighbors to find k =30
# Create Annoy index for 10k PBMC dataset
annoy_index <- new(AnnoyAngular, ncol(cell_embeddings_10k)) ##use cosine distance Angular
for (i in 1:nrow(cell_embeddings_10k)) {
annoy_index$addItem(i - 1, cell_embeddings_10k[i, ])
}
annoy_index$build(10) # Build the index with 10 trees
# Find nearest neighbors for each cell in 3k dataset
nn_indices <- t(sapply(1:nrow(cell_embeddings_3k), function(i) {
annoy_index$getNNsByVector(cell_embeddings_3k[i, ], n_neighbors)
}))
# nn_indices gives you the indices of nearest neighbors in the 10k PBMC dataset
# the rows are cells from 3k dataset, columns are the 30 nearest cells in the 10k dataset
dim(nn_indices)
#> [1] 2700 30
head(nn_indices)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] 8624 4993 275 8358 6192 3536 2828 457 7173 555 9058 5383 6643 6775
#> [2,] 8528 100 8576 6437 1351 4501 1822 4127 1407 3173 5457 5817 4931 46
#> [3,] 7950 754 5057 696 4787 3587 1578 3015 58 8249 2685 1371 8246 5740
#> [4,] 108 8617 4995 9206 8431 6035 6989 426 7892 3620 1215 6495 3897 6036
#> [5,] 452 8114 3868 5414 7884 1579 7633 4572 5017 3612 9159 8211 5712 1922
#> [6,] 4219 1815 6417 7895 8122 2080 6141 1256 3354 6037 6523 7299 7950 8519
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] 3558 8707 3529 1275 5868 4073 3811 1393 3855 5616 2381 7941
#> [2,] 6365 7079 7457 7123 8021 2107 2588 6516 7167 8056 5376 6102
#> [3,] 887 5788 5601 7588 1478 8934 6248 2030 7882 1239 2405 4230
#> [4,] 9132 3516 8077 9192 3507 5605 8443 6134 4658 7512 6104 1225
#> [5,] 2253 2091 1261 7797 4482 8450 7986 8537 8812 4113 1024 6570
#> [6,] 4808 949 8615 9422 696 5938 8860 8239 1319 2800 5870 4782
#> [,27] [,28] [,29] [,30]
#> [1,] 3688 2263 9398 7010
#> [2,] 7311 8774 973 1743
#> [3,] 8239 1484 4816 4560
#> [4,] 1241 6221 3313 4065
#> [5,] 8183 9221 3822 5012
#> [6,] 3566 6276 3166 1578
Label transfer based on the nearest neighbors
labels_10k<- as.character(pbmc10k$celltype)
# Transfer labels based on majority vote from nearest neighbors
transfer_labels <- apply(nn_indices, 1, function(neighbors) {
# Get labels for the nearest neighbors
neighbor_labels <- labels_10k[neighbors + 1] # Add 1 for R's 1-based index
# Return the most common label (majority vote)
most_common_label <- names(sort(table(neighbor_labels), decreasing = TRUE))[1]
return(most_common_label)
})
# Now, transfer_labels contains the predicted labels for the 3k PBMC dataset
head(transfer_labels)
#> [1] "CD8 Naive" "B cell progenitor" "CD4 Memory"
#> [4] "CD16+ Monocytes" "NK cell" "CD4 Memory"
pbmc3k$predicted<- transfer_labels
DimPlot(pbmc3k, reduction = "umap", group.by = "predicted", label = TRUE, repel=TRUE) +
NoLegend()
compare with Seurat’s wrapper
# Step 1: Find transfer anchors
anchors <- FindTransferAnchors(
reference = pbmc10k, # The reference dataset
query = pbmc3k, # The query dataset
dims = 1:100, # The dimensions to use for anchor finding
reduction = "pcaproject" # this is the default
)
# Step 2: Transfer labels
predictions <- TransferData(
anchors = anchors, # The anchors identified in the previous step
refdata = pbmc10k$celltype, # Assuming 'label' is the metadata containing the true labels in seurat_10k
dims = 1:30 # Dimensions to use for transferring
)
# Step 3: Add predictions to the query dataset
pbmc3k <- AddMetaData(pbmc3k, metadata = predictions)
# predicted.id is from Seurat's wrapper function, predicted is from our naive implementation
table(pbmc3k$predicted, pbmc3k$predicted.id)
#>
#> B cell progenitor CD14+ Monocytes CD16+ Monocytes
#> B cell progenitor 85 1 0
#> CD14+ Monocytes 0 266 0
#> CD16+ Monocytes 0 77 135
#> CD4 Memory 0 4 4
#> CD4 Naive 0 129 8
#> CD8 effector 0 1 0
#> CD8 Naive 0 2 2
#> Dendritic cell 0 10 0
#> Double negative T cell 0 0 0
#> NK cell 0 0 0
#> pDC 0 0 0
#> Platelets 0 1 0
#> pre-B cell 9 2 0
#>
#> CD4 Memory CD4 Naive CD8 effector CD8 Naive
#> B cell progenitor 0 0 0 0
#> CD14+ Monocytes 0 0 0 0
#> CD16+ Monocytes 0 0 0 0
#> CD4 Memory 417 6 27 0
#> CD4 Naive 120 523 13 19
#> CD8 effector 1 0 124 0
#> CD8 Naive 21 3 10 120
#> Dendritic cell 0 0 0 0
#> Double negative T cell 0 0 0 0
#> NK cell 0 0 4 0
#> pDC 0 0 0 0
#> Platelets 0 0 0 0
#> pre-B cell 0 0 1 0
#>
#> Dendritic cell Double negative T cell NK cell pDC
#> B cell progenitor 3 0 2 0
#> CD14+ Monocytes 0 0 0 0
#> CD16+ Monocytes 0 0 0 0
#> CD4 Memory 2 27 2 0
#> CD4 Naive 1 9 0 0
#> CD8 effector 0 1 10 0
#> CD8 Naive 0 2 0 0
#> Dendritic cell 24 0 0 0
#> Double negative T cell 0 62 0 0
#> NK cell 0 0 140 0
#> pDC 1 0 0 4
#> Platelets 0 0 0 0
#> pre-B cell 1 0 0 0
#>
#> Platelets pre-B cell
#> B cell progenitor 0 3
#> CD14+ Monocytes 0 0
#> CD16+ Monocytes 0 0
#> CD4 Memory 1 0
#> CD4 Naive 3 8
#> CD8 effector 0 0
#> CD8 Naive 1 0
#> Dendritic cell 0 0
#> Double negative T cell 0 0
#> NK cell 0 0
#> pDC 0 0
#> Platelets 8 0
#> pre-B cell 0 240
visualize in a heatmap
library(ComplexHeatmap)
table(pbmc3k$predicted, pbmc3k$predicted.id) %>%
as.matrix() %>%
scale() %>%
Heatmap(cluster_rows = FALSE, cluster_columns= FALSE, name= "scaled\ncell number")
Our native implementation of the k nearest neighbor label transferring is working decently well:)
Mutual nearest neighbors (MNN)
In Seurat, the mutual nearest neighbors (MNN) method is a key part of anchor identification during label transfer. Here’s a breakdown of what MNN does, how it differs from the PCA projection with k-nearest neighbors (kNN), and how labels are transferred for cells that are not mutual nearest neighbors.
- What is Mutual Nearest Neighbors (MNN) in Seurat?
Mutual nearest neighbors (MNN) is used to match cells from two datasets (query and reference) based on their proximity in the shared feature space (e.g., PCA space). In this context:
Mutual Nearest Neighbors: For a cell in dataset A, find its nearest neighbors in dataset B, and for a cell in dataset B, find its nearest neighbors in dataset A. If two cells are nearest neighbors of each other, they are considered mutual nearest neighbors (MNNs).
Anchor Identification: MNNs serve as anchors or points of correspondence between the two datasets. These anchors represent pairs of cells from the two datasets that have similar profiles, and they help align the datasets for further downstream tasks such as label transfer.
How is MNN Different from Your PCA Projection with kNN?
Mutual Nearest Neighbors (MNN):
MNN requires that the nearest neighbor relationship is mutual: a cell in the query dataset must be a nearest neighbor of a cell in the reference dataset and vice versa. MNN is designed to be more robust when integrating datasets, as it ensures bidirectional similarity between cells in the query and reference datasets. It captures correspondence between cells that truly resemble each other in both datasets, which is particularly important when datasets have batch effects or other technical differences.
- k-Nearest Neighbors (kNN) in PCA Projection:
In PCA projection with k-nearest neighbors, you project the query dataset into the reference dataset’s PCA space and then find the nearest neighbors in that space. The nearest neighbor relationship is one-sided: for each cell in the query dataset, you only find its nearest neighbors in the reference dataset.
This approach does not check if the reference dataset’s cells also treat the query dataset’s cells as nearest neighbors, which can introduce errors if the datasets are not perfectly aligned or suffer from batch effects.
Now, let’s find the nearest neighbors for each dataset separately:
library(RcppAnnoy)
# Number of nearest neighbors to find
n_neighbors <- 30
# Build an annoy index for the 10k dataset
#annoy_index_10k <- new(AnnoyEuclidean, ncol(cell_embeddings_10k))
annoy_index_10k <- new(AnnoyAngular, ncol(cell_embeddings_10k)) #use cosine distance instead
# Add each cell's PCA embeddings to the index
for (i in 1:nrow(cell_embeddings_10k)) {
annoy_index_10k$addItem(i - 1, cell_embeddings_10k[i, ]) # 0-based index for Annoy
}
# Build the index for fast nearest neighbor search
annoy_index_10k$build(10)
Find nearest neighbors in 10k for each cell in 3k
nn_10k_for_3k <- t(sapply(1:nrow(cell_embeddings_3k), function(i) {
annoy_index_10k$getNNsByVector(cell_embeddings_3k[i, ], n_neighbors)
}))
# Adjust for R's 1-based indexing
nn_10k_for_3k <- nn_10k_for_3k + 1 # convert to 1-based indexing for R
head(nn_10k_for_3k)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] 8625 4994 276 8359 6193 3537 2829 458 7174 556 9059 5384 6644 6776
#> [2,] 8529 101 8577 6438 1352 4502 1823 4128 1408 3174 5458 5818 4932 47
#> [3,] 7951 755 5058 697 4788 3588 1579 3016 59 8250 2686 1372 8247 5741
#> [4,] 109 8618 4996 9207 8432 6036 6990 427 7893 3621 1216 6496 3898 6037
#> [5,] 453 8115 3869 5415 7885 1580 7634 4573 5018 3613 9160 8212 5713 1923
#> [6,] 4220 1816 6418 7896 8123 2081 6142 1257 3355 6038 6524 7300 7951 8520
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] 3559 8708 3530 1276 5869 4074 3812 1394 3856 5617 2382 7942
#> [2,] 6366 7080 7458 7124 8022 2108 2589 6517 7168 8057 5377 6103
#> [3,] 888 5789 5602 7589 1479 8935 6249 2031 7883 1240 2406 4231
#> [4,] 9133 3517 8078 9193 3508 5606 8444 6135 4659 7513 6105 1226
#> [5,] 2254 2092 1262 7798 4483 8451 7987 8538 8813 4114 1025 6571
#> [6,] 4809 950 8616 9423 697 5939 8861 8240 1320 2801 5871 4783
#> [,27] [,28] [,29] [,30]
#> [1,] 3689 2264 9399 7011
#> [2,] 7312 8775 974 1744
#> [3,] 8240 1485 4817 4561
#> [4,] 1242 6222 3314 4066
#> [5,] 8184 9222 3823 5013
#> [6,] 3567 6277 3167 1579
Similarly, build an index for the 3k dataset
# annoy_index_3k <- new(AnnoyEuclidean, ncol(cell_embeddings_3k))
annoy_index_3k <- new(AnnoyAngular, ncol(cell_embeddings_3k))
for (i in 1:nrow(cell_embeddings_3k)) {
annoy_index_3k$addItem(i - 1, cell_embeddings_3k[i, ]) # 0-based index for Annoy
}
annoy_index_3k$build(10)
Find nearest neighbors in 3k for each cell in 10k
nn_3k_for_10k <- t(sapply(1:nrow(cell_embeddings_10k), function(i) {
annoy_index_3k$getNNsByVector(cell_embeddings_10k[i, ], n_neighbors)
}))
# Adjust for R's 1-based indexing
nn_3k_for_10k <- nn_3k_for_10k + 1 # convert to 1-based indexing for R
A key thing here is that RcppAnnoy
in C
is 0 based, and R is 1 based.
We need to add 1 to the index. Otherwise, I will get non-sense results!
Identify Mutual Nearest Neighbors (MNN)
Now identify Mutual Nearest Neighbors and include the transfer score:
labels_10k <- as.character(labels_10k)
# Create empty vectors to store the scores and labels
pbmc3k_transferred_labels <- rep(NA, nrow(cell_embeddings_3k))
pbmc3k_transfer_scores <- rep(0, nrow(cell_embeddings_3k))
# Loop through each cell in the 3k dataset to find the mutual nearest neighbors
for (i in 1:nrow(cell_embeddings_3k)) {
# Get nearest neighbors of the i-th 3k cell in 10k
nn_in_10k <- nn_10k_for_3k[i, ]
# Initialize count for mutual nearest neighbors
mutual_count <- 0
# Check mutual nearest neighbors
for (nn in nn_in_10k) {
# Check if i-th 3k cell is a nearest neighbor for the nn-th 10k cell
if (i %in% nn_3k_for_10k[nn, ]) { # Correct 1-based indexing
mutual_count <- mutual_count + 1
# Transfer the label from the 10k cell to the 3k cell
pbmc3k_transferred_labels[i] <- labels_10k[nn]
}
}
# Calculate the transfer score (mutual neighbor count / total neighbors)
pbmc3k_transfer_scores[i] <- mutual_count / n_neighbors
}
Handling Cells Without MNN:
Label Transfer for Non-Mutual Nearest Neighbors Seurat handles cells that are not mutual nearest neighbors (i.e., cells that do not have a direct anchor) in the following ways:
Weighting Anchors:
Seurat uses a weighted transfer system. Even if a query cell does not have a direct MNN, the label transfer process still considers the relationship between that query cell and the nearest anchors (the mutual nearest neighbors).
For cells that are not mutual nearest neighbors, their labels are predicted based on the similarity (distance) to the identified anchors. The influence of each anchor is weighted according to its distance from the query cell. Extrapolation for Non-MNN Cells:
For cells in the query dataset that don’t have mutual nearest neighbors, the labels can still be inferred based on their position relative to the MNN cells.
Seurat’s TransferData
function takes into account the proximity of these non-MNN cells to the anchor cells and extrapolates the labels based on the information from the anchor set. The cells closest to the MNN cells will receive a higher weight when transferring labels.
Prediction Confidence:
Seurat also provides prediction scores that indicate the confidence of the label transfer for each cell. Cells that do not have strong mutual nearest neighbors may receive lower confidence scores.
For cells without MNN, we can still assign labels based on the nearest neighbor from pbmc10k, but their transfer score will be lower (or zero). Our implementation just use the nearest neighbor in the 10k even they are not mutually nearest to simplify the demonstration.
# Fill in missing labels for cells without MNN based on nearest neighbor in 10k
for (i in 1:length(pbmc3k_transferred_labels)) {
if (is.na(pbmc3k_transferred_labels[i])) {
# Assign the label of the nearest 10k cell
nearest_10k_cell <- nn_10k_for_3k[i, 1] # First nearest neighbor
pbmc3k_transferred_labels[i] <- labels_10k[nearest_10k_cell]
# Assign a lower score for non-mutual neighbors
pbmc3k_transfer_scores[i] <- 0.01 # assign a small score like 0.01 for non-mutual
}
}
head(pbmc3k_transferred_labels)
#> [1] "CD4 Memory" "B cell progenitor" "CD4 Memory"
#> [4] "CD16+ Monocytes" "NK cell" "CD4 Memory"
head(pbmc3k_transfer_scores)
#> [1] 0.1000000 0.3666667 0.4000000 0.0100000 0.1000000 0.4333333
Let’s see how it looks like
# Add predictions to the query dataset
pbmc3k$pbmc3k_transferred_labels<- pbmc3k_transferred_labels
# predicted.id is from Seurat's wrapper function, pbmc3k_transferred_labels is from our naive MNN implementation
table(pbmc3k$pbmc3k_transferred_labels, pbmc3k$predicted.id)
#>
#> B cell progenitor CD14+ Monocytes CD16+ Monocytes
#> B cell progenitor 85 1 0
#> CD14+ Monocytes 0 296 0
#> CD16+ Monocytes 0 54 135
#> CD4 Memory 0 27 10
#> CD4 Naive 0 98 3
#> CD8 effector 0 1 0
#> CD8 Naive 0 4 1
#> Dendritic cell 0 5 0
#> Double negative T cell 0 0 0
#> NK cell 1 0 0
#> pDC 0 0 0
#> Platelets 0 3 0
#> pre-B cell 8 4 0
#>
#> CD4 Memory CD4 Naive CD8 effector CD8 Naive
#> B cell progenitor 0 1 0 0
#> CD14+ Monocytes 0 0 0 0
#> CD16+ Monocytes 0 0 0 0
#> CD4 Memory 507 92 29 9
#> CD4 Naive 41 431 3 12
#> CD8 effector 2 0 141 0
#> CD8 Naive 8 8 2 118
#> Dendritic cell 0 0 0 0
#> Double negative T cell 0 0 2 0
#> NK cell 0 0 1 0
#> pDC 0 0 0 0
#> Platelets 0 0 0 0
#> pre-B cell 1 0 1 0
#>
#> Dendritic cell Double negative T cell NK cell pDC
#> B cell progenitor 1 0 2 0
#> CD14+ Monocytes 0 0 0 0
#> CD16+ Monocytes 0 0 0 0
#> CD4 Memory 2 32 3 0
#> CD4 Naive 1 3 0 0
#> CD8 effector 0 9 28 0
#> CD8 Naive 0 2 0 0
#> Dendritic cell 24 0 0 0
#> Double negative T cell 0 55 0 0
#> NK cell 0 0 121 0
#> pDC 2 0 0 4
#> Platelets 0 0 0 0
#> pre-B cell 2 0 0 0
#>
#> Platelets pre-B cell
#> B cell progenitor 0 15
#> CD14+ Monocytes 0 0
#> CD16+ Monocytes 0 0
#> CD4 Memory 4 0
#> CD4 Naive 0 1
#> CD8 effector 0 0
#> CD8 Naive 0 0
#> Dendritic cell 0 0
#> Double negative T cell 0 0
#> NK cell 0 0
#> pDC 1 0
#> Platelets 8 0
#> pre-B cell 0 235
Visualize in a heatmap
table(pbmc3k$pbmc3k_transferred_labels, pbmc3k$predicted.id) %>%
as.matrix() %>%
scale() %>%
Heatmap(cluster_rows = FALSE, cluster_columns= FALSE, name= "scaled\ncell number")
Our native implementation of MNN is not exactly the same as Seurat. Seurat’s MNN implementation includes additional optimization like PC scaling and anchor filtering, but it is very good to see we get reasonable results!
In Seurat
(Tim Stuart et al 2019), the other way is to use Canonical Correlation Analysis (CCA) and we will leave it to a future post!
Final Note
I asked ChatGPT for help a lot, and it gave me a good starting point for the code. I had
to adapt the code when I had the errors. Anyway, it shows how powerful chatGPT
is and
you can use it as you study companion. Embrace AI or it will replace people who do not use it :)
Happy Learning!
Tommy