The T-cell receptor (TCR) is a special molecule found on the surface of a type of immune cell called a T-cell. Think of T-cells like soldiers in your body’s defense system that help identify and attack foreign invaders like viruses and bacteria.

The TCR is like a sensor or antenna that allows T-cells to recognize specific targets, kind of like how a key fits into a lock. When the TCR encounters a target it recognizes, it sends signals to the T-cell telling it to attack and destroy the invader.

T-cell receptor (TCR) sequencing data is one type of high-throughput sequencing data that provides valuable insights into the immune system’s response to cancer. We can now even get single-cell TCRseq data. In this blog post, we will discuss a deep learning model developed to predict cancer from healthy control using TCRseq data.

This blog post is inspired by the publication De novo prediction of cancer-associated T cell receptors for noninvasive cancer detection. We will use the same training and testing data from https://github.com/s175573/DeepCAT

For completeness, we will compare 3 different models:

- fully connected (dense) neural network
- 1d convolutional neural network
- Long short term memory (LSTM) neural network

```
library(keras)
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
use_condaenv("r-reticulate")
# Set a random seed in R to make it more reproducible
set.seed(123)
# Set the seed for Keras/TensorFlow
tensorflow::set_random_seed(123)
```

Prepare the training and testing datasets.

```
normal_CDR3<- read_tsv("/Users/tommytang/github_repos/DeepCAT/TrainingData/NormalCDR3.txt",
col_names = FALSE)
cancer_CDR3<- read_tsv("/Users/tommytang/github_repos/DeepCAT/TrainingData/TumorCDR3.txt",
col_names = FALSE)
train_CDR3<- rbind(normal_CDR3, cancer_CDR3)
train_label<- c(rep(0, nrow(normal_CDR3)), rep(1, nrow(cancer_CDR3))) %>% as.numeric()
normal_CDR3_test<- read_tsv("/Users/tommytang/github_repos/DeepCAT/TrainingData/NormalCDR3_test.txt",
col_names = FALSE)
cancer_CDR3_test<- read_tsv("/Users/tommytang/github_repos/DeepCAT/TrainingData/TumorCDR3_test.txt",
col_names = FALSE)
test_CDR3<- rbind(normal_CDR3_test, cancer_CDR3_test)
test_label<- c(rep(0, nrow(normal_CDR3_test)), rep(1, nrow(cancer_CDR3_test)))
AAindex<- read.table("/Users/tommytang/github_repos/DeepCAT/AAidx_PCA.txt",header = TRUE)
## 20 aa
total_aa<- rownames(AAindex)
total_aa
```

```
#> [1] "A" "R" "N" "D" "C" "E" "Q" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
#> [20] "V"
```

In the paper:

Because CDR3s with different lengths usually form distinct loop structures to interact with the antigens (33, 34), we built five models each for lengths 12 through 16.

We will build a model with a `length_cutoff`

of 12.

```
#one-hot encoding
encode_one_cdr3<- function(cdr3, length_cutoff){
res<- matrix(0, nrow = length(total_aa), ncol = length_cutoff)
cdr3<- unlist(str_split(cdr3, pattern = ""))
cdr3<- cdr3[1:length_cutoff]
row_index<- sapply(cdr3, function(x) which(total_aa==x)[1])
col_index<- 1: length_cutoff
res[as.matrix(cbind(row_index, col_index))]<- 1
return(res)
}
cdr3<- "CASSLKPNTEAFF"
encode_one_cdr3(cdr3, length_cutoff = 12)
```

```
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> [1,] 0 1 0 0 0 0 0 0 0 0 1 0
#> [2,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 1 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 1 0 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 0 0 0 1 0 0
#> [7,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [8,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [9,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [10,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [11,] 0 0 0 0 1 0 0 0 0 0 0 0
#> [12,] 0 0 0 0 0 1 0 0 0 0 0 0
#> [13,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [14,] 0 0 0 0 0 0 0 0 0 0 0 1
#> [15,] 0 0 0 0 0 0 1 0 0 0 0 0
#> [16,] 0 0 1 1 0 0 0 0 0 0 0 0
#> [17,] 0 0 0 0 0 0 0 0 1 0 0 0
#> [18,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [19,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [20,] 0 0 0 0 0 0 0 0 0 0 0 0
```

### fully connected (dense) neural network

```
length_cutoff = 12
# encode all the CDR3 sequences
train_data<- purrr::map(train_CDR3$X1,
~encode_one_cdr3(.x, length_cutoff = length_cutoff))
# reshape the data to a 2D matrix, flatten the 2D aa matrix into a single vector of size 20 * length of the CDR3
train_data<- array_reshape(unlist(train_data),
dim = c(length(train_data), 20 * length_cutoff))
dim(train_data)
```

`#> [1] 70000 240`

`train_data[1,]`

```
#> [1] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [75] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [112] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
#> [186] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [223] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
```

do the same for test data

```
test_data<- purrr::map(test_CDR3$X1,
~encode_one_cdr3(.x, length_cutoff = length_cutoff))
test_data<- array_reshape(unlist(test_data),
dim= c(length(test_data), 20* length_cutoff))
```

build the model:

```
y_train <- as.numeric(train_label)
y_test <- as.numeric(test_label)
model <- keras_model_sequential() %>%
layer_dense(units = 32, activation = "relu",
input_shape = c(20 * length_cutoff)) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
summary(model)
```

```
#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> dense_2 (Dense) (None, 32) 7712
#> ________________________________________________________________________________
#> dense_1 (Dense) (None, 32) 1056
#> ________________________________________________________________________________
#> dense (Dense) (None, 1) 33
#> ================================================================================
#> Total params: 8,801
#> Trainable params: 8,801
#> Non-trainable params: 0
#> ________________________________________________________________________________
```

```
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
```

If one use big epochs, the model will be over-fitted. Set apart 10000 CDR3 sequences for validation.

It is critical to shuffle the data using `sample`

function to make the training and validation set random.

```
set.seed(123)
val_indices <- sample(nrow(train_data), 35000)
x_val <- train_data[val_indices,]
partial_x_train <- train_data[-val_indices,]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]
history <- model %>% fit(partial_x_train,
partial_y_train,
epochs = 20,
batch_size = 512,
validation_data = list(x_val, y_val)
)
plot(history)
```

Final training

```
model %>% fit(train_data, y_train, epochs = 20, batch_size = 512)
results <- model %>% evaluate(test_data, y_test)
results
```

```
#> loss accuracy
#> 0.4169728 0.8061706
```

81% accuracy. Not bad!

### precision-recall

Read this blog post by David Are you in genomics and building models? Stop using ROC - use PR

`predict(model, test_data[1:20, ])`

```
#> [,1]
#> [1,] 0.73146832
#> [2,] 0.17803338
#> [3,] 0.13752139
#> [4,] 0.02351084
#> [5,] 0.05476111
#> [6,] 0.12303543
#> [7,] 0.16548795
#> [8,] 0.04623273
#> [9,] 0.48983151
#> [10,] 0.66229594
#> [11,] 0.05498785
#> [12,] 0.27218163
#> [13,] 0.31630093
#> [14,] 0.31763732
#> [15,] 0.01344728
#> [16,] 0.04035982
#> [17,] 0.10508105
#> [18,] 0.18193811
#> [19,] 0.15098864
#> [20,] 0.16778088
```

```
res<- predict(model, test_data) %>%
dplyr::bind_cols(y_test)
colnames(res)<- c("estimate", "truth")
res$truth<- as.factor(res$truth)
```

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR), at various threshold settings.

ROC AUC is a bit more inflated compared to precision-recall AUC below.

`yardstick::roc_auc(res, truth, estimate, event_level = "second")`

```
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 0.889
```

```
yardstick::roc_curve(res, truth, estimate, event_level = "second") %>%
autoplot() +
ggtitle("ROC AUC")
```

precision-recall AUC:

`yardstick::pr_auc(res, truth, estimate, event_level = "second")`

```
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 pr_auc binary 0.818
```

```
yardstick::pr_curve(res, truth, estimate, event_level = "second") %>%
autoplot() +
ggtitle("precision-recall AUC")
```

### use conv1d

Note how I transformed the tensor to the right shape.

For more on tensor reshaping in R, read my previous blog post: Basic tensor/array manipulations in R

```
train_data<- purrr::map(train_CDR3$X1,
~encode_one_cdr3(.x, length_cutoff = length_cutoff))
train_data2<- array_reshape(lapply(train_data,
function(x) x %>%t() %>% c()) %>% unlist(),
dim = c(length(train_data), 20, length_cutoff))
# sanity check to see if the data are the same after reshaping
all.equal(train_data[[1]], train_data2[1,,])
```

`#> [1] TRUE`

```
test_data<- purrr::map(test_CDR3$X1,
~encode_one_cdr3(.x, length_cutoff = length_cutoff))
test_data2<- array_reshape(lapply(test_data,
function(x) x %>%t() %>% c()) %>% unlist(),
dim = c(length(test_data), 20, length_cutoff))
```

```
y_train <- as.numeric(train_label)
y_test <- as.numeric(test_label)
model<- keras_model_sequential() %>%
layer_conv_1d(filters = 32, kernel_size = 4, activation = "relu",
input_shape = c(20, length_cutoff)) %>%
layer_max_pooling_1d(pool_size = 4) %>%
layer_flatten() %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation= "sigmoid")
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
summary(model)
```

```
#> Model: "sequential_1"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> conv1d (Conv1D) (None, 17, 32) 1568
#> ________________________________________________________________________________
#> max_pooling1d (MaxPooling1D) (None, 4, 32) 0
#> ________________________________________________________________________________
#> flatten (Flatten) (None, 128) 0
#> ________________________________________________________________________________
#> dense_4 (Dense) (None, 16) 2064
#> ________________________________________________________________________________
#> dense_3 (Dense) (None, 1) 17
#> ================================================================================
#> Total params: 3,649
#> Trainable params: 3,649
#> Non-trainable params: 0
#> ________________________________________________________________________________
```

```
set.seed(123)
#sample half for validation set
val_indices <- sample(dim(train_data2)[1], 35000)
x_val <- train_data2[val_indices,,] %>%
array_reshape(dim=c(35000, 20, length_cutoff))
partial_x_train <- train_data2[-val_indices,,] %>%
array_reshape(dim=c(35000, 20, length_cutoff))
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]
history <- model %>% fit(partial_x_train,
partial_y_train,
epochs = 20,
batch_size = 512,
validation_data = list(x_val, y_val)
)
plot(history)
```

```
model %>% fit(train_data2, y_train, epochs = 20, batch_size = 512)
results <- model %>% evaluate(test_data2, y_test)
results
```

```
#> loss accuracy
#> 0.4576261 0.7857693
```

We get a ~78% accuracy. It did not beat the fully connected dense neural network.

### use long short-term memory

```
model_lstm<- keras_model_sequential() %>%
layer_lstm(units = 32, input_shape = c(20, length_cutoff)) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation= "sigmoid")
model_lstm %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
summary(model_lstm)
```

```
#> Model: "sequential_2"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> lstm (LSTM) (None, 32) 5760
#> ________________________________________________________________________________
#> dense_6 (Dense) (None, 16) 528
#> ________________________________________________________________________________
#> dense_5 (Dense) (None, 1) 17
#> ================================================================================
#> Total params: 6,305
#> Trainable params: 6,305
#> Non-trainable params: 0
#> ________________________________________________________________________________
```

```
history <- model_lstm %>% fit(partial_x_train,
partial_y_train,
epochs = 20,
batch_size = 512,
validation_data = list(x_val, y_val)
)
plot(history)
```

```
model_lstm %>% fit(train_data2, y_train, epochs = 19, batch_size = 512)
results <- model_lstm %>% evaluate(test_data2, y_test)
results
```

```
#> loss accuracy
#> 0.5177664 0.7456701
```

accuracy of 75%!

I also tried the CNN model in the original paper, the performance is even worse. Of course, the original model has hyperparameters such as learning rate, drop out rate etc that I did not experiment with.

### Conclusions

understanding the biology is important. How to represent the CDR3 sequences? we can use one-hot encoding, we can also use BLOSOM62 matrix to include the amino acid properties. Further read this blog post How to represent protein sequence.

Tensor reshaping is the more difficult part in building a deep learning model.For more on tensor reshaping in R, read my previous blog post: Basic tensor/array manipulations in R.

Always use the simple model first. Of course, one can fine tune the hyperparameters to make the complicated model perform as good or better. However, it may take time and more computing.

Watch out data leakage. The TCR sequences in the testing sets may have seen by the training sets. Some papers proposed an approach based on the Levenshtein similarities, Hobohm-based redundancy reduction. Further read NetTCR-2.1: Lessons and guidance on how to develop models for TCR specificity predictions

In this example, we only used the most variable region CDR3 region as the input. Including V,D,J gene usage and HLA typing information may further boost the accuracy.

Different models are good at different tasks. e.g., LSTM does not outperform full connected approach in sentiment analysis because the frequency of positive words and negative words are good for prediction already. Given enough context (LSTM) does not help much. See my previous post Long Short-term memory (LSTM) Recurrent Neural Network (RNN) to classify movie reviews

The architecture of the neural network is an art. How many layers to use? How many neurons in each layer? When the performance is good, we do not think too much about it. When the performance is bad, we then need to fine tune those parameters.