Understand word embedding and use deep learning to classify movie reviews

Picture this: a computer that can actually grasp the emotions hidden in movie reviews – sensing whether they’re shouting with joy or grumbling in disappointment. This mind-bending capability comes from two incredible technologies: deep learning and word embedding. But don’t worry if these sound like jargon; I am here to unravel the mystery.

Think of deep learning as a supercharged brain for computers. Just like we learn from experience, computers learn from data. Word embedding, on the other hand, is like converting words into a language computers understand – numbers! It’s like teaching your dog to respond to hand signals instead of just words.

In this blog post, I am your guides on this adventure. I will show you how deep learning and word embedding join forces to train computers in deciphering the tones of movie reviews. We’re talking about those moments when the computer understands that a review saying “This movie was a rollercoaster of emotions” is positive, not about a literal amusement park.

So, fasten your seatbelts. We’re about to journey into the realms of AI, exploring how machines learn, adapt, and finally crack the code of sentiments tucked within movie reviews.

Load the libraries.

library(keras)
library(reticulate)
library(ggplot2)
use_condaenv("r-reticulate")

Computers do not understand text. Let’s turn it into numeric vectors

samples<- c("The cat sat on the mat.", "The dog ate my homework.")

tokenizer<- text_tokenizer(num_words = 1000) %>%
  fit_text_tokenizer(samples)

sequences<- texts_to_sequences(tokenizer, samples)

one_hot_results<- texts_to_matrix(tokenizer, samples, mode= "binary")

one_hot_results[1:2, 1:20]
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]    0    1    1    1    1    1    0    0    0     0     0     0     0     0
#> [2,]    0    1    0    0    0    0    1    1    1     1     0     0     0     0
#>      [,15] [,16] [,17] [,18] [,19] [,20]
#> [1,]     0     0     0     0     0     0
#> [2,]     0     0     0     0     0     0
word_index<- tokenizer$word_index

Let’s use the IMDB movie-review sentiment-prediction dataset for demonstration.

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data.

Download the data from http://mng.bz/0tIo and unzip it.

read the reviews (text files) into R for the training set.

imdb_dir<- "~/blog_data/aclImdb"
train_dir<- file.path(imdb_dir, "train")
labels<- c()
texts<- c()

for (label_type in c("neg", "pos")){
  label<- switch(label_type, neg = 0, pos = 1)
  dir_name<- file.path(train_dir, label_type)
  for (fname in list.files(dir_name, pattern = glob2rx("*txt"),
                           full.names = TRUE)){
    texts<- c(texts, readChar(fname, file.info(fname)$size))
    labels<- c(labels, label)
  }
}


length(labels)
#> [1] 25000
length(texts)
#> [1] 25000
# the first review 
texts[1]
#> [1] "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

Tokenize the data

maxlen<- 100
max_words<- 10000

tokenizer<- text_tokenizer(num_words = max_words) %>%
  fit_text_tokenizer(texts)

tokenizer$num_words
#> [1] 10000
tokenizer$word_index[1:5]
#> $the
#> [1] 1
#> 
#> $and
#> [1] 2
#> 
#> $a
#> [1] 3
#> 
#> $of
#> [1] 4
#> 
#> $to
#> [1] 5
word_index<- tokenizer$word_index

sequences<- texts_to_sequences(tokenizer, texts)

## first review turned into integers
sequences[[1]]
#>   [1]   62    4    3  129   34   44 7576 1414   15    3 4252  514   43   16    3
#>  [16]  633  133   12    6    3 1301  459    4 1751  209    3 7693  308    6  676
#>  [31]   80   32 2137 1110 3008   31    1  929    4   42 5120  469    9 2665 1751
#>  [46]    1  223   55   16   54  828 1318  847  228    9   40   96  122 1484   57
#>  [61]  145   36    1  996  141   27  676  122    1  411   59   94 2278  303  772
#>  [76]    5    3  837   20    3 1755  646   42  125   71   22  235  101   16   46
#>  [91]   49  624   31  702   84  702  378 3493    2 8422   67   27  107 3348
x_train<- pad_sequences(sequences, maxlen = maxlen)

## it becomes a 2D matrix of samples x max_words
dim(x_train)
#> [1] 25000   100
y_train<- as.array(labels)

Do the same thing for the test dataset

test_dir<- file.path(imdb_dir, "test")
labels<- c()
texts<- c()

for (label_type in c("neg", "pos")){
  label<- switch(label_type, neg = 0, pos = 1)
  dir_name<- file.path(test_dir, label_type)
  for (fname in list.files(dir_name, pattern = glob2rx("*.txt"), 
                           full.names = TRUE)){
    texts<- c(texts, readChar(fname, file.info(fname)$size))
    labels<- c(labels, label)
  }
}

sequences<- texts_to_sequences(tokenizer, texts)
x_test<- pad_sequences(sequences, maxlen = maxlen)
y_test<- as.array(labels)

build the model

embedding_dim<- 100

# for the embedding weights matrix, index 1 is not suppose to be any word or token, it is a placeholder. that's why we use tokenizer$num_words (10000) + 1 as input_dim

model<- keras_model_sequential() %>%
  layer_embedding(input_dim = tokenizer$num_words + 1, output_dim = embedding_dim, 
                  input_length = maxlen) %>%
  layer_flatten() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid") # for the binary classification

compile the model

model %>%
  compile(
    optimizer = "rmsprop",
    loss = "binary_crossentropy",
    metric = c("acc")
  ) 
  
summary(model)
#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> embedding (Embedding)               (None, 100, 100)                1000100     
#> ________________________________________________________________________________
#> flatten (Flatten)                   (None, 10000)                   0           
#> ________________________________________________________________________________
#> dense_1 (Dense)                     (None, 32)                      320032      
#> ________________________________________________________________________________
#> dense (Dense)                       (None, 1)                       33          
#> ================================================================================
#> Total params: 1,320,165
#> Trainable params: 1,320,165
#> Non-trainable params: 0
#> ________________________________________________________________________________

The input is a tensor of 25000 (sample) x 100 (maxlen) dimension, the embedding layer is a 3D tensor of 25000(sample) x 100 (sequence_length) x 100 (embedding_dim) dimension;

it then flatten to a vector of length 100 x100 = 10000, and then connect to a dense layer of 32 units, and then connected with a dense layer of unit 1 with sigmoid activation function for prediction.

train the model

history<- model %>%
  fit(
    x_train, y_train,
    epochs = 10,
    batch_size = 32,
    validation_split = 0.2
  )

plot(history)

Final train

model %>%
        fit(x_train, y_train, epochs = 5, batch_size = 32)

accuracy

metrics<- model %>% 
  evaluate(x_test, y_test)

metrics
#>     loss      acc 
#> 2.314948 0.790920

~80% accuracy! not bad.

Understand the embedding

We can get the weights of the embedding layer:

# Get the weights of the embedding layer
embedding_layer <- model$layers[[1]]  # Assuming the embedding layer is the first layer
embedding_weights <- embedding_layer$get_weights()[[1]]

dim(embedding_weights)
#> [1] 10001   100
embedding_weights[1:5, 1:20]
#>              [,1]         [,2]         [,3]        [,4]         [,5]
#> [1,] -0.038108733  0.034164231 -0.010926410 -0.01945905  0.006259503
#> [2,]  0.006445600 -0.005163373  0.077650383  0.02167805 -0.063152276
#> [3,]  0.054408096 -0.025189560  0.059026342  0.05764490 -0.099164285
#> [4,]  0.043150455  0.041099135  0.004985804  0.01686079 -0.062039867
#> [5,] -0.004684128  0.046648771 -0.008564522  0.04492320 -0.034164682
#>              [,6]         [,7]         [,8]        [,9]        [,10]
#> [1,] -0.007477300  0.008997128  0.022076523  0.01617429 -0.011652625
#> [2,] -0.017004006 -0.060964763 -0.006172073 -0.04898103  0.008079287
#> [3,] -0.002061349 -0.079386219  0.038137231 -0.05804368  0.037487414
#> [4,] -0.041052923 -0.008180849 -0.002805086 -0.01827856  0.018388636
#> [5,]  0.016365541 -0.076942243 -0.059156016 -0.07605161  0.004471917
#>             [,11]        [,12]        [,13]       [,14]       [,15]       [,16]
#> [1,]  0.006929873 -0.001043276 -0.007549608 -0.01186891  0.00502130 0.002877086
#> [2,]  0.006907295  0.058686450 -0.020892052 -0.02132107 -0.01123977 0.040830452
#> [3,] -0.007843471 -0.038126167  0.003755015  0.02115629  0.03360209 0.039373539
#> [4,] -0.047682714 -0.017012399 -0.029800395  0.01147485  0.02615152 0.106722914
#> [5,] -0.054134578  0.002277025  0.015656594 -0.02716585  0.01887219 0.063780047
#>            [,17]       [,18]        [,19]       [,20]
#> [1,] -0.01611450  0.03136891 0.0213537812  0.01389796
#> [2,] -0.01134415 -0.01135702 0.0154857207  0.03979484
#> [3,] -0.01122028 -0.04163784 0.0848641545  0.06389144
#> [4,] -0.08052496 -0.06622744 0.0001179522 -0.02714739
#> [5,]  0.04176097  0.04005887 0.0083893426  0.04704465

The embedding weights matrix dimension is 10001 (1000 max_words + 1 placeholder) x100(embedding_dim).

add the words as the rownames to the embedding matrix.

words <- data.frame(
  word = names(tokenizer$word_index), 
  id = as.integer(unlist(tokenizer$word_index))
)

words <- words %>%
  dplyr::filter(id <= tokenizer$num_words) %>%
  dplyr::arrange(id)

rownames(embedding_weights)<- c("UNKNOWN", words$word)

We can now find words that are close to each other in the embedding. We will use the cosine similarity:

library(text2vec)

find_similar_words <- function(word, embedding_matrix, n = 5) {
  similarities <- embedding_matrix[word, , drop = FALSE] %>%
    sim2(embedding_matrix, y = ., method = "cosine")
  
  similarities[,1] %>% sort(decreasing = TRUE) %>% head(n)
}


find_similar_words("bad", embedding_weights)
#>       bad     worst     awful     waste    sucked 
#> 1.0000000 0.8764589 0.8717442 0.8703569 0.8615548
find_similar_words("wonderful", embedding_weights)
#>   wonderful   excellent        rare     perfect excellently 
#>   1.0000000   0.7535562   0.7349462   0.7308548   0.7235736

We can plot the embeddings in a 2-D plot after TSNE or UMAP (just like in single-cell data analysis):

# Perform t-SNE dimensionality reduction
set.seed(123)
tsne_embeddings <- Rtsne::Rtsne(embedding_weights)

# Create a data frame for visualization
tsne_df <- data.frame(
  x = tsne_embeddings$Y[, 1],
  y = tsne_embeddings$Y[, 2],
  word = rownames(embedding_weights)
)

Plot the t-SNE visualization

words_to_plot<- c("good", "fantastic", "cool", "wonderful", "nice", "best", "brilliant", "amazing", "bad", "horrible","nasty", "poor", "awful")

ggplot(tsne_df, aes(x, y)) +
  geom_point(size = 0.2, alpha = 0.5) +
  geom_point(data = tsne_df %>% 
               dplyr::filter(word %in% words_to_plot), 
             color = "red") +
  ggrepel::geom_label_repel(data = tsne_df %>% 
                              dplyr::filter(word %in% words_to_plot), 
                            aes(label = word ), max.overlaps = 1000) +
  theme_minimal(base_size = 13) +
  labs(title = "t-SNE Visualization of Word Embeddings") 

We do see those positive words and negative words are clustered in the same area, respectively. That’s cool!!

We can also use a pre-trained word embedding weights matrix (word2vec or GloVe) when the training data is very small(we have 25000 data points for training, which is a lot), e.g., if we only had 200 samples to train, using a pre-trained model can be beneficial.

In my next blog post, I will try to implement the Long short-term Recurrent Neural Network (RNN) to take into the context of the word to better classify the reviews.

Related

Next
Previous
comments powered by Disqus