Long Short-term memory (LSTM) Recurrent Neural Network (RNN) to classify movie reviews

A major characteristic of all neural networks I have used so far, such as densely connected networks and convnets (CNN) (see my previous post), is that they have no memory. Each input shown to them is processed independently, with no state kept in between inputs. In other words, they do not take into the context of the words (the words around the word).

Imagine you’re reading a book, and you want to understand the story by keeping track of what’s happening in the plot. Your brain naturally remembers information from the beginning of the book even as you read through new chapters. It’s like having a special memory that can remember important details from the past.

Long short-term memory (LSTM) is like that special memory for computers when they’re working with text or sequences of data. In regular computer programs, information can easily get lost as the program processes new data. But LSTM is designed to remember important stuff from the past, just like your brain when reading a book.

I highly recommend Josh Starmer’s video on Long short-term memory to understand the math behind it.

Load the libraries.

library(keras)
library(reticulate)
library(ggplot2)
use_condaenv("r-reticulate")

# Set a random seed in R to make it more reproducible 
set.seed(123)

# Set the seed for Keras/TensorFlow
tensorflow::set_random_seed(123)

Let’s use the IMDB movie-review sentiment-prediction dataset for demonstration again.

It is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews need to be preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. Read my previous post on word embedding.

Download the data from http://mng.bz/0tIo and unzip it.

read the reviews (text files) into R for the training set.

imdb_dir<- "~/blog_data/aclImdb"
train_dir<- file.path(imdb_dir, "train")
labels<- c()
texts<- c()

for (label_type in c("neg", "pos")){
  label<- switch(label_type, neg = 0, pos = 1)
  dir_name<- file.path(train_dir, label_type)
  for (fname in list.files(dir_name, pattern = glob2rx("*txt"),
                           full.names = TRUE)){
    texts<- c(texts, readChar(fname, file.info(fname)$size))
    labels<- c(labels, label)
  }
}


length(labels)
#> [1] 25000
length(texts)
#> [1] 25000
# the first review 
texts[1]
#> [1] "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

Tokenize the data

maxlen<- 100
max_words<- 10000

tokenizer<- text_tokenizer(num_words = max_words) %>%
  fit_text_tokenizer(texts)

tokenizer$num_words
#> [1] 10000
tokenizer$word_index[1:5]
#> $the
#> [1] 1
#> 
#> $and
#> [1] 2
#> 
#> $a
#> [1] 3
#> 
#> $of
#> [1] 4
#> 
#> $to
#> [1] 5
word_index<- tokenizer$word_index

sequences<- texts_to_sequences(tokenizer, texts)

## first review turned into integers
sequences[[1]]
#>   [1]   62    4    3  129   34   44 7576 1414   15    3 4252  514   43   16    3
#>  [16]  633  133   12    6    3 1301  459    4 1751  209    3 7693  308    6  676
#>  [31]   80   32 2137 1110 3008   31    1  929    4   42 5120  469    9 2665 1751
#>  [46]    1  223   55   16   54  828 1318  847  228    9   40   96  122 1484   57
#>  [61]  145   36    1  996  141   27  676  122    1  411   59   94 2278  303  772
#>  [76]    5    3  837   20    3 1755  646   42  125   71   22  235  101   16   46
#>  [91]   49  624   31  702   84  702  378 3493    2 8422   67   27  107 3348
x_train<- pad_sequences(sequences, maxlen = maxlen)

## it becomes a 2D matrix of samples x max_words
dim(x_train)
#> [1] 25000   100
y_train<- as.array(labels)

Do the same thing for the test dataset

test_dir<- file.path(imdb_dir, "test")
labels<- c()
texts<- c()

for (label_type in c("neg", "pos")){
  label<- switch(label_type, neg = 0, pos = 1)
  dir_name<- file.path(test_dir, label_type)
  for (fname in list.files(dir_name, pattern = glob2rx("*.txt"), 
                           full.names = TRUE)){
    texts<- c(texts, readChar(fname, file.info(fname)$size))
    labels<- c(labels, label)
  }
}

sequences<- texts_to_sequences(tokenizer, texts)
x_test<- pad_sequences(sequences, maxlen = maxlen)
y_test<- as.array(labels)

Let’s consider adding a LSTM layer. The underlying Long Short-Term Memory (LSTM) algorithm was developed by Hochreiter and Schmidhuber in 1997; it was the culmination of their research on the vanishing gradient problem.

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 32) %>%
  layer_lstm(units = 32) %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("acc")
)

summary(model)
#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> embedding (Embedding)               (None, None, 32)                320032      
#> ________________________________________________________________________________
#> lstm (LSTM)                         (None, 32)                      8320        
#> ________________________________________________________________________________
#> dense (Dense)                       (None, 1)                       33          
#> ================================================================================
#> Total params: 328,385
#> Trainable params: 328,385
#> Non-trainable params: 0
#> ________________________________________________________________________________
history <- model %>% fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 128,
  validation_split = 0.2
)

plot(history)

# train on the full dataset
model %>%
        fit(x_train, y_train, epochs = 7, batch_size = 32)

evaluate the model on the testing data

metrics<- model %>% 
  evaluate(x_test, y_test)

metrics
#>      loss       acc 
#> 0.4781031 0.8294800

~84% accuracy. It is a little better than the fully connected approach in my last post.

Take-home messages

I was expecting the accuracy to be even higher with such a computation intensive LSTM.

  • We didn’t fine-tune some settings: One reason our approach didn’t work extremely well could be that we didn’t adjust certain settings like how we represent words or the complexity of our model.

  • We didn’t use some techniques to prevent overfitting: Another reason could be that we didn’t use methods to prevent our model from memorizing the data it saw, which can lead to poor generalization.

  • The main reason: But honestly, the biggest reason is that for this specific task of figuring out if a review is positive or negative, we don’t really need to look at the big picture of the entire review. We can do a good job just by counting how often certain words appear in the review. That’s what our previous approach did.

  • LSTM shines in tougher tasks: However, there are more challenging tasks in language processing where LSTM, a type of neural network, shows its strengths. For example, when answering complex questions or translating languages, LSTM can be a valuable tool because it’s good at understanding the context and relationships between words in long pieces of text. So, while it might not be necessary for simple sentiment analysis, it becomes really useful in more complicated language tasks.

Bottom line : choosing the right method for the right problem is more important than the complexity of the method. In fact, I always prefer simpler method first (e.g, regression, random forest etc) because they are more intepretable.

Related

Next
Previous
comments powered by Disqus