Using RNN, RNN with CNN, CNN with LSTM with Keras and Python for sentiment classification of IMDB movie reviews

Celestin Ntemngwa
10 min readJul 22, 2020

--

Description

This project involves using a sequence to sequence prediction to develop a sentiment classification system. I used a data set from an IMDB movie review to create benchmarks using recurrent neural network(RNN), RNN with LSTM and dropout rate, RNN with Convolutional neural network(CNN), and RNN with CNN plus dropout rate to make a composite sequence to sequence classification. I then compared the models’ accuracies.

The Dataset

The data set come from Stanford AI. According to the authors, Maas, Daly, Pham, Huang, Ng & Potts (2011), the dataset is binary sentiment classification. The authors provided 25,000 highly polar movie reviews for training, 25,000 for testing, additional unlabeled data for use, Raw text, and an already processed bag of words formats.

I used the deep learning model to train and correctly predict the sentiments. In this data set, the sentiments are of two classes ; positive and negative. I used the training set to build the model and the test set to compare the model’s accuracy.

Required Libraries

The principal libraries that I used were Keras and Tensorflow

Install using pip install

I used the following

Import numpy

From keras.datasets import imdb

From keras.models import Sequential # this helps make use of sequential information

From keras.layers import LSTM

From keras.layers.embeddings import Embedding

From keras.preprocessing import sequence

# Then for reproducibility, I used a fixed random seed

Numpy.random.seed(7)

Using sequential assumes that the sentiments are not independent. For example, somebody could say that “this movie A was good but not as good as movie B.”

Thus, the sequence of words matter. The above review could be classified as a positive and negative sentiment.

I used a recurrent neural network( RNN) to store the sequences. A recurrent neural network performs the same task. For every element of the sequence with positive or negative output, it depends on the previous computations. It is on the memory network that captures information about previous calculations or what has been calculated so far. RNN can make use of long sequences

Load Data Set

I loaded the data set, but only keeping the top m words and zero the rest

>top_words = 5000

>(x-train, y_train), (x_test, y_test) = imdb.load_data(num_words=top_words)

Here I used the top 5000 words. I could use any number of top words ( I choose 500 randomly). However, the speed of the system is inversely proportional to the number of top words chosen. The higher the number of words, the slower the speed.

when I run x_train and y_train, separately, I get the following results ( just a portion of the x_train results are shown below)

Thus, y_train which is the target class is a binary class ( positive = 1 and negative = 0)

The next task for me was to do a truncate and pad input sequences.

When you look at reviews, they do not have the same number of words. For example, one reviewer can write just one sentence while another writes a paragraph. You must convert these reviews into tokens to create features and hence a structured data set. The structured data set can be used to predict the sentiment classes.

However, the reviews do not have the same length. They must be equal to avoid having null values. Null values will not work or are unacceptable. To make the lengths equal, we use padding.

I did the truncate and pad input sequences as follows

>max_review_length = 500

>x_train = sequence.pad_sequences(x_train, maxlen=max_review_length)

>x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)

This means the maximum review length should be 500. If the length is less or more, the padding will be applied to make it 500. I chose 500 randomly, however the higher the max_review_length, the more resources needed to train the model.

Creating the model

I created the compressed form of the sequences that I would like to see in my hidden layers called sequential.

Embedding_vecor_length = 32

model = Sequential( )

Then added one embedding layer as follows

model.add (Embedding(top_words, embedding_vecore_length, input_length= max_review_length)

Next, I added LSTM cell with 100 neurons

model.add(LSTM(100))

Then I created a dense layer one on the LSTM cell with activation function of sigmoid.

model.add(Dense(1, activation= ‘sigmoid’))

Then complied the model, where for the loss function, I used ‘binary_crossentropy’ because the target variable has two classes. For optimizer, I used ‘adam,’ and for the metrics, I used ‘accuracy.’

model.compile(loss= ‘binary_crossentropy’, optimizer= ‘adam’ , metrics=[ ‘accuracy’])

print(model.summary ( ) )

Next, I fitted or trained the model

model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=64)

When I run the model, I got the following ( part of the output).

Figure 2: sample model fitting output from the model with Epoch =5

As seen in figure 2, the model is still fitting (it is at 6979 out of 25000 to be fitted. The accuracy is 66%. I interrupted the model and changed the Epoch.

Then changed the Epoch to 3 and reran the model. This was to reduce the run time. Everything else stayed the same as was in the model with Epochs=5.

Increasing epochs increases the accuracy level.

Figure 3: sample of model fitting output with epochs=3

Training accuracy is 77.92%, validation accuracy is 84.85% for Epoch 1/3. This implies there is room for more improvement because the training accuracy is less than the validation accuracy. So, the model is unfitting. The difference between training acc and validation acc will reduce with increase epochs. For Epoch 3/3, the training accuracy is 87.62%, while the validation accuracy is 86.31%. The difference is 1.31%, which is fine. This constitutes the RNN.

However, if we increase the Epoch, it will boost accuracy. That means some Epoch will be a perfect match where the model does not overfit or underfit

LSTM is a type of RNN where the long and short sequences are stored to predict the next layer of output. LSTM is the most used RNNs. The main difference between RNN and LSTM is that LSTM computes the weights differently from RNN.

The next step was for me to find out if my model was good or bad in terms of accuracy of prediction. I ran three different epochs. Each epoch had

its accuracies. So, I needed to evaluate all three models on the test data ( x_test, y_test).

Evaluation of the model

I needed to identify the best model from the three epochs. I evaluated as follows

Scores = model.evaluation(x_test, y_test, verbose=0)

Print (“Accuracy:%2f%%” % (scores[1]*100))

The result:

As seen from my evaluation, the best accuracy was 86.31%

Next, I introduced LSTM with dropout

LSTM with dropout

To control the overfitting of the model, I used the dropout rate. Here I dropped out some percent of the words at different layers, especially at the Embedding layer and at the LSTM layer.

First, I had to import the dropout module from keras.

From keras.layers import Dropout

I added the dropout rate:

Model.add(Dropout(0.2)) # a dropout rate of 0.2 at the Embedding layer

Model.add(Dropout(0.2) # another dropout rate of 0.2 at the LSTM layer

Then included the dropout rate in the model and the run it

Figure 4: LSTM with the Dropout rate of 0.2

When I run my model, I obtained the following results

With the dropout, the accuracy is 87.12% which is better than 86.31% (from my previous model without dropout)

I further added two dropouts (normal drop out and recurrent dropout).These were dropout within layers and dropout between layers of LSTM. My goal here was to find out if this will provide a better accuracy than the previous models. I added the following:

model=Sequential ( )

model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length)

model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2) or douyble

model.add(Dense( ), activation= ‘sigmoid’))

I used these additions to retrain the model

The results:

The accuracy is 84.27%

LSTM with CNN

The next task that I performed was to use the embedding layer. Nevertheless, what is word embedding? Word embedding or word vector is a vector representation of document vocabulary where words or phrases from the vocabulary are mapped to vectors of real numbers. It can capture the context of a word in a document, relation with other words, semantic and syntactic similarity. A point in a d dimension can represent a word. Each word is just a d-dimensional point. I guess I repeated myself in this paragraph!

Embedding map items such as words to low dimensional real vectors in a way that similar items are close to each other.

CNN and LSTM for Sequence classification

Keras library provides a convenient way to convert the positive integers representation of words to word embeddings. This is known as the embedding layer. This embedding layer takes arguments that define the mapping, including the vocabulary size. The layer allows you to specify the dimension of each word vector called the output dimension.

How did I introduce the concept of embedding layer and word vector in this case?

I introduced a 1D convolutional neural network (Conv1D) by first adding the module from keras.

From keras.layers.convolutional import Conv1D

I also introduced max pooling 1D. All of these were my attempt to get a robust prediction.

Then the next step was to apply Conv1D, max pooling, LSTM, and an activation function.

This was done as follows:

Then I created the model.

My model configuration included a ConvID with a filter value of 32, a kernel size of 3, and the activation function was RelU and the same padding. Max pooling was 2 (pooling two features at a time).

The results after running the model:

Next, I fitted the model

model.fit(x-train, y_train, epochs=3, batch_size=64

Then the final evaluation of the model:

The results show an accuracy of 88.00%

Then I introduced the concept of ‘flatten’ and fully connected network.

LSTM and Flatten and Fully Connected Network(FCN)

To further optimize the model, I introduced flatten and fully connected network(FCN).

I added a dense layer of 250 neurons, flatten, removed the LSTM layer Agin I took the embedding length to be 32, use sequential, embedding top_words and other configurations as shown in figure 6

Figure 5: import of flatten from keras

The configuration

Figure 6 Model with flatten and FCN

As seen in figure 6, the introduction of the Dense layer increased the number of parameters. So, I decided to increase the batch_size to 128 when fitting my model to process it faster.

model.fit(x_train, y_train, epochs=3, batch_size=128)

This gave me an accuracy of 88.22 %

I changed the batch_size to 64, and I got 86.18%

I did some fine-tuning by changing the activation function at the output player (Dense layer) When I changed from sigmoid to ReLU, the accuracy dropped from 88% to 69%

When I used Softmax, the accuracy went down to 50%

Thus sigmoid 88%, RelU 69% and softmax 50%. Sigmoid wins!

Optimizer

I changed the optimizer from ‘adam’ to ‘sgd,’ and the accuracy was 51%. Not good!

I reversed to ‘adam’ optimizer.

Testing Predictions

I used the following to predict the sequence length for x_test data set

Now I predicted the classes using x_test ( the predicted sentiment for test set)

I then printed y_test, (which is the actual sentiment)

1 is positive, and 0 is negative.

This concludes the project which was using a sequence-based approach to do sentiment analysis.

References

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142–150).

--

--