Code Request
Filling out this form will grant you access to all of the code on this site. In addition, you will join our mailing list, and we'll send you periodic updates regarding new articles.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
June 18, 2024
LSTM Sentiment Analysis with Keras and Pytorch
Training an LSTM to Predict Sentiment
Click here to download this post's source code
Photo by Madison Oren on Unsplash
Table of Contents
Introduction
Humans easily understand emotion conveyed through text, but how can we train computers to do the same? Turns out, it's not that difficult, especially with the right framework.
In this article, I'll demonstrate how to train an LSTM model on a sentiment analysis dataset using both Pytorch and Keras, the leading Python frameworks for deep learning work.
What are LSTMs?
Put simply, LSTM (Long Short Term Memory) models are designed to take in sequential input and remember certain information through the use of input, output, and forget gates. Here's a neat resource to learn more about how LSTMs and similar recurrent models work.
While the introduction of transformers in the 2017 paper Attention is All You Need made these models somewhat obsolete, for most problems that deal with short text classification or generation they're still more than adequate. Plus, they're simpler to write code for than transformers.
The Dataset
For this project, we'll be using a dataset from Kaggle called Sentiment140 which has 1.6 million tweets. For now, we'll only use a portion of the dataset.
Step 1: Download and unzip the file, the rename it to data.csv and move it to your project directory.
Then, make sure you have the pandas library installed to load the .csv file into a DataFrame:
pip install pandas
To read the file:
# train_sentiment_keras.py
import pandas as pd
import random
df = pd.read_csv("data.csv", encoding="latin1")
print(df.head())
""" 0 ... @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
0 0 ... is upset that he can't update his Facebook by ...
1 0 ... @Kenichan I dived many times for the ball. Man...
2 0 ... my whole body feels itchy and like its on fire """
The dataset has two columns of interest: the labels and the text. Incidentally, these are the first and last columns. Let's save the data into a list:
...
data = list(zip(df[df.columns[0]], df[df.columns[-1]]))
# combines the first and last columns of the DataFrame using zip.
print(random.sample(data, 5))
"""
[(0, "crap can't find my miley cyrus cd to play in the car to annoy my mum #LoveEverybody"), (4, 'going to pink concert tonight!! '),
(4, '@AdamSevani haha, I watched Step Up 2 the other day. HahaThe part where you lift yur shirt. Made my day a LOT better. '),
(0, '@KnightOnline does this include coin that has gone when servers were took offline? i lost 10gb from a succesful trade when they went down '),
(4, "@ebony1075 not me, lol. Didn't even know there was a match You ok? Did I miss much last night?????")]
"""
The first value in each list entry is the label. Here, 0 is negative sentiment and 4 is positive sentiment. We'll change these labels to 0 and 1 later. The second value is the text we're training on. The goal is to classify the text into the correct sentiment category.
Now we're ready to train the model!
Keras
Keras is a highly stable deep learning framework used by many beginners who want to quickly train models that don't require advanced customization. The code snippets in this section are much shorter than in the PyTorch section :) Keras runs off of another library called tensorflow. Learn more about Keras here.
Installation
To install the required libraries:
pip install keras tensorflow scikit-learn
Once that's done, we're ready to tokenize the text and build the model. "Tokenization" is the process by which text in a dataset is split up into chunks, such as words or common sequences of characters. "Hello I am Bob" might become ["hello", "i", "am", "bob"].
Each unique token is assigned a numerical label. "hello" might be assigned the label 256. Models like LSTMs require an Embedding layer that transforms these labels into embeddings of some size.
Text Preprocessing and Tokenization
...
data = random.sample(data, 100_000) # only use a subset of the data to train
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re
from sklearn.preprocessing import LabelEncoder
def preprocess_text(text):
text = text.lower() # make the text only lowercase
text = re.sub(r's+', ' ', text) # get rid of excessive spaces
return text
num_words = 10000 # the maximum number of words the tokenizer should have.
max_seq_length = 100 # maximum sequence length (of tokens) the model can handle, anything above this is cut off
tokenizer = Tokenizer(num_words=num_words) # initialize a tokenizer to chunk the text
labels = [x[0] for x in data] # gets all the labels (for later)
all_text_data = [x[1] for x in data] # gets all the text
tokenizer.fit_on_texts(all_text_data) # train the tokenizer
sequences = tokenizer.texts_to_sequences(all_text_data)
print(sequences[:5])
"""
[[29, 540],
[210, 4, 51, 27, 762],
[177, 18, 1, 172, 86, 249, 1073],
[308, 140, 2, 226, 21, 7],
[1488, 1488, 167, 1, 731, 21, 53]]
"""
The following code gets a random subset of the data for training (random.sample), processes the text by removing numbers and extraneous spaces (preprocess_text), and builds a tokenizer to convert all the the text to numbers. We can turn these lists back to strings like so:
...
original_text = tokenizer.sequences_to_texts(sequences[:5])
print(original_text)
"""
['good idea',
'made a back up account',
'tired but i cant miss lost xo',
'hi nice to tweet with you',
'heh heh yes i agree with u']
"""
We also need to pad the sequences so that they're all the same length. This pads the beginning of each sequence:
...
# Pad sequences to ensure uniform length
X = pad_sequences(sequences, maxlen=max_seq_length)
Also, let's save the tokenizer to a file for later:
...
import io
import json
tokenizer_json = tokenizer.to_json() # saves the tokenizer
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(tokenizer_json, ensure_ascii=False))
Now that we have the text in a machine-readable format, let's update the labels so that "0" is 0 and "4" is 1.
...
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labels) # gets the input labels and converts them to a more standard form
# ^^ this also works if you have more than two label types
print(y[:10])
"""originally: [4 4 0 0 4 4 0 0 0 0]"""
"""now: [1 1 0 0 1 1 0 0 0 0]"""
Model Setup + Training
That was (surprisingly) the hard part, since using Keras is pretty simple.
Model setup:
...
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D
# Define the LSTM model
model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=128))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
"""
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ embedding (Embedding) │ ? │ 0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ spatial_dropout1d │ ? │ 0 (unbuilt) │
│ (SpatialDropout1D) │ │ │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm (LSTM) │ ? │ 0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense) │ ? │ 0 (unbuilt) │
└─────────────────────────────────┴────────────────────────┴───────────────┘
...
"""
The following is an LSTM model. First, there is the Embedding layer like we talked about, which turns the numbers representing text tokens into embedding vectors (of size 128, in this case).
After going through the embedding layer, all the embedding vectors are inputs for the LSTM. The output of the LSTM is spat out into a "Dense" layer, which is a linear layer that shrinks the output down to 2 values (the # of classes in our dataset). Softmax then gives a probability per class. If we have a softmax output like [0.9, 0.1] for a given input, the model is 90% confident that the input is negative (class 0)
Train/Val split and Training
One last step before training the model is splitting the data into training and validation categories. The validation data is used to make sure the model doesn't overfit during training; the model never sees this data, so we can confirm the model's general performance based on it. 20% of the data is set aside for validation.
...
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
Finally, we're ready to train the model. It is small enough to train quickly on a decent CPU. We do 3 passes (epochs) over the training data:
...
# Train the model
history = model.fit(X_train, y_train, epochs=3, batch_size=64, validation_data=(X_val, y_val), verbose=1)
print(history)
model.save('sentiment_analysis.keras') # saves the model for testing
"""
Epoch 1/3
1250/1250 ━━━━━━━━━━━━━━━━━━━━ 84s 67ms/step - accuracy: 0.7138 - loss: 0.5478 - val_accuracy: 0.7891 - val_loss: 0.4566
Epoch 2/3
1250/1250 ━━━━━━━━━━━━━━━━━━━━ 88s 70ms/step - accuracy: 0.8090 - loss: 0.4191 - val_accuracy: 0.7927 - val_loss: 0.4478
Epoch 3/3
1250/1250 ━━━━━━━━━━━━━━━━━━━━ 89s 72ms/step - accuracy: 0.8281 - loss: 0.3774 - val_accuracy: 0.8011 - val_loss: 0.4348
"""
Testing
Now, in a new file, let's load the model to test it on our own input. Much of the code is the same:
# testing file
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.saving import load_model
import re
import json
def preprocess_text(text):
text = text.lower()
text = re.sub(r's+', ' ', text)
return text
label_to_type = {
0: 'Negative',
1: 'Positive'
}
num_words = 10000
max_seq_length = 100
with open('tokenizer.json') as f:
data = json.load(f)
tokenizer = tokenizer_from_json(data)
# Define the LSTM model
model = load_model("sentiment_analysis.keras")
while True:
input_s = input("?")
input_s = tokenizer.texts_to_sequences([input_s])
input_s = pad_sequences(input_s, maxlen=max_seq_length)
output = model.predict(input_s)[0].tolist()
print('confidence:', max(output))
val_idx = output.index(max(output))
print('Predicted sentiment of input:', label_to_type[val_idx])
The output:
? I am happy :)
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 65ms/step
confidence: 0.9804297685623169
Predicted sentiment of input: Positive
? I am sad
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
confidence: 0.9954018592834473
Predicted sentiment of input: Negative
? we got a dog
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
confidence: 0.5758517980575562
Predicted sentiment of input: Positive
? we got a cat
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
confidence: 0.8269914984703064
Predicted sentiment of input: Positive
? it snowed last night
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
confidence: 0.5409414172172546
Predicted sentiment of input: Negative
? on no, it snowed last night
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step
confidence: 0.7260915040969849
Predicted sentiment of input: Negative
PyTorch
PyTorch is another library that can be used for training deep learning models like LSTMs. It offers more customization and so is the #1 choice for most researchers/developers. It's also often easier to debug than Keras.
PyTorch - Installation
You can install with pip
pip install torch
pip install pandas
pip install transformers
Text Preprocessing and Tokenization
First, import the required libraries
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
The reason for using the "transformers" library is that, as far as I can tell, pytorch doesn't have an equivalent to the Tokenizer that we used with the keras version. AutoTokenizer is good enough though, and offers very similar (often better) functionality.
Load the tokenizer from a pretrained model (BERT), as well as make a function to tokenize the text:
...
device = torch.device("cpu") # change to "cuda" if you're running on a gpu or "mps" if you have apple silicon
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', padding_side="left") # pads the beginning of the sequence. important!
# try padding right. the model won't converge :)
max_len = 100 # max length of the input. everything else will be cut off
def preprocess_text(text):
return tokenizer(
text,
max_length=max_len, # Maximum length of the sequence
padding='max_length', # Pad to maximum length
truncation=True, # Truncate longer sequences
return_tensors='pt' # Return PyTorch tensors, which are like numpy arrays
)
Loading the data from the pandas file:
...
import random
# Load data
data = pd.read_csv('data.csv', encoding='latin-1')
data = data.fillna('')
df = [[x[-1], 0 if x[0] == 0 else 1] for x in data.values.tolist()] # convert the labels from 0 and 4 to 0 and 1
random.shuffle(df)
df = random.sample(df, 100_000)
split_idx = int(len(df) * 0.8) # index of element around ~ the first 80% of data
a, b = df[:split_idx], df[split_idx:] # training: 80% of data, validation: 20% of data
X_train = preprocess_text([x[0] for x in a])['input_ids']
X_test = preprocess_text([x[0] for x in b])['input_ids']
y_train = [x[1] for x in a]
y_test = [x[1] for x in b]
# gets the labels for training and validation
PyTorch has something different that keras called a DataLoader, which autobatches the data for us. As input, a DataLoader requires a wrapper class for our data called a Dataset. Keras can batch in the background (see batch_size), but in PyTorch we need to create a DataLoader to do it:
...
# Convert data to PyTorch DataLoader
class TwitterDataset(Dataset):
def __init__(self, X, y):
self.X = X
self.y = y
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
num_classes = 2
y_train = torch.tensor(y_train, dtype=torch.long).to(device)
y_test = torch.tensor(y_test, dtype=torch.long).to(device)
train_dataset = TwitterDataset(X_train, y_train)
test_dataset = TwitterDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False) # no need to shuffle, order doesn't matter here
Defining the model:
...
# Define LSTM model
class LSTMModel(nn.Module):
def __init__(self, embedding_dim, lstm_hidden_dim, output_dim, dropout_prob=0.2):
super(LSTMModel, self).__init__()
self.embedding = nn.Embedding(tokenizer.vocab_size, embedding_dim)
self.dropout = nn.Dropout(p=dropout_prob)
self.lstm = nn.LSTM(embedding_dim, lstm_hidden_dim, dropout=dropout_prob, batch_first=True)
self.fc = nn.Linear(lstm_hidden_dim, output_dim)
def forward(self, x):
embedded = self.embedding(x)
embedded_dropout = self.dropout(embedded)
lstm_out, _ = self.lstm(embedded_dropout)
lstm_out = lstm_out[:, -1, :]
output = self.fc(lstm_out)
return output
embedding_dim = 128
hidden_dim = 100
output_dim = num_classes
dropout = 0.2
model = LSTMModel(embedding_dim, hidden_dim, output_dim, dropout).to(device)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
We can see that the structure is very similar to the keras version, however here we more precisely describe how the LSTM output is fed into the Linear ("Dense" in keras) layer. Also, we explicitly initialized the optimizer. Adam will help the model to converge.
Finally, here's the training/validation loop:
...
# Training and validation loop
epochs = 3
for epoch in range(epochs):
# Training
model.train()
train_loss = 0.0
train_correct = 0
train_total = 0
for batch_idx, (inputs, labels) in enumerate(train_loader):
inputs = inputs.to(device)
labels = labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels) # input to the model
loss.backward() # update weights based on loss
optimizer.step()
train_loss += loss.item() * inputs.size(0)
_, predicted = torch.max(outputs, 1)
train_correct += (predicted == labels).sum().item()
train_total += labels.size(0)
# Print training loss and accuracy for each batch
if batch_idx % 100 == 9: # Print every 10 batches
train_loss_avg = train_loss / train_total
train_acc_avg = train_correct / train_total
print(f'Epoch [{epoch+1}/{epochs}], Step [{batch_idx+1}/{len(train_loader)}], Train Loss: {train_loss_avg:.4f}, Train Acc: {train_acc_avg:.4f}')
# Validation, we have to use "eval" so that the dropout isn't added
model.eval()
val_loss = 0.0
val_correct = 0
val_total = 0
with torch.no_grad():
for inputs, labels in test_loader:
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item() * inputs.size(0)
_, predicted = torch.max(outputs, 1)
val_correct += (predicted == labels).sum().item()
val_total += labels.size(0)
# Print validation loss and accuracy after each epoch
val_loss_avg = val_loss / val_total
val_acc_avg = val_correct / val_total
print(f'Epoch [{epoch+1}/{epochs}], Val Loss: {val_loss_avg:.4f}, Val Acc: {val_acc_avg:.4f}')
# Save the model to a file for later
torch.save(model.state_dict(), 'sentiment_analysis.pth')
Now, let's train the model. Training log:
Epoch [1/3], Step [10/1250], Train Loss: 0.6979, Train Acc: 0.5312 Epoch [1/3], Step [110/1250], Train Loss: 0.6922, Train Acc: 0.5246 Epoch [1/3], Step [210/1250], Train Loss: 0.6719, Train Acc: 0.5756 Epoch [1/3], Step [310/1250], Train Loss: 0.6479, Train Acc: 0.6110 Epoch [1/3], Step [410/1250], Train Loss: 0.6300, Train Acc: 0.6336 Epoch [1/3], Step [510/1250], Train Loss: 0.6172, Train Acc: 0.6497 Epoch [1/3], Step [610/1250], Train Loss: 0.6042, Train Acc: 0.6634 Epoch [1/3], Step [710/1250], Train Loss: 0.5941, Train Acc: 0.6746 Epoch [1/3], Step [810/1250], Train Loss: 0.5846, Train Acc: 0.6832 Epoch [1/3], Step [910/1250], Train Loss: 0.5777, Train Acc: 0.6897 Epoch [1/3], Step [1010/1250], Train Loss: 0.5707, Train Acc: 0.6965 Epoch [1/3], Step [1110/1250], Train Loss: 0.5639, Train Acc: 0.7019 Epoch [1/3], Step [1210/1250], Train Loss: 0.5591, Train Acc: 0.7059 Epoch [1/3], Val Loss: 0.4856, Val Acc: 0.7661 Epoch [2/3], Step [10/1250], Train Loss: 0.4557, Train Acc: 0.7922 Epoch [2/3], Step [110/1250], Train Loss: 0.4636, Train Acc: 0.7845 Epoch [2/3], Step [210/1250], Train Loss: 0.4634, Train Acc: 0.7836 Epoch [2/3], Step [310/1250], Train Loss: 0.4606, Train Acc: 0.7839 Epoch [2/3], Step [410/1250], Train Loss: 0.4620, Train Acc: 0.7831 Epoch [2/3], Step [510/1250], Train Loss: 0.4617, Train Acc: 0.7829 Epoch [2/3], Step [610/1250], Train Loss: 0.4599, Train Acc: 0.7842 Epoch [2/3], Step [710/1250], Train Loss: 0.4583, Train Acc: 0.7856 Epoch [2/3], Step [810/1250], Train Loss: 0.4573, Train Acc: 0.7861 Epoch [2/3], Step [910/1250], Train Loss: 0.4564, Train Acc: 0.7865 Epoch [2/3], Step [1010/1250], Train Loss: 0.4563, Train Acc: 0.7864 Epoch [2/3], Step [1110/1250], Train Loss: 0.4553, Train Acc: 0.7867 Epoch [2/3], Step [1210/1250], Train Loss: 0.4544, Train Acc: 0.7869 Epoch [2/3], Val Loss: 0.4555, Val Acc: 0.7867 Epoch [3/3], Step [10/1250], Train Loss: 0.4043, Train Acc: 0.8172 Epoch [3/3], Step [110/1250], Train Loss: 0.4127, Train Acc: 0.8088 Epoch [3/3], Step [210/1250], Train Loss: 0.4131, Train Acc: 0.8091 Epoch [3/3], Step [310/1250], Train Loss: 0.4142, Train Acc: 0.8089 Epoch [3/3], Step [410/1250], Train Loss: 0.4133, Train Acc: 0.8090 Epoch [3/3], Step [510/1250], Train Loss: 0.4140, Train Acc: 0.8089 Epoch [3/3], Step [610/1250], Train Loss: 0.4136, Train Acc: 0.8098 Epoch [3/3], Step [710/1250], Train Loss: 0.4147, Train Acc: 0.8089 Epoch [3/3], Step [810/1250], Train Loss: 0.4139, Train Acc: 0.8095 Epoch [3/3], Step [910/1250], Train Loss: 0.4134, Train Acc: 0.8105 Epoch [3/3], Step [1010/1250], Train Loss: 0.4129, Train Acc: 0.8111 Epoch [3/3], Step [1110/1250], Train Loss: 0.4116, Train Acc: 0.8118 Epoch [3/3], Step [1210/1250], Train Loss: 0.4116, Train Acc: 0.8117 Epoch [3/3], Val Loss: 0.4619, Val Acc: 0.8004
And then, let's test with our own input:
import pandas as pd
import torch
import torch.nn as nn
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', padding_side="left")
max_len = 100
# Encode labels
reverse_label_encoder = {
0: 'Negative',
1: 'Positive'
}
num_classes = 2
def preprocess_text(text):
return tokenizer(
text,
max_length=max_len, # Maximum length of the sequence
padding='max_length', # Pad to maximum length
truncation=True, # Truncate longer sequences
return_tensors='pt' # Return PyTorch tensors
)
class LSTMModel(nn.Module):
def __init__(self, embedding_dim, lstm_hidden_dim, output_dim, dropout_prob=0.2):
super(LSTMModel, self).__init__()
self.embedding = nn.Embedding(tokenizer.vocab_size, embedding_dim)
self.dropout = nn.Dropout(p=dropout_prob)
self.lstm = nn.LSTM(embedding_dim, lstm_hidden_dim, dropout=dropout_prob, batch_first=True)
self.fc = nn.Linear(lstm_hidden_dim, output_dim)
def forward(self, x):
embedded = self.embedding(x)
embedded_dropout = self.dropout(embedded)
lstm_out, _ = self.lstm(embedded_dropout)
lstm_out = lstm_out[:, -1, :]
output = self.fc(lstm_out)
return output
embedding_dim = 128
hidden_dim = 100
output_dim = num_classes
dropout = 0.2
device = torch.device("mps")
model = LSTMModel(embedding_dim, hidden_dim, output_dim, dropout).to(device)
model.load_state_dict(torch.load("sentiment_analysis.pth"))
model.eval()
while True:
input_val = input("?")
if not input_val:
break
inputs = preprocess_text([input_val])['input_ids'].to(device)
outputs = torch.softmax(model(inputs), dim=1)
print(outputs)
_, predicted = torch.max(outputs, 1)
print("predicted label:", reverse_label_encoder[predicted[0].tolist()])
And the results:
? I am happy
tensor([[0.0250, 0.9750]], device='mps:0', grad_fn=<softmaxbackward0>)
predicted label: Positive
? I am sad
tensor([[9.9952e-01, 4.8366e-04]], device='mps:0', grad_fn=<softmaxbackward0>)
predicted label: Negative