Code Request

Filling out this form will grant you access to all of the code on this site. In addition, you will join our mailing list, and we'll send you periodic updates regarding new articles.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Oct. 20, 2024

Text Generation with Keras and Pytorch (Part 1)

Exploring recurrent generative AI models

Click here to download this post's source code

Training an LSTM to generate poetry

Photo by Clark Young on Unsplash

LSTM
Data processing
Keras
Keras Model Text Generation
PyTorch
PyTorch Model Text Generation

In the last article we analyzed how language models learn to classify text using nothing but patterns in textual data. This article is focused on generating text, ChatGPT style :)

Unfortunately, none of us have anywhere close to the computational resources that big companies do, however we can still build some pretty decent models.

This series will be an iterative one, where each article offers model design improvements until we're left with a very powerful model. The goal: write poetry.

Now, I know, it may not be the most interesting of tasks, but there's something ironic about trying to optimize an inherently creative, subjective process. Anyway, ... let's build the first model.

LSTM

We'll train an LSTM in this article. LSTM (Long Short Term Memory) models are recurrent in nature—they work by using context from previous values to predict something. In this case, we'll be using characters, so the model will predict what the next character should be given a context size of previous characters. Given the last n characters, predict what n_plus_1 should be.

I wrote the model twice, once in Keras and once in PyTorch. Before we start with that, let's get into data processing. In this article, we'll use the same tokenizer, so the data processing steps are nearly identical.

Data processing

The poetry dataset we'll be using is Poetry Foundation Poems, which aggregates most of the poems from poetryfoundation.org, a collection of historical poetry.

Installation

To begin this section, make sure you have the required libraries installed:

pip install pandas numpy tqdm

Now let's have a look at our data. Make sure to download the dataset as a zip file, then unzip and save the csv as kaggle_poem_dataset.csv:

# tokenizer.py
import pandas as pd
import json
from tqdm import tqdm
import pickle
import numpy as np

All of the text data is in the "Poem" column, so...

...
df = pd.read_csv("kaggle_poem_dataset.csv")
print(df.head())
data = df["Poem"].values.tolist()
# [..."Ye myrtle wreaths, your fragrance shed\rAround a younger brow"...]

Now comes the fun :)

Tokenization

Recap: We need to convert our text to numbers so the LSTM can work with it. First, we chunk our text into segments (in this case, relevant characters), then we replace each character with its respective index, some number that represents it. "hello" will become ["h", "e", "l", "l", "o"], which then may become something like [8, 5, 12, 12, 15].

The model will have an Embedding layer of size (# classes, embedding_size) like (26, 100). Each character will be replaced by its respective embedding vector, and this is what is fed into the model. This Embedding layer is a bit like a lookup table.

Let's start by defining our tokenizer:

...
# [..."Ye myrtle wreaths, your fragrance shed\rAround a younger brow"...]
token_list_main = list(
    """abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-_=+;:\"\'—’.,<>/?`“”~!@#$%^&*(){}[]| \\""")
token_list = ['<pad>', '<begin>', '<end>', '\n', '\r', '\t', '<unk>'] + token_list_main
token_dict = {x: i for i, x in enumerate(token_list)}
reverse_token_dict = {i: x for i, x in enumerate(token_list)}
json.dump([token_dict, reverse_token_dict], open('tokenizer.json', 'w'))
begin_idx = token_dict['<begin>']
end_idx = token_dict['<end>']
pad_idx = token_dict['<pad>']
num_classes = len(token_dict)
max_len = 256

token_list contains most of the characters that are in our dataset, as well a some additional tokens that will become important later. "" and "" act as the beginning and ending tokens respectively, and will help our model indicate when an input starts and ends. "" is for tokens that aren't properly encoded, and "" is used to pad our sequences to a certain size.

We saved the tokenizer encoder and decoder into files, now let's try encoding and decoding some text!

...
def pad_sequences(inp, to_len):
    # custom - pad and return np array
    results = []
    for val in inp:
        dif = to_len - len(val)
        if dif > 0:
            val = [pad_idx] * dif + val
        results.append(val)
    return results


def to_tokens(strings):
    results = []
    for string in strings:
        result = [token_dict[char] if char in token_dict else token_dict['<unk>'] for char in list(string)]
        result = [begin_idx] + result + [end_idx]
        results.append(result)
    return results


def to_text(tokens):
    return ''.join([reverse_token_dict[token] for token in tokens])


if __name__ == '__main__':
    print(len(data))

    sequences = to_tokens(data)
    print(sequences[10])
    # [1, 4, 4, 3, 51, 14, 11, 104, 9, 21, 27, 18, 10, 20, 76, 26, 104, 14, 11, 18, 22, 104, 8, 27, 26, 104, 25, 26, 15,

    print(to_text(sequences[10]))
    """
    <begin>
    She couldn't help but sting my finger,
    clinging a moment before I flung her...
    """

Hopefully you can see that to_tokens will convert the strings to lists of numbers (and use the "" token if the character doesn't exist in the tokenizer dictionary), and to_text will reverse that process.

Now we're ready to tokenize and save all of the data for training:

...
if __name__ == '__main__':
    ... # continued from previous code block
    step = 20  # the number of characters to step by (making this smaller means more data but also longer training)
    big_data = []
    # now, split up the data into sequences for training
    for txt in tqdm(sequences):
        dif = len(txt) - max_len
        if dif > 0:
            txt = [pad_idx] * (max_len - step) + txt # add this so that the model is used to handling input with initial padding
            for i in range(0, dif, step):
                big_data.append([
                    txt[i:i + max_len],
                    txt[i + max_len]
                ])

        elif len(txt) > 0:
            big_data.append([
                txt[:-1],
                txt[-1]
            ])

    print('size:', len(big_data))

    X = np.array(pad_sequences([a[0] for a in big_data], to_len=max_len))
    y = np.array([a[1] for a in big_data])
    print('Each text sample is of size:', len(X[0]))
    pickle.dump([X, y], open('poetry_data.pkl', 'wb')) # saves the dataset to a pickle file

The following code split the tokenized data into chunks of size 256 (max_length); it did this through a combination of string slicing and padding. step refers to the step size of the text (how many characters to skip by). Consider the following example:

<begin>I like to eat cake.<end>

In the case where step is 5 and the context length is 5, we would expect the following characters per sample (untokenized here):

["I", " ", "l", "i", "k"]
and the next one (skipping 5):
["e", "a", "t", " ", "c"]

X and y are the output of this data sampling function. X will be the input to the model, and y is the last character, the character we want the model to predict. Unlike some other prediction models like transformers, with LSTMs we can only predict data sequentially. The context is fed into the model to result in an output token, which we can train for using categorical losses like cross entropy.

Finally, we converted the lists to numpy arrays and saved the data to a pickle file for further use. Now let's train our model!

Keras

Keras is a deep learning framework that is relatively simplistic and yet offers incredible functionality. To get started with the code for this section, make sure you have the required libraries (use pip to install):

pip install keras tensorflow tqdm pandas

And in a new file,

# train_poetry_keras.py
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
import pickle
from tokenizer import num_classes, max_len

X, y = pickle.load(open('poetry_data.pkl', 'rb'))

y = to_categorical(
    y,
    num_classes = num_classes
)

print(X[0].shape)
print(y[0].shape)
"""
(878835, 256) (878835, 106)
"""

As you can see, the data is in the form (num_text_inputs,input_size) and (num_text_inputs,num_classes), which is as we expected. to_categorical one-hot-encodes the token into an array of size num_classes, where every item in the array except for the token index is 0; the token index will have the value one. For example if the token is "24", the 24th indexed value will be "1". This array is expected by the Embedding layer (next code block) and that layer's output will correspond to the embedding vector for this token. If you're familiar with the dot product of a 1d matrix with a 2d matrix, this operation will make more sense.

Now let's write the Keras model:

# Define the LSTM model
model = Sequential()
model.add(Embedding(num_classes, 100, input_length=max_len))
model.add(LSTM(100))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

filepath="checkpoint-{epoch:02d}.keras"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1)

history = model.fit(X, y, epochs=30, batch_size=128, verbose=1, callbacks=[checkpoint])
print(history)
model.save('poetry_gen.keras')  # saves the model for testing

There's really not much to it, Keras is very powerful and easy to use. Sequential() allows us to write a model with multiple parts, which run in sequence during training and inference. Here, we start with an embedding layer that take in our tokens as input and outputs a 100-dimensional embeddings for each token. These embeddings are weights that are updated during training.

From there, the embeddings go into an LSTM layer that expects embeddings of size 100. The last of the model's layers is a "Dense" layer that outputs logits, or the model's predictions for what the token should be (of all the possible tokens). The most positive logit is the one that is most likely, and the most negative is the token that is least likely given the sequence.

Take the sequence "I like to rea". The most likely next character is "d", for "I like to read". This token would have a relatively high logit. We would not expect the next character to be "z", as that doesn't make any word.

model.fit() runs an entire training process with 30 epochs and a batch size of 128. 5p epochs means that the model will train over the entire dataset 30 times, and the batch size is how many sequences are fed into the model per training step. A higher batch size can lead to better generalization/text generation.

Let's see what loss we get when we run this. Training may take a while if you'd like to do this yourself:

Epoch 1/30
6866/6866 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step - accuracy: 0.3546 - loss: 2.3533
Epoch 1: saving model to checkpoint-01.keras
6866/6866 ━━━━━━━━━━━━━━━━━━━━ 131s 18ms/step - accuracy: 0.3546 - loss: 2.3533
Epoch 2/30
6864/6866 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step - accuracy: 0.4457 - loss: 1.9177
Epoch 2: saving model to checkpoint-02.keras
6866/6866 ━━━━━━━━━━━━━━━━━━━━ 140s 18ms/step - accuracy: 0.4458 - loss: 1.9177
...
Epoch 30/30
6865/6866 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.5324 - loss: 1.5762
Epoch 30: saving model to checkpoint-30.keras
6866/6866 ━━━━━━━━━━━━━━━━━━━━ 143s 19ms/step - accuracy: 0.5324 - loss: 1.5762

Keras Model Text Generation

Now we're ready to generate some text. For the sake of brevity, here's the code:

import json
import numpy as np
from tensorflow.keras.saving import load_model
tokenizer = json.load(open('tokenizer.json'))
print(tokenizer)

tokenizer_encoder, tokenizer_decoder = tokenizer
tokenizer_decoder = {int(i): x for i,x in tokenizer_decoder.items()}
begin_idx = tokenizer_encoder['<begin>']
end_idx = tokenizer_encoder['<end>']
num_classes = len(tokenizer_encoder)
max_len = 256


def pad_sequences(val, to_len):
    # custom - pad and return np array
    dif = to_len - len(val)
    if dif > 0:
        val = [tokenizer_encoder['<pad>']] * dif + val
    return val


def to_tokens(string):
    result = [tokenizer_encoder[char]
              if char in tokenizer_encoder and tokenizer_encoder[char] != '<pad>'
              else tokenizer_encoder['<unk>'] for char in list(string)]
    return result


def generate_text(seed_text, next_words, max_sequence_len, temperature=1.0):
    result = seed_text
    token_list = to_tokens(seed_text)
    token_list = pad_sequences(token_list, max_sequence_len)

    for _ in range(next_words):
        predict_x = model.predict(np.array([token_list]), verbose=0)[0]


        if temperature == 0.0:
            predicted = np.argmax(predict_x)
        else:
            predict_x = np.asarray(predict_x).astype('float64')
            predict_x = np.log(predict_x + 1e-7) / temperature
            exp_preds = np.exp(predict_x)
            predict_x = exp_preds / np.sum(exp_preds)
            predicted = np.random.choice(len(predict_x), p=predict_x)

        if predicted in tokenizer_decoder:
            result += tokenizer_decoder[predicted]
            print(token_list)
            token_list.pop(0)
            token_list.append(predicted)

    return result


model = load_model("checkpoint-30.keras")

# Example usage
seed_text = input('?')
generated_text = generate_text(seed_text, 500, max_len, temperature=0.5)
print(generated_text)

It generates 500 characters max with a temperature of 0.5 to give randomness to each output (so that output isn't the same for each input).

Here's what I got for a certain input:


Input: I am the hand beyond time
Output: 
I am the hand beyond time pastly quiet of the loold
of the story of the dead
and the silent lands on the ance the partance.   
The water sall and the way the picks of the ember
the hard and tenth the seen the soul watched or in the landles,
at the world to be bear like the plant gone,
a should cloud and something of the poets and seem at the month
its mouth the book of the end in the hood and his sat only in the room,
like a word the black corner.

PyTorch

We can train our model with PyTorch instead of Keras. PyTorch offers more flxibility when it comes to defining our model and how it is trained. In PyTorch, we write out the entire training loop.

First, make sure that the required libraries are installed:

pip install torch

Now for the code:
# train_poetry_pytorch.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tokenizer import num_classes
import pickle

# define the device to run the code on
if torch.cuda.is_available():
    device = torch.device("gpu")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

This first part checks if our system has a gpu or mps (Apple Silicon chip) to run the code on, which would be fatser than with native cpu.

Now, data loading:

...
# Convert data to PyTorch DataLoader
class PoemDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return torch.tensor(self.X[idx]), torch.tensor(self.y[idx])


X, y = pickle.load(open('poetry_data.pkl', 'rb'))
train_dataset = PoemDataset(X, y)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

PyTorch requires the use of DataLoaders, which loads data from a Dataset class when it is needed. This also handles data batching. Let's look at one of the samples:

...
example = next(iter(train_loader))
print(example[0].shape)
# torch.Size([128, 256]) # batch size is 128, each sample has 256 tokens

Now that we've handled making the DataLoader, let's define our LSTM model:

...
class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, lstm_hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(num_classes, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_hidden_dim, batch_first=True)
        self.fc = nn.Linear(lstm_hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        lstm_out = lstm_out[:, -1, :]
        output = self.fc(lstm_out)
        return output


embedding_dim = 100
hidden_dim = 100
output_dim = num_classes

model = LSTMModel(embedding_dim, hidden_dim, output_dim).to(device)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
epochs = 30

You can see that the model has an embedding layer like with the keras model, which also depends on the number of classes and the embedding dimension (which is 100). Then, the embeddings are passed through the LSTM layer. Finally, the output is run through a linear layer with an output dimension of num_classes. In this way, we'll get a logit (prediction value) for each token. Remember, the higher the logit, the more likely that token is. Please refer to the Keras section for a more in-depth explanation of some of these concepts.

The criterion is our loss function, which in this case is CrossntropyLoss, the default for classification problems in PyTorch

Finally, let's define the training loop:

...
for epoch in range(epochs):
    # Training
    model.train()
    train_loss = 0.0
    train_correct = 0
    train_total = 0

    for batch_idx, (inputs, labels) in enumerate(train_loader):
        inputs = inputs.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * inputs.size(0)
        _, predicted = torch.max(outputs, 1)
        train_correct += (predicted == labels).sum().item()
        train_total += labels.size(0)

        # Print training loss and accuracy for each batch
        if batch_idx % 100 == 9:  # Print every 10 batches
            train_loss_avg = train_loss / train_total
            train_acc_avg = train_correct / train_total
            print(f'Epoch [{epoch+1}/{epochs}], Step [{batch_idx+1}/{len(train_loader)}], Train Loss: {train_loss_avg:.4f}, Train Acc: {train_acc_avg:.4f}')
            torch.save(model.state_dict(), f'checkpoint-{epoch+1}.pth')

# Save the model
torch.save(model.state_dict(), 'poem_gen.pth')

The model will train for 30 epochs, backpropagating the loss from the criterion to the model weights. Finally the model is saved to "poem_gen.pth". Let's see what we get when we run this:

Epoch [1/30], Step [10/6866], Train Loss: 4.5884, Train Acc: 0.0758
Epoch [1/30], Step [110/6866], Train Loss: 3.1926, Train Acc: 0.2158
Epoch [1/30], Step [210/6866], Train Loss: 2.9111, Train Acc: 0.2592
Epoch [1/30], Step [310/6866], Train Loss: 2.7549, Train Acc: 0.2842
Epoch [1/30], Step [410/6866], Train Loss: 2.6524, Train Acc: 0.3007
Epoch [1/30], Step [510/6866], Train Loss: 2.5862, Train Acc: 0.3128
Epoch [1/30], Step [610/6866], Train Loss: 2.5300, Train Acc: 0.3240
...
Epoch [30/30], Step [6866/6866], Train Loss: 1.5734, Train Acc: 0.5354

So it learned something... let's see what!

PyTorch Model Text Generation

import json
import numpy as np
import torch.nn as nn
import torch
from tqdm import tqdm
tokenizer = json.load(open('tokenizer.json'))
print(tokenizer)

tokenizer_encoder, tokenizer_decoder = tokenizer
tokenizer_decoder = {int(i): x for i,x in tokenizer_decoder.items()}
begin_idx = tokenizer_encoder['<begin>']
end_idx = tokenizer_encoder['<end>']
num_classes = len(tokenizer_encoder)
max_len = 256


def pad_sequences(val, to_len):
    # custom - pad and return np array
    dif = to_len - len(val)
    if dif > 0:
        val = [tokenizer_encoder['<pad>']] * dif + val
    return val


def to_tokens(string):
    result = [tokenizer_encoder[char]
              if char in tokenizer_encoder and tokenizer_encoder[char] != '<pad>'
              else tokenizer_encoder['<unk>'] for char in list(string)]
    return result


def generate_text(seed_text, next_words, max_sequence_len, temperature=1.0):
    result = seed_text
    token_list = [begin_idx] + to_tokens(seed_text)
    token_list = pad_sequences(token_list, max_sequence_len)

    for _ in tqdm(range(next_words)):
        predict_x = model(torch.tensor([token_list]).to(device))[0]
        predict_x = torch.softmax(predict_x, dim=0)
        predict_x = predict_x.cpu().detach().numpy().astype('float64')

        if temperature == 0.0:
            predicted = np.argmax(predict_x)
        else:
            predict_x = np.log(predict_x + 1e-7) / temperature
            exp_preds = np.exp(predict_x)
            predict_x = exp_preds / np.sum(exp_preds)
            predicted = np.random.choice(len(predict_x), p=predict_x)

        if predicted in tokenizer_decoder:
            result += tokenizer_decoder[predicted]
            print(token_list)
            token_list.pop(0)
            token_list.append(predicted)

    return result


class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, lstm_hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(num_classes, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_hidden_dim, batch_first=True)
        self.fc = nn.Linear(lstm_hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        lstm_out = lstm_out[:, -1, :]
        output = self.fc(lstm_out)
        return output


if torch.cuda.is_available():
    device = torch.device("gpu")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
embedding_dim = 100
hidden_dim = 100
output_dim = num_classes
model = LSTMModel(embedding_dim, hidden_dim, output_dim).to(device)
model.load_state_dict(torch.load('poem_gen.pth'))
model.eval()
torch.no_grad()
# Example usage
seed_text = input('?')

generated_text = generate_text(seed_text, 500, max_len, temperature=0.5)
print(generated_text)

The above is the code for the text generation; it will by default generate 500 characters. Let's see what we get:


Input: I am the hand beyond time
Output: I am the hand beyond time of the stone of the most things. I say to be the words of the colum and shadows the book of the all the speech of more thy hands there,
the back the flashing and it is the part to the phruse of the sun where the simple of the small believe the clear and love and surfled and stell the town of pray of the things
and stars, the repless to the moutical book of the sun a many house hand,
and she sailors than a thouses that was a man.

fieryflamingo