Recurrent Neural Networks (RNN) and LSTM¶
Recurrent Neural Networks (RNN) are a type of neural network designed specifically to handle sequential data, such as time series or sentences. Unlike traditional neural networks that consider only the current input, RNNs remember information from previous inputs using something called a hidden state. This makes them useful for tasks like language modeling, speech recognition, and time series prediction.
In a basic RNN, at each step in a sequence, the network takes the current input and combines it with the hidden state from the previous step to produce an output and a new hidden state. This process repeats for every item in the sequence.
However, basic RNNs struggle with long sequences because they tend to forget information from earlier steps. This is known as the vanishing gradient problem.
To solve this, we use a more advanced version of RNNs called Long Short-Term Memory networks, or LSTMs. LSTMs have a more complex internal structure with special units called gates. These gates decide what information should be kept, what should be forgotten, and what should be output. This allows LSTMs to remember important information for longer periods, making them much better for longer sequences.
There are three main gates in an LSTM:
- Forget gate: Decides what information to discard from the previous cell state.
- Input gate: Decides what new information to store in the cell state.
- Output gate: Decides what to output from the cell.
Now let’s talk about Bidirectional LSTMs. In a standard LSTM, the input is processed from the start of the sequence to the end. But sometimes, it is helpful to look at the sequence in both directions. Bidirectional LSTMs do this by running two LSTMs at the same time: one from the start to the end and the other from the end to the start. The outputs from both directions are then combined. This gives the network a better understanding of the context in both directions. It is especially useful in natural language processing where the meaning of a word can depend on both its left and right context.
Finally, let’s look at the encoder-decoder architecture, which is common in tasks like translating a sentence from one language to another. The encoder is an LSTM (or another type of RNN) that reads the entire input sequence and compresses it into a single context vector. This vector summarizes the input. The decoder is another LSTM that takes this context vector and produces an output sequence. For example, it could take a French sentence and produce its English translation. This architecture works well for many sequence-to-sequence tasks.
- RNNs are good for sequences but forget early information.
- LSTMs fix this with gates that help retain important information.
- Bidirectional LSTMs process sequences in both directions to get better context.
- Encoder-decoder structures are used to convert one sequence into another, like translating languages.
LSTM, Bidirectional LSTM, and Encoder-Decoder Model¶
- Unidirectional LSTM
- Bidirectional LSTM
- Encoder-Decoder architecture
We'll use a simple toy dataset of integer sequences and their reversed versions to demonstrate.
# Step 1: Import Required Libraries
import numpy as np # type: ignore
from tensorflow.keras.models import Sequential, Model # type: ignore
from tensorflow.keras.layers import LSTM, Dense, Input, Bidirectional, RepeatVector, TimeDistributed # type: ignore
from tensorflow.keras.preprocessing.sequence import pad_sequences # type: ignore
from sklearn.model_selection import train_test_split # type: ignore
/Users/saroshbaig/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn(
Step 2: Generate Toy Data¶
We create sequences of random integers between 1 and 9, and their reversed versions. For example, input [3, 7, 2] would have output [2, 7, 3]. This helps simulate sequence-to-sequence tasks.
def create_sequence_pairs(n_samples=1000, max_len=5):
X, y = [], []
for _ in range(n_samples):
seq_len = np.random.randint(1, max_len + 1)
seq = np.random.randint(1, 10, size=seq_len).tolist()
X.append(seq)
y.append(seq[::-1])
return X, y
X_raw, y_raw = create_sequence_pairs()
Step 3: Preprocess with Padding and Reshaping¶
pad_sequences ensures all sequences are of equal length (required by LSTM). expand_dims adds a third dimension to make the shape compatible with LSTM, which expects (samples, timesteps, features).
X_pad = pad_sequences(X_raw, maxlen=5, padding='post')
y_pad = pad_sequences(y_raw, maxlen=5, padding='post')
X_pad = np.expand_dims(X_pad, axis=-1)
y_pad = np.expand_dims(y_pad, axis=-1)
Step 4: Split Data¶
Split the dataset into training and testing sets using train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X_pad, y_pad, test_size=0.2, random_state=42)
Step 5: Unidirectional LSTM Model¶
This model learns to predict the reversed sequence using a standard LSTM layer. return_sequences=True ensures the model outputs a sequence (instead of a single value).
The TimeDistributed(Dense(1)) layer applies a Dense layer to each timestep.
model = Sequential()
model.add(LSTM(64, activation='relu', input_shape=(5, 1), return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
model.summary()
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
/Users/saroshbaig/Library/Python/3.9/lib/python/site-packages/keras/src/layers/rnn/rnn.py:200: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ lstm (LSTM) │ (None, 5, 64) │ 16,896 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ time_distributed │ (None, 5, 1) │ 65 │ │ (TimeDistributed) │ │ │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 16,961 (66.25 KB)
Trainable params: 16,961 (66.25 KB)
Non-trainable params: 0 (0.00 B)
Epoch 1/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - loss: 19.6335 - val_loss: 14.6432 Epoch 2/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 14.6101 - val_loss: 10.9644 Epoch 3/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 11.4844 - val_loss: 8.4385 Epoch 4/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 8.8404 - val_loss: 7.0109 Epoch 5/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 7.3830 - val_loss: 6.2394 Epoch 6/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 6.5517 - val_loss: 5.7904 Epoch 7/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 6.1154 - val_loss: 5.4942 Epoch 8/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 5.7620 - val_loss: 5.2305 Epoch 9/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 5.4723 - val_loss: 5.0405 Epoch 10/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 5.3581 - val_loss: 4.8403
<keras.src.callbacks.history.History at 0x34d8d1850>
Step 6: Bidirectional LSTM¶
Bidirectional LSTM improves understanding by processing the sequence both forward and backward. This gives the model context from both past and future.
model_bi = Sequential()
model_bi.add(Bidirectional(LSTM(64, activation='relu', return_sequences=True), input_shape=(5, 1)))
model_bi.add(TimeDistributed(Dense(1)))
model_bi.compile(optimizer='adam', loss='mse')
model_bi.summary()
model_bi.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
/Users/saroshbaig/Library/Python/3.9/lib/python/site-packages/keras/src/layers/rnn/bidirectional.py:107: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ bidirectional (Bidirectional) │ (None, 5, 128) │ 33,792 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ time_distributed_1 │ (None, 5, 1) │ 129 │ │ (TimeDistributed) │ │ │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 33,921 (132.50 KB)
Trainable params: 33,921 (132.50 KB)
Non-trainable params: 0 (0.00 B)
Epoch 1/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step - loss: 15.1776 - val_loss: 8.0639 Epoch 2/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 7.7579 - val_loss: 5.6367 Epoch 3/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 5.5474 - val_loss: 4.6511 Epoch 4/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 4.5240 - val_loss: 4.1445 Epoch 5/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 4.4418 - val_loss: 3.8924 Epoch 6/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 3.7937 - val_loss: 3.7824 Epoch 7/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 3.6953 - val_loss: 3.6852 Epoch 8/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 3.5451 - val_loss: 3.5972 Epoch 9/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 3.6024 - val_loss: 3.5251 Epoch 10/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 3.4208 - val_loss: 3.4395
<keras.src.callbacks.history.History at 0x34da65a00>
Step 7: Encoder-Decoder Architecture¶
This setup is commonly used for tasks like translation. The encoder processes the input and summarizes it into a context vector. The decoder then uses this vector to generate the output sequence.
RepeatVector repeats the context vector for each timestep. initial_state=encoder_states helps the decoder start with the same context.
encoder_inputs = Input(shape=(5, 1))
encoder = LSTM(64, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
decoder_inputs = RepeatVector(5)(encoder_outputs)
decoder_lstm = LSTM(64, return_sequences=True)
decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = TimeDistributed(Dense(1))
outputs = decoder_dense(decoder_outputs)
model_seq2seq = Model(encoder_inputs, outputs)
model_seq2seq.compile(optimizer='adam', loss='mse')
model_seq2seq.summary()
model_seq2seq.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
Model: "functional_4"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ input_layer_2 │ (None, 5, 1) │ 0 │ - │ │ (InputLayer) │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ lstm_2 (LSTM) │ [(None, 64), │ 16,896 │ input_layer_2[0]… │ │ │ (None, 64), │ │ │ │ │ (None, 64)] │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ repeat_vector │ (None, 5, 64) │ 0 │ lstm_2[0][0] │ │ (RepeatVector) │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ lstm_3 (LSTM) │ (None, 5, 64) │ 33,024 │ repeat_vector[0]… │ │ │ │ │ lstm_2[0][1], │ │ │ │ │ lstm_2[0][2] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ time_distributed_2 │ (None, 5, 1) │ 65 │ lstm_3[0][0] │ │ (TimeDistributed) │ │ │ │ └─────────────────────┴───────────────────┴────────────┴───────────────────┘
Total params: 49,985 (195.25 KB)
Trainable params: 49,985 (195.25 KB)
Non-trainable params: 0 (0.00 B)
Epoch 1/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 13.1608 - val_loss: 6.6521 Epoch 2/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 6.2348 - val_loss: 4.8730 Epoch 3/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 4.9934 - val_loss: 4.4931 Epoch 4/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 4.4518 - val_loss: 4.3896 Epoch 5/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 4.6612 - val_loss: 4.3684 Epoch 6/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 4.3126 - val_loss: 4.2029 Epoch 7/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 4.2549 - val_loss: 4.1511 Epoch 8/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 4.0384 - val_loss: 4.0195 Epoch 9/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 4.1627 - val_loss: 3.9056 Epoch 10/10 25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 3.8977 - val_loss: 3.7855
<keras.src.callbacks.history.History at 0x34da91250>