Last Updated on March 22, 2023

Long Short-Term Memory (LSTM) is a structure that can be used in neural network. It is a type of recurrent neural network (RNN) that expects the input in the form of a sequence of features. It is useful for data such as time series or string of text. In this post, you will learn about LSTM networks. In particular,

- What is LSTM and how they are different
- How to develop LSTM network for time series prediction
- How to train a LSTM network

Let’s get started.

## Overview

This post is divided into three parts; they are

- Overview of LSTM Network
- LSTM for Time Series Prediction
- Training and Verifying Your LSTM Network

## Overview of LSTM Network

LSTM cell is a building block that you can use to build a larger neural network. While the common building block such as fully-connected layer are merely matrix multiplication of the weight tensor and the input to produce an output tensor, LSTM module is much more complex.

A typical LSTM cell is illustrated as follows

It takes one time step of an input tensor $x$ as well as a cell memory $c$ and a hidden state $h$. The cell memory and hidden state can be initialized to zero at the beginning. Then within the LSTM cell, $x$, $c$, and $h$ will be multiplied by separate weight tensors and pass through some activation functions a few times. The result is the updated cell memory and hidden state. These updated $c$ and $h$ will be used on the **next time step** of the input tensor. Until the end of the last time step, the output of the LSTM cell will be its cell memory and hidden state.

Specifically, the equation of one LSTM cell is as follows:

$$

begin{aligned}

f_t &= sigma_g(W_{f} x_t + U_{f} h_{t-1} + b_f) \

i_t &= sigma_g(W_{i} x_t + U_{i} h_{t-1} + b_i) \

o_t &= sigma_g(W_{o} x_t + U_{o} h_{t-1} + b_o) \

tilde{c}_t &= sigma_c(W_{c} x_t + U_{c} h_{t-1} + b_c) \

c_t &= f_t odot c_{t-1} + i_t odot tilde{c}_t \

h_t &= o_t odot sigma_h(c_t)

end{aligned}

$$

Where $W$, $U$, $b$ are trainable parameters of the LSTM cell. Each equation above is computed for each time step, hence with subscript $t$. These trainable parameters are **reused** for all the time steps. This nature of shared parameter bring the memory power to the LSTM.

Note that the above is only one design of the LSTM. There are multiple variations in the literature.

Since the LSTM cell expects the input $x$ in the form of multiple time steps, each input sample should be a 2D tensors: One dimension for time and another dimension for features. The power of an LSTM cell depends on the size of the hidden state or cell memory, which usually has a larger dimension than the number of features in the input.

**Kick-start your project** with my book Deep Learning with PyTorch. It provides **self-study tutorials** with **working code**.

## LSTM for Time Series Prediction

Let’s see how LSTM can be used to build a time series prediction neural network with an example.

The problem you will look at in this post is the international airline passengers prediction problem. This is a problem where, given a year and a month, the task is to predict the number of international airline passengers in units of 1,000. The data ranges from January 1949 to December 1960, or 12 years, with 144 observations.

It is a regression problem. That is, given the number of passengers (in unit of 1,000) the recent months, what is the number of passengers the next month. The dataset has only one feature: The number of passengers.

Let’s start by reading the data. The data can be downloaded here.

Save this file as `airline-passengers.csv`

in the local directory for the following.

Below is a sample of the first few lines of the file:

“Month”,”Passengers” “1949-01”,112 “1949-02”,118 “1949-03”,132 “1949-04”,129 |

The data has two columns, the month and the number of passengers. Since the data are arranged in chronological order, you can take only the number of passenger to make a single-feature time series. Below you will use pandas library to read the CSV file and convert it into a 2D numpy array, then plot it using matplotlib:

import matplotlib.pyplot as plt import pandas as pd
df = pd.read_csv(‘airline-passengers.csv’) timeseries = df[[“Passengers”]].values.astype(‘float32’)
plt.plot(timeseries) plt.show() |

This time series has 144 time steps. You can see from the plot that there is an upward trend. There are also some periodicity in the dataset that corresponds to the summer holiday period in the northern hemisphere. Usually a time series should be “detrended” to remove the linear trend component and normalized before processing. For simplicity, these are skipped in this project.

To demonstrate the predictive power of our model, the time series is splitted into training and test sets. Unlike other dataset, usually time series data are splitted without shuffling. That is, the training set is the first half of time series and the remaining will be used as the test set. This can be easily done on a numpy array:

# train-test split for time series train_size = int(len(timeseries) * 0.67) test_size = len(timeseries) – train_size train, test = timeseries[:train_size], timeseries[train_size:] |

The more complicated problem is how do you want the network to predict the time series. Usually time series prediction is done on a window. That is, given data from time $t-w$ to time $t$, you are asked to predict for time $t+1$ (or deeper into the future). The size of window $w$ governs how much data you are allowed to look at when you make the prediction. This is also called the **look back period**.

On a long enough time series, multiple overlapping window can be created. It is convenient to create a function to generate a dataset of fixed window from a time series. Since the data is going to be used in a PyTorch model, the output dataset should be in PyTorch tensors:

import torch
def create_dataset(dataset, lookback): “”“Transform a time series into a prediction dataset
Args: dataset: A numpy array of time series, first dimension is the time steps lookback: Size of window for prediction ““” X, y = [], [] for i in range(len(dataset)–lookback): feature = dataset[i:i+lookback] target = dataset[i+1:i+lookback+1] X.append(feature) y.append(target) return torch.tensor(X), torch.tensor(y) |

This function is designed to apply windows on the time series. It is assumed to predict for one time step into the immediate future. It is designed to convert a time series into a tensor of dimensions (window sample, time steps, features). A time series of $L$ time steps can produce roughly $L$ windows (because a window can start from any time step as long as the window does not go beyond the boundary of the time series). Within one window, there are multiple consecutive time steps of values. In each time step, there can be multiple features. In this dataset, there is only one.

It is intentional to produce the “feature” and the “target” the same shape: For a window of three time steps, the “feature” is the time series from $t$ to $t+2$ and the target is from $t+1$ to $t+3$. What we are interested is $t+3$ but the information of $t+1$ to $t+2$ is useful in training.

Note that the input time series is a 2D array and the output from the `create_dataset()`

function will be a 3D tensors. Let’s try with `lookback=1`

. You can verify the shape of the output tensor as follows:

lookback = 1 X_train, y_train = create_dataset(train, lookback=lookback) X_test, y_test = create_dataset(test, lookback=lookback) print(X_train.shape, y_train.shape) print(X_test.shape, y_test.shape) |

which you should see:

torch.Size([95, 1, 1]) torch.Size([95, 1, 1]) torch.Size([47, 1, 1]) torch.Size([47, 1, 1]) |

Now you can build the LSTM model to predict the time series. With `lookback=1`

, it is quite surely that the accuracy would not be good for too little clues to predict. But this is a good example to demonstrate the structure of the LSTM model.

The model is created as a class, in which a LSTM layer and a fully-connected layer is used.

... import torch.nn as nn
class AirModel(nn.Module): def __init__(self): super().__init__() self.lstm = nn.LSTM(input_size=1, hidden_size=50, num_layers=1, batch_first=True) self.linear = nn.Linear(50, 1) def forward(self, x): x, _ = self.lstm(x) x = self.linear(x) return x |

The output of `nn.LSTM()`

is a tuple. The first element is the generated hidden states, one for each time step of the input. The second element is the LSTM cell’s memory and hidden states, which is not used here.

The LSTM layer is created with option `batch_first=True`

because the tensors you prepared is in the dimension of (window sample, time steps, features) and where a batch is created by sampling on the first dimension.

The output of hidden states is further processed by a fully-connected layer to produce a single regression result. Since the output from LSTM is one per each input time step, you can chooce to pick only the last timestep’s output, which you should have:

x, _ = self.lstm(x) # extract only the last time step x = x[:, –1, :] x = self.linear(x) |

and the model’s output will be the prediction of the next time step. But here, the fully connected layer is applied to each time step. In this design, you should extract only the last time step from the model output as your prediction. However, in this case, the window is 1, there is no difference in these two approach.

## Training and Verifying Your LSTM Network

Because it is a regression problem, MSE is chosen as the loss function, which is to be minimized by Adam optimizer. In the code below, the PyTorch tensors are combined into a dataset using `torch.utils.data.TensorDataset()`

and batch for training is provided by a `DataLoader`

. The model performance is evaluated once per 100 epochs, on both the trainning set and the test set:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
import numpy as np import torch.optim as optim import torch.utils.data as data
model = AirModel() optimizer = optim.Adam(model.parameters()) loss_fn = nn.MSELoss() loader = data.DataLoader(data.TensorDataset(X_train, y_train), shuffle=True, batch_size=8)
n_epochs = 2000 for epoch in range(n_epochs): model.train() for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation if epoch % 100 != 0: continue model.eval() with torch.no_grad(): y_pred = model(X_train) train_rmse = np.sqrt(loss_fn(y_pred, y_train)) y_pred = model(X_test) test_rmse = np.sqrt(loss_fn(y_pred, y_test)) print(“Epoch %d: train RMSE %.4f, test RMSE %.4f” % (epoch, train_rmse, test_rmse)) |

As the dataset is small, the model should be trained for long enough to learn about the pattern. Over these 2000 epochs trained, you should see the RMSE on both training set and test set decreasing:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Epoch 0: train RMSE 225.7571, test RMSE 422.1521 Epoch 100: train RMSE 186.7353, test RMSE 381.3285 Epoch 200: train RMSE 153.3157, test RMSE 345.3290 Epoch 300: train RMSE 124.7137, test RMSE 312.8820 Epoch 400: train RMSE 101.3789, test RMSE 283.7040 Epoch 500: train RMSE 83.0900, test RMSE 257.5325 Epoch 600: train RMSE 66.6143, test RMSE 232.3288 Epoch 700: train RMSE 53.8428, test RMSE 209.1579 Epoch 800: train RMSE 44.4156, test RMSE 188.3802 Epoch 900: train RMSE 37.1839, test RMSE 170.3186 Epoch 1000: train RMSE 32.0921, test RMSE 154.4092 Epoch 1100: train RMSE 29.0402, test RMSE 141.6920 Epoch 1200: train RMSE 26.9721, test RMSE 131.0108 Epoch 1300: train RMSE 25.7398, test RMSE 123.2518 Epoch 1400: train RMSE 24.8011, test RMSE 116.7029 Epoch 1500: train RMSE 24.7705, test RMSE 112.1551 Epoch 1600: train RMSE 24.4654, test RMSE 108.1879 Epoch 1700: train RMSE 25.1378, test RMSE 105.8224 Epoch 1800: train RMSE 24.1940, test RMSE 101.4219 Epoch 1900: train RMSE 23.4605, test RMSE 100.1780 |

It is expected to see the RMSE of test set is an order of magnitude larger. The RMSE of 100 means the prediction and the actual target would be in average off by 100 in value (i.e., 100,000 passengers in this dataset).

To better understand the prediction quality, you can indeed plot the output using matplotlib, as follows:

with torch.no_grad(): # shift train predictions for plotting train_plot = np.ones_like(timeseries) * np.nan y_pred = model(X_train) y_pred = y_pred[:, –1, :] train_plot[lookback:train_size] = model(X_train)[:, –1, :] # shift test predictions for plotting test_plot = np.ones_like(timeseries) * np.nan test_plot[train_size+lookback:len(timeseries)] = model(X_test)[:, –1, :] # plot plt.plot(timeseries, c=‘b’) plt.plot(train_plot, c=‘r’) plt.plot(test_plot, c=‘g’) plt.show() |

From the above, you take the model’s output as `y_pred`

but extract only the data from the last time step as `y_pred[:, -1, :]`

. This is what is plotted on the chart.

The training set is plotted in red while the test set is plotted in green. The blue curve is what the actual data looks like. You can see that the model can fit well to the training set but not very well on the test set.

Tying together, below is the complete code, except the parameter `lookback`

is set to 4 this time:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
import matplotlib.pyplot as plt import numpy as np import pandas as pd import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data
df = pd.read_csv(‘airline-passengers.csv’) timeseries = df[[“Passengers”]].values.astype(‘float32’)
# train-test split for time series train_size = int(len(timeseries) * 0.67) test_size = len(timeseries) – train_size train, test = timeseries[:train_size], timeseries[train_size:]
def create_dataset(dataset, lookback): “”“Transform a time series into a prediction dataset
Args: dataset: A numpy array of time series, first dimension is the time steps lookback: Size of window for prediction ““” X, y = [], [] for i in range(len(dataset)–lookback): feature = dataset[i:i+lookback] target = dataset[i+1:i+lookback+1] X.append(feature) y.append(target) return torch.tensor(X), torch.tensor(y)
lookback = 4 X_train, y_train = create_dataset(train, lookback=lookback) X_test, y_test = create_dataset(test, lookback=lookback)
class AirModel(nn.Module): def __init__(self): super().__init__() self.lstm = nn.LSTM(input_size=1, hidden_size=50, num_layers=1, batch_first=True) self.linear = nn.Linear(50, 1) def forward(self, x): x, _ = self.lstm(x) x = self.linear(x) return x
model = AirModel() optimizer = optim.Adam(model.parameters()) loss_fn = nn.MSELoss() loader = data.DataLoader(data.TensorDataset(X_train, y_train), shuffle=True, batch_size=8)
n_epochs = 2000 for epoch in range(n_epochs): model.train() for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation if epoch % 100 != 0: continue model.eval() with torch.no_grad(): y_pred = model(X_train) train_rmse = np.sqrt(loss_fn(y_pred, y_train)) y_pred = model(X_test) test_rmse = np.sqrt(loss_fn(y_pred, y_test)) print(“Epoch %d: train RMSE %.4f, test RMSE %.4f” % (epoch, train_rmse, test_rmse))
with torch.no_grad(): # shift train predictions for plotting train_plot = np.ones_like(timeseries) * np.nan y_pred = model(X_train) y_pred = y_pred[:, –1, :] train_plot[lookback:train_size] = model(X_train)[:, –1, :] # shift test predictions for plotting test_plot = np.ones_like(timeseries) * np.nan test_plot[train_size+lookback:len(timeseries)] = model(X_test)[:, –1, :] # plot plt.plot(timeseries) plt.plot(train_plot, c=‘r’) plt.plot(test_plot, c=‘g’) plt.show() |

Running the above code will produce the plot below. From both the RMSE measure printed and the plot, you can notice that the model can now do better on the test set.

This is also why the `create_dataset()`

function is designed in such way: When the model is given a time series of time $t$ to $t+3$ (as `lookback=4`

), its output is the prediction of $t+1$ to $t+4$. However, $t+1$ to $t+3$ are also known from the input. By using these in the loss function, the model effectively was provided with more clues to train. This design is not always suitable but you can see it is helpful in this particular example.

## Further Readings

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this post, you discovered what is LSTM and how to use it for time series prediction in PyTorch. Specifically, you learned:

- What is the international airline passenger time series prediction dataset
- What is a LSTM cell
- How to create an LSTM network for time series prediction