Time Series Prediction with Transformers

Author:Murphy | View: 28956 | Time: 2025-03-22 23:16:46

At the latest since the advent of ChatGPT, Large Language models (LLMs) have created a huge hype, and are known even to those outside the AI community. Even though one needs to understand that LLMs inherently are "just" sequence prediction models without any form of intelligence or reasoning – the achieved results are certainly extremely impressive, with some even talking about another step in the "AI Revolution".

Essential to the success of LLMs are their core building blocks, transformers. In this post, we will give a complete guide of using them in Pytorch, with particular focus on time series prediction. Thanks for stopping by, and I hope you enjoy the ride!

One could argue that all problems solved via transformers essentially are time series problems. While that is true, here we will put special focus to continuous series and data – such as predicting the spreading of diseases or forecasting the weather. The difference to the prominent application of Natural Language Processing (NLP) simply (if this word is allowed in this context – developing a model like ChatGPT and making it work naturally does require a multitude of further optimization steps and tricks) is the continuous input space, while NLP works with discrete tokens. However, apart from this, the basic building blocks are identical.

In this post, we will start with a (short) theoretical introduction of transformers, and then move towards applying them in Pytorch. For this, we will discuss a selected example, namely predicting the sine function. We will show how to generate data for this and pre-process it correctly, and then use transformers for learning how to predict this function. Later, we will discuss how to do inference when future tokens are not available, and conclude the post by extending the example to multi-dimensional data.

Introduction to Transformers

Goal of this post is providing a complete hands-on tutorial on how to use transformers for real-world use cases – and not theoretically introducing and explaining these interesting models. Instead I'd like to refer to this amazing article and the original paper [1] (whose architecture we will follow throughout this post) for details.

Still, for the sake of completeness, we'll give a (somewhat) brief walk-through of transformers. Transformers consist of an encoder and a decoder. The encoder processes an input sequence, and the decoder outputs an output sequence. A prime example is machine translation, e.g. translating from French to English:

We start with a high-level overview, and then as Step 2, increase the level of detail and understanding

Overview

Input and output sequence, as the names suggest, consist of several values or tokens. In contrast to e.g. Recurrent Neural Networks (RNNs), which process data sequentially, transformers process all these in parallel (one of their advantages). For this, each token is encoded / embedded into a higher dimension for future processing. Then, we add some positional encoding to this embedding – a way of identifying the order of elements. This is essential, since tokens are processed in parallel – and without this the transformer would not understand the ordering of data.

This representation is then handled by encoder and decoder – which are made out of simple, repeated building blocks: self-attention and feed forward layers. In the encoder, for each input token we attend over all other tokens in the input sequence and generate an accumulated value. We then process this via a feed forward layer and repeat. The decoder is similar, except we now also attend to tokens of the input sequence – meaning there are two types of attention: self-attention (decoder—decoder) and cross-attention (decoder – encoder). For the first mode, we additionally mask future tokens – s.t. the decoder cannot learn to "cheat" by looking at future tokens.

Diving Deeper

Let's now repeat this explanation, but fill in a bit more details along the way. Let's visualize the input to the encoder:

As stated above, each input token is embedded and we add positional encoding – yielding the input to the encoder.

Next, let's have a look inside a single encoder layer:

We see the blocks self-attention and a feed forward layer. First, via self-attention, for each token we generate a compressed representation of the others – its context. Core are three values: (Q)ueries, (K)eys, and (V)alues, which are obtained via matrix-multiplication of the current input token. Given one token as query, we use the key values of all other tokens to determine scores for these. Then, we multiply these scores via their values, and aggregate.

Example: to compute the attention values of all other tokens w.r.t. token 1, we compute q1 x ki for all other tokens i. Then we multiply these scores via the values vi and (after some left-out normalization) obtain the resulting processed value z1.

And now visually:

In practise, transformers use Multi-head Attention (MHA) – which simply means repeating above process N times, and concatenating the resulting values.

After this, there comes the feed forward block, which simply is a fully connected layer processing the output values of the previous step individually.

This completes one encoder block – and we repeat this block several times (6 in the original paper).

On the decoder side, we do the same— i.e. also stack M decoder layers made out of self-attention and a feed forward layer. As mentioned before, the differences to the encoder are:

We also attend to all generated representations of the last encoder layer.
We mask out values in the self-attention step. I.e., when decoding token i, we cannot attend tokens i+1, … in order to avoid cheating and simply copying future outputs.

Overall, the full model in simplified form can be visualized as such:

With these theoretical foundations set, let's move to using the transformer model in Pytorch. If not everything is perfectly clear to you yet, it might be worthwhile to just continue – maybe seeing things applied helps clear things up.

Transformers in Pytorch

In this section we will show how to use transformers in Pytorch – using the available transformer module. Since the model is already implemented, the main "difficulty" is pre-preprocessing input and output and using the transformer in the right way.

Toy Problem: Predicting the Sine Function

Let's introduce the problem we will solve: we want to train our transformer model to predict a noisy sine function. Towards the end of the post, we will then show how to extend this to multi-dimensional in- and output.

Let's create a script sine_generator.py to generate our training data, and save it to disk, s.t. we can load it later (note: the full code of this post is also available in this repo):

from pathlib import Path

import numpy as np

def generate_data(data_path: Path, num_steps: int, interval: float = 0.1) -> None:
    x = np.linspace(0, num_steps * interval, num_steps)
    y = np.sin(x) + np.random.normal(0, 0.1, x.shape)

    np.savez(data_path, y=y)

generate_data("data.npz", 1000000)

We thus generate 1.000k data points of a noisy sine curve, which can look as follows:

The Model

Initializing the model is actually extremely easy with Pytorch – we can just use the available Transformer model, and use it as such in our code:

self.transformer = torch.nn.Transformer(nhead=num_heads, num_encoder_layers=num_layers, num_decoder_layers=num_layers, d_model=embed_dim, batch_first=True)

As we can see, we specify the number of heads, number of encoder / decoder layers, the embed / feature dimension we will feed to the transformer, and specify that inputs / outputs start with the batch dimension.

If we wanted to customize this behavior or only use parts of the model, one could use encoder and decoder separately – but for our use case the full transformer is sufficient.

Preparing the Data

As mentioned before, this and the next section will probably be the most insightful and relevant ones regarding using transformers in practice and making them "work". First, we need to preprocess and prepare our data, in particular decide what is input and output.

Let's start with the example of machine translation: in this, we have a source (src) and a target (tgt) sequence – src is the original sentence in German, tgt its English translation.

The input to the encoder is then src – that is a sequence consisting of N tokens. Naturally, everything is batched, so the shape will be [bs, N, feat_dim] (we'll cover the feature dimension in the next section) – and you might have to pad some sentences.

We expect the decoder to generate tgtof shape [bs, M, feat_dim]. However, the input to the decoder is tgt shifted right by one – otherwise the decoder could simply pass the input token through to the output of each step. The following graphic visualizes this:

Thus, in machine translation and other NLP tasks, the decoder starts with a special token.

For our example, we could do something similar, but here instead follow [2] and use the last token of src to begin the decoder input.

We do this in code as follows:

def split_sequence(
    sequence: np.ndarray, ratio: float = 0.8
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Splits a sequence into 2 (3) parts, as is required by our transformer
    model.

    Assume our sequence length is L, we then split this into src of length N
    and tgt_y of length M, with N + M = L.
    src, the first part of the input sequence, is the input to the encoder, and we
    expect the decoder to predict tgt_y, the second part of the input sequence.
    In addition we generate tgt, which is tgt_y but "shifted left" by one - i.e. it
    starts with the last token of src, and ends with the second-last token in tgt_y.
    This sequence will be the input to the decoder.

    Args:
        sequence: batched input sequences to split [bs, seq_len, num_features]
        ratio: split ratio, N = ratio * L

    Returns:
        tuple[torch.Tensor, torch.Tensor, torch.Tensor]: src, tgt, tgt_y
    """
    src_end = int(sequence.shape[1] * ratio)
    # [bs, src_seq_len, num_features]
    src = sequence[:, :src_end]
    # [bs, tgt_seq_len, num_features]
    tgt = sequence[:, src_end - 1 : -1]
    # [bs, tgt_seq_len, num_features]
    tgt_y = sequence[:, src_end:]

    return src, tgt, tgt_y

Embedding, Positional Encoding and Masking

Now we need to actually feed our data to the transformer. For this, we first need to embed our input – that is map from the 1D-sine signal to a higher dimension – namely the one we specified when initializing the transformer. There are several options for this, but here we simply use a linear layer:

embedding = torch.nn.Linear(
    in_features=in_dim, out_features=embed_dim
)

seq = embedding(seq)

Next, we need to apply positional encoding. This is not part of Pytorch (yet), but there are several good implementations available, e.g. even in the official documentation:

# Taken from https://pytorch.org/tutorials/beginner/transformer_tutorial.html,
# only modified to account for "batch first".
class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000) -> None:
        super().__init__()
        self.dropout = torch.nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
        )
        pe = torch.zeros(1, max_len, d_model)
        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)
        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Adds positional encoding to the given tensor.

        Args:
            x: tensor to add PE to [bs, seq_len, embed_dim]

        Returns:
            torch.Tensor: tensor with PE [bs, seq_len, embed_dim]
        """
        x = x + self.pe[:, : x.size(1)]
        return self.dropout(x)

With this, our input sequence is processed as follows:

src = self.encoder_embedding(src)
src = self.positional_encoding(src)

On the decoder side, we execute the same steps, but additionally generate a mask to avoid attenting future timesteps:

# Generate mask to avoid attention to future outputs.
# [tgt_seq_len, tgt_seq_len]
tgt_mask = torch.nn.Transformer.generate_square_subsequent_mask(tgt.shape[1])
# Embed decoder input and add positional encoding.
# [bs, tgt_seq_len, embed_dim]
tgt = self.decoder_embedding(tgt)
tgt = self.positional_encoding(tgt)

The generated mask is a square matrix, with row i specifying which tokens j are allowed to attend over – the mask thus is a diagonal matrix "shifted right" by one.

Putting it All Together

Now, let's put everything together, train our model and visualize the results.

The full model (model.py) looks as follows (for a complete view I again would like to refer to the Github repo):

import math

import torch

# Taken from https://pytorch.org/tutorials/beginner/transformer_tutorial.html,
# only modified to account for "batch first".
class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000) -> None:
        super().__init__()
        self.dropout = torch.nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
        )
        pe = torch.zeros(1, max_len, d_model)
        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)
        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Adds positional encoding to the given tensor.

        Args:
            x: tensor to add PE to [bs, seq_len, embed_dim]

        Returns:
            torch.Tensor: tensor with PE [bs, seq_len, embed_dim]
        """
        x = x + self.pe[:, : x.size(1)]
        return self.dropout(x)

class TransformerWithPE(torch.nn.Module):
    def __init__(
        self, in_dim: int, out_dim: int, embed_dim: int, num_heads: int, num_layers: int
    ) -> None:
        """Initializes a transformer model with positional encoding.

        Args:
            in_dim: number of input features
            out_dim: number of features to predict
            embed_dim: embed features to this dimension
            num_heads: number of transformer heads
            num_layers: number of encoder and decoder layers
        """
        super().__init__()

        self.positional_encoding = PositionalEncoding(embed_dim)

        self.encoder_embedding = torch.nn.Linear(
            in_features=in_dim, out_features=embed_dim
        )
        self.decoder_embedding = torch.nn.Linear(
            in_features=out_dim, out_features=embed_dim
        )

        self.output_layer = torch.nn.Linear(in_features=embed_dim, out_features=out_dim)

        self.transformer = torch.nn.Transformer(
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            d_model=embed_dim,
            batch_first=True,
        )

    def forward(self, src: torch.Tensor, tgt: torch.Tensor) -> torch.Tensor:
        """Forward function of the model.

        Args:
            src: input sequence to the encoder [bs, src_seq_len, num_features]
            tgt: input sequence to the decoder [bs, tgt_seq_len, num_features]

        Returns:
            torch.Tensor: predicted sequence [bs, tgt_seq_len, feat_dim]
        """
        # if self.train:
        # Add noise to decoder inputs during training
        # tgt = tgt + torch.normal(0, 0.1, size=tgt.shape).to(tgt.device)

        # Embed encoder input and add positional encoding.
        # [bs, src_seq_len, embed_dim]
        src = self.encoder_embedding(src)
        src = self.positional_encoding(src)

        # Generate mask to avoid attention to future outputs.
        # [tgt_seq_len, tgt_seq_len]
        tgt_mask = torch.nn.Transformer.generate_square_subsequent_mask(tgt.shape[1])
        # Embed decoder input and add positional encoding.
        # [bs, tgt_seq_len, embed_dim]
        tgt = self.decoder_embedding(tgt)
        tgt = self.positional_encoding(tgt)

        # Get prediction from transformer and map to output dimension.
        # [bs, tgt_seq_len, embed_dim]
        pred = self.transformer(src, tgt, tgt_mask=tgt_mask)
        pred = self.output_layer(pred)

        return pred

The main train and testing loop (main.py):

import torch
from torch.utils.data import DataLoader

from model import TransformerWithPE
from utils import (
    load_and_partition_data,
    make_datasets,
    move_to_device,
    split_sequence,
    visualize,
)

BS = 512
FEATURE_DIM = 128
NUM_HEADS = 8
NUM_EPOCHS = 100
NUM_VIS_EXAMPLES = 1
NUM_LAYERS = 2
LR = 0.001

def main() -> None:
    # Load data and generate train and test datasets / dataloaders
    sequences, num_features = load_and_partition_data("data.npz")
    train_set, test_set = make_datasets(sequences)
    train_loader, test_loader = DataLoader(
        train_set, batch_size=BS, shuffle=True
    ), DataLoader(test_set, batch_size=BS, shuffle=False)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Initialize model, optimizer and loss criterion
    model = TransformerWithPE(
        num_features, num_features, FEATURE_DIM, NUM_HEADS, NUM_LAYERS
    ).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=LR)
    criterion = torch.nn.MSELoss()

    # Train loop
    for epoch in range(NUM_EPOCHS):
        epoch_loss = 0.0
        for batch in train_loader:
            optimizer.zero_grad()

            src, tgt, tgt_y = split_sequence(batch[0])
            src, tgt, tgt_y = move_to_device(device, src, tgt, tgt_y)
            # [bs, tgt_seq_len, num_features]
            pred = model(src, tgt)
            loss = criterion(pred, tgt_y)
            epoch_loss += loss.item()

            loss.backward()
            optimizer.step()

        print(
            f"Epoch [{epoch + 1}/{NUM_EPOCHS}], Loss: "
            f"{(epoch_loss / len(train_loader)):.4f}"
        )

    # Evaluate model
    model.eval()
    eval_loss = 0.0
    with torch.no_grad():
        for idx, batch in enumerate(test_loader):
            src, tgt, tgt_y = split_sequence(batch[0])
            src, tgt, tgt_y = move_to_device(device, src, tgt, tgt_y)

            # [bs, tgt_seq_len, num_features]
            pred = model(src, tgt)
            loss = criterion(pred, tgt_y)
            eval_loss += loss.item()

            if idx < NUM_VIS_EXAMPLES:
                visualize(src, tgt, pred, pred_infer)

    avg_eval_loss = eval_loss / len(test_loader)
    avg_infer_loss = infer_loss / len(test_loader)

    print(f"Eval Loss on test set: {avg_eval_loss:.4f}")

if __name__ == "__main__":
    main()

After training for 100 epochs, we visualize the resulting predictions on the held-out test set:

In blue, we visualize the input to the encoder, in green the continued curve to be predicted by the model, and in red the actual prediction.

As we can see, the model has successfully learned to predict the sine function – and it is also interesting to observe the learned denoising capability of the model.

Inference

One question often arising with transformers is how to run inference. Yes, with all the code presented so far, our model will train nicely, and also yield good results on the test set – that is, when presented with ground truth data as encoder / decoder input (src / tgt), the prediction generated by the decoder will closely resemble the expected values (tgt_y). This method is known as teacher forcing – but it is not what happens during "real inference"!

At a given point in time, we have all the needed historical data, and now want to predict how our time series evolves in the future. Naturally, in this case we don't have any input to the decoder.

What we do in this case, is iteratively generate the decoder output: we begin with an empty output sequence, in which only the first time step is filled by the last value of src. We then run a prediction step, and insert the generated output of the first step at position 2 of the output sequence – and so on. And since we correctly mask our decoder inputs, it does not matter, how the rest of the output sequence looks like (it's empty …) – the decoder does not use these tokens.

Let's add an infer function to our model, and also modify our training loop to run this on the test set:

def infer(self, src: torch.Tensor, tgt_len: int, tgt: torch.Tensor) -> torch.Tensor:
    """Runs inference with the model, meaning: predicts future values
    for an unknown sequence.
    For this, iteratively generate the next output token while
    feeding the already generated ones as input sequence to the decoder.

    Args:
        src: input to the encoder [bs, src_seq_len, num_features]
        tgt_len (int): _description_ - TODO
        tgt (_type_): _description_

    Returns:
        torch.Tensor: inferred sequence
    """
    output = torch.zeros((src.shape[0], tgt_len + 1, src.shape[2])).to(src.device)
    output[:, 0] = src[:, -1]
    for i in range(tgt_len):
        output[:, i + 1] = self.forward(src, output)[:, i]

    return output[:, 1:]

In yellow, we have now drawn the predictions generated by this iterative approach – observing that also in this setting the model does a good job predicting the function.

Covariate Shift

In this example, our model does a nice job in all tested tasks. However, inference and working with induced testing distributions can sometimes cause (big) problems. Reason for this is the introduced distribution / covariate shift – we trained the model by sampling samples from some train distribution, and when running inference, the distribution induced by the model is different.

This is a big issue in the field of Imitation Learning, with one prominent application of this being autonomous driving. Actually already in the early 90s it was noted that a car trained to follow a lane would sometimes run out of bounds / run out of its known training distribution [3]. Later, many works have addressed this issue [4, 5].

In the field of NLP, this does not seem to cause issues – and understanding why is a very interesting question to me. Unfortunately this would blow the scope of this post, but some common conceptions are: language is inherently multi-modal, and the amount of training data is vast – the model thus can never really "run out" of distribution – whereas in imitation learning / autonomous driving we usually only see examples of good driving, e.g. around some lane center line.

The example discussed here (predicting a sine function), is not multi-modal – but the problem might just be easy enough (or our predicted sequence too short) for distribution shift to occur.

Thus, I will leave it like this, but end this section with a discussion what one could do when encountering covariate shift:

Make the model not auto-regressive: one cause of the model "running out" of distribution is the auto-regressive property, i.e. iteratively generating one data point after another. A simple mitigation to this would be changing the model to a non-auto-regressive one, i.e. predicting all future data points in one go. This could be achieved by e.g. removing the decoder, and via some fully-connected layers predicting all points of tgt. In fact, there are a multitude of language models next to ChatGPT, like BERT or LLama – which use different architectures, such as encoder-only and decoder-only (in fact, GPT-3.5 is a decoder-only architecture). However, the auto-regressive property is common to all (most), thus currently this seems to be one success factor of LLMs.
Adding noise: this is a popular mitigation strategy in imitation learning, popularized by [5] and [6]. During training we on purpose add noise to the data, s.t. the model is exposed to varying distributions. In our example, it could be helpful to add noise to the decoder inputs, s.t. the model learns to deal with imperfect sequences.

Extension to Multi-Dimensional Data

To conclude this post, we will show how to extend our model to handle multi-modal, multi-dimensional data. However, this merely is changing the input data and applying the same model – as we already designed it to be able to handle an arbitrary number of input and output features.

Our multi-dimensional data is generated as follows: we start with two sine curves of different frequency (y1 and y2). Then, we generate y3 by overlaying (multiplying) these functions.

In this setting, we are interested in predicting y3. Towards this, we understand y1 and y2 as helper variables: we can certainly directly predict y3, but knowing y1 and y2 might help us and the model better understand how the resulting function is formed.

With this in mind, we can make a design choice about how to model this:

We either only aim to predict y3, treating y1 and y2 as exogenous variables (i.e. given from external sources). This way, we require the decoder to untangle the complex relationship of how y3 is formed by itself – but also reduce prediction to the one function we are interested in.
Alternatively, we can simply predict all of y1, y2 and y3. This way, the decoder might better learn to understand how y3 is formed (which is often beneficial – cue: multi-task learning). But we also spend precious encoder capacity on learning variables which are purely auxiliary tasks.

As so often in Machine Learning, there is no right or wrong, and the answer is often found empirically. Here, we go with approach 2, as this requires no changes to the model. But also approach 1 can be implemented with minimal tweaks, and I invite you to play around with this.

For visualization, for simplicity, we only show our variable of interest, y3:

And as we can observe, our transformer model also handles 3-dimensional data well.

Conclusion

This brings us to the end of this post. In it, we showed how to use the transformer model in Pytorch.

First, we theoretically introduced transformers. Then, we moved to applying them in practice using Pytorch. To get started we selected the simple problem of predicting a noisy sine function. We showed how to initialize a transformer in Pytorch, what input and output shapes are needed, and what other techniques we need to apply to make it work (in particular embedding, positional encoding and decoder masking). Next, we discussed how to run "real-life" inference, when future labels are not available, and also briefly mentioned the problem of distribution shift. Lastly, we showed how to extend our problem and model to handle 3-dimensional data.

I hope, you enjoyed this post – thanks for reading! As stated, you can find the full code on Github.