Understanding Positional Embeddings in Transformers: From Absolute to Rotary

Author:Murphy  |  View: 27618  |  Time: 2025-03-22 20:41:34

One of the key components of Transformers are positional embeddings. You may ask: why? Because the self-attention mechanism in transformers is permutation-invariant; that means it computes the amount of attention each token in the input receives from other tokens in the sequence, however it does not take the order of the tokens into account. In fact, attention mechanism treats the sequence as a bag of tokens. For this reason, we need to have another component called positional embedding which accounts for the order of tokens and it influences token embeddings. But what are the different types of positional embeddings and how are they implemented?

In this post, we take a look at three major types of positional embeddings and dive deep into their implementation.

Here is the table of content for this post:

  1. Context and Background
  2. Absolute Positional Embedding
  • 2.1 Learned Approach
  • 2.2 Fixed Approach (Sinusoidal)
  • 2.3 Code Example: RoBERTa Implementation
  1. Relative Positional Embedding
  • 3.1. Layman's Explanation
  • 3.2. Technical Explanation
  • 3.3 Code Example: Transformer-XL Implementation
  1. Rotary Positional Embedding (RoPE)
  • 4.1. Layman's Explanation
  • 4.2. Technical Explanation
  • 4.3. Mathematical Proof of Relativity
  • 4.4 Rotation Matrix for Higher Dimension
  • 4.5 Code Snippet: Roformer Implementation
  1. Conclusion
  2. References

1. Context and Background

In Natural Language Processing (NLP), the order of words in a sequence is very important to understand the meaning, just as it is to human. If the order is jumbled up, the meaning completely changes. Consider "Sam sits down on the mat" and "The mat sits down on Sam" where one re-ordering changes the meaning completely.

Transformers, which form the backbone of many modern NLP systems, process all words in parallel. This parallel processing happens in the attention mechanism where they compute the attention scores each token receives from other tokens in the input context. While parallelism is inherently great for efficiency, it causes the model to lose all notions of word orders. For this reason, transformers have an additional component called positional embedding which creates vectors containing notion of position or order of tokens in a sequence.

There are many different types of positional embeddings. The three major well-known ones are absolute positional embedding, relative positional embedding and rotary positional embeddings (RoPE).

2. Absolute Positional Embedding

Absolute positional embedding is like giving each token in a sentence a unique number. In practice, we create a vector for each position in the sequence. In the simplest case, the positional embedding of each token is a one-hot vector that is zero at all places except at the index which is position of the token. We then add these positional embedding vectors to the token embeddings before feeding them into the transformer.

For example, in the sentence "I am a student", there are 4 tokens. Each token gets a unique positional embedding vector. Let's assume the embedding dimension is 3. Then the first token "I" will get the one-hot encoding of[1, 0, 0], the second gets[0, 1, 0], and so on.

While one-hot encoding is a straightforward approach to convey the idea of positional embedding, in practice there are better ways to implement absolute positional embedding. All different implementations are simple and effective, yet they can struggle with very long sequences or sequences longer than what they were trained on.

Let's look at these implementations.

2.1 Implementation

The implementation of absolute positional embeddings typically involves creating a lookup table of size _vocabulary*embeddingdim. That means for every token in the vocabulary there is an entry in the lookup table and that entry is of dimension _embeddingdim.

There are two main types of absolute positional embeddings:

  1. Learned: in the learned approach, embedding vectors are initialized randomly and then trained during the training process. This method is employed in the original transformer paper[5] and in popular models like BERT, GPT, and RoBERTa. Soon, we will see an example of this approach in code. A disadvantage of this approach is that it may not generalize well to sequences longer than those seen during training, simply because for there exists no entry in the lookup table for those positions .

2. Fixed: this approach, also known as sinusoidal positional encoding, was introduced in the seminal "Attention Is All You Need" paper[5]. This method uses sine and cosine functions of different frequencies to create unique patterns for each position. The formula for this encoding is as following:

Sinusoidal positional encoding – Image by author

In above formula, PE is the positional embedding, d is the dimension of the model also known as the embedding dimension. Basically, the positional embedding vector for the position of pos at even indices is governed by sin and at odd indices is governed by cos.

A key advantage of this approach is its ability to extrapolate to sequence lengths not encountered during training; this of course offers great flexibility in handling varying input sizes.

Regardless of the type (learned or fixed), once absolute positional embeddings are created, they are added to the token embeddings:

Image by author

Let's see learned positional embedding from source code of RoBERTa model together. The code is taken from HuggingFace code repository here.

class RobertaEmbeddings(nn.Module):

    # Copied from transformers.models.bert.modeling_bert.BertEmbeddings.__init__
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        self.register_buffer(
            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
        )
        self.register_buffer(
            "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
        )

        # End copy
        self.padding_idx = config.pad_token_id
        self.position_embeddings = nn.Embedding(
            config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
        )

    def forward(
        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
    ):
        if position_ids is None:
            if input_ids is not None:
                # Create the position ids from the input token ids. Any padded tokens remain padded.
                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length)
            else:
                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)

        if input_ids is not None:
            input_shape = input_ids.size()
        else:
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]

        # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
        # issue #5664
        if token_type_ids is None:
            if hasattr(self, "token_type_ids"):
                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
                token_type_ids = buffered_token_type_ids_expanded
            else:
                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

Notice how the following line in the __init__ method is initializing the learned positional embeddings with random values:

self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)

and then in the forward method, the positional embeddings is added to the token embeddings

if self.position_embedding_type == "absolute":
    position_embeddings = self.position_embeddings(position_ids)
    embeddings += position_embeddings

Let's run the code together on a text input and see the result:

from transformers import RobertaConfig

config = RobertaConfig()

print(config)
RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 50265
}

As you see in the config parameter printed above, we have "position_embedding_type": "absolute", and we see that context window is of 512 length:"max_position_embeddings": 512 . Let's get an object of RobertaEmbedding :

emb = RobertaEmbeddings(config)
print(emb)
RobertaEmbeddings(
  (word_embeddings): Embedding(50265, 768, padding_idx=1)
  (position_embeddings): Embedding(512, 768, padding_idx=1)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

And you see above that RobertaEmbedding layer, has one (word_embeddings): Embedding(50265, 768, padding_idx=1) that is an embedding matrix initialized at random of shape 50265*768; it means the vocabulary size is 50265 and each embedding vector if 768-dimensional.

Then we see that (position_embeddings): Embedding(512, 768, padding_idx=1) which is the positional embedding vectors which are again initialized at random and they are 768-dimensional vector. Note the size of this embedding matrix is 512*768 which shows we will have positional embedding for only 512 positions. As a result, if at inference time, a sequence larger than 512 tokens appear, we wont have a learned positional embedding for it!! which is one of the caveats of learned absolute positional embeddings which we discussed earlier.

Let's take a sequence and feed it to the embedding layer:

from transformers import RobertaTokenizer

# Initialize the tokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)

print(tokens)

# Get the input IDs
input_ids = tokenizer.encode(sentence, add_special_tokens=True)

print("nInput IDs:", input_ids)

It prints the following:

['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.']

Input IDs: [0, 133, 2119, 6219, 23602, 13855, 81, 5, 22414, 2335, 4, 2]

Note the sequence is of 12 tokens. We pass it to the embedding layer:

input_tensor = torch.tensor(input_ids).reshape((1,-1))
emb(input_ids=input_tensor)
tensor([[[-0.7226, -2.3475, -0.5119,  ..., -1.3224, -0.0000, -0.9497],
         [-0.4094,  0.7778,  1.8330,  ...,  0.1183, -0.3897, -1.8805],
         [-0.7342, -1.6158,  0.2465,  ..., -0.0000, -1.4895, -0.8259],
         ...,
         [-0.2884, -3.0506,  0.6108,  ...,  0.8692,  0.9901,  0.6638],
         [ 0.6423, -2.1128,  1.2056,  ...,  0.2799,  0.5368, -1.0147],
         [-0.4305, -0.4462, -1.2317,  ...,  0.4016,  1.8494, -0.2363]]],
       grad_fn=)

The output tensor is the summation of token embedding and positional embedding retrieved from their corresponding embedding matrices.

3. Relative Positional Embedding

Relative positional embeddings focus on how tokens in a sequence relate to each other in terms of distance, and does not take into account the exact place of the tokens.

3.1 Layman's Explanation

Consider the sentence "I am a student". Exact position of "I" is 1, and exact position of "student" is 4. These are the absolute positions of the tokens. The relative positional embedding do not take these into accounts, it only considers that "I" is 3 distance away from "student" and 1 distance away from "am".

Relative positional embeddings have an advantage when dealing with longer sequences and generalize better to sequence lengths not seen during training. Soon we will see how.

Some notable models that use relative positional embeddings are Transformer-XL [1], T5 (Text-To-Text Transfer Transformer)[2], DeBERTa (Decoding-enhanced BERT with Disentangled Attention)[3] and BERT with Relative Position Embeddings [4]. Feel free to read any of these papers to get a feeling for how they have implemented relative positional embeddings.

3.2 Technical Explanation

First of all, as opposed to the absolute positional embedding which adds positional embedding to the token embedding, relative positional embeddings create matrices that represent the relative distances between tokens. For example, if token i is at position 2 and token j is at position 5, the relative position is j−i=3.

Then relative positional embeddings modify the attention scores to include information about the relative positions. As you know, in the self-attention mechanism, attention scores are computed between pairs of tokens. Therefore relative positional embedding adds a bias term to the attention scores based on the relative position or by incorporating a learnable embedding for each possible relative distance.

A common implementation of this method involves adding a relative position bias to the attention scores. If A is the attention score matrix, a relative position bias matrix B is added:

Image by author

Here, Qi​ and Kj​ are the query and key vectors for tokens i and j, and _dk​ is the dimensionality of the key vectors, and Bij​ is the bias term based on the relative position j−i.

3.3 Code Example: Transformer-XL Implementation

Here's a simple code of how relative positional embeddings can be implemented in PyTorch. This implementation is close to how Transformer-XL has implemented it. To check out their code repository, please see here.

import torch
import torch.nn as nn

class RelativePositionalEmbedding(nn.Module):
    def __init__(self, max_len, d_model):
        super(RelativePositionalEmbedding, self).__init__()
        self.max_len = max_len
        self.d_model = d_model
        self.relative_embeddings = nn.Embedding(2 * max_len - 1, d_model)

    def forward(self, seq_len):
        # Generate relative positions
        range_vec = torch.arange(seq_len)
        range_mat = range_vec[None, :] - range_vec[:, None]
        clipped_mat = torch.clamp(range_mat, -self.max_len + 1, self.max_len - 1)
        relative_positions = clipped_mat + self.max_len - 1
        return self.relative_embeddings(relative_positions)

class RelativeSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads, max_len):
        super(RelativeSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.relative_pos_embedding = RelativePositionalEmbedding(max_len, d_model)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        batch_size, seq_len, d_model = x.size()
        Q = self.query(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        K = self.key(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        V = self.value(x).view(batch_size, seq_len, self.num_heads, self.d_k)

        # Compute standard attention scores
        scores = torch.einsum('bnqd,bnkd->bnqk', Q, K) / (self.d_k ** 0.5)

        # Get relative position embeddings
        rel_pos_embeddings = self.relative_pos_embedding(seq_len)
        rel_scores = torch.einsum('bnqd,rlkd->bnqk', Q, rel_pos_embeddings)

        # Add relative position scores
        scores += rel_scores
        attn_weights = self.softmax(scores)

        # Compute the final output
        output = torch.einsum('bnqk,bnvd->bnqd', attn_weights, V).contiguous()
        output = output.view(batch_size, seq_len, d_model)
        return output

We can call it using below parameters:

seq_len = 10
d_model = 512
num_heads = 8
max_len = 20

x = torch.randn(32, seq_len, d_model)  # Batch of sequences
attention = RelativeSelfAttention(d_model, num_heads, max_len)
output = attention(x)

Notice thatseq_len (sequence length) refers to the actual length of the input sequence for a specific batch. seq_len varies for each batch.

However, max_len (maximum length) is a predefined value that represents the maximum relative position distance the model will consider. This value determines the range of relative positions for which the model will learn embeddings. If max_len is set to 20, the model will have embeddings for relative positions from -19 to 19.

Note that's why self.relative_embeddings = nn.Embedding(2 * max_len - 1, d_model) is set to this size to accommodate all possible relative positions within the range defined by max_len.

Now, let's explain the code:

The first class which is as following creates a learnable embedding matrix for the size of 2 * max_len - 1 . In the forward function, for a given sequence, it retrieves the corresponding embedding from the relative_embeddings matrix.

class RelativePositionalEmbedding(nn.Module):
    def __init__(self, max_len, d_model):
        super(RelativePositionalEmbedding, self).__init__()
        self.max_len = max_len
        self.d_model = d_model
        self.relative_embeddings = nn.Embedding(2 * max_len - 1, d_model)

    def forward(self, seq_len):
        # Generate relative positions
        range_vec = torch.arange(seq_len)
        range_mat = range_vec[None, :] - range_vec[:, None]
        clipped_mat = torch.clamp(range_mat, -self.max_len + 1, self.max_len - 1)
        relative_positions = clipped_mat + self.max_len - 1
        return self.relative_embeddings(relative_positions)

The second class (which is the following), takes a sequence (which is x) and computes Query, Key and Value matrices. Note that each attention head has its own Q, K and V; which is why the shape of all these matrices are (batch_size, seq_len, self.num_heads, self.d_k) .

class RelativeSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads, max_len):
        super(RelativeSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.relative_pos_embedding = RelativePositionalEmbedding(max_len, self.d_k)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        batch_size, seq_len, d_model = x.size()
        Q = self.query(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.key(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.value(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Compute standard attention scores
        scores = torch.einsum('bhqd,bhkd->bhqk', Q, K) / (self.d_k ** 0.5)

        # Get relative position embeddings
        rel_pos_embeddings = self.relative_pos_embedding(seq_len)  # (seq_len, seq_len, d_k)
        rel_pos_embeddings = rel_pos_embeddings.transpose(0, 2).transpose(1, 2)  # (d_k, seq_len, seq_len)
        rel_scores = torch.einsum('bhqd,dqk->bhqk', Q, rel_pos_embeddings)

        # Add relative position scores
        scores += rel_scores
        attn_weights = self.softmax(scores)

        # Compute the final output
        output = torch.einsum('bhqk,bhvd->bhqd', attn_weights, V).contiguous()
        output = output.transpose(1, 2).reshape(batch_size, seq_len, d_model)
        return output

This line scores = torch.einsum('bhqd,bhkd->bhqk', Q, K) / (self.d_k ** 0.5) is a powerful notation for specifying complex tensor operations concisely. In this context, it is used to compute the dot-product between the query and key vectors. The equation 'bhqd,bhkd->bhqk' can be interpreted as follows:

  • b: Batch size.
  • h: Number of attention heads.
  • q: Query sequence length.
  • k: Key sequence length (which is typically the same as the query sequence length in self-attention).
  • d: Depth of each head (i.e., self.d_k).

The einsum notation 'bhqd,bhkd->bhqk' specifies that the dot-product is computed between the last dimension of Q and K, while preserving the other dimensions.

Next line, rel_pos_embeddings = self.relative_pos_embedding(seq_len) retrieves relative positional embedding for all existing relative distances in the sequence; that's why its shape is (seq_len, seq_len, d_k) . We then transpose it to change the shape into (d_k, seq_len, seq_len) . The next line rel_scores = torch.einsum('bhqd,dqk->bhqk', Q, rel_pos_embeddings) computes the contribution of relative positional embeddings to the attention scores in the self-attention mechanism. This is the relative positional bias matrix B in the equation we saw earlier above.

Finally, we add the matrix B to the original attention scores:

# Add relative position scores
scores += rel_scores
attn_weights = self.softmax(scores)

And multiply by values matrix V to get the output:

# Compute the final output
output = torch.einsum('bhqk,bhvd->bhqd', attn_weights, V).contiguous()
output = output.transpose(1, 2).reshape(batch_size, seq_len, d_model)
return output

4. Rotary Positional Embedding

Rotary positional embedding, often called RoPE (Rotary Position Embedding), is a clever approach that combines some benefits of both absolute and relative embeddings. This method was proposed in Roformer paper [6].

4.1 Layman's Explanation:

The key idea behind RoPE is to encode position information by rotating word vectors in a high-dimensional space. The amount of rotation depends on the position of the word or token in the sequence.

This rotation has a neat mathematical property: the relative position between any two words can be easily computed by how much one word's vector has rotated compared to the other. So, while each word gets a unique rotation based on its absolute position, the model can easily figure out relative positions too.

RoPE has several advantages: It can handle longer sequences more effectively than absolute positional embeddings. It naturally incorporates both absolute and relative position information. And as we see later, it's computationally efficient and easy to implement.

4.2 Technical Explanation:

Given a token embedding and the position of that token, absolute positional embedding, computes a positional embedding and adds it to the token embedding:

Image by author

However, in rotary positional embedding, given a token embedding and its position, it produces a new embedding which contains positional information in it:

Image by author

Let's see how this is computed. In a nutshell (and we expand on it soon):

Given a token, RoPE applies a rotation to its respective key and query vectors based on its position in the sequence. This rotation is achieved by multiplying the vectors with a rotation matrix. The rotated key and query vectors are then used to compute the attention scores in the usual way (dot product followed by softmax), and the rest of the computation in transformers happen as usual.

Let's see what is a rotation matrix and how it is applied to the query and key vector.

Rotation Matrix: A rotation matrix in 2D space (the simplest case) is as following, where θ is an arbitrary angle:

Rotation matrix with angle θ – image by author

If you multiply above matrix by a vector in 2D then it just changes the angle of the vector and keep the length of the vector the same. Do you agree?

product of rotation matrix and a vector x – Image by author

And we see that the norm of the rotated vector is the same as the original vector. Let's do the math:

rotation matrix preserves vector norm – Image by author

Now, how is it applied to the key and query vector?

Notice that query vector is the multiplication of the query matrix and the token embedding, i.e.:

query vector – Image by author

Now if we apply the rotation matrix to it, we are rotating the query vector.

product of rotation matrix and query vector – Image by author

But, how on earth it is including the position information?

Great question. In all above math, we assumed the token x is happening at position 1 ! If it is happening at an arbitrary position m then the rotation matrix will contain m in it:

Image by author

4.3 Mathematical Proof of Relativity

Now, let's prove that Rotary Positional Embedding (RoPE) is relative. To do so, we need to demonstrate that the attention score between two tokens depends only on their relative positions, not their absolute positions.

1.Let's define ROPE operation as following:

ROPE function definition – Image by author

2.Consider two tokens at positions m and n:

Image by author

3.We calculate the attention score as their dot-product:

attention score – image by author

Let's expand this as following:

attention score – image by author
  1. Rotation matrices have this nice property that:
property of rotation matrices – image by author
  1. therefore the attention score becomes the following:
attention score – image by author

As you see, the score is a function of relative positions which is the difference of positions between m and n.

4.4 Rotation Matrix for Higher Dimensions

More often than not, the embedding dimension of the model is not 2

Tags: Deep Learning Large Language Models Machine Learning Thoughts And Theory Transformers

Comment