Understanding Long RoPE in LLMs

Author:Murphy  |  View: 29716  |  Time: 2025-03-22 21:40:33

This blog post will go in detail about the Long RoPE Methodology used to expand the context lengths in LLMs without significant performance degradation

Image by Author – generated by Stable Diffusion 2.1

As the general public has begun using LLMs in their daily lives, one important problem arises when they have long-conversations. After a few dialogue turns, the LLM can appear to completely forget what was said before! Behind the scenes, each line of dialogue is fed into the LLM's context, which you can think of as a giant input into the model. Once the conversation is too long for the context, you have to remove some of the data.

Not only is this a bad customer experience, it also limits the amount of information that a LLM can reasonably process. Consequently, work has been ongoing to build LLMs with larger and larger contexts.

Today's paper, "LongRoPE: Extending Llm Context Window Beyond 2 Million Tokens," achieves just that.

Figure 1 from the paper

Looking at the above graph, we can see that the perplexity, a measurement of loss correlating to how well the LLM predicts the next token, stays low for LongRoPE, but spikes for the other methodologies.

Let's dive into how these impressive results are found.


Context Length and Positional Encoding

Figure 1 from "Attention Is All You Need"

Starting from a high-level, Transformers require two pieces of information for inputs: the token embeddings and the positional encodings. Token embeddings are things like tiktoken where they will use a fixed vocabulary size to generate a unique key for each token. Through training, the model then learns the query and value for each token so that it can generate the next token successfully with the information.

Equation 1 from "RoFormer: Enhanced Transformer with Rotary Position Embedding"

In addition to the embeddings, we also need positional information to tell the LLM where in a sentence the token is. The equations above show the most abstracted view for passing along the positional information. We have 3 functions, 1 for each element of the token, and 1 word embedding vector. To break it down, Xm and Xn are reading different positions of the embedding vector for the token with X having a constant d dimensionality.

One approach is to simply create a new vector for each token you see, so that the position is perfectly unique. Naturally, the trade-off here is that the unique vector makes it hard for the model to see similarities in the training data, degrading performance.

A secondary approach would be to create a vector that has a similarity factor with other vectors for each token. This way we still capture information about how similar a situation is to another distinct situation. Nevertheless, as we can create collisions of these vectors, there can be confusion that arises from this methodology.

How do we find the best combination of these approaches?

Rotational Positional Encoding (RoPE)

The industry has largely focused on RoPE as a way to get the best of both worlds. Without going too deep into the mathematics, RoPE uses sinusoidal functions to assign positional values to the tokens. As sinusoidal functions are repetitious by design, there are some positional values which will be very similar to others. Consequently, items that are similar will have some quantitative value indicating just how similar they are.

Equation 14 and 15 from "RoFormer: Enhanced Transformer with Rotary Position Embedding"

As you can see from the equation above, we have a sparse matrix filled with different functions revolving around the value θ which is passed in as a way to keep all of the positional encodings related.

The exact way these θ are related is shown below:

Defining Theta in "RoFormer: Enhanced Transformer with Rotary Position Embedding"

The most critical part of this equation for context size is the value 10,000. As we have tried to create bigger contexts with non-infinite ranges of numbers, the value of 10,000 has become a limiting factor – after all there are only so many vectors you can create with that number as your base.

Figure 1 from "RoFormer: Enhanced Transformer with Rotary Position Embedding"

Extending RoPE Before Long RoPE

While you could train a new model from scratch using a larger base value for your positional encodings, there are a few reasons stopping people at large from doing this. First, there is a huge cost associated with training from scratch. As only a few organizations in the world have the resources to do so currently, the burden to do this is great. Second, it is incredibly difficult to find a large volume of high quality long text. As the training requires trillions of tokens, finding quality long-data at that scale is a major challenge.

Consequently, researchers have put forward different methodologies for expanding RoPE to larger thetas.

The first method is Linear positional interpolation (PI), where you can expand the number of possible positions by reducing theta by some value λ. The equation below uses Beta to represent the θ^(2/d) equation which we used to connect all of the thetas from before.

Equation 2 in the paper

While this works, the authors of the paper note that there is a crowding effect where some of the information ends up getting lost after the reduction.

The second method is YaRN (Yet another RoPE extensioN method) where we divide the RoPE Dimensions into 3 groups and assign a different linear factor to each of them. The basic idea is that tokens that appear frequently should not be altered (their λ := 1) and the ones that are less so are altered. From the graph below, we can see that this works well at expanding up to 128k context length. The issue at play here is determining the groupings. The groups are determined by people and thus there can be sub-optimal decisions made that reduce performance.

Figure 1 from "YaRN: Efficient Context Window Extension of Large Language Models"

Thus, while both YaRN and Linear Projection (PI) work, they have limitations that hold them back. Long Rope takes the best of each idea and finds a clever way to combine them.

Long RoPE 2 Insights

The Long RoPE Researchers realized that to improve upon previous methods, they would introduce two key ideas: (1) the distribution of good λ is irregular, so searching for λ is better than assuming a correct answer and (2) there is a subset of tokens that should simply not have their positions changed.

Both of these findings are found in the formula below. To find the optimal λ, they created a loss function that they could minimize. The formula below is a reformatted version of RoPE with result of

Tags: AI Llm Long Rope Machine Learning Microsoft

Comment