How to Improve Model Quality Without Building Larger Models
Recently OpenAI unveiled their newest model o1. Rather than highlight the parameter size of this model, OpenAI instead showcased that the model performs significantly better because it takes more time. When you ask the model a question, it will often taken multiple seconds to respond – a far cry from the millisecond speed most people now expect with Large Language Models (LLMs). Nevertheless, this extra time appears to pay off as o1 scores substantially higher than other models on the LMSYS Chatbot Arena.
Given this leap in performance, the question everyone is asking is, How did they do this?

While OpenAI has not publicly stated how they achieved these results, there have been a few papers recently that are good candidates for what is happening behind the scenes. One such paper is "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters". This goes into how you can leverage test-time compute to increase your model's accuracy.
Let's dive into that paper.
The first question you might ask is what is test-time compute? In general there are two major points of computation when building LLMs: pre-training and inferencing. Pre-training is when we are running enormous numbers of tokens through our model and using entropy loss and back propagation to teach the model how to better predict the next token (see more about this in my series here). Inferencing is when a user is prompting our model and we need to give that user a result back. Inferencing is also called ‘test-time'.
The basic question here is Can we trade-off making the user wait longer with giving the user a better result?
Laying Out an Optimization Problem
The authors of the paper frame this as an optimization problem. We are given a limited amount of compute and a query from the user, and our goal is to determine the optimal allocation of compute to that query. If we give it too much, then we have wasted money. If we give it too little, then we are losing customer trust.
Importantly, all queries are not created equal. Some are easy and require less compute than harder ones. Thus, to really optimize our compute, we need to be aware of the query's difficulty. The way they determined difficulty was via pass@1
Difficulty
As you can imagine the more difficult our problem is, the more we want the model to reason. Importantly, difficulty depends on the model that you are using – one size does not fit all. Thus, we have to determine difficulty in a way that can be adapted to new models. For this paper, the authors computed pass@1
for the questions in the PRM800K dataset.
You find pass@1
by taking each question and running it through your model 2048 times. The rate of getting back a correct answer is your pass@1
rate. Naturally, finding these values is quite expensive – back of the envelope calculations show you need to compute ~1,638,400,000 forward passes! Note also that this method of defining difficulty is directly connected to the model answering the questions. If you change the model, the expectation is that the pass@1
rate will change for the same questions.
The authors decided that the precise pass@1
rate didn't matter as much as the relative difficulty. Thus, they split the questions into quintiles based off their pass@1
rates.
With the problem laid out, the authors explored two ways to improve model behavior at test time. First, they devised ways for the model to explore multiple answers and pick the best one. Second, they explored fine-tuning the model so it naturally evaluates on its own answers. They call this adjusting the proposal distribution.
Scaling Test Time Compute with Verifiers
Our first strategy is to lean heavily on verifiers. We will generate a few answers at once, and as the LLM is outputting these answers, we will stop it at specific times to get feedback from the verifier. Based off that verifier's feedback we will then determine which answers to keep working on.
Process-supervised Reward Model Verifier
Naturally there are many ways to build a verifier. We could use LLM-as-a-judge (where we ask an LLM to rate the answer), but this would likely be very expensive.
Instead our authors built a separate smaller model, trained on the PRM800K dataset (created in the "Let's Verify Step by Step" paper). In that paper, the authors created a dataset called PRM800K (MIT license). PRM stands for Process-supervised Reward Model, and the name gives us a good hint about the intentions here. The reward model is trained so that as the main model is generating an answer, the reward model can check in and make sure it's thinking process is correct.
The authors explored three ways to figure out when to stop a potential answer and get feedback from the verifier.

Best-of-N
Here we simply generate N different responses and then select the best one using our verifier. This is the cheapest of the options, as at the end you only get N options and if one of them took a wrong turn it cannot be corrected. The slowdown here is also the smallest, as we are slowing down by a factor of N (for N forward passes of the base model) and then whatever the time it takes for the verifier model to determine which branch is best at the end.
Beam Search
Beam Search is similar to Breadth-First Search. Here we also generate N samples, but after each step ( some arbitrary number of tokens generated) we make an assessment about which M branches are most promising and continue generating more tokens off those. This continues with us sampling N more samples and then choosing M branches from those.
We determine which ones are promising using the verifier. This has the advantage of looking through many different possibilities while still focusing compute on only the most promising. The downside is that you have to run the verifier more often which adds a time tax.
Lookahead Search
Lookahead search extends Beam Search further. Rather than determining how promising the branch is by using the verifier on the tokens produced so far, we instead sample k-steps further (this is done with temperature at 0). We then pass this longer chain to the verifier to determine its promise. The advantage here is that you have the most amount of information at your disposal, but this comes at the largest time tax and most amount of compute spent.
Results of Verifier Approach
Our first comparison is seeing how the searches compare when using limited compute resources. With Beam Search, we set the constraint M as the number of branches to continue down from each approved branch. With Lookahead, we set constraint M as the number of tokens to generate ahead before the verifier runs. The left graph below shows us that while initially Beam Search does better than Best-of-N and Lookahead, once the compute budget gets large enough the others catch up fairly quickly.

The graph on the right compares these options based off the quintile difficulty of the problem. As expected beam search, with its greater number of possibilities considered, does better than Best-of-N at higher levels of difficulty.
Refining Proposal Distribution
The authors now turn their attention to getting the model to iteratively improve on its answer during test-time. While asking the model to simply improve on its answer doesn't always yield better results, they decided to instead fine-tune their language model so that the model itself decides to iterate on its first thoughts before giving a final answer.
Data Generation for Fine-Tuning
The authors wanted their data to show a multi-turn exchange (think a dialogue with many instances of the participants talking to each other) where small mistakes made in the first iterations of the answer are corrected later on. This setup was hypothesized to teach the model how to pick these mistakes out and fix them. Naturally, the danger here is that you teach the model to make certain mistakes and thus may increase the risk of them repeating these at test-time.
Interestingly, generating enough fine-tuning data for this was challenging. To make the most of their compute, the authors sampled 64 responses in parallel for each question. They then take all the correct answers and randomly add between 0 and 4 wrong answers to the context. This final chunk of text will be added to the fine-tuning data set.
They want the wrong answers to get progressively closer to the right one, so to find the closest wrong answer they use a character edit distance metric to rank the wrong answers and pick the best wrong answer. For the remaining wrong answers they need, they simply randomly sample from the wrong ones. They don't disclose which edit metric they use, but some common character edit algorithms are Levenshtein Distance and Damerau–Levenshtein Distance. It seems likely the metric is derived from one of these. It's interesting that they didn't use their edit metric to choose progressively less wrong answers to add to their context – as far as I can tell they didn't mention why they didn't go this way.

Adjustments to the Fine-Tuned Model at Inference Time
The authors found that they could improve the model's pass@1
rate by running the model multiple times on the same question and then only putting the 4 most recent answers into the context. The graph below shows that the accuracy goes up a fair amount as you continue to generate.

As the model was trained to expect wrong answers in its context, it will sometimes take a correct answer and turn it into an incorrect one. The authors found roughly 38% of the correct answers in the context will turn into wrong ones. To avoid this issue, the authors used their techniques from above and the verifier model to help guide the model towards using the correct answer.

Balancing Sequential and Parallel Generation
After applying their correction, the authors were curious how generating answers in parallel compared with the sequential generation at test-time. They compared using the fine-tuned model multiple times to generate the right answer versus having it generate answers in parallel. They found that the sequential generation narrowly outperforms parallel generation.

The authors hypothesized that parallel and sequential generation had different strengths and weaknesses. Parallel generation has the model considering more options (unchanged by the previous answers) while the sequential generation let the model hone in on the previous answer's thinking. Put simply, if you're on the right path, sequential generation is the key, but if you're not parallel generation is very helpful.
To strike a balance between the two, the authors tried multiple different ratios between sequential and parallel generation. They found that the question difficulty mattered the most when choosing a ratio – with an easier question benefiting more from more sequential compute, while the more difficult the question is the more a balance is necessary.

Optimizing Test-Time Spend with Pre-Training Spend
Now that we've seen that increasing test-time compute can enhance the model's answers, the authors wondered how far can you push test-time compute. Is there a way to make a smaller model better than a bigger one via test-time compute?
The authors specifically looked at this tradeoff from a computation spend vantage point. They used the below formulas to estimate the number of Floating Point Operations (FLOPS) required to train and inference a model of a certain parameter size. N represents the number of parameters, and D represents the number of tokens used (either for pre-training or at inference time).
For something like pretraining, finding the number of tokens used could be found by multiplying the tokens in the dataset by the number of epochs you'll train over. Generally, more compute at pre-training is necessary for larger parameter models. For inference, we'll have greater variability in our estimate— predicting the total number of queries and multiplying that by the average response in tokens.
Remember, the idea here is that we have a set budget for compute, so we need to divide it between training the model and inferencing the model.


The graph below shows 2 things – (1) how increased spending on test-time compute improves performance and (2) where spending is better put on inferencing vs pre-training.

To explain how to understand (2), we need to focus on the star. The star appears on the graph to show how the performance of the equivalent FLOPS spent on pre-training a model with 14x more parameters would be. If the star is below the test-time compute line, then it shows that the compute was better spent on test-time. If the star is above the line, it shows that the compute is better spent on pre-training.
In short, these results suggest that higher difficulty questions require more complex models (and thus more pre-training compute) than lower difficulty ones, though performance increases generally with higher compute budgets. Additionally, if we expect to have a lot of inferencing tokens, then it is more effective to spend more compute on pre-training.
Closing
With the performance gains seen with greater test-time compute, we can expect to see people quickly follow this methodology. While the PRM80K dataset is freely available, we have not seen many publicly available verifier models yet. If these become open source, it will be very interesting to see how complex these models need to be to ensure high quality performance. Additionally, we're seeing here that the more data you have about customer use cases, the lower you can make your inferencing cost structure.
It is an exciting time to be building.
[1] Snell, C., et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (2024), arXiv
[2] Lightman, H., et al., "Let's Verify Step by Step" (2023), arXiv
[3] Open AI Team, "Introducing OpenAI o1" (2024), OpenAI