Boosting LLM Inference Speed Using Speculative Decoding

Author:Murphy | View: 22563 | Time: 2025-03-23 11:39:45

Intro

Large Language Models are extremely power-hungry and require a significant amount of GPU resources to perform well. However, the transformer architecture does not take full advantage of the GPU.

GPUs, by design, can process things in parallel, but the transformer architecture is auto-regressive. In order for the next token to get generated it has to look at all of the previous tokens that came before it. Transformers don't allow you to predict the next n tokens in parallel. Ultimately, this makes the generation phase of LLMs quite slow as each new token must be produced sequentially. Speculative decoding is a novel optimization technique that aims to solve this issue.

Each forward pass produces a new token generated by the LLM

There are a few different methods for speculative decoding. The technique described in this article uses the two model approach.

Speculative Decoding

Speculative Decoding works by having two models, a large main model and a smaller assistant model. The smaller assistant model first generates a sequence of n tokens. The main model then validates the sequence of tokens in a single forward pass.

The idea is that because the assistant model is small, it will produce tokens quickly. The main model being larger and more accurate does not need to generate every single token. It just needs to validate the tokens the assistant model has generated.

For example, let's say the assistant model produces the following 5 tokens.

The assistant model auto-regressively produces the tokens, while the main model verifies all tokens in one shot

The main model will perform a single forward pass over all 5 tokens. The purpose of the main model is to validate the correctness of each token. If one of the tokens is incorrect according to the main model, it discards the entire sequence after the incorrect token. The main model then auto-regressively fills in the rest of the sequence with the correct tokens.

Main model discards the outputs of the sequence starting from the first incorrect token

Using speculative decoding you are guaranteed to get the exact same output as if you had run just the main model on its own. This is because the main model is validating every single token and replacing the tokens it believes are incorrect. In an ideal scenario, if the assistant model produces the majority of tokens correctly, the main model will be able to quickly validate the token which results in faster end-to-end generation time.

Speculative decoding in practice

Many popular inference services like TGI and vLLM support speculative decoding out of the box. The challenge is to get the right pair of assistant and main model. In general, it's good to pick models that share the same architecture and the vocabulary.

For this tutorial, we'll be using Llama 3.1 70B Instruct as the main model and Llama 3.1 8B Instruct as the assistant model. vLLM provides support for both of those models, so we'll be using that as the inference service.

Installing Python requirements and importing packages

!pip install vllm==0.5.4

from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

Downloading the models

model_args = AsyncEngineArgs(
            model="meta-llama/Meta-Llama-3.1-70B-Instruct",
            speculative_model="meta-llama/Meta-Llama-3.1-8B-Instruct",
            trust_remote_code=True,
            tensor_parallel_size=4,
            max_num_seqs=8,
            dtype="half",
            use_v2_block_manager=True,
            enforce_eager=True
        )

llm_engine = AsyncLLMEngine.from_engine_args(model_args)

In the engine arguments, model specifies the main model to use while speculative_model is the assistant.
Since the 70B model takes up ~140 GB of VRAM, I decided to use four GPUs(A100 80GB), so tensor_parallel_size is set to 4.
Using dtype=half loads the weights using FP16 precision which is half the amount of memory compared to full FP32 bit precision.
The arguments use_v2_block_manager and enforce_eager are required for this speculative model to function correctly.
The main reason to use vLLMs AsyncLLMEngine vs the regular LLM class is that the AsyncEngine can handle concurrent requests.

Running inference with speculative decoding

import uuid
import asyncio

prompt = "Why is the earth flat?"
stream = True
max_tokens = 512
model_input = {"max_tokens": max_tokens}
sampling_params = SamplingParams(**model_input)

idx = str(uuid.uuid4().hex)
vllm_generator = llm_engine.generate(prompt, sampling_params, idx)

async def stream_tokens():
    full_text = ""
    async for request_output in llm_engine.generate(prompt, sampling_params, idx):
        if len(request_output.outputs) > 0:
            text = request_output.outputs[0].text
            delta = text[len(full_text):]
            full_text = text
            print(delta, end='', flush=True)
    print()

await stream_tokens()

At the top, we specify the general parameters such as prompt , max_tokens , and stream. To easily visualize the performance gains we'll set stream=True .
In the stream_tokens function we have a generator object returned from the llm_engine.generate call. To print the output of the generator we need to loop over it asynchronously, since our LLM engine is defined as async. Finally, we take the generated tokens and print them.

Performance Results

TPS

Under lower concurrency we see a nearly 2X improvement.
Under higher concurrency the performance gain is about 30%.

Generation Time

Similar to performance gains in TPS, the end-to-end latency gets cut in half when using speculative decoding.
The generation time does slowly creep up as the number of concurrent users increases, but it's still much lower than without using spec-dec.

TTFT

TTFT is actually worse with speculative decoding.
The reason for this is because, before the main model can validate the generated tokens, the assistant model has to generate the tokens. This eats up a bit of time initially causing the TTFT to be higher.

Potential challenges

Speculative decoding can seem like a home run, especially since it's so easy to set up using vLLM. However, there are a few things to keep in mind when using this inference acceleration technique.

Picking the right assistant and main model

Not all large language models are compatible with each other when using spec-dec. For example, you can't use llama 2 7B as the assistant model and llama 3 70B as the main model.

This is because both the assistant model and main model must share the same vocabulary. Llama 2 was trained on a vocab size of 32K tokens, while llama 3 has 128K tokens in its vocab.

The size of the assistant model matters

Since speculative decoding relies on the assistant model to do most of the heavy lifting, choosing a faster assistant model leads to higher performance.

Ideally, the assistant model should have very few parameters so that it can generate tokens quickly. The tradeoff is that a smaller model tends to be less accurate, which leads me to the third point.

Main model acceptance rate

Having a high acceptance rate of the main model is crucial. If the main model keeps rejecting most of the tokens the assistant produces, it(the larger model) now has to auto-regressively generate the rest of the sequence.

This can result in a big performance hit to TPS and overall generation time, since the same work has to be done twice. It's best to experiment with various models to see which pair has the highest acceptance rate.

For the llama 3.1 70B and 8B pair in the example above, the acceptance rate of the main model is ~70%.

Conclusion

When it comes to LLM inference, speed is a major factor because nobody wants to keep their users waiting. A tool like speculative decoding can be a great solution while maintaining high output quality.

In this blog, post we covered the basics of how speculative decoding works and how to implement it using vLLM. Although its not a perfect solution for every LLM use-case, it's always good to have it in your toolbox.

I hope you found this post interesting. Thanks for reading!