Evaluations with Chat Formats

Author:Murphy | View: 23599 | Time: 2025-03-22 22:46:12

"Building solid evals should be the starting point for any LLM-based system or product (as well as conventional machine learning systems)." – Eugene Yan, link

tl;dr

Chat models are typically fine-tuned on datasets formatted with a prompt template. These chat templates are programmed recipes that convert a chat conversation into a single string. At prediction time, it's standard to match an LLM's expected chat format – not doing so is oft-noted as causing performance degradations [1]. However, do we in fact see these degradations on evaluation benchmarks?

NB: This blog post is intended for readers with basic familiarity with Python programming and neural language modeling.

Introduction

If you've built on top of OpenAI's chat API, the following code will be recognizable. Under the hood, this input is transformed into one tokenizable string via the ChatML format:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)

"<|im_start|>system
You are a helpful assistant.
<|im_start|>user
Who won the world series in 2020?<|im_end|>
<|im_start|>assistant
The Los Angeles Dodgers won the World Series in 2020.<|im_end|>
<|im_start|>user
Where was it played?<|im_end|>
<|im_start|>assistant"

It turns out there's a wide variety of chat templates across the LLM research community. Take an open-source model like Mixtral-8x7B-Instruct-v0.1. It's format looks wildly different from gpt-3.5-turbo above:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "Write me a haiku about coding."},
]
tokenizer.apply_chat_template(chat, tokenize=False)

"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] Write me a haiku about coding. [/INST]"

Why bother with chat templates? Well, it's strongly advised to match the expected chat template at prediction time (for instance, see the info on "Instruction format" at the repo for Mixtral-8x7B-Instruct-v0.1). And, with proprietary chat models like gpt-3.5-turbo, chat templates are often applied behind the scenes of an endpoint whether you like it or not!

But how do we know whether chat formatting is indeed improving our performance? Enter LM evals.

LM evals

Evaluations are used to measure an AI/ML model's performance, and they can take many shapes and sizes. Evals include two core components: a dataset curated for a specific task and associated metric(s) measuring the modeling performance.

Generative LM evals carry some additional nuances. For example, different frameworks measure text generation performance in different ways – even varying for the same eval (reference). When comparing scores across studies, it's therefore very important to confirm that the results were computed with the same code and config to avoid any errant analysis.

The superb Instruction-Following Evaluation (IFEval) [2] is used for our testing here. This eval includes 541 prompts that measures a language model's ability to follow verifiable natural language instructions. Examples of these verifiable instructions include:

"Write 450 to 500 words", "your entire output should be in JSON output", "include a title, and put it into two square brackets such as [[ title ]]"

For a given response and a verifiable instruction, we examine whether the instruction has been followed or not with the following four metrics:

Prompt-level strict-accuracy: The percentage of prompts that all verifiable instructions in each prompt are followed.

Inst-level strict-accuracy: The percentage of verifiable instructions that are followed.

Prompt-level loose-accuracy: Prompt-level accuracy computed with the loose criterion.

Inst-level loose-accuracy: Instruction-level accuracy computed with a loose criterion.

The average of these four metrics was computed here (Table 1), primarily to use a single metric that captures the most diverse signal available.

IFEval is an ideal test for exploring the impacts of chat templates, since the test is specifically designed to measure instruction-following capabilities on chat data. Another interesting line of questioning is whether chat templating positively impacts evals that aren't as well suited for chat data – a topic left for future research.

Chat templates for IFEval

Eleuther.AI's lm-eval is the de facto open-source package for LM evaluation. Since chat templating for more models is an oft-requested addition to the library, it was easy to sync up with other developers wanting to work on this feature in the

Tags: AI Evaluation Large Language Models Machine Learning Text Generation