Unsupervised LLM Evaluations

Author:Murphy  |  View: 20400  |  Time: 2025-03-22 19:48:34
> Evaluating AI-generated outputs is critical for building robust applications of large language models because it allows complex AI applications to be split into simple stages with built-in error control. > It is relatively straightforward to evaluate generative outputs in a supervised mode, where the "right answers" can be computed or hinted by human evaluators. > At the same time, in many practical Llm Applications the supervised approach is too restrictive, and there is a need for evaluations capable of tackling open-ended questions. The simplest way to build an unsupervised evaluator is to ask an LLM to evaluate itself. However, the ability of generative models to detect errors in their own output is not well understood. > **We demonstrate that the quality of self-evaluations can be improved with iterative self-reflection**. Similar to the "Chain of Thought" technique, this method trades compute at inference for the robustness of the final result.

Link to Google Colab notebook with examples:

https://colab.research.google.com/drive/1q_dChQBMbnUXZ377JVwYsjvn7lZ_7qlZ?usp=sharing

Image source: Flux 1. Pro model prompted for "robot evaluating other robots"

Introduction

When building processing pipelines using large language models, the often-mentioned issue is the quality of generated outputs. If a good evaluation process is in place, it can highlight cases of poor performance and trigger LLM fine-tuning, prompt adjustments, escalation to human agents – or all these actions at once.

Here is a typical workflow that uses evaluations for training: an LLM goes over the input dataset, and any output discrepancies detected by the evaluator are used to generate synthetic data to fine-tune the model. The application is deployed only when the target quality metrics are met.

Image by the author: Evaluation loop for LLM fine-tuning

Using LLM evaluators in production is very similar – except that detected discrepancies are usually sent to a human agent to ensure the workflow can continue despite raising an error flag.

However, building a good LLM evaluator is not trivial. The complexity of this problem stems from two practical restrictions:

First, it is highly desirable to minimize human involvement in evaluations. For example, imagine a chatbot interacting with a user and missing a common colloquial pattern of ellipsis (using one word instead of the full output sentence):

Bot: Is that correct?

User: correct

Bot: Sorry, I didn't get that. Please try again.

User: yes it is correct

Given this dialog section, a human should easily highlight deficiencies in the chatbot's response and suggest a fine-tuning course. However, in order to find this problem, an evaluator would have to read the entire dialog (which can be very long). This approach does not work at scale–which means we should strive for evaluation without humans.

Second, the process of judging the LLM output without knowing the "ground truth" is comparable in complexity to the original task. This means a state-of-the-art LLM can (at most) employ an evaluator with similar capabilities (most likely itself), thus raising questions about the validity of such evaluation.

Supervised evaluations

If we look at the well-studied to evaluate LLMs today, we will notice they mostly center on supervised or semi-supervised use cases.

If the training dataset comes with "ground truth" answers, evaluation becomes trivial – and can even drive optimization frameworks like DSPy. The same is true when testing an enterprise LLM app against historical cases handled by human agents, where the "ground truth" equates to the judgments of those agents.

Another opportunity to check the output against the "ground truth" comes when the LLM output can be formally verified on its own – such as computer code that can be compiled and tested. Despite the fact that a computer program can be written in many different ways, the correct code should pass the tests regardless of the chosen implementation path.

Cases where the generative output cannot be formally verified usually require adding a human into the loop. For example, RLHF can be used to rate LLM outputs according to ordinal human preferences and thus steer the network toward complicated and nuanced policies.

Unsupervised self-evaluations

Meanwhile, there are many open-ended evaluation cases where "ground truth" approach cannot be implemented, and RLHF is too lengthy or too costly. This explains the interest in unsupervised self-evaluation techniques.

So, assuming we have an open-ended Llm Evaluation question that would normally require human involvement – like "how can this chatbot improve" – what can be done to automate?

An economical evaluation harness can be built if we assume that contemporary large language models with rich semantic representations are inherently capable of self-evaluations. This means you can simply ask the model to evaluate its own output, or use another LLM for the same task to avoid cross-contamination in their training sets.

Unfortunately, a naïve implementation of LLM self-judging may fail. There are two possible reasons for this.

The first reason is the most obvious: an LLM is not guaranteed to produce coherent outputs on subjects not reflected in its training set or unreachable for retrieval tools (like web search). For example, consider this Google map snippet, which depicts an unincorporated town called Mineral in California and a small body of water (Martin Creek) that runs through it.

Mineral, CA vicinity (source: Google maps)

The town of Mineral does not have its own Wikipedia page, and most LLMs do not use GIS data in training, so when asked which creek passes through the town, an LLM can easily hallucinate:

GPT-4o hallucination example

Mill Creek is a real water body that skirts the (much larger) census-designated place of which Mineral is a part. However, Mill Creek does not pass through the town of Mineral proper, so GPT-4o's answer is incorrect.

Intuitively, it feels like a failure in plain LLM inference due to the subject being outside of the knowledge set and unreachable via common tools (like web search) should not be easily self-detectable.

The second reason for failing self-evals is somewhat more subtle and is driven by probabilistic nature of LLM outputs. It is easy to demonstrate in cases where evaluation should include multiple observations. For example, consider this dialog between the user and chatbot:

Bot: I am a chatbot to help you find the best mobile phone plan.

User: I need at least 500Gb of data and priced below $50

Bot: The "Ultimate Freedom" plan delivers uncapped Internet access for $79.99

User: This might actually work.

Bot: Do you want to change the current plan?

User: Is "Ultimate Freedom" domestic only ? Can I have international roaming?

Bot: Sorry, I didn't get that. Please try again.

User: I am done for now. Please do not make any changes.

Bot: I am a chatbot to help you find the best mobile phone plan.

If we repeatedly ask an LLM to list the improvements that can be made to the bot based on this dialog, we will notice the output list is not stable between runs. One version may look like this:

{["Ensure the bot can handle follow-up questions after suggesting a plan.", 
"Improve bot's recognition of farewell messages.", 
"Make sure the bot can assist with additional features or package inquiries."]}    

While another may look like this:

{["Ensure the bot can switch contexts and answer follow-up questions.", 
  "Enable the bot to recognize when a conversation ends.", 
  "Improve the bot's ability to provide options for additional services", 
  "Support input about cost limitations, e.g. users stating a desired price."]}

Even though there is significant overlap between these two answers, it is clear that an exhaustive list of suggestions (similar to what a human expert would produce) was not generated in either case.

The unexpected power of self-reflection

Once we outline the typical failure modes for evaluations, it may seem like using an LLM to judge itself is a bad idea. After all, this sounds like asking a diligent student to re-check their own answers. Since a good student does not make many typos, re-checking merely reflects existing knowledge and should not result in improvements.

However, this is where our intuition about LLMs may go awfully wrong.

In fact, most LLMs are capable of corrective self-evaluation, even when the subject lies outside of their knowledge base.

To illustrate this phenomenon, let us return to GPT-4o example hallucinating about the body of water crossing the town of Mineral, CA. Interestingly enough, this particular hallucination can be dispelled during self-evaluation:

Self-evaluation in GPT-4o is capable of reversing hallucinations

So where is the magic?

In this example, the LLM does not have the knowledge or the tools to get the correct answer, so it hallucinates the "most plausible" completion. However, when asked to evaluate itself, it arrives at the conclusion that the facts it can access do not corroborate previous statement. Even though GPT-4o does not know the correct answer, it can dismiss the incorrect one.

A more sophisticated model (like GPT-4o1) may be slightly harder to treat in the same way because it tends to produce more nuanced responses:

Hallucination in GPT-4o1 is more nuanced.

Instead of hallucinating a completion on the subject it cannot verify, GPT-4o1 may choose to answer the question it was never asked – like "Which primary body of water runs near Mineral, CA?". This evasion means that a direct self-evaluation prompt along the lines of "evaluate as True or False" may fail.

However, a more deliberative way of asking for self-evaluation can still be successful, even if it takes multiple iterations:

This ability of LLMs to self-reflect in an iterative way is, of course, well-known and is somewhat taken for granted in applications like code generation. Here we are just extending the same technique to self-evaluation.

The "expected" power of memoization

The same idea of iterative reflection is also applicable to LLM tasks that tend to produce incomplete outputs. If we revisit the bot dialog example and allow an LLM to iterate on a memoized list of improvements, we will observe the model is rarely "satisfied" with the result at first shot.

In other words, if we formulate a prompt like this:

iterative_prompt = """
Consider the following dialog between the user and the chatbot.
The bot's goal is to suggest a cheaper mobile plan based on the information the user provides.
The user's responses are not guaranteed to be consistent or coherent at all times.

This dialog was evaluated by an LLM and this evaluation is provided below. 

You job is to assess the quality of evaluation and respond with "success"=True and repeat the original action list if there is nothing significant to add.
If there is something missing in evaluation, respond with "success"=False and a new list of action items to create better user experience integrating the old list with new suggestions. Make sure the list items are unique and not repetitive.

"""

Then it would typically take 2–4 passes over the list of improvements until the LLM converges on recommendations and declares the evaluation task to be successful:

         

Tags: AI Llm Agent Llm Applications Llm Evaluation Machine Learning

Comment