The Three Essential Methods to Evaluate a New Language Model

Author:Murphy | View: 25942 | Time: 2025-03-23 18:12:13

What is this about?

New LLMs are released every week, and if you're like me, you might ask yourself: Does this one finally fit all the use cases I want to utilise an LLM for? In this tutorial, I will share the techniques that I use to evaluate new LLMs. I'll introduce three techniques I use regularly – none of them are new (in fact, I will refer to blog posts that I have written previously), but by bringing them all together, I save a significant amount of time whenever a new LLM is released. I will demonstrate examples of testing on the new OpenChat model.

Why is this important?

When it comes to new LLMs, it's important to understand their capabilities and limitations. Unfortunately, figuring out how to deploy the model and then systematically testing it can be a bit of a drag. This process is often manual and can consume a lot of time. However, with a standardised approach, we can iterate much faster and quickly determine whether a model is worth investing more time in, or if we should discard it. So, let's get started.

Getting Started

There are many ways to utilise an LLM, but when we distil the most common uses, they often pertain to open-ended tasks (e.g. generating text for a marketing ad), chatbot applications, and Retrieval Augmented Generation (RAG). Correspondingly, I employ relevant methods to test these capabilities in an LLM.

0. Deploying the model

Before we get started with the evaluation, we first need to deploy the model. I have boilerplate code ready for this, where we can just swap out the model ID and the instance to which to deploy (I'm using Amazon SageMaker for model hosting in this example) and we're good to go:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
  role = sagemaker.get_execution_role()
except ValueError:
  iam = boto3.client('iam')
  role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

model_id = "openchat/openchat_8192"
instance_type = "ml.g5.12xlarge" # 4 x 24GB VRAM
number_of_gpu = 4
health_check_timeout = 600 # how much time do we allow for model download

# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID': model_id,
  'SM_NUM_GPUS': json.dumps(number_of_gpu),
  'MAX_INPUT_LENGTH': json.dumps(7000),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(8192),  # Max length of the generation (including input text)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
  image_uri=get_huggingface_llm_image_uri("huggingface",version="0.8.2"),
  env=hub,
  role=role, 
)

model_name = hf_model_id.split("/")[-1].replace(".", "-")
endpoint_name = model_name.replace("_", "-")

# deploy model to Sagemaker Inference
predictor = huggingface_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type, 
  container_startup_health_check_timeout=health_check_timeout,
  endpoint_name=endpoint_name,
)

# send request
predictor.predict({
  "inputs": "Hi, my name is Heiko.",
})

It's worth noting that we can utilise the new Hugging Face LLM Inference Container for SageMaker, as the new OpenChat model is based on the LLAMA architecture, which is supported in this container.

1. Playground

Using the notebook to test a few prompts can be burdensome, and it may also discourage non-technical users from experimenting with the model. A much more effective way to familiarise yourself with the model, and to encourage others to do the same, involves the construction of a playground. I have previously detailed how to easily create such a playground in this blog post. With the code from that blog post, we can get a playground up and running quickly.

Once the playground is established, we can introduce some prompts to evaluate the model's responses. I prefer using open-ended prompts, where I pose a question that requires some degree of common sense to answer:

How can I improve my time management skills?

What if the Suez Canal had never been constructed?

Both responses appear promising, suggesting that it could be worthwhile to invest additional time and resources in testing the OpenChat model.

2.Chatbot

The second thing we want to explore is a model's chatbot capabilities. Unlike the playground, where the model is consistently stateless, we want to understand its ability to "remember" context within a conversation. In this blog post, I described how to set up a chatbot using the Falcon model. It's a simple plug-and-play operation, and by changing the SageMaker endpoint, we can direct it towards the new OpenChat model.

Let's see how it fares:

The performance as a chatbot is quite impressive. There was an instance, however, where Openchat attempted to abruptly terminate the conversation, cutting off in mid-sentence. This occurrence is not rare, in fact. We don't usually observe this with other chatbots because they employ specific stop words to compel the AI to cease text generation. The occurrence of this issue in my app is probably due to the implementation of stop words within my application.

Beyond that, OpenChat has the capability to maintain context throughout a conversation, as well as to extract crucial information from a document. Impressive.

Tags: Hugging Face Llm Llmops NLP Sagemaker