Hosting Multiple LLMs on a Single Endpoint

Author:Murphy | View: 29809 | Time: 2025-03-22 23:18:34

Image from Unsplash by **Michael Dziedzic**

The past year has witnessed an explosion in the Large Language Model (LLM) space with a number of new models paired with various technologies and tools to help train, host, and evaluate these models. Specifically, Hosting/Inference is where the power of these LLMs and Machine Learning in general is recognized, as without inference there is no visual result or purpose to these models.

As I've documented in the past, hosting these LLMs can be quite challenging due to the size of the model and utilizing the associated hardware behind a model efficiently. While we've worked with model serving technologies such as DJL Serving, Text Generation Inference (TGI), and Triton in conjunction with a model/infrastructure hosting platform such as Amazon SageMaker to be able to host these LLMs, another question arises as we try to productionize our LLM use-cases. How we can we do this for multiple LLMs?

Why does the initial question even arise? When we get to production level use-cases, its common to have multiple models that may be utilized. For instance, maybe a Llama model is used for your summarization use-case, while a Falcon model is powering your chatbot. While we can host these models each on their own persistent endpoint, this leads to heavy cost implications. A solution where both cost and performance/resource allocation and optimization is considered is needed.

In this article, we will explore how we can utilize an advanced hosting option known as SageMaker Inference Components to address this problem and build out an example where we host both a Flan and Falcon model on a singular endpoint.

NOTE: This article assumes an intermediate understanding of Python, LLMs, and Amazon SageMaker Inference. I would suggest following this article for getting started with Amazon SageMaker Inference.

DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.

Inference Components Introduction
Other Multi-Model SageMaker Inference Hosting Options
Hosting Multiple LLMs Example
Additional Resources & Conclusion

1. Inference Components Introduction

SageMaker Inference Components are a newly introduced multi-model Hosting option within SageMaker Real-Time Inference. For Sagemaker Real-Time Inference you have a persistent endpoint with a set of dedicated hardware for which you can enable AutoScaling. The general flow to create a vanilla SageMaker Endpoint is the following:

Vanilla SageMaker Endpoint Flow (Screenshot by Author)

For our new endpoint for which we can add multiple Inference Components, our creation/orchestration process would look something like the following:

Inference Component Flow (Screenshot by Author)

Now what is an Inference Component? An Inference Component is very similar to a vanilla SageMaker Model object. You can specify model data and the container for the model that you are trying to host. The key difference is that you can specify the compute that is required to host this model. You can define this in an API call similar to the following:

"ComputeResourceRequirements": {
   "NumberOfAcceleratorDevicesRequired": 1, 
   "NumberOfCpuCoresRequired": 1, 
   "MinMemoryRequiredInMb": 1024
}

The number of accelerators can either be a GPU or Inferentia based device, while you can also specify CPU cores and memory as needed. Additionally, at runtime you can set an initial copy count. This means a model copy is already loaded and ready to serve requests (can help mitigate cold-starts and keep the model warm).

Within Inference Components, you can also apply AutoScaling based off of the average number of invocations each copy of your model is receiving. For instance if you have a singular copy of a model and it is hitting a certain threshold of invocations that you have specified in your AutoScaling policy, you can scale your number of copies up based off of the limit you have specified. Note that each copy will reserve the compute requirements you have specified so you want to ensure that you have the appropriate number of instances behind your endpoint to be able to host the number of copies you may create.

In essence within Inference Components the main features that we want to keep in mind are the following:

Each Inference Component has its own container and model data, it is model server/container agnostic. For instance you can have an IC that is utilizing a TGI container and an IC that is utilizing a DJL Serving container on the same endpoint.
You can add n number of Inference Components to your endpoint, this is the multi-model angle to this hosting option.
You can specify the compute resources requirements for each Inference Component/Model.
You can specify an initial copy count for the Inference Component/Model that you are dealing with.
You can scale at a per Inference Component/Model basis with Application AutoScaling configured for the number of copies behind your IC. With this feature you can also scale your IC to 0 copies.

In today's example we won't explore the scaling portion, but we'll take a look at how we can add two separate LLMs as Inference Components to a SageMaker endpoint. Before that let's cover other multi-model options within SageMaker Inference to avoid confusion for which option to use when.

2. Other Multi-Model SageMaker Inference Hosting Options

While Inference Components introduce a new way of hosting multiple models on SageMaker Real-Time Inference, there's already a few prior options that can also be utilized depending on your use-case.

To understand the full difference between both options please reference to my starter article here. In essence, MME can be used when you have multiple similar flavored models, meaning same container/model framework. With MME you can mount multiple models on a singular container on a dedicated endpoint. MCE can be utilized when you have different containers/model frameworks. With MCE you have multiple containers on a singular endpoint. Inference Components can be thought of as a bit of a marriage between the options, but for specific use-cases it is best to utilize one of the two existing options.

Multi-Model Options (Screenshot by Author)

There are specific considerations you need to think of within each advanced inference option as well. For instance within MME, there's automatic loading and unloading of models done depending on traffic. Within MCE there's options to chain these containers together into a singular inference pipeline for inference. In use-cases where you do want to control the hardware behind each model and scale at a per model basis it's best to utilize Inference Components.

Each option has its own pros and cons, so it's important to evaluate your performance requirements and model(s) inference patterns to make the right hosting decision.

3. Hosting Multiple LLMs Example

For our example we'll be hosting both the FlanT5XXL and Falcon7B models using Inference Components on a singular endpoint. For our orchestration, we will be using some of the same tools in the SageMaker Python SDK and the AWS Python SDK (Boto3).

import boto3
import sagemaker
import time
from time import gmtime, strftime

#Setup
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")
boto_session = boto3.session.Session()
s3 = boto_session.resource('s3')
region = boto_session.region_name
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
print(f"Role ARN: {role}")
print(f"Region: {region}")

Traditionally a SageMaker Model object is the first step of creation in SageMaker Real-Time Inference. In this case we skip this step of the process as we define the model data and container artifacts when creating our Inference Components itself. We directly go to the SageMaker Endpoint Configuration creation, which has a few key changes.

# Container Parameters, increase health check for LLMs: 
variant_name = "AllTraffic"
instance_type = "ml.g5.24xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

# Setting up managed AutoScaling
initial_instance_count = 1
max_instance_count = 2
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")

# Endpoint Config Creation
endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=epc_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            # can set to least outstanding or random: https://aws.amazon.com/blogs/machine-learning/minimize-real-time-inference-latency-by-using-amazon-sagemaker-routing-strategies/
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

Traditionally with Endpoint Configuration we just define the initial instance count and type. In this case however, we add a few key parameters:

Managed Instance AutoScaling: With Inference Components you can scale at the Model/Container level. Here you can scale the number of model copies depending on the traffic you are receiving. However, the hardware behind your endpoint must also scale to be able to have enough resources to host those extra copies. SageMaker offers managed instance autoscaling depending on the scaling of these Inference Components, we enable that in the Endpoint Configuration.
Routing Strategy: Additionally, we can also configure the routing strategy of requests. The default here is random routing, but we enable Least Outstanding Requests (LOR) routing. In this case requests are routed to the instance that has the most capacity to serve the request, this is a combination/formula of monitoring the number of inference components and traffic on each instance.

Once our Endpoint Configuration has been defined, we can create an endpoint in the usual manner.

#Endpoint Creation
endpoint_name = "ic-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=epc_name,
)

Once the endpoint has been created, we focus on the new flow of adding an Inference Component. Once again think of an Inference Component as very similar to a SageMaker Model object. You need the same Model Data and Container specification defined. In this case we can even directly create a SageMaker Model object and have our Inference Component, inherit that metadata that is needed.

For our Flan and Falcon models, we use the Text Generation Inference (TGI) container to directly grab these model artifacts (TGI will be the model server).

from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import json

# utilizing huggingface TGI container
image_uri = get_huggingface_llm_image_uri("huggingface",version="1.1.0")
print(f"TGI Image: {image_uri}")

# Flan T5 TGI Model
flant5_model = {"Image": image_uri, "Environment": {"HF_MODEL_ID": "google/flan-t5-xxl"}}
flant5_model_name = "flant5-model" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(f"Flan Model Name: {flant5_model_name}")

#note: falcon 7b takes just one GPU, sharding is not supported
falcon7b_model = {"Image": image_uri, "Environment": {'HF_MODEL_ID':'tiiuae/falcon-7b'}}
falcon7b_model_name = "falcon7b-model" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(f"Falcon Model Name: {falcon7b_model_name}")

We then create a SageMaker Model object and pass in the model metadata to our Inference Component.

# create model object for flan t5
create_flan_model_response = client.create_model(
    ModelName=flant5_model_name,
    ExecutionRoleArn=role,
    Containers=[flant5_model],
)
print("Flan Model Arn: " + create_flan_model_response["ModelArn"])

# flan inference component reaction, inherit SageMaker Model object
create_flan_ic_response = client.create_inference_component(
    InferenceComponentName=flant5_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": flant5_model_name,
        "ComputeResourceRequirements": {
            # enables tensor parallel via TGI, reserving 2 GPUs (g5.24xlarge has 4 GPUs)
            "NumberOfAcceleratorDevicesRequired": 2,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    # can setup autoscaling for copies
    RuntimeConfig={"CopyCount": 1},
)

print("IC Flan Arn: " + create_flan_ic_response["InferenceComponentArn"])

Notice that we pass in the SageMaker Model name that contains our Container and Model Data information to our Inference Component. Our Inference Component (IC) will inherit this information and then we also define the components that make an IC different from our traditional SageMaker Real-Time Inference Flow.

The key differences for Inference Components lie at the hardware and scaling parameters we define. Notice that we define NumberOfAcceleratorDevicesRequired, this parameter can mean either GPU or Inferentia based instances. Note that you want to consider how much hardware you have available. In our case we use an ml.g5.24xlarge instance for our endpoint, this instance has 4 GPUs available. By saying we have 2 devices required, we assign 2 GPUs for each copy of the model that we create. Once again we also define the CopyCount in the Runtime Config. In this case we defined 2 GPUs as with Flan we want to enable Tensor Parallel via our TGI container. For certain models, multiple accelerators may not even be required. For instance for our Falcon Inference Component, Tensor Parallel is not enabled for Falcon 7B so we only define a 1 GPU requirement as the following:

# create falcon model object
create_falcon_model_response = client.create_model(
    ModelName=falcon7b_model_name,
    ExecutionRoleArn=role,
    Containers=[falcon7b_model],
)
print("Falcon Model Arn: " + create_falcon_model_response["ModelArn"])

falcon_ic_name = "falcon-ic" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
variant_name = "AllTraffic"

# falcon inference component requirement
create_falcon_ic_response = client.create_inference_component(
    InferenceComponentName=falcon_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": falcon7b_model_name,
        "ComputeResourceRequirements": {
            # For falcon 7b only one GPU is needed: https://github.com/huggingface/text-generation-inference/issues/418#issuecomment-1579186709
            "NumberOfAcceleratorDevicesRequired": 1,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    # can setup autoscaling for copies
    RuntimeConfig={"CopyCount": 1},
)
print("IC Falcon Arn: " + create_falcon_ic_response["InferenceComponentArn"])

Once your Inference Components have been created you can see in the Studio UI that there are multiple models enabled on your endpoint:

Invoking these Inference Components is very similar to invoking a traditional MME or MCE endpoint, we specify the Inference Component name as its own parameter.

import json

# sample request
payload = "What is the capitol of the United States?"
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=falcon_ic_name, #specify IC name
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(
        {
            "inputs": payload,
            "parameters": {
                "early_stopping": True,
                "length_penalty": 2.0,
                "max_new_tokens": 50,
                "temperature": 1,
                "min_length": 10,
                "no_repeat_ngram_size": 3,
                },
        }
    ),
)
result = json.loads(response["Body"].read().decode())
result

4. Additional Resources & Conclusion

SageMaker-Deployment/LLM-Hosting/Inference-Components/falcon-flan-tgi-ic.ipynb at master ·…

The code for the entire example can be found at the link above. For additional resources around Inference Components please refer to the official AWS Blog here and the SageMaker LLM workshop. Inference Components offer a production level view into scaling not just your LLM models, but any other ML models you may have in a performance and cost optimal manner.

Stay tuned for more GenAI/AWS articles and deep dives into the topics we have discussed above and more. As always thank you for reading and feel free to leave any feedback.

If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter.

Tags: AWS Generative Ai Use Cases Llm Machine Learning Sagemaker