Deploying Large Language Models With HuggingFace TGI

Author:Murphy | View: 20859 | Time: 2025-03-23 18:05:18

Large Language Models (LLMs) continue to soar in popularity as a new one is released nearly every week. With the number of these models increasing, so are the options for how we can host them. In my previous article we explored how we could utilize DJL Serving within Amazon SageMaker to efficiently host LLMs. In this article we explore another optimized model server and solution in HuggingFace Text Generation Inference (TGI).

NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I would suggest following this article for understanding Deployment/Inference more in depth.

DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.

Why HuggingFace Text Generation Inference? How Does It Work With Amazon SageMaker?

TGI is a Rust, Python, gRPC model server created by HuggingFace that can be used to host specific large language models. HuggingFace has long been the central hub for NLP and it contains a large set of optimizations when it comes to LLMs specifically, look below for a few and the documentation for an extensive list.

Tensor Parallelism for efficient hosting across multiple GPUs
Token Streaming with SSE
Quantization with bitsandbytes
Logits warper (different params such as temperature, top-k, top-n, etc)

A large positive of this solution that I noted is the simplicity of use. TGI at this moment supports the following optimized model architectures that you can directly deploy utilizing the TGI containers.

Even more nicely it directly integrates with Amazon Sagemaker as we will explore in this article. SageMaker now provides managed TGI containers that we will retrieve and use to deploy a Llama model in this example.

TGI vs DJL Serving vs JumpStart

So far in this LLM Hosting series we've explored two alternative options with Amazon SageMaker:

When do we use what and where does TGI fall into the picture? SageMaker JumpStart is great if the model is already provided in its offerings. You don't really need to mess with any containers or model server level work, this is all abstracted out for you. DJL Serving is great if you have a custom use-case with a model that is not supported by JumpStart. You can pull directly from the HuggingFace Hub or also load your own artifacts in S3 and point towards it. Another positive is also if you've mastered a specific model partitioning framework such as Accelerate or FasterTransformers that it integrates with, this allows you to add some advanced performance optimization techniques to your LLM hosting.

Where does TGI come in? TGI as we've noted natively supports specific model architectures. TGI is a great option if you are trying to deploy one of these supported models as it's optimized specifically for that architecture. In addition, if you understand and know the model server deeper you can tune some of the environment variables that are exposed. Think of TGI almost as an intersection between ease of use and performance optimization. It's essential to choose the right option out of these three for your deployment depending on your use-case and weighing the different pros and cons each solution has.

Llama Deployment with TGI on Amazon SageMaker

To work with Amazon SageMaker we will utilize the SageMaker Python SDK to grab our TGI container and streamline deployment. In this example we will follow Phillip Schmid's TGI blog and adjust it to Llama. Make sure to first install the latest version of the SageMaker Python SDK and setup the necessary clients to work with the service.

!pip install "sagemaker==2.163.0" --upgrade --quiet

import sagemaker
import boto3
sess = sagemaker.Session()

sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

Next we retrieve the necessary SageMaker container for TGI deployment. AWS manages a set of Deep Learning containers (you can also bring your own) that integrate with different model servers such as TGI, DJL Serving, TorchServe, and more. In this case we point the SDK to the version of TGI we need.

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

Next comes the specific magic related to the TGI container. The simplicity of TGI is that we can just provide the model hub ID for the specific LLM we are working with. Note that it must be from the supported model architectures of TGI. In this case we utilize a form of the Llama-7B model.

import json
from sagemaker.huggingface import HuggingFaceModel

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "decapoda-research/llama-7b-hf", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
}

In the config object above you can also specify any additional optimizations that the TGI container supports, this includes for example quantization format (ex: bitsandbytes). We then pass in this config to a HuggingFace Model object that SageMaker understands.

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

We can then directly deploy this model object to a SageMaker Real-Time Endpoint. Here you can specify your hardware, for these larger models a GPU family such as a g5 instance is recommended. We also enable a larger container health check timeout due to the size of these models.

instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
)

Creation of the endpoint should take a few minutes, but then you should be able to perform sample inference with the following code.

llm.predict({
    "inputs": "My name is Julien and I like to",
})

Additional Resources & Conclusion

SageMaker-Deployment/LLM-Hosting/TGI/tgi-llama.ipynb at master · RamVegiraju/SageMaker-Deployment

You can find the code for the entire example at the link above, you can also alternatively generate this code directly from the HuggingFace Hub at the SageMaker deployment tab for your supported model. Text Generation Inference provides a highly optimized model server that also greatly simplifies the deployment process. Stay tuned for more content around the LLM space, and as always thank you for reading and feel free to leave any feedback.

If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you're new to Medium, sign up using my Membership Referral.

Tags: AWS Hugging Face Llm Machine Learning Sagemaker