Deploying Cohere Language Models On Amazon SageMaker

Author:Murphy | View: 27976 | Time: 2025-03-23 18:35:12

Large Language Models (LLMs) and Generative AI are accelerating Machine Learning growth across various industries. With LLMs the scope for Machine Learning has increased to incredible heights, but has also been accompanied with a new set of challenges.

The size of LLMs lead to difficult problems in both the Training and Hosting portions of the ML lifecycle. Specifically for Hosting LLMs there are a myriad of challenges to consider. How can we fit a model into a singular GPU for inference? How can we apply model compression and partitioning techniques without compromising on accuracy? How can we improve inference latency and throughput for these LLMs?

To be able to address many of these questions requires advanced ML Engineering where we have to orchestrate model hosting on a platform that can apply compression and parallelization techniques at a container and hardware level. There are solutions such as DJL Serving that provide containers tuned for LLM hosting, but we will not explore them in this article.

In this article, we'll explore SageMaker JumpStart Foundational Models. With Foundational Models we don't worry about containers or model parallelization and compression techniques, but focus primarily on directly deploying a pre-trained model with hardware of your choice. Specifically in this article we'll explore a popular LLM provider known as Cohere and how we can host one of their popular language models on SageMaker for Inference.

NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I would suggest following this article for understanding Deployment/Inference more in depth. In particular, for SageMaker JumpStart, I would reference this following blog.

What is SageMaker JumpStart? What Are Foundational Models?

SageMaker JumpStart in essence is SageMaker's Model Zoo. There are a variety of different pre-trained models that are already containerized and can be deployed via the SageMaker Python SDK. The main value here is that customers don't need to worry about tuning or configuring a container to host a specific model, that heavy lift is taken care of.

Specifically for LLMs, JumpStart Foundational Models were launched with popular language models from a variety of providers such as Stability AI and in this case Cohere. You can view a full list of the Foundational Models that are available on the SageMaker Console.

SageMaker JumpStart Foundational Models (Screenshot by Author)

These Foundational Models are also exposed via the AWS MarketPlace where you can subscribe for specific models that may not be accessible by default. In the case of Cohere's Medium model that we will be working with, this should be accessible via JumpStart without any subscription, but in the case you do run into any issues you can enlist for access at the following link.

Cohere Medium Language Model Deployment

For this example we'll specifically explore how we can deploy Cohere's GPT Medium Language Model via Sagemaker JumpStart. Before we start, we install the cohere-sagemaker SDK. This SDK further simplifies the deployment process as it builds a wrapper around the usual SageMaker Inference constructs (SageMaker Model, SageMaker Endpoint Configuration, and SageMaker Endpoint).

!pip install cohere-sagemaker --quiet

From this SDK we import the Client object that will help us create our endpoint and also perform inference.

from cohere_sagemaker import Client
import boto3

If we go to the Marketplace link we see that this model is available via Model Package. Thus, for the next step we provide the Model Package ARN for the Cohere Medium model. Note that this specific model is currently only available in US-East-1 and EU-West-1 regions.

# Currently us-east-1 and eu-west-1 only supported
model_package_map = {
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/cohere-gpt-medium-v1-5-15e34931a06235b7bac32dca396a970a",
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/cohere-gpt-medium-v1-5-15e34931a06235b7bac32dca396a970a",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

Now that we have our model package we can instantiate our Client object and create our endpoint. With JumpStart we have to provide our Model Package Details, Instance Type and Count, as well as the Endpoint Name.

# instantiate client
co = Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-gpt-medium", 
instance_type="ml.g5.xlarge", n_instances=1)

For language models such as Cohere for the instance type we recommend mostly GPU based instances such as the g5 family, or the p3/p2 family, and the g4dn instance class. All these instances have enough compute and memory to be able to handle the size of these models. For further guidance you can also follow the MarketPlace recommendation for the instance to use for the specific model you choose.

Next we perform a sample inference with the generate API call which will create text for the prompt we provide our endpoint with. This generate API call serves as a Cohere wrapper around the invoke_endpoint API call we traditionally see with SageMaker endpoints.

prompt = "Write a LinkedIn post about starting a career in tech:"

# API Call
response = co.generate(prompt=prompt, max_tokens=100, temperature=0, return_likelihoods='GENERATION')
print(response.generations[0].text)

Parameter Tuning

For an extensive understanding of the different LLM parameters that you can tune I would reference Cohere‘s official article here. We primarily focus on tuning two different parameters which we saw in our generate API call.

max_tokens: Max tokens as the word indicates is the limit to number of tokens our LLM can generate. What an LLM defines as a token varies, it can be a character, word, phrase, or more. Cohere utilizes byte-pair encoding for their tokens. To fully understand how their models define tokens please refer to the following documentation. In essence we can iterate on this parameter to find an optimal value as we don't want a value that's too small as it won't properly answer our prompt or a value that's too large to the point where the response does not make much sense. Cohere's generation models support up to 2048 tokens.
temperature: The temperature parameter helps control the "creativity" of the model. For example when one word is generated, there's a list of words with varying probabilities for the next word. When the temperature parameter is lower the model tends to pick the word with the highest probability. When we increase the temperature the responses tend to get a large amount of variety as the model starts selecting words with lower probabilities. This parameter ranges from 0 to 5 for this model.

First we can explore iterating upon the max_token size. We create an array of 5 arbitrary token sizes and loop through them for inference while keeping temperature constant.

token_range = [100, 200, 300, 400, 500]

for token in token_range:
    response = co.generate(prompt=prompt, max_tokens=token, temperature=0.9, return_likelihoods='GENERATION')
    print("-----------------------------------")
    print(response.generations[0].text)
    print("-----------------------------------")

As expected we can see the difference in the length of each of the responses.

We can also test the temperature parameter by iterating through values between 0 to 5.

for i in range(5):
    response = co.generate(prompt=prompt, max_tokens=100, temperature=i, return_likelihoods='GENERATION')
    print("-----------------------------------")
    print(response.generations[0].text)
    print("-----------------------------------")

We can see that at a value of 1 we have a very realistic output that makes sense for the most part.

At a temperature of 5 we see an output that makes somewhat sense, but is deviating extremely from the topic due to the word selection.

If you would like to test all the different combinations of these parameters for your optimal configuration you can also run the following code block.

import itertools

# Create array of all combinations of both params
temperature = [0,1,2,3,4,5]
params = [token_range, temperature]
param_combos = list(itertools.product(*params))

for param in param_combos:
    response = co.generate(prompt=prompt, max_tokens=param[0], 
    temperature=param[1], return_likelihoods='GENERATION')

Additional Resources & Conclusion

SageMaker-Deployment/cohere-medium.ipynb at master · RamVegiraju/SageMaker-Deployment

The code for the entire example can be found at the link above (stay tuned for more LLM and JumpStart examples). With SageMaker JumpStart's Foundational Models it becomes easy to host LLMs via an API call without doing the grunt of work of containerizing and model serving. I hope this article was a useful introduction to LLMs with Amazon SageMaker, feel free to leave any feedback or questions as always.

If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you're new to Medium, sign up using my Membership Referral.

Tags: AWS Cohere Large Language Models Machine Learning Sagemaker