Automating Your Containerised Model Deployments

Author:Murphy | View: 20426 | Time: 2025-03-22 21:58:35

We will be touching on a portion of "Serving Infrastructure" and "Process Management" in this post. Image adapted from Figure 1 in Sculley et al. (2015) *[1]*

The term MLOps, or Machine Learning Operations is a loaded term that comprises a complex of tasks, processes and infrastructure to set up, automate and monitor. An early 2015 paper by Google, before this term was even coined, describes this well, with ML code encompassing a small fraction of the overall real-world machine learning systems [1].

In this article, I will share a component of MLOps, which shows the process of deploying a containerised workload or model to a serving infrastructure, taking into consideration automation, security and feedback loops.

There are many different platforms and software that you can choose for deployments, and what I will be showing is just one of them. While AWS, Terraform, GitLab, and Ansible used here are widely adopted in the real world, they have their popular alternatives. However, the methodologies and processes are agnostic and may also serve as insights on how to implement in other platforms you may use.

Architecture Overview Create an Ansible Playbook Provision Cloud Resources Build a GitLab-CI Pipeline

While I will attempt to provide context and details, it would benefit if you already have a basic understanding of AWS, Ansible, Terraform and Gitlab-CI. There is also an assumption that you already know Docker, docker-compose and how to build a model API.

Architecture Overview

A high-level diagram of what we want to achieve. (Image by author)

From a high level, this is a simple task, which consists of three components. First, the Continuous Integration and Continuous Deployment (CICD) platform will build the image of the model API and push it to a container registry. Second, it will then instruct the server to pull the image from the registry and deploy the container API. Last, the status of the job will be retrieved on completion.

The exact steps for this tutorial are more complex. Below is a sequence diagram showing the communication flows between the various resources.

Sequence diagram showing the communications from GitLab-CI to the various AWS resources. (Image by author)

Gitlab-CI is the CICD platform I will be using. It is a popular and mature Git repository and DevSecOps platform.
The first CICD job is to build the docker image and push it to the AWS Elastic Container Registry (ECR).
The second CICD job for deployment involves a more intricate sequence of steps. While we can communicate directly to a server via SSH, this has severe security complications. Imagine a malicious party having access to it. Instead, we can use an intermediary service in AWS called a Run Command in AWS's System Manager (SSM). This allows us to run commands securely within the specified host servers. In some aspects, SSM Run Command acts like a traditional bastion or jump host but is easier to use with zero maintenance required.
In this example, SSM will download an Ansible playbook from an S3 bucket (storage server). This playbook contains all the instructions which pull the image from ECR and deploy it as a container.
After the job is completed, the logs are stored in the AWS's logging service called Cloudwatch Log Group.
An AWS command is continuously sent to poll for the status of the SSM Run Command job. If it detects that the job completes with failure, it will grab the logs from Cloudwatch and print them in the Gitlab-CI job console so that it can be debugged easily.

Create an Ansible Playbook

Ansible is one of the three most popular configuration management tools, the other two being Chef and Puppet. It is an open-source tool developed in Python. Instructions can be sent using Ansible to be executed in an instance by placing the scripts in an Ansible playbook, which is a set of instructions coded in a YAML file.

---
- hosts: all
  become: true

vars:
    home_dir: /home/ubuntu

  tasks:
  - name: login ECR
    shell: |
      AWS_ACCOUNT=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .accountId)
      AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)
      aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com
    when: ansible_os_family == "Debian"

  - name: download docker-compose
    command: aws s3 cp s3://{{COMPOSE_PATH}}/docker-compose.yml {{home_dir}}/docker-compose.yml
  - name: prune docker
    command: docker system prune -f
  - name: pull docker images
    command: docker compose -f {{home_dir}}/docker-compose.yml pull
  - name: deploy docker images
    command: docker compose -f {{home_dir}}/docker-compose.yml up -d --remove-orphans

The above script contains the following instructions.

Define thehome_dir variable to Ubuntu's default path.
Inform Docker to log in to ECR. It includes an additional two lines of code to get the AWS region and account number which is by default contained within the instance as metadata, and these are used as part of the information to log in to ECR.
Then, it will download the docker-compose.yml file, with a variable name of the path as COMPOSE_PATH which we will input in the Run Command later.
The last three commands are to delete (prune) any obsolete images, containers and storage to save space, pull the image from ECR and then launch the model container from the image.

Provision Cloud Resources

To set up an entire infrastructure to support a machine learning workload is not trivial and requires more than one article to explain all the working components.

However, I have provided a Terraform repository to provision all the necessary resources [2]. Terraform is the most dominant Infrastructure-as-Code (IaC) tool right now because it is free, cloud-agnostic, and simple to use. A few simple commands and you have everything up and running!

# add your AWS credentials for terraform to use
aws configure

# initialise the terraform scripts
terraform init
# verify and deploy the infrastructure
terraform apply

AWS infrastructure to support the stated deployment process. (Image by author)

After provisioning it, you should have an architecture as shown above. This comprises of the:

Resources – EC2 instance, ECR repository, S3 bucket and CloudWatch Log Group
Permissions – i.e., instance profile, which allows the instance to have access to the various resources
Installations – a user data [3] with the required docker, ansible and AWS CLI installations scripts will install in the instance when it is created
S3 objects – namely an example docker-compose.yml file which pulls a dummy model server API built using Flask, and an Ansible Playbook which contains the instructions for deployment.

You will also need to create an AWS IAM user and create an access key for use in GitLab-CI later. You may want to use the policy in the repository at user/policy.json.

Retrieving an AWS access key for use in the CI pipeline later. (Screenshot by author)

Build a GitLab-CI Pipeline

Save Global Variables and Secrets

After creating your AWS user and retrieving its access keys, save them into your GitLab repository.

Adding your AWS credentials into GitLab's CI variables (Screenshot by author)

They are located under Settings, CI/CD, Variables, and click the Add variable button. Remember that for the secrets, it is important to click the check the option Masked, so that they will be masked within your CI console logs to prevent exposure. Two other variables are stored here, including your AWS account number and the region where your AWS services are hosted.

Building the Pipeline

A GitLab pipeline is built using a YAML file called .gitlab-ci.yml saved at the root of a repository.

stages:
  - build
  - deploy

build-image:
  stage: build
  image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest
  variables:
    TAG: "latest"
    IMAGE: "ecr-ml-inference"
  before_script:
    - REGISTRY=${AWS_ACCOUNT}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com
    - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $REGISTRY
  script:
    - docker build -t ${REGISTRY}/${IMAGE}:${TAG} .
    - docker push ${REGISTRY}/${IMAGE}:${TAG}

As mentioned earlier, there are two jobs in this pipeline. The first is to build our image and push it to ECR. It can be built simply with the above script as shown. Note that a default AWS docker image hosted by GitLab is used, which already includes the AWS CLI tool installed.

The second is to send an SSM run command to deploy the model. This consists of several AWS commands.

Flow chart showing how to use AWS commands for deployment. (Created by author)

As seen from the flow chart above, we will first get the instance unique ID by querying the instance name with aws ec2 describe-instances. This is important as the instance name will likely stay the same, while the ID can change due to the need to remove the particular instance because of the need to apply changes in the span of a project.

Next, we will send the SSM Run Command with aws ssm send-command to System Manager to instruct the instance to download the Ansible Playbook and trigger the deployment. The run command unique ID will be captured here.

With the ID, we can then use aws ssm get-command-invocation to continuously poll for the status every X seconds till we obtain one unless it times out. If the job status completes with failure, aws logs get-logs-events are used to grab the failure logs that are saved in CloudWatch.


deploy-ec2:
  stage: deploy
  image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest
  variables:
    INSTALL_DEP: "False"
    PLAYBOOK: "deployment"
    INSTANCE_NAME: "model-server"
    AWS_S3_BUCKET: "s3://s3-autodeploy-storage"
    AWS_CLOUDWATCH: "/aws/ssm/runcommand/logs"
  script:
    # get instance ID
    - echo "[INFO] Getting INSTANCE_ID from tags"
    - INSTANCE_ID=$(aws ec2 describe-instances --filters "Name=tag:Name,Values=${INSTANCE_NAME}" "Name=instance-state-name,Values=pending,running,stopping,stopped" --query "Reservations[*].Instances[*].InstanceId" --output text)
    # send SSM run command to deploy
    # ------------------------------
    - COMPOSE_PATH=$AWS_S3_BUCKET
    - |
      result=`aws ssm send-command 
              --document-name "AWS-ApplyAnsiblePlaybooks" 
              --document-version "1" 
              --targets '[{"Key":"InstanceIds","Values":["'${INSTANCE_ID}'"]}]' 
              --parameters '{"SourceType":["S3"],"SourceInfo":["{"path": "https://'${AWS_S3_BUCKET}'.s3.'${AWS_DEFAULT_REGION}'.amazonaws.com/'${PLAYBOOK}'.yml"}"],"InstallDependencies":["'${INSTALL_DEP}'"],"PlaybookFile":["'${PLAYBOOK}'.yml"],"ExtraVariables":["COMPOSE_PATH='${COMPOSE_PATH}'"],"Check":["False"],"Verbose":["-v"],"TimeoutSeconds":["3600"]}' 
              --timeout-seconds 600 
              --max-concurrency "50" 
              --max-errors "0" 
              --cloud-watch-output-config '{"CloudWatchOutputEnabled":true,"CloudWatchLogGroupName":"'${AWS_CLOUDWATCH}'"}'`
    - id=$(echo $result | jq -r .Command.CommandId)
    # poll status, 15mins (90 * 10sec) timeout
    # refer to status types here: https://docs.aws.amazon.com/systems-manager/latest/userguide/monitor-commands.html
    # ------------------------------
    - |
      count=90
      for i in $(seq $count); do
        checkstatus=`aws ssm get-command-invocation --command-id $id --instance-id $INSTANCE_ID`;
        getstatus=$(echo $checkstatus | jq -r .StatusDetails);
        if [[ $getstatus == "Pending" ]]; then
          echo "[INFO] pending... please wait"
          sleep 10
        elif [[ $getstatus == "InProgress" ]]; then
          echo "[INFO] in progress... please wait"
          sleep 10
        elif [[ $getstatus == "Success" ]]; then
          echo "[INFO] Success! Images deployed to EC2"
          break
        elif [[ $getstatus == "Failed" ]]; then
     # retrieve logs from cloudwatch if deployment fails
     # ------------------------------
          getloggroup=$(echo $checkstatus | jq -r .CloudWatchOutputConfig.CloudWatchLogGroupName)
          aws logs get-log-events 
              --log-group-name $getloggroup 
              --log-stream-name "$id/$INSTANCE_ID/runShellScript/stdout" 
              --query events[*].message 
              --output text
          echo "[ERROR] deployment failed, please check SSM logs above"
          exit 1
        fi;
      done;
    - echo "[INFO] Final Status:" $getstatus
    - |
      if [[ $getstatus == "InProgress" ]]; then
          echo "[ERROR] Timeout, consider increase checktime"
          exit 1;
      fi;

In script form, it is translated above as another GitLab-CI job. It should be self-explanatory if you break it down with reference to the earlier chart.

GitLab-CI console log. (Screenshot by author)

If all goes well, you should see in the GitLab-CI job's console log like the above.

Summary

I have shown how you can set up a deployment pipeline for your containerised ML workloads using the various popular technologies of GitLab, AWS, Terraform and Ansible.

While this is work usually carried out by Mlops or DevOps engineers, it is also helpful for data scientists to understand the overarching machine learning framework and process to work together smoothly as a team.

For those who have little experience in these platforms and tools, it will take some patience and time to understand and implement. However, it will surely serve as an insight into how this can be done on your own. Once you figure things out, these processes and scripts can be templatised and reused.

References

[1] Sculley D. et al. (2015) Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems.
[2] Terraform repository to provision the infrastructure stated in this article. [Link]
[3] My previous article on setting up user data. [Link]

Tags: Ansible AWS Data Science Gitlab Mlops