How To Deploy and Test Your Models Using FastAPI and Google Cloud Run

Author:Murphy | View: 20195 | Time: 2025-03-23 19:28:57

Introduction

MLOps (Machine Learning Operations) is an increasingly popular function and skill among data professionals. More and more, full stack data scientists who are capable of taking machine learning (ML) projects from training to production are in demand, so if you feel a bit uncomfortable in this area, or you want a quick refresher, this post is for you! If you've never deployed a model before, I highly encourage you to checkout my previous post that gives you more of an introduction into ML Deployment and explains many concepts that will be used in this post.

Introduction to ML Deployment: Flask, Docker & Locust

This post is going to cover quite a few topics, so feel free to skip to the chapters you're the most interested in. The structure of this post is:

Use Fastapi to specify API for inference endpoint
Containerise the endpoint using Docker
Upload the Docker image to Artifact Registry in Gcp
Deploy the image to Google Cloud Run
Monitor the inference speed and load-test the deployment

Project steps overview. Image by author.

Setup

First things first, you'll need to install fastapi , uvicornand Docker on your local machine to complete the first two steps. The third and fourth steps require you to have an active Google Cloud Platform account with enabled Articat Registry and Cloud Run APIs. Luckily, GCP offers a free trial period with free credits which you can activate using your Gmail account. In addition, you'll want to install gcloud CLI tool which will allow you to interact with your GCP account through the CLI. Finally, you'll want to install Locust for load testing. Exact installation steps will depend on your system but below you can find all the links to the installation instructions.

All the code is available in this GitHub repo. Make sure to pull it and follow along, since it's the best way to learn (in my opinion).

Project Overview

To make the learning more practical, this post will show you how to deploy a model for loan default prediction. Model training process is out of scope for this post, but you can find an already trained model in the Github repo. The model was trained on the pre-processed U.S. Small Business Administration dataset (CC BY-SA 4.0 license). Feel free to explore the data dictionary to understand what each of the columns mean.

The main goal of this project is to create a working endpoint (called /predict) which will receive POST requests with information about loan applications (i.e. model features) and will output a response with default probabilities.

Simplified ML prediction system. Image by author.

With all of this out of the way, let's finally get started! First, let's see how we can create an inference API using FastAPI library.

Create Inference API

What even is API?

An API (Application Programming Interface) is a set of established protocols for external programs to interact with your application. To explain it a bit more intuitively, here's a quote from a really good answer on Stack Overflow:

Web APIs are entry points on an application running on a web server that permit other tools to interact with that web service in some way. Think of it as a "user interface" for software. A list of URLs and how to interact with them is all any web API is, in the end.

In other words, our API will take care of specifying how to interact with the application that hosts the ML model. To be more precise, it will specify:

What's the URL of the prediction service
What information the service expects as input
What information the service will provide as outputs

FastAPI

FastAPI is a Python framework for developing web APIs. It's quite similar to Flask (covered in my previous post) but there are a few notable advantages.

In-built data validation using Pydantic
Automatic documentation with SwaggerUI
Asynchronous task support using ASGI
Overall simpler syntaxis

The creator of this framework has basically explored every other solution for building APIs, decided what he likes about them, and combined all the nice functionalities into FastAPI (for which we're very grateful, so don't forget to star the github repo).

Input and Output Validation

Before creating an endpoint, it's important to define the type of information that is expected as input and output. FastAPI accepts JSON requests and responses, but it's crucial to specify the fields that are required and their respective data types. To accomplish this, FastAPI utilizes Pydantic, which validates type hints during runtime and offers clear and user-friendly error messages whenever invalid data is submitted.

The way that Pydantic enforces these data types is through data models. Data models are simply Python classes that inherit from pydantic.BaseModel and list expected fields and their types as class variables. Above you can see how I've specified the expected input in LoanApplication class and the expected output in PredictioutOut class. This type of strict validation can help you catch errors early in the development process and prevent them from causing issues in production. We can also specify the acceptable ranges for each of these parameters, but I'll skip this step for now.

Application and Endpoint Creation

FastAPI is used to both create an application and define different endpoints that will server as APIs. Here's a very simple example of a FastAPI application that has a prediction endpoint.

Let's digest the code above:

It imports the necessary libraries and loads a pre-trained CatBoost model that has been saved in a file
It creates a FastAPI app instance by assigning FastAPI class to the app variable
It defines an API endpoint for making predictions that expects POST requests by decorating a function with the @app.post("/predict", response_model=PredictionOut) decorator. The response_model parameter specifies the expected output format of the API endpoint
The predict function takes a payload, which is expected to be an instance of a Pydantic model called LoanApplication. This payload is used to create a pandas DataFrame, which is then used to generate a prediction using the pre-trained catboost model. The predicted probability of default is returned as a JSON object

Notice how the previously defined input and output data classes came into play. We specify the expected response format using response_model parameter in the decorator and the request format as the input type hint in the predict function. Very easy and clean, yet highly effective.

There you have it! In a few lines of code, we've created an application with a prediction endpoint (API) and full fledged data validation capabilities. To start up this application locally you can run this command.

uvicorn app:app --host 0.0.0.0 --port 80

If everything goes well, you'll see that the application is running at the specified address.

Documentation with Swagger UI

Here's a cool thing about FastAPI – we get the documentation for free! Head over to the http://0.0.0.0:80/docs and you should see a Swagger UI screen.

In the same screen you can test out your API by clicking on our POST endpoint (in green) and editing the default request body. Then, click execute and you'll see what the API responds with.

Request body in the UI. Screenshot by author.

If you've specified everything correctly, you should get a response code 200 which means that the request has been successful. The body of this response should contain JSON with default_probability which we've previously specified as output of our /predict endpoint.

Response body in the UI. Screenshot by author.

Of course, you can test the API programmatically as well using a request library but the UI gives you a nice and intuitive alternative.

Containerise the Application Using Docker

As I've mentioned in my previous post, containerisation is the process of encapsulating your application and all of its dependencies (including Python) into a self-contained, isolated package that can run consistently across different environments (e.g. locally, in the cloud, on your friend's laptop, etc.). FastAPI has a great documentation of how to run FastAPI in Docker, so check it out here if you're interested in more details.

Below, you can see a Dockerfile (set of instructions for Docker) for our API.

It's identical to the one we've used with Flask with exception of the last command that gets run. Here's a brief description of what this Dockerfile achieves:

FROM python:3.9-slim: Specifies the container's base image as Python version 3.9 with the slim variant, which is a smaller image compared to the regular Python image
WORKDIR /app: This instruction sets the working directory inside the container to /app
COPY requirements.txt requirements.txt: This instruction copies the requirements.txt file from the host machine to the container's /app directory
RUN pip install --upgrade pip & RUN pip install -r requirements.txt: This instruction upgrades pip and installs the required packages specified in the requirements.txt file using pip
COPY ["loan_catboost_model.cbm", "app.py", "./"] .: This instruction copies the trained CatBoost model, and the app.py file, which contains the Python code for the application, from the host machine to the container's /app directory
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]: This instruction specifies the command that should be run when the container starts. It runs the uvicorn server with the app module as the main application at URL 0.0.0.0:80

Build the Docker Image

Once the Dockerfile is defined, we need to build a Docker image based on it. It's quite simple to build the image locally, all you need to do is to run the following command in the folder where your Dockerfile and the application code is.

docker build -t default-service-fastpai:latest .

This will create an image called default-service-fastapi and will save it locally. You can run docker images in your CLI to see it listed among the other Docker images you have.

Push the Image to Google Artifact Registry

The process of pushing this image to Google Artifact Registry is a bit more involved but it is very nicely documented by Google. There are five steps to this process:

Configure Docker permissions to access your Artifact Registry
Enable Artifact Registry in your GCP
Create a repo in the Registry
Tag your image using specific naming convention
Push the image

To begin with, you'll need to determine in which location you want to have your container stored. You can find a full list here. I'll be choosing europe-west2 , so a command for me to authenticate my Docker is:

gcloud auth configure-docker europe-west2-docker.pkg.dev

The next two steps can be done in you GCP UI. Go to the Artifact Registry in the navigation menu and click the button to enable the API. Then, go to the repositories and click Create Repository button. In the pop-up menu, specify that this is a Docker repository and set its location equal to the previously selected one (for me it's europe-west2 )

Google Artifact Registry. Screenshot by author.

All the setup is done, so let's begin the main part. To tag an image with appropriate name and to push it we need to run two specific commands.

docker tag IMAGE-NAME LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE-NAME
docker push LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE-NAME

Since the image name is default-servise-fastapi , repository name is ml-images and my project ID is rosy-flames376113 , the commands for me look as follows:

docker tag default-servise-fastapi europe-west2-docker.pkg.dev/rosy-flames-376113/ml-images/default-servise-fastapi
docker push europe-west2-docker.pkg.dev/rosy-flames-376113/ml-images/default-servise-fastapi

The command will take some time to execute but once it's done, you'll be able see your Docker image in your Artifact Registry UI.

By the way, you can find your Project ID or create a new project in the GCP UI by clicking the dropdown button next to your Project name (My First Project in the example below).

Deploy Container on Google Cloud Run

The actual deployment step is relatively easy with GCP. The only thing you need to take care of is to enable the Cloud Run API in your GCP project by going to the Cloud Run section in navigation menu and clicking on the button to enable the API.

There are two ways to create a service in Cloud Run – through the CLI or in the UI. I'm going to show you how to do it using CLI, and you can explore the UI option on your own.

 gcloud run deploy default-service 
      --image europe-west2-docker.pkg.dev/rosy-flames-376113/ml-images/default-service-fastapi 
      --region europe-west2 
      --port 80 
      --memory 4Gi

This command will create a default-service (a.k.a our API) from the previously uploaded image in the Artifact Registry. In addition, it also specifies the region (the same as our Docker image), the port to expose, and the RAM available to the service. Other parameters that you can specify, you can check out in the official documentation.

If you see a url in your command line – congrats! Your model is officially deployed!

Test the Deployment

Similar to the previous post, I'll be using Locust to test the deployment. Here's the test scenario that will be tested – the user sends a request with loan information and receives a response with default probability (very unrealistic but works for our purposes).

With local Flask deployment, my laptop could take up to ~180 requests per second before it started failing. In addition, the response time was gradually increasing with the number of requests per second. Let's run the test for our cloud deployment and see if it performs any better. To launch the test, run locust -f app_test.py (app_test.py contains the test scenario from above) and input the URL of your server in the UI located be default at http://0.0.0.0:8089. I've put the limit of users as 300 with the spawn rate of 5 but you can change it to whatever you like.

Keep in mind that Cloud Run bills per request, so be mindful with these parameters!

Local Deployment (Left) vs Cloud Deployment (Right) Load Tests. Charts by author.

After the traffic reaches its peak, here're the charts that I see for my server. First things first, the server didn't fail even with 300 requests per second which is great news. This means that cloud deployment is more robust than running a model on your local machine. Secondly, the median response time is way lower and stays almost constant with the increased traffic which means that this deployment is also more performant.

There are however two notable peaks in the 95th percentiles – in the beginning of the test and closer to the end. The first spike can be explained by the fact that Cloud Run servers stand idle until they start receiving traffic. This means that the service needs to warm-up, so low speeds and even some failures are to be expected in the beginning. The second bump is probably due to the server starting to reach its capacity. However, you might notice that by the end of the test, 95th percentile speed started to decrease. This is due to our service starting to automatically scale-up (thanks Google!) as you can see in the dashboard below.

Metrics Dashboard in Cloud Run. Screenshot by author.

At its peak, the service actually had 12 instances running, instead of 1. I'd say that this is the main advantage of managed services like Cloud Run – you don't need to worry about scalability of your deployment. Google will take care of adding new resources (as long as you can pay of course) and will ensure smooth running of your application.

Conclusion

This has been quite a ride, so let's summarise everything we've done in this post:

Created a basic APIs with an inference endpoint using FastAPI
Containerised the API using Docker
Built the Docker image and pushed it to the Artifact Registry in GCP
Turned this image into a service using Google Cloud Run
Tested the deployment's robustness and inference speed using Locust

Really well done for following through these chapters! I hope that now you feel more confident about the deployment step of ML process and can put your own models into production. Let me know if you have any questions remaining, and I'll try to cover them in the next posts.