How To Deploy and Test Your Models Using FastAPI and Google Cloud Run

Introduction
MLOps (Machine Learning Operations) is an increasingly popular function and skill among data professionals. More and more, full stack data scientists who are capable of taking machine learning (ML) projects from training to production are in demand, so if you feel a bit uncomfortable in this area, or you want a quick refresher, this post is for you! If you've never deployed a model before, I highly encourage you to checkout my previous post that gives you more of an introduction into ML Deployment and explains many concepts that will be used in this post.
This post is going to cover quite a few topics, so feel free to skip to the chapters you're the most interested in. The structure of this post is:
- Use Fastapi to specify API for inference endpoint
- Containerise the endpoint using Docker
- Upload the Docker image to Artifact Registry in Gcp
- Deploy the image to Google Cloud Run
- Monitor the inference speed and load-test the deployment

Setup
First things first, you'll need to install fastapi
, uvicorn
and Docker
on your local machine to complete the first two steps. The third and fourth steps require you to have an active Google Cloud Platform
account with enabled Articat Registry
and Cloud Run
APIs. Luckily, GCP offers a free trial period with free credits which you can activate using your Gmail account. In addition, you'll want to install gcloud
CLI tool which will allow you to interact with your GCP account through the CLI. Finally, you'll want to install Locust for load testing. Exact installation steps will depend on your system but below you can find all the links to the installation instructions.
- Install FastAPI and Uvicorn
- Download and install Docker
- Get free GCP account
- Install gcloud CLI
- Install Locust
All the code is available in this GitHub repo. Make sure to pull it and follow along, since it's the best way to learn (in my opinion).
Project Overview
To make the learning more practical, this post will show you how to deploy a model for loan default prediction. Model training process is out of scope for this post, but you can find an already trained model in the Github repo. The model was trained on the pre-processed U.S. Small Business Administration dataset (CC BY-SA 4.0 license). Feel free to explore the data dictionary to understand what each of the columns mean.
The main goal of this project is to create a working endpoint (called /predict
) which will receive POST requests with information about loan applications (i.e. model features) and will output a response with default probabilities.

With all of this out of the way, let's finally get started! First, let's see how we can create an inference API using FastAPI library.
Create Inference API
What even is API?
An API (Application Programming Interface) is a set of established protocols for external programs to interact with your application. To explain it a bit more intuitively, here's a quote from a really good answer on Stack Overflow:
Web APIs are entry points on an application running on a web server that permit other tools to interact with that web service in some way. Think of it as a "user interface" for software. A list of URLs and how to interact with them is all any web API is, in the end.
In other words, our API will take care of specifying how to interact with the application that hosts the ML model. To be more precise, it will specify:
- What's the URL of the prediction service
- What information the service expects as input
- What information the service will provide as outputs
FastAPI
FastAPI is a Python framework for developing web APIs. It's quite similar to Flask (covered in my previous post) but there are a few notable advantages.
- In-built data validation using Pydantic
- Automatic documentation with SwaggerUI
- Asynchronous task support using ASGI
- Overall simpler syntaxis
The creator of this framework has basically explored every other solution for building APIs, decided what he likes about them, and combined all the nice functionalities into FastAPI (for which we're very grateful, so don't forget to star the github repo).
Input and Output Validation
Before creating an endpoint, it's important to define the type of information that is expected as input and output. FastAPI accepts JSON requests and responses, but it's crucial to specify the fields that are required and their respective data types. To accomplish this, FastAPI utilizes Pydantic, which validates type hints during runtime and offers clear and user-friendly error messages whenever invalid data is submitted.
The way that Pydantic enforces these data types is through data models. Data models are simply Python classes that inherit from pydantic.BaseModel
and list expected fields and their types as class variables. Above you can see how I've specified the expected input in LoanApplication
class and the expected output in PredictioutOut
class. This type of strict validation can help you catch errors early in the development process and prevent them from causing issues in production. We can also specify the acceptable ranges for each of these parameters, but I'll skip this step for now.
Application and Endpoint Creation
FastAPI is used to both create an application and define different endpoints that will server as APIs. Here's a very simple example of a FastAPI application that has a prediction endpoint.
Let's digest the code above:
- It imports the necessary libraries and loads a pre-trained CatBoost model that has been saved in a file
- It creates a FastAPI app instance by assigning
FastAPI
class to theapp
variable - It defines an API endpoint for making predictions that expects POST requests by decorating a function with the
@app.post("/predict", response_model=PredictionOut)
decorator. Theresponse_model
parameter specifies the expected output format of the API endpoint - The
predict
function takes a payload, which is expected to be an instance of a Pydantic model calledLoanApplication
. This payload is used to create a pandas DataFrame, which is then used to generate a prediction using the pre-trainedcatboost
model. The predicted probability of default is returned as a JSON object
Notice how the previously defined input and output data classes came into play. We specify the expected response format using response_model
parameter in the decorator and the request format as the input type hint in the predict
function. Very easy and clean, yet highly effective.
There you have it! In a few lines of code, we've created an application with a prediction endpoint (API) and full fledged data validation capabilities. To start up this application locally you can run this command.
uvicorn app:app --host 0.0.0.0 --port 80
If everything goes well, you'll see that the application is running at the specified address.

Documentation with Swagger UI
Here's a cool thing about FastAPI – we get the documentation for free! Head over to the http://0.0.0.0:80/docs and you should see a Swagger UI screen.

In the same screen you can test out your API by clicking on our POST endpoint (in green) and editing the default request body. Then, click execute and you'll see what the API responds with.

If you've specified everything correctly, you should get a response code 200 which means that the request has been successful. The body of this response should contain JSON with default_probability
which we've previously specified as output of our /predict
endpoint.

Of course, you can test the API programmatically as well using a request
library but the UI gives you a nice and intuitive alternative.
Containerise the Application Using Docker
As I've mentioned in my previous post, containerisation is the process of encapsulating your application and all of its dependencies (including Python) into a self-contained, isolated package that can run consistently across different environments (e.g. locally, in the cloud, on your friend's laptop, etc.). FastAPI has a great documentation of how to run FastAPI in Docker, so check it out here if you're interested in more details.
Below, you can see a Dockerfile (set of instructions for Docker) for our API.
It's identical to the one we've used with Flask with exception of the last command that gets run. Here's a brief description of what this Dockerfile achieves:
FROM python:3.9-slim
: Specifies the container's base image as Python version 3.9 with the slim variant, which is a smaller image compared to the regular Python imageWORKDIR /app
: This instruction sets the working directory inside the container to/app
COPY requirements.txt requirements.txt
: This instruction copies therequirements.txt
file from the host machine to the container's/app
directoryRUN pip install --upgrade pip & RUN pip install -r requirements.txt
: This instruction upgradespip
and installs the required packages specified in therequirements.txt
file usingpip
COPY ["loan_catboost_model.cbm", "app.py", "./"] .
: This instruction copies the trained CatBoost model, and theapp.py
file, which contains the Python code for the application, from the host machine to the container's/app
directoryCMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]
: This instruction specifies the command that should be run when the container starts. It runs theuvicorn
server with theapp
module as the main application at URL0.0.0.0:80
Build the Docker Image
Once the Dockerfile is defined, we need to build a Docker image based on it. It's quite simple to build the image locally, all you need to do is to run the following command in the folder where your Dockerfile and the application code is.
docker build -t default-service-fastpai:latest .
This will create an image called default-service-fastapi
and will save it locally. You can run docker images
in your CLI to see it listed among the other Docker images you have.
Push the Image to Google Artifact Registry
The process of pushing this image to Google Artifact Registry is a bit more involved but it is very nicely documented by Google. There are five steps to this process:
- Configure Docker permissions to access your Artifact Registry
- Enable Artifact Registry in your GCP
- Create a repo in the Registry
- Tag your image using specific naming convention
- Push the image
To begin with, you'll need to determine in which location you want to have your container stored. You can find a full list here. I'll be choosing europe-west2
, so a command for me to authenticate my Docker is:
gcloud auth configure-docker europe-west2-docker.pkg.dev
The next two steps can be done in you GCP UI. Go to the Artifact Registry in the navigation menu and click the button to enable the API. Then, go to the repositories and click Create Repository button. In the pop-up menu, specify that this is a Docker repository and set its location equal to the previously selected one (for me it's europe-west2
)

All the setup is done, so let's begin the main part. To tag an image with appropriate name and to push it we need to run two specific commands.
docker tag IMAGE-NAME LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE-NAME
docker push LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE-NAME
Since the image name is default-servise-fastapi
, repository name is ml-images
and my project ID is rosy-flames376113
, the commands for me look as follows:
docker tag default-servise-fastapi europe-west2-docker.pkg.dev/rosy-flames-376113/ml-images/default-servise-fastapi
docker push europe-west2-docker.pkg.dev/rosy-flames-376113/ml-images/default-servise-fastapi
The command will take some time to execute but once it's done, you'll be able see your Docker image in your Artifact Registry UI.
By the way, you can find your Project ID or create a new project in the GCP UI by clicking the dropdown button next to your Project name (My First Project in the example below).

Deploy Container on Google Cloud Run
The actual deployment step is relatively easy with GCP. The only thing you need to take care of is to enable the Cloud Run API in your GCP project by going to the Cloud Run section in navigation menu and clicking on the button to enable the API.
There are two ways to create a service in Cloud Run – through the CLI or in the UI. I'm going to show you how to do it using CLI, and you can explore the UI option on your own.
gcloud run deploy default-service
--image europe-west2-docker.pkg.dev/rosy-flames-376113/ml-images/default-service-fastapi
--region europe-west2
--port 80
--memory 4Gi
This command will create a default-service
(a.k.a our API) from the previously uploaded image in the Artifact Registry. In addition, it also specifies the region (the same as our Docker image), the port to expose, and the RAM available to the service. Other parameters that you can specify, you can check out in the official documentation.

If you see a url in your command line – congrats! Your model is officially deployed!
Test the Deployment
Similar to the previous post, I'll be using Locust to test the deployment. Here's the test scenario that will be tested – the user sends a request with loan information and receives a response with default probability (very unrealistic but works for our purposes).
With local Flask deployment, my laptop could take up to ~180 requests per second before it started failing. In addition, the response time was gradually increasing with the number of requests per second. Let's run the test for our cloud deployment and see if it performs any better. To launch the test, run locust -f app_test.py
(app_test.py
contains the test scenario from above) and input the URL of your server in the UI located be default at http://0.0.0.0:8089. I've put the limit of users as 300 with the spawn rate of 5 but you can change it to whatever you like.
Keep in mind that Cloud Run bills per request, so be mindful with these parameters!


After the traffic reaches its peak, here're the charts that I see for my server. First things first, the server didn't fail even with 300 requests per second which is great news. This means that cloud deployment is more robust than running a model on your local machine. Secondly, the median response time is way lower and stays almost constant with the increased traffic which means that this deployment is also more performant.
There are however two notable peaks in the 95th percentiles – in the beginning of the test and closer to the end. The first spike can be explained by the fact that Cloud Run servers stand idle until they start receiving traffic. This means that the service needs to warm-up, so low speeds and even some failures are to be expected in the beginning. The second bump is probably due to the server starting to reach its capacity. However, you might notice that by the end of the test, 95th percentile speed started to decrease. This is due to our service starting to automatically scale-up (thanks Google!) as you can see in the dashboard below.

At its peak, the service actually had 12 instances running, instead of 1. I'd say that this is the main advantage of managed services like Cloud Run – you don't need to worry about scalability of your deployment. Google will take care of adding new resources (as long as you can pay of course) and will ensure smooth running of your application.
Conclusion
This has been quite a ride, so let's summarise everything we've done in this post:
- Created a basic APIs with an inference endpoint using FastAPI
- Containerised the API using Docker
- Built the Docker image and pushed it to the Artifact Registry in GCP
- Turned this image into a service using Google Cloud Run
- Tested the deployment's robustness and inference speed using Locust
Really well done for following through these chapters! I hope that now you feel more confident about the deployment step of ML process and can put your own models into production. Let me know if you have any questions remaining, and I'll try to cover them in the next posts.