Deploy a LightGBM ML Model With GitHub Actions

Author:Murphy | View: 25960 | Time: 2025-03-22 21:19:21

The internet is full of tutorials about how to train and tune Machine Learning models inside Jupyter notebooks. But, as Pau Labarta Bajo so perfectly puts it:

ML models inside Jupyter notebooks have a business value of $0

Executives have no appetite for proof-of-concept models stuck inside Jupyter notebooks. They need live models which deliver tangible value by continuously responding to new data.

In this tutorial, I'll show you how to build one.

We'll train a LightGBM machine learning model, write a script that uses the model to generate scores for a given input csv file, and deploy the whole thing as a batch prediction pipeline using GitHub Actions.

All for free.

If you've never productionised a model before (or if you don't know what that sentence even means), this guide is for you.

Background: What does it mean to "productionise" a model?

Generally, it means you upload your model to a server where it can automatically/repeatedly generate predictions.

For example, let's say you've trained a churn prediction model which, for a given customer, generates a score between 0 and 1 representing the likelihood that that customer will churn in the coming week.

You might have trained your model in a local Jupyter Notebook or .py script. For your model to be useful to others in your company, however, you need to get it out of your Jupyter Notebook and deploy it on a server where it can repeatedly generate predictions. That way, when someone wants to know "what's the likelihood that Customer A will churn this week?" they don't have to fire up your messy Jupyter Notebook and generate an ad-hoc prediction. They can just ping the server which hosts the deployed model and fetch the latest scores.

2 ways to productionise an ML model

Approach 1: Real-time prediction

If you need the ability to continuously ping your model on demand, you'll want to build a real-time inference pipeline.

Take a service like Google Maps, for instance. Behind the scenes of the Google Maps app, there is an ML model which predicts the total journey time for a given route.

The model is "real-time" in the sense that it continuously generates new/updated time estimates in response to new data.

This makes sense – if a traffic accident occurs midway through your journey, you'd expect the time estimate to increase in real-time. It wouldn't make sense for Google to generate all the journey time estimates in advance, because those estimates would quickly become outdated and inaccurate whenever a new pothole opened up.

Approach 2: Batch prediction

The other option is to build a batch prediction pipeline, in which you automatically generate predictions in "batches" at regular intervals.

This is the approach we'll take in this guide. Why? Because batch prediction is especially good for lots of "real-world" high-value business use cases like:

Generating churn likelihood scores for all customers in your base every week
Making daily call volume forecasts once a day just before call centres open

If you want me to cover real-time systems in a future tutorial, let me know!

Let's put this into action by building a simple LightGBM regression model using the Mobile Price Prediction Data on Kaggle.

Step 1: Prepare the model

Preview the data

The data set contains the prices of 836 mobile phones, along with information about their specs (RAM, camera resolution, brand, etc.). Let's save it inside a folder called data/ and then inspect it in a notebook notebooks/eda.ipynb:

# notebooks/eda.ipynb

import pandas as pd

# Load data
df = pd.read_csv("../data/Mobile_Price_Prediction_Datatset.csv")

# Inspect
print(df.shape)
display(df.head())

Image by author. Data from the Mobile Price Prediction Data on Kaggle (Apache 2.0 license)

Preprocessing

There are four things we'll do to prepare the data:

Separate the features and the target
Reduce the number of categories in the Brand me column – this is necessary because the column contains 100s of unique values with very verbose names, e.g., "Apple iPhone 11 Pro (Space Grey, 512 )" and "Vivo Y17 (Mystic Purple, 128 )", and we don't want our model to overfit to these sparse values
Split the data into training, testing and validation subsets (the validation set is what we'll score after we've fitted/tested the model)
Create a preprocessing pipeline to handle null/missing values and encode the categorical feature Brand into numeric format

# notebooks/preprocessing.ipynb

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
import joblib

# 1. Separate X (features) and y (target)
TARGET = 'Price'
y = df[TARGET]
X = df.drop([TARGET, 'Unnamed: 0'], axis=1)

# 2. Simplify `Brand me` column
def identify_brand(brand_me):
    return brand_me.split()[0]
X['Brand'] = X['Brand me'].apply(lambda x: identify_brand(x))

# 3. Split into training and testing subsets
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.3)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, test_size=0.5)

# 4. Save VALIDATION data for scoring (after we've trained and evaluated the model)
X_val.to_csv('../data/scoring_data.csv')

# 5. Create preprocessing pipeline

# 5.1 Preprocessing pipeline for numerical features
numerical_transformer = Pipeline([
  SimpleImputer(strategy='most_frequent'),
])

# 5.2 Preprocessing pipeline for categorical feature (Brand)
categorical_transformer = Pipeline([
  ('impute', SimpleImputer(strategy='most_frequent')),
  ('encode', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# 5.3 Combine
preprocessing_pipeline = make_pipeline(ColumnTransformer(

    transformers=[
        ('impute_numerical', numerical_transformer, ['Ratings', 'RAM', 'ROM', 'Mobile_Size', 'Primary_Cam', 'Selfi_Cam', 'Battery_Power']),
        ('impute_categorical', categorical_transformer, ['Brand'])
    ],

    verbose_feature_names_out=False,
))

# 5.4 Fit and transform X DataFrames
preprocessing_pipeline.fit(X_train)
X_train = preprocessing_pipeline.transform(X_train)
X_test = preprocessing_pipeline.transform(X_test)

# 5.5 Save preprocessing pipeline
joblib.dump(preprocessing_pipeline, '../models/preprocessing_pipeline.joblib')

I'm using scikit-learn's Pipeline and ColumnTransformer to build a reusable preprocessing pipeline. I love these classes, but if you're unfamiliar with them, you can learn about them here:

Simplify Your Data Preparation With These 4 Lesser-Known Scikit-Learn Classes

Build simple model

We've prepared the data; now, let's build a simple model using LightGBM's scikit-learn API:

# notebooks/train.ipynb

import lightgbm as lgbm
from sklearn.metrics import mean_absolute_percentage_error, PredictionErrorDisplay
import joblib

model = lgbm.LGBMRegressor()
model.fit(X_train)
y_pred = model.predict(X_test)
print(f"MAPE: {mean_absolute_percentage_error(y_test, y_pred)}")
display = PredictionErrorDisplay(y_true=y_test, y_pred=y_pred)
display.plot()

The MAPE is 103%. This seems pretty terrible, but I'm not too worried as this seems to be heavily inflated by a few outliers and the majority of the predictions have less than 50% MAPE:

Besides, for the purpose of demonstrating how to productionise a model, it's good enough. Rather than optimising this further, let's look at how to productionise this model with GitHub Actions.

Step 2: Save the model

We'll use joblib to save the model so that it can be loaded in our scoring script:

# notebooks/train.ipynb (continued)

joblib.dump(model, '../models/model.joblib')

Step 3: Write a scoring script (.py)

To make our model (re)useable, we need to write a Python script that (1) fetches and preprocesses the data we want to score using our fitted pipeline, (2) loads the model, and (3) generate predictions:

# score.py

import pandas as pd
import joblib
import logging

def main():
    """
    Loads and preprocesses data, loads model, generates 
    predictions, and logs to scores.log
    """

    # Load data to be scored
    X_val = pd.read_csv('data/scoring_data.csv')

    # Preprocess data
    preprocessing_pipeline = joblib.load('models/preprocessing_pipeline.joblib')
    X_val = preprocessing_pipeline.transform(X_val)    

    # Load model
    model = joblib.load('models/model.joblib')

    # Generate scores
    scores = model.predict(scoring_data)

    # Log scores
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    f_handler = logging.FileHandler('scores.log')
    f_format = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    f_handler.setFormatter(f_format)
    logger.addHandler(f_handler)
    logger.info(f'Scores: {scores}')

if __name__ == '__main__':
    main()

We're using a Python script (rather than a Jupyter notebook) because Python scripts can be easily run from the command line (e.g., python score.py).

Step 4: Save package versions

Next, we need to save a record of which package versions we're using to a requirements.txt file:

$ pip freeze > requirements.txt

This is necessary so that, when GitHub Actions tries to execute our script, it knows which versions of each package (e.g., pandas, sklearn, etc.) to install.

Step 5: Create a GitHub Action to run score.py script at regular intervals

Now we get onto the fun bit: scheduling the batch inference pipeline.

Quick interjection: I hope you're enjoying this article! I'd love to connect with you on X or LinkedIn – reach out and say hi!
Tags: Artificial Intelligence Data Science github-actions Machine Learning Technology

Add Fav