Managing the Technical Debts of Machine Learning Systems

Author:Murphy | View: 21726 | Time: 2025-03-23 12:37:40

Explore the practices (design patterns, version control, and monitoring systems) for sustainably mitigating the cost of speedy delivery—with implementation codes

As the Machine Learning (ML) community advances over the years, the resources available for developing ML projects are plentiful. For example, we can rely on the generic Python package scikit-learn, which is built on NumPy, SciPy, and matplotlib, for data preprocessing and basic predictive tasks. Or we can leverage the open-source collection of pre-trained models from Hugging Face for analyzing diverse types of datasets. These empower current data scientists to quickly and effortlessly tackle standard ML tasks while achieving moderately good model performance.

However, the abundance of ML tools often leads business stakeholders and even practitioners to underestimate the effort required to build enterprise-level ML systems. Particularly when faced with tight project deadlines, the teams may expedite deploying systems to production without giving sufficient technical considerations. Consequently, the ML system often does not address the business needs in a technically sustainable and maintainable manner.

As the system evolves and deploys over time, technical debts accumulate – The longer the implied cost remains unaddressed, the more costly it becomes to rectify them.

There are multiple sources of technical debts in the ML system. Some are included below.

1 Inflexible code design to cater to unforeseen requirements

To validate if ML can address the enterprise challenges at hand, many ML projects commence with a proof of concept (PoC). We initially created a Jupyter Notebook or Google Colab environment to explore data, then developed several ad-hoc functions, and created the illusion of nearing project completion for stakeholders. Such systems building directly from PoC may end up consisting mostly of glue code – the supporting code that connects specific incompatible components but itself does not have the functionality of data analysis. They can be spaghetti-like, hard to maintain, and prone to errors.

Business stakeholders often present alternate requirements from time to time or wish to scale up the project, such as trying out new data sources or new algorithms. We thus often find ourselves frequently revisiting the codebase covering the current preprocessing pipelines and model development processes. Inflexible code design can consequently lead to difficulties in reacting to new changes, or rewrite most of the codes for even minor adjustments.

2 Messiness in the configurations of the ML system

Software engineering programming automates tasks by defining rules for computers to follow, ensuring precise output for the same input. Software engineers are also concerned with the correctness of every corner case. On the other hand, ML systems automate tasks by collecting and feeding feature data into models to achieve desired target results. This experimental process embraces uncertainty and variability. As ML systems mature, they often contain multiple versions of configuration options, such as datasets with different feature combinations, and algorithm-specific learning settings.

Input features in ML systems are inherently interconnected. Consider a scenario where feature A is no longer available as an input for your ML systems in production, you are required to re-evaluate the weighting of the remaining features for production serving. However, after 2 months, feature A becomes available again. If you did not systematically save or even mistakenly modify the original configurations, rectifying the decrease in performance would require additional computational resources and time effort.

3 Limited ability to adapt to the evolving external world

ML systems often have dependencies on the external world, and various hidden factors continually evolve but are not being appropriately considered and monitored.

External factors leading to potential degradation of model performance (image by author)

Unstable data output from upstream producers: The input signal of our ML system may come from another machine learning model that updates over time. Additionally, the system may rely on unstructured data such as signals from Internet of Things (IoT) devices, web scraping data, or output data from audio-to-text converter. If the maintenance of these tools from upstream producers is not properly declared or flawed patching is implemented, it can unexpectedly degrade the performance of the ML system.
Drift in input data: Take demand forecasting in the retail industry as an example. The input data can exhibit new distributions periodically (such as the seasonal cycle of purchasing behaviors), gradually (like the inflationary cost of goods from suppliers), and suddenly (the entry of new competitors).

In the following sections, we will delve into some great practices for building ML systems and provide illustrative Python code examples to demonstrate their implementation.

Imagine that you would like to build a robust traffic system for the city prepared for traffic peaks, so you collected traffic data from sensors over the past two years. Your goal is to predict the traffic patterns (i.e. the no. of vehicles) for the upcoming six months.

Training dataset: ID, DateTime, and no. of vehicles
Test dataset: ID and DateTime

1 Use design patterns for the ML codebase

To make the code design more flexible and reusable for future requirements, we can leverage design patterns. These patterns serve as the templates for solving common problems in various situations, enabling us to decouple different parts of the codebase. As a result, this helps to improve the comprehension of the codebase and build a common language to communicate quickly about solutions.

The two primary components in ML projects are data and algorithms, which can benefit from design patterns.

Factory pattern

This creational pattern provides a layer of abstraction for generating objects at runtime. In ML systems, we can implement this pattern by creating a data loader class (CSVDataLoader in this example) that handles the loading, saving, and returning of training/ test data with a consistent data type. We can then declare the DataProcessor interface without specifying the concrete implementation.

# CSV Data loader class
class CSVDataLoader:
  def __init__(self, file_path):
    self.file_path = file_path

  def get_data(self):
    return pd.read_csv(self.file_path)

  def save_data(self, df):
    df.to_csv(f'data/transformed_{self.file_path}', index=False)

# Interface
class DataProcessor:
  def __init__(self, train_data_loader, test_data_loader):
    self.train_data_loader = train_data_loader
    self.test_data_loader = test_data_loader

  def run(self):
    # Load training and test data using data loaders
    train_df = self.train_data_loader.get_data()
    test_df = self.test_data_loader.get_data()

# Create a data processor instance
process = DataProcessor(
 train_data_loader=CSVDataLoader(file_path='train.csv'),
 test_data_loader=CSVDataLoader(file_path='test.csv')
)

# Run the data processing pipeline
process.run()

This approach allows you to extend the code without having to re-implement the DataProcessor. For example, if you want to load the dataset from a JSON file, you can simply create a new class JSONDataLoader as an instance of the data loader for declaration.

Strategy Pattern

Since there is no one-size-fits-all algorithm for ML problems with varying data distribution, we often find ourselves switching and experimenting between algorithms during prototyping or project enhancement. We can apply the Strategy Pattern by creating a new class DataTransformer for feature engineering and another class LGBMModel for encapsulating the strategy of fitting and predicting using the LightGBM model.

class DataTransformer:
  def transform_data(self, train_df, test_df):
    for idx, df in enumerate([train_df, test_df]):
      df['DateTime'] = pd.to_datetime(df['DateTime'])

      # Build 'Time' column
      df['Time'] = [date.hour * 3600 + date.minute * 60 + date.second for date in df['DateTime']]

      # Convert DateTime to Unix timestamp
      unixtime = [time.mktime(date.timetuple()) for date in df['DateTime']]
      df['DateTime'] = unixtime

      # Perform one-hot encoding on the DataFrame
      df = pd.get_dummies(df)

      if idx == 0:
        # Split training DataFrame into features (X_train) and target (y_train)
        X_train = df.drop(['Vehicles'], axis=1)
        y_train = df[['Vehicles']]
      elif idx == 1:
        # Store test DataFrame
        X_test = df

    return X_train, y_train, X_test

class LGBMModel:
  def __init__(self, num_leaves, n_estimators):
    self.model = lgb.LGBMRegressor(
      num_leaves=num_leaves,
      n_estimators=n_estimators
    )

  def fit(self, X, y):
    self.model.fit(X, y)
    self.model.booster_.save_model('model/lgbm_model.txt')
    return self

  def predict(self, X):
    predictions = self.model.predict(X)
    return predictions

The implementation and declaration of the interface DataProcessor are provided below. This is the end-to-end process that includes loading training and test data using train_data_loader and test_data_loaderrespectively, transforming the data using data_transformer, and fitting the model to the transformed data using model. As a result, we can predict the number of vehicles in each record of the test dataset.

# Interface
class DataProcessor:
  def __init__(self, train_data_loader, test_data_loader, data_transformer, model):
    self.train_data_loader = train_data_loader
    self.test_data_loader = test_data_loader
    self.data_transformer = data_transformer
    self.model = model

  def run(self):
    # Load train and test data using data loaders
    train_df = self.train_data_loader.get_data()
    test_df = self.test_data_loader.get_data()

    # Transform the data using the data transformer
    X_train, y_train, X_test = self.data_transformer.transform_data(train_df, test_df)

    # Fit the model and make prediction
    self.model.fit(X_train, y_train)
    test_df['Vehicles'] = self.model.predict(X_test)

    # Save the transformed training data and test data
    self.train_data_loader.save_data(pd.concat([X_train, y_train], axis=1))
    self.test_data_loader.save_data(test_df)

# Create a data processor instance
process = DataProcessor(
    train_data_loader=CSVDataLoader(file_path='train.csv'),
    test_data_loader=CSVDataLoader(file_path='test.csv'),
    data_transformer=DataTransformer(),
    model=LGBMModel(num_leaves=16, n_estimators=80)
)

# Run the data processing pipeline
process.run()

You can easily add new blocks of code to implement other data transformation ideas or algorithms. Similar to the Factory Pattern, these changes would not require you to modify the interface DataProcessor. This design makes it easier to maintain the code, even if you have a long list of algorithms. The behavior of the ML system can dynamically vary based on the chosen strategy.

Of course, the above code implementation is only the preliminary template for development. For example, we can further enhance the code by covering data validation, implementing a hyperparameter tuning mechanism, and evaluating the model.

2 Version control of the ML systems

Throughout the complex process of model development and management, we require proper version control. This enables us to maintain a history of modifications done by ourselves or team members and track the version in the local environment relative to the components of the ML system, including data, trained models, and hyperparameters. We can thus address some common questions, including:

What was the change that led to the model's failure?
Which modifications resulted in improved model performance?
Which version of the model was most recently released?

Here we demonstrate how to utilize the versioning features in DVC, which works best in the Git repository for tracking our original traffic data, transformed traffic data, and LGBM models.

# Initialise a Git and DVC project in the current working directory
git init
dvc init

# Capture the current state of transformed data in folder 'data' and latest LGBM model in folder 'model'
dvc add data model

# Commit the current state of 1st version
git add data.dvc model.dvc .gitignore
git commit -m "First LGBM model, with Time feature"
git tag -a "v1.0" -m "model v1.0, Time feature"

Let's consider a scenario where we have made the following changes in the second version of data processing:

Add the Weekday feature in the DataTransformer class

df['Weekday'] = [datetime.weekday(date) for date in df.DateTime]

Set new configuration parameters for the LGBM model in the DataProcessor interface

model=LGBMModel(num_leaves=20, n_estimators=90)

With the following commands, we can track the second version of the data and model in DVC and commit the .dvc files that point to them with Git.

git add data.dvc model.dvc
git commit -m "Second LGBM model, with Time and Weekday features"
git tag -a "v2.0" -m "model v2.0, Time and Weakday features"

Though the workspace is currently locating the second version of our data and model, we can easily switch back and restore to the first snapshot whenever necessary.

git checkout v1.0
dvc checkout

The above commands cover the basic operations. We can further leverage the tool for project organization and collaboration. Examples of use cases are to understand how the data was built initially and compare model metrics among experiments.

3 Test and monitor the ML systems continuously

Once the enhanced model can generate predictions, it is essential to perform sanity checks before releasing it to production. This is achieved by fitting a random sample set of online data into the latest model offline and examining the results from various perspectives.

Ensuring the right access permission: The model results can be stored in the destination path (such as writing them to a Hive table).
Eliminating semantic errors: Visualize the distribution of the transformed features that are fitted to the model, to identify any deviant behavior.
Assessing model performance: Rescore with the latest model and compare the results to the current online model using appropriate performance metrics (e.g., F1-score is a preferable measurement to accuracy for imbalanced class problems).

Even after the latest version of the ML system is released, ongoing monitoring is necessary to account for evolving external environments.

Monitoring data drift and model drift: Detect drift conditions through model performance metrics, statistical tests, and adaptive windowing techniques.
Tracking upstream producers: Stay informed about changes in upstream processes, and routinely monitor them to ensure they meet a service level objective.

Wrapping it up

We have explored several effective practices that can be implemented to tackle technical debts in developing and deploying ML systems.

Use design patterns, to create a modular and flexible data processing pipeline that can adapt to unforeseen requirements.
Utilize Version Control, to track and manage the ML artifacts, such as data and models, ensuring a workflow with less messiness.
Test and monitor the ML system, to promptly and smoothly handle changes in the dynamic external world.

Before you go

If you enjoy this reading, I invite you to **** follow my Medium page and Linkedin page. By doing so, you can stay updated with exciting content related to data science side projects, Machine Learning Operations (MLOps) demonstrations, and project management methodologies.