Scale your Machine Learning Projects with SOLID principles

Author:Murphy | View: 26725 | Time: 2025-03-22 22:28:04

When I was a junior Data Scientist, my goal was to write code that simply worked.

I used to see Python as a framework to run Pandas, Numpy, or Matplotlib only. I started like everybody else in a Jupyter Notebook, processing the data and training models cell by cell.

I remember my first job in a company.

As the project progressed, the notebook grew, and despite providing explanations with markdowns, the code began to get messy.

The first model was finally trained, its performance evaluated and shipped to production with the developers' help.

However, like any Machine Learning project, deploying a model is not the end of the journey but the beginning…

Several weeks later, I had to start over and revisit the notebook. To be honest, it was almost easier to create a new notebook. Requirements had changed. The code was too messy to attempt any modifications.

Furthermore, shipping the processing algorithm to production was a painful task. Data had to be processed identically across the notebook, in the training pipeline, and in the inference pipeline.

The need to write the code three times meant that any modification in the notebook required corresponding changes in the different pipelines, increasing the likelihood of introducing bugs.

Doing Machine Learning at this time was painful for me.

Until I started to apply Software Engineer best practices.

My code, my relationship with my colleagues, and my efficiency in delivering ML pipelines improved significantly.

One of those best practices was about using SOLID principles.

Why should you learn to code with SOLID principles?

You probably recognized yourself in my story.

Don't worry—you're not alone.

Throughout my career, I've collaborated with dozens of data scientists. Despite the diversity in their roles and responsibilities, there was one commonality: they were all coders.

Their code was reviewed, deployed, modified, and tracked.

Having the skill to write Clean Code not only eases sharing your work with others but also significantly enhances your efficiency by minimizing the likelihood of introducing errors in your projects.

This is the promise of SOLID principles: make your code less prone to errors and make any modification easy to implement.

My goal with this article is to introduce you to a way to write better code and make you a better developer!

To address those principles, I'll take a common Data Science use case: processing data.

Starting with poorly written code, we will walk through improving the code step by step by using best practices. I'll address each principle one at a time, leading to the final refined code.

Here are the 5 SOLID Principles:

Single Responsability Principle
Open/Close Principle
Liskov Substitution Principle
Interface Segregation Principle
Dependency Aversion Principle

Stick with me, you'll become a better developer after reading this article!

Start writing code that scales. Start using SOLID Principles.

Use case

To illustrate how to use SOLID Principles in your code, let's study a typical case in the Data Scientist job: pre-processing a dataset.

We create a simple dataframe containing 3 features: numerical, categorical, and one with missing values.

Raw data:    feature_a feature_b  feature_c
0          1         a        0.0
1          2         a        0.0
2          3         b        NaN
3          4         b        1.0
4          5         c        1.0

Let's say we need to preprocess this dataset to train a Machine Learning model such as linear regression. Using Pandas, the code would look like this:

import logging

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

logging.basicConfig(level=logging.INFO)

def process(path: str, output_path: str) -> pd.DataFrame:
    """"""
    df = pd.read_parquet(path)
    logging.info(f"Data: {df}")

    # Normalization
    std = np.std(df["feature_a"])
    mean = np.mean(df["feature_a"])
    standardized_feature = (df["feature_a"] - mean) / std

    # Categorical value
    encoder = LabelEncoder()
    encoded_feature = encoder.fit_transform(df["feature_b"])

    # Nan
    filled_feature = df["feature_c"].fillna(-1)

    processed_df = pd.concat(
        [standardized_feature, encoded_feature, filled_feature],
        axis=1
    )
    logging.info(f"Processed data: {processed_df}")
    processed_df.to_parquet(output_path)

def main():
    path = "data/data.parquet"
    output_path = "data/preprocessed_data.parquet"
    process(path, output_path)

if __name__ == "__main__":
    main()

As the code suggests, the raw data is loaded, processed, and saved as a parquet file. After being processed, it looks like this:

INFO:root:Processed df:    feature_a  feature_b  feature_c
0  -1.414214          0        0.0
1  -0.707107          0        0.0
2   0.000000          1       -1.0
3   0.707107          1        1.0
4   1.414214          2        1.0

Nothing fancy, you would say. This is the typical code any data scientist would likely write in a Jupyter notebook.

But this code is poorly written for 3 reasons:

The way the data is processed cannot be changed without modifying the function process() . This increases the risk of adding bugs later in the project.
The function is hardly testable. One could write a unitest for the function process() , but if the code changes, the test function needs to change as well.
This code is not reusable. If another dataset should be processed, a new function needs to be developed from scratch.

SOLID Principles solve these 3 problems. So let's jump right into it!

Single Responsibility Principle

Robert C. Martin, also known as Uncle Bob, is a figure in software development and the author of the notorious book Clean Code. He ** expressed the Single Responsibility Principl**e as follows:

A class should have only one reason to change.

If we consider our code, we can apply this principle by dividing process() into several functions with a single responsibility.

Python">from typing import List
import logging

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

logging.basicConfig(level=logging.INFO)

def process(self, path: str, output_path: str) -> None:
    df = load_data(path)
    logging.info(f"Raw data: {df}")
    normalized_df = normalize(df["feature_a"])
    encoded_df = encode(df["feature_b"])
    filled_df = fill_na(df["feature_c"], value=-1)
    processed_df = pd.concat(
        [normalized_df, encoded_df, filled_df],
        axis=1
    )
    logging.info(f"Processed df: {processed_df}")
    save_data(df=processed_df, path=output_path)

def standardize(df: pd.DataFrame) -> pd.DataFrame:
    std = np.std(df)
    mean = np.mean(df)
    return (features - mean) / std

def encode(df: pd.DataFrame) -> pd.DataFrame:
    encoder = LabelEncoder()
    encoder.fit_transform(features)
    array = np.atleast_2d(array) # Transform array into 2D from 1D or 2D arrays
    processed_df = pd.DataFrame({name: data for name, data in zip(df.columns, array)})
    return processed_df

def fill_na(df: pd.DataFrame, value: int = -1) -> pd.DataFrame:
    return df.fillna(value=self.value)

def load_data(self, path: str) -> pd.DataFrame:
    return pd.read_parquet(path)

def save_data(self, df: pd.DataFrame, path: str) -> None:
    df.to_parquet(path)

That's already better!

Each stage of the data processing is represented as a function with one responsibility.

Any modification in the overall process is done on a small part of the process. Also, it is now far easier to test each function with unitests, making the project more robust to change.

But what if you would like the process to handle CSV files in addition to Parquet?

You would need to modify the load_data() function to handle both file types.

import os
def load_data(self, path: Path) -> pd.DataFrame:
    splitted_path = os.path.splitext(path)
    if splitted_path[-1] == ".csv":
        return pd.read_csv(path)
    if splitted_path[-1] == ".parquet"
        return pd.read_parquet(path)
    else:
        raise ValueError(f"File type {splitted_path[-1]} not handled.")

But this is a bad practice since we need to modify an existing function, increasing the likelihood of implementing bugs.

That's where the second principle comes into play.

Open/Closed Principle

This principle was introduced by Bertrand Meyer. It says:

Software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification.

If we take our previous example, instead of modifying load_data() , we could create two functions load_csv() and load_parquet() . This way, we don't touch the existing load_data() , transformed into load_parquet() .

def load_csv(path: str) -> pd.DataFrame:
    return pd.read_csv(path)

def load_parquet(path: str) -> pd;DataFrame:
    return pd.read_parquet(path)

The Open/Closed Principle is respected!

However, you will notice that if we implement new features, such as loading CSV files, we would need to implement those new functions to process() . Thus, we still break the Open/Closed Principle…

To address this issue, let's dive straight into the third principle and begin writing better code!

Liskov Substitution Principle

The Liskov Substitution principle was initially introduced by Barbara Liskov and says:

A module may be replaced by its base without breaking the program.

In other terms, modules with identical behavior should inherit from a common base. This base module should represent any of its submodules in the code.

This is where Python classes become handy!

I have been waiting to introduce this 3rd principle to start using Python classes for re-writing our code. Let's now apply the 3 first principles using the power of Object-Oriented Programming!

Let's re-write the load_parquet() and load_csv() functions as classes that inherit from the upper class DataLoader .

from abc import ABC, abstractmethod

class DataLoader(ABC):
    @abstractmethod   
    def load_data(self, path: str) -> pd.DataFrame:
        pass

class ParquetDataLoader(DataLoader):
    def load_data(self, df: pd.DataFrame, path: str) -> None:
        return pd.read_parquet(path)

class CSVDataLoader(DataLoader):
    def load_data(self, df: pd.DataFrame, path: str) -> None:
        return pd.read_csv(path)

This code snippet respects the 1st and 2nd SOLID principles while using Python Object-Oriented Programming capacity.

In Python, a class used as a base should inherit from ABC for indicative purposes. It does not influence the code but helps developers better communicate the hierarchical structure of the algorithm.

In the same manner, the @abstractmethod wrapper serves as a guide to writing methods in subclasses and is destined to be overwritten.

Now, let's say we want to add a new way to load data, such as JSON. Do you have the solution?

Exact! We can add another class that inherits from DataLoader and in the same way, ensure that the Open/Closed principle is respected.

class JSONDataLoader(DataLoader):
    def load_data(path: str) -> pd.Dataframe:
        return pd.read_json(path)

We can do the same for other data transformations such asstandardize() , encoder() , or na_fill() .

class FeatureProcessor(ABC):
    def __init__(self, feature_names: List[str]) -> None:
        # Features to process are directly implemented into the base module
        self.feature_names = feature_names

    @abstractmethod
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        pass

class Standardizer(FeatureProcessor):   
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        std = np.std(features)
        mean = np.mean(features)
        return (features - mean) / std

class Encoder(FeatureProcessor):
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        encoder = LabelEncoder()
        array = encoder.fit_transform(features)
        array = np.atleast_2d(array) # Transform array into 2D from 1D or 2D arrays
        processed_df = pd.DataFrame({name: data for name, data in zip(features.columns, array)})
        return processed_df

class NaFiller(FeatureProcessor):
    def __init__(self, feature_names: List[str], value: int = -1) -> None:
        self.value = value
        super().__init__(feature_names)

    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        return features.fillna(value=self.value)

For instance, if we want to add a new method to process numerical values, we would just have to create a new FeatureProcessor module and respect the Open/Closed Principle.

class Normalizer(FeatureProcessor):
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        minimum = features.min()
        maximum = features.max()
        return (features - minimum) / (maximum - minimum)

Now the code is transformed into modules, we implement the Liskov Substitution principle to the process() function. Let's rewrite our code to match our modular style.

class DataProcessor:
    # __init__ respect the Liskov Substitution Principle
    def __init__(
        self,
        feature_processors: List[FeatureProcessor],
        data_loader: DataLoader,
        data_saver: DataSaver
    ) -> None:
        self.feature_processors = feature_processors
        self.data_loader = data_loader
        self.data_saver = data_saver     

    def process(self, path: str, output_path: str) -> None:
        df = self.data_loader.load_data(path)
        logging.info(f"Raw data: {df}")
        processed_df = pd.concat(
            [feature_processor.process(df) for feature_processor in self.feature_processors],
            axis=1
        )
        self.data_saver.save_data(df=processed_df, path=output_path)
        logging.info(f"Processed df: {processed_df}")

As you can see, DataProcessor takes as inputs module bases, such as FeatureProcessor , DataLoader , or DataSaver , removing the dependency over which function is used.

This way, if we need to modify the code we would just have to change any module during the initialization of DataProcessor .

processor = DataProcessor(
        feature_processors=[
            Normalizer(feature_names=["feature_a"]),
            # Standardizer(...),
            Encoder(feature_names=["feature_b"]),
            NaFiller(feature_names=["feature_c"], value=-1)
        ],
        data_loader=CSVDataLoader(),
        data_saver=ParquetDataSaver()
    )
    processor.process(
        path="data/data.csv", 
        output_path="data/preprocessed_data.parquet"
    )

Good job! You've made your code modular and open to extension but closed to changes!

But if you have noticed, there are a few elements we can change to improve our code…

Interface Segregation Principle

Robert C. Martin (again) proposed another principle of the SOLID Principles:

Clients should not be forced to depend upon methods that they do not use. Interfaces belong to clients, not to hierarchies.

In other words, a class shouldn't inherit methods (or attributes) that are not used. Those methods should be associated with appropriate classes instead.

In our example, we ensured that any of the modules implemented into DataProcessor carried only methods and attributes necessary to the algorithm.

But let's say another data processing module, used in another part of our project, requires a different version of normalization such as mean normalization.

As it is a normalization method, we could create a subclass of Normalizer called MeanNormalizer , which will handle this specific processing case.

class MeanNormalizer(Normalizer):
    def normalize(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        minimum = features.min()
        maximum = features.max()
        mean = features.mean()
        return (features - mean) / (maximum - minimum)

In this case, we called the method normalize() .

However, once instantiated, MeanNormalizer will not only carry the method normalize() , but also all the methods from the parent classes, such as process() , which is not used in the other program.

We just broke the Interface Segregation principle.

To avoid this, we can create a new module that inherits directly from FeatureProcessor instead of Normalizer , decreasing the level of hierarchy in our code base.

class MeanNormalizer(FeatureProcessor):
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        minimum = features.min()
        maximum = features.max()
        mean = features.mean()
        return (features - mean) / (maximum - minimum)

In the book The Pragmatic Programmer, the authors recommend keeping the modules hierarchy as small as possible to avoid complicating the code with overly specific details.

Dependency Aversion Principle

The Dependency Aversion Principle states that:

Abstractions should not depend upon details. Details should depend upon abstractions.

Let's say you want to add another categorical values encoder, such as OrdinalEncoder instead of LabelEncoder from the Sklearn library.

To respect the other SOLID principles, we could create new modules such as LabelEncoderProcessor and OrdinalEncoderProcessor .

class LabelEncoderProcessor(FeatureProcessor):
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        encoder = LabelEncoder() # <<<<
        array = encoder.fit_transform(features)
        array = np.atleast_2d(array) # Transform array into 2D from 1D or 2D arrays
        processed_df = pd.DataFrame({name: data for name, data in zip(features.columns, array)})
        return processed_df

class OrdinalEncoderProcessor(FeatureProcessor):
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        encoder = OrdinalEncoder() # <<<<
        array = encoder.fit_transform(features)
        array = np.atleast_2d(array) # Transform array into 2D from 1D or 2D arrays
        processed_df = pd.DataFrame({name: data for name, data in zip(features.columns, array)})
        return processed_df

It works. But it is not pretty at all! It doesn't even respect another programming principle: DRY, Don't Repeat Yourself!

The best solution here is to remove the dependence on the Sklearn encoder by implementing it as a parameter of the Encoder class, instead of embedding it into the method.

This is the Dependency Aversion principle.

We use the TransformerMixin base object as the abstract representation of any Sklearn encoder, following the Liskov Substitution principle.

from sklearn.base import TransformerMixin

class Encoder(FeatureProcessor):
    def __init__(self, encoder: TransformerMixin, 
                   feature_names: List[str]) -> None:
        self.encoder = encoder
        super().__init__(feature_names)

    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        array = self.encoder.fit_transform(features)
        array = np.atleast_2d(array) # Transform array into 2D from 1D or 2D arrays
        processed_df = pd.DataFrame({name: data for name, data in zip(features.columns, array)})
        return processed_df

This way, Encoder is not dependent on which encoder is used anymore, as long as it uses a TransformerMixin object from Sklearn that contains the fit_transform method.

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
processor = DataProcessor(
        feature_processors=[
            Normalizer(feature_names=["feature_a"]),
            Encoder(encoder=LabelEncoder(), feature_names=["feature_b"]),
            # Or Encoder(encoder=OrdinalEncoder(), ...)
            ...

Final code

After implementing the 5 principles into our code, it gives us this

from abc import ABC, abstractmethod
from typing import List
import logging

import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder

logging.basicConfig(level=logging.INFO)

class FeatureProcessor(ABC):
    def __init__(self, feature_names: List[str]) -> None:
        self.feature_names = feature_names

    @abstractmethod
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        pass

class Standardizer(FeatureProcessor):   
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        std = np.std(features)
        mean = np.mean(features)
        return (features - mean) / std

class Normalizer(FeatureProcessor):
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        minimum = features.min()
        maximum = features.max()
        return (features - minimum) / (maximum - minimum)

class Encoder(FeatureProcessor):
    def __init__(self, encoder: TransformerMixin, 
                    feature_names: List[str]) -> None:
        self.encoder = encoder
        super().__init__(feature_names)

    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        array = self.encoder.fit_transform(features)
        array = np.atleast_2d(array) # Transform array into 2D from 1D or 2D arrays
        processed_df = pd.DataFrame({name: data for name, data in zip(features.columns, array)})
        return processed_df

class NaFiller(FeatureProcessor):
    def __init__(self, feature_names: List[str], 
                   value: int = -1) -> None:
        self.value = value
        super().__init__(feature_names)

    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        features = df[self.feature_names]
        return features.fillna(value=self.value)

class DataLoader(ABC):
    @abstractmethod   
    def load_data(self, path: str) -> pd.DataFrame:
        pass

class ParquetDataLoader(DataLoader):
    def load_data(self, path: str) -> pd.DataFrame:
        return pd.read_parquet(path)

class DataSaver(ABC):
    @abstractmethod
    def save_data(self, df: pd.DataFrame, path: str) -> None:
        pass

class ParquetDataSaver(DataSaver):
    def save_data(self, df: pd.DataFrame, path: str) -> None:
        df.to_parquet(path)

class DataProcessor:
    def __init__(
        self,
        feature_processors: List[FeatureProcessor],
        data_loader: DataLoader,
        data_saver: DataSaver
    ) -> None:
        self.feature_processors = feature_processors
        self.data_loader = data_loader
        self.data_saver = data_saver     

    def process(self, path: str, output_path: str) -> None:
        df = self.data_loader.load_data(path)
        logging.info(f"Raw data: {df}")
        processed_df = pd.concat(
            [feature_processor.process(df) for feature_processor in self.feature_processors],
            axis=1
        )
        self.data_saver.save_data(df=processed_df, path=output_path)
        logging.info(f"Processed df: {processed_df}")

Our code has become highly modular, open to extensions, and prevents most of the potential bugs a developer could implement during the project.

Congrats!

Conclusions

In this article, you learned the 5 SOLID Principles and how to apply them to improve your code.

IT projects naturally evolve, with new requirements continuously added. Your code becomes outdated rapidly throughout the project's development.

Learning how to craft code designed for expansion is therefore primordial, in addition to accelerating your work once the infrastructure is set up.

However, there are a few points I would like to share before you start implementing the SOLID principles in all of your projects.

First, engineering your code following these principles makes it highly flexible and scalable, but it requires time to architect it.

One who would like to apply those principles in his/her projects needs to think about the relevance of doing it first.

SOLID principles are designed for scale, efficiency, and act as a guideline for other developers on the same project.

But much of the Data Scientist work is exploratory and in a singular environment such as Jupyter notebooks. Thus, it often doesn't require over-engineering the code if it doesn't touch the production environment.

Second, one who applies the SOLID principles can quickly be lost in creating a deep hierarchical structure using inheritance and object-oriented programming like Python (a subclass of a subclass of a subclass, etc…)

The Pragmatic Programmer discourages this approach by advising not to extend beyond a base class and a subclass. The 4th SOLID Principle, the Interface Segregation Principle, implicitly discourages this practice.

To conclude, those principles are powerful and enable engineers like you to work cohesively with teams and accelerate your coding.

I remember the first time I used the SOLID principles in a project.

It took me some time to architect the code from scratch.

But later on, once everything was set up, my efficiency in developing new features went through the roof. Not because I became better at programming, but because new features were easy to implement.

I hope these principles will help you and guide you through your journey in the tech industry.

Happy coding!

If you liked this article, feel free to join my newsletter. I share my content about Machine Learning, MLOps, and entrepreneurship.

You can reach out to me on Linkedin, or check my Github.

If you're a business and want to implement Machine Learning into your product, you can also book a call!