Stop Overusing Scikit-Learn and Try OR-Tools Instead

Author:Murphy | View: 28140 | Time: 2025-03-22 23:07:49

Do you want my hot take on the state of Data Science in 2024?

Data Scientists are too obsessed with machine learning.

To someone with a hammer, every problem looks like a nail; to the modern Data Scientist, every problem apparently looks like a Machine Learning problem. We've become so good at translating problems into the language of analytics and ML that we sometimes forget there are other data-scientific approaches out there. And this is a massive shame.

In this article, I will introduce another branch of Data Science – Mathematical Optimisation (specifically, Constraint Programming)— and show how it can add value to your career as a Data Scientist.

If you've not got a strong Maths background, please don't be put off by the name. I didn't study Maths at university either, yet I found it surprisingly easy to get started with Mathematical Optimisation techniques thanks to Google's open-source Python library OR-Tools, which I'll introduce in this beginner-friendly article.

If you want to expand your Data Science toolkit and learn this high-demand skill, sit down and buckle up.

Optimisation 101 for Data Scientists

Optimisation is a suite of techniques for "find[ing] the best solution to a problem out of a very large set of possible solutions" (source: Google Developers).

Sometimes, that means finding the optimal solution to a problem; at other times, it just means finding all the feasible solutions. There are lots of situations where you'll encounter these types of problems, for example:

Imagine that you're working in the Data Science team at your local Amazon warehouse. There are 100 packages to deliver, 3 delivery drivers, and all the deliveries must be made within a 2-hour window. This is an example of an optimisation problem, where you need to find the optimal schedule/route for each of the drivers.
Or imagine that you're a teacher who's planning a group activity for your 10 students. You need to split the students into 3 groups, but there are a few annoying constraints ** which mean that you can't just split them up randomly: (1) Timmy can't be in the same group as Jimmy, (2) Billy _mus_t be in the same group as Willy, and (3) each group must contain at least one of Mickey, Dickie, Ricky and Vicky. This is an example of a constraint programmin**g problem, where you need to find a viable grouping which satisfies all those constraints.

These problems are examples of optimisation problems because there are hundreds/thousands/millions of possible solutions, but we need to narrow down to the feasible/optimal one(s).

Optimisation problems are like looking for a needle in a haystack— Image by Leshaesvan on Unsplash

Allocating students to groups

To explain how this works, I'll walk through the constraint Programming example I described above – split 10 students across 3 groups – but without the ridiculous names. As a reminder, the (non-ridiculous) conditions are:

Students 1 and 2 cannot be in the same group
Student 3 must be in the same group as Student 8
Each group must contain at least one of the following students: 7, 8, 9, 10

There are also a few common-sense constraints:

Each group must have a minimum of 3 people in it
Each student must be added to exactly 1 group (no more, no less)

Together, these make for quite complex constraints which are tough to solve with mental maths. Luckily, we can use Constraint Programming to help us find a possible solution. To implement it, we'll use OR-Tools, an open-source library released by Google.

Setup

First, let's install the OR-Tools Python library using pip:

pip install ortools

Next import the constraint programming (cp) module and initialise a CP Model:

from ortools.sat.python import cp_model
model = cp_model.CpModel()

Add the decision variables and constraints

To solve our constraint programming / allocation problem, we need to write the constraints as a series of linear equations.

This might sound fancy, but a linear equation is just an equation with the format:

ax + b = c

You might have encountered linear equations at school, e.g., through questions like:

Find the value of x given that: x + 3y = 10 2y + w = 5 w = 3

In our case, we can write each of our conditions as a series of linear equations. I warn you: at first, this might feel a bit like mental gymnastics. But I promise that it will make sense soon!

Let's go through them one by one.

Constraint 1: Each student must be added to one group

Writing this constraint as a linear equation is actually going to be pretty simple. For Student 1, for instance, we can write the equation as:

x + y + z = 1

where:

x is a Boolean variable which equals 1 if Student 1 is allocated to Group 1, else 0 if Student 1 is not allocated to Group 1
y is a Boolean variable which equals 1 if Student 1 is allocated to Group 2, else 0
z is a Boolean variable which equals 1 if Student 1 is allocated to Group 3, else 0

In constraint programming, these variables in a linear equation are known as decision variables. To help us keep track of them, we'll give them more informative names (instead of "x", "y" and "z"):

"x" will be written as "student1_group1"
"y" will be written as "student1_group2"
"z" will be written as "student1_group3"

This will give us the linear equation for this constraint for Student 1:

student1_group1 + student1_group2 + student1_group3 = 1

As you can see, this means that if Student 1 is allocated to Group 1 (i.e. if _student1group1 = 1), the other two decision variables must equal 0, and vice versa.

To implement this linear equation in OR-Tools, we need to (1) define the decision variables, and (2) add the constraint to the model:

# Initialise the variables and "add" them to the model
student1_group1 = model.NewBoolVar("student1_group1")
student1_group2 = model.NewBoolVar("student1_group2")
student1_group3 = model.NewBoolVar("student1_group3")

# Add the linear equation (i.e. the constraint) to the model
model.Add(student1_group1 + student1_group2 + student1_group3 == 1)

The nice thing about OR-Tool is that it lets you write linear equations in a straightforward way using model.Add(linear_equation_goes_here).

Next, we'll repeat this for each student, giving us 30 decision variables and 10 linear equations in total:

# Define the remaining decision variables
student2_group1 = model.NewBoolVar("student2_group1")
student2_group2 = model.NewBoolVar("student2_group2")
student2_group3 = model.NewBoolVar("student2_group3")
student3_group1 = model.NewBoolVar("student3_group1")
student3_group2 = model.NewBoolVar("student3_group2")
student3_group3 = model.NewBoolVar("student3_group3")
student4_group1 = model.NewBoolVar("student4_group1")
student4_group2 = model.NewBoolVar("student4_group2")
student4_group3 = model.NewBoolVar("student4_group3")
student5_group1 = model.NewBoolVar("student5_group1")
student5_group2 = model.NewBoolVar("student5_group2")
student5_group3 = model.NewBoolVar("student5_group3")
student6_group1 = model.NewBoolVar("student6_group1")
student6_group2 = model.NewBoolVar("student6_group2")
student6_group3 = model.NewBoolVar("student6_group3")
student7_group1 = model.NewBoolVar("student7_group1")
student7_group2 = model.NewBoolVar("student7_group2")
student7_group3 = model.NewBoolVar("student7_group3")
student8_group1 = model.NewBoolVar("student8_group1")
student8_group2 = model.NewBoolVar("student8_group2")
student8_group3 = model.NewBoolVar("student8_group3")
student9_group1 = model.NewBoolVar("student9_group1")
student9_group2 = model.NewBoolVar("student9_group2")
student9_group3 = model.NewBoolVar("student9_group3")
student10_group1 = model.NewBoolVar("student10_group1")
student10_group2 = model.NewBoolVar("student10_group2")
student10_group3 = model.NewBoolVar("student10_group3")

# Add the remaining constraints
model.Add(student2_group1 + student2_group2 + student2_group3 == 1)
model.Add(student3_group1 + student3_group2 + student3_group3 == 1)
model.Add(student4_group1 + student4_group2 + student4_group3 == 1)
model.Add(student5_group1 + student5_group2 + student5_group3 == 1)
model.Add(student6_group1 + student6_group2 + student6_group3 == 1)
model.Add(student7_group1 + student7_group2 + student7_group3 == 1)
model.Add(student8_group1 + student8_group2 + student8_group3 == 1)
model.Add(student9_group1 + student9_group2 + student9_group3 == 1)
model.Add(student10_group1 + student10_group2 + student10_group3 == 1)

(There are more efficient ways of writing this, but I won't cover those now because it makes the code less understandable for beginners).

Constraint 2: Each group must have a minimum of 3 people in it

To express this constraint as a linear equation, we need to collect the 10 decision variables related to each group and ensure that they sum to at least 3.

For Group 1, for example, we need to make sure that the 10 Boolean decision variables related to Group 1 (student1_group1, student2_group1, … , student10_group1) sum to at least 3, i.e.:

student1_group1 + student2_group1 + student3_group1 + student4_group1 + student5_group1 + student6_group1 + student7_group1 + student8_group1 + student9_group1 + student10_group1 >= 3

Using the same syntax as before, we can write this constraint as a series of 3 linear equations (one for each Group):

# Each group must have a minimum of 3 in it
model.Add(student1_group1 + student2_group1 + student3_group1 + student4_group1 + student5_group1 + student6_group1 + student7_group1 + student8_group1 + student9_group1 + student10_group1 >= 3)
model.Add(student1_group2 + student2_group2 + student3_group2 + student4_group2 + student5_group2 + student6_group2 + student7_group2 + student8_group2 + student9_group2 + student10_group2 >= 3)
model.Add(student1_group3 + student2_group3 + student3_group3 + student4_group3 + student5_group3 + student6_group3 + student7_group3 + student8_group3 + student9_group3 + student10_group3 >= 3)

Constraint 3: Students 1 and 2 cannot be in the same group

This one's a little simpler: we just need to ensure that student1_group1 and student2_group1 cannot both be equal to 1 at the same time – if either one of the equals 1, the other must equal 0 – and so on for Groups 2 and 3:

# Students 1 and 2 cannot be in the same group
model.Add(student1_group1 + student2_group1 <= 1)
model.Add(student1_group2 + student2_group2 <= 1)
model.Add(student1_group3 + student2_group3 <= 1)

Constraint 4: Each group must contain at least one of the following students: 7, 8, 9, 10

To write linear equations that satisfy this constraint, we'll first need to define a few more decision variables:

# Define three new decision variables
x1 = model.NewBoolVar('x1')  # 1 if student3_group1 + student8_group1 == 2, else 0
x2 = model.NewBoolVar('x2')  # 1 if student3_group2 + student8_group2 == 2, else 0
x3 = model.NewBoolVar('x3')  # 1 if student3_group3 + student8_group3 == 2, else 0

… and then add the constraints:

# Students 3 and 8 must be in the same group
model.Add(student3_group1 + student8_group1 == 2).OnlyEnforceIf(x1)
model.Add(student3_group2 + student8_group2 == 2).OnlyEnforceIf(x2)
model.Add(student3_group3 + student8_group3 == 2).OnlyEnforceIf(x3)

# Only one of x1, x2, x3 can be true (exclusive or)
model.Add(x1 + x2 + x3 == 1)

Notice that we use the .OnlyEnforceIf() method to specify that the conditions should only be applied if the corresponding decision variable is equal to 1.

Solve

You'll be relieved to hear that the hard bit is over!

Now that we've defined our decision variables and translated our constraints into a series of linear equations, we can instruct OR-Tools to solve the problem…

solver = cp_model.CpSolver()
status = solver.Solve(model)

… and show one of the solutions:

# Show the first solution
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:

    # Show which group each student has been allocated to
    print("nnSummary:n")
    print(f"Student 1: ", show_group(student1_group1, student1_group2, student1_group3))
    print(f"Student 2: ", show_group(student2_group1, student2_group2, student2_group3))
    print(f"Student 3: ", show_group(student3_group1, student3_group2, student3_group3))
    print(f"Student 4: ", show_group(student4_group1, student4_group2, student4_group3))
    print(f"Student 5: ", show_group(student5_group1, student5_group2, student5_group3))
    print(f"Student 6: ", show_group(student6_group1, student6_group2, student6_group3))
    print(f"Student 7: ", show_group(student7_group1, student7_group2, student7_group3))
    print(f"Student 8: ", show_group(student8_group1, student8_group2, student8_group3))
    print(f"Student 9: ", show_group(student9_group1, student9_group2, student9_group3))
    print(f"Student 10: ", show_group(student10_group1, student10_group2, student10_group3))

else:
    print("No solution found.")

# Result
# Student 1:  Group 3
# Student 2:  Group 2
# Student 3:  Group 3
# Student 4:  Group 2
# Student 5:  Group 1
# Student 6:  Group 1
# Student 7:  Group 2
# Student 8:  Group 3
# Student 9:  Group 1
# Student 10:  Group 1

There you have it! Not too bad, eh?

Show all solutions

The solution printed out above is but one of many feasible solutions. Using OR-Tools, we can find all the solutions:

class VarArraySolutionPrinter(cp_model.CpSolverSolutionCallback):
    """
    Find all solutions.
    """

    def __init__(self, variables):
        cp_model.CpSolverSolutionCallback.__init__(self)
        self.__variables = variables
        self.__solution_count = 0

    def on_solution_callback(self):
        self.__solution_count += 1

        # Uncomment the next 3 lines if you want to print out the other solutions
        # for v in self.__variables:
        #     print(f"{v}={self.Value(v)}", end=" ")
        # print()

    def solution_count(self):
        return self.__solution_count

solver = cp_model.CpSolver()
solution_printer = VarArraySolutionPrinter([
    student1_group1, student1_group2, student1_group3,
    student2_group1, student2_group2, student2_group3,
    student3_group1, student3_group2, student3_group3,
    student4_group1, student4_group2, student4_group3,
    student5_group1, student5_group2, student5_group3,
    student6_group1, student6_group2, student6_group3,
    student7_group1, student7_group2, student7_group3,
    student8_group1, student8_group2, student8_group3,
    student9_group1, student9_group2, student9_group3,
    student10_group1, student10_group2, student10_group3
])

# Enumerate all solutions
solver.parameters.enumerate_all_solutions = True

# Solve
status = solver.Solve(model, solution_printer)

# Count the total number of solutions
num_solutions = solution_printer.solution_count()
print(f"There are {num_solutions} feasible solutions.")

# Result
# There are 1764 feasible solutions.

Where are all the Optimisers?

Optimisation is nothing new. Lots of Data Science teams already make use of optimisation techniques, and there are entire conferences and libraries dedicated to it. But for some reason, optimisation isn't commonly considered "canon" in the same way that Analytics and ML are, so it often gets unfairly neglected.

I think this is a great shame, and I'd love to see more Data Scientists making use of these tools. I hope you've found this article helpful, and let me know if you have any cool use cases you'd like to share!

One more thing –

I run theSQLgym and also write a free newsletter called AI in Five where I share 5 bullet points each week on the latest AI news, coding tips and career stories for Data Scientists/Analysts. Subscribe here if that sounds up your street!

Thanks for reading, and feel free to connect with me on Twitter or LinkedIn!

Tags: Data Science Getting Started Machine Learning Optimization Programming