5 Project Management Frameworks you can use in the context of Machine Learning

Author:Murphy  |  View: 27493  |  Time: 2025-03-22 21:55:27
Photo by joszczepanska @Unplash.com

Ok, I admit it, project management is not one of the most fun concepts in machine learning and Data Science, particularly for technical users. But, when it comes to reasons why ML projects fail, bad project management will likely be in the top 5, or at least influence some other reasons that will show up in the top 5 (poor stakeholder management, not setting time aside for deployment, etc.).

Why is project management so hard in the context of Machine Learning and Data Science? The main reason is that there is, normally, a lot of uncertainty tied to the outcome of a projects – in some of them, there's a high intersection with research and that grey area requires a lot of experimentation. This may enter into direct conflict with the majority of project management frameworks that require strict timelines to make projects adhere to time and budget.

In this blog post, we are going to explore 5 famous project management frameworks that can be applied to the world of Data Science and Machine Learning and their main pros and cons. We'll start with frameworks that are more related to the software engineering and product world and narrow down to frameworks tailored to Data Science and Machine Learning.

Let's go!


Agile

Representation of an Agile Meetup, generated by AI – Image by Microsoft Designer

Agile is one of the most famous Project Management frameworks in the world. With it's clear and straightforward Manifesto, the agile framework is more than a project management framework but a motto in software development.

It's normally rooted in 12 principles:

  1. Customer satisfaction by early and continuous delivery of valuable software.
  2. Welcome changing requirements.
  3. Deliver working software frequently.
  4. Close, daily cooperation business users and devs.
  5. Projects are built around motivated individuals.
  6. Face-to-face conversation is the best form of communication.
  7. Working software is the primary measure of success and progress.
  8. Sustainable development.
  9. Continuous attention to technical excellence.
  10. Simplicity – the art of maximizing the amount of work not done is mandatory.
  11. Best architectures, requirements, and designs emerge from self-organizing teams.
  12. Regularly, the team reflects on how to become more effective, and adjusts its software and methods accordingly.

With these principles in mind, several implementations of Agile project management emerged. In most of them there are a couple of important techniques and organization methods that are standard:

  • 2 to 4 weeks sprints of developments with clear deliverables while gathering user feedback.
  • Quality focus – high quality of technical delivery is guaranteed through continuous use of software usage and improvement.
  • Fail often and fail early principles.
  • Sprints of development are prone to change and readjusted based on results from previous sprints and user feedback.

In the context of ML, Agile has been extensively used, due to its flexibility that tends to align with model deliverables. Particularly in consulting gigs where data scientists need to have results with limited budget or timeline, Agile has been a friend to lean on.

In this context, one of the techniques I've been using is sprinting to a baseline model and getting models' output in front of the users as the first priority of Agile-based ML projects. When Data Scientists and Machine Learning engineers work under Agile principles, I've found that leaving thorough feature engineering and preprocessing for second plan aligns better with the framework.

In a nutshell, in the context of ML and DS, Agile has several positive aspects:

  • The framework incorporates quick changes, something that may happen if you need to add more data, new features or experiment with other models
  • Users are involved quite early in the machine learning development, something that tens to lead to better results
  • Data Scientists are incentivized to have critical meetings with the business mid-project and avoid "tunnel-vision" development.

Agile is widely used in different industries and it definitely fits well in the Machine Learning world! Particularly in cases where the problem is complex to solve and when there is a lot of uncertainty on how to solve it, Agile is a good friend as it can also be understood by other data teams that tend to work in Agile setting.


Waterfall

Representation of the Waterfall Methodology, generated by AI – Image by Microsoft Designer

In the waterfall method, projects are planned sequentially where each project phase cascades sequentially throughout the development process.

Normally with few room for changes after the first planning process, waterfall method typically encapsulates the following stages:

  • Requirements
  • Design
  • Implementation
  • Verification & Testing
  • Deployment & Maintenance

It focuses extensively on documentation and when using it, project managers believe that there is minimal focus on last minute changes to the software during development. This project management framework is relevant for:

  • Projects that have a lot of unambiguous requirements;
  • Projects where there is a high likelihood that stakeholders will want to keep changing scope – something that is becoming rare as the world gets more complex;

Some might say that waterfall method is getting a bit out-of-fashion for the fast paced world that we live on today, although it is still used throughout companies, particularly companies that need to implement strict requirements that need extensive validation from a lot of stakeholders and with controlled budgets. As a rough generalization, startup companies have been leaning on Agile methodologies, while traditional companies still widely adopt waterfall methodologies.

In the context of machine learning and data science, it tends to be applied in projects that are highly replicable and where the team making these projects is already pretty familiar with the problem to solve.


SEMMA

Waterfall and Agile are used by other types of job roles that are not related to ML. Particularly, these types of methodologies have been used by software engineers, product managers and others for a long time.

Now, we'll see the first project management methodology specifically tailored for Machine Learning and Data Science. It was conceptualized by the SAS institute and it is one of the first methodologies specifically tailored for analytical models.

The SEMMA approach consists of 5 steps that go through the typical machine learning model development:

  • Sample – generate sample and create the analytical base table for your ML Models
  • Explore – visualize and check analytics of data
  • Modify – treat outliers, preprocess data and prepare data for your modelling phase
  • Model – Use statistical machine learning model to predict something
  • Assess – check the performance of your model on unseen data

This methodology is specifically tailored to consider each iteration of a specific model phase – it's also specially targeted for working with SAS Enterprise Miner, a drag-and-drop tool by SAS.

Although it is considered to be an iterative framework, giving room for you to go back and change some of the processes on each of the phases, it disregards business requirements gathering or model deployments, phases that have been gaining prominence in recent years.

The positive side of SEMMA is that it created a backbone project flow that is common for ML models development. Going through it can give you ideas on how to allocate time throughout each of the phases as data scientists typically spend more time working in the Sample, Explore and Modify stages.


CRISP-DM

Representation of CrispDM in Pop Art, generated by AI – Image by Microsoft Designer

The Cross-Industry Standard Process for Data Mining is one of the most widely used methodologies regarding project management in the context of ML and DS.

It consists of 6 different phases:

  • Business understanding, where there is a lot of time applied to understand how a certain solution should solve a business/research problem.
  • Data understanding, where Data Scientists combine the available data with the business / research logic.
  • Data preparation, where preprocessing (outlier removal, data clearning) happens.
  • Modelling, where machine learning models are built and trained.
  • Evaluation, the phase where the different models are evaluated against one or more metrics.
  • Deployment, that goes through the details on how to serve models to users.

A lot of debate has been going on regarding if CRISP-DM is an Agile or Waterfall related methodology. Although the phases can be seen as waterfall-y, CRISP-DM provides a lot of room for experimentation and iteration between phases. Particularly, most people integrate CRISP-DM methodologies in the following way:

CRISP-DM as Agile Process – Credit to https://www.datascience-pm.com/crisp-dm-2/

CRISP-DM is extremely popular among data scientists, as the KDNuggets poll suggests. Although a bit old, most Machine Learning Teams still incorporate some portion of CRISP-DM in their development processes.

In the end, CRISP-DM is adaptable and can be used as a great starting baseline for most projects' strategy and scaffold timeline. If you would like to read a bit more about a new version of management frameworks, check out bizML, a methodology by Harvard Business Review, inspired by CRISP-DM.


KDD

Finally, we reach the end of our list with Knowledge Discovery in Databases (KDD). Although not really a framework, KDD is a conceptual framework that encopasses more than machine learning models and that is also used in Analytics projects (for example, A/B testing, pure historical analysis, etc.).

The goal of KDD is to set some common tasks, rules and stages to find useful knowledge from data in databases. It's a bit more traditional and it can be considered as an inspiration for CRISP-DM. It envolves five-stages:

  1. Selection
  2. Pre-processing
  3. Transformation
  4. Data mining
  5. Interpretation/evaluation.

It's importance is mostly related to how it influenced CRISP-DM and going through it is specially useful if you are going to develop some analytics other than a machine learning model.


Thank you for taking the time to read this post!

Choosing a project management methodology that fits your ML projects is not an easy fit, with so many choices and advice available. While most users are finding CRISP-DM fit for their own projects, it's always relevant to consider different options depending on different parameters and questions of your project:

  • How well do you know the problem you are going to solve?
  • How likely will requirements change during development?
  • Can you sprint to a baseline model fast?
  • Are users expecting extensive documentation at the end of the project?
  • Are you developing a POC that doesn't require model serving or a production level ML model?

These are some of the questions that you can ask when starting an ML project and that will help you choose between the different approaches available.

In the end, it's all a question of nuance and being aware of the different choices. Also, just like in ML models, ensembles in terms of project management techniques are the way to go

Tags: Data Analytics Data Science Machine Learning Project Management

Comment