Machine Learning Orchestration vs MLOps

There is a saying that I've heard from the ML engineers that I've worked with that "most of machine learning operations (MLOps) is just data engineering". There is a blog post that puts the actual percentage at 98%. This is obviously tongue in cheek, but I think the sentiment is correct. However what is meant by MLOps is still a moving target. While there are a lot of components and moving parts that can be considered part of MLOps, this definition from Cristiano Breuel is good enough for what I intend to discuss in this article:
ML Ops is a set of practices that combines Machine Learning, DevOps and Data Engineering, which aims to deploy and maintain ML systems in production reliably and efficiently.

Something I like to add to this definition is that a "good" ML system is:
one that solves a business need and delivers it effectively. The focus of the MLOps team should always be on solving the business need.
Like most workflow (or pipeline) based systems, an MLOps system requires an orchestrator. In this context it can be called a Machine Learning Orchestrator (or for now an MLOx). An MLOx's job is simply that of an orchestrator, which is a mechanism that can manage and coordinate complex workflows and processes on a very specific schedule. One orchestrator that is often used in MLOps and many other ML related systems is Apache Airflow, others include Dagster, Prefect, Flyte, Mage etc. For this article I will focus on Airflow.
I like to think of an MLOx as being similar to a movie director (or a conductor for an orchestra, but that can be confusing since we would be talking about orchestrating an orchestra). A director has a script that they work from to direct the various processes to deliver the final product – the movie. In the context of an MLOx there is a workflow that would be the equivalent of the script. The role of the orchestrator is to ensure that the various processes that are part of the workflow execute on schedule, in the correct sequence and deal with failure appropriately.
However, Airflow adds complexity to this analogy. Airflow also has compute capability as it can utilize the environment it runs in to run any Python code (similarly with Dagster, Prefect et al). Being extensible and open source, it becomes more like an actor-director where an Airflow task can be an integral part of the workflow in the same way a director can also play the part of an actor in the movie. With Airflow, you can load data into memory, perform some processing, and then pass the data to the next task. In this way, Airflow can be an Mlops tool as well as the orchestrator. It can direct do so of the required machine learning operations or just act as an orchestrator, instantiating processes on TensorFlow clusters or initiating Spark jobs etc.
In short, an MLOx is an orchestrator doing its job with ML tooling.
Here is how ZenML defines an MLOx as:
The orchestrator is an essential component in any MLOps stack as it is responsible for running your machine learning pipelines. To do so, the orchestrator provides an environment which is set up to execute the steps of your pipeline. It also makes sure that the steps of your pipeline only get executed once all their inputs (which are outputs of previous steps of your pipeline) are available.

The features that make Airflow particularly well-suited as an MLOx are:
- DAGs (Directed Acyclic Graphs): DAGs are a visual representation of ML workflows. DAGs make it easy to see the dependencies between tasks and to track the progress of a workflow.
- Scheduling: Airflow can be used to schedule ML workflows to run on a regular basis. This can help to ensure that ML models are always up-to-date and that they are being used to make predictions in a timely manner.
- Monitoring: Airflow provides a number of tools for monitoring ML workflows. These tools can be used to track the performance of ML models and to identify any potential problems.
Think "Airflow and", not "Airflow or"
One complication I see when people are deciding on ML tooling is to look for one tool to do everything, including Orchestration. Some of these all-in-one-ders include a basic workflow scheduler to cover the minimum requirements, but it will likely not be nearly as capable as something like Airflow. Once your MLOx requirements exceed the included orchestrator's capability, you then need to bring in a more capable orchestrator and redo the scheduling work and probably re-write much of the code too.
Another issue I've seen is people comparing Airflow to ML tools that do vastly different things, with some that just happen to end in "flow" like MLFlow or Kubeflow. MLFlow is mostly used for experiment tracking and has a completely different way of operating compared with Airflow. Airflow is just another component in vast MLOps tooling space, from model registries to experiment tracking to specialised model training platforms. MLOps encompasses many components necessary for effective ML workflow management.
Some people moving into MLOps are coming from a more experimental data science environment and don't have the experience of the more stringent requirements that Dev Ops and Data Engineering bring to MLOps. Data scientists work in a more informal way, while MLOps requires a structured approach. To implement an end-to-end MLOps pipeline, a systematic, repeatable approach is necessary to take the data, extract the features, train the model, and deploy it. Airflow is often orchestrating structured processes like these and may take some learning for people used to from a more flexible, data science like approach. You should however, start with the right tool from the beginning if you plan to scale up your MLOps capabilities.
A final consideration for using Airflow as an MLOx is that many organizations already have an Airflow implementation doing some kind of data orchestration work. If there is already someone who knows how to manage the Airflow infrastructure and can help with creating and running DAGs, you have everything you need to get your MLOx up and running. Add to this the automated DAG generation tools like gusty and the Astro SDK, or ML specific DAG generators from ZenML and Metaflow lets you get a working MLOx without having to know that much about Airflow.
Conclusion
Machine learning orchestration a.k.a MLOx is a vital component of MLOps. It requires a comprehensive and adaptable solution. Apache Airflow is a powerful orchestration tool, enabling seamless workflow management and execution. By covering both the roles of an orchestrator and an MLOps tool, Airflow enables organizations to efficiently deploy and maintain machine learning models. As the field of MLOps continues to evolve, embracing tools like Airflow becomes crucial for maximizing productivity and unlocking the true potential of machine learning for your organization.