Introduction

Operationalizing Data Science projects is no trivial task. At the very least, data analysis workflows have to run on a regular basis to produce up-to-date results: a report with last week’s data or re-training a Machine Learning model due to concept drift. In some cases, the output of such workflows needs to be exposed as an API, for example, a trained Machine Learning model that generates predictions by hitting a REST endpoint.

This calls for development practices that allow workflows (also known as pipelines) to be reproducible, repeatable and can be easily deployed. In recent years, plenty of open-source workflow management tools have popped up. Given the plethora of options, it is hard for teams to choose what tools best suits their needs, this article reviews 13 open-source workflow management tools.

Evaluation criteria (summary)

Over the last 5 years, I have developed several Machine Learning projects in industry and academic research. This evaluation criteria is the result of that experience. Although there is an emphasis on Machine Learning workflows, this survey is also useful for projects that require batch processing or job scheduling.

The following sections explained the rationale for each evaluation section. If you want to see a detailed explanation (and justification) for the criteria, scroll to the last section of this post.

SectionExplanation
Ease of useHow easy it is to pick it up (API design).
Development experienceSupport for incremental builds and local execution.
DebuggingIntegration with existing Python debugging tools (i.e. pdb).
TestingSupport for integration tests and pipeline testing.
DeploymentExecuting workflows in a production-scale system, preferably open-source (e.g. Kubernetes). Ability to re-use pre-processing training code in production to eliminate training-serving skew.
Programming languagesSQL compatibility. Support for other popular programming languages such as R or Julia.
MaintainabilityAmount of pipeline code (less is better) and source code organization. Tool-specific characteristics that affect maintainability are also considered.
Jupyter notebooks supportSupport for developing pipeline tasks interactively (i.e. using Jupyter notebooks/lab) and running notebooks programmatically to generate visual reports.

Each section is evaluated on a scale from 1-3:

GradeExplanation
NAUnsupported or major limitations.
1Supported with some limitations.
2Good.
3Excellent.

Bear in mind that this evaluation criteria is very specific as it is tailored to evaluate Machine Learning workflows. Each project prioritizes certain aspects over others, the primary objective of this survey is to give an overview of tools so you can choose whichever option is best for your use case.

Disclaimer

I am the author of Ploomber, one of the reviewed tools. I started Ploomber in 2019 because none of the available tools suited my needs. Since Ploomber was designed with this evaluation criteria from the beginning, it is natural that it does well in most sections, except the ones where features are still in development.

Evaluation

(in alphabetical order)

Ease of useDevelopment experienceDebuggingTestingDeploymentProgramming languagesMaintainabilityJupyter notebooks support
Ploomber33331233
Airflow11NA12221
Dagster11NA32111
DVC (Data Pipelines)33NANANANA21
Elyra31NANA1NA23
Flyte2NANANA211NA
Kale31NANA2NANA2
Kedro11222NA11
Kubeflow pipelines1NANANA2NA1NA
Luigi31NA1221NA
Metaflow3212111NA
Prefect223NA1131
Tensorflow Extended (TFX)12NA13NANA1

Use [tool] if…

(in alphabetical order)

NameUse [tool] if…
PloomberYou want a simple API with great development experience (debugging and testing tools). Support for all SQL backends that have a Python connector and R support (languages that have Jupyter kernels such as Julia are supported with minor limitations), develop tasks interactively (notebooks, scripts of functions) with Jupyter and execute them programmatically. Support for deploying workflows to Airflow or Kubernetes (via Argo). Note: deployment to Airflow/Kubernetes is stable but with limited features.
AirflowYou already have an Airflow installation and data scientists already familiar with it. Otherwise, I would recommend you to use a more developer-friendly library that can export to Airflow, to leverage the aspects where Airflow shines: a robust and scalable scheduler.
DagsterYou have enough resources to dedicate an engineering team to maintain a dagster installation that can only run dagster workflows, data scientists are willing to spend time learning a DSL, navigate through the documentation to understand each module’s API and are willing to give up interactive development with notebooks. In exchange, you get workflows that can run locally, be tested as any other Python code, along with a web interface for monitoring execution and debugging workflows via logs. Alternatively, you can deploy workflows to Airflow but you lose the power of dagster’s native execution engine.
DVC (Data pipelines)You already use DVC for data versioning and want a simple way (albeit limited) to execute small pipelines locally and do not require to deploy to a production system like Kubernetes. DVC Data Pipelines are language-agnostic, since tasks are specified by a shell command. However, this flexibility has a tradeoff: since DVC is unaware of the nature of your scripts, the workflow specification (a YAML file) becomes verbose for large pipelines.
ElyraYou want to leverage an existing Kubernetes (Elyra requires Kubeflow) installation and want to provide a way to develop workflows visually, with the limitation that each task must be a Jupyter notebook. Since workflows are executed with Kubeflow, you can monitor workflows using Kubeflow Pipelines UI. However, it lacks of important features such as debugging tools, support for integration testing, creating API endpoints, support for SQL/R or scheduling.
FlyteYou have enough resources to dedicate a team of engineers to maintain a Flyte (Kubernetes) installation, your data scientists are proficient with Python, are willing to learn a DSL, and give up using Jupyter notebooks. In exchange, you get a cloud-agnostic, scalable workflow management tool that has been battle-tested at Lyft. Important: documentation is still limited.
KaleYou want to leverage an existing Kubernetes cluster to deploy legacy notebook-based pipelines. For new pipelines, I would only recommend Kale for small projects since pipelines are constrained to be a single Jupyter notebook file.
KedroYou are ok following a specific project structure convention, having a strictly Python function-based workflow (no direct support for scripts or notebooks), data scientists are willing to spend time understanding kedro’s API and can give up interactive development with Jupyter. In exchange, you get pipelines that you can deploy to several production services.
Kubeflow pipelinesYou are skilled in Kubernetes and want complete control over computational/storage resources. Do not need to use Jupyter notebooks for interactive development and are willing to spend time learning the Python API. The main drawback is that you cannot execute nor debug workflows locally.
LuigiYou already have a Luigi installation. Bear in mind that Luigi was designed for data engineering workflows, thus, it does come with specific features for data science or machine learning. There is no support for interactive development with Jupyter notebooks nor executing notebooks programmatically.
MetaflowYou use AWS and have enough resources to dedicate an engineering team to maintain a Metaflow’s installation and do not mind developing pipeline tasks in a non-interactive way (no support for Jupyter notebooks). I would recommend you to take a detail look at the Administrator’s guide to make sure it suits your needs because some of the deployment infrastructure tools are still closed-source (such as the DAG scheduler). In exchange, you get a very robust tool that provides powerful workflow development tools with a clean, well-designed Python API.
PrefectYou have enough resources to maintain a Prefect installation which can only execute prefect workflows. Prefect was designed with the dataflow paradigm in mind (closer to stream processing than to batch processing), hence, it lacks of several features that are required for data science or machine learning workflows: no incremental builds, no support for integration testing, no interactive development. Important: Prefect Server has a non-standard license.
Tensforflow Extended (TFX)Data scientists are proficient with Tensorflow, are willing to learn a new DSL, are ok give up other libraries like numpy, pandas or scikit-learn and are mostly developing Deep Learning models. In exchange, you get a flexible framework that can be deployed using Airflow or Beam, robust monitoring capabilities and can easily convert pipelines into API endpoints.

Individual reviews

Ploomber

SectionScoreComments
Ease of use3Uses a convention-over-configuration approach, to get started, you only need to include two special variables in your scripts/notebooks and Ploomber will orchestrate execution. For more flexibility you can specify your pipeline using YAML and for advanced use cases, use the Python API.
Development experience3Workflows can be executed locally either in a single process or in multiple processes (in parallel). Provides incremental builds.
Debugging3Integrates with pdb and ipdb, you can start line-by-line debugging sessions on any task or let the pipeline run and start a post-mortem session if a task crashes.
Testing3Execute integration tests upon task execution by using the on_finish hook. Pipelines are regular Python objects, you can import them and test them with the testing framework of your choice.
Deployment1Exporting workflows to Airflow and Argo (Kubernetes) is supported, however, this features only offer basic functionality but are in active development. Python batch-processing pipeline can be exported to process observations in-memory, this object can be used with any web framework to create an API endpoint.
Programming languages2Supports any database with a Python connection. Complete R support. Limited support for languages (such as Julia) that have a Jupyter kernel.
Maintainability3A task can be a script (Python/R/SQL), notebook or a Python function, you can choose whatever is more useful for your project. The task library (that exposes a unified API) provides with functionality for common tasks (e.g. run a SQL script, dump a database table) to reduce boilerplate code.
Jupyter notebooks support3Scripts and notebooks can be developed interactively using Jupyter and then executed programmatically.

Resources

Airflow

SectionScoreComments
Ease of use1Airflow is notoriously hard to pick it up. Although there is a wide range of task types available, in practice, the recommended way to write workflows is by exclusively using the Kubernetes operator to ensure environment isolation.
Development experience1Workflows can run locally using the local executor but this requires to have a full Airflow installation. Furthermore, one-off workflow executions are not straightforward since Airflow makes strong assumptions about how and when workflows should execute. No support for incremental builds.
DebuggingNANo tools for debugging.
Testing1Airflow workflows are Python objects, you can import them and inspect its properties. However, testing workflows this way does not seem to be an official or recommended practice.
Deployment2Scaling and scheduling are Airflow’s core strengths. But there is no support for exposing workflows as an API endpoint.
Programming languages2Support for a wide variety of SQL backends.
Maintainability2Since each task is a Python object, you can organize large projects in multiple modules without any limitations. Some tasks are community-contributed and vary in quality.
Jupyter notebooks support1There is a task to execute notebooks programmatically, however, its use is not recommended since the code is executed in a global environment. There is no support for interactive development.

Resources

Dagster

SectionScoreComments
Ease of use1Workflows are written in Python. The API has a lot of features but it is hard to pick it up and read. For example, a lot of functionality is hidden in a context parameter. Even for seemingly simple tasks such as executing a SQL query, one has to become familiar with several concepts and type a lot of code.
Development experience1Offers great flexibility to execute workflows locally and deploy to a distributed system (e.g. Airflow, Kubernetes). No support for incremental builds.
DebuggingNANo tools for debugging.
Testing3Great support for testing, using hooks you can execute integration tests. Workflows can be imported and tested using a testing framework.
Deployment2Dagster comes with a full-features executor and scheduler. However, it means that you have to maintain a dagster installation which can only execute dagster workflows. There is no documentation on exposing a workflow as an API, but this seems possible, since workflows can be imported and used as any other Python object.
Programming languages1Only supports postgres and snowflake. No support for R/Julia.
Maintainability1The configuration mechanism is extremely verbose, there are several optional packages to integrate with other systems but they all have different APIs.
Jupyter notebooks support1Support to execute notebooks programmatically. No support for interactive development.

Resources

DVC (Data pipelines)

SectionScoreComments
Ease of use3Workflows are specified using a YAML file, where each task is specified by the command to execute, file dependencies and output files.
Development experience3Can (exclusively) run workflows locally and provides incremental builds.
DebuggingNANo tools for debugging.
TestingNANo support for integration testing. No support for pipeline testing.
DeploymentNANo support for exporting to large-scale systems. No support for exposing a workflow as an API endpoint.
Programming languagesNAThe frame work is language-agnostic, tasks are specified using commands. However, this implies that you have to provide a command-line interface for each script to pass arguments. No direct support for SQL.
Maintainability2Workflows are specified with YAML, this is good for small projects but it is not great for large ones. Once the project grows the YAML file becomes redundant because you have to specify the same values multiple times (i.e. a script train.py appears both in the cmd section and the deps section). With a few dozen tasks, this becomes verbose and error-prone.
Jupyter notebooks support1Tasks are specified with a single command, this implies that you are free to use notebooks as pipeline tasks, edit them interactively and then run it programmatically. However, you have to manually specify the command and parameters to use for executing your notebook programmatically since DVC is unaware of its content.

Resources

Elyra

SectionScoreComments
Ease of use3The pipeline visual editor makes is extremely simple to convert a set of notebooks into a pipeline.
Development experience1Pipelines can be executed locally. No support for incremental builds.
DebuggingNANo tools for debugging.
TestingNANo support for integration testing. No support for pipeline testing.
Deployment1Run workflows in Kubernetes via Kubeflow pipelines. No support for scheduling workflows. Due to its exclusive notebook-based nature, there is no easy way to convert a workflow into an API endpoint.
Programming languagesNAPython-only.
Maintainability2The visual editor is great to facilitate workflow authoring, however, some people might prefer to have more control over the pipeline definition. The definition is written in JSON format, but it is unclear if manually editing such file is recommended. It is limiting that tasks have to be notebooks.
Jupyter notebooks support3Elyra is a notebook-centric tool where each task is a notebook, hence, you can develop tasks interactively. When you execute your pipeline, the notebooks are executed programmatically.

Resources

Flyte

SectionScoreComments
Ease of use2The API is clean. Tasks are defined using Python functions with a few decorators.
Development experienceNAWorkflows cannot be executed locally, they can only be executed in Kubernetes. No support for incremental builds.
DebuggingNANo tools for debugging.
TestingNANo support for integration testing. No support for pipeline testing.
Deployment2Runs on Kubernetes, supports scheduling. Unclear if it is possible to expose a workflow as an API endpoint.
Programming languages1There is support for some SQL-compatible systems such as Hive and Presto. Spark is also supported. No support for R/Julia,
Maintainability1The API is clean but the documentation is still work in progress and there are only a few code examples.
Jupyter notebooks supportNANo support for interactive development nor execute notebooks programmatically.

Resources

Kale

SectionScoreComments
Ease of use3Deploying a pipeline in Kale only requires you to add tags to Jupyter notebook cells.
Development experience1Workflows can execute locally. No support for incremental builds.
DebuggingNANo tools for debugging.
TestingNANo support for integration testing. No support for pipeline testing.
Deployment2Deployment for batch processing is seamless, once you annotate your notebook, you can submit the workflow to the Kubernetes cluster. However, there is no support for re-using feature engineering code for an API endpoint.
Programming languagesNAPython-only,
MaintainabilityNAPipelines have to be declared in a single notebook file, which may cause a lot of trouble since cell side-effects are difficult to track. Having multiple people to edit the same file causes a lot of trouble when resolving version control conflicts. Finally, the code that you write is not the code that gets executed (they generate Kubeflow code with jinja), which may cause problems for debugging.
Jupyter notebooks support2Kale is notebook-first framework. You can develop your pipeline interactively and the notebook itself becomes a pipeline, however, it has go through some pre-processing steps before its executed.

Resources

Kedro

SectionScoreComments
Ease of use1Pipelines are defined using a Python API, where each task is a function. Although the workflow API is clean, some extra modules have complex APIs. Furthermore, it is very opinionated and expects your project to follow a specific folder layout, which includes several kedro-specific configuration files.
Development experience1Workflows can execute locally. No support for incremental builds.
Debugging2Support for debugging nodes and pipelines, although the API looks complex.
Testing2There is support for testing tasks upon execution (hooks), however, similar to the debugging API, it looks complex.
Deployment2Supports deployment to Kubernetes (Argo and Kubeflow), Prefect and AWS Batch. It is unclear whether you can convert a batch pipeline to an online API.
Programming languagesNAPython-only.
Maintainability1Expects your project to have a specific folder layout and configuration files. This is restrictive and an overkill for simple projects.
Jupyter notebooks support1You can start a Jupyter notebook and export defined functions as kedro nodes (tasks), but interactivity is limited since the exported code has to be a single function. No support to execute notebooks programmatically.

Resources

Kubeflow pipelines

SectionScoreComments
Ease of use1Workflows a written in a highly complex Python API (which is the reason why Kale exists).
Development experienceNAThere is no way to run workflows locally as it is a Kubernetes-only framework. No support for incremental builds either.
DebuggingNANo tools for debugging.
TestingNANo support for integration testing. No support for pipeline testing.
Deployment2Batch processing deployment is simple since Kubeflow is tightly integrated with Kubernetes. Unclear whether we can compose training and serving pipelines to re-use feature engineering code for an API endpoint.
Programming languagesNAPython-only.
Maintainability1The code is hard to read, and contains too many details, see this example. The same parameters (project, cluster name, region) are passed to all tasks. Documentation is outdated.
Jupyter notebooks supportNANo support for interactive development nor executing notebooks programmatically.

Resources

Luigi

SectionScoreComments
Ease of use3Need to become familiar with the API to get started, however it is not as complex as others. It has a consistent set of concepts: tasks, targets and parameters. And tasks (defined as Python classes) have more-less the same structure.
Development experience1Can run workflows locally. No support for incremental builds (once a task is executed, running it again has no effect, even if input files change).
DebuggingNANo debugging tools.
Testing1Although not specifically designed for that purpose, callbacks can be used for integration testing. No support for inspecting the pipeline for testing its properties/definition.
Deployment2Very simple to deploy to the central monitoring tool. Limited scalability, no built-in scheduler. Batch-processing only, no conversion to API endpoints.
Programming languages2Support for some SQL backends.
Maintainability1Requiring workflows to be defined as a single class is problematic for collaboration and code organization. Furthermore, it might lead to sneaky bugs since tasks are not stateless (due to the existence of instance variables).
Jupyter notebooks supportNANo support for interactive development. No support for executing notebooks programmatically.

Resources

Metaflow

SectionScoreComments
Ease of use3Workflows are defined using a Python class and decorator can be used for several things such as retrying tasks or installing dependencies before the task is executed.
Development experience2Workflows can be executed locally and you can resume execution from failed tasks. No support for incremental builds.
Debugging1If a workflow fails, you can inspect the data to determine what went wrong. Nonetheless, you can only debug workflows after they failed, there is no support for starting an interactive post-mortem debugging session and you have to resort to using print statements for debugging, which is far from ideal.
Testing2Workflows can be imported to inspect its definition and properties. Although not explicitly mentioned, there does not seem to be any restrictions, and you could use this testing tools with a testing framework like pytest.
Deployment1Metaflow comes with a built-in AWS tool to execute workflows, it is also possible to schedule workflows using AWS Step Functions. However, Netflix uses an internal (closed-source) DAG scheduler. There are not options to deploy to other clouds. It appears that workflows can be exposed as APIs, but it is unclear if this is part of the open-source package.
Programming languages1There is support for R workflows, although it is a separate tool that uses the Python library as a backend, you cannot mix R and Python in the same workflow. No support for SQL.
Maintainability1Requiring workflows to be defined as a single class is problematic for collaboration and code organization. Furthermore, it might lead to sneaky bugs since tasks are not stateless (due to the existence of instance variables).
Jupyter notebooks supportNANo support for interactive development. No support for executing notebooks programmatically.

Resources

Prefect

SectionScoreComments
Ease of use2Function-based workflows are written with a clean Python API. The task library contains a wide variety of tasks, however, only a handful of them are relevant to Data Science/Machine Learning projects.
Development experience2Workflows can be executed locally. No support for incremental builds.
Debugging3You can inspect the output and status of each task. As well as tools that make workflow debugging simple.
TestingNANo support for integration testing. No support for pipeline testing.
Deployment1Workflows can be deployed (and scheduled) using the web interface, however, this interface can only execute Prefect workflows. No support for running workflows in other systems. Prefect server has a non-standard open-source license.
Programming languages1Although there is support for a few SQL database (e.g. postgres, snowflake, sqlite), each module has a different API. No support for R or Julia.
Maintainability3Great API with minimal boilerplate code.
Jupyter notebooks support1There is support for executing notebooks programmatically. No support for developing tasks interactively.

Resources

Tensorflow Extended (TFX)

SectionScoreComments
Ease of use1The API is very complex, it requires you to get familiar with several modules before you get started.
Development experience2Pipelines can be executed locally. No support for incremental builds.
DebuggingNANo tools for debugging.
Testing1Technically possible since you can run workflows locally, however, local workflows have to explicitly enable the interactive mode, it is unclear whether this can cause trouble when running under a testing framework. No support for pipeline testing.
Deployment3There are three deployment options: Airflow, Kubeflow Pipelines and Apache Beam, however, examples are only provided for Google Cloud. Workflows can be exposed as API using Tensorflow serving.
Programming languagesNAPython-only.
MaintainabilityNATFX is a Tensorflow-exclusive framework, which implies that you cannot bring other libraries like numpy or pandas. There is also a good amount of boilerplate code that makes pipelines difficult to follow for people unfamiliar with the Tensorflow ecosystem.
Jupyter notebooks support1You can develop pipelines interactively and run them in a single notebook. No support for executing notebooks programmatically.

Resources

Evaluation criteria

1. Ease of use

Having a clean API that allows users get started quickly is essential to any software tool, but even more so to data analysis tools where programming proficiency varies a lot among users. Often, data analysis projects are experimental/exploratory and go through an initial prototyping phase. Practitioners tend to stay away from “production tools” because they often have a steep learning curve that slows progress down, as a consequence, many data scientists write entire data pipelines in a single notebook. This is a terrible practice because it creates unmaintainable software. The only way to avoid this bad practice is to provide production tools that add minimal overhead to make them more appealing and practitioners use it from day one.

2. Development experience

Data Science/Machine Learning is highly iterative, especially in the prototyping phase. We start with a rough idea of what kind of analysis we have to do and refine as we learn more about the data. A data scientist spends a lot of time changing a small part of the code and re-running the analysis to see how such change affects results (e.g. add a new feature). Incremental builds are key to facilitate this iterative process. When a data scientist modifies a few lines of code in a single task, there is no need to re-execute the pipeline end-to-end, but only tasks that were modified/affected.

Deployed data pipelines are usually managed by production systems such as Airflow scheduler or Kubernetes, however, being able to develop locally without requiring any extra infrastructure is critical to foster rapid development.

3. Debugging

Data pipelines are notoriously hard to debug because errors may come from incorrect code or unexpected data properties. Being able to start an interactive session to debug our code is more efficient than looking at a bunch of print (or logging) statements. Workflow managers often execute our code in specific ways (e.g. multiprocessing, remote workers, etc), which might cause the standard Python debugging tools not to work. We evaluate if it is possible to use standard tools to debug workflows.

4. Testing

Unexpected data can break pipelines in many ways. Best case scenario, a downstream task crashes because of a schema compatibility, worst case: the pipeline runs “successfully” but produces incorrect results (e.g. a model with bad performance). To prevent such sneaky bugs, it is becoming increasingly common to test artifacts after task execution, this is known as integration testing or data testing. An example of this is to check that there are no NULL values in a dataset after we apply a transformation to it. We evaluate support for integration testing.

A second (often overlooked) feature, is the ability to test our workflow. Workflows are complex because they encompass multi-stage procedures with dependencies among them. Being able to run and inspect our workflows in a test environment helps detect bugs during development rather than production. The ability to test our workflows using testing tools such as pytest is considered.

5. Deployment

The two most common deployment paradigms for data workflows are batch and online. An example of a batch deployment is a workflow that process new observations (often on schedule), makes predictions and uploads them to a database. The online scenario involves exposing the pipeline as an API (e.g. REST/RPC) that takes input data and returns a prediction. A common error during deployment happens when the pre-processing code at serving time differs from the one at training time (training-serving skew). We assess the ability to re-use existing training code for serving to eliminate this problem.

We also evaluate deployment options and favor integration with other open-source deployment tools (e.g. Airflow, Kubernetes).

6. Programming languages

Although training ML models with unstructured data is becoming increasingly common thanks to Deep Learning, tabular data and classical ML algorithms are still the most common type of application (see page 19 of this report). Tabular data is usually stored in SQL databases, this implies that our pipeline is often a combination of SQL and Python. We evaluate the ability to integrate SQL scripts as part of our workflow Finally, we also evaluate support for other popular languages for data analysis such as R and Julia.

7. Maintainability

This section evaluates project maintainability. The first aspect to consider is the amount of code needed to declare the pipeline (less code is better). The second aspect is code organization, we determine if the library imposes restrictions that limit our ability to organize our code in separate functions/modules. Tool-specific characteristics that may affect code maintainability are also evaluated.

8. Jupyter notebooks support

The use of notebooks in production pipelines always triggers a heated debate, but I believe that problems come from poor development practices rather than with notebooks itself. Notebooks are a fantastic environment for exploratory work, which is exactly what we need when we are learning about the data. Being able to develop workflows interactively has a great positive impact on productivity.

A second important usage of notebooks is as an output format. The .ipynb format supports embedding tables and charts in a single file without any extra code. This is a huge time saver because debugging workflows is much easier when we can take a look at our data with charts, finding errors with text-based logs severely limits this process. Having a .ipynb file as a result of executing a workflow task is akin to having rich logs that facilitate debugging.