Introduction
Operationalizing Data Science projects is no trivial task. At the very least, data analysis workflows have to run on a regular basis to produce up-to-date results: a report with last week’s data or re-training a Machine Learning model due to concept drift. In some cases, the output of such workflows needs to be exposed as an API, for example, a trained Machine Learning model that generates predictions by hitting a REST endpoint.
This calls for development practices that allow workflows (also known as pipelines) to be reproducible, repeatable and can be easily deployed. In recent years, plenty of open-source workflow management tools have popped up. Given the plethora of options, it is hard for teams to choose what tools best suits their needs, this article reviews 13 open-source workflow management tools.
Evaluation criteria (summary)
Over the last 5 years, I have developed several Machine Learning projects in industry and academic research. This evaluation criteria is the result of that experience. Although there is an emphasis on Machine Learning workflows, this survey is also useful for projects that require batch processing or job scheduling.
The following sections explained the rationale for each evaluation section. If you want to see a detailed explanation (and justification) for the criteria, scroll to the last section of this post.
Section | Explanation |
---|---|
Ease of use | How easy it is to pick it up (API design). |
Development experience | Support for incremental builds and local execution. |
Debugging | Integration with existing Python debugging tools (i.e. pdb ). |
Testing | Support for integration tests and pipeline testing. |
Deployment | Executing workflows in a production-scale system, preferably open-source (e.g. Kubernetes). Ability to re-use pre-processing training code in production to eliminate training-serving skew. |
Programming languages | SQL compatibility. Support for other popular programming languages such as R or Julia. |
Maintainability | Amount of pipeline code (less is better) and source code organization. Tool-specific characteristics that affect maintainability are also considered. |
Jupyter notebooks support | Support for developing pipeline tasks interactively (i.e. using Jupyter notebooks/lab) and running notebooks programmatically to generate visual reports. |
Each section is evaluated on a scale from 1-3:
Grade | Explanation |
---|---|
NA | Unsupported or major limitations. |
1 | Supported with some limitations. |
2 | Good. |
3 | Excellent. |
Bear in mind that this evaluation criteria is very specific as it is tailored to evaluate Machine Learning workflows. Each project prioritizes certain aspects over others, the primary objective of this survey is to give an overview of tools so you can choose whichever option is best for your use case.
Disclaimer
I am the author of Ploomber, one of the reviewed tools. I started Ploomber in 2019 because none of the available tools suited my needs. Since Ploomber was designed with this evaluation criteria from the beginning, it is natural that it does well in most sections, except the ones where features are still in development.
Evaluation
(in alphabetical order)
Ease of use | Development experience | Debugging | Testing | Deployment | Programming languages | Maintainability | Jupyter notebooks support | |
---|---|---|---|---|---|---|---|---|
Ploomber | 3 | 3 | 3 | 3 | 1 | 2 | 3 | 3 |
Airflow | 1 | 1 | NA | 1 | 2 | 2 | 2 | 1 |
Dagster | 1 | 1 | NA | 3 | 2 | 1 | 1 | 1 |
DVC (Data Pipelines) | 3 | 3 | NA | NA | NA | NA | 2 | 1 |
Elyra | 3 | 1 | NA | NA | 1 | NA | 2 | 3 |
Flyte | 2 | NA | NA | NA | 2 | 1 | 1 | NA |
Kale | 3 | 1 | NA | NA | 2 | NA | NA | 2 |
Kedro | 1 | 1 | 2 | 2 | 2 | NA | 1 | 1 |
Kubeflow pipelines | 1 | NA | NA | NA | 2 | NA | 1 | NA |
Luigi | 3 | 1 | NA | 1 | 2 | 2 | 1 | NA |
Metaflow | 3 | 2 | 1 | 2 | 1 | 1 | 1 | NA |
Prefect | 2 | 2 | 3 | NA | 1 | 1 | 3 | 1 |
Tensorflow Extended (TFX) | 1 | 2 | NA | 1 | 3 | NA | NA | 1 |
Use [tool] if…
(in alphabetical order)
Name | Use [tool] if… |
---|---|
Ploomber | You want a simple API with great development experience (debugging and testing tools). Support for all SQL backends that have a Python connector and R support (languages that have Jupyter kernels such as Julia are supported with minor limitations), develop tasks interactively (notebooks, scripts of functions) with Jupyter and execute them programmatically. Support for deploying workflows to Airflow or Kubernetes (via Argo). Note: deployment to Airflow/Kubernetes is stable but with limited features. |
Airflow | You already have an Airflow installation and data scientists already familiar with it. Otherwise, I would recommend you to use a more developer-friendly library that can export to Airflow, to leverage the aspects where Airflow shines: a robust and scalable scheduler. |
Dagster | You have enough resources to dedicate an engineering team to maintain a dagster installation that can only run dagster workflows, data scientists are willing to spend time learning a DSL, navigate through the documentation to understand each module’s API and are willing to give up interactive development with notebooks. In exchange, you get workflows that can run locally, be tested as any other Python code, along with a web interface for monitoring execution and debugging workflows via logs. Alternatively, you can deploy workflows to Airflow but you lose the power of dagster’s native execution engine. |
DVC (Data pipelines) | You already use DVC for data versioning and want a simple way (albeit limited) to execute small pipelines locally and do not require to deploy to a production system like Kubernetes. DVC Data Pipelines are language-agnostic, since tasks are specified by a shell command. However, this flexibility has a tradeoff: since DVC is unaware of the nature of your scripts, the workflow specification (a YAML file) becomes verbose for large pipelines. |
Elyra | You want to leverage an existing Kubernetes (Elyra requires Kubeflow) installation and want to provide a way to develop workflows visually, with the limitation that each task must be a Jupyter notebook. Since workflows are executed with Kubeflow, you can monitor workflows using Kubeflow Pipelines UI. However, it lacks of important features such as debugging tools, support for integration testing, creating API endpoints, support for SQL/R or scheduling. |
Flyte | You have enough resources to dedicate a team of engineers to maintain a Flyte (Kubernetes) installation, your data scientists are proficient with Python, are willing to learn a DSL, and give up using Jupyter notebooks. In exchange, you get a cloud-agnostic, scalable workflow management tool that has been battle-tested at Lyft. Important: documentation is still limited. |
Kale | You want to leverage an existing Kubernetes cluster to deploy legacy notebook-based pipelines. For new pipelines, I would only recommend Kale for small projects since pipelines are constrained to be a single Jupyter notebook file. |
Kedro | You are ok following a specific project structure convention, having a strictly Python function-based workflow (no direct support for scripts or notebooks), data scientists are willing to spend time understanding kedro’s API and can give up interactive development with Jupyter. In exchange, you get pipelines that you can deploy to several production services. |
Kubeflow pipelines | You are skilled in Kubernetes and want complete control over computational/storage resources. Do not need to use Jupyter notebooks for interactive development and are willing to spend time learning the Python API. The main drawback is that you cannot execute nor debug workflows locally. |
Luigi | You already have a Luigi installation. Bear in mind that Luigi was designed for data engineering workflows, thus, it does come with specific features for data science or machine learning. There is no support for interactive development with Jupyter notebooks nor executing notebooks programmatically. |
Metaflow | You use AWS and have enough resources to dedicate an engineering team to maintain a Metaflow’s installation and do not mind developing pipeline tasks in a non-interactive way (no support for Jupyter notebooks). I would recommend you to take a detail look at the Administrator’s guide to make sure it suits your needs because some of the deployment infrastructure tools are still closed-source (such as the DAG scheduler). In exchange, you get a very robust tool that provides powerful workflow development tools with a clean, well-designed Python API. |
Prefect | You have enough resources to maintain a Prefect installation which can only execute prefect workflows. Prefect was designed with the dataflow paradigm in mind (closer to stream processing than to batch processing), hence, it lacks of several features that are required for data science or machine learning workflows: no incremental builds, no support for integration testing, no interactive development. Important: Prefect Server has a non-standard license. |
Tensforflow Extended (TFX) | Data scientists are proficient with Tensorflow, are willing to learn a new DSL, are ok give up other libraries like numpy, pandas or scikit-learn and are mostly developing Deep Learning models. In exchange, you get a flexible framework that can be deployed using Airflow or Beam, robust monitoring capabilities and can easily convert pipelines into API endpoints. |
Individual reviews
Ploomber
Section | Score | Comments |
---|---|---|
Ease of use | 3 | Uses a convention-over-configuration approach, to get started, you only need to include two special variables in your scripts/notebooks and Ploomber will orchestrate execution. For more flexibility you can specify your pipeline using YAML and for advanced use cases, use the Python API. |
Development experience | 3 | Workflows can be executed locally either in a single process or in multiple processes (in parallel). Provides incremental builds. |
Debugging | 3 | Integrates with pdb and ipdb , you can start line-by-line debugging sessions on any task or let the pipeline run and start a post-mortem session if a task crashes. |
Testing | 3 | Execute integration tests upon task execution by using the on_finish hook. Pipelines are regular Python objects, you can import them and test them with the testing framework of your choice. |
Deployment | 1 | Exporting workflows to Airflow and Argo (Kubernetes) is supported, however, this features only offer basic functionality but are in active development. Python batch-processing pipeline can be exported to process observations in-memory, this object can be used with any web framework to create an API endpoint. |
Programming languages | 2 | Supports any database with a Python connection. Complete R support. Limited support for languages (such as Julia) that have a Jupyter kernel. |
Maintainability | 3 | A task can be a script (Python/R/SQL), notebook or a Python function, you can choose whatever is more useful for your project. The task library (that exposes a unified API) provides with functionality for common tasks (e.g. run a SQL script, dump a database table) to reduce boilerplate code. |
Jupyter notebooks support | 3 | Scripts and notebooks can be developed interactively using Jupyter and then executed programmatically. |
Resources
Airflow
Section | Score | Comments |
---|---|---|
Ease of use | 1 | Airflow is notoriously hard to pick it up. Although there is a wide range of task types available, in practice, the recommended way to write workflows is by exclusively using the Kubernetes operator to ensure environment isolation. |
Development experience | 1 | Workflows can run locally using the local executor but this requires to have a full Airflow installation. Furthermore, one-off workflow executions are not straightforward since Airflow makes strong assumptions about how and when workflows should execute. No support for incremental builds. |
Debugging | NA | No tools for debugging. |
Testing | 1 | Airflow workflows are Python objects, you can import them and inspect its properties. However, testing workflows this way does not seem to be an official or recommended practice. |
Deployment | 2 | Scaling and scheduling are Airflow’s core strengths. But there is no support for exposing workflows as an API endpoint. |
Programming languages | 2 | Support for a wide variety of SQL backends. |
Maintainability | 2 | Since each task is a Python object, you can organize large projects in multiple modules without any limitations. Some tasks are community-contributed and vary in quality. |
Jupyter notebooks support | 1 | There is a task to execute notebooks programmatically, however, its use is not recommended since the code is executed in a global environment. There is no support for interactive development. |
Resources
Dagster
Section | Score | Comments |
---|---|---|
Ease of use | 1 | Workflows are written in Python. The API has a lot of features but it is hard to pick it up and read. For example, a lot of functionality is hidden in a context parameter. Even for seemingly simple tasks such as executing a SQL query, one has to become familiar with several concepts and type a lot of code. |
Development experience | 1 | Offers great flexibility to execute workflows locally and deploy to a distributed system (e.g. Airflow, Kubernetes). No support for incremental builds. |
Debugging | NA | No tools for debugging. |
Testing | 3 | Great support for testing, using hooks you can execute integration tests. Workflows can be imported and tested using a testing framework. |
Deployment | 2 | Dagster comes with a full-features executor and scheduler. However, it means that you have to maintain a dagster installation which can only execute dagster workflows. There is no documentation on exposing a workflow as an API, but this seems possible, since workflows can be imported and used as any other Python object. |
Programming languages | 1 | Only supports postgres and snowflake. No support for R/Julia. |
Maintainability | 1 | The configuration mechanism is extremely verbose, there are several optional packages to integrate with other systems but they all have different APIs. |
Jupyter notebooks support | 1 | Support to execute notebooks programmatically. No support for interactive development. |
Resources
DVC (Data pipelines)
Section | Score | Comments |
---|---|---|
Ease of use | 3 | Workflows are specified using a YAML file, where each task is specified by the command to execute, file dependencies and output files. |
Development experience | 3 | Can (exclusively) run workflows locally and provides incremental builds. |
Debugging | NA | No tools for debugging. |
Testing | NA | No support for integration testing. No support for pipeline testing. |
Deployment | NA | No support for exporting to large-scale systems. No support for exposing a workflow as an API endpoint. |
Programming languages | NA | The frame work is language-agnostic, tasks are specified using commands. However, this implies that you have to provide a command-line interface for each script to pass arguments. No direct support for SQL. |
Maintainability | 2 | Workflows are specified with YAML, this is good for small projects but it is not great for large ones. Once the project grows the YAML file becomes redundant because you have to specify the same values multiple times (i.e. a script train.py appears both in the cmd section and the deps section). With a few dozen tasks, this becomes verbose and error-prone. |
Jupyter notebooks support | 1 | Tasks are specified with a single command, this implies that you are free to use notebooks as pipeline tasks, edit them interactively and then run it programmatically. However, you have to manually specify the command and parameters to use for executing your notebook programmatically since DVC is unaware of its content. |
Resources
Elyra
Section | Score | Comments |
---|---|---|
Ease of use | 3 | The pipeline visual editor makes is extremely simple to convert a set of notebooks into a pipeline. |
Development experience | 1 | Pipelines can be executed locally. No support for incremental builds. |
Debugging | NA | No tools for debugging. |
Testing | NA | No support for integration testing. No support for pipeline testing. |
Deployment | 1 | Run workflows in Kubernetes via Kubeflow pipelines. No support for scheduling workflows. Due to its exclusive notebook-based nature, there is no easy way to convert a workflow into an API endpoint. |
Programming languages | NA | Python-only. |
Maintainability | 2 | The visual editor is great to facilitate workflow authoring, however, some people might prefer to have more control over the pipeline definition. The definition is written in JSON format, but it is unclear if manually editing such file is recommended. It is limiting that tasks have to be notebooks. |
Jupyter notebooks support | 3 | Elyra is a notebook-centric tool where each task is a notebook, hence, you can develop tasks interactively. When you execute your pipeline, the notebooks are executed programmatically. |
Resources
Flyte
Section | Score | Comments |
---|---|---|
Ease of use | 2 | The API is clean. Tasks are defined using Python functions with a few decorators. |
Development experience | NA | Workflows cannot be executed locally, they can only be executed in Kubernetes. No support for incremental builds. |
Debugging | NA | No tools for debugging. |
Testing | NA | No support for integration testing. No support for pipeline testing. |
Deployment | 2 | Runs on Kubernetes, supports scheduling. Unclear if it is possible to expose a workflow as an API endpoint. |
Programming languages | 1 | There is support for some SQL-compatible systems such as Hive and Presto. Spark is also supported. No support for R/Julia, |
Maintainability | 1 | The API is clean but the documentation is still work in progress and there are only a few code examples. |
Jupyter notebooks support | NA | No support for interactive development nor execute notebooks programmatically. |
Resources
Kale
Section | Score | Comments |
---|---|---|
Ease of use | 3 | Deploying a pipeline in Kale only requires you to add tags to Jupyter notebook cells. |
Development experience | 1 | Workflows can execute locally. No support for incremental builds. |
Debugging | NA | No tools for debugging. |
Testing | NA | No support for integration testing. No support for pipeline testing. |
Deployment | 2 | Deployment for batch processing is seamless, once you annotate your notebook, you can submit the workflow to the Kubernetes cluster. However, there is no support for re-using feature engineering code for an API endpoint. |
Programming languages | NA | Python-only, |
Maintainability | NA | Pipelines have to be declared in a single notebook file, which may cause a lot of trouble since cell side-effects are difficult to track. Having multiple people to edit the same file causes a lot of trouble when resolving version control conflicts. Finally, the code that you write is not the code that gets executed (they generate Kubeflow code with jinja), which may cause problems for debugging. |
Jupyter notebooks support | 2 | Kale is notebook-first framework. You can develop your pipeline interactively and the notebook itself becomes a pipeline, however, it has go through some pre-processing steps before its executed. |
Resources
Kedro
Section | Score | Comments |
---|---|---|
Ease of use | 1 | Pipelines are defined using a Python API, where each task is a function. Although the workflow API is clean, some extra modules have complex APIs. Furthermore, it is very opinionated and expects your project to follow a specific folder layout, which includes several kedro-specific configuration files. |
Development experience | 1 | Workflows can execute locally. No support for incremental builds. |
Debugging | 2 | Support for debugging nodes and pipelines, although the API looks complex. |
Testing | 2 | There is support for testing tasks upon execution (hooks), however, similar to the debugging API, it looks complex. |
Deployment | 2 | Supports deployment to Kubernetes (Argo and Kubeflow), Prefect and AWS Batch. It is unclear whether you can convert a batch pipeline to an online API. |
Programming languages | NA | Python-only. |
Maintainability | 1 | Expects your project to have a specific folder layout and configuration files. This is restrictive and an overkill for simple projects. |
Jupyter notebooks support | 1 | You can start a Jupyter notebook and export defined functions as kedro nodes (tasks), but interactivity is limited since the exported code has to be a single function. No support to execute notebooks programmatically. |
Resources
Kubeflow pipelines
Section | Score | Comments |
---|---|---|
Ease of use | 1 | Workflows a written in a highly complex Python API (which is the reason why Kale exists). |
Development experience | NA | There is no way to run workflows locally as it is a Kubernetes-only framework. No support for incremental builds either. |
Debugging | NA | No tools for debugging. |
Testing | NA | No support for integration testing. No support for pipeline testing. |
Deployment | 2 | Batch processing deployment is simple since Kubeflow is tightly integrated with Kubernetes. Unclear whether we can compose training and serving pipelines to re-use feature engineering code for an API endpoint. |
Programming languages | NA | Python-only. |
Maintainability | 1 | The code is hard to read, and contains too many details, see this example. The same parameters (project, cluster name, region) are passed to all tasks. Documentation is outdated. |
Jupyter notebooks support | NA | No support for interactive development nor executing notebooks programmatically. |
Resources
Luigi
Section | Score | Comments |
---|---|---|
Ease of use | 3 | Need to become familiar with the API to get started, however it is not as complex as others. It has a consistent set of concepts: tasks, targets and parameters. And tasks (defined as Python classes) have more-less the same structure. |
Development experience | 1 | Can run workflows locally. No support for incremental builds (once a task is executed, running it again has no effect, even if input files change). |
Debugging | NA | No debugging tools. |
Testing | 1 | Although not specifically designed for that purpose, callbacks can be used for integration testing. No support for inspecting the pipeline for testing its properties/definition. |
Deployment | 2 | Very simple to deploy to the central monitoring tool. Limited scalability, no built-in scheduler. Batch-processing only, no conversion to API endpoints. |
Programming languages | 2 | Support for some SQL backends. |
Maintainability | 1 | Requiring workflows to be defined as a single class is problematic for collaboration and code organization. Furthermore, it might lead to sneaky bugs since tasks are not stateless (due to the existence of instance variables). |
Jupyter notebooks support | NA | No support for interactive development. No support for executing notebooks programmatically. |
Resources
Metaflow
Section | Score | Comments |
---|---|---|
Ease of use | 3 | Workflows are defined using a Python class and decorator can be used for several things such as retrying tasks or installing dependencies before the task is executed. |
Development experience | 2 | Workflows can be executed locally and you can resume execution from failed tasks. No support for incremental builds. |
Debugging | 1 | If a workflow fails, you can inspect the data to determine what went wrong. Nonetheless, you can only debug workflows after they failed, there is no support for starting an interactive post-mortem debugging session and you have to resort to using print statements for debugging, which is far from ideal. |
Testing | 2 | Workflows can be imported to inspect its definition and properties. Although not explicitly mentioned, there does not seem to be any restrictions, and you could use this testing tools with a testing framework like pytest. |
Deployment | 1 | Metaflow comes with a built-in AWS tool to execute workflows, it is also possible to schedule workflows using AWS Step Functions. However, Netflix uses an internal (closed-source) DAG scheduler. There are not options to deploy to other clouds. It appears that workflows can be exposed as APIs, but it is unclear if this is part of the open-source package. |
Programming languages | 1 | There is support for R workflows, although it is a separate tool that uses the Python library as a backend, you cannot mix R and Python in the same workflow. No support for SQL. |
Maintainability | 1 | Requiring workflows to be defined as a single class is problematic for collaboration and code organization. Furthermore, it might lead to sneaky bugs since tasks are not stateless (due to the existence of instance variables). |
Jupyter notebooks support | NA | No support for interactive development. No support for executing notebooks programmatically. |
Resources
Prefect
Section | Score | Comments |
---|---|---|
Ease of use | 2 | Function-based workflows are written with a clean Python API. The task library contains a wide variety of tasks, however, only a handful of them are relevant to Data Science/Machine Learning projects. |
Development experience | 2 | Workflows can be executed locally. No support for incremental builds. |
Debugging | 3 | You can inspect the output and status of each task. As well as tools that make workflow debugging simple. |
Testing | NA | No support for integration testing. No support for pipeline testing. |
Deployment | 1 | Workflows can be deployed (and scheduled) using the web interface, however, this interface can only execute Prefect workflows. No support for running workflows in other systems. Prefect server has a non-standard open-source license. |
Programming languages | 1 | Although there is support for a few SQL database (e.g. postgres, snowflake, sqlite), each module has a different API. No support for R or Julia. |
Maintainability | 3 | Great API with minimal boilerplate code. |
Jupyter notebooks support | 1 | There is support for executing notebooks programmatically. No support for developing tasks interactively. |
Resources
Tensorflow Extended (TFX)
Section | Score | Comments |
---|---|---|
Ease of use | 1 | The API is very complex, it requires you to get familiar with several modules before you get started. |
Development experience | 2 | Pipelines can be executed locally. No support for incremental builds. |
Debugging | NA | No tools for debugging. |
Testing | 1 | Technically possible since you can run workflows locally, however, local workflows have to explicitly enable the interactive mode, it is unclear whether this can cause trouble when running under a testing framework. No support for pipeline testing. |
Deployment | 3 | There are three deployment options: Airflow, Kubeflow Pipelines and Apache Beam, however, examples are only provided for Google Cloud. Workflows can be exposed as API using Tensorflow serving. |
Programming languages | NA | Python-only. |
Maintainability | NA | TFX is a Tensorflow-exclusive framework, which implies that you cannot bring other libraries like numpy or pandas. There is also a good amount of boilerplate code that makes pipelines difficult to follow for people unfamiliar with the Tensorflow ecosystem. |
Jupyter notebooks support | 1 | You can develop pipelines interactively and run them in a single notebook. No support for executing notebooks programmatically. |
Resources
Evaluation criteria
1. Ease of use
Having a clean API that allows users get started quickly is essential to any software tool, but even more so to data analysis tools where programming proficiency varies a lot among users. Often, data analysis projects are experimental/exploratory and go through an initial prototyping phase. Practitioners tend to stay away from “production tools” because they often have a steep learning curve that slows progress down, as a consequence, many data scientists write entire data pipelines in a single notebook. This is a terrible practice because it creates unmaintainable software. The only way to avoid this bad practice is to provide production tools that add minimal overhead to make them more appealing and practitioners use it from day one.
2. Development experience
Data Science/Machine Learning is highly iterative, especially in the prototyping phase. We start with a rough idea of what kind of analysis we have to do and refine as we learn more about the data. A data scientist spends a lot of time changing a small part of the code and re-running the analysis to see how such change affects results (e.g. add a new feature). Incremental builds are key to facilitate this iterative process. When a data scientist modifies a few lines of code in a single task, there is no need to re-execute the pipeline end-to-end, but only tasks that were modified/affected.
Deployed data pipelines are usually managed by production systems such as Airflow scheduler or Kubernetes, however, being able to develop locally without requiring any extra infrastructure is critical to foster rapid development.
3. Debugging
Data pipelines are notoriously hard to debug because errors may come from incorrect code or unexpected data properties. Being able to start an interactive session to debug our code is more efficient than looking at a bunch of print (or logging) statements. Workflow managers often execute our code in specific ways (e.g. multiprocessing, remote workers, etc), which might cause the standard Python debugging tools not to work. We evaluate if it is possible to use standard tools to debug workflows.
4. Testing
Unexpected data can break pipelines in many ways. Best case scenario, a downstream task crashes because of a schema compatibility, worst case: the pipeline runs “successfully” but produces incorrect results (e.g. a model with bad performance). To prevent such sneaky bugs, it is becoming increasingly common to test artifacts after task execution, this is known as integration testing or data testing. An example of this is to check that there are no NULL
values in a dataset after we apply a transformation to it. We evaluate support for integration testing.
A second (often overlooked) feature, is the ability to test our workflow. Workflows are complex because they encompass multi-stage procedures with dependencies among them. Being able to run and inspect our workflows in a test environment helps detect bugs during development rather than production. The ability to test our workflows using testing tools such as pytest
is considered.
5. Deployment
The two most common deployment paradigms for data workflows are batch and online. An example of a batch deployment is a workflow that process new observations (often on schedule), makes predictions and uploads them to a database. The online scenario involves exposing the pipeline as an API (e.g. REST/RPC) that takes input data and returns a prediction. A common error during deployment happens when the pre-processing code at serving time differs from the one at training time (training-serving skew). We assess the ability to re-use existing training code for serving to eliminate this problem.
We also evaluate deployment options and favor integration with other open-source deployment tools (e.g. Airflow, Kubernetes).
6. Programming languages
Although training ML models with unstructured data is becoming increasingly common thanks to Deep Learning, tabular data and classical ML algorithms are still the most common type of application (see page 19 of this report). Tabular data is usually stored in SQL databases, this implies that our pipeline is often a combination of SQL and Python. We evaluate the ability to integrate SQL scripts as part of our workflow Finally, we also evaluate support for other popular languages for data analysis such as R and Julia.
7. Maintainability
This section evaluates project maintainability. The first aspect to consider is the amount of code needed to declare the pipeline (less code is better). The second aspect is code organization, we determine if the library imposes restrictions that limit our ability to organize our code in separate functions/modules. Tool-specific characteristics that may affect code maintainability are also evaluated.
8. Jupyter notebooks support
The use of notebooks in production pipelines always triggers a heated debate, but I believe that problems come from poor development practices rather than with notebooks itself. Notebooks are a fantastic environment for exploratory work, which is exactly what we need when we are learning about the data. Being able to develop workflows interactively has a great positive impact on productivity.
A second important usage of notebooks is as an output format. The .ipynb
format supports embedding tables and charts in a single file without any extra code. This is a huge time saver because debugging workflows is much easier when we can take a look at our data with charts, finding errors with text-based logs severely limits this process. Having a .ipynb
file as a result of executing a workflow task is akin to having rich logs that facilitate debugging.