In this article, I will compare the main differences between Ploomber and Apache Airflow and cover some background information and motivations for both products.
Introduction
Airflow is a tool for developers to author, schedule, execute, and monitor their workflows. Initially developed by Airbnb, it is currently an Apache incubator project. It is a popular and fully featured Data Engineering tool, but it is an overkill for many projects (e.g. code to reproduce some paper’s results). Taking myself as an example, I got frustrated with its setup process and a steep learning curve when I explored it for the first time. From my perspective, Airflow and many other similar solutions (e.g. Luigi, Pinball, etc.) lack consideration of the iterative nature of Data Science/Machine Learning pipelines, which involves trying out new experiments to improve a final result (e.g. a predictive model). Ploomber, whose motivation is not only aligned with the data engineering space but also considering data science and development sides, is an excellent choice to bridge the gap between existing solutions and Data Scientists' actual work needs.
Ploomber is the fastest and most complete solution to build data pipelines. It builds on top of papermill and extends it to allow writing multi-stage workflows where each task is a notebook. With Ploomber, users can choose their favorite editor (e.g. Jupyter, VSCode, PyCharm) to develop tasks interactively and deploy them without code changes on various platforms, including Airflow, Kubernetes, AWS Batch, etc. I will compare Ploomber and Airflow in terms of ease of use, debugging and testing, development experience, maintainability, and deployment, respectively, based on this survey and my own experience.
Ease of Use
Ploomber uses a convention-over-configuration approach. Therefore, it is straightforward for users to start because they only need to include two special variables in their scripts/notebooks, and Ploomber will orchestrate execution. For more flexibility, users can even specify their pipeline using YAML and use the Python API for advanced use cases.
In contrast, Airflow is notoriously hard to pick it up. Although a wide range of task types are available, in practice, the recommended way to write workflows is by exclusively using the Kubernetes operator to ensure environment isolation.
Debugging and Testing
Ploomber is the best for debugging and testing among all similar products I have ever used. Ploomber integrates with pdb
and ipdb
, thus users can start line-by-line debugging sessions on any task or let the pipeline run and start a post-mortem session if a job crashes. Regarding testing, Ploomber executes integration tests upon task execution by using the on_finish
hook. Because pipelines are regular Python objects, users can import them and test them with the testing framework of their own choice.
However, Airflow offers no option for debugging. Although Airflow workflows are also Python objects, users can import them and inspect their properties; testing workflows this way still does not seem to be an official or recommended practice.
Development Experience
With Ploomber, workflows can be executed locally, either in a single process or in parallel. In contrast, workflows can run locally on Airflow using the local executor. Still, it requires a complete Airflow installation, and one-off workflow executions are not straightforward since Airflow makes strong assumptions about how and when workflows should execute. Furthermore, Airflow provides no support for incremental builds (only re-computing tasks that have changed since the last execution and caching previous results) while Ploomber offers that.
Regarding programming languages and development tools, Ploomber supports any database with a Python connection and provides limited support for languages with a Jupyter kernel, such as Julia. And scripts and notebooks can also be developed interactively using Jupyter and executed programmatically. Although Airflow also supports various SQL backends, it does not provide as good support for interactive development in Jupyter Notebook as Ploomber does.
Maintainability
With Ploomber, users can make their task a script (Python/R/SQL), notebook, or even a Python function, whatever is more beneficial for their project. In addition, the task library (that exposes a unified API) provides functionality for common tasks (e.g. run a SQL script, dump a database table) to reduce boilerplate code. While with Airflow, since each task is a Python object, users can organize large projects in multiple modules without any limitations. But some operators are community-contributed and vary in quality.
Deployment
Exporting workflows from Ploomber to Airflow (and other platforms such as Argo, Kubernetes, etc.) is supported because scaling and scheduling are Airflow’s core strengths. However, these features offer only basic functionality but are in active development. Welcome to check this documentation for how to export a Ploomber pipeline to Airflow.
Summary
Ploomber and Airflow have their strengths and can work closely together to make the whole workflow more efficient. For example, Ploomber enables us to take the first step of building data pipelines more smoothly, develop interactively, keep track of our source code changes closely, and get a better debugging and testing experience. On the other hand, Airflow is a good choice in terms of the scheduling of pipelines.
Thanks for your reading! Ploomber is the fastest and most convenient way to build data pipelines empowered by a fast-growing and creative community. Let’s give it a try! Got questions? Feel free to reach out to Ploomber team on Slack or send an Email.
References
Ploomber Documentation: https://docs.ploomber.io/en/stable/
Airflow Documentation: https://github.com/apache/Airflow
Ploomber Source Code: https://github.com/ploomber/ploomber
Airflow Source Code: https://github.com/apache/airflow
Introducing Ploomber Blog: https://ploomber.io/blog/ploomber/
Open-source Workflow Tools Survey: https://ploomber.io/blog/survey/