Today I am announcing the release of ploomber, a library to accelerate Data Science and Machine Learning pipeline experimentation. This post describes the motivation, core features, short and medium-term objectives.
When I started working on Data Science projects back in 2015, I realized that there were no standard practices for developing pipelines; which caused teams to develop fragile, hardly reproducible software. During one of my first projects, our pipeline consisted of a bunch of shell, SQL and Python scripts loosely glued together; each member would edit a “master” shell script so we could “reproduce” our end result, but since our pipeline would take several hours to run, no one would test that such script actually worked, hence, there was no guarantee that given a clean environment, the pipeline would execute without errors, let alone give the same final output.
Back then, a few colleagues started using drake (currently unmaintained) whose purpose was to manage complex data workflows. The concept caught my attention but it had two drawbacks 1) it is written in Clojure - which prevented me from digging into the codebase and 2) it resolved dependencies using file timestamps, thus assuming that every task produced files in the local filesystem, which does not hold for many pipelines interacting with remote systems.
Later next year, I found out a few other projects: Luigi, Pinball and Airflow . I think they are great, fully featured Data Engineering tools but they are not not what I was looking for. I tried Luigi and Airflow for two small projects but became frustrated with their setup and a steep learning curve; however, the primary reason I do not use them anymore is the lack of consideration to the iterative nature of Data Science/Machine Learning pipelines, which involves trying out new experiments (very often interactively) to improve a final result (e.g. a predictive model), their motivations were more aligned with the Data Engineering space (which is reflected in their documentation examples). I wanted something that kept track of my source code changes and update results accordingly, which I believe is what drake from rOpenSci (not related to the first one I mentioned) does.
Then I started graduate school and participated in another data project: YASS, a library for computational Neuroscience. One of my contributions was to make sure that the pipeline could be used by other teams, so I refactored the code in modules and created a “pipeline function” that would import all the tasks and execute them in the right order. I setup a CI service and coded a few smoke tests that executed the pipeline with some sample inputs. The pipeline had a lot of parameters that had to be customized by the user; requiring them to write their own “pipeline function” seemed to much, so we ended up using a common approach: a configuration file. This worked great for a while, until we started to experiment with the pipeline to enhance it. Any modification faced a challenge: adding new parameters meant modifying logic to validate the configuration file, which severely limited our ability to try out new experiments. While the configuration file still made sense for some end users, it did not make it for us, the developers. Since we wanted to add and remove building blocks, each one could potentially have their own “pipeline function”, but given the absence of a standard API for building blocks (where to save the data, what kind of input to expect), small changes would break the pipeline.
Based in my experience, I had a clear idea of which tool I needed. Since no existing tool fulfilled all my requirements, I started building a small library to facilitate experimentation for my current projects. As my projects grew in complexity, I found myself spending more and more time to fix bugs and develop new features until it reached a stable status. For a the second half of 2019, the codebase remained private and very few changes were introduced. Keeping that private benefits no one other than me, so here it is.
My main learning when working on YASS was that each team member should be able to experiment with the pipeline structure by re-organizing building blocks. I think Airflow’s syntax achieves that, each task is an object and dependencies are declared using an explicit operation. ploomber borrows Airflow’s expressive syntax with a few twists so pipeline declarations read like blueprints.
An important aspect I have seen overlooked in existing libraries is that even though tasks produce persistent changes (e.g. a new file, a table in a database), they are not part of the pipeline declaration but hidden in the task implementation. In my experience, this is a huge source of errors: paths to files (or table names) are hardcoded everywhere in the codebase, which leads to files getting deleted or overwritten accidentally. ploomber requires each task to declare its output, hence the pipeline declaration provides a complete picture: it not only includes which tasks to perform and in which order, but where the output will be stored and in which form.
This allows the same pipeline declaration to be easily customized by team members (e.g. each member stores outputs in
/data/YOUR_USERNAME/output) effectively isolating pipeline runs among each other.
One of the most common features I have seen in existing tools is support for containerization and orchestration, while these features are required for complex applications; even moderately complex pipelines (especially during the development phase) can achieve great progress by just having a virtual environment in a single machine. This is even true in the “big data” regime, where many applications can be solved by deferring computations to another system (such as an analytical database). In such cases, the pipeline just sends messages to other systems, with little to no computation happening in the client. Requiring every user to worry about containerization or orchestrations leaves out a lot of potential users that could benefit from a workflow management tool (e.g. scientists developing models in their laptops). In ploomber, once a pipeline is instantiated, it is ready to run.
There are of course cases when orchestration is needed, but I believe this is a separate problem to be solved by a different tool, rather than bundled it together in one solution.
Pipelines usually take hours or even days to run, during the development phase, it is wasteful to re-execute the pipeline on each code change. Building complex software systems is a similar process, which is why tools such as GNU Make exist. Taking a similar approach, ploomber keeps track of code changes and only executes a task if the source code has changed since its last execution. Since for many data pipelines outputs might not even exist in the same machine (e.g. the pipeline creates a table in a remote database), ploomber is able to check status in external systems.
Testable and interactive
ploomber produces standalone pipelines that are able to execute themselves (they are just Python objects) this makes testing much easier. For my current projects, I have the usual
tests/ folder where I import a function that instantiates a pipeline object, then I run the pipeline in a testing environment (to isolate it from the development environment) with a data sample, which makes them run end-to-end in a reasonable time.
ploomber also supports hooks to execute code upon task completion. Which allows me to write acceptance tests that explicitly state input assumptions (e.g. check a data frame’s input schema). Not stating data assumptions explicitly is a common source of errors when they do not hold anymore causing errors propagate to downtream tasks (e.g. a column that suddenly has NAs breaks a sum computation).
The combination of both types of tests serves different purposes:
tests/ allows me to check whether the pipeline still works after a few code changes and acceptance tests prevent unexpected conditions to propagate downstream in the pipeline.
Pipelines support more than just a
build command. You can (among other things) interact with any individual task by doing
dag['task_name'], which is useful for debugging and by doing
dag['task_name'].debug() you can start a debugging session (using the
Data projects usually involve several stakeholders spanning all kind of backgrounds. Explaining the pipeline’s embedded logic, assumptions and resources needed is key to build trust on it. While the pipeline declaration provides a blueprint for developers, it is not well-suited non-technical partners (or even technical ones from other disciplines such as software engineers or business analysts). ploomber is able to generate HTML summaries with code, output location and a diagram to communicate pipelines to a wider audience.
Present and future
My short-term goal is to consolidate ploomber as a robust and accessible tool by increasing testing coverage and improving documentation. The API has matured enough and it is not expected to change (besides trivial changes).
The medium-term goal is to improve current features to ease pipeline debugging and communication. There are some important features I will leave out of this project (containerization, orchestration, scheduling and serving), but I am currently evaluating tools to integrate with.
ploomber is available on Github and I am already using it in production projects. I hope it helps you in your next DS/ML pipeline!