Posts
Distribute your Python pipeline the right way (by packaging it)
Pipelines are a bunch of source code files that, when executed in the right order, produce a final result (e.g. a chart, report, ML model, etc). Whatever your goal is, you want your code to be easily distributed and executed in computers other than yours for reproducibility, scaling up computations or taking a model to production.
Given how easy is to package your Python code and the advantages that come with it, there is no reason not to do it.
Posts
Introducing ploomber
Today I am announcing the release of ploomber, a library to accelerate Data Science and Machine Learning pipeline experimentation. This post describes the motivation, core features, short and medium-term objectives.
Motivation When I started working on Data Science projects back in 2015, I realized that there were no standard practices for developing pipelines; which caused teams to develop fragile, hardly reproducible software. During one of my first projects, our pipeline consisted of a bunch of shell, SQL and Python scripts loosely glued together; each member would edit a “master” shell script so we could “reproduce” our end result, but since our pipeline would take several hours to run, no one would test that such script actually worked, hence, there was no guarantee that given a clean environment, the pipeline would execute without errors, let alone give the same final output.