Causal Inference and Experimentation in Python - Part II Hello everyone, welcome back to my post discussing causal …
How Evidation uses Ploomber to develop production-ready data science pipelines directly from Jupyter
At Evidation, we’ve developed a workflow to produce maintainable, robust, and production-ready machine learning pipelines directly from Jupyter. Our new workflow has made pipeline development and maintenance more manageable and more accessible. It has cut the development time of pipelines by 40% and reduced debugging and maintenance time by 60%.
Data scientists love using Jupyter notebooks for fast iteration; however, using them in production causes many issues since ipynb files are JSON files that contain both code and outputs. Tying up code and output makes ipynb files hard to control and review. Furthermore, using Jupyter also encourages the creation of monolithic notebooks (large notebooks that contain all the analysis logic in a single file). Monolithic notebooks are difficult to update and iterate upon because developing requires re-running the whole notebook file, even if it is just a tiny part of the pipeline that needs to be changed.
To prevent these issues, we used to port the notebooks to python modules; however, this is time-consuming and menial work that nobody wants to do. Furthermore, it widens the distance between development and production, increasing the likelihood that critical parts of the code get lost during the porting process, causing a mismatch in model behavior between development and production.
None of these alternatives is ideal, so we did a comparative analysis of open-source pipeline orchestration tools that would allow us to keep a fast iteration speed but balance it with high code standards.
After researching many tools, we picked Ploomber. Not only is Ploomber easy to get started with, but it’s accessible to data scientists across our organization. In addition, it integrates with Jupyter, which is critical since we do not want to disturb data scientists' workflows and introduce more friction in the development process. In a sentence, Ploomber allows us to use Jupyter in a maintainable, responsible way.
Our enhanced development experience
Ploomber allows our data scientists to keep their Jupyter workflow and provides the tools to help them produce more maintainable work that doesn’t require refactoring when going to production.
We are able to produce highly maintainable production workflows because of several features Ploomber supports. The first one is modularization. Ploomber lets us organize our pipelines into small tasks instead of a single monolithic notebook. The modularization of notebooks has dramatically changed the way we work.
In the past, collaboration was challenging because each of our data scientists often worked on their notebooks in isolation, causing duplicated work. With modularized pipelines, we can now have multiple data scientists take on separate tasks and work in parallel without stepping into each other’s work. In addition, the separation of tasks increases our team speed since improvements are merged to the main branch in smaller chunks and more rapidly.
Additionally, we are able to accelerate individual iteration since Ploomber supports caching results of previous runs. Hence, further executions are faster because they only execute tasks that have changed, giving each data scientist the confidence to run more experiments without worrying that each run will trigger a long-running job.
Under the hood, Ploomber uses jupytext to convert between Python and ipynb files. Everyone can continue working and developing with Jupyter for interactive development, but files are stored as simple scripts in our repository, allowing us to easily perform code reviews. And since we’re using standard py files, members of the team that prefer other development environments (like VSCode) can choose to use them and are still able to collaborate with the members that prefer Jupyter.
Finally, since Ploomber supports Python and R, it allows members of our team to use their preferred language while continuing to collaborate and integrate their work efficiently.
Ploomber has changed the way we develop pipelines, and when deploying our first Ploomber pipeline to production, we realized its value goes well beyond the development stage.
Deployment and maintenance
Another critical feature is that Ploomber allows us to separate code from configuration. For example, we want to store pipeline artifacts in specific locations during development for debugging purposes, but we want to switch this location in production. Many configuration parameters were hardcoded in the past, forcing us to manually update values during the deployment process. Since Ploomber supports pipeline parameterization, we’re able to switch the configuration file during the deployment process and ensure we keep the development environment isolated from the production environment.
Debugging data pipelines is difficult. Data pipelines often fail because the source data has changed, limiting the text-based logs files since it’s hard to understand what has changed while investigating production errors. Ploomber produces a set of notebooks (one per pipeline task) per run. Output notebooks allow us to bookkeep our results during a successful production run without writing extra code. Consequently, output notebooks enable us to respond much faster when a production run fails.
When an error occurs, Ploomber’s execution logs allow us to locate the point of failure immediately. Moreover, we can use the partially executed notebook and the produced pipeline artifacts to start an interactive debugging session: the recorded output in the notebook is a rich log that provides enough context to get started, and the interactivity enables us to investigate further. Finally, once the problem is fixed, we can restart the run from the point of failure and move on.
As of this writing, here at Evidation, we now have several projects running that leverage Ploomber, including:
Flu Monitoring - Provides flu education, personalized flu insights, and an opportunity to participate in flu research. Members can opt into connecting their wearable device to get personalized outreach when the system predicts a positive flu event from their wearable data. Over 100k predictions are made per day during flu season for members with Apple Watch, Fitbit, or Garmin devices. Ploomber is used to orchestrate data preprocessing, flu predictions, generation of personalized insights, and notifications.
Heart Health - Enables individuals to continuously monitor data relevant to their cardiovascular health, such as activity and symptom information to identify worsening symptoms, and access personalized content, resources, and tools for better managing their heart health. Ploomber is used to generate members' personalized heart health reports.
Month in Review Insight Emails - Shares personalized activity and health insights to our population using activity data from a variety of sources including Apple Watch, Fitbit, Garmin, Withings, and Oura. Ploomber is used to conduct data processing, personalized image and text generation, and sending messages to a queue.
Finally, we’d like to highlight the value of the community. We know we can rely on Ploomber’s community to get our questions answered. The Ploomber Slack is a unique place to discuss topics around using notebooks in production.
Ploomber’s core developers are always open to feedback and suggestions and have helped us multiple times to ensure we have a great experience with their tool.
Ploomber allows us to achieve an excellent balance between the interactivity of Jupyter and a reliable software development process. In the past, it seemed like we had to choose between two unsatisfying options: either run ipynb files in production or go through a painful refactoring process on each deployment. We no longer have to make that choice, enabling us to deliver more value to our members faster.
For more information on how Evidation uses Jupyter in production with python libraries Ploomber and nbdev, check out our talk at PyData Global 2021: From Jupyter to Production: Deploying an Influenza Monitoring System at Scale With Wearable Sensors.
The Evidation app is an easy-to-use mobile digital platform that encourages individuals to develop healthy habits – such as walking, meditating, and logging meals. It also gives them opportunities to participate in ground-breaking research and health programs. By securely sharing their health data from wearable devices such as smartphones, smartwatches, and fitness trackers, Evidation Members help provide insights to measure and improve health in everyday life for millions.
Found an error? Click here to let us know.
Hello everyone, this is Harrison from Ploomber! I am writing a three-part series detailing causal inference in data …