Have you ever wanted to schedule the execution of a data analysis workflow? For example, you may want to fetch new data hourly and generate an output report. This blog post will show you how you can easily schedule a data analysis workflow that generates an HTML report on each run.
cron
: A simple job scheduler
cron
is a job scheduler that comes pre-installed on Linux and macOS. It’s a simple option to run scripts on a schedule. The main drawback is that it doesn’t have a graphical user interface; however, it works great for simple use cases since it requires no setup.
Ploomber: An open-source framework for developing data analysis workflows
cron
only takes care of scheduling jobs. For example, we could configure it to execute python create-report.py
every hour. However, all the processing and output management logic depends on us. So if we wanted to run a data pipeline and generate a report, we’d have to write code to ensure each run stores its output in a different folder, configure logging, add code to generate the HTML report, etc. Ploomber simplifies the process (including creating the HTML report), so we only have to code the data manipulation logic.
How does it work?
This blog post section provides a high-level description of how the different pieces fit together. You can take a detailed look at the source code here.
In Ploomber, users can declare data analysis workflows in a pipeline.yaml
file; in our case, it looks like this:
tasks:
- source: scripts/load.py
product:
nb: products/{{timestamp}}/load.html
data: products/{{timestamp}}/load.csv
- source: scripts/plot.py
product:
nb: products/{{timestamp}}/plot.html
Our workflow has two stages: scripts/load.py
and scripts/plot.py
. The first script downloads some data, while the second one plots it. The product
section on each task specifies where we’ll store the output of each script; you can see that scripts/load.py
generates an HTML report and a .csv
file, while scripts/plot.py
generates the final HTML report. Also, notice that all output paths contain the {{timestamp}}
placeholder; Ploomber will replace this with the runtime timestamp, so each run stores output in a different folder.
This example is a simple workflow, but you can extend it to suit your needs, such as adding more tasks, running them in parallel, etc.
To learn more about Ploomber, click here.
Getting the example source code
To get the sample code, we first need to install Ploomber; then we fetch the code:
# install ploomber
pip install ploomber
# get sample code
ploomber examples -n guides/cron -o cron
cd cron
Now, we configure the environment:
ploomber install --create-env --use-venv
source venv-cron/bin/activate
To ensure that everything is working correctly, we execute the pipeline:
ploomber build
If things are working, you’ll see something like this:
name Ran? Elapsed (s) Percentage
------ ------ ------------- ------------
load True 1.90359 39.4266
plot True 2.9246 60.5734
Let’s now schedule this pipeline using cron
.
Scheduling
Ploomber can execute your workflow by invoking the ploomber build
command; we’ll tell cron
to run this command every minute (we can define other intervals as well).
First, get the current working directory using the pwd
command and copy it. Then, open the cron
configuration file with:
crontab -e
Note: If using macOS Big Sur (11.6) or newer, you may need to follow a few extra steps to enable cron. If you need help, ping us.
Once it opens, add the following:
* * * * * PROJ=/path/to/cron && cd $PROJ && bash run.sh >> cron.log 2>&1
Replace the /path/to/cron/
with the absolute path to the cron/
directory that contains the sample code.
After a minute, you’ll start to see more directories in the products folder; this is what mine looks like:
2022-03-12T11:14:47.506532/
2022-03-12T11:25:12.707618/
dev/
Congrats, you just scheduled a pipeline with cron and Ploomber! 🎉
If you don’t see new folders, something may have gone wrong; check the cron.log
file for hints. If you need help, ask us for help in our Slack community.
Final thoughts
This blog post showed how to schedule data analysis workflows using Ploomber and cron
. Apart from cron
, there are other alternatives for job scheduling, but they require more setup.
A popular alternative is to use GitHub Actions; the main advantage is that you don’t have to keep a server running to run your tasks since GitHub will provide one for you. However, the provided computing environment will disappear after the job finishes hence you need to configure some storage layer for logs and HTML reports; a common choice is to use something like Amazon S3 or Google Cloud Storage. Ploomber is compatible with Amazon S3, so if you want help setting up schedule jobs with Ploomber, GitHub Actions, and Amazon S3 / Google Cloud Storage, ping us.