Have you ever wanted to schedule the execution of a data analysis workflow? For example, you may want to fetch new data hourly and generate an output report. This blog post will show you how you can easily schedule a data analysis workflow that generates an HTML report on each run.

cron: A simple job scheduler

cron is a job scheduler that comes pre-installed on Linux and macOS. It’s a simple option to run scripts on a schedule. The main drawback is that it doesn’t have a graphical user interface; however, it works great for simple use cases since it requires no setup.

Ploomber: An open-source framework for developing data analysis workflows

cron only takes care of scheduling jobs. For example, we could configure it to execute python create-report.py every hour. However, all the processing and output management logic depends on us. So if we wanted to run a data pipeline and generate a report, we’d have to write code to ensure each run stores its output in a different folder, configure logging, add code to generate the HTML report, etc. Ploomber simplifies the process (including creating the HTML report), so we only have to code the data manipulation logic.

How does it work?

This blog post section provides a high-level description of how the different pieces fit together. You can take a detailed look at the source code here.

In Ploomber, users can declare data analysis workflows in a pipeline.yaml file; in our case, it looks like this:

tasks:
  - source: scripts/load.py
    product:
      nb: products/{{timestamp}}/load.html
      data: products/{{timestamp}}/load.csv

  - source: scripts/plot.py
    product:
      nb: products/{{timestamp}}/plot.html

Our workflow has two stages: scripts/load.py and scripts/plot.py. The first script downloads some data, while the second one plots it. The product section on each task specifies where we’ll store the output of each script; you can see that scripts/load.py generates an HTML report and a .csv file, while scripts/plot.py generates the final HTML report. Also, notice that all output paths contain the {{timestamp}} placeholder; Ploomber will replace this with the runtime timestamp, so each run stores output in a different folder.

This example is a simple workflow, but you can extend it to suit your needs, such as adding more tasks, running them in parallel, etc.

To learn more about Ploomber, click here.

Getting the example source code

To get the sample code, we first need to install Ploomber; then we fetch the code:

# install ploomber
pip install ploomber

# get sample code
ploomber examples -n guides/cron -o cron
cd cron

Now, we configure the environment:

ploomber install --create-env --use-venv
source venv-cron/bin/activate

To ensure that everything is working correctly, we execute the pipeline:

ploomber build

If things are working, you’ll see something like this:

name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
load    True          1.90359       39.4266
plot    True          2.9246        60.5734

Let’s now schedule this pipeline using cron.

Scheduling

Ploomber can execute your workflow by invoking the ploomber build command; we’ll tell cron to run this command every minute (we can define other intervals as well).

First, get the current working directory using the pwd command and copy it. Then, open the cron configuration file with:

crontab -e

Note: If using macOS Big Sur (11.6) or newer, you may need to follow a few extra steps to enable cron. If you need help, ping us.

Once it opens, add the following:

* * * * *  PROJ=/path/to/cron && cd $PROJ && bash run.sh >> cron.log 2>&1

Replace the /path/to/cron/ with the absolute path to the cron/ directory that contains the sample code.

After a minute, you’ll start to see more directories in the products folder; this is what mine looks like:

2022-03-12T11:14:47.506532/ 
2022-03-12T11:25:12.707618/ 
dev/

Congrats, you just scheduled a pipeline with cron and Ploomber! 🎉

If you don’t see new folders, something may have gone wrong; check the cron.log file for hints. If you need help, ask us for help in our Slack community.

Final thoughts

This blog post showed how to schedule data analysis workflows using Ploomber and cron. Apart from cron, there are other alternatives for job scheduling, but they require more setup.

A popular alternative is to use GitHub Actions; the main advantage is that you don’t have to keep a server running to run your tasks since GitHub will provide one for you. However, the provided computing environment will disappear after the job finishes hence you need to configure some storage layer for logs and HTML reports; a common choice is to use something like Amazon S3 or Google Cloud Storage. Ploomber is compatible with Amazon S3, so if you want help setting up schedule jobs with Ploomber, GitHub Actions, and Amazon S3 / Google Cloud Storage, ping us.