Deploying a Data Science Platform on AWS: Parallelizing Experiments (Part III)

author image

In our previous post, we configured Amazon ECR to push a Docker image to AWS and configured an S3 bucket to write the output of our Data Science experiments.

In this final post, we’ll show you how to use Ploomber and Soopervisor to create grids of experiments that you can run in parallel on AWS Batch, and how to request resources dynamically (CPUs, RAM, and GPUs).

Authenticating with the aws CLI

We’ll be using the aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:

aws configure

Checking Docker

We’ll be using Docker for this part, so ensure it’s up and running:

docker run hello-world

Creating an Amazon ECR repository

First, let’s create another ECR repository to host our Docker image:

# set this to the region you want to use
aws ecr create-repository \
    --repository-name ploomber-project-repository-grid \
    --query repository.repositoryUri \
    --region $AWS_REGION \
    --output text


Assign the REPOSITORY variable to the output of the previous command:

Getting sample code

We’ll now get a sample project. First, let’s install the required packages.

Note: We recommend you install them in a virtual environment.

pip install ploomber soopervisor --upgrade --quiet

Download the example in the grid directory:

ploomber examples -n cookbook/grid -o grid


Loading examples...
Examples copy is more than 1 day old...
Cloning into '/Users/Edu/.ploomber/projects'...
remote: Enumerating objects: 606, done.
remote: Counting objects: 100% (606/606), done.
remote: Compressing objects: 100% (489/489), done.
remote: Total 606 (delta 116), reused 341 (delta 54), pack-reused 0
Receiving objects: 100% (606/606), 4.30 MiB | 16.45 MiB/s, done.
Resolving deltas: 100% (116/116), done.
==================== Copying example cookbook/grid to grid/ ====================
Next steps:

$ cd grid/
$ ploomber install

Open grid/ for details.

This downloaded a full project:

ls grid


README.ipynb        pipeline.yaml     scripts/         environment.yml   requirements.txt  tasks/

The example we downloaded prepares some data and trains a dozen Machine Learning models in parallel, here’s a graphical representation:


Let’s look at the pipeline.yaml file, which specifies the tasks in our workflow:

cat grid/pipeline.yaml


# run tasks in parallel
executor: parallel

  - source: tasks.raw.get
    product: products/raw/get.csv

  - source: tasks.features.sepal
    product: products/features/sepal.csv

  - source: tasks.features.petal
    product: products/features/petal.csv

  - source: tasks.features.features
    product: products/features/features.csv

  - source: scripts/
    # generates tasks fit-1, fit-2, etc
    name: fit-[[model_type]]-[[n_estimators]]-[[criterion]][[learning_rate]]
    # disabling static_analysis because the notebook does not have
    # a fixed set of parameters (depends on random-forest vs ada-boost)
    static_analysis: disable
      nb: products/report-[[model_type]]-[[n_estimators]]-[[criterion]][[learning_rate]].html
      model: products/model-[[model_type]]-[[n_estimators]]-[[criterion]][[learning_rate]].pickle
      # generates 6 tasks (1 * 3 * 2)
      - model_type: [random-forest]
        n_estimators: [1, 3, 5]
        criterion: [gini, entropy]

      # generates 6 tasks (1 * 3 * 2)
      - model_type: [ada-boost]
        n_estimators: [1, 3, 5]
        learning_rate: [1, 2]

The pipeline.yaml is one interface that Ploomber has to describe computational workflows (you can also declare them with Python).

The tasks section contains five entries, one per task. The first four are Python functions that process some input data (tasks.raw.get, tasks.features.sepal, tasks.features.petal, tasks.features.features), and the last one is a script that fits a model (scripts/

Note the last entry is longer because it’s a grid task: it’ll use the same script and execute it multiple times with different parameters. In total, the script will be executed 12 times, but this could be a larger number.

To learn more about the pipeline.yaml file and Ploomber, check our documentation.

Let’s now configure AWS Batch as our cloud environment (Kubernetes, SLURM, and Airflow are supported as well):

Configuring the project to run on AWS

cp grid/requirements.txt grid/requirements.lock.txt

cd grid
soopervisor add aws-env --backend aws-batch
cd ..


================================= Loading DAG ==================================
No found, looking for pipeline.yaml instead
Found /Users/Edu/dev/ Loading...
= Adding /Users/Edu/dev/ =
===================================== Done =====================================
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it:
	 $ soopervisor export aws-env
To force execution of all tasks:
	 $ soopervisor export aws-env --mode force

There are a few extra things we need to configure, to facilitate the setup, we created a script that automates these tasks depending on your AWS infrastructure, let’s download it:

curl -O


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3211  100  3211    0     0  22935      0 --:--:-- --:--:-- --:--:-- 22935

Now, set the values for the AWS Batch job queue and artifacts bucket you want to use. If in doubt, you might want to revisit the previous tutorials (part I and part II).


Let’s generate the configuration file that specifies the job queue to use and the ECR repository to upload our code:

python config \
    --directory grid \
    --queue $JOB_QUEUE \
    --region $AWS_REGION \
    --repository $REPOSITORY


Config file stored at: grid/soopervisor.yaml

Now, let’s specify the S3 client so the outputs of the pipeline are uploaded to the bucket:

python client \
    --directory grid \
    --bucket-name $BUCKET_NAME


Clients file stored at: grid/

Modify the pipeline.yaml so it uses the client we created in the step above:

python client-cfg \
    --directory grid

Upload the project to the ECR repository

Let’s upload our project to ECR:

pip install boto3 --quiet
aws ecr get-login-password \
    --region $AWS_REGION \
    | docker login \
    --username AWS \
    --password-stdin $REPOSITORY


Login Succeeded

Ensure boto3 is installed as part of our project. We need to upload to S3:

echo -e '\nboto3' >> grid/requirements.lock.txt

Execute jobs in AWS Batch

We’re now ready to schedule our workflow! Let’s use the soopervisor export command to build the Docker image, push it to ECR and schedule the jobs on AWS Batch:

cd grid
soopervisor export aws-env --mode force \
    --ignore-git --skip-tests > ../output.log 2>&1
cd ..

You can monitor execution in the AWS Batch console. Or use the following command, just ensure you change the job name. The following command retrieves the status of the fit-random-forest-1-gini task:

aws batch list-jobs --job-queue $JOB_QUEUE \
    --filters 'name=JOB_NAME,values=fit-random-forest-1-entropy' \
    --query 'jobSummaryList[*].status' \
    --region $AWS_REGION



After a few minutes, all tasks should be executed!

Checking output

Let’s check the outputs in the S3 bucket:

aws s3api list-objects --bucket $BUCKET_NAME \
    --query 'Contents[].Key'



You can see there’s a combination of .pickle files (the trained models), .csv (processed data), and .html (reports generated from the training script).

Let’s download one of the reports:

aws s3 cp s3://$BUCKET_NAME/outputs/products/report-random-forest-1-entropy-1.html report.html


download: s3://ploomber-bucket-3gsajz/outputs/products/report-random-forest-1-entropy-1.html to ./report.html

Open the report.html and you’ll see the outputs of the training script!

Requesting more resources

Let’s take a look at the grid/soopervisor.yaml file which configures the cloud environment:

cat grid/soopervisor.yaml


  backend: aws-batch
  container_properties: {memory: 16384, vcpus: 8}
  exclude: [output]
  job_queue: ploomber-batch-queue
  region_name: us-east-1

The soopervisor.yaml file specifies the backend to use (aws-batch), the resources to use by default ({memory: 16384, vcpus: 8}), the job queue, region and repository.

We can add a new section to specify per-task resources, to override the default value:

  get: # resources for the "get" task
    memory: 16384
    vcpus: 8
  fit-*: # match all tasks that begin with "fit-"
    memory: 32768
    vcpus: 16
    gpu: 1

Closing remarks

In this final part, we showed how to create multi-step workflows, and how to parametrize a script to create a grid of experiments that can run in parallel. Now you have a scalable infrastructure to run Data Science and Machine Learning experiments!

If you need help customizing the infrastructure or want to share your feedback, please join our community!

To keep up-to-date with our content; follow us on Twitter, LinkedIn, or subscribe to our newsletter!

Epilogue: Cleaning up the infrastructure

Here’s the command you need to run to delete the ECR repository we created on this post. To delete all the infrastructure, revisit the previous tutorials.

aws ecr delete-repository \
    --repository-name ploomber-project-repository-grid \
    --region $AWS_REGION \


    "repository": {
        "repositoryArn": "arn:aws:ecr:us-east-1:0123456789:repository/ploomber-project-repository-grid",
        "registryId": "0123456789",
        "repositoryName": "ploomber-project-repository-grid",
        "repositoryUri": "",
        "createdAt": "2022-10-28T10:06:04-04:00",
        "imageTagMutability": "MUTABLE"

Found an error? Click here to let us know.

comments powered by Disqus

Recent Articles


Who needs MLflow when you have SQLite?

I spent about six years working as a data scientist and tried to use MLflow several times (and others as well) to track …

Try Ploomber Cloud Now

Get Started