preloader
blog-post

Deploying a Data Science Platform on AWS: Parallelizing Experiments (Part III)

author image

In our previous post, we configured Amazon ECR to push a Docker image to AWS and configured an S3 bucket to write the output of our Data Science experiments.

In this final post, we’ll show you how to use Ploomber and Soopervisor to create grids of experiments that you can run in parallel on AWS Batch, and how to request resources dynamically (CPUs, RAM, and GPUs).

Authenticating with the aws CLI

We’ll be using the aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:

aws configure

Checking Docker

We’ll be using Docker for this part, so ensure it’s up and running:

docker run hello-world

Creating an Amazon ECR repository

First, let’s create another ECR repository to host our Docker image:

# set this to the region you want to use
AWS_REGION=us-east-1
aws ecr create-repository \
    --repository-name ploomber-project-repository-grid \
    --query repository.repositoryUri \
    --region $AWS_REGION \
    --output text

Output:

0123456789.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository-grid

Assign the REPOSITORY variable to the output of the previous command:

REPOSITORY=0123456789.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository-grid

Getting sample code

We’ll now get a sample project. First, let’s install the required packages.

Note: We recommend you install them in a virtual environment.

pip install ploomber soopervisor --upgrade --quiet

Download the example in the grid directory:

ploomber examples -n cookbook/grid -o grid

Output:

Loading examples...
Examples copy is more than 1 day old...
Cloning into '/Users/Edu/.ploomber/projects'...
remote: Enumerating objects: 606, done.
remote: Counting objects: 100% (606/606), done.
remote: Compressing objects: 100% (489/489), done.
remote: Total 606 (delta 116), reused 341 (delta 54), pack-reused 0
Receiving objects: 100% (606/606), 4.30 MiB | 16.45 MiB/s, done.
Resolving deltas: 100% (116/116), done.
==================== Copying example cookbook/grid to grid/ ====================
Next steps:

$ cd grid/
$ ploomber install

Open grid/README.md for details.


This downloaded a full project:

ls grid

Output:

README.ipynb      _source.md        pipeline.yaml     scripts/
README.md         environment.yml   requirements.txt  tasks/

The example we downloaded prepares some data and trains a dozen Machine Learning models in parallel, here’s a graphical representation:

dag

Let’s look at the pipeline.yaml file, which specifies the tasks in our workflow:

cat grid/pipeline.yaml

Output:

# run tasks in parallel
executor: parallel

tasks:
  - source: tasks.raw.get
    product: products/raw/get.csv

  - source: tasks.features.sepal
    product: products/features/sepal.csv

  - source: tasks.features.petal
    product: products/features/petal.csv

  - source: tasks.features.features
    product: products/features/features.csv

  - source: scripts/fit.py
    # generates tasks fit-1, fit-2, etc
    name: fit-[[model_type]]-[[n_estimators]]-[[criterion]][[learning_rate]]
    # disabling static_analysis because the notebook does not have
    # a fixed set of parameters (depends on random-forest vs ada-boost)
    static_analysis: disable
    product:
      nb: products/report-[[model_type]]-[[n_estimators]]-[[criterion]][[learning_rate]].html
      model: products/model-[[model_type]]-[[n_estimators]]-[[criterion]][[learning_rate]].pickle
    grid:
      # generates 6 tasks (1 * 3 * 2)
      - model_type: [random-forest]
        n_estimators: [1, 3, 5]
        criterion: [gini, entropy]

      # generates 6 tasks (1 * 3 * 2)
      - model_type: [ada-boost]
        n_estimators: [1, 3, 5]
        learning_rate: [1, 2]

The pipeline.yaml is one interface that Ploomber has to describe computational workflows (you can also declare them with Python).

The tasks section contains five entries, one per task. The first four are Python functions that process some input data (tasks.raw.get, tasks.features.sepal, tasks.features.petal, tasks.features.features), and the last one is a script that fits a model (scripts/fit.py).

Note the last entry is longer because it’s a grid task: it’ll use the same script and execute it multiple times with different parameters. In total, the script will be executed 12 times, but this could be a larger number.

To learn more about the pipeline.yaml file and Ploomber, check our documentation.

Let’s now configure AWS Batch as our cloud environment (Kubernetes, SLURM, and Airflow are supported as well):

Configuring the project to run on AWS

cp grid/requirements.txt grid/requirements.lock.txt

cd grid
soopervisor add aws-env --backend aws-batch
cd ..

Output:

================================= Loading DAG ==================================
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead
Found /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/pipeline.yaml. Loading...
= Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/aws-env/Dockerfile... =
===================================== Done =====================================
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it:
	 $ soopervisor export aws-env
To force execution of all tasks:
	 $ soopervisor export aws-env --mode force



There are a few extra things we need to configure, to facilitate the setup, we created a script that automates these tasks depending on your AWS infrastructure, let’s download it:

curl -O https://raw.githubusercontent.com/ploomber/posts/master/ds-platform-part-ii/generate.py

Output:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3211  100  3211    0     0  22935      0 --:--:-- --:--:-- --:--:-- 22935

Now, set the values for the AWS Batch job queue and artifacts bucket you want to use. If in doubt, you might want to revisit the previous tutorials (part I and part II).

JOB_QUEUE=ploomber-batch-queue
BUCKET_NAME=ploomber-bucket-zxmmcm

Let’s generate the configuration file that specifies the job queue to use and the ECR repository to upload our code:

python generate.py config \
    --directory grid \
    --queue $JOB_QUEUE \
    --region $AWS_REGION \
    --repository $REPOSITORY

Output:

Config file stored at: grid/soopervisor.yaml

Now, let’s specify the S3 client so the outputs of the pipeline are uploaded to the bucket:

python generate.py client \
    --directory grid \
    --bucket-name $BUCKET_NAME

Output:

Clients file stored at: grid/clients.py

Modify the pipeline.yaml so it uses the client we created in the step above:

python generate.py client-cfg \
    --directory grid

Upload the project to the ECR repository

Let’s upload our project to ECR:

pip install boto3 --quiet
aws ecr get-login-password \
    --region $AWS_REGION \
    | docker login \
    --username AWS \
    --password-stdin $REPOSITORY

Output:

Login Succeeded

Ensure boto3 is installed as part of our project. We need to upload to S3:

echo -e '\nboto3' >> grid/requirements.lock.txt

Execute jobs in AWS Batch

We’re now ready to schedule our workflow! Let’s use the soopervisor export command to build the Docker image, push it to ECR and schedule the jobs on AWS Batch:

cd grid
soopervisor export aws-env --mode force \
    --ignore-git --skip-tests > ../output.log 2>&1
cd ..

You can monitor execution in the AWS Batch console. Or use the following command, just ensure you change the job name. The following command retrieves the status of the fit-random-forest-1-gini task:

aws batch list-jobs --job-queue $JOB_QUEUE \
    --filters 'name=JOB_NAME,values=fit-random-forest-1-entropy' \
    --query 'jobSummaryList[*].status' \
    --region $AWS_REGION

Output:

[
    "SUCCEEDED"
]

After a few minutes, all tasks should be executed!

Checking output

Let’s check the outputs in the S3 bucket:

aws s3api list-objects --bucket $BUCKET_NAME \
    --query 'Contents[].Key'

Output:

[
    "outputs/products/.model-ada-boost-1-1-6.pickle.metadata",
    "outputs/products/.model-ada-boost-1-2-7.pickle.metadata",
    "outputs/products/.model-ada-boost-3-1-8.pickle.metadata",
    "outputs/products/.model-ada-boost-3-2-9.pickle.metadata",
    "outputs/products/.model-ada-boost-5-1-10.pickle.metadata",
    "outputs/products/.model-ada-boost-5-2-11.pickle.metadata",
    "outputs/products/.model-random-forest-1-entropy-1.pickle.metadata",
    "outputs/products/.model-random-forest-1-gini-0.pickle.metadata",
    "outputs/products/.model-random-forest-3-entropy-3.pickle.metadata",
    "outputs/products/.model-random-forest-3-gini-2.pickle.metadata",
    "outputs/products/.model-random-forest-5-entropy-5.pickle.metadata",
    "outputs/products/.model-random-forest-5-gini-4.pickle.metadata",
    "outputs/products/.report-ada-boost-1-1-6.html.metadata",
    "outputs/products/.report-ada-boost-1-2-7.html.metadata",
    "outputs/products/.report-ada-boost-3-1-8.html.metadata",
    "outputs/products/.report-ada-boost-3-2-9.html.metadata",
    "outputs/products/.report-ada-boost-5-1-10.html.metadata",
    "outputs/products/.report-ada-boost-5-2-11.html.metadata",
    "outputs/products/.report-random-forest-1-entropy-1.html.metadata",
    "outputs/products/.report-random-forest-1-gini-0.html.metadata",
    "outputs/products/.report-random-forest-3-entropy-3.html.metadata",
    "outputs/products/.report-random-forest-3-gini-2.html.metadata",
    "outputs/products/.report-random-forest-5-entropy-5.html.metadata",
    "outputs/products/.report-random-forest-5-gini-4.html.metadata",
    "outputs/products/features/.features.csv.metadata",
    "outputs/products/features/.petal.csv.metadata",
    "outputs/products/features/.sepal.csv.metadata",
    "outputs/products/features/features.csv",
    "outputs/products/features/petal.csv",
    "outputs/products/features/sepal.csv",
    "outputs/products/model-ada-boost-1-1-6.pickle",
    "outputs/products/model-ada-boost-1-2-7.pickle",
    "outputs/products/model-ada-boost-3-1-8.pickle",
    "outputs/products/model-ada-boost-3-2-9.pickle",
    "outputs/products/model-ada-boost-5-1-10.pickle",
    "outputs/products/model-ada-boost-5-2-11.pickle",
    "outputs/products/model-random-forest-1-entropy-1.pickle",
    "outputs/products/model-random-forest-1-gini-0.pickle",
    "outputs/products/model-random-forest-3-entropy-3.pickle",
    "outputs/products/model-random-forest-3-gini-2.pickle",
    "outputs/products/model-random-forest-5-entropy-5.pickle",
    "outputs/products/model-random-forest-5-gini-4.pickle",
    "outputs/products/raw/.get.csv.metadata",
    "outputs/products/raw/get.csv",
    "outputs/products/report-ada-boost-1-1-6.html",
    "outputs/products/report-ada-boost-1-2-7.html",
    "outputs/products/report-ada-boost-3-1-8.html",
    "outputs/products/report-ada-boost-3-2-9.html",
    "outputs/products/report-ada-boost-5-1-10.html",
    "outputs/products/report-ada-boost-5-2-11.html",
    "outputs/products/report-random-forest-1-entropy-1.html",
    "outputs/products/report-random-forest-1-gini-0.html",
    "outputs/products/report-random-forest-3-entropy-3.html",
    "outputs/products/report-random-forest-3-gini-2.html",
    "outputs/products/report-random-forest-5-entropy-5.html",
    "outputs/products/report-random-forest-5-gini-4.html"
]

You can see there’s a combination of .pickle files (the trained models), .csv (processed data), and .html (reports generated from the training script).

Let’s download one of the reports:

aws s3 cp s3://$BUCKET_NAME/outputs/products/report-random-forest-1-entropy-1.html report.html

Output:

download: s3://ploomber-bucket-3gsajz/outputs/products/report-random-forest-1-entropy-1.html to ./report.html

Open the report.html and you’ll see the outputs of the training script!

Requesting more resources

Let’s take a look at the grid/soopervisor.yaml file which configures the cloud environment:

cat grid/soopervisor.yaml

Output:

aws-env:
  backend: aws-batch
  container_properties: {memory: 16384, vcpus: 8}
  exclude: [output]
  job_queue: ploomber-batch-queue
  region_name: us-east-1
  repository: 0123456789.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository-grid

The soopervisor.yaml file specifies the backend to use (aws-batch), the resources to use by default ({memory: 16384, vcpus: 8}), the job queue, region and repository.

We can add a new section to specify per-task resources, to override the default value:

task_resources:
  get: # resources for the "get" task
    memory: 16384
    vcpus: 8
  fit-*: # match all tasks that begin with "fit-"
    memory: 32768
    vcpus: 16
    gpu: 1

Closing remarks

In this final part, we showed how to create multi-step workflows, and how to parametrize a script to create a grid of experiments that can run in parallel. Now you have a scalable infrastructure to run Data Science and Machine Learning experiments!

If you need help customizing the infrastructure or want to share your feedback, please join our community!

To keep up-to-date with our content; follow us on Twitter, LinkedIn, or subscribe to our newsletter!

Epilogue: Cleaning up the infrastructure

Here’s the command you need to run to delete the ECR repository we created on this post. To delete all the infrastructure, revisit the previous tutorials.

aws ecr delete-repository \
    --repository-name ploomber-project-repository-grid \
    --region $AWS_REGION \
    --force

Output:

{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-east-1:0123456789:repository/ploomber-project-repository-grid",
        "registryId": "0123456789",
        "repositoryName": "ploomber-project-repository-grid",
        "repositoryUri": "0123456789.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository-grid",
        "createdAt": "2022-10-28T10:06:04-04:00",
        "imageTagMutability": "MUTABLE"
    }
}

Found an error? Click here to let us know.

comments powered by Disqus

Recent Articles

blog-post

Who needs MLflow when you have SQLite?

I spent about six years working as a data scientist and tried to use MLflow several times (and others as well) to track …

Try Ploomber Cloud Now

Get Started
*