preloader
blog-post

Deploying a Data Science Platform on AWS: Running an experiment (Part II)

author image

In our previous post, we saw how to configure AWS Batch and tested our infrastructure by executing a task that spinned up a container, waited for 3 seconds and shut down.

In this post, we’ll leverage the existing infrastructure, but this time, we’ll execute a more interesting example.

We’ll ship our code to AWS by building a container and storing it in Amazon ECR, a service that allows us to store Docker images.

Authenticating with the aws CLI

We’ll be using the aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:

aws configure

Checking Docker

We’ll be using Docker for this part, so ensure it’s up and running:

docker run hello-world

Output:


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Creating an Amazon ECR repository

We first create a repository, which will host our Docker images:

# set the aws region to use for all resources
AWS_REGION=us-east-1

Create ECR repository:

aws ecr create-repository \
    --repository-name ploomber-project-repository \
    --query repository.repositoryUri \
    --output text \
    --region $AWS_REGION

Output:

1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository

The command above will print the repository URI, assign it to the next variable since we’ll need it later:

REPOSITORY=1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository

Getting some sample code

We’ll now use two open source tools (Ploomber, and Soopervisor) to write our computational task, generate a Docker image, push it to ECR, and schedule a job in AWS Batch.

Let’s install the packages:

Note: We recommend you installing them in a virtual environment.

pip install ploomber soopervisor --upgrade --quiet

Let’s get an example. This example trains and evaluates a Machine Learning model:

ploomber examples -n templates/ml-basic -o example

Output:

Loading examples...
================ Copying example templates/ml-basic to example/ ================
Next steps:

$ cd example/
$ ploomber install

Open example/README.md for details.


Let’s see the files:

ls example

Output:

README.ipynb      clients.py        pipeline.yaml
README.md         environment.yml   requirements.txt
_source.md        fit.py            tasks.py

The structure is a typical Ploomber project. Ploomber allows you to easily organize computational workflows as functions, scripts or notebooks and execute them locally. To learn more check out Ploomber’s documentation.

On the other hand, Soopervisor allows you to export a Ploomber project and execute it in the cloud.

The next command will tell Soopervisor to create the necessary files so we can export to AWS Batch:

cp example/requirements.txt example/requirements.lock.txt
pip install -r example/requirements.txt --quiet

cd example
soopervisor add aws-env --backend aws-batch
cd ..

Output:

================================= Loading DAG ==================================
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead
Found /Users/Edu/dev/ploomber.io/raw/ds-platform-part-ii/example/pipeline.yaml. Loading...
= Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-ii/example/aws-env/Dockerfile... =
===================================== Done =====================================
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it:
	 $ soopervisor export aws-env
To force execution of all tasks:
	 $ soopervisor export aws-env --mode force



soopervisor add will create a soopervisor.yaml file and a aws-batch folder.

The aws-batch folder contains a Dockerfile (which we need to create a Docker image):

ls example/aws-env

Output:

Dockerfile

The soopervisor.yaml file contains configuration parameters:

cat example/soopervisor.yaml

Output:

aws-env:
  backend: aws-batch
  container_properties: {memory: 16384, vcpus: 8}
  exclude: [output]
  job_queue: your-job-queue
  region_name: your-region-name
  repository: your-repository/name

There are a few parameters we have to configure here, we created a small script to generate the configuration file:

  • job_queue: the name of your job queue
  • aws_region: the region where your AWS Batch infrastructure is located
  • repository: the ECR repository URI

Here are the values for my infrastructure (replace it with yours):

# this is the name we put in the previous post
JOB_QUEUE=ploomber-batch-queue

Note: If you don’t have the job queue name, you can get it from the AWS console (ensure you’re in the right region).

Let’s download a utility script to facilitate creating the configuration files:

curl -O https://raw.githubusercontent.com/ploomber/posts/master/ds-platform-part-ii/generate.py

Output:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3211  100  3211    0     0  10811      0 --:--:-- --:--:-- --:--:-- 10811

Create the soopervisor.yaml configuration file:

python generate.py config \
    --directory example \
    --queue $JOB_QUEUE \
    --region $AWS_REGION \
    --repository $REPOSITORY

Output:

Config file stored at: example/soopervisor.yaml

This is how the file looks like:

cat example/soopervisor.yaml

Output:

aws-env:
  backend: aws-batch
  container_properties: {memory: 16384, vcpus: 8}
  exclude: [output]
  job_queue: ploomber-batch-queue
  region_name: us-east-1
  repository: 1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository

Let’s now use soopervisor export to execute the command in AWS Batch. Such command will do a few things for us:

  • Build the Docker container
  • Push it to the Amazon ECR repository
  • Submit the jobs to AWS Batch

We need to install boto3 since it’s a dependency to submit jobs to AWS Batch:

pip install boto3 --quiet

Authenticate with Amazon ECR so we can push images:

aws ecr get-login-password \
    --region $AWS_REGION \
    | docker login \
    --username AWS \
    --password-stdin $REPOSITORY

Output:

Login Succeeded

Let’s now export the project. Bear in mind that this command will take a few minutes:

cd example
soopervisor export aws-env --task get --mode force \
    --ignore-git --skip-tests > ../get-first.log 2>&1
cd ..

If all goes well, you’ll see something like this:

========================= Submitting jobs to AWS Batch =========================
=============== Registering 'example' job definition... ===============
============================== Submitting jobs... ==============================
Submitted task 'get'...
========================= Done. Submitted to AWS Batch =========================

Note: You can see the logs in the ../get-first.log file.

If you encounter issues with the soopervisor export command, or are unable to push to ECR, join our community and we’ll help you!

Once the command finishes execution, the job will be submitted to AWS Batch. Let’s use the aws CLI to list the jobs submitted to the queue:

aws batch list-jobs --job-queue $JOB_QUEUE \
    --filters 'name=JOB_NAME,values=get' \
    --query 'jobSummaryList' \
    --region $AWS_REGION

Output:

[
    {
        "jobArn": "arn:aws:batch:us-east-1:1234567890:job/44149f83-3549-406b-8deb-053fbad24619",
        "jobId": "44149f83-3549-406b-8deb-053fbad24619",
        "jobName": "get",
        "createdAt": 1666383930050,
        "status": "SUCCEEDED",
        "stoppedAt": 1666384038811,
        "container": {
            "exitCode": 0
        },
        "jobDefinition": "arn:aws:batch:us-east-1:1234567890:job-definition/example-5bf45a9e:1"
    }
]

After a a minute, you’ll see that task shows as SUCCEEDED (it’ll appear as RUNNABLE, STARTING or RUNNING if it hasn’t finished).

However, there is a catch: AWS Batch ran our code but shortly after, it shut down the EC2 instance, hence, we no longer have access to the output.

To fix that, we’ll add an S3 client to our project, so all outputs are stored.

Creating an S3 bucket to store outputs

Let’s first create a bucket in S3. S3 bucket names must be unique, you can run the following snippet in your terminal or choose a unique name and assign it to the BUCKET_NAME variable:

BUCKET_NAME=$(python generate.py bucket)
echo "Bucket name is $BUCKET_NAME"

Output:

Bucket name is ploomber-bucket-zxmmcm
aws s3api create-bucket \
    --bucket $BUCKET_NAME \
    --region $AWS_REGION

Output:

{
    "Location": "/ploomber-bucket-zxmmcm"
}

Adding a client to our pipeline

Ploomber allows us to specify an S3 bucket and it’ll take care of uploading all outputs for us. We only have to create a short file. The generate.py script can create one for us:

python generate.py client \
    --directory example \
    --bucket-name $BUCKET_NAME

Output:

Clients file stored at: example/clients.py

We need to configure our pipeline.yaml file so it uploads artifacts to S3. Let’s use the generate.py file so it does it for us:

python generate.py client-cfg --directory example

Furthermore, let’s add boto3 to our dependencies since we’ll be calling it to upload artifacts to S3:

echo -e '\nboto3' >> example/requirements.lock.txt

Giving AWS Batch permissions to access the bucket

Let’s add S3 permissions to our AWS Batch tasks. Generate a policy:

python generate.py policy --bucket-name $BUCKET_NAME

Output:

Policy file stored at: s3-policy.json

Apply it:

aws iam put-role-policy --role-name ploomber-ecs-instance-role \
    --policy-name ploomber-s3-policy \
    --policy-document file://s3-policy.json

Executing the workload

We’re now ready to execute our task in AWS Batch!

Let’s ensure we can push to ECR:

aws ecr get-login-password \
    --region $AWS_REGION \
    | docker login \
    --username AWS \
    --password-stdin $REPOSITORY

Output:

Login Succeeded

Submit the task again:

cd example
soopervisor export aws-env --task get --mode force \
    --ignore-git --skip-tests > ../get-second.log 2>&1
cd ..

Note that this time, the soopervisor export command is a lot faster, since it cached our Docker image!

Let’s check the status of the task:

aws batch list-jobs --job-queue $JOB_QUEUE \
    --filters 'name=JOB_NAME,values=get' \
    --query 'jobSummaryList[*].status' \
    --region $AWS_REGION

Output:

[
    "SUCCEEDED",
    "SUCCEEDED"
]

After a minute, you should see it as SUCCEEDED.

Check the contents of our bucket, we’ll see the task output (a .parquet file):

aws s3api list-objects --bucket $BUCKET_NAME \
    --query 'Contents[].{Key: Key, Size: Size}'

Output:

[
    {
        "Key": "outputs/output/.get.parquet.metadata",
        "Size": 361
    },
    {
        "Key": "outputs/output/get.parquet",
        "Size": 5627
    }
]

Wrapping up

In this post, we learned how to upload our code and execute it in AWS Batch via a Docker image. We also configured AWS Batch to read and write an S3 bucket. With this configuration, we can start running Data Science experiments in a scalable way without worrying about maintaining infrastructure!

In next (and final) post of this series, we’ll see how to easily generate hundreds of experiments and retrieve the results.

If you want to be the first to know when the final part comes out; follow us on Twitter, LinkedIn, or subscribe to our newsletter!

Epilogue: Cleaning up the infrastructure

If you wish to delete the infrastructure we created in this post, here are the commands.

Delete ECR repository:

aws ecr delete-repository \
    --repository-name ploomber-project-repository \
    --region $AWS_REGION \
    --force

Output:

{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-east-1:1234567890:repository/ploomber-project-repository",
        "registryId": "1234567890",
        "repositoryName": "ploomber-project-repository",
        "repositoryUri": "1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository",
        "createdAt": "2022-10-21T15:20:27-05:00",
        "imageTagMutability": "MUTABLE"
    }
}

Delete S3 bucket:

aws s3 rb s3://$BUCKET_NAME --force

Output:

delete: s3://ploomber-bucket-zxmmcm/outputs/output/.get.parquet.metadata
delete: s3://ploomber-bucket-zxmmcm/outputs/output/get.parquet
remove_bucket: ploomber-bucket-zxmmcm

Found an error? Click here to let us know.

comments powered by Disqus

Recent Articles

blog-post

Who needs MLflow when you have SQLite?

I spent about six years working as a data scientist and tried to use MLflow several times (and others as well) to track …

Try Ploomber Cloud Now

Get Started
*