In our previous post, we saw how to configure AWS Batch and tested our infrastructure by executing a task that spinned up a container, waited for 3 seconds and shut down.
In this post, we’ll leverage the existing infrastructure, but this time, we’ll execute a more interesting example.
We’ll ship our code to AWS by building a container and storing it in Amazon ECR, a service that allows us to store Docker images.
Authenticating with the aws
CLI
We’ll be using the aws
CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:
aws configure
Checking Docker
We’ll be using Docker for this part, so ensure it’s up and running:
docker run hello-world
Output:
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
Creating an Amazon ECR repository
We first create a repository, which will host our Docker images:
# set the aws region to use for all resources
AWS_REGION=us-east-1
Create ECR repository:
aws ecr create-repository \
--repository-name ploomber-project-repository \
--query repository.repositoryUri \
--output text \
--region $AWS_REGION
Output:
1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository
The command above will print the repository URI, assign it to the next variable since we’ll need it later:
REPOSITORY=1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository
Getting some sample code
We’ll now use two open source tools (Ploomber, and Soopervisor) to write our computational task, generate a Docker image, push it to ECR, and schedule a job in AWS Batch.
Let’s install the packages:
Note: We recommend you installing them in a virtual environment.
pip install ploomber soopervisor --upgrade --quiet
Let’s get an example. This example trains and evaluates a Machine Learning model:
ploomber examples -n templates/ml-basic -o example
Output:
Loading examples...
[34m================ Copying example templates/ml-basic to example/ ================[0m
Next steps:
$ cd example/
$ ploomber install[34m
Open example/README.md for details.
[0m[0m
Let’s see the files:
ls example
Output:
README.ipynb clients.py pipeline.yaml
README.md environment.yml requirements.txt
_source.md fit.py tasks.py
The structure is a typical Ploomber project. Ploomber allows you to easily organize computational workflows as functions, scripts or notebooks and execute them locally. To learn more check out Ploomber’s documentation.
On the other hand, Soopervisor allows you to export a Ploomber project and execute it in the cloud.
The next command will tell Soopervisor to create the necessary files so we can export to AWS Batch:
cp example/requirements.txt example/requirements.lock.txt
pip install -r example/requirements.txt --quiet
cd example
soopervisor add aws-env --backend aws-batch
cd ..
Output:
[34m================================= Loading DAG ==================================[0m
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead
Found /Users/Edu/dev/ploomber.io/raw/ds-platform-part-ii/example/pipeline.yaml. Loading...
[34m= Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-ii/example/aws-env/Dockerfile... =[0m
[32m===================================== Done =====================================[0m
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it:
$ soopervisor export aws-env
To force execution of all tasks:
$ soopervisor export aws-env --mode force
[0m
soopervisor add
will create a soopervisor.yaml
file and a aws-batch
folder.
The aws-batch
folder contains a Dockerfile
(which we need to create a Docker image):
ls example/aws-env
Output:
Dockerfile
The soopervisor.yaml
file contains configuration parameters:
cat example/soopervisor.yaml
Output:
aws-env:
backend: aws-batch
container_properties: {memory: 16384, vcpus: 8}
exclude: [output]
job_queue: your-job-queue
region_name: your-region-name
repository: your-repository/name
There are a few parameters we have to configure here, we created a small script to generate the configuration file:
job_queue
: the name of your job queueaws_region
: the region where your AWS Batch infrastructure is locatedrepository
: the ECR repository URI
Here are the values for my infrastructure (replace it with yours):
# this is the name we put in the previous post
JOB_QUEUE=ploomber-batch-queue
Note: If you don’t have the job queue name, you can get it from the AWS console (ensure you’re in the right region).
Let’s download a utility script to facilitate creating the configuration files:
curl -O https://raw.githubusercontent.com/ploomber/posts/master/ds-platform-part-ii/generate.py
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3211 100 3211 0 0 10811 0 --:--:-- --:--:-- --:--:-- 10811
Create the soopervisor.yaml
configuration file:
python generate.py config \
--directory example \
--queue $JOB_QUEUE \
--region $AWS_REGION \
--repository $REPOSITORY
Output:
Config file stored at: example/soopervisor.yaml
This is how the file looks like:
cat example/soopervisor.yaml
Output:
aws-env:
backend: aws-batch
container_properties: {memory: 16384, vcpus: 8}
exclude: [output]
job_queue: ploomber-batch-queue
region_name: us-east-1
repository: 1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository
Let’s now use soopervisor export
to execute the command in AWS Batch. Such command will do a few things for us:
- Build the Docker container
- Push it to the Amazon ECR repository
- Submit the jobs to AWS Batch
We need to install boto3
since it’s a dependency to submit jobs to AWS Batch:
pip install boto3 --quiet
Authenticate with Amazon ECR so we can push images:
aws ecr get-login-password \
--region $AWS_REGION \
| docker login \
--username AWS \
--password-stdin $REPOSITORY
Output:
Login Succeeded
Let’s now export the project. Bear in mind that this command will take a few minutes:
cd example
soopervisor export aws-env --task get --mode force \
--ignore-git --skip-tests > ../get-first.log 2>&1
cd ..
If all goes well, you’ll see something like this:
========================= Submitting jobs to AWS Batch =========================
=============== Registering 'example' job definition... ===============
============================== Submitting jobs... ==============================
Submitted task 'get'...
========================= Done. Submitted to AWS Batch =========================
Note: You can see the logs in the ../get-first.log
file.
If you encounter issues with the soopervisor export
command, or are unable to push to ECR, join our community and we’ll help you!
Once the command finishes execution, the job will be submitted to AWS Batch. Let’s use the aws
CLI to list the jobs submitted to the queue:
aws batch list-jobs --job-queue $JOB_QUEUE \
--filters 'name=JOB_NAME,values=get' \
--query 'jobSummaryList' \
--region $AWS_REGION
Output:
[
{
"jobArn": "arn:aws:batch:us-east-1:1234567890:job/44149f83-3549-406b-8deb-053fbad24619",
"jobId": "44149f83-3549-406b-8deb-053fbad24619",
"jobName": "get",
"createdAt": 1666383930050,
"status": "SUCCEEDED",
"stoppedAt": 1666384038811,
"container": {
"exitCode": 0
},
"jobDefinition": "arn:aws:batch:us-east-1:1234567890:job-definition/example-5bf45a9e:1"
}
]
After a a minute, you’ll see that task shows as SUCCEEDED
(it’ll appear as RUNNABLE
, STARTING
or RUNNING
if it hasn’t finished).
However, there is a catch: AWS Batch ran our code but shortly after, it shut down the EC2 instance, hence, we no longer have access to the output.
To fix that, we’ll add an S3 client to our project, so all outputs are stored.
Creating an S3 bucket to store outputs
Let’s first create a bucket in S3. S3 bucket names must be unique, you can run the following snippet in your terminal or choose a unique name and assign it to the BUCKET_NAME
variable:
BUCKET_NAME=$(python generate.py bucket)
echo "Bucket name is $BUCKET_NAME"
Output:
Bucket name is ploomber-bucket-zxmmcm
aws s3api create-bucket \
--bucket $BUCKET_NAME \
--region $AWS_REGION
Output:
{
"Location": "/ploomber-bucket-zxmmcm"
}
Adding a client to our pipeline
Ploomber allows us to specify an S3 bucket and it’ll take care of uploading all outputs for us. We only have to create a short file. The generate.py
script can create one for us:
python generate.py client \
--directory example \
--bucket-name $BUCKET_NAME
Output:
Clients file stored at: example/clients.py
We need to configure our pipeline.yaml
file so it uploads artifacts to S3. Let’s use the generate.py
file so it does it for us:
python generate.py client-cfg --directory example
Furthermore, let’s add boto3
to our dependencies since we’ll be calling it to upload artifacts to S3:
echo -e '\nboto3' >> example/requirements.lock.txt
Giving AWS Batch permissions to access the bucket
Let’s add S3 permissions to our AWS Batch tasks. Generate a policy:
python generate.py policy --bucket-name $BUCKET_NAME
Output:
Policy file stored at: s3-policy.json
Apply it:
aws iam put-role-policy --role-name ploomber-ecs-instance-role \
--policy-name ploomber-s3-policy \
--policy-document file://s3-policy.json
Executing the workload
We’re now ready to execute our task in AWS Batch!
Let’s ensure we can push to ECR:
aws ecr get-login-password \
--region $AWS_REGION \
| docker login \
--username AWS \
--password-stdin $REPOSITORY
Output:
Login Succeeded
Submit the task again:
cd example
soopervisor export aws-env --task get --mode force \
--ignore-git --skip-tests > ../get-second.log 2>&1
cd ..
Note that this time, the soopervisor export
command is a lot faster, since it cached our Docker image!
Let’s check the status of the task:
aws batch list-jobs --job-queue $JOB_QUEUE \
--filters 'name=JOB_NAME,values=get' \
--query 'jobSummaryList[*].status' \
--region $AWS_REGION
Output:
[
"SUCCEEDED",
"SUCCEEDED"
]
After a minute, you should see it as SUCCEEDED
.
Check the contents of our bucket, we’ll see the task output (a .parquet
file):
aws s3api list-objects --bucket $BUCKET_NAME \
--query 'Contents[].{Key: Key, Size: Size}'
Output:
[
{
"Key": "outputs/output/.get.parquet.metadata",
"Size": 361
},
{
"Key": "outputs/output/get.parquet",
"Size": 5627
}
]
Wrapping up
In this post, we learned how to upload our code and execute it in AWS Batch via a Docker image. We also configured AWS Batch to read and write an S3 bucket. With this configuration, we can start running Data Science experiments in a scalable way without worrying about maintaining infrastructure!
In next (and final) post of this series, we’ll see how to easily generate hundreds of experiments and retrieve the results.
If you want to be the first to know when the final part comes out; follow us on Twitter, LinkedIn, or subscribe to our newsletter!
Epilogue: Cleaning up the infrastructure
If you wish to delete the infrastructure we created in this post, here are the commands.
Delete ECR repository:
aws ecr delete-repository \
--repository-name ploomber-project-repository \
--region $AWS_REGION \
--force
Output:
{
"repository": {
"repositoryArn": "arn:aws:ecr:us-east-1:1234567890:repository/ploomber-project-repository",
"registryId": "1234567890",
"repositoryName": "ploomber-project-repository",
"repositoryUri": "1234567890.dkr.ecr.us-east-1.amazonaws.com/ploomber-project-repository",
"createdAt": "2022-10-21T15:20:27-05:00",
"imageTagMutability": "MUTABLE"
}
}
Delete S3 bucket:
aws s3 rb s3://$BUCKET_NAME --force
Output:
delete: s3://ploomber-bucket-zxmmcm/outputs/output/.get.parquet.metadata
delete: s3://ploomber-bucket-zxmmcm/outputs/output/get.parquet
remove_bucket: ploomber-bucket-zxmmcm