Customizing your Cloud Based Machine Learning Training Environment — Part 2

This is the second part of a two-part post on the topic of customizing your cloud-based AI model training environment. In the first part, a prerequisite for this part, we introduced the conflict that may arise between the desire to use a pre-built specially-designed training environment and the requirement that we have the ability to customize the environment to our project’s needs. The key to discovering potential opportunities for customization is a deep understanding of the end-to-end flow of running a training job in the cloud. We described this flow for the managed Amazon SageMaker training service while emphasizing the value of analyzing the publicly available underlying source code. We then presented the first method for customization — installing pip package dependencies at the very beginning of the training session — and demonstrated its limitations.

In this post we will present two additional methods. Both methods involve creating our own custom Docker image, but they are fundamentally different in their approach. The first method uses an official cloud-service provided image and expands it according to the project needs. The second takes a user defined (cloud agnostic) Docker image and extends it to support training in the cloud. As we will see, each has its pros and cons and the best option will highly depend on the details of your project.

Extending the Official Cloud Service Docker Image

Creating a fully functional, performance optimal, Docker image for training on a cloud-based GPU can be painstaking, requiring navigation of a multitude of intertwined HW and SW dependencies. Doing this for a wide variety of training use cases and HW platforms is even more difficult. Rather than attempt to do this on our own, our first choice will always be to take advantage of the pre-defined image created for us by the cloud service provider. If we need to customize this image, we will simply create a new Dockerfile that extends the official image and adds the required dependencies.

The AWS Deep Learning Container (DLC) github repository includes instructions for extending an official AWS DLC. This requires logging in to access the Deep Learning Containers image repository in order to pull the image, build the extended image, and then upload it to an Amazon Elastic Container Registry (ECR) in your account.

The following code block demonstrates how to extend the official AWS DLC from our SageMaker example (in part 1). We show three types of extensions:

  1. Linux Package: We install Nvidia Nsight Systems for advanced GPU profiling of our training jobs.
  2. Conda Package: We install the S5cmd conda package which we use for pulling data files from cloud storage.
  3. Pip Package: We install a specific version of the opencv-python pip package.
From 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker

# install nsys
ADD https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/NsightSystems-linux-cli-public-2023.1.1.127-3236574.deb ./
RUN apt install -y ./NsightSystems-linux-cli-public-2023.1.1.127-3236574.deb

# install s5cm
RUN conda install -y s5cmd

# install opencv
RUN pip install opencv-python==4.7.0.72

For more details on extending the official AWS DLC, including how to upload the resultant image to ECR, see here. The code block below shows how to modify the training job deployment script to use the extended image:

from sagemaker.pytorch import PyTorch

# define the training job
estimator = PyTorch(
entry_point='train.py',
source_dir='./source_dir',
role='<arn role>',
image_uri = '<account-number>.dkr.ecr.us-east-1.amazonaws.com/<tag>'
job_name='demo',
instance_type='ml.g5.xlarge',
instance_count=1
)

A similar option you have for customizing an official image, assuming you have access to its corresponding Dockerfile, is to simply make the desired edits to the Dockerfile and build from scratch. For AWS DLC, this option is documented here. However, keep in mind that although based on the same Dockerfile, the resultant image might differ due to differences in the build environment and updated package versions.

Environment customization via extension of an official Docker image is a great way to get the most out of the fully functional, fully validated, cloud-optimal training environment predefined by the cloud service while still allowing you the freedom and flexibility to make the additions and adaptations you require. However, this option also has its limitations as we demonstrate via example.

Training in a User Defined Python Environment

Bestseller No. 1
Pwshymi Printhead Printers Head Replacement for R1390 L1800 Printhead R390 R270 R1430 1400 for Home Office Printhead Replacement Part Officeproducts Componentes de electrodomésti
  • Function Test: Only printer printheads that have...
  • Stable Performance: With stable printing...
  • Durable ABS Material: Our printheads are made of...
  • Easy Installation: No complicated assembly...
  • Wide Compatibility: Our print head replacement is...
Bestseller No. 2
United States Travel Map Pin Board | USA Wall Map on Canvas (43 x 30) [office_product]
  • PIN YOUR ADVENTURES: Turn your travels into wall...
  • MADE FOR TRAVELERS: USA push pin travel map...
  • DISPLAY AS WALL ART: Becoming a focal point of any...
  • OUTSTANDING QUALITY: We guarantee the long-lasting...
  • INCLUDED: Every sustainable US map with pins comes...

For a variety of reasons, you may require the ability to train in a user-defined Python environment. This could be for the sake of reproducibility, platform independence, safety/security/compliance considerations, or some other purpose. One option you might consider would be to extend an official Docker image with your custom Python environment. That way you could, at the very least, benefit from the platform related installations and optimizations from the image. However, this could get kind of tricky if your intended use relies on some form of Python based automation. For example, in a managed training environment, the Dockerfile ENTRYPOINT runs a Python script that performs all kinds of actions including downloading the code source directory from cloud storage, installing Python dependencies, running the user defined training script, and more. This Python script resides in the predefined Python environment of the official Docker image. Programming the automated script to start up the training script in a separate Python environment is doable but might require some manual code changes in the predefined environment and could get very messy. In the next section we will demonstrate a cleaner way of doing this.

Bringing Your Own Image (BYO)

The final scenario we consider is one in which you are required to train in a specific environment defined by your own Docker image. As before, the drive for this could be regulatory, or the desire to run with the same image in the cloud as you do locally (“on-prem”). Some cloud services provide the ability to bring your own user-defined image and adapt it for use in the cloud. In this section we demonstrate two ways in which Amazon SageMaker supports this.

BYO Option 1: The SageMaker Training Toolkit

The first option, documented here, allows you to add the specialized (managed) training start-up flow we described in part 1 into you custom Python environment. This essentially enables you to train in SageMaker using your custom image in the same manner in which you would use an official image. In particular, you can re-use the same image for multiple projects/experiments and rely on the SageMaker APIs to download the experiment-specific code into the training environment at start-up (as described in part 1). You do not need to create and upload a new image every time you change your training code.

The code block below demonstrates how to take a custom image and enhance it with the SageMaker training toolkit following the instructions detailed here.

FROM user_defined_docker_image

RUN echo "conda activate user_defined_conda_env" >> ~/.bashrc
SHELL ["/bin/bash", "--login", "-c"]

ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
RUN conda activate user_defined_conda_env \
&& pip install --no-cache-dir -U sagemaker-pytorch-training sagemaker-training

# sagemaker uses jq to compile executable
RUN apt-get update \
&& apt-get -y upgrade --only-upgrade systemd \
&& apt-get install -y --allow-change-held-packages --no-install-recommends \
jq


# SageMaker assumes conda environment is in Path
ENV PATH /opt/conda/envs/user_defined_conda_env/bin:$PATH

# delete entry point and args if provided by parent Dockerfile
ENTRYPOINT []
CMD []

BYO Option 2: Configuring the Entrypoint

The second option, documented here, allows you to train in SageMaker in a user-defined Docker environment with zero changes to the Docker image. All that is required is to explicitly set the ENTRYPOINT instruction of the Docker container. One of the ways to do this (as documented here) is to pass in ContainerEntrypoint and/or ContainerArguments parameters to the AlgorithmSpecification of the API request. Unfortunately, as of the time of this writing, this option is not supported by the SageMaker Python API (version 2.146.1). However, we can easily enable this by extending the SageMaker Session class as demonstrated in the code block below:

from sagemaker.session import Session

# customized session class that supports adding container entrypoint settings
class SessionEx(Session):
def __init__(self, **kwargs):
self.user_entrypoint = kwargs.pop('entrypoint', None)
self.user_arguments = kwargs.pop('arguments', None)
super(SessionEx, self).__init__(**kwargs)

def _get_train_request(self, **kwargs):
train_request = super(SessionEx, self)._get_train_request(**kwargs)
if self.user_entrypoint:
train_request["AlgorithmSpecification"]["ContainerEntrypoint"] =\
[self.user_entrypoint]
if self.user_arguments:
train_request["AlgorithmSpecification"]["ContainerArguments"] =\
self.user_arguments
return train_request

from sagemaker.pytorch import PyTorch

# create session with user defined entrypoint and arguments
# SageMaker will run 'docker run --entrypoint python <user image> path2file.py
sm_session = SessionEx(user_entrypoint='python',
user_arguments=['path2file.py'])

# define the training job
estimator = PyTorch(
entry_point='train.py',
source_dir='./source_dir',
role='<arn role>',
image_uri='<account-number>.dkr.ecr.us-east-1.amazonaws.com/<tag>'
job_name='demo',
instance_type='ml.g5.xlarge',
instance_count=1,
sagemaker_session=sm_session
)

Optimizing Your Docker Image

New
ABYstyle - Call of Duty Toiletry Bag Search and Destroy, Black, 26 x 14 x 8.5 cm, Handle on pencil case for easy carrying, Black, 26 x 14 x 8.5 cm, Handle on pencil case for easy carrying
  • 100% official
  • Very practical with multiple pockets
  • Handle on pencil case for easy carrying
  • Material: Polyester
  • Dimensions: 26 x 14 x 8.5 cm
New
1890 Wing Angel Goddess Hobo Morgan Coin Pendant - US Challenge Coin Liberty Eagle Novel Coin Adult Toy Funny Sexy Coin Lucky Coin Pendant Storage Bag for Festival Party
  • FUNNY COIN&BAG: You will get a coin and jewelry...
  • NOVELTY DESIGN: Perfect copy the original coins,...
  • LUCKY POUCH: The feel of the flannelette bag is...
  • SIZE: Fine quality and beautiful packing. Coin...
  • PERFECT GIFT: 1*Coin with Exquisite Jewelry Bag....
New
Panther red Fleece Beanie
  • German (Publication Language)

One of the disadvantages of the BYO option is that you lose the opportunity to benefit from the specialization of the official pre-defined image. You can manually and selectively reintroduce some of these into your custom image. For example, the SageMaker documentation includes detailed instructions for integrating support for Amazon EFA. Moreover, you always have the option of looking back at the publicly available Dockerfile to cherry pick what you want.

Summary

In this two-part post we have discussed different methods for customizing your cloud-based training environment. The methods we chose were intended to demonstrate ways of addressing different types of use cases. In practice, the best solution will directly depend on your project needs. You might decide to create a single custom Docker image for all of your training experiments and combine this with an option to install experiment-specific dependencies (using the first method). You might find that a different method, not discussed here, e.g., one that involves tweaking some portion of the sagemaker-training Python package, to better suit your needs. The bottom line is that when you are faced with a need to customize your training environment — you have options; and if the standard options we have covered do not suffice, do not despair, get creative!

Original Post>