AWS estimates that inference (the process of using a trained machine learning [ML] algorithm to make a prediction) makes up 90 percent of the cost of an ML model. Given with AWS you pay for what you use, we estimate that inference also generally equates to most of the resource usage within an ML lifecycle.
In this series, we’re following the phases of the Well-Architected machine learning lifecycle (Figure 1) to optimize your artificial intelligence (AI)/ML workloads. In Part 3, our final piece in the series, we show you how to reduce the environmental impact of your ML workload once your model is in production.
If you missed the first parts of this series, in Part 1, we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing. In Part 2, we identified ways to reduce the environmental impact of developing, training, and tuning ML models.
Figure 1. ML lifecycle
Select sustainable AWS Regions
As mentioned in Part 1, select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions near Amazon renewable energy projects and Regions where the grid has low published carbon intensity to deploy your model.
Align SLAs with sustainability goals
Define SLAs that support your sustainability goals while meeting your business requirements:
- If your users can tolerate some latency, deploy your model on asynchronous endpoints to reduce resources that are idle between tasks and minimize the impact of load spikes. Asynchronous endpoints will automatically scale the instance count to zero when there are no requests to process, so you only maintain an inference infrastructure when your endpoint is processing requests.
- If your workload doesn’t require high availability, deploy it to a single Availability Zone to reduce the cloud resources you consume. Adjusting availability is an example of a conscious trade off you can make to meet your sustainability targets.
- When you don’t need real-time inference, use Amazon SageMaker batch transform. Unlike persistent endpoints, clusters are decommissioned when batch transform jobs finish so you don’t continuously maintain an inference infrastructure.
Use efficient silicon
For CPU-based ML inference, use AWS Graviton3. These processors offer the best performance per watt in Amazon Elastic Compute Cloud (Amazon EC2). They use up to 60% less energy than comparable EC2 instances. Graviton3 processors deliver up to three times better performance compared to Graviton2 processors for ML workloads, and they support bfloat16.
For deep learning workloads, the Amazon EC2 Inf1 instances (based on custom designed AWS Inferentia chips) deliver 2.3 times higher throughput and 80% lower cost compared to g4dn instances. Inf1 has 50% higher performance per watt than g4dn, which makes it the most sustainable ML accelerator Amazon EC2 offers.
Make efficient use of GPU
While training jobs batch process hundreds of data samples in parallel, inference jobs usually process a single input in real time, and thus consume a small amount of GPU compute. Elastic Inference allows you to reduce the cost and environmental impact of your inference by using GPU resources more efficiently.
Optimize models for inference
Improve efficiency of your models by compiling them into optimized forms with the following:
- Various open-source libraries (like Treelite for decision tree ensembles)
- Third-party tools like Hugging Face Infinity, which allows you to speed up transformer models and run inference not only on GPU but also on CPU.
- SageMaker Neo’s runtime consumes as little as one-tenth the footprint of a deep learning framework and optimizes models to perform up to 25 time faster with no loss in accuracy (example with XGBoost).
Deploying more efficient models means you need fewer resources for inference.
Deploy multiple models behind a single endpoint
SageMaker provides three methods to deploy multiple models to a single endpoint to improve endpoint utilization:
- Host multiple models in one container behind one endpoint. Multi-model endpoints are served using a single container. This can help you cut up to 90 percent of your inference costs and carbon emissions.
- Host multiple models that use different containers behind one endpoint.
- Host a linear sequence of containers in an inference pipeline behind a single endpoint.
Sharing endpoint resources is more sustainable and less expensive than deploying a single model behind one endpoint.
Right-size your inference environment
Right-size your endpoints by using metrics from Amazon CloudWatch or by using the Amazon SageMaker Inference Recommender. This tool can run load testing jobs and recommend the proper instance type to host your model. When you use the appropriate instance type, you limit the carbon emission associated with over-provisioning.
If your workload has intermittent or unpredictable traffic, configure autoscaling inference endpoints in SageMaker to optimize your endpoints. Autoscaling monitors your endpoints and dynamically adjusts their capacity to maintain steady and predictable performance using as few resources as possible. You can also try Serverless Inference (in preview), which automatically launches compute resources and scales them in and out depending on traffic, which eliminates idle resources.
Consider inference at the edge
When working on Internet of Things (IoT) use cases, evaluate if ML inference at the edge can reduce the carbon footprint of your workload. To do this, consider factors like the compute capacity of your devices, their energy consumption, or the emissions related to data transfer to the cloud. When deploying ML models to edge devices, consider using SageMaker Edge Manager, which integrates with SageMaker Neo and AWS IoT Greengrass (Figure 2).
Figure 2. Run inference at the edge with SageMaker Edge
Device manufacturing represents 32-57 percent of the global Information Communication Technology carbon footprint. If your ML model is optimized, it requires less compute resources. You can then perform inference on lower specification machines, which minimizes the environmental impact of the device manufacturing and uses less energy.
The following techniques compress the size of models for deployment, which speeds up inference and saves energy without significant loss of accuracy:
- Pruning removes weights (learnable parameters) that don’t contribute much to the model.
- Quantization represents numbers with the low-bit integers without incurring significant loss in accuracy. Specifically, you can reduce resource usage by replacing the parameters in an inference model with half-precision (16 bit), bfloat16 (16 bit, but the same dynamic range as 32 bit), or 8-bit integers instead of the usual single-precision floating-point (32 bit) values.
Archive or delete unnecessary artifacts
Compress and reduce the volume of logs you keep during the inference phase. By default, CloudWatch retains logs indefinitely. By setting limited retention time for your inference logs, you’ll avoid the carbon footprint of unnecessary log storage. Also delete unused versions of your models and custom container images from your repositories.
Retrain only when necessary
Monitor your ML model in production and only retrain if it’s required. Because of model drift, robustness, or new ground truth data being available, models usually need to be retrained. Instead of retraining arbitrarily, monitor your ML model in production, automate your model drift detection and only retrain when your model’s predictive performance has fallen below defined KPIs.
Consider SageMaker Pipelines, AWS Step Functions Data Science SDK for Amazon SageMaker, or third-party tools to automate your retraining pipelines.
Measure results and improve
To monitor and quantify improvements during the inference phase, track the following metrics:
- Resources provisioned for your endpoints (
- Efficient use of these resources (
DiskUtilization) in the CloudWatch Console
- The total size of the data captured by Amazon SageMaker Model Monitor using Amazon S3 Storage Lens
- The size of your CloudWatch log groups
AI/ML workloads can be energy intensive, but as called out by UN and mentioned in the last IPCC report, AI can contribute to mitigation of climate change and the achievement of several Sustainable Development Goals. As technology builders, it’s our responsibility to make sustainable use of AI and ML.
In this blog post series, we presented best practices you can use to make sustainability-conscious architectural decisions and reduce the environmental impact for your AI/ML workloads.
Other posts in this series
- Optimize AI/ML workloads for sustainability: Part 1, identify business goals, validate ML use, and process data
- Optimize AI/ML workloads for sustainability: Part 2, model development
About the Well-Architected Framework
These practices are part of the Sustainability Pillar of the AWS Well-Architected Framework. AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Use the AWS Well-Architected Tool to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on AWS re:Post.