Optimize your modern data architecture for sustainability: Part 2 – unified data governance, data movement, and purpose-built analytics

In the first part of this blog series, Optimize your modern data architecture for sustainability: Part 1 – data ingestion and data lake, we focused on the 1) data ingestion, and 2) data lake pillars of the modern data architecture. In this blog post, we will provide guidance and best practices to optimize the components within the 3) unified data governance, 4) data movement, and 5) purpose-built analytics pillars.
Figure 1 shows the different pillars of the modern data architecture. It includes data ingestion, data lake, unified data governance, data movement, and purpose-built analytics pillars.

Modern Data Analytics Reference Architecture on AWS

Figure 1. Modern Data Analytics Reference Architecture on AWS

3. Unified data governance

A centralized Data Catalog is responsible for storing business and technical metadata about datasets in the storage layer. Administrators apply permissions in this layer and track events for security audits.

Data discovery

To increase data sharing and reduce data movement and duplication, enable data discovery and well-defined access controls for different user personas. This reduces redundant data processing activities. Separate teams within an organization can rely on this central catalog. It provides first-party data (such as sales data) or third-party data (such as stock prices, climate change datasets). You’ll only need access data once, rather than having to pull from source repeatedly.

AWS Glue Data Catalog can simplify the process for adding and searching metadata. Use AWS Glue crawlers to update the existing schemas and discover new datasets. Carefully plan schedules to reduce unnecessary crawling.

Data sharing

Establish well-defined access control mechanisms for different data consumers using services such as AWS Lake Formation. This will enable datasets to be shared between organizational units with fine-grained access control, which reduces redundant copying and movement. Use Amazon Redshift data sharing to avoid copying the data across data warehouses.

Well-defined datasets

Create well-defined datasets and associated metadata to avoid unnecessary data wrangling and manipulation. This will reduce resource usage that might result from additional data manipulation.

4. Data movement

AWS Glue provides serverless, pay-per-use data movement capability, without having to stand up and manage servers or clusters. Set up ETL pipelines that can process tens of terabytes of data.

To minimize idle resources without sacrificing performance, use auto scaling for AWS Glue.

You can create and share AWS Glue workflows for similar use cases by using AWS Glue blueprints, rather than creating an AWS Glue workflow for each use case. AWS Glue job bookmark can track previously processed data.

Consider using Glue Flex Jobs for non-urgent or non-time sensitive data integration workloads such as pre-production jobs, testing, and one-time data loads. With Flex, AWS Glue jobs run on spare compute capacity instead of dedicated hardware.

Joins between several dataframes is a common operation in Spark jobs. To reduce shuffling of data between nodes, use broadcast joins when one of the merged dataframes is small enough to be duplicated on all the executing nodes.

The latest AWS Glue version provides more new and efficient features for your workload.

5. Purpose-built analytics

Data Processing modes

Real-time data processing options need continuous computing resources and require more energy consumption. For the most favorable sustainability impact, evaluate trade-offs and choose the optimal batch data processing option.

Identify the batch and interactive workload requirements and design transient clusters in Amazon EMR. Using Spot Instances and configuring instance fleets can maximize utilization.

To improve energy efficiency, Amazon EMR Serverless can help you avoid over- or under-provisioning resources for your data processing jobs. Amazon EMR Serverless automatically determines the resources that the application needs, gathers these resources to process your jobs, and releases the resources when the jobs finish.

Amazon Redshift RA3 nodes can improve compute efficiency. With RA3 nodes, you can scale compute up and down without having to scale storage. You can choose Amazon Redshift Serverless to intelligently scale data warehouse capacity. This will deliver faster performance for the most demanding and unpredictable workloads.

Energy efficient transformation and data model design

Data processing and data modeling best practices can reduce your organization’s environmental impact.

To avoid unnecessary data movement between nodes in an Amazon Redshift cluster, follow best practices for table design.

You can also use automatic table optimization (ATO) for Amazon Redshift to self-tune tables based on usage patterns.

SaleBestseller No. 1
Acer Aspire 3 A315-24P-R7VH Slim Laptop | 15.6" Full HD IPS Display | AMD Ryzen 3 7320U Quad-Core Processor | AMD Radeon Graphics | 8GB LPDDR5 | 128GB NVMe SSD | Wi-Fi 6 | Windows 11 Home in S Mode
  • Purposeful Design: Travel with ease and look great...
  • Ready-to-Go Performance: The Aspire 3 is...
  • Visibly Stunning: Experience sharp details and...
  • Internal Specifications: 8GB LPDDR5 Onboard...
  • The HD front-facing camera uses Acer’s TNR...
Bestseller No. 2
HP Newest 14" Ultral Light Laptop for Students and Business, Intel Quad-Core N4120, 8GB RAM, 192GB Storage(64GB eMMC+128GB Micro SD), 1 Year Office 365, Webcam, HDMI, WiFi, USB-A&C, Win 11 S
  • 【14" HD Display】14.0-inch diagonal, HD (1366 x...
  • 【Processor & Graphics】Intel Celeron N4120, 4...
  • 【RAM & Storage】8GB high-bandwidth DDR4 Memory...
  • 【Ports】1 x USB 3.1 Type-C ports, 2 x USB 3.1...
  • 【Windows 11 Home in S mode】You may switch to...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

Use the EXPLAIN feature in Amazon Athena or Amazon Redshift to tune and optimize the queries.

The Amazon Redshift Advisor provides specific, tailored recommendations to optimize the data warehouse based on performance statistics and operations data.

Consider migrating Amazon EMR or Amazon OpenSearch Service to a more power-efficient processor such as AWS Graviton. AWS Graviton 3 delivers 2.5–3 times better performance over other CPUs. Graviton 3-based instances use up to 60% less energy for the same performance than comparable EC2 instances.

Minimize idle resources

Use auto scaling features in EMR Clusters or employ Amazon Kinesis Data Streams On-Demand to minimize idle resources without sacrificing performance.

AWS Trusted Advisor can help you identify underutilized Amazon Redshift Clusters. Pause Amazon Redshift clusters when not in use and resume when needed.

Energy efficient consumption patterns

Consider querying the data in place with Amazon Athena or Amazon Redshift Spectrum for one-off analysis, rather than copying the data to Amazon Redshift.

Enable a caching layer for frequent queries as needed. This is in addition to the result caching that comes built-in with services such as Amazon Redshift. Also, use Amazon Athena Query Result Reuse for every query where the source data doesn’t change frequently.

Use materialized views capabilities available in Amazon Redshift or Amazon Aurora Postgres to avoid unnecessary computation.

Use federated queries across data stores powered by Amazon Athena federated query or Amazon Redshift federated query to reduce data movement. For querying across separate Amazon Redshift clusters, consider using Amazon Redshift data sharing feature that decreases data movement between these clusters.

Track and assess improvement for environmental sustainability

The optimal way to evaluate success in optimizing your workloads for sustainability is to use proxy measures and unit of work KPI. This can be GB per transaction for storage, or vCPU minutes per transaction for compute.

In Table 1, we list certain metrics you could collect on analytics services as proxies to measure improvement. These fall under each pillar of the modern data architecture covered in this post.

Pillar Metrics
Unified data governance
Data movement
Purpose-built Analytics
New
Naclud Laptops, 15 Inch Laptop, Laptop Computer with 128GB ROM 4GB RAM, Intel N4000 Processor(Up to 2.6GHz), 2.4G/5G WiFi, BT5.0, Type C, USB3.2, Mini-HDMI, 53200mWh Long Battery Life
  • EFFICIENT PERFORMANCE: Equipped with 4GB...
  • Powerful configuration: Equipped with the Intel...
  • LIGHTWEIGHT AND ADVANCED - The slim case weighs...
  • Multifunctional interface: fast connection with...
  • Worry-free customer service: from date of...
New
HP - Victus 15.6" Full HD 144Hz Gaming Laptop - Intel Core i5-13420H - 8GB Memory - NVIDIA GeForce RTX 3050-512GB SSD - Performance Blue (Renewed)
  • Powered by an Intel Core i5 13th Gen 13420H 1.5GHz...
  • Equipped with an NVIDIA GeForce RTX 3050 6GB GDDR6...
  • Includes 8GB of DDR4-3200 RAM for smooth...
  • Features a spacious 512GB Solid State Drive for...
  • Boasts a vibrant 15.6" FHD IPS Micro-Edge...
New
HP EliteBook 850 G8 15.6" FHD Laptop Computer – Intel Core i5-11th Gen. up to 4.40GHz – 16GB DDR4 RAM – 512GB NVMe SSD – USB C – Thunderbolt – Webcam – Windows 11 Pro – 3 Yr Warranty – Notebook PC
  • Processor - Powered by 11 Gen i5-1145G7 Processor...
  • Memory and Storage - Equipped with 16GB of...
  • FHD Display - 15.6 inch (1920 x 1080) FHD display,...
  • FEATURES - Intel Iris Xe Graphics – Audio by...
  • Convenience & Warranty: 2 x Thunderbolt 4 with...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

Table 1. Metrics for the Modern data architecture pillars

Conclusion

In this blog post, we provided best practices to optimize processes under the unified data governance, data movement, and purpose-built analytics pillars of modern architecture.

If you want to learn more, check out the Sustainability Pillar of the AWS Well-Architected Framework and other blog posts on architecting for sustainability.

If you are looking for more architecture content, refer to the AWS Architecture Center for reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more.

Original Post>