Simplify data integration pipeline development using AWS Glue custom blueprints

https://aws.amazon.com/blogs/big-data/simplify-data-integration-pipeline-development-using-aws-glue-custom-blueprints/

Organizations spend significant time developing and maintaining data integration pipelines that hydrate data warehouses, data lakes, and lake houses. As data volume increases, data engineering teams struggle to keep up with new requests from business teams.

Although these requests may come from different teams, they’re often similar, such as ingesting raw data from a source system into a data lake, partitioning data based on a certain key, write data from data lakes to a relational database, or assigning default values for empty attributes. To keep up with these requests, data engineers modify pipelines in a development environment, and test and deploy to a production environment. This redundant code creation process is error-prone and time consuming.

Data engineers need a way to enable non-data engineers like business analysts, data analysts, and data scientists to operate using self-service methods by abstracting the complexity of pipeline development. In this post, we discuss AWS Glue custom blueprints, which offer a framework for you to build and share reusable AWS Glue workflows.

Introducing AWS Glue custom blueprints

AWS Glue is a serverless data integration service that allows data engineers to develop complex data integration pipelines. In AWS Glue, you can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers.

With AWS Glue custom blueprints, data engineers can create a blueprint that abstracts away complex transformations and technical details. Non-data engineers can easily use blueprints using a user interface to ingest, transform, and load data instead of waiting for data engineers to develop new pipelines. These users can also take advantage of blueprints developed outside their organization; for example, AWS has developed sample blueprints to transform data.

The following diagram illustrates our architecture with AWS Glue custom blueprints.

The workflow includes the following steps:

  1. The data engineer identifies common data integration patterns and creates a blueprint.
  2. The data engineer shares the blueprint via the source control tool or Amazon Simple Storage Service (Amazon S3).
  3. Non-data engineers can easily register and use the blueprint via a user interface where they provide input.
  4. The blueprint uses these parameters to generate an AWS Glue workflow. You can simply run these workflows to ingest and transform data.
Advertisements

Develop a custom blueprint

Data ingested in a data lake is partitioned in a certain way. Sometimes, data analysts, business analysts, data scientists, and data engineers need to partition data differently based on their query pattern. For instance, a data scientist may want to partition the data by timestamp, whereas a data analyst may want to partition data based on location. The data engineer can create AWS Glue jobs that accepts parameters and partitions the data based on these parameters. Then they can package the job as a blueprint to share with other users, who provide the parameters and generate an AWS Glue workflow. Here, we will create a blueprint to solve this use case.

To create a custom blueprint, a data engineer has to create three components: A configuration file, layout file, AWS Glue job scripts and any additional libraries required in the creation of resources specified in the layout file.

AWS Glue job script(s) are usually for data transformation. In this example, the data engineer creates the job script partitioning.py, which accepts parameters such as the source S3 location, partition keys, partitioned table name, and target S3 location. The job reads data in the source S3 location, writes partitioned data to the target S3 location, and catalogs the partitioned table in the AWS Glue Data Catalog.

The configuration file is a JSON based file where data engineer defines list of inputs needed to generate the workflow. In this example, the data engineer creates blueprint.cfg, outlining all the inputs needed, such as input data location, partitioned table name, and output data location. AWS Glue uses this file to create a user interface for users to provide values when creating their workflow. The following figure shows how parameters from the configuration file are translated to the user interface.

The layout file is a Python file that uses the user inputs to create the following:

  • Prerequisite objects such as Data Catalog databases and S3 locations to store ETL scripts or use them as intermediate data locations
  • The AWS Glue workflow

In this example, the developer creates a layout.py file that generates the workflow based on the parameters provided by the user. The layout file includes code that performs the following functions:

  • Creates the AWS Glue database based on inputs provided by user
  • Creates the AWS Glue script S3 bucket and uploads partitioning.py
  • Creates temporary S3 locations for processing
  • Creates the workflow that first runs the crawler and then the job
  • Based on the input parameters, sets up the workflow schedule

Package a custom blueprint

After you develop a blueprint, you need to package it as a .zip file, which others can use to register the blueprint. It should contain following files:

  • Configuration file
  • Layout file
  • AWS Glue jobs scripts and additional libraries as required

You can share the blueprint with others using your choice of source control repository or file storage.

Register the blueprint

To register a blueprint on the AWS Glue console, complete the following steps:

  1. Upload the .zip file in Amazon S3.
  2. On the AWS Glue console, choose Blueprints.
  3. Choose Add blueprint.
  4. Enter the following information:
    1. Blueprint name
    2. Location of .zip archive
    3. Optional description
  5. Choose Add blueprint.

When the blueprint is successfully registered, its status turns to ACTIVE.

You’re now ready to use the blueprint to create an AWS Glue workflow.

Use a blueprint

We have developed a few blueprints to get you started. To use them, you have to download them, create a .zip files and register them as described in the previous section.

Custom Blueprint NameDescription
Crawl S3 locationsCrawl S3 locations and create tables in the AWS Glue Data Catalog
Convert data format to ParquetConvert S3 files in various formats to Parquet format using Snappy compression
Original Postartitioning" target="_blank">Partition DataPartition files based on user inputs to optimize data layout in Amazon S3
Copy data to DynamoDBCopy data from Amazon S3 to Amazon DynamoDB
CompactionCompact input files into larger files to improve query performance
EncodingConvert encoding in S3 files

In this post, we show how a data analyst can easily use the Partition Data blueprint. Partitioning data improves query performance by organizing data into parts based on column values such as date, country, or region. This helps restrict the amount of data scanned by a query when filters are specified on the partition. You may want to partition data differently, such as by timestamp or other attributes. With the partitioning blueprint, data analysts can easily partition the data without deep knowledge in data engineering.

For this post, we use the Daily Global & U.S. COVID-19 Cases & Testing Data (Enigma Aggregation) dataset available in the AWS COVID-19 data lake as our source. This data contains US and global cases, deaths, and testing data related to COVID-19 organized by the country.

Bestseller No. 1
SAMSUNG Galaxy A54 5G A Series Cell Phone, Unlocked Android Smartphone, 128GB, 6.4” Fluid Display Screen, Pro Grade Camera, Long Battery Life, Refined Design, US Version, 2023, Awesome Black
  • CRISP DETAIL, CLEAR DISPLAY: Enjoy binge-watching...
  • PRO SHOTS WITH EASE: Brilliant sunrises, awesome...
  • CHARGE UP AND CHARGE ON: Always be ready for an...
  • POWERFUL 5G PERFORMANCE: Do what you love most —...
  • NEW LOOK, ADDED DURABILITY: Galaxy A54 5G is...
Bestseller No. 2
OnePlus 12,16GB RAM+512GB,Dual-SIM,Unlocked Android Smartphone,Supports 50W Wireless Charging,Latest Mobile Processor,Advanced Hasselblad Camera,5400 mAh Battery,2024,Flowy Emerald
  • Free 6 months of Google One and 3 months of...
  • Pure Performance: The OnePlus 12 is powered by the...
  • Brilliant Display: The OnePlus 12 has a stunning...
  • Powered by Trinity Engine: The OnePlus 12's...
  • Powerful, Versatile Camera: Explore the new 4th...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

The dataset includes two JSON files, and as of this writing the total data size is 215.7 MB. This data is not partitioned and not optimized for best query performance. It’s common to query this kind of historical data by specifying a date range condition in the WHERE clause. To minimize data scan size and achieve optimal performance, we partition this data using the date time field.

You can partition the datasets via nested partitioning or flat partitioning:

  • Flat partitioning – path_to_data/dt=20200918/
  • Nested partitioning – path_to_data/year=2020/month=9/day=18/

In this example, the input data contains the date field, and its value is formatted as 2020-09-18 (YYYY-MM-DD). For flat partitioning, you can simply specify the date field as partitioning key. However, it becomes tricky to implement nested partitioning. The developer needs to extract the year, month, and day from the date, and it’s hard for non-data engineers to code this. This blueprint abstracts this complexity and can generate nested fields (such as year, month, and day) in any granularity from a date time field.

To use this blueprint, complete the following steps:

  1. Download the files from GitHub with the following code:$ git clone https://github.com/awslabs/aws-glue-blueprint-libs.git $ cd aws-glue-blueprint-libs/samples/
  2. Compress the blueprint files into a .zip file:$ zip partitioning.zip partitioning/*
  3. Upload the .zip file to your S3 bucket:$ aws s3 cp partitioning.zip s3://path/to/blueprint/
  4. On the AWS Glue console, choose Blueprints.
  5. Choose Add blueprint.
  6. For Blueprint name, enter partitioning-tutorial.
  7. For ZIP archive location (S3), enter s3://path/to/blueprint/partitioning.zip.
  8. Wait for the blueprint status to show as ACTIVE.
  9. Select your partitioning-tutorial blueprint, and on the Actions menu, choose Create workflow.
  10. Specify the following parameters:
    1. WorkflowName – partitioning
    2. IAMRole – The AWS Identity and Access Management (IAM) role to run the AWS Glue job and crawler
    3. InputDataLocation – s3://covid19-lake/enigma-aggregation/json/global/
    4. DestinationDatabaseName – blueprint_tutorial
    5. DestinationTableName – partitioning_tutorial
    6. OutputDataLocation – s3://path/to/output/data/location/
    7. PartitionKeys: (blank)
    8. TimestampColumnName – date
    9. TimestampColumnGranularity – day
    10. NumberOfWorkers – 5 (the default value)
    11. IAM role – The role that AWS Glue assumes to create the workflow, crawlers, jobs, triggers and any other resources defined in the layout script. For a suggested policy for the role, see Permissions for Blueprint Roles
  1. Choose Submit.
  2. Wait for the blueprint run status to change to SUCCEEDED.
  3. In the navigation pane, choose Workflows.
  4. Select partitioning and on the Actions menu, choose Run.
  5. Wait for the workflow run status to show as Completed.

You can navigate to the output file on the Amazon S3 console to see that the Parquet files have been written under the partitioned folders of year=yyyy/month=MM/day=dd/ successfully.

The blueprint registers two tables:

  • source_partitioning_tutorial – The non-partitioned table that is generated by the AWS Glue crawler as a data source
  • partitioning_tutorial – The new partitioned table in the AWS Glue Data Catalog

You can access both tables using Amazon Athena. Let’s compare the data scan size for both tables to see the benefit of partitioning.

First, run the following query against the non-partitioned source table:

SELECT * FROM "blueprint_tutorial"."source_partitioning_tutorial"
WHERE date='2020-09-18'

The following screenshot shows the query results.

Then, run the same query against the partitioned table:

SELECT * FROM "blueprint_tutorial"."partitioning_tutorial"
WHERE year=2020 AND month=9 AND day=18

The following screenshot shows the query results.

New
Fadnou I23 Ultra Unlocked Cell Phone,Built in Pen,Smartphone Battery 6800mAh 6.8" HD Screen Unlocked Phones,6+256GB Android13 with 128G Memory Card,Face ID/Fingerprint Lock/GPS (Purple)
  • 【Octa-Core CPU + 128GB Expandable TF Card】...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】The machine...
  • 【6800mAh Long lasting battery】With the 6800mAh...
  • 【Business Services】The main additional...
New
Huness I15 Pro MAX Smartphone Unlocked Cell Phone,Battery 6800mAh 6.8 HD Screen Unlocked Phone,6+256GB Android 13 with 128GB Memory Card,Dual SIM/5G/Fingerprint Lock/Face ID (Black, 6+256)
  • 【Dimensity 9000 CPU + 128GB Expandable TF...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】Dual SIM &...
  • 【6800mAh Long lasting battery】The I15 Pro MAX...
  • 【Business Services】The main additional...
New
Jopuzia U24 Ultra Unlocked Cell Phone, 5G Smartphone with S Pen, 8GB+256GB Full Netcom Unlocked Phone, 6800mAh Battery 6.8" FHD+ Display 120Hz 80MP Camera, GPS/Face ID/Dual SIM Phone (Rose Gold)
  • 🥇【6.8" HD Unlocked Android Phones】Please...
  • 💗【Octa-Core CPU+ 256GB Storage】U24 Ultra...
  • 💗【Support Global Band 5G Dual SIM】U24 Ultra...
  • 💗【80MP Professional Photography】The U24...
  • 💗【6800mAh Long Lasting Battery】With the...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

The query in the non-partitioned table scanned 215.67 MB of data. The query on the partitioned table scanned 126.42 KB, which is 1700 times less data. This technique reduces usage costs for Athena.

Conclusion

In this post, we demonstrated how data engineers can use AWS Glue custom blueprints to simplify data integration pipelines and promote reusability. Non-data engineers such as data scientists, business analysts, and data analysts can ingest and transform data using a rich UI that abstracts the technical details so they can gain faster insights from their data. Our sample templates can get you started using AWS Glue custom blueprints. We highly encourage you to build blueprints and make them available to the AWS Glue community.


About the authors

Noritaka Sekiyama is a big data architect at AWS Glue and Lake Formation. His passion is for implementing artifacts for building data lakes.

Keerthi Chadalavada is a software development engineer at AWS Glue. She is passionate about building fault tolerant and reliable distributed systems at scale.

Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop, and deploy modern data platforms.