Crawl Delta Lake tables using AWS Glue crawlers

Table of Contents ( Press the ← key in browser search bar to return TOC)

In recent evolution in data lake technologies, it became popular to bring ACID (atomicity, consistency, isolation, and durability) transactions on Amazon Simple Storage Service (Amazon S3). You can achieve that by introducing open-source data lake formats such as Apache Hudi, Apache Iceberg, and Delta Lake. Delta Lake is one of the common open-source data lake formats.

Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. To utilize Delta Lake from Glue Spark jobs, follow this blog series. To utilize Delta Lake from Athena and Redshift Spectrum, you need to have specific table definitions on the AWS Glue Data Catalog, and there is an extra step to make it queryable from Athena and Redshift Spectrum.

One of the key capabilities of Delta Lake and other data lake formats is reading consistent snapshot using ACID transactions. Even when there are many concurrent writes, you can guarantee consistent version of the tables at the specific point in time without retrieving intermediate or incomplete state. It is highly demanded capability especially in complex data pipelines.

AWS Glue crawlers are designed to populate table definitions on the Data Catalog based on data dynamically. This year, AWS Glue crawlers started supporting Delta Lake. It simplifies those use cases by creating table definitions of Delta tables dynamically, populating the metadata from the Delta Lake transaction logs, and creating the manifest files in Amazon S3 for Athena and Redshift Spectrum to consume. With Delta lake crawler, you can easily read consistent snapshot from Athena and Redshift Spectrum. AWS Glue crawler integration with Delta Lake also supports AWS Lake Formation access control. You can grant Lake Formation permissions on the Delta tables created by the crawler to AWS principals that then query through Athena and Redshift Spectrum to access data in Delta tables.

This post demonstrates how AWS Glue crawlers work with Delta tables, and describes typical use cases to query Delta tables.

How AWS Glue Crawler works with Delta Lake

Delta Lake provides an abstraction known as a Delta table that encapsulates all metadata, data files, and partitions under a transaction log. Delta Lake stores the metadata and schema within the distributed file system rather than in a central data catalog.

To access data using the Delta Lake protocol, Redshift Spectrum and Athena need a manifest file that lists all files that are associated to a particular Delta table, along with the table metadata populated in the AWS Glue Data Catalog. Traditionally, this manifest file creation required running a GENERATE symlink_format_manifest query on Apache Spark.

The AWS Glue crawler populates the metadata from the Delta Lake transaction log into the Data Catalog, and creates the manifest files in Amazon S3 for different query engines to consume. To simplify access to Delta tables, the crawler provides an option to select a Delta Lake data store, which encapsulates all parameters that are required for crawling. For each Delta Lake data store, the crawler scans the Delta table’s transaction log to detect metadata. It populates the _symlink_manifest folder with the manifest files that are partitioned by the partition keys, based on configuration parameters that you choose.

Crawl Delta Lake tables using AWS Glue Crawler

In this tutorial, let’s go through how to crawl delta tables using AWS Glue Crawler.

Prerequisites

Complete the following prerequisite steps for this tutorial:

Install and configure the AWS Command Line Interface (AWS CLI).
Create your S3 bucket if you don’t have one.
Create an AWS Identity and Access Management (IAM) role for your AWS Glue crawler if you don’t have one. For instructions, refer to Create an IAM role for AWS Glue.
Run the following command to copy the sample Delta table into your S3 bucket (replace your_s3_bucket with your S3 bucket name):

$ aws s3 sync s3://aws-bigdata-blog/artifacts/delta-lake-crawler/sample_delta_table/ s3://your_s3_bucket/data/sample_delta_table/

Create a Delta Lake crawler

You can create a Delta Lake crawler via the AWS Glue console, the AWS Glue SDK, or the AWS CLI. In the SDK, specify a DeltaTarget with the following configurations:

DeltaTables – A list of Amazon S3 DeltaPath values where the Delta tables are located. (Note that each path must be the parent of a _delta_log folder).
WriteManifest – A Boolean value indicating whether or not the crawler should write the manifest files for each DeltaPath.
ConnectionName – An optional connection name stored in the Data Catalog that the crawler should use to access Delta tables backed by a VPC.

To create your crawler on the AWS Glue console, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
For Name, enter delta-lake-crawler, and choose Next.
For Data source configuration, chooseNot yet.
For Data source, choose Add a data source.
For Data source, select Delta Lake.
For Include delta lake table paths, enter s3://your_s3_bucket/data/sample_delta_table/.
Select Enable write manifest, then choose Add a Delta Lake data source. Choose Next.
For IAM role, under Existing IAM role, choose your IAM role, then choose Next.
For Target database, choose Add database, then Create a database page is shown up.
For Name, enter delta_lake, then choose Create database. Then come back to the previous page. For Target database, click the reload button, and select delta_lake database.
For Frequency under Crawler schedule, choose On demand, then choose Next.
Review your configuration, and choose Create crawler. You can trigger the crawler to run manually via the AWS Glue console, or through the SDK or AWS CLI using the StartCrawl API. You could also schedule a trigger via the AWS Glue console. For this post, we run the crawler via the AWS Glue console.
Select delta-lake-crawler, and choose Run.
Wait for the crawler to complete.

After the crawler runs, it writes a single manifest table in the Data Catalog for each DeltaPath under its configuration that has a valid Delta table. The manifest table uses the format SymlinkTextInputFormat and the manifest location s3://your_s3_bucket/data/sample_delta_table/_symlink_format_manifest/.

You can see the Delta table definition on the AWS Glue console. The table location points to the preceding manifest location.

The table definition also includes an AdditionalLocations field, which is an array that points to the location s3://your_s3_bucket/data/sample_delta_table/. You can access this additional field through the following AWS CLI command:

$ aws glue get-table --database delta_lake --name sample_delta_table
{
    "Table": {
        "Name": "sample_delta_table",
        "DatabaseName": "delta_lake",
        "Owner": "owner",
        "CreateTime": "2022-07-07T17:49:43+09:00",
        "UpdateTime": "2022-07-07T20:33:09+09:00",
        "LastAccessTime": "2022-07-07T17:49:43+09:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "product_id",
                    "Type": "string"
                },
                {
                    "Name": "product_name",
                    "Type": "string"
                },
                {
                    "Name": "price",
                    "Type": "bigint"
                },
                {
                    "Name": "currency",
                    "Type": "string"
                },
                {
                    "Name": "category",
                    "Type": "string"
                },
                {
                    "Name": "updated_at",
                    "Type": "double"
                }
            ],
            "Location": "s3://your_s3_bucket/data/sample_delta_table/_symlink_format_manifest/",
            "AdditionalLocations": [
                "s3://your_s3_bucket/data/sample_delta_table/"
            ],
            "InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "Compressed": false,
            "NumberOfBuckets": -1,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
                "Parameters": {}
            },
            "BucketColumns": [],
            "SortColumns": [],
            "Parameters": {
                "CrawlerSchemaDeserializerVersion": "1.0",
                "CrawlerSchemaSerializerVersion": "1.0",
                "UPDATED_BY_CRAWLER": "delta-lake-crawler",
                "classification": "parquet"
            },
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "CrawlerSchemaDeserializerVersion": "1.0",
            "CrawlerSchemaSerializerVersion": "1.0",
            "UPDATED_BY_CRAWLER": "delta-lake-crawler",
            "classification": "parquet",
            "last_modified_by": "hadoop",
            "last_modified_time": "1657193589",
            "transient_lastDdlTime": "1657193589"
        },
        "CreatedBy": "arn:aws:sts::123456789101:assumed-role/AWSGlueServiceRole-Default/AWS-Crawler",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "123456789101",
        "IsRowFilteringEnabled": false,
        "VersionId": "1",
        "DatabaseId": "7cb01f36ebbc460eb2d4dcce2c91ed2b"
    }
}

Query Delta tables

After you create the manifest table, AWS query engines such as Athena and Redshift Spectrum are able to query against the files by reading the manifest file locations to filter which data files to query in the Delta table.

Query from Athena

Athena users need to point their catalog to the AWS Glue Data Catalog. Open the Athena console in the same Region as where your table is registered in the Data Catalog, and confirm that the data source is set to AwsDataCatalog.

Now you’re ready to run queries on Athena. To access your Delta table, run the following query:

SELECT * FROM "delta_lake"."sample_delta_table" limit 10;

The following screenshot shows our output.

Query from Redshift Spectrum

Redshift Spectrum requires an external schema pointing to the database in which the Delta table was created.

To query with Redshift Spectrum, complete the following steps:

Create an IAM role for an Amazon Redshift cluster with the following configuration:
1. For permissions, use arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess, or your custom policy for reading your S3 bucket.
2. Use the following trust relationship:
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```
To lean more, visit IAM policies for Amazon Redshift Spectrum.
Launch a new Redshift cluster with the following configuration:
1. Choose dc2.large, 1 node.
2. Configure the IAM role you created in step 1.
Connect with the Amazon Redshift query editor v2.For instructions, see Querying a database using the query editor v2.

Create an external schema for the delta_lake database to use in Redshift Spectrum (replace <your IAM role ARN> with your IAM role ARN):

create external schema spectrum from data catalog 
database 'delta_lake' 
iam_role '<your IAM role ARN>'
create external database if not exists;

Run the following SQL against spectrum.sample_delta_table:

SELECT * FROM "dev"."spectrum"."sample_delta_table" LIMIT 10

The following screenshot shows our output.

Limitations of Delta crawlers and manifest tables

When the data or schema in a Delta table is updated, the manifest tables in the AWS Glue Data Catalog may become out of sync. It means that you can still query the manifest table and get the consistent result, but the result of the table is at the previous point of time. To get the latest result, you must update the manifest tables using the crawler or manually update the manifest table through the AWS Glue SDK or AWS CLI. When you want to keep the manifest table up-to-date, you can run Delta Lake crawlers on a schedule (for example, once an hour).

When the Delta table data is updated, the manifest files under the _symlink_manifest folder of your Delta tables may also become out of sync, in which case you need to rerun a crawler with writeManifest enabled.

Use of Delta tables in EMR and Glue Spark

The delta lake crawler is designed for use in Athena, Redshift Spectrum, and other engines compatible with parquet-based manifest tables. For EMR Spark or Glue Spark jobs, you do not need to create a manifest table by running the delta lake crawler, instead, you can read from and write to delta table directly using Delta Lake library. You can follow this blog series to understand how to process Delta tables on Glue Spark jobs.

Secure Delta tables using Lake Formation permissions

Manifest tables created by the Delta Lake crawler support Lake Formation access control, including cell-level security. It allows Data Lake administrators to filter specific rows and columns for certain users of their manifest tables. Through the use of CreateDataCellFilter and GrantPermissions APIs, you can grant row and column filters to the Delta manifest table. You can query the Delta manifest table from Athena and Redshift Spectrum with the use of these filters configured on the Delta manifest tables.

To learn more about Lake Formation cell-level security, refer to the following blog posts:

Clean up

Now to the final step, cleaning up the resources:

Delete the Amazon Redshift cluster.
Delete your data under your S3 path: s3://your_s3_bucket/data/sample_delta_table/.
Delete the AWS Glue crawler delta-lake-crawler.
Delete the AWS Glue database delta_lake.

Conclusion

This post demonstrated how to crawl Delta tables using an AWS Glue crawler, and how to query against the crawled tables from Athena and Redshift Spectrum. With AWS Glue crawlers, the manifest files are automatically created, so you can easily integrate Delta tables with Athena and Redshift Spectrum without manual effort in maintaining manifest files. It also enables you to manage cell-level security on the Delta tables using Lake Formation permissions.

Let’s start using Glue crawlers for your own Delta tables. If you have comments or feedback, please feel free to leave them in the comments.

About the authors

Kyle Duong is a Software Development Engineer on the AWS Glue and AWS Lake Formation team. He is passionate about building big data technologies and distributed systems. In his free time, he enjoys cycling or playing basketball.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. This summer, he enjoyed goldfish scooping with his children.

Original Post>

Enjoyed this article? Sign up for our newsletter to receive regular insights and stay connected.

How AWS Glue Crawler works with Delta Lake

Crawl Delta Lake tables using AWS Glue Crawler

Prerequisites

Create a Delta Lake crawler

Query Delta tables

Query from Athena

Query from Redshift Spectrum

Limitations of Delta crawlers and manifest tables

Use of Delta tables in EMR and Glue Spark

Secure Delta tables using Lake Formation permissions

Clean up

Conclusion

About the authors

Like this:

Related

How AWS Glue Crawler works with Delta Lake

Crawl Delta Lake tables using AWS Glue Crawler

Prerequisites

Create a Delta Lake crawler

Query Delta tables

Query from Athena

Query from Redshift Spectrum

Limitations of Delta crawlers and manifest tables

Use of Delta tables in EMR and Glue Spark

Secure Delta tables using Lake Formation permissions

Clean up

Conclusion

About the authors

Share this:

Like this:

Related

Discover more from Global Intelligence and Insight Platform: IT Innovation, ETF Investment, plus Health Wellbeing