Swiftly Search Metadata with an Amazon S3 Serverless Architecture

As you increase the number of objects in Amazon Simple Storage Service (Amazon S3) , you’ll need the ability to search through them and quickly find the information you need. In this blog post, we offer you a cost-effective solution that uses a serverless architecture to search through your metadata.

Source: Original Postress-this.php?">Swiftly Search Metadata with an Amazon S3 Serverless Architecture

Using a serverless architecture helps you reduce operational costs because you only pay for what you use.

Our solution is built with Amazon S3 event notifications, AWS LambdaAWS Glue Catalog, and Amazon Athena. These services allow you to search thousands of objects in an S3 bucket by filenames, object metadata, and object keys. This solution maintains an index in an Apache Parquet file, which optimizes Athena queries to search Amazon S3 metadata. Using this approach makes it straightforward to run queries as needed without the need to ingest data or manage any servers.

Using Athena to search Amazon S3 objects

Amazon S3 stores and retrieves objects for a range of use cases, such as data lakes, websites, cloud-native applications, backups, archive, machine learning, and analytics. When you have an S3 bucket with thousands of files in it, how do you search for and find what you need? The object search box within the Amazon S3 user interface allows you to search by prefix, or you can search using Amazon S3 API’s LIST operation, which only returns 1,000 objects at a time.

A common solution to this issue is to build an external index and search for Amazon S3 objects using the external index. The Indexing Metadata in Amazon Elasticsearch Service Using AWS Lambda and Python and Building and Maintaining an Amazon S3 Metadata Index without Servers blog posts show you how to build this solution with Amazon OpenSearch Service or Amazon DynamoDB.

Our solution stores the external index in Amazon S3 and uses Athena to search the index. Athena makes it straightforward to search Amazon S3 objects without the need to manage servers or introduce another data repository. In the next section, we’ll talk about a few use cases where you can apply this solution.

Metadata search use cases

When your clients upload files to their S3 buckets, you’ll sometimes need to verify the files that were uploaded. You must validate whether you have received all the required information, including metadata such as customer identifier, category, received date, etc. The following examples will make use of this metadata:

  • Searching for a file from a specific date (or) date range
  • Finding all objects uploaded by a given customer identifier
  • Reviewing files for a particular category

The next sections outline how to build a serverless architecture to apply to use cases like these.

Building a serverless file metadata search on AWS

Let’s go through layers that are involved in our serverless architecture solution:

  1. Data ingestion: Set object metadata when objects are uploaded into Amazon S3. This layer uploads objects using the Amazon S3 console, AWS SDK, REST API, and AWS CLI.
  2. Data processing: Integrate Amazon S3 event notifications with Lambda to process S3 events. The AWS Data Wrangler library within Lambda will transform and build the metadata index file.
  3. Data catalog: Use AWS Glue Data Catalog as a central repository to store table definition and add/update business-relevant attributes of the metadata index. AWS Data Wrangler API creates the Apache Parquet files to maintain the AWS Glue Catalog.
  4. Metadata search: Define tables for your metadata and run queries using standard SQL with Athena to get started faster.

Reference architecture

Figure 1 illustrates our approach to implementing serverless file metadata search, which consists of the following steps:

  1. When you create new objects/files in an S3 bucket, the source bucket is configured with Amazon S3 Event Notification events (put, post, copy, etc.). Amazon S3 events provide the metadata information required for further processing and building the metadata index file on the destination bucket.
  2. The S3 event is sent to a Lambda function with necessary permissions on Amazon S3 using a resource-based policy. The Lambda function processes the event with metadata and converts it into an Apache Parquet file, which is then written into a target bucket. The AWS Data Wrangler API transforms and builds the metadata index file. The Lambda layer configures AWS Data Wrangler library for necessary transformations.
  3. AWS Data Wrangler also creates and stores metadata in the AWS Glue Data Catalog. DataFrames are then written into target S3 buckets in Apache Parquet format. The AWS Glue Data Catalog is then updated with necessary metadata. The following example code snippet writes into an example table with columns for year, month, and date for an S3 object.
SaleBestseller No. 1
HP 2022 Newest All-in-One Desktop, 21.5" FHD Display, Intel Celeron J4025 Processor, 16GB RAM, 512GB PCIe SSD, Webcam, HDMI, RJ-45, Wired Keyboard&Mouse, WiFi, Windows 11 Home, White
  • 【High Speed RAM And Enormous Space】16GB DDR4...
  • 【Processor】Intel Celeron J4025 processor (2...
  • 【Display】21.5" diagonal FHD VA ZBD anti-glare...
  • 【Tech Specs】2 x SuperSpeed USB Type-A 5Gbps...
  • 【Authorized KKE Mousepad】Include KKE Mousepad
SaleBestseller No. 2
ACEMAGIC Laptop Computer, 16GB DDR4 512GB SSD, 15.6 Inch Windows 11 Laptop with Intel Quad-Core N95(Up to 3.4GHz), Metal Shell, BT5.0, 5G WiFi, USB3.2, Type_C, Webcam, 38Wh Battery, 180° Open Angle
  • 【EFFICIENT PERFORMANCE】ACEMAGIC Laptop...
  • 【16GB RAM & 512GB ROM】Featuring 16GB of DDR4...
  • 【15.6" IMMERSIVE VISUALS】This 15.6 inch laptop...
  • 【NO LATENCY CONNECTION】The laptop computer...
  • 【ACEMAGIC CARE FOR YOU】 This slim laptop will...

wr.s3.to_parquet(df=df, path=path, dataset=True, mode="append", partition_cols=["year","month","date"],database="example_database", table="example_table")

  1. With the AWS Glue Data Catalog built, Athena will use AWS Glue crawlers to automatically infer schemas and partitions of the metadata search index. Athena makes it easy to run interactive SQL queries directly into Amazon S3 by using the schema-on-read approach.

Figure 1. Serverless S3 metadata search

Athena charges based on the amount of data scanned for the query. The data being in columnar format and data partitioning will save costs as well as improve performance. Figure 2 provides a sample metadata query result from Athena.

Figure 2. Athena sample metadata query results

Conclusion

New
HP Envy Desktop, Intel Core i7-13700, 64GB RAM, 4TB SSD, SD Card Reader, HDMI, VGA, RJ45, Wired Keyboard & Mouse, Wi-Fi 6, Windows 11 Home, Black
  • [High Speed RAM And Enormous Space] 64GB...
  • [Processor] Intel Core i7-13700 (16 Cores, 24...
  • [Tech Specs] 1 x USB 3.2 Type-C, 4 x USB 3.2...
  • [Operating System] Windows 11 Home - Beautiful,...
New
XZKKCD Archangel 3.0 Gaming Computer PC Desktop - Ryzen 5 3600 6-Core 3.6GHz, RTX 3060 12GB, 1TB SSD, 16GB DDR4 3200, RGB Fans, AC WiFi, 600W Gold PSU, Windows 11 Home 64-bit, White
  • AMD Ryzen 5 3600 6-Core 3.6 GHz (4.2 GHz Turbo)...
  • GeForce RTX 3060 12GB GDDR6 Graphics Card (Brand...
  • 802.11AC | No Bloatware | Graphic output options...
  • Heatsink & 3 x RGB Fans | Powered by 80 Plus Gold...
  • 1 Year Warranty on Parts and Labor | Lifetime Free...
New
jumper Laptop, Laptop Computer with 24GB LPDDR4 512GB SSD, Intel Celeron N5095 CPU(Up to 2.9GHz), 17.3" FHD IPS 1920x1200 Display, 38WH Battery, Intel® UHD Graphics, USB3.0 * 3, BT5.0, Front 2.0MP.
  • 【Excellent performance】 Laptop is equipped...
  • 【Do Your Tasks Easily】 Laptop computer comes...
  • 【Amazing Visuals】 The 17.3-inch laptop...
  • 【Poweful Cooling System】Laptops are equipped...
  • 【External Ports Design】Notebook computer comes...

This blog post shows you how to create a robust metadata index using serverless components. This solution allows you to search files in an S3 bucket by filenames, metadata, and keys.

We showed you how to set up Amazon S3 Event Notifications, Lambda, AWS Glue Catalog, and Athena. You can use this approach to maintain an index in an Apache Parquet file, store it in Amazon S3, and use Athena queries to search S3 metadata.

Our solution requires minimal administration effort. It does not require administration and maintenance of Amazon Elastic Compute Cloud (Amazon EC2) instances, DynamoDB tables, or Amazon OpenSearch Service clusters. Amazon S3 provides scalable storage, high durability, and availability at a low cost. Plus, this solution does not require in-depth knowledge of AWS services. When not in use, it will only incur cost for Amazon S3 and possibly for AWS Glue Data Catalog storage. When needed, this solution will scale out effortlessly.