Build a serverless pipeline to analyze streaming data using AWS Glue, Apache Hudi, and Amazon S3

To deliver on these requirements, organizations have to build custom frameworks to handle in-place updates (also referred as upserts), handle small files created due to the continuous ingestion of changes from upstream systems (such as databases), handle schema evolution, and compromise on providing ACID guarantees on its data lake

To consume this streaming data, we set up an AWS Glue streaming ETL job that uses the Apache Hudi Connector for AWS Glue to write ingested and transformed data to Amazon S3, and also creates a table in the AWS Glue Data Catalog.

Continue reading

green and yellow printed textile

Automate Amazon Redshift load testing with the AWS Analytics Automation Toolkit

In this post, we demonstrate the use of the AWS Analytics Automation Toolkit for JMeter load tests on cloud benchmark data, using Amazon Redshift as a target environment

To use the AWS Analytics Automation Toolkit to run a JMeter load test, deploy the toolkit with the JMeter option, load data into your Amazon Redshift cluster, and customize the default test plan as you see fit.

Continue reading

Audit AWS service events with Amazon EventBridge and Amazon Kinesis Data Firehose

In this post, we provide a working example of AWS service-generated events ingested to Amazon S3. To make sure we have some service events available in default event bus, we use Parameter Store, a capability of AWS Systems Manager to store new parameters manually. This action generates a new event, which is ingested by the following pipeline.

Continue reading

kuala lumpur building

How Gemini Built a Cryptocurrency Analytics Platform Using Lakehouse for Financial Services

With the sheer volume of historical and live data feeds being ingested, and the need for a scalable compute platform for backtesting and spread calculations, our team needed a performant single source of truth to build the application dashboards.

As the data sets would be leveraged by machine learning and analyst teams, the Delta Lake format provided unique capabilities for managing high volume market/tick data — these features were key in developing the Gemini Lakehouse platform:

Continue reading

green and yellow printed textile

Export JSON data to Amazon S3 using Amazon Redshift UNLOAD

Example 1 – Unload customer data in JSON format into Amazon S3, partitioning output files into partition folders, following the Apache Hive convention, with customer birth month as the partition key.

Example 3 – Unload line item data (With SUPER column) in JSON format into Amazon S3, partitioning output files into partition folders, following the Apache Hive convention, with customer key as the partition key

Continue reading

1 22 23 24 25 26 27