Hybrid Cloud Architectures Using Self-hosted Apache Kafka and AWS Glue

Using analytics to gain insights from a variety of datasets is key to successful transformation. There are many options to consider to realize the full value and potential of our data in a hybrid cloud infrastructure. Common practice is to route data produced from on-premises to a central repository or data lake.

Source: Hybrid Cloud Architectures Using Self-hosted Apache Kafka and AWS Glue

Here it can be consumed by multiple applications.

You can use an Apache Kafka cluster for data movement from on-premises to the data lake, using Amazon Simple Storage Service (Amazon S3). But you must either replicate the topics onto a cloud cluster, or develop a custom connector to read and copy the topics to Amazon S3. This presents a challenge for many customers.

This blog presents another option; an architecture solution leveraging AWS Glue.

Kafka and ETL processing

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. You can use Kafka clusters as a system to move data between systems. Producers typically publish data (or push) to a Kafka topic, where an application can consume it. Consumers are usually custom applications that feed data into respective target applications. These targets can be a data warehouse, an Amazon OpenSearch Service cluster, or others.

AWS Glue offers the ability to create jobs that will extract, transform, and load (ETL) data. This allows you to consume from many sources, such as from Apache Kafka, Amazon Kinesis Data Streams, or Amazon Managed Streaming for Apache Kafka (Amazon MSK). The jobs cleanse and transform the data, and then load the results into Amazon S3 data lakes or JDBC data stores.

Hybrid solution and architecture design

In most cases, the first step in building a responsive and manageable architecture is to review the data itself. For example, if we are processing insurance policy data from a financial organization, our data may contain fields that identify customer data. These can be account ID, an insurance claim identifier, and the dollar amount of the specific claim. Glue provides the ability to change any of these field types into the expected data lake schema type for processing.

Figure 1. Data flow – Source to data lake target

SaleBestseller No. 1

Purposeful Design: Travel with ease and look great...
Ready-to-Go Performance: The Aspire 3 is...
Visibly Stunning: Experience sharp details and...
Internal Specifications: 8GB LPDDR5 Onboard...
The HD front-facing camera uses Acer’s TNR...

Bestseller No. 2

HP Newest 14" Ultral Light Laptop for Students and Business, Intel Quad-Core N4120, 8GB RAM, 192GB Storage(64GB eMMC+128GB Micro SD), 1 Year Office 365, Webcam, HDMI, WiFi, USB-A&C, Win 11 S

【14" HD Display】14.0-inch diagonal, HD (1366 x...
【Processor & Graphics】Intel Celeron N4120, 4...
【RAM & Storage】8GB high-bandwidth DDR4 Memory...
【Ports】1 x USB 3.1 Type-C ports, 2 x USB 3.1...
【Windows 11 Home in S mode】You may switch to...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

Next, AWS Glue must be configured to connect to the on-premises Kafka server (see Figure 1). Private and secure connectivity to the on-premises environment can be established via AWS Direct Connect or a VPN solution. Traffic from the Amazon Virtual Private Cloud (Amazon VPC) is allowed to access the cluster directly. You can do this by creating a three-step streaming ETL job:

Create a Glue connection to the on-premises Kafka source
Create a Data Catalog table
Create an ETL job, which saves to an S3 data lake

Configuring AWS Glue

Create a connection. Using AWS Glue, create a secure SSL connection in the Data Catalog using the predefined Kafka connection type. Enter the hostname of the on-premises cluster and use the custom-managed certificate option for additional security. If you are in a development environment, you are required to generate a self-signed SSL certificate. Use your Kafka SSL endpoint to connect to Glue. (AWS Glue also supports client authentication for Apache Kafka streams.)
Specify a security group. To allow AWS Glue to communicate between its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating this rule, you can restrict the source to the same security group in the Amazon VPC. Ensure you check the default security group for your VPC, as it could have a preconfigured self-referencing inbound rule for ALL traffic.
Create the Data Catalog. Glue can auto-create the data schema. Since it’s a simple flat file, use the schema detection function of Glue. Set up the Kafka topic and refer to the connection.
Define the job properties. Create the AWS Identity and Access Management (IAM) role to allow Glue to connect to S3 data. Select an S3 bucket and format. In this case, we use CSV and enable schema detection.

The Glue job can be scheduled, initiated manually, or by using an event driven architecture. Note that Glue does not yet support the “test connection” option within the console. Make sure you set the “Job Timeout” and enter a duration in minutes because the default value is blank.

When the job runs, it pulls the latest topics from the source on-premises Kafka cluster. Glue supports checkpoints to ensure that all source data is processed. By default, AWS Glue processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data arriving later. You can modify this window size to increase timeliness or aggregation accuracy. AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read. Original Postricing/" target="_blank" rel="noreferrer noopener">AWS Glue bills hourly for streaming ETL jobs only while they are running.

Now that the connection is complete and the job is created, we can format the source data needed for the data lake. AWS Glue offers a set of built-in transforms that you can use to process your data using your ETL script. The transformed data is then placed in S3, where it can be leveraged as part of a larger data lake environment.

Many additional steps can be taken to render even more value from the information. For example, one team may choose to use a business intelligence tool like Amazon QuickSight to visualize and embed the data into an internal dashboard. Another team may want to use event driven architectures to notify financial analysts and initiate downstream actions when specific types of data are discovered. There are endless opportunities that should be determined by the business needs.

Summary

New

Naclud Laptops, 15 Inch Laptop, Laptop Computer with 128GB ROM 4GB RAM, Intel N4000 Processor(Up to 2.6GHz), 2.4G/5G WiFi, BT5.0, Type C, USB3.2, Mini-HDMI, 53200mWh Long Battery Life

EFFICIENT PERFORMANCE: Equipped with 4GB...
Powerful configuration: Equipped with the Intel...
LIGHTWEIGHT AND ADVANCED - The slim case weighs...
Multifunctional interface: fast connection with...
Worry-free customer service: from date of...

New

HP - Victus 15.6" Full HD 144Hz Gaming Laptop - Intel Core i5-13420H - 8GB Memory - NVIDIA GeForce RTX 3050-512GB SSD - Performance Blue (Renewed)

Powered by an Intel Core i5 13th Gen 13420H 1.5GHz...
Equipped with an NVIDIA GeForce RTX 3050 6GB GDDR6...
Includes 8GB of DDR4-3200 RAM for smooth...
Features a spacious 512GB Solid State Drive for...
Boasts a vibrant 15.6" FHD IPS Micro-Edge...

New

HP EliteBook 850 G8 15.6" FHD Laptop Computer – Intel Core i5-11th Gen. up to 4.40GHz – 16GB DDR4 RAM – 512GB NVMe SSD – USB C – Thunderbolt – Webcam – Windows 11 Pro – 3 Yr Warranty – Notebook PC

Processor - Powered by 11 Gen i5-1145G7 Processor...
Memory and Storage - Equipped with 16GB of...
FHD Display - 15.6 inch (1920 x 1080) FHD display,...
FEATURES - Intel Iris Xe Graphics – Audio by...
Convenience & Warranty: 2 x Thunderbolt 4 with...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

In this blog post, we have given an overview of an architecture that provides hybrid cloud data integration and analytics capability. Once the data is transformed and hosted in the S3 data lake, we can provide secure, reliable access to gain valuable insights. This solution allows for a variety of different producers and consumers, with the ability to handle increasing volumes of data.

AWS Glue along with Apache Kafka will ensure that your on-premises workloads are tightly integrated with your larger data lake solution.

Kafka and ETL processing

Hybrid solution and architecture design

Configuring AWS Glue

Summary

Share this:

Like this:

Related

Discover more from Global Intelligence and Insight Platform: IT Innovation, ETF Investment, plus Health Wellbeing