Centralize feature engineering with AWS Step Functions and AWS Glue DataBrew

One of the key phases of a machine learning (ML) workflow is data preprocessing, which involves cleaning, exploring, and transforming the data. AWS Glue DataBrew , announced in AWS re:Invent 2020, is a visual data preparation tool that enables you to develop common data preparation steps without having to write any code or installation.

Source: Centralize feature engineering with AWS Step Functions and AWS Glue DataBrew

In this post, we show how to integrate the standard data preparation steps with training an ML model and running inference on a pre-trained model via DataBrew and AWS Step Functions. The solution is architected with an ML pipeline that trains the publicly available Air Quality Dataset to predict the CO levels in New York City.

Overview of solution

The following architecture diagram shows an overview of the ML workflow, which employs DataBrew for data preparation and scheduling jobs, and uses AWS Lambda and Step Functions to orchestrate ML model training and inference using the AWS Step Functions Data Science SDK. We use Amazon EventBridge to trigger the Step Functions state machine when the DataBrew job is complete.

Scope of Solution

The steps in this solution are as follows:

  1.  Import your dataset to Amazon Simple Storage Service (Amazon S3).
  2.  Launch the AWS CloudFormation stack, which deploys the following:
    1. DataBrew recipes for training and inference data.
    2.  The DataBrew job’s schedule for training and inference.
    3.  An EventBridge rule.
    4. A Lambda function that triggers the Step Functions state machine, which in turn orchestrates the states.
    5. The training state includes the following steps:
      1. Runs an Amazon SageMaker processing job to remove column headers.
      2. Performs SageMaker model training.
      3. Outputs the data to an S3 bucket to store the trained model.
    6. The inference state includes the following steps:
      1. Runs a SageMaker processing job to remove column headers.
      2. Performs a SageMaker batch transform.
      3. Outputs the data to an S3 bucket to store the predictions.

Prerequisites

For this solution, you should have the following prerequisites:

Load the dataset to Amazon S3

In this first step, we load our air quality dataset into Amazon S3.

  1. Download the Outdoor Air Quality Dataset for the years 2018, 2019, and 2020, limiting to the following options:
    1. Pollutant – CO
    2. Geographic Area – New York
    3. Monitor Site – All Sites
  2. For each year of data, split by year, month, and day, and use the data for 2018–2019 to train the model and the 2020 data to run inference.
  3. Run the following script, which stores the output into the NY_XXXX folder:
import os
import pandas as pd

def split_data(root_folder,df):
    # Create year, month and day columns
    df["year"] = pd.DatetimeIndex(df["Date"]).year
    df["month"] = pd.DatetimeIndex(df["Date"]).month
    df["day"] = pd.DatetimeIndex(df["Date"]).day
    if not os.path.exists(root_folder):
        os.mkdir(root_folder)
    for m, x1 in df.groupby(['month']):
        month_dir = os.path.join("{:02}".format(m))
        if not os.path.exists(root_folder+"/"+month_dir):
            os.mkdir(root_folder+"/"+month_dir)
        for d, x2 in x1.groupby(["day"]):
            day_dir = os.path.join("{:02}".format(d))
            if not os.path.exists(root_folder+"/"+month_dir+"/"+day_dir):
                os.mkdir(root_folder+"/"+month_dir+"/"+day_dir)
            p = os.path.join(root_folder+"/"+month_dir+"/"+day_dir, "{:02}.csv".format(d))
            x2.to_csv(p, index=False)

ny_data_2018 = pd.read_csv("<path to downloaded 2018 data  file>")
ny_data_2019 = pd.read_csv("<path to downloaded 2019 data file>")
ny_data_2020 = pd.read_csv("<path to downloaded 2020 data file>") 
 
split_data("NY_2018", ny_data_2018)
split_data("NY_2019", ny_data_2019)
split_data("NY_2020", ny_data_2020)
  1. Create an S3 bucket in the us-east-1 Region and upload the folders NY_2018 and NY_2019 to the path S3://<artifactbucket>/train_raw_data/.
train-raw-data-s3

  1. Upload the folder NY_2020 to S3:// <artifactbucket>/inference_raw_data/.
inference-raw-data-s3

Deploy your resources

For a quick start of this solution, you can deploy the provided AWS CloudFormation stack. This creates all the required resources in your account (us-east-1 Region), including the DataBrew datasets, jobs, projects, and recipes; the Step Functions train and inference state machines (which include SageMaker processing, model training, and batch transform jobs); an EventBridge rule; and the Lambda function to deploy an end-to-end ML pipeline for a predefined S3 bucket.

  1. Launch the following stack:
  2. For ArtifactBucket, enter the name of the S3 bucket you created in the previous step.
enter cloud formation stack details

  1. Select the three acknowledgement check boxes.
  2. Choose Create stack.
Create Cloud Formation Stack

Test the solution

As part of the CloudFormation template, the DataBrew job km-mlframework-trainingfeatures-job was created, which is scheduled to run every Monday at 10:00 AM UTC. This job creates the features required to train the model.

When the template deployment is successfully completed, you can manually activate the training pipeline. For this, navigate to the DataBrew console, select the DataBrew job km-mlframework-trainingfeatures-job, and choose Run job.

Glue DataBrew Run Train Job

The job writes the features to s3://<artifactbucket>/train_features/.

When the job is complete, an EventBridge rule invokes the Lambda function, which orchestrates the SageMaker training jobs via Step Functions.

Step Function Train Job

When the job is complete, the output of the model is stored in s3://<artifactbucket>/artifact-repo/model/.

Bestseller No. 1
SAMSUNG Galaxy A54 5G A Series Cell Phone, Unlocked Android Smartphone, 128GB, 6.4” Fluid Display Screen, Pro Grade Camera, Long Battery Life, Refined Design, US Version, 2023, Awesome Black
  • CRISP DETAIL, CLEAR DISPLAY: Enjoy binge-watching...
  • PRO SHOTS WITH EASE: Brilliant sunrises, awesome...
  • CHARGE UP AND CHARGE ON: Always be ready for an...
  • POWERFUL 5G PERFORMANCE: Do what you love most —...
  • NEW LOOK, ADDED DURABILITY: Galaxy A54 5G is...
Bestseller No. 2
OnePlus 12,16GB RAM+512GB,Dual-SIM,Unlocked Android Smartphone,Supports 50W Wireless Charging,Latest Mobile Processor,Advanced Hasselblad Camera,5400 mAh Battery,2024,Flowy Emerald
  • Free 6 months of Google One and 3 months of...
  • Pure Performance: The OnePlus 12 is powered by the...
  • Brilliant Display: The OnePlus 12 has a stunning...
  • Powered by Trinity Engine: The OnePlus 12's...
  • Powerful, Versatile Camera: Explore the new 4th...

In the next step, we trigger the DataBrew job km-mlframework-inferencefeatures-job, which is scheduled to run every Tuesday at 10:00 AM UTC. This job creates the inference features that are used to run inference on the trained model.

You can also activate the inference pipeline by manually triggering the DataBrew job on the DataBrew console.

Glue DataBrew Inference Job

The job writes the features to s3://<artifactbucket>/ inference_features/.

When the job is complete, an EventBridge rule invokes the Lambda function, which orchestrates the SageMaker batch transform job via Step Functions.

Inference Step Function

When the job is complete, the predictions are written to s3://<artifactbucket>/predictions/.

For more information on DataBrew steps and building a DataBrew recipe, see Original Postreparing-data-for-ml-models-using-aws-glue-databrew-in-a-jupyter-notebook/" target="_blank">Preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Wait for any currently running activity to complete, or manually stop it (DataBrew, Step Functions, SageMaker).
  2. Delete the scheduled DataBrew jobs km-mlframework-trainingfeatures-job and km-mlframework-inferencefeatures-job. This ensures the jobs aren’t started by the schedule.
  3. Delete the S3 bucket created to store data and model artifacts.
  4. Delete the CloudFormation stack created earlier.

Conclusion

DataBrew is designed to support data engineers and data scientists to experiment with data preparation steps via a visual interface. With more than 250 built-in transformations, DataBrew can be a strong tool to accelerate your ML lifecycle for development and production stages.

New
Fadnou I23 Ultra Unlocked Cell Phone,Built in Pen,Smartphone Battery 6800mAh 6.8" HD Screen Unlocked Phones,6+256GB Android13 with 128G Memory Card,Face ID/Fingerprint Lock/GPS (Purple)
  • 【Octa-Core CPU + 128GB Expandable TF Card】...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】The machine...
  • 【6800mAh Long lasting battery】With the 6800mAh...
  • 【Business Services】The main additional...
New
Huness I15 Pro MAX Smartphone Unlocked Cell Phone,Battery 6800mAh 6.8 HD Screen Unlocked Phone,6+256GB Android 13 with 128GB Memory Card,Dual SIM/5G/Fingerprint Lock/Face ID (Black, 6+256)
  • 【Dimensity 9000 CPU + 128GB Expandable TF...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】Dual SIM &...
  • 【6800mAh Long lasting battery】The I15 Pro MAX...
  • 【Business Services】The main additional...
New
Jopuzia U24 Ultra Unlocked Cell Phone, 5G Smartphone with S Pen, 8GB+256GB Full Netcom Unlocked Phone, 6800mAh Battery 6.8" FHD+ Display 120Hz 80MP Camera, GPS/Face ID/Dual SIM Phone (Rose Gold)
  • 🥇【6.8" HD Unlocked Android Phones】Please...
  • 💗【Octa-Core CPU+ 256GB Storage】U24 Ultra...
  • 💗【Support Global Band 5G Dual SIM】U24 Ultra...
  • 💗【80MP Professional Photography】The U24...
  • 💗【6800mAh Long Lasting Battery】With the...

In this post, we walked through the process of creating an end-to-end ML framework with DataBrew, which you can use to train an ML model as well as run inferences on a schedule. You can use the same framework with your own DataBrew recipe prepared using any dataset.

To learn more on applying the most frequently used transformations from within DataBrew, see 7 most common data preparation transformations in AWS Glue DataBrew.


About the Authors

Gayatri Ghanakota is a Machine Learning Engineer with AWS Professional Services, where she helps customers build machine learning solutions on AWS. She is passionate about developing, deploying, and explaining ML models.

Surbhi Dangi is a product and design leader at Amazon Web Services. She focusses on providing ease of use and rich functionality for her analytics and monitoring on both her products – Amazon CloudWatch Synthetics and AWS Glue DataBrew. When not working, she mentors aspiring product managers, hiking, and traveling the world.