Detect and process sensitive data using AWS Glue Studio

Data lakes offer the possibility of sharing diverse types of data with different teams and roles to cover numerous use cases. This is very important in order to implement a data democratization strategy and incentivize the collaboration between lines of business. When a data lake is being designed, one of the most important aspects to consider is data privacy. Without it, sensitive information could be accessed by the wrong team, which may affect the reliability of a data platform. However, identifying sensitive data inside a data lake could represent a challenge due to the diversity of the data and also its volume.

Earlier this year, AWS  Glue announced the new sensitive data detection and processing feature to help you identify and protect sensitive information in a straightforward way using AWS Glue Studio. This feature uses pattern matching and machine learning to automatically recognize personally identifiable information (PII) and other sensitive data at the column or cell level as part of AWS Glue jobs.

Sensitive data detection in AWS Glue identifies a variety of sensitive data like phone and credit card numbers, and also offers the option to create custom identification patterns or entities to cover your specific use cases. Additionally, it helps you take action, such as creating a new column that contains any sensitive data detected as part of a row or redacting the sensitive information before writing records into a data lake.

This post shows how to create an AWS Glue job that identifies sensitive data at the row level. We also show how create a custom identification pattern to identify case-specific entities.

Overview of solution

To demonstrate how to create an AWS Glue job to identify sensitive data, we use a test dataset with customer comments that contain private data like Social Security number (SSN), phone number, and bank account number. The goal is to create a job that automatically identifies the sensitive data and triggers an action to redact it.

Prerequisites

For this walkthrough, you should have the following prerequisites:

If the AWS account you use to follow this post uses AWS Lake Formation to manage permissions on the AWS Glue Data Catalog, make sure that you log in as a user with access to create databases and tables. For more information, refer to Implicit Lake Formation permissions.

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

  1. Launch your CloudFormation stack in us-east-1:
  2. Under Parameters, enter a name for your S3 bucket (include your account number).
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.

  5. Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.

Launching this stack creates AWS resources. You need the following resources from the Outputs tab for the next steps:

  • GlueSenRole – The IAM role to run AWS Glue jobs
  • BucketName – The name of the S3 bucket to store solution-related files
  • GlueDatabase – The AWS Glue database to store the table related to this post

Create and run an AWS Glue job

Let’s first create the dataset that is going to be used as the source of the AWS Glue job:

  1. Open AWS CloudShell.

  2. Run the following command:
    aws s3 cp s3://aws-bigdata-blog/artifacts/gluesendata/sourcedata/customer_comments.csv s3://glue-sendata-blog-<YOUR ACCOUNT NUMBER>/customer_comments/customer_comments.csv

    This action copies the dataset that is going to be used as the input for the AWS Glue job covered in this post.

    Now, let’s create the AWS Glue job.

  1. On the AWS Glue Studio console, choose Jobs in the navigation pane.

  2. Select Visual with blank canvas.

  3. Choose the Job Details tab to configure the job.

  4. For Name, enter GlueSenJob.
  5. For IAM Role, choose the role GlueSenDataBlogRole.
  6. For Glue version, choose Glue 3.0.
  7. For Job bookmark, choose Disable.

  8. Choose Save.

  9. After the job is saved, choose the Visual tab and on the Source menu, choose Amazon S3.

  10. On the Data source properties -S3 tab, for S3 source type, select S3 location.
  11. Add the S3 location of the file that you copied previously using CloudShell.
  12. Choose Infer schema.

This last action infers the schema and file type of the of the source for this post, as you can see in the following screenshot.

Now, let’s see what the data looks like.

  1. On the Data preview tab, choose Start data preview session.

  2. For IAM role, choose the role GlueSeDataBlogRole.
  3. Choose Confirm.

This last step may take a couple of minutes to run.

When you review the data, you can see that sensitive data like phone numbers, email addresses, and SSNs are part of the customer comments.

Now let’s identify the sensitive data in the comments dataset and mask it.

  1. On the Transform menu, choose Detect PII.

The AWS Glue sensitive data identification feature allows you to find sensitive data at the row and column level, which covers a diverse number of use cases. For this post, because we scan comments made by customers, we use the row-level scan.

  1. On the Transform tab, select Find sensitive data in each row.
  2. For Types of sensitive information to detect, select Select specific patterns.

Now we need to select the entities or patterns that are going to be identified by the job.

  1. For Selected patterns, choose Browse.

  2. Select the following patterns:
    1. Credit Card
    2. Email Address
    3. IP Address
    4. Mac Address
    5. Person’s Name
    6. Social Security Number (SSN)
    7. US Passport
    8. US Phone
    9. US/Canada bank account
  3. Choose Confirm.

After the sensitive data is identified, AWS Glue offers two options:

  • Enrich data with detection results – Adds a new column to the dataset with the list of the entities or patterns that were identified in that specific row.
  • Redact detected text – Replaces the sensitive data with a custom string. For this post, we use the redaction option.
  1. For Actions, select Redact detected text.
  2. For Replacement text, enter ####.

SaleBestseller No. 1
SAMSUNG Galaxy A54 5G A Series Cell Phone, Unlocked Android Smartphone, 128GB, 6.4” Fluid Display Screen, Pro Grade Camera, Long Battery Life, Refined Design, US Version, 2023, Awesome Black
  • CRISP DETAIL, CLEAR DISPLAY: Enjoy binge-watching...
  • PRO SHOTS WITH EASE: Brilliant sunrises, awesome...
  • CHARGE UP AND CHARGE ON: Always be ready for an...
  • POWERFUL 5G PERFORMANCE: Do what you love most —...
  • NEW LOOK, ADDED DURABILITY: Galaxy A54 5G is...
Bestseller No. 2
OnePlus 12,16GB RAM+512GB,Dual-SIM,Unlocked Android Smartphone,Supports 50W Wireless Charging,Latest Mobile Processor,Advanced Hasselblad Camera,5400 mAh Battery,2024,Flowy Emerald
  • Free 6 months of Google One and 3 months of...
  • Pure Performance: The OnePlus 12 is powered by the...
  • Brilliant Display: The OnePlus 12 has a stunning...
  • Powered by Trinity Engine: The OnePlus 12's...
  • Powerful, Versatile Camera: Explore the new 4th...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

Let’s see how the dataset looks now.

  1. Check the result data on the Data preview tab.

As you can see, the majority of the sensitive data was redacted, but there is a number on row 11 that isn’t masked. This is because it’s a Canadian permanent resident number, and this pattern isn’t part of the ones that the sensitive data identification feature offers. However, we can add a custom pattern to identify this number.

  1. On the Transform tab, for Selected patterns, choose Create new.

This action opens the Create detection pattern window, where we create the custom pattern to identify the Canadian permanent resident number.

  1. For Pattern name, enter Can_PR_Number.
  2. For Expression, enter the regular expression [P]+[D]+[0]\d\d\d\d\d\d
  3. Choose Validate.
  4. Wait until you get the validation message, then choose Create pattern.

Now you can see the new pattern listed under Custom patterns.

  1. On the AWS Glue Studio Console, for Selected patterns, choose Browse.

Now you can see Can_PR_Number as part of the pattern list.

  1. Select Can_PR_Number and choose Confirm.

On the Data preview tab, you can see that the Canadian permanent resident number has been redacted.

Let’s add a destination for the dataset with redacted information.

  1. On the Target menu, choose Amazon S3.

  2. On the Data target properties -S3 tab, for Format, choose Parquet.
  3. For S3 Target Location, enter s3://glue-sendata-blog-<YOUR ACCOUNT ID>/output/redacted_comments/.
  4. For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
  5. For Database, choose gluesenblog.
  6. For Table name, enter custcomredacted.

  7. Choose Save, then choose Run.

You can view the job run details on the Runs tab.

Wait until the job is complete.

Query the dataset

Now let’s see what the final dataset looks like. To do so, we query the data with Athena. As part of this post, we assume that a query result location for Athena is already configured; if not, refer to Working with query results, recent queries, and output files.

  1. On the Athena console, open the query editor.
  2. For Database, choose the gluesenblog database.

  3. Run the following query:
    SELECT * FROM "gluesenblog"."custcomredacted" limit 15;
  1. Verify the results; you can observe that all the sensitive data is redacted.

Clean up

New
Fadnou I23 Ultra Unlocked Cell Phone,Built in Pen,Smartphone Battery 6800mAh 6.8" HD Screen Unlocked Phones,6+256GB Android13 with 128G Memory Card,Face ID/Fingerprint Lock/GPS (Purple)
  • 【Octa-Core CPU + 128GB Expandable TF Card】...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】The machine...
  • 【6800mAh Long lasting battery】With the 6800mAh...
  • 【Business Services】The main additional...
New
Huness I15 Pro MAX Smartphone Unlocked Cell Phone,Battery 6800mAh 6.8 HD Screen Unlocked Phone,6+256GB Android 13 with 128GB Memory Card,Dual SIM/5G/Fingerprint Lock/Face ID (Black, 6+256)
  • 【Dimensity 9000 CPU + 128GB Expandable TF...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】Dual SIM &...
  • 【6800mAh Long lasting battery】The I15 Pro MAX...
  • 【Business Services】The main additional...
New
Jopuzia U24 Ultra Unlocked Cell Phone, 5G Smartphone with S Pen, 8GB+256GB Full Netcom Unlocked Phone, 6800mAh Battery 6.8" FHD+ Display 120Hz 80MP Camera, GPS/Face ID/Dual SIM Phone (Rose Gold)
  • 🥇【6.8" HD Unlocked Android Phones】Please...
  • 💗【Octa-Core CPU+ 256GB Storage】U24 Ultra...
  • 💗【Support Global Band 5G Dual SIM】U24 Ultra...
  • 💗【80MP Professional Photography】The U24...
  • 💗【6800mAh Long Lasting Battery】With the...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

To avoid incurring future charges, and to clean up unused roles and policies, delete the resources you created: Datasets, CloudFormation stack, S3 bucket, AWS Glue job, AWS Glue database, and AWS Glue table.

Conclusion

AWS Glue sensitive data detection offers an easy way to identify and process private data, without coding. This feature allows you to detect and redact sensitive data when it’s ingested into a data lake, enforcing data privacy before the data is available to data consumers. AWS Glue sensitive data detection is generally available in all Regions that support AWS Glue.

To learn more and get started using AWS Glue sensitive data detection, refer to Detect and process sensitive data.


About the author

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, he has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn