Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions

Data lakes, business intelligence, operational analytics, and data warehousing share a common core characteristic—the ability to extract, transform, and load (ETL) data for analytics. Since its launch in 2017, AWS Glue has provided serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

AWS Glue interactive sessions allows programmers to build, test, and run data preparation and analytics applications. Interactive sessions provide access to run fully managed serverless Apache Spark using an on-demand model. AWS Glue interactive sessions also provide advanced users the same Apache Spark engine as AWS Glue 2.0 or AWS Glue 3.0, with built-in cost controls and speed. Additionally, development teams immediately become productive using their existing development tool of choice.

In this post, we walk you through how to use AWS Glue interactive sessions with PyCharm to author AWS Glue jobs.

Solution overview

This post provides a step-by-step walkthrough that builds on the instructions in Getting started with AWS Glue interactive sessions. It guides you through the following steps:

  1. Create an AWS Identity and Access Management (IAM) policy with limited Amazon Simple Storage Service (Amazon S3) read privileges and associated role for AWS Glue.
  2. Configure access to a development environment. You can use a desktop computer or an OS running on the AWS Cloud using Amazon Elastic Compute Cloud (Amazon EC2).
  3. Integrate AWS Glue interactive sessions with an integrated development environments (IDE).

We use the script Validate_Glue_Interactive_Sessions.ipynb for validation, available as a Jupyter notebook.

Prerequisites

You need an AWS account before you proceed. If you don’t have one, refer to How do I create and activate a new AWS account? This guide assumes that you already have installed Python and PyCharm. Python 3.7 or later is the foundational prerequisite.

Create an IAM policy

The first step is to create an IAM policy that limits read access to the S3 bucket s3://awsglue-datasets, which has the AWS Glue public datasets. You use IAM to define the policies and roles for access to AWS Glue.

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. On the JSON tab, enter the following code:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*",
                    "s3-object-lambda:Get*",
                    "s3-object-lambda:List*"
                ],
                "Resource": ["arn:aws:s3:::awsglue-datasets/*"]
            }
        ]
    }
  4. Choose Next: Tags.
  5. Choose Next: Review.
  6. For Policy name, enter glue_interactive_policy_limit_s3.
  7. For Description, enter a description.
  8. Choose Create policy.

Create an IAM role for AWS Glue

To create a role for AWS Glue with limited Amazon S3 read privileges, complete the following steps:

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. For Trusted entity type, select AWS service.
  4. For Use cases for other AWS services, choose Glue.
  5. Choose Next.

  6. On the Add permissions page, search and choose the AWS managed permission policies AWSGlueServiceRole and glue_interactive_policy_limit_s3.
  7. Choose Next.
  8. For Role name, enter glue_interactive_role.

  9. Choose Create role.

  10. Note the ARN of the role, arn:aws:iam::<replacewithaccountID>:role/glue_interactive_role.

Set up development environment access

This secondary level of access configuration needs to occur on the developer’s environment. The development environment can be a desktop computer running Windows or Mac/Linux, or similar operating systems running on the AWS Cloud using Amazon EC2. The following steps walk through each client access configuration. You can select the configuration path that is applicable to your environment.

Set up a desktop computer

To set up a desktop computer, we recommend completing the steps in Getting started with AWS Glue interactive sessions.

Set up an AWS Cloud-based computer with Amazon EC2

This configuration path follows the best practices for providing access to cloud-based resources using IAM roles. For more information, refer to Using an IAM role to grant permissions to applications running on Amazon EC2 instances.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. For Trusted entity type¸ select AWS service.
  4. For Common use cases, select EC2.
  5. Choose Next.

  6. Add the AWSGlueServiceRole policy to the newly created role.
  7. On the Add permissions menu, choose Create inline policy.

  8. Create an inline policy that allows the instance profile role to pass or assume glue_interactive_role and save the new role as ec2_glue_demo.

Your new policy is now listed under Permissions policies.

  1. On the Amazon EC2 console, choose (right-click) the instance you want to attach to the newly created role.
  2. Choose Security and choose Modify IAM role.

  3. For IAM role¸ choose the role ec2_glue_demo.
  4. Choose Save.

  5. On the IAM console, open and edit the trust relationship for glue_interactive_role.
  6. Add “AWS”: [“arn:aws:iam:::user/glue_interactive_user”,”arn:aws:iam:::role/ec2_glue_demo”] to the principal JSON key.

  7. Complete the steps detailed in Getting started with AWS Glue interactive sessions.

You don’t need to provide an AWS access key ID or AWS secret access key as part of the remaining steps.

Integrate AWS Glue interactive sessions with an IDE

You’re now ready to set up and validate your PyCharm integration with AWS Glue interactive sessions.

  1. On the welcome page, choose New Project.
  2. For Location, enter the location of your project glue-interactive-demo.
  3. Expand Python Interpreter.
  4. Select Previously configured interpreter and choose the virtual environment you created earlier.
  5. Choose Create.
SaleBestseller No. 1
SAMSUNG Galaxy A54 5G A Series Cell Phone, Unlocked Android Smartphone, 128GB, 6.4” Fluid Display Screen, Pro Grade Camera, Long Battery Life, Refined Design, US Version, 2023, Awesome Black
  • CRISP DETAIL, CLEAR DISPLAY: Enjoy binge-watching...
  • PRO SHOTS WITH EASE: Brilliant sunrises, awesome...
  • CHARGE UP AND CHARGE ON: Always be ready for an...
  • POWERFUL 5G PERFORMANCE: Do what you love most —...
  • NEW LOOK, ADDED DURABILITY: Galaxy A54 5G is...
Bestseller No. 2
OnePlus 12,16GB RAM+512GB,Dual-SIM,Unlocked Android Smartphone,Supports 50W Wireless Charging,Latest Mobile Processor,Advanced Hasselblad Camera,5400 mAh Battery,2024,Flowy Emerald
  • Free 6 months of Google One and 3 months of...
  • Pure Performance: The OnePlus 12 is powered by the...
  • Brilliant Display: The OnePlus 12 has a stunning...
  • Powered by Trinity Engine: The OnePlus 12's...
  • Powerful, Versatile Camera: Explore the new 4th...

The following screenshot shows the New Project page on a Mac computer. A Windows computer setup will have a relative path beginning with C:\ followed by the PyCharm project location.

  1. Choose the project (right-click) and on the New menu, choose Jupyter Notebook.

  2. Name the notebook Validate_Glue_Interactive_Sessions.

The notebook has a drop-down called Managed Jupyter server: auto-start, which means the Jupyter server automatically starts when any notebook cell is run.

  1. Run the following code:
    print("This notebook will start the local Python kernel")

You can observe that the Jupyter server started running the cell.

  1. On the Python 3 (ipykernal) drop-down, choose Glue PySpark.

  2. Run the following code to start a Spark session:
    spark
  3. Wait to receive the message that a session ID has been created.

  4. Run the following code in each cell, which is the boilerplate syntax for AWS Glue:
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    glueContext = GlueContext(SparkContext.getOrCreate())
  5. Read the publicly available Medicare Provider payment data in the AWS Glue data preparation sample document:
    medicare_dynamicframe = glueContext.create_dynamic_frame.from_options(
        's3',
        {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
        'csv',
        {'withHeader': True})
    print("Count:",medicare_dynamicframe.count())
    medicare_dynamicframe.printSchema()
  6. Change the data type of the provider ID to long to resolve all incoming data to long:
    medicare_res = medicare_dynamicframe.resolveChoice(specs = [('Provider Id','cast:long')])
    medicare_res.printSchema()
  7. Display the providers:
    medicare_res.toDF().select('Provider Name').show(10,truncate=False)

Clean up

You can run %delete_session which deletes the current session and stops the cluster, and the user stops being charged. Have a look at the AWS Glue interactive sessions magics. Also please remember to delete IAM policy and role once you are done.

Conclusion

New
Fadnou I23 Ultra Unlocked Cell Phone,Built in Pen,Smartphone Battery 6800mAh 6.8" HD Screen Unlocked Phones,6+256GB Android13 with 128G Memory Card,Face ID/Fingerprint Lock/GPS (Purple)
  • 【Octa-Core CPU + 128GB Expandable TF Card】...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】The machine...
  • 【6800mAh Long lasting battery】With the 6800mAh...
  • 【Business Services】The main additional...
New
Huness I15 Pro MAX Smartphone Unlocked Cell Phone,Battery 6800mAh 6.8 HD Screen Unlocked Phone,6+256GB Android 13 with 128GB Memory Card,Dual SIM/5G/Fingerprint Lock/Face ID (Black, 6+256)
  • 【Dimensity 9000 CPU + 128GB Expandable TF...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】Dual SIM &...
  • 【6800mAh Long lasting battery】The I15 Pro MAX...
  • 【Business Services】The main additional...
New
Jopuzia U24 Ultra Unlocked Cell Phone, 5G Smartphone with S Pen, 8GB+256GB Full Netcom Unlocked Phone, 6800mAh Battery 6.8" FHD+ Display 120Hz 80MP Camera, GPS/Face ID/Dual SIM Phone (Rose Gold)
  • 🥇【6.8" HD Unlocked Android Phones】Please...
  • 💗【Octa-Core CPU+ 256GB Storage】U24 Ultra...
  • 💗【Support Global Band 5G Dual SIM】U24 Ultra...
  • 💗【80MP Professional Photography】The U24...
  • 💗【6800mAh Long Lasting Battery】With the...

In this post, we demonstrated how to configure PyCharm to integrate and work with AWS Glue interactive sessions. The post builds on the steps in Getting started with AWS Glue interactive sessions to enable AWS Glue interactive sessions to work with Jupyter notebooks. We also provided ways to validate and test the functionality of the configuration.


About the Authors

Kunal Ghosh is a Sr. Solutions Architect at AWS. His passion is building efficient and effective solutions on cloud, especially involving analytics, AI, data science, and machine learning. Besides family time, he likes reading and watching movies. He is a foodie.

Sebastian Muah is a Solutions Architect at AWS focused on analytics, AI/ML, and big data. He has over 25 years of experience in information technology and helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. He enjoys cycling and building things around his home.

Original Post>