Field Notes: Scaling Browser Automation with Puppeteer on AWS Lambda with Container Image Support

https://aws.amazon.com/blogs/architecture/field-notes-scaling-browser-automation-with-puppeteer-on-aws-lambda-with-container-image-support/

This post is contributed by Bill Kerr, SHI and Raj Seshadri, Global SA Lead, AWS. Imagine you are launching a brand new website selling goods and services. You are expecting a huge amount of traffic due to the seasonality of the product.

You would like to test 100K simultaneous connections to the website and make sure it is working properly. How would you go about doing that? Try a headless browser automation tool like Puppeteer. Puppeteer can now be packaged as a container image in a Lambda function to perform browser automation or any web scraping functionality.

Puppeteer is a Node library which allows you to automate tasks in headless Chrome. When using Puppeteer in Lambda with container image support, you can scale browser automation horizontally. With Lambda, Node packages can be installed in a container instead of having to put them in Lambda layers. This blog will show how to run Puppeteer and Chrome in a Lambda container function. In this example, multiple instances of Puppeteer will simultaneously take screenshots of several popular news websites and store them in Amazon S3.

Advertisements

Solution Overview

The overall solution architecture is shown in the preceding diagram. Two Lambda functions are used in this example.

  1. A Puppeteer function that requires a URL and bucket name as inputs. This uses Puppeteer to take a screenshot of the URL in headless Chrome and save the image in the S3 bucket.
  2. A fan-out function that requires a list of URLs as input, which asynchronously invokes the Puppeteer function for each URL in the list.

Lambda container Dockerfile for Puppeteer function

Here is a documented version of the Dockerfile that is used to create a container for use with Lambda.

# Start with an AWS provided image that is ready to use with Lambda
FROM amazon/aws-lambda-nodejs:12

# Allow AWS credentials to be supplied when building this container locally for testing,
# so S3 can be accessed
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
ARG AWS_REGION=us-east-1

# Install Chrome to get all of the dependencies installed
ADD https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm chrome.rpm
RUN yum install -y ./chrome.rpm

ENV AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
    AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
    AWS_REGION=$AWS_REGION

# Copy files into the container
COPY jest.config.js package*.json tsconfig.json ${LAMBDA_TASK_ROOT}/
COPY bin/app.ts ${LAMBDA_TASK_ROOT}/bin/
COPY test/app.test.ts ${LAMBDA_TASK_ROOT}/test/

# Install, build and test the Puppeteer app and Chrome
RUN npm install
RUN npm run build
RUN npm test

# Lambda handler path
CMD [ "bin/app.lambdaHandler" ]

Deploy the cloud infrastructure

Prerequisite

AWS CDK must be installed. Review the CDK installation instructions.

Download and install the dependencies and example CDK application

In a terminal, check out the code used in this article and install it.

# Clone the example CDK application code
git clone https://github.com/shi/crpm-lambda-container-puppeteer

# Change to the infrastructure directory containing CDK and CRPM
cd crpm-lambda-container-puppeteer/infra
# Install the CDK application
npm install

# Deploy the CDK Toolkit CloudFormation stack
cdk bootstrap aws://unknown-account/unknown-region

# Deploy the Puppeteer example CloudFormation stack
cdk deploy puppeteer

Puppeteer Usage

The next steps are performed in the AWS Console.

1.In the AWS Console, open the Lambda function that was created by CDK above.

  • Look for InvokeLambdaFunctionName in the Outputs section to get the name of the function to open.
  • You can also find the function name in the Resources tab of the CloudFormation stack in the AWS Console.

2. In the function, click on the Test tab.

SaleBestseller No. 1
Acer Aspire 3 A315-24P-R7VH Slim Laptop | 15.6" Full HD IPS Display | AMD Ryzen 3 7320U Quad-Core Processor | AMD Radeon Graphics | 8GB LPDDR5 | 128GB NVMe SSD | Wi-Fi 6 | Windows 11 Home in S Mode
  • Purposeful Design: Travel with ease and look great...
  • Ready-to-Go Performance: The Aspire 3 is...
  • Visibly Stunning: Experience sharp details and...
  • Internal Specifications: 8GB LPDDR5 Onboard...
  • The HD front-facing camera uses Acer’s TNR...
Bestseller No. 2
HP Newest 14" Ultral Light Laptop for Students and Business, Intel Quad-Core N4120, 8GB RAM, 192GB Storage(64GB eMMC+128GB Micro SD), 1 Year Office 365, Webcam, HDMI, WiFi, USB-A&C, Win 11 S
  • 【14" HD Display】14.0-inch diagonal, HD (1366 x...
  • 【Processor & Graphics】Intel Celeron N4120, 4...
  • 【RAM & Storage】8GB high-bandwidth DDR4 Memory...
  • 【Ports】1 x USB 3.1 Type-C ports, 2 x USB 3.1...
  • 【Windows 11 Home in S mode】You may switch to...

3. Create a new test event with JSON like the following, and run it. Have fun changing the URLs to what you want.

["Original Postre>



4. Click on the Invoke button to invoke the fan-out function.

5. Open the S3 bucket that was created by the aforementioned CDK.

6. Look for puppeteer.BucketName in the Outputs section to get the bucket name.

7. Within a minute of running the fan-out function, you should see a list of images in the screenshots folder in the bucket. They should slowly trickle in as you refresh the list of screenshots until all are done.

8. If any screenshots are missing, you can view CloudWatch Logs for the Puppeteer function.

9. Search all for error to determine how to implement improved error handling in the code.

10. You could modify the app to perform functional testing of a website, and save screenshots in S3 whenever errors occur.

Clean up

New
Naclud Laptops, 15 Inch Laptop, Laptop Computer with 128GB ROM 4GB RAM, Intel N4000 Processor(Up to 2.6GHz), 2.4G/5G WiFi, BT5.0, Type C, USB3.2, Mini-HDMI, 53200mWh Long Battery Life
  • EFFICIENT PERFORMANCE: Equipped with 4GB...
  • Powerful configuration: Equipped with the Intel...
  • LIGHTWEIGHT AND ADVANCED - The slim case weighs...
  • Multifunctional interface: fast connection with...
  • Worry-free customer service: from date of...
New
HP - Victus 15.6" Full HD 144Hz Gaming Laptop - Intel Core i5-13420H - 8GB Memory - NVIDIA GeForce RTX 3050-512GB SSD - Performance Blue (Renewed)
  • Powered by an Intel Core i5 13th Gen 13420H 1.5GHz...
  • Equipped with an NVIDIA GeForce RTX 3050 6GB GDDR6...
  • Includes 8GB of DDR4-3200 RAM for smooth...
  • Features a spacious 512GB Solid State Drive for...
  • Boasts a vibrant 15.6" FHD IPS Micro-Edge...
New
HP EliteBook 850 G8 15.6" FHD Laptop Computer – Intel Core i5-11th Gen. up to 4.40GHz – 16GB DDR4 RAM – 512GB NVMe SSD – USB C – Thunderbolt – Webcam – Windows 11 Pro – 3 Yr Warranty – Notebook PC
  • Processor - Powered by 11 Gen i5-1145G7 Processor...
  • Memory and Storage - Equipped with 16GB of...
  • FHD Display - 15.6 inch (1920 x 1080) FHD display,...
  • FEATURES - Intel Iris Xe Graphics – Audio by...
  • Convenience & Warranty: 2 x Thunderbolt 4 with...

In the AWS Console, manually empty the bucket that was created by the CDK. Look for puppeteer.BucketName in the Outputs section. You can also find the bucket name in the Resources tab of the CloudFormation stack. Then, run the following command after the bucket has been emptied.

# Destroy the Puppeteer example CloudFormation stack
cdk destroy puppeteer

# Delete the CDK Toolkit CloudFormation stack
aws cloudformation delete-stack --stack-name CDKToolkit
Advertisements

Conclusion

In this post, we showed you how to use Lambda functions packaged as container image to do web scraping functions. The possibilities of such applications are limitless when using lambda with container image support.

For more serverless learning resources, visit the Serverlessland website.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

Bill Kerr

Bill Kerr is a senior developer at Stratascale who has worked at startup and Fortune 500 companies. He’s the creator of CRPM and he’s a super fan of CDK and cloud infrastructure automation.