AWS Fault Injection Simulator – Use Controlled Experiments to Boost Resilience

AWS gives you the components that you need to build systems that are highly reliable: multiple Regions (each with multiple Availability Zones ), Amazon CloudWatch (metrics, monitoring, and alarms), Auto Scaling , Load Balancing , several forms of cross-region replication, and lots more.

Source: AWS Fault Injection Simulator – Use Controlled Experiments to Boost Resilience

When you put them together in line with the guidance provided in the Well-Architected Framework, your systems should be able to keep going even if individual components fail.

However, you won’t know that this is indeed the case until you perform the right kinds of tests. The relatively new field of Chaos Engineering (based on pioneering work done by “Master of Disaster” Jesse Robbins in the early days of Amazon.com, and then taken into high gear by the Netflix Chaos Monkey) focuses on adding stress to an application by creating disruptive events, observing how the system responds, and implementing improvements. In addition to pointing out the areas for improvements, Chaos Engineering helps to discover blind spots that deserve additional monitoring & alarming, uncovers once-hidden implementation issues, and gives you an opportunity to improve your operational skills with an eye toward improving recovery time. To learn a lot more about this topic, start with Chaos Engineering – Part 1 by my colleague Adrian Hornsby.

Advertisements

Introducing AWS Fault Injection Simulator (FIS)
Today we are introducing AWS Fault Injection Simulator (FIS). This new service will help you to perform controlled experiments on your AWS workloads by injecting faults and letting you see what happens. You will learn how your system reacts to various types of faults and you will have a better understanding of failure modes. You can start by running experiments in pre-production environments and then step up to running them as part of your CI/CD workflow and ultimately in your production environment.

Each AWS Fault Injection Simulator (FIS) experiment targets a specific set of AWS resources and performs a set of actions on them. We’re launching with support for Amazon Elastic Compute Cloud (EC2)Amazon Elastic Container Service (ECS)Amazon Elastic Kubernetes Service (EKS), and Amazon Relational Database Service (RDS), with additional resources and actions on the roadmap for 2021. You can select the target resources by type, tag, ARN, or by querying for specific attributes. You also have the ability to stop the experiment if one or more stop conditions (as defined by CloudWatch Alarms) are met. This allows you to quickly terminate the experiment if it has an unexpected impact on a crucial business or operational metric.

Advertisements

Using AWS Fault Injection Simulator (FIS)
Let’s create an experiment template and run an experiment! I will use four EC2 instances, all tagged with a Mode of Test:

I open the FIS Console and click Create experiment template to get started:

I enter a Description and choose an IAM Role. The role grants permission that are needed for FIS to perform actions on the selected resources so that it can perform the experiment:

Next, I define the action(s) that comprise the experiment. I click on Add action to get started:

Then I define my first action — I want to stop some of my EC2 instances (tagged with a Mode of Test for this example) for five minutes, and make sure that my system stays running. I make my choices and click Save:

Next, I choose the target resources (EC2 instances in this case) for the experiment. I click Add target, give my target a name, and indicate that it consists of all of my EC2 instances (in the current region) that have tag Mode with value Test. I can also choose a random instance or a percentage of all of instances that match the tag or the Resource filter. Again, I make my choices and click Save:

I can choose one or more stop conditions (CloudWatch Alarms) for the experiment. If an alarm is triggered, the experiment stops. This is a safety mechanism that allows me to make sure that a local failure does not cascade into a full-scale outage.

Finally, I tag my experiment and click Create experiment template:

My template is ready to be used as the basis for an experiment:

SaleBestseller No. 1
Acer Aspire 3 A315-24P-R7VH Slim Laptop | 15.6" Full HD IPS Display | AMD Ryzen 3 7320U Quad-Core Processor | AMD Radeon Graphics | 8GB LPDDR5 | 128GB NVMe SSD | Wi-Fi 6 | Windows 11 Home in S Mode
  • Purposeful Design: Travel with ease and look great...
  • Ready-to-Go Performance: The Aspire 3 is...
  • Visibly Stunning: Experience sharp details and...
  • Internal Specifications: 8GB LPDDR5 Onboard...
  • The HD front-facing camera uses Acer’s TNR...
Bestseller No. 2
HP Newest 14" Ultral Light Laptop for Students and Business, Intel Quad-Core N4120, 8GB RAM, 192GB Storage(64GB eMMC+128GB Micro SD), 1 Year Office 365, Webcam, HDMI, WiFi, USB-A&C, Win 11 S
  • 【14" HD Display】14.0-inch diagonal, HD (1366 x...
  • 【Processor & Graphics】Intel Celeron N4120, 4...
  • 【RAM & Storage】8GB high-bandwidth DDR4 Memory...
  • 【Ports】1 x USB 3.1 Type-C ports, 2 x USB 3.1...
  • 【Windows 11 Home in S mode】You may switch to...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

To run an experiment, I select a template and choose Start experiment from the Actions menu:

Then I click Start experiment (I also decided to add a tag):

I confirm my intent, since it can affect my AWS resources:

My experiment starts to run, and I can watch the actions:

As expected, the target instances are stopped:

My experiment runs to conclusion, and I now know that my system can keep on going if those instances are stopped:

I can also create, run, and review experiments using the FIS API and the FIS CLI. You could, for example, run different experiments against the same target, or run the same experiment against different targets.

Available Now
AWS Fault Injection Simulator (FIS) is available now and you can use it to run controlled experiments today. It is available in all of the commercial AWS Regions today except Asia Pacific (Osaka) and the two Regions in China. The remaining three commercial regions are on the roadmap.

New
Naclud Laptops, 15 Inch Laptop, Laptop Computer with 128GB ROM 4GB RAM, Intel N4000 Processor(Up to 2.6GHz), 2.4G/5G WiFi, BT5.0, Type C, USB3.2, Mini-HDMI, 53200mWh Long Battery Life
  • EFFICIENT PERFORMANCE: Equipped with 4GB...
  • Powerful configuration: Equipped with the Intel...
  • LIGHTWEIGHT AND ADVANCED - The slim case weighs...
  • Multifunctional interface: fast connection with...
  • Worry-free customer service: from date of...
New
HP - Victus 15.6" Full HD 144Hz Gaming Laptop - Intel Core i5-13420H - 8GB Memory - NVIDIA GeForce RTX 3050-512GB SSD - Performance Blue (Renewed)
  • Powered by an Intel Core i5 13th Gen 13420H 1.5GHz...
  • Equipped with an NVIDIA GeForce RTX 3050 6GB GDDR6...
  • Includes 8GB of DDR4-3200 RAM for smooth...
  • Features a spacious 512GB Solid State Drive for...
  • Boasts a vibrant 15.6" FHD IPS Micro-Edge...
New
HP EliteBook 850 G8 15.6" FHD Laptop Computer – Intel Core i5-11th Gen. up to 4.40GHz – 16GB DDR4 RAM – 512GB NVMe SSD – USB C – Thunderbolt – Webcam – Windows 11 Pro – 3 Yr Warranty – Notebook PC
  • Processor - Powered by 11 Gen i5-1145G7 Processor...
  • Memory and Storage - Equipped with 16GB of...
  • FHD Display - 15.6 inch (1920 x 1080) FHD display,...
  • FEATURES - Intel Iris Xe Graphics – Audio by...
  • Convenience & Warranty: 2 x Thunderbolt 4 with...

Last update on 2024-04-05 / Affiliate links / Images from Amazon Product Advertising API

Pricing is based on the number of minutes that your actions run; read the Original Postricing/" target="_blank" rel="noreferrer noopener">FIS Pricing page to learn more.

We’ll be adding support for additional services and additional actions throughout 2021, so stay tuned!

— Jeff;