Journey to Adopt Cloud-Native Architecture Series: #3 – Improved Resilience and Standardized Observability

https://aws.amazon.com/blogs/architecture/journey-to-adopt-cloud-native-architecture-series-3-improved-resilience-and-standardized-observability/

In the last blog, Maximizing System Throughput , we talked about design patterns you can adopt to address immediate scaling challenges to provide a better customer experience. In this blog, we talk about architecture patterns to improve system resiliency, why observability matters, and how to build a holistic observability solution.

As a refresher from previous blogs, our example ecommerce company’s “Shoppers” application runs in the cloud. It is a monolithic application (application server and web server) that runs on an Amazon Elastic Compute Cloud (Amazon EC2) instance. It connects with a PostgreSQL database running on Amazon Relational Database Service (Amazon RDS).

The monolith application is tightly coupled with the database. The transaction from front end to database has to complete within 30 milliseconds, otherwise user experience degrades. The application integrates with a number of external systems for enriching data, payment processing, and order fulfillment. Some of these external systems don’t provide Service License Agreements (SLAs) on response time but have median response time of 15 milliseconds. It takes around 30 minutes before a new server starts serving user traffic. To break it down, it takes 10 minutes for application startup, 5 minutes to create and test connections, 10 minutes to run sanity test, and 5 minutes to load application cache.

Advertisements

Increase resiliency

The complexity of monolith applications can present unknown failures. You cannot avoid failure by simply testing all application and infrastructure failure scenarios. Your application must be designed to handle any cascading failures to maintain a continuous user experience even when the backend systems are experiencing issues. In the following sections, we show you the steps we took to improve system resiliency for our example company.

Minimum business continuity for failover

Building disaster recovery (DR) strategies into your system requires you to work backwards from recovery point objective (RPO) and recovery time objective (RTO) requirements. Our business needs in this scenario required us to build high availability to prevent 30 minutes of continuous downtime (RTO) and prevent persistent user data loss (that is, a few minutes RPO).

To meet these goals, we developed a long-term business continuity plan per the Disaster Recovery of Workloads on AWS: Recovery in the Cloud whitepaper that consisted of the following:

Due to the stateful nature of the application, taking EBS snapshots wasn’t going to prevent data loss. We updated our continuous integration and continuous delivery (CI/CD) pipeline. This enables us to deploy application code in different Regions based on the guidance from the Using AWS CodePipeline to Perform Multi-Region Deployments blog.

Improve load-balancing capability across servers

The monolith application handles all transactions, including long transactions that can take up to 3 minutes and short transactions that complete within 30 milliseconds.

To manage the resulting unbalanced traffic, we used the Least Outstanding Request algorithm in Application Load Balancer that distributes the load more uniformly.

Exponential backoff

After implementing retries and timeouts, we learned that backoffs are equally important. This is true especially for exponential backoff, where wait time increases exponentially after every retry attempt. We know from the Timeouts, retries, and backoff with jitter article that retries can become an anti-pattern for application resiliency.

We observed that database retries were overwhelming the database in the case of latency jitters. So, we implemented retry throttling within AWS SDK for Java per the Introducing Retry Throttling blog.

Decoupling integrations using event-driven design patterns 

Our application experienced a cascading failure when the payment processing and order fulfillment API time out impacted business operations.

To fix this, we identified asynchronous messaging paths to improve customer experience. We updated our application logic to use the design pattern shown in this FIFO topics example use case:

This pattern allows us to retry failed requests while using First-In-First-Out and deduplication features described in the Introducing Amazon SNS FIFO – First-In-First-Out Pub/Sub Messaging blog.

Predictive scaling for EC2

Due to its monolithic architecture, the application didn’t scale quickly with sudden increases in traffic because of its high bootstrap time.

SaleBestseller No. 1
Acer Aspire 3 A315-24P-R7VH Slim Laptop | 15.6" Full HD IPS Display | AMD Ryzen 3 7320U Quad-Core Processor | AMD Radeon Graphics | 8GB LPDDR5 | 128GB NVMe SSD | Wi-Fi 6 | Windows 11 Home in S Mode
  • Purposeful Design: Travel with ease and look great...
  • Ready-to-Go Performance: The Aspire 3 is...
  • Visibly Stunning: Experience sharp details and...
  • Internal Specifications: 8GB LPDDR5 Onboard...
  • The HD front-facing camera uses Acer’s TNR...
Bestseller No. 2
HP Newest 14" Ultral Light Laptop for Students and Business, Intel Quad-Core N4120, 8GB RAM, 192GB Storage(64GB eMMC+128GB Micro SD), 1 Year Office 365, Webcam, HDMI, WiFi, USB-A&C, Win 11 S
  • 【14" HD Display】14.0-inch diagonal, HD (1366 x...
  • 【Processor & Graphics】Intel Celeron N4120, 4...
  • 【RAM & Storage】8GB high-bandwidth DDR4 Memory...
  • 【Ports】1 x USB 3.1 Type-C ports, 2 x USB 3.1...
  • 【Windows 11 Home in S mode】You may switch to...

To predict future demands, we optimized for application availability and allowed automatic scaling to use historical CPU utilization data. This approach was highlighted in the New – Predictive Scaling for EC2, Powered by Machine Learning blog. This allows us to adjust capacity needs by forecasting usage patterns along with configurable warm-up time for application bootstrap.

Advertisements

Standardize observability

Production outages are scary for everyone, but with the right system monitoring solution, they can be made less stressful. After few outages of our application, we realized we needed to re-think holistically and not add metrics on one-time basis.

Logging into each server and comparing logs can get overwhelming when you are troubleshooting to identify bottleneck for unknown behaviors. But we found the key to troubleshooting: observability and event correlation, as explained in the following sections.

Define application and infrastructure metrics

As part of standardizing monitoring and observability metrics, we followed the guidance provided in the Performance Efficiency pillar of the AWS Well-Architected Framework. In summary, we worked backwards from the customer experience to identify performance-related metrics for each system component and dashboards to provide consolidated views.

Centralized logging

When analyzing downtime events, we found that the time spent on correlating different events was leading to high Mean-Time-To-Resolve. Additionally, the performance baseline was missing. Our analysis also uncovered a gap in security monitoring.

To overcome these challenges, we:

We used this solution as a baseline and updated the configuration as shown in Amazon ES Service Best Practices to optimize for performance and scale. We took manual hourly snapshots of the ES cluster to Amazon S3 and used Amazon S3 Cross-Region Replication to restore the cluster in case of a DR event.

Manage service limits with Service Quota and adopt Multi-account strategy

New
Naclud Laptops, 15 Inch Laptop, Laptop Computer with 128GB ROM 4GB RAM, Intel N4000 Processor(Up to 2.6GHz), 2.4G/5G WiFi, BT5.0, Type C, USB3.2, Mini-HDMI, 53200mWh Long Battery Life
  • EFFICIENT PERFORMANCE: Equipped with 4GB...
  • Powerful configuration: Equipped with the Intel...
  • LIGHTWEIGHT AND ADVANCED - The slim case weighs...
  • Multifunctional interface: fast connection with...
  • Worry-free customer service: from date of...
New
HP - Victus 15.6" Full HD 144Hz Gaming Laptop - Intel Core i5-13420H - 8GB Memory - NVIDIA GeForce RTX 3050-512GB SSD - Performance Blue (Renewed)
  • Powered by an Intel Core i5 13th Gen 13420H 1.5GHz...
  • Equipped with an NVIDIA GeForce RTX 3050 6GB GDDR6...
  • Includes 8GB of DDR4-3200 RAM for smooth...
  • Features a spacious 512GB Solid State Drive for...
  • Boasts a vibrant 15.6" FHD IPS Micro-Edge...
New
HP EliteBook 850 G8 15.6" FHD Laptop Computer – Intel Core i5-11th Gen. up to 4.40GHz – 16GB DDR4 RAM – 512GB NVMe SSD – USB C – Thunderbolt – Webcam – Windows 11 Pro – 3 Yr Warranty – Notebook PC
  • Processor - Powered by 11 Gen i5-1145G7 Processor...
  • Memory and Storage - Equipped with 16GB of...
  • FHD Display - 15.6 inch (1920 x 1080) FHD display,...
  • FEATURES - Intel Iris Xe Graphics – Audio by...
  • Convenience & Warranty: 2 x Thunderbolt 4 with...

We observed that most workloads were running in a single account, leading to service limits. We implemented two strategies to address this situation:

Original Post>

Figure 1. Current Architecture with improved resiliency and standardized observability

Advertisements

Conclusion

In this blog, we talked about design patterns you can adopt to improve overall resiliency, meet your RPO and RTO objectives, scale monolith application based on forecasted usage patterns, enhance observability, and plan for multi-account framework. In the next blog, we will talk about more architecture patterns to evolve the current architecture.

Find out more

Other blogs in this series

Related information