Journey to Adopt Cloud-Native Architecture Series: #2 – Maximizing System Throughput - Global Intelligence and Insight Platform: IT Innovation, ETF Investment, plus Health Wellbeing

https://aws.amazon.com/blogs/architecture/journey-to-adopt-cloud-native-architecture-series-2-maximizing-system-throughput/

In the last blog, Preparing your Applications for Hypergrowth , we talked about hypergrowth and the technical challenges it presents to companies. As a reminder, we presented an example ecommerce company running a monolithic application on Elastic Compute Cloud (Amazon EC2) .

This application connects with Amazon Relational Database Service (Amazon RDS).

The company recently experienced a hypergrowth event where user traffic grew exponentially (10 times) within a few days. During this event, we observed degraded performance at peak times. In this blog, we talk about improving performance by maximizing system throughput through incremental improvements to application and database layers.

Maximize System Throughput

In the following subsections, we explain system improvements we made to maximize the company’s system throughput and get consistent performance from their application stack.

Configure Synchronized Transaction Timeout

Transactions were timing out in the application, database, and network intermittently for different reasons:

When a transaction starts at the load balancer (connection idle timeout)
When it goes to application (SDK timeout)
When it makes a database connection (connection timeout)
When it runs a query (statement timeout)

We increased timeouts from backend to front end so rollbacks happen properly. For example, we increased the database connection timeout from 3 seconds from 10 seconds to accommodate long-running queries.

Introduce Connection Pooling

We identified that idle connections were consuming memory and CPU after reading this Resources Consumed by Idle PostgreSQL Connections blog post. The post talks about how having more connections doesn’t necessarily increase throughput and could instead result in ideal connections. It has opposite effect and impacts performance, as explained in the Performance Impact of Idle PostgreSQL Connections blog post.

As a solution, we implemented connection pooling via Amazon RDS Proxy to improve overall performance and reduce unnecessary resource utilization. The proxy helps with the heavy lifting, so we don’t have to build this into our applications. It enables us to reuse database connections, which improved overall scalability and resiliency. For optimized performance, we tested applications against a small number of connections and stepped it up until the overall throughput reached a plateau. We observed adding more connections after the plateau had diminishing returns.

Advertisements

Introduce Read Replicas

To address read scaling in the database cluster, we added multiple read replicas to handle read-only queries. These read replicas also handle one-time processes that fetch data. This pattern is also known as Command Query Responsibility Segregation.

We updated the application code to determine read/write data access based on an API call. Based on the data access need, the application code will use different endpoints for running queries. This allows the application to offload read transactions to a different host and maximize system resources to increase write throughput. Introducing read replicas has boosted overall throughput and resource buffer (CPU and RAM) on the write database node.

Introduce Caching

We introduced a caching layer to improve performance and reduce latency when the user is waiting for a response. Though read replicas provided the ability to distribute read/write queries across different database instances, the queries still needed to be run. We needed an in-memory caching solution for submillisecond latency. After we reviewed the Performance at Scale with Amazon ElastiCache whitepaper, we were able to identify the right caching solution for different use cases.

SaleBestseller No. 1

HP Elite Desktop PC Computer Intel Core i5 3.1-GHz, 8 gb Ram, 1 TB Hard Drive, DVDRW, 19 Inch LCD Monitor, Keyboard, Mouse, Wireless WiFi, Windows 10 (Renewed)

This Certified Refurbished product is tested and...
HP Elite 6200 Small Form Factor Desktop PC, Intel...
Includes: USB Keyboard & Mouse, WiFi Adapter,...
Ports: USB 2.0, DisplayPort, VGA, PS/2 keyboard,...
Operating System: Windows 10 64 Bit –...

SaleBestseller No. 2

HP 2022 Newest All-in-One Desktop, 21.5" FHD Display, Intel Celeron J4025 Processor, 16GB RAM, 512GB PCIe SSD, Webcam, HDMI, RJ-45, Wired Keyboard&Mouse, WiFi, Windows 11 Home, White

【High Speed RAM And Enormous Space】16GB DDR4...
【Processor】Intel Celeron J4025 processor (2...
【Display】21.5" diagonal FHD VA ZBD anti-glare...
【Tech Specs】2 x SuperSpeed USB Type-A 5Gbps...
【Authorized KKE Mousepad】Include KKE Mousepad

We deployed Amazon ElastiCache for Redis in a lazy-loading pattern as mentioned in the Database Caching Strategies Using Redis whitepaper. Caching not only improved scale and performance but also helped to optimize the database cost.

Introduce Purge and Data Archiving

The orders table is one of the most read/write heavy tables in the database. However, there was no business need to keep historical information in the transactional database for completed orders. So, we worked with our internal teams to define a data retention policy. We created tables partitioned by date, which allows us to analyze partitions with records that can be archived or purged based on the retention policy.

Table size was reduced by 60% by applying these techniques. We also created automated jobs to purge older partitions and implemented backup jobs to archive data. Additionally, we adopted a strategy to migrate older data into Aurora Serverless Postgres to offload infrequent data queries from the main database to optimize cost. This database supports users who want to view old orders on a one-time basis (Note: Aurora Serverless V2 is also now available in preview).

Query Optimization

As we identified the different bottlenecks in the database throughput, we used Amazon RDS performance to find bottlenecks in SQL queries. We used metrics like “read time per call” and “write time per call” to recognize queries that took the most time to read and write. When we analyzed deeper, we observed that some queries were used by downstream online analytic processing (OLAP) processes. Before further optimizing the queries, we moved SELECT queries on a separate read replica to free up resources on the primary database.

Introduce Database Sharding

Since we run on the largest instance for one of our databases, we had to find a way to horizontally scale the database. “Sharding” allowed us to deal with having multiple writer nodes. As described in this Sharding with Amazon Relational Database Service blog post, there are several approaches to sharding, each with pros and cons.

New

HP Stream 14 inch Laptop for Student and Business, Intel Quad-Core Processor, 16GB RAM, 64GB eMMC, 1-Year Office 365, Webcam, 12H Long Battery Life, Lightweight & Slim Laptop, Wi-Fi, Win 11 H in S

【Processor】Intel Celeron N4120, 4 Cores & 4...
【Display】14.0-inch diagonal, HD (1366 x 768),...
【Storage】16GB high-bandwidth DDR4 Memory (2400...
【Connectivity】1 x USB 3.1 Type-C ports, 2 x...
【System】Windows 11 Home in S mode operating...

New

Configured with AMD Ryzen 5 5600G Processor and...
8GB GeForce RTX 4060 GDDR6 dedicated graphics card...
Liquid cooling system keeps internal components at...
Integrated PCIE Wi-Fi provides excellent wireless...
Includes USB Gaming RGB Mechanical Keyboard, Mouse...

New

Lenovo 2023 IdeaPad 1i Essential Laptop Computer, Intel Core i5-1235U 12th Gen, 15.6" FHD Anti-Glare Display, (16GB DDR4 RAM, 512GB SSD), HDMI, Bluetooth, Windows 11, Cloud Grey, W/GaLiMu

✔【Display】 15.6" FHD (1920x1080) TN 220nits...
✔【Memory & Storage】RAM Size 16GB 3200MHz...
✔【Connectivity】 1x USB 2.0, 1x USB 3.2 Gen...
✔【Processor & Graphics】 12th Generation...
✔【Operating System】 Windows 11

In our case, the order management component had the most contention. We used “customer_id” as the partition key and split the data into different shards. We developed a proxy layer to decide the mapping between customers and shards.Original Post>

Figure 1. Current architecture with improved system resiliency

Conclusion

In this blog, we talked about design patterns you can adopt to address scaling challenges that can arise due to hypergrowth. In the last blog, Preparing your Applications for Hypergrowth, we talked about hypergrowth and the technical challenges it presents to companies. These architecture patterns not only maximize system throughput but also prepare your applications for incremental improvements to build highly scalable and resilient architecture. In the next blog, we will provide more design patterns to help you in your journey to adopt cloud-native architecture.

Maximize System Throughput

Conclusion

Share this:

Like this:

Related

Discover more from Global Intelligence and Insight Platform: IT Innovation, ETF Investment, plus Health Wellbeing