In Part 4 of this series, Governing Security at Scale and IAM Baselining, we discussed building a multi-account strategy and improving access management and least privilege to prevent unwanted access and to enforce security controls.
As a refresher from previous posts in this series, our example e-commerce company’s “Shoppers” application runs in the cloud. The company experienced hypergrowth, which posed a number of platform and technology challenges, including enforcing security and governance controls to mitigate security risks.
With the pace of new infrastructure and software deployments, we had to ensure we maintain strong security. This post, Part 5, shows how we detect security misconfigurations, indicators of compromise, and other anomalous activity. We also show how we developed and iterated on our incident response processes.
Threat detection and data protection
With our newly acquired customer base from hypergrowth, we had to make sure we maintained customer trust. We also needed to detect and respond to security events quickly to reduce the scope and impact of any unauthorized activity. We were concerned about vulnerabilities on our public-facing web servers, accidental sensitive data exposure, and other security misconfigurations.
Prior to hypergrowth, application teams scanned for vulnerabilities and maintained the security of their applications. After hypergrowth, we established dedicated security team and identified tools to simplify the management of our cloud security posture. This allowed us to easily identify and prioritize security risks.
Use AWS security services to detect threats and misconfigurations
We use the following AWS security services to simplify the management of cloud security risks and reduce the burden of third-party integrations. This also minimizes the amount of engineering work required by our security team.
Detect threats with Amazon GuardDuty
We use Amazon GuardDuty to keep up with the newest threat actor tactics, techniques, and procedures (TTPs) and indicators of compromise (IOCs).
GuardDuty saves us time and reduces complexity, because we don’t have to continuously engineer detections for new TTPs and IOCs for static events and machine-learning-based detections. This allows our security analysts to focus on building runbooks and quickly responding to security findings.
Discover sensitive data with Amazon Macie for Amazon S3
To host our external website, we use a few public Amazon Simple Storage Service (Amazon S3) buckets with static assets. We don’t want developers to accidentally put sensitive data in these buckets, and we wanted to understand which S3 buckets contain sensitive information, such as financial or personally identifiable information (PII).
We explored building a custom scanner to search for sensitive data, but maintaining the search patterns was too complex. It was also costly to continuously re-scan files each month. Therefore, we use Amazon Macie to continuously scan our S3 buckets for sensitive data. After Macie makes its initial scan, it will only scan new or updated objects in those S3 buckets, which reduces our costs significantly. We added filter rules to exclude files of larger size and S3 prefixes to scan required objects and provided a sampling rate to further cost optimize scanning large S3 buckets (in our case, S3 buckets greater than 1 TB).
Scan for vulnerabilities with Amazon Inspector
We use Amazon Inspector to run continuous vulnerability scans on our EC2 instances and Amazon Elastic Container Registry (Amazon ECR) container images. With Amazon Inspector, we can continuously detect if our developers are deploying and releasing vulnerable software on our EC2 instances and ECR images without setting up a third-party vulnerability scanner and installing additional endpoint agents.
Aggregate security findings with AWS Security Hub
We don’t want our security analysts to arbitrarily act on one security finding over another. This is time-consuming and does not properly prioritize the highest risks to address. We also need to track ownership, view progress of various findings, and build consistent responses for common security findings.
With AWS Security Hub, our analysts can seamlessly prioritize findings from GuardDuty, Macie, Amazon Inspector, and many other AWS services. Our analysts also use Security Hub’s built-in security checks and insights to identify AWS resources and accounts that have a high number of findings and act on them.
Setting up the threat detection services
This is how we set up these services:
- Assigned a security tooling account (covered in a previous blog post in this series) as the delegated administrator for these services. The delegated administrator configures the services and aggregates findings from other member accounts.
- Used the AWS Security Reference Architecture and the associated scripts to assist with set up, which helped ensure we set up and configured the security services according to best practices.
- Used Security Hub’s new multi-Region aggregation to aggregate all findings into our primary Region.
- Integrated Jira with the security tooling account with the steps outlined in How to set up a two-way integration between AWS Security Hub and Jira Service Management to track ownership and remediation status.
Our security analysts use Security Hub-generated Jira tickets to view, prioritize, and respond to all security findings and misconfigurations across our AWS environment.
Through this configuration, our analysts no longer need to pivot between various AWS accounts, security tool consoles, and Regions, which makes the day-to-day management and operations much easier. Figure 1 depicts the data flow to Security Hub.
Figure 1. Aggregation of security services in security tooling account
Figure 2. Delegated administrator setup
Before hypergrowth, there was no formal way to respond to security incidents. To prevent future security issues, we built incident response plans and processes to quickly address potential security incidents and minimize the impact and exposure. Following the AWS Security Incident Response Guide and NIST framework, we adopted the following best practices.
Playbooks and runbooks for repeatability
We developed incident response playbooks and runbooks for repeatable responses for security events that include:
- Playbooks for more strategic scenarios and responses based on some of the sample playbooks found here.
- Runbooks that provide step-by-step guidance for our security analysts to follow in case an event occurs. We used Amazon SageMaker notebooks and AWS Systems Manager Incident Manager runbooks to develop repeatable responses for pre-identified incidents, such as suspected command and control activity on an EC2 instance.
Automation for quicker response time
After developing our repeatable processes, we identified areas where we could accelerate responses to security threats by automating the response. We used the AWS Security Hub Automated Response and Remediation solution as a starting point.
By using this solution, we didn’t need to build our own automated response and remediation workflow. The code is also easy to read, repeat, and centrally deploy through AWS CloudFormation StackSets. We used some of the built-in remediations like disabling active keys that have not been rotated for more than 90 days, making all Amazon Elastic Block Store (Amazon EBS) snapshots private, and many more. With automatic remediation, our analysts can respond quicker and in a more holistic and repeatable way.
Simulations to improve incident response capabilities
We implemented quarterly incident response simulations. These simulations test how well prepared our people, processes, and technologies are for an incident. We included some cloud-specific simulations like an S3 bucket exposure and an externally shared Amazon Relational Database Service (Amazon RDS) snapshot to ensure our security staff are prepared for an incident in the cloud. We use the results of the simulations to iterate on our incident response processes.
In this blog post, we discussed how to prepare for, detect, and respond to security events in an AWS environment. We identified security services to detect security events, vulnerabilities, and misconfigurations. We then discussed how to develop incident response processes through building playbooks and runbooks, performing simulations, and automation. With these new capabilities, we can detect and respond to a security incident throughout hypergrowth.
Other blog posts in this series
- Journey to Adopt Cloud-Native Architecture Series: #1 – Preparing your Applications for Hypergrowth
- Journey to Adopt Cloud-Native Architecture Series: #2 – Maximizing System Throughput
- Journey to Adopt Cloud-Native Architecture Series: #3 – Improved Resilience and Standardized Observability
- Journey to Adopt Cloud-Native Architecture Series: #4 – Governing Security at Scale and IAM Baselining