Chaos engineering is the use of experimental and potentially destructive failure testing or fault injection testing to uncover vulnerabilities and weaknesses within and among the varied elements of a complex system.
Chaos engineering tools enable software engineering teams to systematically plan, document, execute and analyze attacks on components and systems, both before and after implementation.
Site reliability engineers, software engineers and security experts can use these tools to proactively increase the resilience of infrastructure, applications, and processes.
“Organizations can use chaos engineering to augment their existing testing processes,” Gartner VP and analyst Manju Bhat explains. “Traditional testing methods help verify and validate system performance and behavior against known conditions.”
Chaos engineering, on the other hand, adopts an experimental approach to surface any unknown and latent weaknesses in the system when it is subject to unpredictable and random interaction patterns among system components.
“This helps gain confidence in our systems to withstand and recover from failures in production,” Bhat says.
He explains that chaos engineering improves enterprise resilience on two fronts: One, to ensure that the system can gracefully degrade and recover from failure conditions and, second, to also ensure the recovery processes for fault tolerance kick in and take effect when required.
Building Confidence Amid Chaos
Dan Benjamin, CEO and co-founder at Dig Security, says utilizing chaos engineering in large-scale, distributed cloud environments can help organizations build confidence in their production environment to operate without experiencing unplanned downtime or being exploited by bad actors.
“With the explosion of data in public cloud environments, using chaos engineering practices could help enterprises protect sensitive data,” he explains.
Many organizations struggle to get visibility into where their most sensitive data is stored. Improper handling of that data can have disastrous consequences, such as compliance violations or trade secrets falling into the wrong hands.
“Using chaos engineering could help identify vulnerabilities that, unless remediated, could be exploited by bad actors within minutes,” Benjamin says.
Kelly Shortridge, senior principal of product technology at Fastly, says organizations can use chaos engineering to generate evidence of their systems’ resilience against adverse scenarios, like attacks.
“By conducting experiments, you can proactively understand how failure unfolds, rather than waiting for a real incident to occur,” she says.
The very nature of experiments requires curiosity — the willingness to learn from evidence — and flexibility so changes can be implemented based on that evidence.
“Adopting security chaos engineering helps us move from a reactive posture, where security tries to prevent all attacks from ever happening, to a proactive one in which we try to minimize incident impact and continuously adapt to attacks,” she notes.
Chaos Engineering a Four-Step Process
Gartner’s Bhat says chaos engineering can help with security practices involving threat modeling, threat detection, incident response and remediation as well as application programming interfaces (API) and cloud security.
“Much like continuous integration and continuous delivery, chaos engineering must be treated as a continuous process that integrates with the software development life cycle,” he says.
A chaos engineering strategy involves four steps, starting with designing the experiments, which he says is the most important step of building the practice, because it involves brainstorming the potential failures that can impact systems.
“This phase must include everyone, product owners, developers, testers, platforms, security and operations,” Bhat says.
Experiments must mimic likely failures, including hardware failures, application failures, service provider outages, dependency failures, and failures due to executed changes. Failure injection should start with the infrastructure and move up through the application layers.
The second step involves performing experiments, with Bhat recommending all experiments start in non-production environments that match production as closely as possible.
He explains that chaos experiments should be controlled, have testing participants from affected services available, and have a kill switch to stop the experiment if there is any unexpected impact to live production environments.
“Invoke incident management processes as a part of the experiment, use clear communication to coordinate the testing, and use automation to repeat experiments,” Bhat says.
The third step is to measure the system response. As teams conduct experiments, they must measure the response and behavior of the system, as well as the impact on users.
Bhat explains key indicators include the mean time to detect (MTTD), mean time to repair (MTTR), and service-level objectives.
The final step is focused on learning and improvement and should be used to analyze metrics and information discovered during the experiments.
“Ensure that the teams document test parameters, outcomes and baseline metrics, and review the experiments and results with testing participants,” he says. “As learning and improvement practices mature, adopting a chaos engineering tool can help automate the repetitive parts of the test definition and the measurement of the effects.”
Chaos Engineering to Become More Vital
Shortridge notes that most business leaders are all too aware that existing, reactive cybersecurity strategies aren’t effective enough in outmaneuvering attackers.
“We struggle to prove better security outcomes with our traditional strategies,” she says. “We rely on industry folk’s wisdom to guide our decision-making. It doesn’t have to be this way.”
From her perspective, adopting a resilience approach – such as through chaos engineering – helps enterprises drive meaningful security progress that supports innovation and growth rather than stifling it.
“We’ll see real security outcomes rather than outputs,” she says. “We’ll live in confidence rather than fear, because we know through repeated experimentation that our systems are prepared for a variety of adverse scenarios. Our security strategy can be grounded in learning and adaptation, aligning security with business goals on a continual basis.”
As “software eats the world”, Shortridge says it will become even more vital that security teams collaborate with software engineering teams, empowering rather than controlling them.
“Security chaos engineering facilitates this collaboration, or even allows for greater decentralization of security activities, freeing up precious time and effort for under-resourced security teams,” she adds.
Bhat adds chaos engineering strategies will become more important in the future due to the growing adoption of cloud services, distributed system architectures and increased cadence of software delivery.
“Most importantly, companies will treat reliability and resilience as a key competitive differentiator of their products and services,” he says.
What to Read Next:
Chaos Engineering: Withstanding Turbulence in Software Production
The Right System Architecture Will Reduce Software Failures