The birth of chaos engineering happened somewhat accidentally in 2008 when Netflix moved from the data center to the cloud. The move didn’t go as planned.
The thinking at the time was that the data center locked Netflix into an architecture of single points of failure, like large databases and vertically scaled components. Moving to the cloud would necessitate horizontally scaled components, which would decrease the single points of failure. But the move to horizontally scaled cloud deployment practices did not produce the boost to uptime of the streaming service that they expected.
The specific methods for making a system robust enough to handle instances disappearing was not important. It might even be different depending on the context of the system. The important thing was that it had to be done, because the streaming service was facing availability deficits due to the high frequency of instance instability events. In a way, Netflix had simply multiplied the single-point-of-failure effect.
Enter Chaos Monkey. Chaos Monkey gave the company a way to proactively test everyone’s resilience to a failure, and do it during business hours so that people could respond to any potential fallout when they had the resources to do so, rather than at 3 a.m. when pagers typically go off.
Then, Dec. 24, 2012, happened. Amazon Web Services (AWS) suffered a rolling outage of elastic load balancers (ELBs). These components connect requests and route traffic to the compute instances where services are deployed. As the ELBs went down, additional requests couldn’t be served. Since Netflix’s control plane ran on AWS, customers were not able to choose videos and start streaming them.
To ensure that all of these teams had their services up to the task, an activity was created to take a region offline. Well, AWS wouldn’t allow Netflix to take a region offline (something about having other customers in the region) so, instead, Netflix simulated the activity; labeled “Chaos Kong.”
Fast-forward to about 2015. Netflix had Chaos Kong working on large-scale vanishing regions and had introduced Chaos Monkey, which worked on small-scale vanishing instances.
In these early days of chaos engineering at Netflix, it was not obvious what the discipline actually was. There were some catch phrases tossed around about “pulling out wires,” or “breaking things,” or “testing in production,” paired with many misconceptions about how to make services reliable, and very few examples of actual tools to help support that work.
To formalize the discipline, I was given the task of developing a charter and roadmap for a chaos engineering team at Netflix in early 2015. I built and managed that team for three years. My co-author of “Chaos Engineering: System Resiliency in Practice,” Nora Jones, joined the chaos engineering team early on as an engineer and technical leader. She was responsible for significant architectural decisions about the tools we built as well as overseeing implementation.
To start, I sat down with my teams to formally define chaos engineering. We specifically wanted clarity on:
• What is the definition of Chaos Engineering?
• What is the point of it?
• How do I know when I’m doing it?
• How can I improve my practice of it?
We researched resilience engineering and other disciplines in order to come up with a definition and a blueprint for how others could also participate in chaos engineering. After about a month of working on a manifesto of sorts, we produced the Principles of Chaos Engineering. The discipline was officially formalized:
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
This definition established that it is a form of experimentation, which sits apart from testing. It also mentions “turbulent conditions in production” to highlight that this isn’t about creating chaos. Chaos engineering is about making the chaos inherent in the system visible. The point of doing chaos engineering is to build confidence.
The principles go on to describe a basic template for experimentation, which borrows heavily from Karl Popper’s principle of falsifiability. In this regard, chaos engineering is modeled very much as a science rather than a technique.
What Chaos Engineering Is
The principles define the discipline so that we know when we are doing chaos engineering, how to do it and how to do it well. The more common definition for chaos engineering today is, “The facilitation of experiments to uncover systemic weaknesses.”
The steps of Chaos Engineering experimentation are as follows:
- Start by defining “steady state” as some measurable output of a system that indicates normal behavior.
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
By design, there is great latitude in how to implement these experiments.
Experimentation Versus Testing
Testing, strictly speaking, does not create new knowledge. Testing requires that the engineer writing the test knows specific properties about the system that they are looking for in advance. Complex systems are opaque to that type of analysis—humans are simply not capable of understanding all of the potential side effects from all of the potential interactions of parts in a complex system. This leads us to one of the key properties of a test.
Tests make an assertion, based on existing knowledge, and then running the test collapses the valence of that assertion, usually into either true or false. Tests are statements about known properties of the system.
Experimentation, on the other hand, creates new knowledge. Experiments propose a hypothesis, and as long as the hypothesis is not disproven, confidence grows in that hypothesis. If it is disproven, then we learn something new. This kicks off an inquiry to figure out why our hypothesis is wrong. In a complex system, the reason why something happens is often not obvious. Experimentation either builds confidence, or it teaches us new properties about our own system. It is an exploration of the unknown.
No amount of testing in practice can equal the insight gained from experimentation, because testing requires a human to come up with the assertions ahead of time. Experimentation formally introduces a way to discover new properties. It is entirely possible to translate newly discovered properties of a system into tests after they are discovered. It also helps to encode new assumptions about a system into new hypotheses, which creates something like a “regression experiment” that explores system changes over time. Because chaos engineering was born from complex system problems, it is essential that the discipline favors experimentation over testing.
Verification Versus Validation
Using definitions of verification and validation inspired by operations management and logistical planning, we can say that chaos engineering is strongly biased toward the former.
Verification
Verification of a complex system is a process of analyzing output at a system boundary. A homeowner can verify the quality of the water (output) coming from a sink’s faucet (system boundary) by testing it for contaminants without knowing anything about how plumbing or municipal water service (system parts) functions.
Validation
Validation of a complex system is a process of analyzing the parts of the system and building mental models that reflect the interaction of those parts. A homeowner can validate the quality of water by inspecting all of the pipes and infrastructure (system parts) involved in capturing, cleaning, and delivering water (mental model of functional parts) to a residential area and, eventually, to the house in question.
Both of these practices are potentially useful, and both build confidence in the output of the system. As software engineers, we often feel a compulsion to dive into code and validate that it reflects our mental model of how it should be working. Contrary to this predilection, chaos engineering strongly prefers verification over validation. chaos engineering cares whether something works, not how.
Note that in the plumbing metaphor we could validate all of the components that go into supplying clean drinking water, and yet still end up with contaminated water for some reason we did not expect. In a complex system, there are always unpredictable interactions. But if we verify that the water is clean at the tap, then we do not necessarily have to care about how it got that way. In most business cases, the output of the system is much more important than whether or not the implementation matches our mental model. Chaos engineering cares more about the business case and output than about the implementation or mental model of interacting parts.
What Chaos Engineering Is Not
There are two concepts that are often confused with chaos engineering, namely breaking stuff in production and anti-fragility.
Breaking Stuff
Occasionally, in blog posts or conference presentations, we hear chaos engineering described as “breaking stuff in production.” While this might sound cool, it doesn’t appeal to enterprises running at scale and other complex system operators who can most benefit from the practice.
A better characterization would be “fixing stuff in production.” Breaking stuff is easy; the difficulty lies in mitigating blast radius, thinking critically about safety, determining if something is worth fixing, deciding whether you should invest in experimenting on it—the list goes on.
Breaking stuff could be done in countless ways, with little time invested. The larger question here is, how do we reason about things that are already broken when we don’t even know they are broken?
“Fixing stuff in production” does a much better job of capturing the value of chaos engineering, since the point of the whole practice is to proactively improve availability and security of a complex system. Chaos engineering is the only major discipline in software that focuses solely on proactively improving safety in complex systems.
Anti-Fragility
People familiar with the concept of anti-fragility, introduced by Nassim Taleb, often assume that chaos engineering is essentially the software version of the same thing. Taleb argues that words like “hormesis” are insufficient to capture the ability of complex systems to adapt, and so he invented the word “anti-fragile” as a way to refer to systems that get stronger when exposed to random stress.
An important, critical distinction between chaos engineering and anti-fragility is that chaos engineering educates human operators about the chaos already inherent in the system so that they can be a more resilient team. Anti-fragility, by contrast, adds chaos to a system in hopes that it will grow stronger in response rather than succumbing to it.
As a framework, anti-fragility puts forth guidance at odds with the scholarship of resilience engineering, human factors and safety systems research. For example, anti-fragility proposes that the first step in improving a system’s robustness is to hunt for weaknesses and remove them. This proposal seems intuitive, but resilience engineering tells us that hunting for what goes right in safety is much more informative than investigating what goes wrong.
The next step in anti-fragility is to add redundancy. This also seems intuitive, but adding redundancy can cause failure just as easily as it can mitigate against it, and the literature in resilience engineering is rife with examples where redundancy actually contributes to safety failures. Perhaps the most famous example is the 1986 Challenger disaster. The redundancy of O-rings was one of three reasons that NASA approved the continuation of launches, even though damage to the primary O-ring was well known internally for over fifty of the prior launch missions over the span of five years.
There are numerous other examples of divergence between these two schools of thought. Resilience engineering is an ongoing area of research with decades of support, whereas anti-fragility is a theory that exists largely outside of academia and peer review. It is easy to imagine how the two concepts become conflated, since both deal with chaos and complex systems, but the spirit of anti-fragility does not share the empiricism and fundamental grounding of chaos engineering. For these reasons, we should consider them to be different pursuits.
A combination of real-world experience applying the original four principled steps to experimenting on systems at scale, paired with additional thoughtful introspection, led the chaos engineering team at Netflix to push the practice further than experimentation. These insights became the “Advanced Principles,” which guide teams through the maturity of their chaos engineering programs and help set a gold standard toward which we can aspire—we’ll cover these in a follow-up post.