Admitting things will go wrong isn’t easy for anyone or any team. But modern engineering practices have moved beyond this fear, giving birth to a new practice in DevOps and site reliability engineering (SRE) known as resilience engineering.
Resilience engineering today isn’t thought of as a function. However, just as DevOps was a description of culture before it was a role and site reliability was an extension of operations before it was a focus, I wouldn’t be surprised if resilience engineering became a function in the new future. The first question most will ask however is, “Isn’t this just SRE?” The purpose of the term is to change the focus from simply reacting to incidents to developing long-term response strategies for them.
Because the expectation in these environments is that things will break, resilience is the responsibility of existing DevOps and cloud operations teams. When applications and services do break, a “fly by the seat of your pants” response strategy will not work.
A Focus on Frameworks
Resilience engineering, while rooted in engineering practices, is largely focused on building strategies and a framework for their execution. This leaves the process of building resilience into a largely unestablished system in part because each system is unique. And, how you respond to issues in that system will likely be unique, even if the management plane that reports issues is not.
The job of resilience engineering is to:
Establish procedures, habits and decision trees
When things break, fight is the only option. Operators and on-call engineers need to address issues in a systematic and repeatable way and do their best to remove emotion and fear from the equation. This not only helps triage and resolve issues, but it also makes sure the activity associated with the issue leads to meaningful insights in post mortems and future collaboration. Part of that is establishing habits and decision-making processes for those who are on-call. The processes help prioritize what to focus on, help catch details because details are critical.
Be data-driven
Resilience engineering must rely on data. This is another place where traditional SRE practices grow with a focus on resilience. In typical SRE environments, the focus is on the now, using real-time dashboards of the current state. When you think about resilience you are thinking about the past’s impact on the future, not the now. The only way to do that is to make sure the data supports it; thus, part of resilience engineering is making sure the data is there. Instead of data silos that most organizations have across their delivery chain. Resilience engineering should ensure that telemetry across the entire delivery chain is captured, correlated and shared. The reason this is critical is that what happens early in the delivery chain directly impacts incidents. That activity can be the source of answers, it can be the triggers for rollback, or it can be the clarity needed to prevent similar issues in the future. Without continuity between each stage of the delivery chain, it’s easy to miss correlated events that can lead to more systemic problems.
Engineering out of reproducible incidents
For most, the best part of resilience engineering is taking what is learned from previous incidents and finding ways to automate future resolution. Learning from data and having consistency in habit leads to the ability to create runbooks and automate remediation for known issues. Often incident response audit trails can read like playbooks for addressing issues of a particular type. When the resolution is not directly related to code and the potential of issues to surface again in the future is inevitable, being able to build intelligence to address it saves waking someone up at midnight and much shorter impact on customers.
The Resilience Stack
Organizations looking to embrace resilience engineering need to have a toolkit built for it. It’s pretty straightforward but worth highlighting. The resilience stack will include:
- Observability and/or monitoring tool.
- Incident response tool.
- On-call strategy documentation.
- Post mortem process and documentation.
- Documented processes including recommend response steps.
- Path to automation.
For those with a relatively mature and automated environment, the next step is chaos engineering—embracing chaos as a way to get ahead of incidents before they happen in the wild.
Get your markdown skills to maximum level. There is a lot of documenting that needs to happen with comprehensive resilience engineering. But these documents should not be shelf-ware—they should be living and ultimately lead to the implementation of automation or feedback to development.
As for monitoring, observability and incident response tools, it’s not enough to simply implement them. Knowing how data will be collected, consumed and actualized is also necessary. So is establishing an on-call strategy with purpose, not just because having everyone on-call is the “cool thing to do.”
It may seem obvious to say that organizations want system resiliency. What is not obvious is how to execute it. When organizations see the gaps (and are often embarrassed by them), they understand that resilience is a focus for either current functions or new ones in the future. That is why it’s worthwhile to talk about resilience engineering and what makes it effective.