Site reliability engineers, or SREs, do many things. They help developers build reliability into applications. They manage SLAs and SLOs. They play a leading role in incident management and incident response.
For all of these tasks, SREs draw heavily on observability and monitoring. Although other parts of the IT organization also typically help to manage observability and monitoring, it’s virtually impossible for SREs to do their jobs effectively in today’s cloud-native world without observability and monitoring tools and data.
Here’s what observability and monitoring mean, why they’re so important for SREs in particular and how SREs can systematically measure the success of their monitoring and observability practices.
Defining Observability vs. Monitoring
Monitoring is the collection of data from applications, infrastructure or other IT resources to track the status and performance of those resources. Monitoring has been a core IT practice since the late 1990s when the first modern monitoring tools, such as Nagios, appeared.
When compared to monitoring, observability is more intensive. Observability is the collection of discrete types of data from multiple systems and the correlation of those data sets to achieve actionable visibility into IT resources. One thing to note: Observability has a more technical definition in the field of control theory that has to do with using external outputs to make inferences about the internal state of a system; however, when people talk about observability in the context of IT and SRE, they usually mean the collection and correlation of broad sets of data in order to understand complex systems.
The debate about the differences between observability and monitoring is ongoing, but the consensus is that they boil down to:
- What versus why: Monitoring helps SREs determine that a problem exists. Observability goes further by allowing them to understand why it exists.
- Context: Observability provides context around what is wrong with a system. Monitoring, for the most part, offers surface-level information about the existence of a problem.
- Actionability: Observability empowers SREs to respond effectively to outages, performance degradations or other problems. Monitoring merely lets them know that something is wrong and that it may require intervention.
In most respects, monitoring is one step in the greater observability process. But observability involves additional practices, such as the correlation of various types of data, to achieve a level of context and actionability that monitoring alone can’t deliver.
Why SREs Should Understand Observability
Again, observability has become a core practice for engineers in a variety of roles. Developers can use observability tools to help measure and optimize application performance before deployment. IT engineers can leverage observability to gain visibility into issues that exist in a production environment. Observability tools can help QA engineers determine why an application failed a test.
But no group has more to gain from observability than SREs. Given that maximizing the reliability and performance of systems is the core mission of SREs, the ability to not just detect problems via monitoring but also to understand them through observability is critical for modern SRE teams.
That’s all the more true for SREs tasked with managing reliability for complex systems, like those based on microservices. In complex applications, identifying the root cause of a reliability issue—such as which specific microservice contains buggy code or which type of request triggers a performance bug—can be very difficult. By maximizing context into reliability problems, as well as allowing teams to trace requests across complex, distributed systems, observability places SREs in the strongest position to find and fix reliability problems quickly.
What’s more, beyond detecting issues in systems that have already been deployed, SREs can also leverage observability tools to help build reliability into applications before they are put into production. By observing applications in dev/test environments, SREs may be able to identify reliability risks and then use that insight to find ways to make the application inherently more reliable. Observability might reveal reliability weaknesses that stem from an application’s architectural design, for instance, or from the orchestration tool that manages the application.
How to Measure Observability and Monitoring
Getting the most from observability and monitoring requires systematic measurement of observability and monitoring initiatives. It’s only by collecting metrics about observability and monitoring outcomes that SREs can ensure they are improving reliability rather than merely collecting and analyzing data to no particular end.
While a full discussion of observability and monitoring measurement is beyond the scope of this article, consider the following types of metrics for assessing these practices:
- Incident frequency: How often do incidents (especially severe incidents) occur that impact production systems? Ideally, SREs will see a decrease in incident frequency that correlates with increased use of observability and monitoring tools.
- Rollback frequency: The less often an application release is rolled back to a previous version, the more effective the SRE team’s observability and monitoring practices likely are.
- Mean-time-to-remediate (MTTR): MTTR is a measure of the overall time it takes to resolve a reliability problem. It reflects how well you monitor (which determines how quickly you detect problems) and how well you observe (which is the key to resolving problems).
- Total data analyzed: Tracking the total volume of data that you collect and analyze through observability and monitoring tools helps you assess the overall scope and scalability of your practices.
- Total systems analyzed: In addition to tracking total data volume, consider measuring how many discrete applications, hosting environments, CI/CD tools and other systems you are managing as part of observability and monitoring workflows.
The ultimate goals of SREs extend beyond observability and monitoring. However, observability and monitoring serve as essential means to the broader ends that SREs are tasked with achieving: Maximizing uptime and performance. And while monitoring alone was once sufficient for supporting the goals of SREs, it’s very difficult to imagine an SRE team today that doesn’t also make extensive use of observability, which takes monitoring to the next level.