4 steps to improve root cause analysis

Teams that follow observability standards, use monitoring tools, and foster a culture of collaboration can more quickly discover the root cause of system outages and performance issues.

Contributor, InfoWorld |

4 steps to improve root cause analysis — Realstock/Shutterstock

When there’s a major systems outage or performance issue, IT teams come to the rescue to restore services as quickly as possible. Some IT organizations follow IT service management (ITSM) incident management practices to restore service, then follow problem management procedures to perform root cause analysis (RCA). More advanced organizations may employ site reliability engineers (SREs) involved in incident and problem management, but their primary responsibility is to drive more proactive steps to reduce error rates and improve service level objectives.

While much of IT operations tends to focus on major incidents like outages, disruptive performance issues, and security attacks, one of the more difficult challenges is finding the root cause behind sporadic, needle-in-a-haystack issues. These issues are infrequent, impact a small subset of users, or last for a very short duration. However, they can be very damaging to the business if they occur during critical operations performed by important end users.

Here are some examples:

A user creates a complex website search or database query that hoards system resources and bottlenecks all other searching activities.
A transaction locks system resources and only creates a performance issue when multiple users perform the same transaction simultaneously.
A faulty cable, network card, or other device creates packet loss, but the impact is only felt by end users during peak usage periods.
A database backup procedure’s duration increases as data grows, creating performance issues only for a subset of end users.
A third-party service has slower-than-usual response times and degrades performance for dependent applications.

“Narrowing down difficult application performance issues requires a functioning debugging and feedback loop,” says Liz Fong-Jones, field CTO of Honeycomb. “Simple, quick issues often turn up in a spike in a single pre-aggregated query on a dashboard, but any issue more complicated than that is, by definition, an “unknown unknown” that was not previously seen or anticipated by the developer at the time they wrote the code.”

Finding the root cause of sporadic performance issues

As a developer in my younger days and later as a CIO, I’ve experienced many needle-in-the-haystack issues, and finding the root cause can be time-consuming and error-prone.

Sometimes, the challenge is sorting out the root cause from too much data, a problem AIops platforms can help address. Other times, there’s missing data, data quality issues, or data sets that need joining. Geoff Hixon, VP of solutions engineering at Lakeside Software, says, “Application performance issues aren’t always easy to find and fix, especially with gaps in data that can cause blind spots of the true root cause.”

How to perform root cause analysis (RCA)

What is needed is a process SREs, developers, and IT operational engineers can follow to perform RCA on issues that are harder to find. I propose four steps:

Manage observability as a product
Plan for top-down and bottom-up analysis
Determine whether it's a network issue
Collaborate and triangulate on root causes

Step 1: Manage observability as a product

In my book, Digital Trailblazer, I tell several stories about fixing performance issues using observability. “It’s easy for people to chase the white rabbits and take other wrong turns, and observability data should help guide teams on the optimal focus areas.”

A devops best practice is to improve the observability of microservices, data pipelines, applications, and other in-house developed software. The challenge for many organizations is creating and improving data standards so that consistency improves ease of use when RCA is needed.

Nick Heudecker, senior director of market strategy and competitive intelligence at Cribl, recommends taking standardization one step further and treating application logs as a data product designed to be consumed by IT operations. “The most important factor in identifying application performance issues is ensuring the telemetry coming from apps is usable by downstream systems. This means structuring logs, enriching them with the right context, and delivering them to relevant platforms. Sounds simple, but the challenge is that the developers producing the logs often aren’t the people using them on the operations side.”

Standardizing observability data is one way to productize observability and simplify it for operational needs. Other best practices for devops observability include consulting with risk management on sensitive data and data retention policies. Devops teams should also take steps to educate SREs and people working in the network and security operation centers (NOCs and SOCs) to connect what the software does with how observability data is represented in logfiles and other repositories.

For large organizations developing many applications and microservices, observability standards must be coupled with automation, analytics tools, and models to make root cause analysis easier.

“A shift to a more targeted, real-time data analysis mindset in a company’s observability practice empowers engineers to proactively query the data and gain the insights needed to solve the most perplexing application performance issues,” says Asaf Yigal, co-founder and CTO of Logz.io. “To get to the root cause and resolve critical performance issues of modern microservice-heavy systems, a more efficient solution that cuts through the data using automation and enables proactive rather than reactive response is required.”

It’s important to have a continuous improvement mindset and incremental release strategy to observability standards. As NOCs, SOCs, and SREs encounter new issues, devops teams should use the feedback to improve data collection.

Step 2: Plan for top-down and bottom-up analysis

It’s relatively easy to find a slow query with basic database logfiles. Identifying root causes becomes more complex when query performance only degrades when the database is under load and multiple queries compete for the same system resources.

Grant Fritchey, devops advocate at Redgate Software, shares an example of a query that was running fast, about 6ms on average. “From a performance measurement standpoint, it was an unimportant query, until you saw the execution counts and realized that the query was called thousands of times per minute. Even at 6ms, it wasn’t running fast enough. This underscores the need for integrating observability and database monitoring tools to achieve a holistic and nuanced understanding of system performance.”

Effective RCA requires monitoring tools to do more than basic alerting of outages or major performance. Ops and SREs need indicators when performance is outside the norm and tools for top-down analytics to drill into suspect transactions and activities. Tools should also help identify performance outliers, especially for high-volume and poor-performance activities. The better tools also help isolate end-user experiences, so when there’s a customer support call about a problem, operations have tools to perform an RCA for that user.

Step 3: Determine whether it's a network issue

It’s easier for devops teams to point to problems in the network and infrastructure as the root cause of a performance issue, especially when these are the responsibility of a vendor or another department. That knee-jerk response was a significant problem before organizations adapted devops culture and recognized that agility and operational resiliency are everyone’s responsibility.

“The villain when there are application performance issues is almost always the network, and it’s always the first thing we blame, but also the hardest thing to prove,” says Nicolas Vibert of Isovalent. “Cloud-native and the multiple layers of network virtualization and abstraction caused by containerization make it even harder to correlate the network as the root cause issue.”

Identifying and resolving complex network issues can be more challenging when building microservices, applications that connect to third-party systems, IoT data streams, and other real-time distributed systems. This complexity means that IT ops need to monitor networks, correlate them to application performance issues, and perform network RCAs more efficiently.

“Integrated packet monitoring across virtualized environments over north-south and east-west traffic paths provides consistent, real-time insights into traffic and application performance,” says Eileen Haggerty, AVP of product and solutions marketing at NETSCOUT. “But every domain and location must have the same analytics, intelligence, and visibility level, no matter where workloads, apps, and services are running. A consistent measurement approach across every hosting environment enables easier and faster determination of the root cause and location of performance issues for applications across any network infrastructure.“

Step 4: Collaborate and triangulate on root causes

Two other recommendations focus on how teams collaborate to resolve incidents and perform root cause analysis. I’ve led more than my fair share of bridge calls and rooms to find and fix issues, which can be a necessary evil during a major outage. However, these approaches are far less effective when solving sporadic performance issues that require correlating data from multiple tools and observability data sources. Many of these issues require a cross-disciplinary team to collaborate, share knowledge, and work together efficiently when an RCA is needed.

“I have observed a notable absence of application documentation and limited communication between teams in many larger and well-established organizations, says Chris Hendrich, associate CTO at SADA. “Breaking down these disjointed silos can help companies improve their ability to conduct root cause analysis.”

The second speaks to how teams search for root causes. Fong-Jones of Honeycomb says, “It’s not necessary to leap directly to the needle in the haystack, only to be able to narrow down parts of the haystack that the needle is or isn’t in until you find the needle. But, tools can help generate questions that will help you filter the haystack.”

All IT organizations run into performance issues that are hard to solve. Teams that collaborate, share information, create observability standards, and develop expertise in using monitoring tools can lower the stress, reduce the time, and improve the accuracy of their RCAs.

Next read this:

Software Development

Isaac Sacolick is president of StarCIO and the author of the Amazon bestseller Driving Digital: The Leader’s Guide to Business Transformation through Technology and Digital Trailblazer: Essential Lessons to Jumpstart Transformation and Accelerate Your Technology Leadership. He covers agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO and digital transformation influencer. He has published more than 900 articles at InfoWorld.com, CIO.com, his blog Social, Agile, and Transformation, and other sites.