These observability tips can help developers uncover issues that impede performance and derail customer experience
Modern technologies and methodologies such as cloud services, containers, DevOps, microservices and serverless have made it easier for organizations to deploy application code to production. However, the additional layers within the modern tech stack have increased complexity, making it more difficult to identify performance issues and trace them back to the root cause before customers are impacted.
While all these layers can be instrumented for observability, each of them spits out different information and for different roles. If something goes wrong, where does a developer begin to look? Maybe they get mentioned on a support ticket. Or they end up staring at high-level visualizations of systems from the various performance monitoring and logging tools. But there are graphs everywhere. There’s no surefire way for the developer to hone in on what they’re supposed to look at quickly. By standing back and eyeballing the graphs, they’ll obviously see the spike. But with all this information and so many variables, it’s not easy to pinpoint the reasons why.
When incidents impact large numbers of customers, they could spend many days trying to find the answer. Without context into what’s causing the issue, developers spend most of this time blindly following different trails and doing things like arbitrary code rollbacks. It’s costly. Studies show organizations spend about $4.6 million annually on incident management.
Observability: Strategies to Consider
Development teams can save time and reduce costs by implementing three overlooked observability strategies that offer a new perspective and carry learnings forward.
Compare Changes Over Time
To help reduce the time it takes to trace the issue back to a specific reason, developers should look at metrics over time. Find the closest representation of an identical window of activity and compare it to a previous point in time to help spot an outlier. For example, if a development team is alerted to an incident in which customers are reporting slow response times with an application, they can compare transaction data from similar time periods to look for correlations. Data may reveal the issue is only with API calls, although individual API call times were not significant enough to meet thresholds and trigger alarms. However, comparing those API calls to those from a week ago under the same parameters—same time, same server, etc.—they discover the aggregate of the API calls is performing 20% slower.
Looking at performance comparisons over time gives developers visibility into performance degradation before it starts to hurt a large number of users.
Analyze User Behavior Patterns
A development team should also evaluate the behavior patterns of individuals and IP addresses that are interacting with an application. This can provide insights that help to differentiate between external (user, browser) causes and internal causes (problems with code or a company’s systems).
For example, consider an e-commerce customer trying to modify items in their shopping cart before checking out. The application is responding slowly to their click (it’s taking 2 seconds), so they click repeatedly. A single occurrence might be overlooked but repeated 2-second delays will add up to a significant impact. It may even trigger alerts that the database is overloaded or there’s a code problem when in reality user behaviors are compounding an underlying problem. Some log activity may look like an attack, but investigating the users impacted may reveal the root cause is actually a browser behaving unusually or a misconfigured proxy that is sending anomalous traffic.
In either case, developers end up wasting time by implementing corrective actions that may not solve the problem. But considering patterns—viewing events that fired before and after the problem and behavior from individual users along with IP addresses—will enable developers to quickly trace performance issues back to the root cause.
Properly Assess Impact
Properly assessing the impact of an issue is another useful strategy that can help to determine response to an alert. For instance, if a subset of users < 0.5% are repeatedly encountering an error with an API, are unable to access the functionality (as in the shopping cart example) and are persistently re-attempting the API, you will see a spike in error rates that implies a bigger problem.
Another example might be new functionality that has been rolled out that is for some reason incompatible with components of older accounts, such as incompatible profile information or data thresholds. As customers explore, this small subset of users will encounter the bug. It may appear as though the feature is broken when, in reality, it is a problem with some legacy data.
When problems affect only a few customers, it is beneficial for developers to lean on monitoring tools for alerts and context necessary to resolve an issue without needing to rely on support or support tickets.
Apply Observability Learnings Forward in Post Discovery
For issues that impacted many users or were difficult to diagnose, development teams should consider adding a custom metric or monitor when they do their post-discovery analysis. Whatever the metric is—an extra line in a log, a StatsD metric or putting a catch around it to log the trace—it is important to capture the lesson learned and make it actionable for the future, which will save time during discovery when it happens again. Without this step, the mean time to resolution for future issues may continue to worsen as software complexity increases.
Lack of actionable data is one of the reasons why companies spend millions of dollars annually on managing and resolving incidents. As development teams look to expand observability practices, they should first consider what puts the developer in the best position to resolve an error or performance issue quickly and ultimately deliver exceptional customer experiences.