In my role as a developer advocate for Coralogix, I speak with organizations across almost every conceivable industry, and there is a pattern across almost all of them. In the world of observability, the challenge is battling scale. At scale, observability becomes a complex problem to solve. Surprisingly, the most common pattern for handling observability data is also holding people back from unlocking the true power of their data.
What is the Data Silo in the World of Observability?
To understand the data silo, we first need to understand the typical architecture of an observability solution, whether it’s homegrown or SaaS. Observability data can be divided into three types: Logs, metrics and traces. These three types come together to give engineers a complete picture of the health and performance of their system.
These three data types are typically held in different storage mechanisms. A popular solution for logs is open search, while Prometheus holds the mindshare for metrics, and Jaeger is a very popular tracing solution. This means three solutions, three interfaces and, subsequently, three problem-solving workflows.
What’s wrong with separating data like this? Even though there are three types of data, there is still only one goal: To understand the health and performance of the system. This is the only use case. The data silo is the dominant trend for observability solutions to provide a fragmented view of the data. It requires engineers to piece together stories as they go, rather than correlating across all of this data at once, to provide a complete picture of system performance.
This slows down troubleshooting. If there is an issue, an engineer will first review one of the data types. Then, they will run different queries against other data types to confirm. So they may detect a 500 error in the logs, then move to the traces to see the impact on traffic, and finally, review the application metrics. This means three minimum queries when attempting to assess only one problem.
Breaking the Data Silo
Breaking the data silo can be a daunting challenge. There are some key points to focus on when a company starts out and some finishing touches that make for an excellent developer experience.
Unify data storage as much as possible. For example, traces can be stored in Prometheus rather than in a separate solution. This will enable engineers to query, for example, trace latencies alongside CPU metrics. It will also minimize the UI work necessary to join together data. If an organization has the expertise, it’s possible to store everything inside an OpenSearch cluster. This means a single interface, with a single query language, to query all available observability data. Anyone assessing this option should simply decide what their company can maintain and what skills they have in-house.
Present observability data as one. If storage unification isn’t an option, then focus on the presentation layer. Tools like Grafana can consume most observability solutions as external data sources and render them all on the same dashboard. While an engineer will still have to query them using the specific language of the tool, it means that they can render things alongside one another, making for an insightful dashboard.
Correlate data in one user interface. This means metrics are rendered in a single dashboard alongside logging and tracing data. This will give users a static view of the data at once, even if there is a limit to how queryable the data is.
Create universal values. If it’s impossible to make any of the previous recommendations, then this is an easy, quick win that will at least make querying observability data more consistent. An easy way to do this without making your code messy is to use a library like MDC (or your tech stack’s equivalent). Add some tags or values that will exist in all logs, metrics, and traces. For example, the name of the application and the feature it belongs to, i.e. application is credit-card-service and feature is payments. This means that if something goes wrong, there is already a set of key-value pairs that translate across every data type.
At the very least, this will mean an engineer can issue consistent queries across the three solutions, albeit in different syntaxes. This represents the least value, but also the least rework and effort, and will lower the cognitive load of engineers who are already trying to debug a potentially serious outage.
The Data Silo Can be Beaten
A siloed observability strategy drives waste and inefficiency in engineering teams, precisely when they need efficiency the most. By breaking the data silo, observability data can be queried, analyzed, and correlated together to form a single, cohesive and complete picture of your system health.