What happens when you mess up data modeling?

There has been an uptick in discussions surrounding the modern data stack, data mesh and self-service promising to solve many of your company’s data problems. However, in Xebia Data’s Analytics Engineering team’s day-to-day work for clients, we still see small and large companies struggling to address most of the core concepts.

It’s great that data analytics is now more accessible than ever for anyone to get involved in due to all of these new tools, but this has a lot of risks that are not often discussed.

In this blog post, we’d like to share our experience of the current state of data modeling at companies and what happens if you mess up data modeling at your company.

The current state of data modeling at companies

Much has been written about different data modeling techniques, but more needs to be written about the reality of data modeling. Knowing a data modeling technique versus implementing it are two entirely different things. Unsurprisingly, we see a lot of companies where people know a thing or two about data modeling but need help to set it up properly.

Free-for-all data modeling

One of the most common issues we see at companies is that no one is in charge of how data is being modeled: the free-for-all approach. We mostly see no ownership and vision, or a data mesh style in which all teams create their way of working.

This usually originates from the need of individual teams to “do something with data” where coordinating with other teams is not a priority yet, leading to data silos.

While the increasingly popular data mesh pattern advocates for a decentralized approach to data analytics where every team has the autonomy to create their own pipelines, in reality this only works well for very few companies.

For most companies, not having a more centralized approach to data modeling leads to the following:

Teams creating data transformations in their own way. In this case, collaboration amongst teams is hard and a lot of effort is poured into understanding and reconciling the different ways in which data is being modeled. Each team will create their own pipelines leading to duplicate work. Furthermore, data transformation is done in several steps of the process (ingestion, data warehouse and visualization tool) making it even harder to see what is going on.
Teams creating their own data definitions. Since each team creates pipelines from scratch, alignment of definitions, metrics and KPIs is limited or even nonexistent and leads to discussions. The definition of revenue is a typical example of a metric that can be defined in many different ways. Some teams might include tax in the revenue metric, whereas others will not, leading to inconsistencies.

Business not involved in decisions

In the end, data modeling is there to make the lives of business users easier. However, we often see that data teams are responsible for designing the data models. Although data teams usually have good knowledge of data concepts, they usually lack business context. This leads to situations where the language of the business is not reflected at all in data models.

Entities and/or column names might contain technical terminology, refer to source data systems, or result from what the data team thinks it should be. For example, developers might store the ID of logged-in website users in a column called uuid, whereas the business might actually use user when talking about this entity. This can become confusing and result in data models that are harder for business users to derive value from.

Again, data teams support the business, not the other way around. Therefore, business teams should always be in the lead and involved when creating data models.

Data model debt

Strongly related to the first observation, we have seen many companies struggling to keep up with a company’s growth in managing data models at scale. Data models become large and complex, create lots of unnecessary computing costs, become hard to navigate, and, in the end, become hard to maintain or expand upon. Examples are (1) keeping historical snapshots in a data warehouse without any use cases for snapshots and (2) models with such a complicated lineage that nobody knows where to start when it has to be expanded.

We also see many ‘Shadow BI’ cases where teams work around the limitations and data model debt by plugging directly into operational systems or using raw data and creating their own models directly within visualization tools.This leads to even more data model debt. ‘Shadow BI’ and the discussion on where to do data modeling in your pipeline is a topic we will cover in a future blog post!

It might seem easy when you start your data modeling process, but the more significant the data model count, the greater the need for a solid data modeling vision and related processes. The data model debt that you build up over time becomes harder and harder to solve and will turn into data model debt.

Black boxes

Even if there is a more centralized approach to data modeling, it is often hard for other teams to understand how the data models are created. We’ve seen many cases where a consistent data modeling technique was used without documenting how these models are developed.

Imagine looking through a set of data models and not knowing how these are created and where the source data comes from. Would you fully trust these models?

Lack of data modeling skills

Another observation is that companies and data professionals do not always see data modeling as a core skill, even though they perform data transformations.

Most courses and education are focused on the sexy and shiny things of analytics engineering, like learning SQL, Python, and even learning specific tools. Less time is spent on data modeling. This is reflected in the hiring processes.

We’ve seen many cases where applicants for data roles at our clients have the hard skills but need to gain the knowledge to manage all the data models efficiently when you scale up. Furthermore, not all companies see data modeling as a core competency for data transformation roles. For example, data modeling is often not included in interviews and job descriptions. A data modeling strategy depends on people’s skills and is just as important as any other aspect of data modeling.

So, what do you risk when messing up data modeling?

We’ve gone over a number of data modeling observations that we see happening often at companies, but what is really at stake?

Unnecessary computing and storage costs

A direct and measurable effect is increased computing and storage costs due to inefficient data modeling techniques, data teams recreating and storing the same data models, and a lack of data modeling skills. This will end up hurting your cloud bills.

Time to value increases

All the data activities at a company should eventually lead to actionable insights. However, suppose your data team needs help maintaining or expanding data models due to an incoherent data modeling approach. In that case, the time it takes for data to reach the business is increased.

Moreover, analysts will spend less time extracting insights and are forced to spend more time trying to figure out where to find data and how to use it. A well-defined and maintained data model can result in fewer data analysts delivering the same amount of value. Think about it: if you are working on an analysis to investigate a recent uptick in customers, you might want to find some marketing data to see if a recent campaign might be the cause.

Applying a proper data modeling technique allows you to find the right data entity, including its documentation quickly. Without it, you would have to spend more time looking for the right entity to be used, ask questions to people who know more about it, and maybe even verify what is actually in the data.

Finally, suppose the tech debt accumulates to a point where it becomes unworkable. In that case, time is lost by having engineers refactor or even rebuild data models from scratch to fix the situation. A consistent approach to data modeling makes it easier for your data models to adapt to changing business requirements. An inconsistent approach leads to lots of discussions on how to expose new source data to data consumers for every new model and/or attribute you want to add.

These symptoms are more challenging to measure, but if all of this adds up, it will increase the time to getting value out of data.

Time spent on irrelevant tasks

If the responsibility of prioritizing and defining the data models is in the hands of the data team instead of the business, there will be a misalignment of priorities. The data team will not necessarily be always working on what is essential for the business and will be less effective in providing the correct data to the analysts.

Governing your data gets hard

With teams using different ways of data modeling, not sharing data lineage, and data modeling being done in other parts of the pipeline, knowing where and by whom your data is used is tricky. Especially for enterprise-scale companies, data governance is critical and helps keep the large number of data models manageable. Having a proper data modeling technique makes your data models more consistent and organized and, therefore, easier to govern.

Take data modeling serious

Not seeing data modeling as a core competency and core focus of your data team has many risks and makes your data team less effective. Of course, it’s easy for us to point out what is wrong in terms of data modeling at companies. But it’s even better to share how to fix it. If you’re interested in knowing more about data modeling, stay tuned for more data modeling blogs and read the Fundamentals of Analytics Engineering book written by members of the Analytics Engineering team at Xebia Data, which is now available to order!

Do you have a data modeling challenge for us? Feel free to reach out to us, and we will be happy to help you.

Photo by Omar Flores on Unsplash`