The Programmer's Paradox: Data Engineering

Friday, October 14, 2022

Data Engineering

The foundation of all software systems is persistent data.

That is, a big part of any solution provided by a software system is the ability to digitize events, inventories, and conversations. Then collect them together and make them reliably available in the digital realm.

Like anything else, these foundations have to be in place first, before you can build any other type of computation on top. Building on top of broken foundations is a waste of time and resources.

An organization only needs only one copy of any of the possible trillions of digital records. They need to know where it is located, and unless it is accessed frequently they don’t need speed. If it takes a few minutes to get to infrequent data, that is fine. In some cases, an hour might be acceptable as well.

Obviously, size, frequency, and usage are critical things that need to be understood.

For every different type of record, at least one person in the company needs to really understand it. Its structure, its usage, its frequency, and any special cases that will be seen during collection. If you just collect data blindly, the quality of that data will always be very poor. You get out of data, the effort you put into making sure it is right.

Often beside the primary copy of the record, there can be a lot of secondary copies getting created as well. Sometimes for performance, more often because of politics. That is fine if all of them are ‘read-only’, but it is a really, really bad mistake if people are updating those secondary copies.

You have to capture the data as it exists. You can’t flatten structural data without losing information. You often can’t inflate it back due to ambiguities. So, if you spend the effort to capture it into a primary database, it is always worth any time and effort to get the structure correct. Collecting a lot of broken databases is a waste of resources.

If you have just one primary copy of good-quality data, then building out any other computations and interfaces on top of it is straightforward. If you find that what you are building is instead extremely complex, then it is often because of the poor foundations. It is way, way more efficient to fix the foundations first, instead of just piling more problems on top. The house of cards only ever gets worse, it will not fix itself and the real problem is below.

So, the core of data engineering is to be able to craft models for the incoming data that correctly list out the size, frequency, usage, and structure of each different ‘type’ of record. All of them. Then the other part is to ensure that for any organization there is only one primary source and that any secondary sources are always read-only. If that is in place, crafting databases, ETLs, or any other sort of data piping will go well. With that as a foundation, building anything on top will be cheaper and far better quality.

The Programmer's Paradox

Friday, October 14, 2022

Data Engineering

No comments:

Post a Comment