The Programmer's Paradox: Duality

There are two very similar ways of looking at software systems.

The most common one is to see it as a lot of code that moves data around. Code is the primary concept. Its dual, however, is that you see it as data, with the code just assisting in getting it from place to place. Code is a secondary issue.

They may seem similar, and it's easy to miss the difference between the two, but from a higher level the second perspective is a lot simpler and a lot more powerful.

When we first learn to program, we are taught to start assembling larger and larger code fragments. First, it is small examples of branches and loops, putting some code into functions, calling other chunks of code to get them to do stuff. That, rather directly, establishes that ‘the code’ is super important. We get caught up in syntax issues, different languages, and various IDEs. Early coding tasks are to ‘write some code that ...’, interviews are uber-focused on the code too. Everything is about ‘code’.

Data is usually introduced as a secondary issue, but most often it is somewhat trivial and primitive. If we take a data structures course, the actual data is even abstracted away, we’re just left with abstract relationships like lists and trees.

That carries on through most programmers' careers. Their work assignments are often crafted as producing some code to support a feature. In many shops, the code is started first, then later they realize that there were some details about the data that were missed. If there is a bug to be fixed, it is because some code is missing or calculating the wrong values.

So its code, code, code, all of the way down. Data Is an afterthought.

The point of taking a data structures course is lost on most people. Sure, the underlying data is abstracted away but it's not because it doesn’t matter. It's the exact opposite. Data structures are a pretty complete means of decomposition. That is, you can take most large and complex programs and rephrase them as a set of data structures. Most programs are just a bunch of primitive operations happening on a fairly small set of structures like lists, trees, stacks, queues, etc. If those structural operations are pulled out and reused, the resulting code is way smaller, and intrinsically has less bugs. That’s why Donald Knuth collected them all together in the Art of Programming, that is why they keep getting taught in classes. They are the ‘power tools’ of programming, but to get them to work you have to flip your perspective on the system.

Data structures aren’t the only opposite approach. The general ideas around them got formalized and explicitly wired into languages as Object-Oriented programming. In non-OO languages, the programmers had to set up the structures themselves and keep any primitives nearby. With Objects that become a part of the language syntax. Objects bind code directly to the data not because it is fun, but so that objects that are structurally similar can be encapsulated and reused and they can point, polymorphically to other types. It is exactly the same as basic data structures, just formalized into the language.

It’s also why people writing a lot of code in OO that is focused on doing super long lists of instructions end up getting into such a mess with Object-Oriented languages. Retrofitting a brute force procedural style into objects is basically going against the grain of the language. Objects as mega-instructions clash with other such objects, which prevent reuse, are hard to understand collectively and are prone to integration bugs. It makes the code awkward, which keeps getting worse as the system gets larger.

While data structures are very powerful, they are just the tip of the iceberg and came of age in an era when data was really rare and hard to get. That all changed when computers became ubiquitous and networked. Now data is plentiful, everywhere, and often of extremely poor quality. Data was always important, but it's getting more important as we collect a lot more of it and want to draw wisdom out of what we have.

So the alternative perspective is to see a system, just by its data, and how that data flows around.

For data to be valuable, it has to stay around for a long time, which happens when we persist it into a database. In all ways, the schema defines the possible scope and capabilities of a system. If you haven’t persisted the data, it isn’t available for usage. If you saved some of it, in an awkward format, that will percolate upwards through all of the other functionality you use it for. If the data is available, guaranteed to be correct, then most of the features that require it are fairly simple. If its hard to write some code, its often because the incoming data is a mess.

All data starts somewhere in the real world. That may seem like a controversial statement since data like log files originate in response only to changes in the digital realm, but if you accept that those come from the behavior of the underlying hardware then it makes more sense. Besides operational data, the rest of it is entered from end-users, administrators, programmers, or third-party organizations. It starts in the real world, and really only has value in the real world. On top of this raw information, we can derive other useful relationships but it all has to start somewhere.

What becomes important then is how the data flows from one location to another. For example, it may have started as an observation by a person. They used some interface to get it into persistence. Later it might be pulled from storage and used to augment some other data. That new data is shown in another interface or packaged together and sent to a remote location. Maybe in exchange, more data flows into the system from that remote site.

If you just look at the data and ignore the code, threads, processes, etc. most systems are not particularly complex. They act as a hub to collect and distribute data to a whole bunch of different sources, people, or other machines.

What’s needed then, to build a system that manages that data is many fragments of code that can move, decorate, or translate the data as it circulates around. If you minimize the fiddling done with that code as it travels around, you’ve optimized a large portion of the system without even realizing it. The closer those data formats are in the widgets, middleware, and persistence, the less work that is needed when it is in motion. Moving data is always expensive.

That perspective puts data first. It is more important than the code, and the idea is to minimize what the code is doing to it whenever possible. There still might be memoization or other macro optimizations that are possible, but those can be seen as refinements.

What becomes important then is keeping dependent pieces of data together and managing the structural relationships between these ‘entities’. This brings us right back to data-structures. They deal with the relationships between these larger chunks and can be built to have indirect references to the underlying entities. Why? Because the structural relationships are smaller and more common. If you get them perfected for one type of entity, you can reuse them for all types of entities. That then just shifts the problem down to picking a good set of entities and relationships that best fit the data as it originated in the real world, or basically ‘modeling’.

Now the data perspective doesn’t magically eliminate performance problems. Systems aren’t intrinsically scalable, handling huge loads has to be explicitly engineered into the system in order for it to work correctly. But seeing the system as flowing data does make it a whole lot easier to lay out an industrial scale architecture.

Take caching for example. From a code perspective, programmers often just allocate a chunk of memory, set it up as a hash table, then do some type of lookup first to get the value. That seems to make caching easy, but eventually, the real problems show their ugly heads. If you see caching as a form of memoization where we keep a smaller pool of data closer to the action, then what is obvious is how to decide what’s in that pool and what's not in it. Caching stale data is bad, and also letting the cache grow to be the same size as the persistence is rather pointless. But on top of that, if the data may be in two places at the same time, what happens when you need to update it? The code perspective of caching forces programmers to think about memory, while the data perspective forces them to think about the quality of the data. It makes it easier to see the whole picture, which makes it easier to get the implementation correct. Once you realize that removing stuff from the cache and keeping it in sync during writes are the real problems, figuring out what code and architecture are needed to make this happen is a whole lot easier.

The same is true for putting queues between systems to resolve speed differences and with syncing external information for read-only internal copies. Pretty much all of the big enterprise issues get a whole lot less challenging. The difficult problems shift away from some tangled mess of code, to really keeping the whole thing organized and operating correctly.

It also applies to the really big systems as well. It’s hard to figure out how to massively parallelize systems until you realize that it is just an issue about dependencies. Where data is independent, it can be safely split across threads, processes, or machines. If it's not independent, then there will be problems. So, instead of being a complex coding problem, it is really a complex data modeling one. What underlying model do we need that both approximates the real-world issues, but also guarantees some usable independences that can be parallelized? If you structure the data correctly, the rest is just spending the time to write out the code.

As an industry, we usually start by teaching new programmers to code. That’s okay, but we often fail to teach them how to flip to this dual perspective. Instead, we leave them dangling in the wind, trying to crank out code from what is clearly a much harder perspective. Sure, some people get data structure courses or even computer theory, but then they go out into the industry and none of that sticks. To make it worse, people tag it to ‘abstraction’ and ‘overcomplexity’ and keep trying to insist that writing more overly simple code, faster, is somehow going to make it better. It even percolated back into our bad interview styles.

It permeates everything, making most serious programming tasks a whole lot harder than they need to be. If you’ve ever suspected that there was an ‘easier’ way to build systems, then you were right. If you forget about the code, and focus on the data, getting that set up correctly, then most of the coding is easy and straightforward.

The Programmer's Paradox

Sunday, August 2, 2020

Duality

No comments:

Post a Comment