Saturday, February 6, 2021

Theory and Practice

In many discussions, I’ve often referenced the need to know some ‘theory’ when writing specific types of code. What I’ve seen is that people who bypass acquiring this knowledge will end up writing code that just won’t ever work correctly. That is, there are some very deep, rather counter-intuitive problems buried at the heart of some discrete computational problems, and that approaching it with clever ideas is not enough.

If you are okay with your stuff falling back 30 to 50 years in behavior then you can attempt to reinvent these wheels, but if you need to get it out there, working as well as it can, then you should not ignore the underlying theories.

The first example is parsing. People have long noticed that string operations like ‘string split’ are pretty powerful, and they do some of what is needed to parse incoming text. So if you have a buffer of “x = y” statements all on their own line, then you could split and squeeze the whole thing into a dictionary. Easy, peasy.

It starts to fall apart when bits of the parsing becomes ‘modal’. The classic example is a quoted string in a .csv file that contains a comma, which is also the field separator for the file. The right way to handle this is to basically have 2 states. One that is normal and one that is ‘in quotes’. While ‘in-quotes’ you treat the comma as a normal character. Otherwise, you treat it as a separator.

And this is where things get messy. The comma character has 2 distinct meanings. It’s either just a normal character in a string, or it is a special character used to act as a separator between fields. If you don’t keep track of the state, the meaning is ‘ambiguous’. You just don’t know. Split might handle the work of breaking up the string into pieces, known as tokenizing, but it provides no means of state mechanics.

The syntax in a complex programming language has tonnes of these state-based issues. They are everywhere, from the operators we use to the way we handle blocks in the code. It’s a state-fest. This occurs because the bandwidth of our characters is somewhat smaller than the bandwidth of the things we want to do, so we pack our mechanics tightly into the symbols that we have available, thus causing lots of ambiguity. Language theory refers to this as an LR(1) grammar, but the underlying issue is that we often hit a token that is ambiguous, so we try to process it one way, realize that is wrong, then we have to back up and do it some other way. If the language is really gnarly, we might have to back up a few times before we can resolve the ambiguity.

Without the ability to back up, basically anywhere at any time, the code will continue to hit ambiguities and do the wrong thing. So the underlying problem is that if you try to hard code the mechanics, there will continue to be circumstances where that static coding won’t do what is expected. These special cases will continue to occur infinitely. In the beginning, you fix a lot of them, it will slow down as you get way more code, and the unobserved variance of the inputs reduces, but it will never, ever go away. The code is incomplete because the code can’t adapt dynamically to the input, so the code is weak by construction. The mechanics can’t be static they have to be dynamic enough to bend to any ambiguity that they find anywhere in the text.

Ambiguities are actually found all over the space of discrete systems. They are essentially just missing information, but they are missing in a way that sometimes the physical universe itself can’t resolve them. The most classic one is TGP, the two generals problem. It’s at the very heart of all distributed processing.

Basically, if you have more than one ‘serialized computation’, sometimes described as Turing Machines, the communication between them has an intrinsic difficulty. If one ‘machine’ sends a message to another ‘machine’ and after some time there is no response, what happened is ambiguous. The sending of the message may have failed, or the sending of the reply may have failed, or some code in between may have failed. You can’t ever be sure which side of the communication was dropped. You can’t be sure if the work was done or not.

If the message was just asking for some information, basically a ‘read’, then when the first machine gets impatient it can just redo the question.

If the message was asking the second machine to change a value, then it suddenly gets funny. If the message never made it, the second machine obviously didn’t make the change. If the message made it, but the reply didn’t come back from them the second machine may have actually made the change. The lack of reply doesn’t imply the state of the change.

If you don’t care whether the change has been made or not, it’s not much of a problem, but in a lot of cases we build up huge amounts of ‘related’ work that are distributed over a bunch of different computers. We wrap these in a ‘transaction’ so that we get the property that either all of the changes worked, or all of the changes failed. That’s critical in that we need to inform the user that their request, which was ‘singular’ to them, was actually fully and properly processed, or there was some problem so nothing was done. We refer to this as ‘transactional integrity’. It’s one transaction to the users, but it may span a lot of different machines, databases, etc. A trivial example is transferring money between two systems. The amount must be decremented from one database if and only if the amount is incremented in another. The transfer either works, or it does not. There can be no other intermediary states, or things will get ugly.

So, we have this problem of coordinating a bunch of different machines to either all do the work, or to all fail. But if one of them is late in replying, we don’t know whether that work was done or not. That ambiguity should halt the other machines from continuing until it’s been correctly resolved. Worse, there may be lots of other transactions piling up behind this one, and we don’t know whether they are independent of this change, or they rely on it for their changes. Say, transferring the new balance in one of those earlier accounts to a third system.

If we built this type of computation to be entirely reliable then every time one of the communications gets dropped, the entire system would grind to a halt, until that message was found or resent. That, of course, would perform incredibly badly, as there are lots of reasons in the physical world that communications can be interrupted.

In practice what we used to do was minimize the window of failure to be as small as possible. That is, a transaction is sent out to all participants to ‘stage’ the work. When they insist that their next step is tiny, the second round of messages is set out to turn it all live, or ‘commit it’. If it’s successfully staged, it is treated as if it will go through. If that first part of the conversation failed, it is all rolled back and an error is propagated upwards to let the initiator know that it didn’t work.

If you get this right, the behavior of the system overall is nicely deterministic. When you see a confirmation, it means everything really was updated. When you see an error, it means that it was not. When you get this wrong, neither a confirmation nor an error has an exact meaning. It could mean it mostly worked, or it could mean that it kinda failed. You really can’t trust the system at that point.

In modern practice, the whole notion of transactional integrity is often ignored. All of the subsystems just get the requests and process them or they error out. Everything is handled individually. The problem is amplified by the modern habits of adding more and more distributed systems. When most software was on the same hardware, the likelihood of problems was really tiny. When it is distributed across a lot of machines, the likelihood of problems is quite significant. It happens all of the time. If the data is spread all over the place, and there are a lot of mindless workers changing it, without any integrity, the quality of the data, as a whole, is going to be really, really bad. Constant problems will cause gaps, overlaps, and contradictions. Mostly it will go unnoticed, but every once in a while someone will do the work of reconciling their different data sources, and find out that collectively it is all a mess.

As if we needed more problems, we used to architect large scale systems with Transaction Management technology, but since they are slow and the programmers felt that they were difficult to use and somewhat pedantic, we switched over to using stateless API calls. No transactional integrity at all, on purpose. That might have been okay if we partitioned the systems based on data dependencies, but then along came ‘microservices’, where we break it all up even further into even more little, dependent, parts. Rather obviously that is a recipe for disaster. The finer grain we go, as well as making it more distributed across more hardware, while ignoring the necessity of transactions, the more we degrade the quality of the data we collect. This occurs as a natural result of tunnel visioning into small performance or organizational problems, while ignoring larger cross-cutting concerns like transactions. It’s not easy to cope with physical limitations, but it certainly isn’t advisable to just ignore them.

But it’s not just persistence that gets into trouble. Even though they are way faster, computers are still too slow. That oddly comes from their popularity. As we use them for more processing, we expect them to process way more stuff. The size of the stuff we are trying to handle these days has grown faster than the increases in hardware. That is, the workload for software outpaces Moore’s law in growth, right now.

To get around that, we don’t just want faster hardware, we want to involve more hardware as well. In the early days of computing, when we parallelized stuff, we were very careful to sandbox it as well. So, one computation as it runs could not be affected by any other running computation. We crafted a ‘process’ abstraction so that we could fit a bunch of these together on the same hardware, but they maintained their ‘safety’. Later, when programmers decided that safety was just too awkward, they came up with ‘threads’. These are independent computations that can cause all sorts of havoc with each other. Basically, they are fighting for dominance within the hardware, and if they bang into each other, the results of the collisions can be rather messy.

Realizing that multi-threaded code was a problem, we enhanced our knowledge of fine-grained synchronizations. That is, two threads could share a larger piece of data like a string if they both nicely agreed to use the same mechanics to synchronize their operations on it. The primitives to drive synchronization have been around since the early days, as they were needed for foundational technology like operating systems, networks, and databases. What changed was that we pushed their necessity upwards, higher and higher in the technical stacks, so that the programmers writing code on top now also needed to understand how they worked. It’s a form of unencapsulating stuff, where we solved a problem, made it nicely usable then went out and unsolved it again.

It’s not the only place where we provided mechanics to people that were highly likely to let them injure themselves. Colloquially we call it “giving them enough rope” and we’ve seen that over and over again in other areas like memory management, pointers, type management, and variable arguments.

It’s not enough to just avoid data collisions in the code, you also have to worry about starvation, race conditions, deadlocks, livelocks, etc. Synchronizing two dependent computations is non-trivial. But getting it wrong still lets the code mostly work. It works a lot, then strangely errors out, then magically works again. Mostly if the collisions are infrequent, the testing won’t notice them, and the users just put up with the occasional, unexplained problems.

There are many other known and unknown problems that quickly flow into theoretical issues, like caching or algorithmic growth, or even error handling to some degree. On the surface, they seem easy, but underneath it’s incredibly hard to get them done correctly, or at least close to the state-of-art knowledge.

Most code however is pretty trivial, and people often use brute force to make it even more trivial. Get some data from a database, throw it to the user, have them make some edits, and then save it again. Ask another system for an update, store it away forever. Spew out a few million slightly customized reports. Most of this type of programming needs a really good understanding of the way that the data should be modeled, but the technical side is quite routine. The trick isn’t coding it, it is keeping it all organized as it becomes large and hard to handle. Keeping millions of little pieces neatly arranged is a somewhat different problem from getting one little tricky piece to behave as expected. What we need to do better is to keep the tricky bits encapsulated, so that they aren’t wrong so often.

No comments:

Post a Comment

Thanks for the Feedback!