Sunday, June 21, 2020

Defensive Coding

Over my career, I have joined a number of ongoing software development projects that were having extreme difficulties.


Most of these troubled projects had very common symptoms. 


On joining, the first most obvious problem was that the developers were kept really busy with operational problems. There were lots of known bugs, but generally, the unknown or unrecorded bugs were way, way higher. So, a lot or most of the effort was going into trying to put bandaids on existing issues, without really trying to understand or solve them. 


That sense of stress, lack of time, and anxiety not only was effecting how they were dealing with operational problems but also with any new code additions. Since the environment was chaotic with frequent interruptions, the new stuff was also not receiving much cognitive focus. If they weren’t just hacking out at bugs, they were just blindly writing new code and hoping that it might work correctly.


Another interesting trait is that they generally blame management for their problems. They feel like they are being forced to move too fast, and that the focus is too short-term.


I’ve seen this both in small and large companies. With huge projects and medium-sized ones. It basically spans all tech stacks and has pretty much been common over the 30 years I have been working.


What most of the developers want, when I arrived, is finding a magic bullet that would make all of their problems go away and suddenly put the project back under control. They usually figured there was something that management could change. What turned out to be true most times, was that the core issues at the heart of the problem were the habits of the developers themselves. That is, they were hoping it was a management problem, and at times it was irritated by high-level personality issues, but most often it turned out that it is the way the developers were working. Their code was a mess and getting way worse as it was growing larger, and that was throwing everything else out of whack. Often they were using the language, the framework, or other libraries incorrectly. They generally didn’t leverage any of their tools either. Sometimes they were critical things they needed to know but didn’t. They’re pounding away furiously at the keyboard, often doing a lot of work, sometimes even consistently, but it’s the work itself that is causing most of their problems.


It’s a really hard problem to fix when you have one or more programmers who are often over-confident but still struggling with getting something built. It makes it harder when management believes that too, but for the wrong reasons, like that the work itself is trivial. That compounds the rift between developers and management, particularly if there are actually some complicated parts to solving the problem. Neither side wants to admit any faults, neither side wants to change.


The fix for this type of problem is always that the quality of the code needs to be better. Way better. But not just by adding something like a few piddly performance optimizations sporadically around the code or switching to some other alternative technology. It all needs to be better. It needs to be organized, disciplined, and every little part of it thought through. The developers need to change their habits, get tighter about their work, and keep the code as tidy as they can, given any time constraints. Then they need to spend more time thinking deeply about how they will add new features by extending the code, and less time at their keyboard experimenting or trying to route around code that they don’t want to spend time understanding. 


It’s not unlike a driver’s training course, where it’s not that hard to learn to drive a car, but it is difficult to be able to drive one safely in most major cities around the world. There are two skillsets there, not one. The first is to know how to do things with the car, the second is to know how to move around the environment properly without crashing into things. Programming is the same. The coding part of programming is just one piece of the puzzle, but knowing how not to end up with ‘bad’ code in a production environment that triggers these types of vicious cycles is a completely different skill.


So, the very first, high-level understanding of Defensive Coding is that not all code is created equal. 


There is ‘good’ code, that you want to be running in production because it is stable and trustworthy and really solves the problems. Then all of the other code is essentially ‘bad’ code. It has one or more serious problems that either effect it operationally (short-term), or its ability to be extended and grown with the rest of the system (long-term). 


In modern software development, people often have the expectation that the expected work is way less than the actual work involved. So, there is always a lot of pressure to find short-cuts, to reduce estimates, etc. In order to keep financing for any large software project, programmers need to occasionally bend to these types of pressure, to get something out quickly. So, it would be unreasonable to say that a production system will never have ‘bad’ code running, but it is actually reasonable to expect that the programmers who built the system were aware of what code is ‘good’ and what code is ‘bad’. And as a matter of their work habits, when they are forced to quickly put ‘bad’ code into production, they also allocate the time and effort necessary to replace it later with ‘good’ code. And then replace it. That is, if it’s written quickly, it will probably need to be replaced soon. It’s not until you’ve gone back over the existing code base a few hundred times, rechecking it and cleaning up little issues, that you are certain that that code is industrial-grade and really ready for a production system.


The converse is also true. There is a huge coding problem if the programmers can’t distinguish between ‘good’ and ‘bad’ code or they think that everything in production should be there and can be ignored. If they aren’t willing to revisit earlier work, then any of its bad issues will percolate outwards and destabilize everything else build on top. 


Which gets us to the second major issue. Code is always stacked on top of other code, whether it was written by the current programmers or other people. If the lower levels of the code were written badly, then the upper levels are tainted. Nothing in a large system is truly independent, they are always interdependencies whether they are explicit, or implicit. Worse, they sometimes form when nobody is watching, so components may start out as independent, but as time progresses that will change. 


The simplest way to think about this is that the system is only as good as it’s worst code. Technically that’s not entirely true, in that there may be onion layered sub-systems that are awful and that support unused features, but even those infrequent ‘entry-points’ into the system represent significant risks that a user may accidentally trigger them someday, and that may cascade back into the rest of the functionality. So, for safety reasons, they need to be at least disabled, and for consistency reasons, marked as unusable. But its probably best that they are just removed, and the only traces of them left in deleted files in the repo. 


That gets us back to talking about the habit of cleaning up stuff. Really nothing in the system should be ‘off-limits’ and it is a reoccurring and constant task to go back and clean up the code. Oddly, even if it is not causation, there is usually a correlation between a large bug count and a lot of old dead code, invalid comments, lack of consistency, stray configurations, etc. probably because of bad habits. The two aspects always seem to come together, mostly because when the primary concern is not ‘good’ code, anything will get into production. Eventually, that will build up enough to trigger a downwards cycle.


Indirectly that leads us to a somewhat uncalculatable metric. If you were to get a full, and honest account of all of the technical, domain, interface and operational issues that ‘bug’ people in the current system, all of them, then while for many systems that would be thousands of bugs, we could use that to divide the overall number of lines of code and coding comments currently written. So, for easy example numbers, we might have 1500 bugs, coming out of 40k code. Although it’s not evenly distributed, that essentially gives us one issue per every 26.7 lines of code. That sets a rough quality marker for the work done and gives someone an idea of how frequently issues are getting build up into the system. Way back, one of the developers I used to work with felt that about 1 in 200 was expectable, but that at least 4 out of 10 of those should be caught by the developer, and probably 4 out of 10 of those should be caught by basic QA. Or that we might see 1 bug leaking into an ‘average’ production system for every 1000 lines written. So, if we go back to 26.7, that is a huge number of bugs really, or basically poor quality. And the closer we get to 0 (which I know is asymptotic) the more we can say that the code is lower quality. This metric is uncalculatable though because very few projects actually are diligent enough to really get a full list of bugs. More often, they don’t have explicit specifications or get full feedback from the users or the operations department, so their sense of problems is grossly underrepresented. They count only known, obvious, technical bugs, ignoring the other 3 categories.


There is a lot more I can say about Defensive Coding, but this is getting too long for a blog post. Indirectly I’ve covered this a lot in the past posts, but I think it’s best to try and focus it on a specific mindset going forward. Programmers don’t want their jobs to be hard or painful, but in order to avoid that fate, they have to choose good working habits that keep their ‘workspaces’ tidy and organized. That’s true of any profession, but the digital realm allows us to hide a huge mess and ignore it more easily, so we have to work harder to ensure that we not shooting ourselves in the foot by misfocusing on the wrong issues. And much like driving a car, we have to realize that while some accidents aren’t unavoidable, there are lots of things we can do as well to keep us out of obvious trouble, most of the time. Good programming requires discipline, even if that sometimes makes parts of the job boring. We shouldn’t focus on making ‘coding’ fun or fast, instead, we need to focus on what we build. Real satisfaction from building stuff comes afterward, when it is ‘good’, when it is used and when it stays around a long, long time. Those basic principles can help us reframe our perspectives so that with our limited time we can get the right things done, at the right time, to keep everyone (mostly) happy and the project moving forward smoothly. 

No comments:

Post a Comment

Thanks for the Feedback!