Modern software developers have access to powerful tools and services which allow them to quickly develop, demo and deploy fully functional applications. But what happens when the single-user prototype satisfies all required functionality but the user says the system seems slow? Or if the initial implementation turns out fine for one user, but multiple users start to experience delays? Or if users are happy, but the cost of auto-scaling grows prohibitive?
The Problem with Performance Engineering
Concern over non-functional aspects of computing systems—response times, resource utilization and costs—falls under the domain of computer systems performance engineering (CSPE). Unfortunately, the practice of CSPE has evolved into something more akin to art than engineering—with few standardized principles and practices. Because software is constantly evolving its languages and abstractions, CSPE practitioners appear obligated to evolve and change their abstractions and languages, too.1 As a result, part of our jobs as performance engineers is to understand what the heck other performance engineers really intend or mean.
There are many reasons for this state of affairs, but ultimately this lack of a common language and engineering standards for CSPE is a deficiency in the education and training of software engineers. There is very little in the typical university software engineering curriculum that prepares software engineers for doing CSPE. This sometimes leads to practices that are not backed by scientific or engineering principles. As a result, practitioners are forced to “wing it”—creating their own art or hitching their wagons to someone else’s star and learning (or not) through trial and error.
The PMWG – A POSIX for Performance? The Performance Management Working Group (PMWG) was started in the late 1980s during the “UNIX wars,” when computer vendors were feuding over claims to the UNIX leadership mantle and had formed two factions: UNIX International and the Open Software Foundation. The group was unique in that its members included computer systems vendors on both sides of the “war.” Its members put aside those differences to try to come up with performance management standards and practices that would put UNIX on par (at least) with the performance management tools found on mainframes. In retrospect, the PMWG was probably the best chance UNIX had for establishing, promoting and enshrining some basic performance engineering principles—a kind of POSIX for performance. Unfortunately, the group failed—in large part, in my opinion, because it put too much priority on the latter stages of the performance management data pipeline and did not spend enough effort toward establishing key primitives. Today, in the absence of established rules and practices for performance engineering, history is repeating itself. Current efforts in observability continue to focus mainly on data presentation and data transport and storage, without much focus on the right metrics and data quality based on basic performance engineering principles. Presentation, transport and storage are important, but the efforts around metrics, logging and tracing coverage and data quality are sporadic at best. |
Inertia, Expediency and Superstition
Why are more formal, engineering-based practices around CSPE needed? Consider the use of CPU utilization as a key performance indicator (KPI) of system performance. I’m a member of Bloomberg’s Trading Solutions SRE team, which manages hundreds of machines on which the company’s Trading Solutions software runs. We often see or hear statements like “This host’s CPU is too busy,” or “That host is out of CPU resources.” I’ve found that these observations are not actually very helpful and sometimes divert our attention from the real problem(s). When misused as a system KPI, focusing solely on CPU utilization can lead us down the wrong path and draw incorrect conclusions. To understand why, let’s first look at how we assess the performance of real-world, non-computing systems.
Take fast-food restaurants. One of the most important attributes of fast-food restaurants (besides their health inspection rating) is that they are fast. But we’ve all been to fast food restaurants that are anything but fast. Without actually standing in line and waiting, we can often tell that we will have a slow experience at a fast food joint if there is a long line. The line length gives us a lot of information. Our assessment of the situation is different whether there are 100 customers versus 10 customers versus a single customer waiting in line.
In the real world, this line length (or queue length) assessment is almost universal. Supermarket checkouts, ATMs, airport security screening, toll booths, elevators and COVID-19 testing are all areas where we apply this semi-conscious assessment. In contrast, we almost never use “cashier utilization” or “self-serve soda fountain utilization” as a way of assessing how slow a fast-food restaurant is. We always say “the lines are long” or, if we actually choose to wait in line, we’ll make the more direct statement that “service is slow.”
Which brings us back to our computing systems. Why do we reflexively look to CPU utilization when computing systems are slow while, in the physical world, we rely on queue lengths? The question is even more relevant when we consider that the concept of CPU is less well-defined today. Are we measuring the CPU, its cores or hardware threads?2 I believe the contributing factors to this misuse include inertia, expediency and superstition.
In the early days of computing, CPUs were perhaps the most expensive component of the large monolithic computer systems of the time. Measuring how much these expensive components were being used was an important part of maximizing the financial investment in these large machines. Fortunately, measures of CPU utilization were relatively easy to estimate. Ease of implementation meant that CPU utilization measures were almost universally available.
The universal availability of this simple metric, combined with some correlation (sometimes weak) between high CPU utilization and system slowness made it a default indicator of systems performance. When a belief becomes ingrained to the point of superstition, even weak correlations are enough to validate and perpetuate it. Better metrics—for example, run queue lengths at the CPU(s)3—can help us shed our superstitions to better understand our systems and lead us in the right direction in identifying and correcting problems.
Dispelling Superstition and Other Irrational Practices
Which brings us to the title and purpose of this series. As a title, “Of Max and Min” is meant to evoke John Steinbeck’s classic novella “Of Mice and Men,” which in turn got its title from the famous line in Robert Burns’ poem “To a Mouse”: The best-laid plans of mice and men oft go awry. The two main characters in “Of Mice and Men” fail in their efforts to better their lives during the Great Depression because they are tragically ill-equipped to do so. Similarly, we should not be surprised if plans go awry when software engineers are tasked with doing performance engineering without the proper training in performance engineering principles.
“Max and Min” also highlights the importance of data analysis and statistics in understanding systems behavior. Effective CSPE requires us to be familiar with concepts in statistics, numerical analysis, and even operations research—in addition to the more standard computer science areas of algorithmic complexity and computer architecture.
With “Of Max and Min,” in addition to calling out superstitions, I will be pointing out some common mistakes we make and blind spots that we have.4 Through case studies, I’ll illustrate how our blind spots can lead to mistakes that can persist for years. I will also try to cover some basic CSPE principles that I believe are skipped in most software engineering training. Like everyone else, I’ve been winging it for 40+ years. And I’m still learning. If you disagree with any of my points or have further insights into the topics that I am presenting, please let me know.
1But, as in the application of the principles behind algorithmic complexity, there are some performance engineering principles that can and should be adhered to regardless of the system and language du jour.
2There are in fact some bizarre implementations of “CPU utilization”. Here’s an example of how AIX “broke” the meaning of CPU utilization on their multiprocessor systems running in hyper-threading mode.
3UNIX and Linux do in fact have metrics that could be used as CPU queue length indicators. In a future installment, I will go into the quirks of some of the CPU queue length implementations and further argue for more consistent mathematical bases for queue length metric implementations.
4For example, have you ever wondered why, after so many outages caused by logging, malloc, DNS and other low-level services, we still don’t have good, out-of-the-box visibility into these and other low-level components? We are accepting key blind spots as a fact of life when we really shouldn’t.
About Me
Like many software engineers, I started off studying a different field (Physics). Unlike most software engineers, I’ve always wanted to be a performance engineer. When I started taking computer classes as an undergraduate in Columbia College’s Physics program, I did not wish for a career writing programs in Fortran (yes, it was that long ago). But I did envision myself making systems run better and faster.
To that end, when I added a Computer Science major to my undergraduate studies, I also added a minor in Industrial Engineering and Operations Research. For those unfamiliar with IEOR, it is a multidisciplinary engineering field that leverages mathematical and analytical methods to address complex system problems like resource allocation, supply chains, wait times, fairness, etc. Tools and methodologies from IEOR are invaluable to CSPE.
At the start of my career, I joined the Bell Labs Computer Center as a computer systems performance engineer, where I made the somewhat contrarian decision (for 1980) to work in the group supporting their in-house, low-key UNIX systems (as opposed to their commercially popular MVS mainframes). I didn’t realize it at the time, but having access to the source code of the system you support/study can be a boon to understanding. At Bell Labs, I had the opportunity to contribute to the UNIX System V kernel, where I co-developed the first general-purpose kernel and user-land tracing system for UNIX and made significant performance enhancements to the virtual memory subsystem, and participated in the PMWG.
Since then, I’ve spent the bulk of my professional career working on financial software systems—ranging from market data to messaging to transaction management. I’ve worked on monitoring systems on an enterprise level. I’ve made and witnessed many CSPE mistakes. In my current role as an SRE on the Trading Solutions team at Bloomberg, I’m seeing and helping remedy lots of operating system issues that stem from our runtime scale.
Thanks
Thanks to everyone who reviewed this document and provided constructive feedback. Special thanks to Nate McNamara and Peter Wainwright for suggesting substantive organizational and structural improvements. Peter also reined in my default stream of consciousness, run-on style. None of them had any hand in the writing of this paragraph.