At First Glance, a Rendering Flaw; Ultimately, a Compiler Bug

Published in

Seerene

10 min readFeb 21, 2020

Read about our story how we mistook a text rendering bug as a graphics developers’ programming error, deep-dived through source code, and traced everything down to questioning programming itself. This story is targeted at software developers and low-level computer graphics enthusiasts.

One key take-away: If you want to have highest-quality software, you need expert developers that are capable of even diving into the underlying frameworks and libraries. Especially as they are often considered as working flawless and “off-the-shelf”, but things go wrong either way. It is not enough to write defect-free, high-quality code yourself, you also need to be able to detect and correct defects in underlying 3rd-party software as well.

Screenshot of a Seerene software map of the Firefox source code from its Github Repository.

Introduction

As developers, we sometimes find ourselves having a look at the work for next week, even on the weekend. What usually happens is the sight of bugs that have conveniently been reported with detailed tickets. We get a quick overview, try to guess our future participation with this bug and dismiss it until next week. We expect a debugging session, a somewhat convenient fix and a nice lesson to learn from this.

Here at Seerene, we got such a bug ticket that we perceived as an every-day bug at first, but which got unexpected depth on the journey. Quickly summarizing, this bug stated illegible text rendering within our software map visualization. For the software map visualization and its text rendering, we use the open source framework webgl-operate, a WebGL rendering system. Next to its low abstraction level that allows for massive scalability for visualization, we value its native support for clean rendering of text within the three-dimensional scenes using distance field rendering (reference: “OpenLL: an API for Dynamic 2D and 3D Labeling“). As of now, our problems with this framework were ultimately located within browser specifics such as WebGL driver issues or unsupported features. This time, however, the bug indicated wrong handling of resources or wrong rasterization by webgl-operate itself. In retrospect, webgl-operate was not the cause of the error, but this had to be determined first. Eventually, the bug we want to talk about is effectively located not within webgl-operate but within an open source project, we, but probably an even broader community, relies on. Fortunately for the reader, this bug includes the work and domain of passionate graphics developers, which results in visually presentable artifacts for this post. Further, you can track external communication on this bug on publicly accessible websites.

Bug Report

The bug we worked on was also created for webgl-operate (link to the issue). The bug ticket stated erroneous rendering results for the hardware-accelerated text rendering module. Where nice, clean text rendering was expected, there were characters with different widths; some characters were so wide that they covered neighbor characters and others were so thin that they were hardly legible if they were visible at all.

Starting from that, we tried to collect three different pieces of information:
1. Which systems are affected?
2. What causes the bug?
3. How can this be fixed?
What usually comes along question 3 are the more sophisticated questions “Is there a workaround?” and “When will this get fixed”? After all, the answer to the last question is what matters for production systems and costumer success.

Answering Question 1: Which systems are affected?

As this text rendering is used in production for quite a while, we actually had several machines were this error in rendering could not get reproduced. In fact, reproducing the error was an hour-long task itself. It came down to be reproducible on Windows machines running Google Chrome in software rendering mode. More specifically, software rendering by the Google library SwiftShader. After this search for reproducibility, we were slightly suspicious what may cause the bug. On the one hand, we have a large number of systems that implements WebGL that is finally executed on graphics hardware and where no problems were detected so far. On the other hand, we have a specific renderer that doesn’t use graphics hardware to execute WebGL that causes problems reported right now.

Answering Question 2: What causes the bug?

As the final pixel color values were wrong, we focused directly on the fragment shader of the text rendering. From there, we started with low-hanging fruits. First, we got an overview on the source code and tried to reacquire the approach and implementation. As there was not that much code, we validated correctness of the implementation, too. After all, this shader isn’t complicated and uses mostly standard WebGL and, from our point of view, is in full adherence to the WebGL and GLSL specification. We stripped the code down and removed code that was not used in our bugged example. The code is as follows.

Stripped-down distance field rendering fragment shader.

Next to general programming language features, the shader was now stripped down to the most basic constructs a GLSL shader could use:
• Color output
• Uniform variable inputs
• Texture lookup
• Fragment discard
That was the time from which we knew that we were on an actual journey.

Digging Down

After our program understanding phase, we tried to get a feeling for the bug. We tried to manipulate the shader and observe the changes in the output to isolate the effect and hopefully track down the erroneous lines of code. Naturally, we expected complex or rarely used features to cause problems, e.g., features introduced by extensions or features that have numeric instability or non-intuitive implementations. As such, we identified potential points of failure with the shader derivative feature (fwidth), varying variable interpolation, texture lookup, and the smoothstep implementation. The texture lookup could get quickly validated by returning the value parameter within aastep instead of the smoothstep computation. The result was as expected; we basically printed the distance field for each glyph and it was the same as in the font asset texture.

Internally used font asset texture. The gradient for each character allows for approximation of the border of the actual glyph using linear interpolation of fragments and thresholding; the distance field rendering technique.

Next, we eliminated the uncertainty on the fwidth implementation by replacing smoothstep with a normal step. Unfortunately, this resulted in similiar erroneous rendering, albeit without anti-aliasing. As a quick test, we replaced the standard GLSL step function with its default implementation, i.e., a comparison as condition to return either 1 or 0. This didn’t fix the bug, either. We ruled out an error on varying variable interpolation as the actual
expected characters were on their expected positions. It was just their rendering width that wasn’t as expected. Removing the fragment discard didn’t fix it either. Summarizing, our current state of knowledge was as follows: Google SwiftShader seems to fully support shader derivatives, varying variable interpolation, texture lookup, and smoothstep as well. We haven’t come a step further. The code afterwards was as follows.

The revelant excerpt of the further stripped-down shader.

Digging Deep

Our next approach was to find a pattern when a character is printed too wide, normal in width, or too narrow. Interestingly, we found wide characters were on the left side and narrow characters on the ride side within the texture, whereby characters around the center were rendered similar to our expectation. Thus, we tried to validate this correlation and, hopefully, find a causation. Using the distance field rendering technique, the printed width of a character is parameterized through a threshold value that approximates the border. Usually, this technique uses a per-font constant threshold parameter. However, on the target system this threshold seems to depend on the glyph instead of being a constant value. Moreover, we presumed that this value is dependent on the x component of the glyph within the asset texture. Using this presumption for the error, we investigated the source code further. As we validated before, the texture lookup is correct so we ruled out a value mismatch of the value parameter of aastep. Similarly, the step implementation performs as expected. We reduced the aastep implementation further to return just the threshold parameter.

Minimal version of the aastep function.

Thus, we changed the implementation of the function from returning 0 if the fragment is outside the border and 1 if the fragment belongs to the glyph to just return the compile-time constant threshold parameter, i.e., 0.5. You may imagine our surprise as we observed it to be non-constant. We started searching for an exorcist hotline.

Still Digging

Knowing we’re on to a bug report, we decided to derive a more general example. From that, we hoped to detect the actual technical problem or limitation that is expressed with the shader code. We converted the code to a ShaderToy demo. The full example code is as follows.

ShaderToy demo showcasing the correct and erroneous behavior (depending on the system).

Using this example shader, we expected the constant color orange across the whole image, as the result color depends from just constant values. The parameter a is derived from the result of subFunction, which returns the result of utilityFunction, which returns the forwarded parameter t, that is the constant value 0.5. However, the value passed to utilityFunction doesn’t seem to be the value of t, but something different when executed with SwiftShader software rendering. What we actually observed was a horizontal color gradient from red to yellow instead. This is explainable if the value of a ranges from 0.0 to 1.0 instead of being constant at 0.5.

A side note for the inclined reader: please take your time to confirm that the value 0.5 and the parameter t is logically not operated on, the validity of the source code is in fact provable.

Digging Deeper

What remained conspicious is the unnecessary texture lookup passed to utilityFunction, as it is eventually ignored. If we remove this lookup, the result image is actually plain orange (as expected). Unfortunately, this texture lookup is required in the original code, rendering this approach infeasible. At this point, we were still inclined to detect the actual erroneous pattern of source code. We discarded all assumptions of defined behavior, programming languages, and run-time execution knowledge of our GLSL shader. The question at hand was to find out how the constant value 0.5 could get replaced with a dynamic value ranging from 0.0 to 1.0 during a short trip of function calls and parameter passes. Conveniently, the value range of 0.0 to 1.0 was actually present for each fragment: the x coordinate of the uv variable. We tried to rewrite the code using several approaches and, fortunately, most of them let the bug disappear. We concluded that we had found a GLSL compiler bug for our very specific combined use of features: Using an inline texture lookup with array subscription access on a single component. It seems as if under this condition the actual parameter list of the utilityFunction call is wrong and the first component of the second parameter is passed as the first parameter (the one we returned in the ShaderToy example and the one we threshold against in the actual distance field rendering shader).

Answering Question 3: How can this be fixed?

After resurfacing from our excursion in function dispatch and parameter list management, we wanted to provide both a quick fix and an actual solution to the original bug. The former was quickly found, e.g., extract the inline texture lookup and store the result in a temporary variable. For the latter approach, we filed a bug on the Google SwiftShader bug tracker (link to the issue). However, we didn’t evaluate other contexts such misbehavior could been imaginable nor did we dug deep-enough into the SwiftShader source code to give its developers a clear hint were to search for the bug. After all, we assume they know their code better than we do. The last remaining question, “When will this be fixed?” is a double-edged sword. For one, we could work around this bug for us and for the community using webgl-operate by providing a one-liner as pull request for this specific shader. However, this would not eliminate the actual defect in the sources of SwiftShader, and, thus, the risk for other bugs for a much broader community would remain unmitigated. Right now, we wait for Google developers to detect the defect in their source base and eventually fix this software rendering bug in Chrome. Until then, we accept the workaround within webgl-operate.

Conclusions

This felt like one of the days a software developer waits for. We experienced a bug where the error did not come from a user error on our developers’ side, a misinterpretation of WebGL or GLSL or an implementation error from the developers of webgl-operate or Google SwiftShader, but, presumably from a more fundamental implementation error in the compiler code or shader execution environment of SwiftShader. This error didn’t seem to be reported before, what makes our discovery essential and crucial to be shared with the open source community. After all, that is the main force that brings open source forward: the shared goal and joint effort to create better software.

Willy Scheibel is a passionate software developer and has been working for Seerene for several years. As many of Seerene’s co-workers, he is not just an expert in developing our own platform but also in analyzing external software architectures. Furthermore, he is a Ph.D. researcher in visualization system architecture, particularly system design, interfaces, and algorithms for hierarchy visualization at Hasso Plattner Institute (HPI).