Debugging a Crash in OpenRCT2

2023-01-19

Last year, my interest in RollerCoaster Tycoon was renewed through Marcel Vos's in-depth videos on the game. Like most players today, I picked up OpenRCT2, which offers cross-platform support, higher resolutions, and other enhancements. This is the definitive way to experience the game, certainly exceeding the disappointing official ports.

While playing on Linux I experienced occassional crashes when placing coasters:

Crash occurs when a ride is selected and the mouse moved over the map.

While the bug itself was trivial and easily fixed, reproducing and locating it was more difficult. If you're impatient you can skip to the solution that finally found the problem, or read on for the full investigation.

Initial Observations

Before debugging I collected some thoughts and researched if this was a known issue. I found a single similar-sounding bug report, also from a Linux user, although only seen in multiplayer.

The crash only affected specific combinations of saved coaster designs and scenarios. Not every run of the game would crash, permitting one to save and reload a few times until getting lucky.

The bug's inconsistent appearance, and possible platform dependence, suggested an invalid access was always present, but the crash depended on the memory layout. Many build and runtime factors affecting layout would make reproduction difficult.

I was primed to suspect a use-after-free, after having seen other glitches in-game suggesting memory being incorrectly re-used during coaster construction. That would turn out to be a different bug entirely, unrelated to this article.

Initial thoughts before debugging.

I had briefly seen parts of the source before, to learn game rules and logic, but was otherwise unfamiliar with the codebase and had no idea where to start. My previous experience with 1990s game code primed me to expect undisciplined global variable use any port would necessarily inherit. I had seen the project contained a mix of coding styles, with some logic directly adapted from the original game assembly.

Debugging of Release Build

The first step was to run the game under gdb. This proved immediately informative, as the game didn't crash. To my frustration, I repeatedly placed a coaster which had previously been a consistent source of faults.

I had narrow previous experience with gdb, but knew programs could behave subtly differently within a debugger. If a program was loaded in differently, this would complicate reproduction of a layout-dependent crash. I suspected gdb was affecting ASLR, and a search confirmed gdb disables it by default:

"Disabling randomization is on". I had to read it a few times.

With ASLR re-enabled, the game crashed within the debugger as desired. Since this is a release build, gdb can't provide function or variable names to tell us where we are.

Some more exploration was possible from here. Repeating the crash several times and collecting the faulting addresses (the location the program incorrectly attempted to access), we see many common low bits:

The target address having a common page offset suggests, but does not prove, the program was attempting to read a global variable, rather than a stack or heap location. It also implies that if the program was reading outside the bounds of an array, it was doing so with the same (bad) index each time.

We can switch to the disassembly within gdb to see what the program was doing at the time of the crash. With luck we might match this up to something in the source code:

The faulting instruction, a load from an array, is highlighted.

The faulting instruction is attempting to load a value from an array. If, like me, you don't frequently read assembly in detail, here are the components of an x86 memory reference (presented in AT&T syntax, the default of the tools I was using):

A memory reference has up to four parts, and resembles using an array of structs in C. There is a base address, a scale for the size of the array elements (up to eight bytes), an index to be multiplied by the scale, and a displacement to add to the base, useful for example to select a member from within a larger structure. The base and index may be specified by register. The type of the value to be loaded is indicated by the suffix of the assembly mnemonic.

Play around with this Compiler Explorer example to see how the parts of a memory reference correspond to source code. While the full syntax permits expressiveness of common operations for the assembly programmer, compilers seeking to minimize code size will use available information to achieve a more concise representation, such as by combining base and displacement values when both are known in advance.

The faulting instruction in the debugger is a simple indexed load:

The zbl suffix indicates the value being moved is one byte, and it is being placed into a 32 bit register, with the remaining bits filled with zeros ("zero-extended"). The scale factor is 1, and there is no displacement, so the target is probably¹ an array of bytes.

The base and index are supplied by register. What are the contents of those registers at the time of the crash?

The maxed-out 16 bit index value is immediately attenion-grabbing. We're indexing far off the end of something. Though the crash is inconsistent, that bad index is probably always there, and such a loudly wrong value should be easier to chase down than some fuzzy distribution of junk numbers. Seeing 0xFFFF suggests integer underflow, but this turned out not to be the case.

At this point I could have also checked the base address to verify the target was a global variable, but I was quickly distracted and didn't get around to confirming this until later.

The next instruction is a return, so we're looking at the end of a function. We need to look at the preceding context to see what's going on and where that index value is coming from. This brought me to the first of many tooling bugs encountered during this exercise:

GDB had a bug where it couldn't scroll up to show previous instructions. And this bug only occurred when trying to debug a crash:

A debugger that can't debug programs!

The bug had gone unfixed for nearly a year, and was preserved in the LTS release of Ubuntu for me to experience another year later. Installing a different version of development tools on my main machine just for this was not something I wanted to leap into, so I turned my attention away from the release binary.

Debug Build

While poking at the faulting release build on my main machine, I had been setting up a clean VM to use for building the game. Reproducing a crash here would locate the source immediately.

After the usual mess of installing dependencies, I had everything needed to build, but encountered another problem:

I can't find a screenshot of the actual error so here's me Googling it instead.

GCC encountered an internal error trying to produce a debug build. I found no help for this error and was not eager to start changing compiler versions. Perhaps I was impatient after GitHub also went down at the same time I was setting up the build:

I switched to Clang, also supported, which compiled without issue. I copied over the game assets and finally had a running (but still crash-free) debug build:

This was less than ideal. My crashing release build had been made with gcc.² With a fault that was so sensitive to layout, there was no guarantee different builds would exhibit the crash. Changing to a different compiler could obscure the problem further. It would also lessen potential similarity in the disassembly between the builds, making it harder to match the faulting location in the release build to source code by searching for corresponding instruction patterns in the debug build.

Source Exploration

Without reproducing the crash in debug, I don't know what I expected to learn stumbling around an unfamiliar codebase. An obvious bug would likely have been fixed already. Getting acquainted to the degree necessary to narrow down a search for the culprit would take time and patience I didn't have.

The port overall is quality work, with familiar patterns and new features implemented in maintainable style. But much of the game's core necessarily conforms to the original RCT. The closer one gets to this old logic, the less clean things are. Consequently there were many unencapsulated global arrays and frequent lack of bounds checking. With so many suspicious locations there was no obvious place to start looking.

Failed Attempts

Array Instrumentation

Assuming the bad index was always present, I thought I might catch it by instrumenting accesses to global arrays. Using ASan was a possibilty, but I believed enabling checking globally on a program full of unrelated memory issues would be too noisy to be of use.

I made a simple wrapper class with a bounds-checked index operator, to be used as a drop-in replacement without affecting point of use code. Starting with arrays that sounded related to ride construction and drawing, I started substituting the checked type and repeatedly re-playing the same scenario as before.

I soon felt this wasn't going anywhere. RCT has many static lookup tables and none I touched were the problem. It turned out I was not even looking for the right thing - The problematic variable, I would later discover, was an array of bools, not a char or integer type. My naive find and replace wouldn't have found it.

One of RCT's many lookup tables. A modern game would tend to use a map rather than an array with fixed indexes, and populate it at run time based on loaded assets or mods. In this example, my understanding is the current game can't add new types of ride supports, forcing modders to fake them with a combination of cheats and custom scenery objects.

Disassembly Comparison

GDB had been unable to show the previous instructions, but after dumping the disassembly using objdump and grep-ing around for the instructions I had already seen, I found the full context within the release binary:

The entire relevant function body.

We'll discuss what this code is doing in a moment, but first I wanted to see if I could find something similar in the debug build. I soon realized this was not viable either.

With only these few, generic instructions, there was almost no hope of finding a corresponding location within the debug build. Checking for occurence counts of various patterns, I found the two builds differed greatly in the frequencies at which these two instructions appeared together, as well as the way memory was referenced. This should not have been surprising given the different compilers, and the many possible ways to generate logically equivalent code. Even if I could get a gcc debug build to compare to, this approach was probably not going to work without additional context to narrow the search.

Misc. Failed Reproduction Attempts

Other ways I tried to provoke the crash under debug were:

Loading a save file from a crashing scenario, from the failing machine, on the debug build.
Copying a crashing custom coster design from the failing environment, and loading under debug.
Running the debug build on the failing machine, and vice versa.
Various combinations of build options.
Enabling/disabling ASLR.

None of these worked. I ended up having to solve the problem without ever reproducing the crash on a debug build.

An Unfortunate Setback

While failing to reproduce the crash under debug, things got worse: The crash stopped occuring on my release build too, no matter what I changed. What had once been a semi-regular frustration had gone away just when I needed it. I was now tragically stuck with a working game.

I never caught the crash in a debugger again. What follows had to be done with only previous observations. To find the location of the crash within the release disassembly, I had resorted to referring back to a screen recording I had from my initial debugging session:

Thankfully, at this point we have sufficient information to get where we need without having to wait for the fault to recur, and continue the search from there.

Dead Reckoning

From the disassembly, we see in the instruction preceding the faulting load that the base address of the target array is somewhere in the bss section, and is being accessed by relative addressing. Let's review what this means.

A program's memory contains segments for code and data loaded from the binary image, as well as mappings provided at runtime for purposes such as stack and heap.³ Their corresponding memory pages are marked with permissions such as read, write, and execute. Many of these sections have annoying historical names, so "text" means the program's executable code, and "bss" means zero-initialized static variables.

Simplified model of a program's memory.

The program's code and data from the binary are brought in together. Their size is fixed, as is their relative position to each other.⁴ This means we always know where static variables can be found relative to instructions, unlike dynamic or stack allocations whose locations are only known at runtime. For security, the program's base address, and its heap and stack locations, are randomized by the OS (ASLR), which is possible because programs are not permitted to assume anything about their location.

Programs once hard coded most data addreses, but the preferred modern way is to reference memory relatively, as offsets from the current instruction. Such addressing is said to be "rip-relative" after of the name of the x86 instruction pointer register. This technique facilitates dynamic loading and security by permitting the program to be loaded in anywhere in virtual memory without having to fixup its memory references, a slow process.

Looking at the lea instruction in our disassembly that computes the base address for the following load, while the presence of a register name in the syntax suggests the target location is variable, the value of that register is always the same for a given point in the program, so we are actually looking at a fixed memory reference to a static variable. An annotation provided by the disassembly provides the relative location of that variable within the bss section:

If we read out of bounds of the array, we could read random data from elsewhere in the bss segment, and if we go far enough, we might end up in an unmapped page and crash. But we're not guaranteed a crash. Because the index is only 16 bits, depending on the size and layout of the bss section, we might not actually go outside of it. Even if we do, there might happen to be another valid mapping nearby, so we still may not fault. This is why the crash depended on a combination of platform, build, and runtime conditions, making it rare and difficult to reproduce.

Fortunately, because the array being referenced is static, rather than one provided by the caller as a pointer, we don't need to wait for the conditions of the crash to recur to find it.

What Actually Worked

With no reproduction, I thought I would try to establish a correspondence between the two builds based on searching their runtime data, rather than looking for code patterns. There was no guarantee this would work - different versions of the game might load assets in a different order, or assign different IDs, or something else that made their array contents incomparable. But I was lucky and a naive search for an exact pattern proved completely successful.

Here is an overview of the search steps that found the bug, which will be described below:

I first got back to the crash location within the release game running under gdb. An easy way to do this was to directly search the memory of the process for the instructions I had seen in my previous screen capture.⁵

To do this, with the game running, we can ask gdb where the executable's sections are located in memory (info files):

Then we use the memory search command, with the target address range being the text (code) section, and the search pattern being the bytes of the instruction we saw in the disassembly:

Adding a little more context narrows it down to two results. I chose the first address as it had the same low bits as the offset from the binary disassembly:

We can ask gdb to print out the contents of the memory at that address, interpreting it as an instruction, and see that it matches the disassembly:

At this point we're back at the crash site. If we wanted to, we could set a breakpoint conditional on the bad index register value, and it would be as if we had just reproduced the crash. But knowing I was unlikely to be able to make sense of the backtrace within the release build, I proceeded to the memory examination.

Subtracting seven bytes from this address gets us to the preceding lea instruction that computes the base of the array. GDB has helpfully annotated the instruction with the resulting virtual address:

Printing out a chunk of memory starting from that address, we can see ... nothing.

But this is a mutable data section. I had the game stopped at the main menu - maybe this array is only used within a scenario. I resumed, loaded the same scenario as before, then checked again. The target memory was now populated with values:

Every byte is either zero or one. I hadn't even considered the possibility the array was of bool type. I briefly searched the source to see if this pattern occurred anywhere such as an initializer list, but found nothing. Hopefully the same pattern could be found in the debug build at runtime.

I loaded the same scenario in the debug build in its VM, and halted it at the same point, having just placed the test coaster. As before, we ask gdb where the debug build's sections are located in memory:

We can then perform another byte search, this time with the target range being the bss section, and the search pattern being the contents of the array seen in the release build. There is one match.

GDB has already used its debug information to annotate the result with the name of the variable corresponding to this address, which is exactly what we are looking for. The hard part was over and I could finally start looking at relevant source code.

Here is the declaration of that array:

Great, so all I need to do is drop in the checked array wrapper and wait for it to assert:

The array type is bool. My hastily written test class wrapped a std::vector. Many of you already know what happens when you try to return a reference to a bool within a vector:

Right, vector<bool> is totally broken. Fortunately, there was no need to try and make this work, as I was done with the broad searching and this array was only used in a couple relevant places:

We know the program is crashing on a read, so I went to the only indexed read location, which was immediately suspicious:

The function ride_entry_is_invented does what the name suggests, checking if a particular ride asset is unlocked in the current scenario. Neither this helper function, nor the array type used, have any bounds checking, so anyone can pass a crazy number and crash the game. This function was called from many places.

An assertion placed here fired under the same conditions the release build had crashed:

Printing out the index, we see the invalid value continuously as the ride construction preview is being drawn:

Adding a breakpoint finally yields a debug backtrace:

The Bug

The bad index value was coming from a method of a class that is part of the command pattern. An object representing a game action has member functions to query if a potential change is valid ("you can't build that here") or execute it. TrackDesignAction had the same bug in both functions, which is why the crash happens from hovering the mouse over the map, which invokes the Query. The bug is shown below:

The code uses a global object manager to look up an ObjectEntryIndex for the track design's vehicle type. It checks if the index is equal to a magic null value, and if so, calls the "is invented" function with the index. If the entry is not invented, it sets the index to null again. Clearly, the intent had been to only do the invention check if the index was not null, but the test was inverted, ensuring the function was only called with null values.

What is the null value? It's defined as the maximum value of the index type, which is uint16_t:

For any vehicle type not loaded in the current scenario, the OOB read would occur. The bug had been present for many years, but may not have crashed until later changes disturbed the memory layout. As for in-game indications of the bug, nothing wrong probably resulted from an action attempting to build a ride with an uninvented vehicle type, perhaps due to another fallback check elsewhere. The game does warn about the vehicle type when placing the design, a clue I might have been able to notice earlier.

If the bug sounds unexciting and basic, it is. I opened a GitHub issue with my findings, and a fix was committed ten days later. The inverted test was fixed, and a bounds check added to the array access. In the meantime I commented out the problem code in my own build.

Conclusions

This article ended up being longer than I anticipated but I believe the detail is important. I was using several debugging features for the first time, searching for how to do almost everything as I went, so I wanted to cover every step to make it easier for the next person. GDB's ability to arbitrarily search and reinterpret memory has broad utility but I hadn't seen these commands demonstrated before, either in tutorials or using it for more basic troubleshooting.

Debugging

I spent about 8-10 hours on this bug. Some of this was unavoidable, like setting up the build environment and dealing with tooling bugs.

My development tools crashed more than the actual game.

I wasted time trying indirect, broad searches for the bug. Searching around the source code without knowing how it was organized was not going to guide me to the right location. Making small changes, adding assertions all over the place, and repeatedly trying to provoke the debug build to crash was unlikely to help, since after all, it didn't crash for most people most of the time anyway.

What worked was taking small but direct steps with the pieces of information I had before moving on, even when it didn't look like much to go on. I should have tried to get the full disassembly of the release build sooner, rather than getting distracted by the debug build and source exploration. As soon as I saw the previously obscured lea instruction, I had more context and a literal pointer telling me where to look next. The source code wasn't able to tell me anything until I'd already isolated the problem using other tools.

Bug Prevention

Opinionated ways to deter bugs like this:

Avoid C Arrays

The existing code had inconsistent bounds checks on global array accesses, even in similar, adjacent functions. Checks were probably added where crashes had occurred before. We know manual checks don't scale and add visual noise to code readers.

Container types can build in checks for range and value, log their usage, use custom allocation, etc. These features can be conditionally compiled out as needed.

One potential advantage of native arrays is they are known to the compiler and other tools, which may have their own memory checking instrumentation options. But the fact that bugs like this persist for many years is evidence these checks are not widely used, probably because they are slow and noisy. Games often prefer targeted debug features for specific systems or types that are toggled by their own compile-time switches and don't depend on individual compiler extensions.

Container classes have their own annoyances. In particular, writing a drop-in replacement static array class template that deduces its size from an initializer, doesn't require helper functions, and doesn't use heap allocation, requires advanced template knowledge. The fact that CTAD is all-or-nothing remains a significant impediment to common use cases of std::array and similar fixed container types.

Avoid In-Band Signal Values

Modern standards discourage using magic values to indicate null, failure, or other special conditions. Relying on programmer discipline to add manual checks does not scale, but was perhaps sufficient for the original game as developed by one person.

While it wasn't the cause of this bug, the faulting function also had an incorrect parameter type which was corrected in the same commit. Integer type mix ups are a similar source of subtle bugs in C. Typedefs improve readability but are transparent to the language, permitting undesired conversions.

Type systems can be used to enforce correct unit usage, and separate checked and unchecked domains. A type returned by an object manager system, which could be null, can be prevented from being passed to a query function that only makes sense for valid entity identifiers.

The argument against this practice is again that C++ can make such library types cumbersome to use in comparison to languages like Rust which give option and enum types first-class support. Within the standard library, std::optional is decent; std::variant is pretty ugly. Enforcement of checking usually relies on exceptions, or non-obvious control flow where all possible cases are handled by e.g. overloaded function call operators on a visitor object, or callables passed together as arguments to a selector function.


auto Result = ObjectManager::GetVehicleType(TrackDesign.VehicleType);

// this kinda sucks
Result.Match(
    [](ObjectEntryIndex Index) {
        // do something with the index
    },
    []() {
        // null case

        // Because this is a lambda, we can't affect enclosing 
        // control flow with break, continue, or return.
        // Functional style alternatives possible but not widespread in C++.
    }
);

An intermediate solution that preserves normal code flow is a debug-only enforcement that asserts or throws conditional on a flag indicating if the validity of the contained value was checked prior to a get function being called.

Aside: Assembly Programming

RollerCoaster Tycoon's development has been the subject of much programming lore. Social media posts express amazement at the superhuman mind required to create such an intricate game using primitive tools.

This perception is detrimental to the learning of newcomers, and the spirit of programming. Generations of programmers used assembly, and they were not superhuman. They created games, operating sytems, and other complex software, using similar organizational principles as programmers today. Even you, when learning a modern language, probably once felt lost in its details until you built up a mental repertoire of patterns, idioms, and organizing techniques that imbued that text with meaning greater than its keywords and grammar. The experienced assembly coder contemplates that language with similar richness.

So it is with domain knowledge as well. Even a skilled programmer, if unfamiliar with graphics and simulation, might regard games as magic. But only a few books will break this illusion. With moderate experience, you can look at a game and imagine its plausible data structures, processing steps, and organization. The experienced coder of any domain knows this is the essence of programming, and the realization of these relationships as code, while laborious, is not magic. You could simulate a theme park game in FORTRAN, or write it on punch cards; it's only a matter of familiarity, tooling, and time.

Besides, the creation of RCT did not literally consist of writing a great edifice of assembly instructions. It was written in a quite capable macro assembler, which enabled familiar program organization. This assembler supported structs, macros, variable names, function calls, math expressions, string manipulation, and more. I wouldn't choose to write software in it today, but it was a reasonable tool for the job, one the creator was already familiar with. In practice Chris Sawyer was likely operating a level of abstraction comparable to C. If you know C today, or know people who know C, you know creating impressive software in this way is not magic. It's just work, and you can do it.

Just because the faulting instruction is indexing by byte, doesn't mean the target has to be an array of bytes. The target could be a complex structure, like an array of larger structs, or a buffer of items of varying size. Instructions preceding the faulting one could have computed an index separately by more complex logic, which was then supplied to the mov operation. Even though at at this point I hadn't been able to see the preceding context, the nature of how games like RCT work led me to suspect frequent use of global tables of small values.
Not knowing exactly what I was doing, I ran strings on the release build, grepped for "gcc", and found some symbols unique to gcc's output.
Although a process has a contiguous segment named "heap", that can be resized with special calls, this was of greater importance historically. Memory allocators today use a syscall like mmap to request or release memory on demand, which the OS can place anywhere within the large virtual address space. There is no guarantee any dynamic variable actually resides in the segment labeled "heap", or has any consistent position relative to any other segment.
In reality there is not a one to one correspondence between "sections" in the binary and "segments" that get loaded into memory, but this is not a concern to us here. Additionally, because the bss section is zero-initialized, it doesn't occupy any space in the binary. It's just a placeholder for a fixed amount of reserved space. The loading of the program is more than just a straight copy of the executable contents, as things get nudged around and references fixed up during loading, but it's not a bad mental model.
I could have worked out the runtime location of the faulting instruction, or the target bss array, by using their offset info from the binary, and the program's runtime base address. But my first attempt at this was wrong, while the search worked immediately.