The game company I worked for until recently has a Mac/PC FPS wargame MMO which features myriads of physically modeled vehicles such as tanks, trucks, planes and boats. Running in every vehicle is a highly complex parameterized component model of the drive train (engine, torque converter, gearboxes, etc) complete with a damage model (oil leaks too much and your engine begins to fail,etc).
The codebase is more than a decade old and the original programmer for this part is long gone. At the time when this code was written there were only a few vehicles in use; subsequent additions of new vehicles meant creating new sets of parameters to model their drivetrains (based on original documents as much as possible). For the most part it seemed to work well.
There was one bug however that drove people nuts, caused them to rage quit and often unsubscribe. Every once in a while. generally during big battles, the player would attempt to restart an engine and it would get stuck starting up; the engine would never start but you couldn't stop it either, and the sound of starting would be audible by the other side and all you could do was scream and spawn out. Since the game also modeled resource limits the vehicle would be lost.
This existed in the game for years but everyone was scared to fix it. The code was written in straight C, ran of course in the physics loop (so a little bit at a time), had hundreds of variables and long complex functions and was involved in every single vehicle in the game. Each modeled engine (or pair if there were more than one) had a component list of each used component which could be different such as tanks have gearboxes but planes don't, and each component had dozens of parameters including masses of parts, inertia, damping, critical RPMs, etc. The damage model affected each component in certain ways as well, such as an oil leak from a bullet hole might over time burn out the engine causing a fire. The upshot was a body of code absolutely critical and absolutely frightening to work on.
Now the bug itself happened randomly. Some people would get it often and some never get it at all. Sometimes it came in spurts and then vanished. It happened on both platforms and could happen any time the engine was started but people reported it mostly for certain tanks and generally in battle situations. It could never be reproduced on cue.
So I finally gave in and decided to tackle the bug.
First I tried to figure out if I could get it to happen offline (the game supported training offline) but no matter what I did it never happened. In fact in playing the game for years it only happened to me once during live play so I wasn't surprised.
So I looked at the code to understand what it was doing. Complex, deep, virtually impenetrable - you could use it for armor - it took me about a day to begin to understand not only the code flow but how the math worked. I looked up the relevant math as best I could, as there was no documentation at all and only a few comments. I could tell the programmer had implemented most of the math but clearly hit a wall and finally "fudged" things to make it work. In addition there were a couple random elements thrown in to make starting not be so deterministic (this was a feature).
Now at this point I still didn't know where the problem lay. The codebase had had various issues with uninitialized memory and stack contents in the past, which of course cause random behavior. Since people reported the problem occurring in battles and the physics loop ran inside the game loop, I wondered if it might only happen when frames got really slow. Note that coupling the physics loop to the game loop too closely could be a potential problem. Floating point math errors happened often as we were using the Intel C++ Compiler which had been in use for several years so that might be a problem. With the problem being reported more often in a certain tank model I thought maybe bad data might be an issue. Even the implementation of the physics math might have a hole. Another complicating factor was the programmer's reuse of variables for different purposes and copy-and-paste reuse of code without changing the names.
It was a cornucopia of potential problems. Compounding the issue was the real time nature of the model; it ran in tiny physics step increments and there was no way to run it standalone except offline which eliminated some of the potential contributing factors. My brain hurt by this point.
Trying to instrument the entire drive train code and print out every variable at each step was too much data, I needed some way to focus. Not being able to reproduce this at all made it doubly hard. Even if I got lucky and it happened the problem likely was already passed; like coming around a corner while driving and seeing a dead guy on the road. Was he hit by car? Fell from a plane? Shot? Appeared there from transporter malfunction?
So I decided to use the hit it with hammer and see what breaks method. I started messing with the math. After a lot of trial and error I found that if I changed the flywheel calculation to make it a paper flywheel every freaking vehicle got the starting error. In fact with just the right values many fat sluggish tanks could be turned into drag racers, popping wheelies and eventually blowing up from too many RPMS. Funny stuff, but it got me to thinking the problem was likely code and not modeling or random memory.
So I left in only the RPM values to be printed out. After watching them with and without the funky flywheels I saw what was happening. In the starter process the starter would be turned on and feed power to the engine which would of course move that power down the drive train and feedback would push it back, once per physics tick. The starter power parameter different for each engine of course. Since the desired behavior was not to start the engine the exact same way each time a random modifier was adding into the equation but always added, so the power would only go up. Once the RPMs reached a critical value additional steps would happen and eventually the engine would reach a running state. Then the starter system would be switched off and the vehicle be drivable.
In most cases the whole starting process took a couple seconds or longer, as in real life a random time. What I discovered by making an essentially weightless flywheel was that the starting process could go from 0 to enough RPMs to sustain the engine all in a single physics tick. The code assumed that this was impossible; the next physics tick assumed that the engine couldn't have started yet and would again add a little power to goose the RPMS up a little more. Now the engine could never start as the code only handled cases where the engine just got over the first stage RPM. It didn't handle the final stage (fully started) in the same function assuming it would only happen later. So it never would notice the engine had jumped from a dead stop to fully runnable but instead slowly increased the power forever.
Now the starter code would not turn off unless the engine was fully started, and the off switch only worked on a fully started engine so there you were: eternal starting you couldn't turn off except to spawn out of the game.
So why did it happen on normal vehicles with their correctly modeled flywheels? The random component would occasionally spit out a value which was just right to make the RPMs jump up just enough. The next time the function ran the checks it made based on the RPM (and other state variables) were inconsistent with what the programmer expected and there was no code to notice the engine was actually ready to run. It's not simple code but imagine having several if's and no else. The programmer had only a few engines to test with, and with no idea of what kinds of values might be used in the future never saw this happening.
It took me another day of testing to come up with a fix that only affected this exact scenario that did not break anything else (which would have been a very bad result). I was able to identify the situation and limit the fix to that; if the engine was starting and the RPM's too high for the state 1 check I simply lowered the RPMS to fit. After that the code ran as expected. Even with paper flywheels and other hammered values the engine would always start.
Nasty bug took 4 days to take down but made many happy players.