I love to read other people's tales of finding and fixing weird bugs, so here is one of my own. I did one earlier that people found interesting: Fixing a Nasty Physically Modeled Engine Bug in an FPS Game.
I work for a travel company, and our iPad app includes many features including renting cars. The app was originally written by a third party company and developed under a previous group of executives who were "replaced". We got the code in a broken state and had only 10 days to complete it and put it in the App Store. It worked fairly well given the start.
During the first week a user sent in a bug report, and unlike most actually included useful details. Apparently he had opened the app the first time, went to Branson, Missouri and started to look at rental cars. After selecting a few he hit one that crashed the app. So he restarted the app, went back to Branson and repeated his search.
No cars were found. In fact no matter when he searched there were never any cars available for Branson. Yet there were cars available in other locations.
My first words were "WTF kind of crash could make a search in Branson fail every time?" or something less printable.
We didn't know exactly what he did other than the search so I played with it in the simulator for a bit and was able to recreate the crash. I then restarted the app and lo and behold it found no cars in Branson. Bugs are always easier to fix when you can recreate them. Even bizarre ones.
Our apps and mobile site all talk with a JSON-based set of web services that our group develops. These communicate with an XML-based set of services that we have no control over, and those communicate with other services and often call into a mainframe system at our parent company and even third party systems.
So I tackled the crash first. I captured the JSON for the car rental search and stuck it into a JSON validator on the web and it reported a non-UTF8 character, a real no-no in JSON. After looking at the character which was in a single SUV's description field I recognized it as likely being an EBCDIC character. Now our service get its data from XML and that get's data from other services but the data ultimately comes from the mainframe, and likely was entered by some system at a car rental company. Somehow this character survived a whole bunch of data processing to survive intact. To this day we haven't seen another example.
So in the bowels of the webservice framework that the app uses (again not our choice) there is piece of code that converts the GET body into an NSString using UTF8 as the encoding. This failed and returned a nil pointer since NSString is pretty strict about conversions. The code failed to deal with the nil and eventually lead to a crash. Fairly simple to understand. Even if we fixed our web service layer to try to filter out the issue, I felt it necessary to handle it anyway. I used the GNU libiconv C library to clean the data if the conversion failed.
Now I had to figure out why a crash affected the Branson search but no where else. We don't cache results in the app since it makes little sense so that wasn't it. Clearly something had to connect these two things.
So I repeated the crash and aftermath several times, starting from a clean simulator. I looked at the requests and compared them to each other and finally noticed the time of day was different after the crash. When the app was first run the time of day for the rental was initialized to 10 AM since there was no previous default. After the crash it was set to 8 AM.
Turns out that in Branson, there are no car rental agencies open at 8 AM! So this wasn't an incorrect response, the search failed because people in Branson don't like to get to work early.
The real issue was that the time of day was never being set in NSUserDefaults correctly, leading the second run of the app to set the time of day for the search to 0, which was converted to 8 AM.
So in the end a crash had nothing to do with the search failure at all. A classic red herring!
Debugging is fun stuff. I wish more people would write posts about some classic bug so we can all enjoy. Maybe someone should start a Hacker News thread. I have more, including the time back in 1985 when I thought I had found a bug in the 68000 processor...