The Fine Art Of Solving Strange Bugs
Solving strange, unexplainable and downright bizarre errors in production apps or code is something every programmer has to do sometimes. It's often much like Dr. House trying to figure out what's wrong with a patient with missing or conflicting symptoms.
It's also one of my favorite things to do. After three decades I've had a lot of experience dealing with nasty bugs. If I had job where all I did was figure out how to fix unreal bugs I'd be a happy camper.
Both user reported and internally identified problems can result in a mystery since you may have incomplete, misleading, unknown or even incorrect information. It's like coming around a corner and seeing a car smashed into a tree; you weren't there, the driver is dead and no one saw anything. What happened?
I like watching real crime or medical mystery shows because the process is so much like debugging an odd bug. Recently I saw a program where a lady suffered for 3 decades having terrible diarrhea and vomiting at random times. Every doctor she went too found nothing wrong. Only during her first pregnancy did the symptoms go away, but not in future ones. Eventually she got swelling and redness in her hands. That doctor said it was allergies. Then she went to yet another doctor who also found nothing, but then asked one more question, if anything else was happening. When she mentioned her hands the light dawned. It was a rare disorder where a missing component of the immune response system caused random inflammation, all triggered by hormonal changes. Until someone was able to combine the symptoms and add additional information from experience the problem was undiagnosed; now she could get symptomatic relief.
In my present soon to be former job (all tech gone, just a marketing company now) our mobile team had a server API, native apps and mobile web. The server talked with the main company APIs, our parent company's APIs, and eventually out to thousands of external vendor's APIs. So when something went wrong, a problem could be anywhere.
One of my tasks was to follow App Store reviews, read user feedback, watch crash reports (we used Crashlytics), follow Google Analytics and watch sales (I wrote an app for that) to spot any problems. Often I would be the first one to take a look at anything that appeared to be wrong.
One interesting bug happened when a few customers gave us a bad review in the App Store, they entered a city name like New York into our location search and got back "No results found". Some in the group figured it was just a bad internet connection but I didn't think so. A few tried and of course it worked fine.
When you see an error that seems unlikely and rare it's easy to think of it as a "glitch" or as we used to do: blame it on a "cosmic ray particle". But anytime a user takes time out to complain about something not working I prefer to think of it like crime statistics: often crime is underreported because people are fearful or don't think it matters or don't want to deal with the consequences. This leads to not knowing the actual magnitude. Every thing that goes wrong has a reason but you might not yet know what it is. Often the lack of reporting might mask a real problem. If the bug is transient it might not happen a lot to any one person but is actually affecting a lot of people.
In this case the location service was upstream from our servers. I wrote a little test app that called the service in a loop for a long time. Eventually the problem appeared. During the timeframe when it was possible for the error to happen it seemed to occur around 15% of the time. I went to the people who supported this server to see what the configuration was. Apparently it had a load balancer with six servers supporting our usage. They rebooted the servers and the problem went away. I wasn't convinced since the reason was still unknown but had to move on.
Eventually it happened again in more reviews. I reran the test app and got exactly the same result again. Now we had to find out why. Clearly since around 1/6 of the time you got no results one of the servers had to be down. Their load balancer only sent traffic to a running server; however it didn't bother to note if the results were actually returned just that it was up. They looked at the code and found it was possible that the app would start up and occasionally fail to load any data. Since it only happened on occasion any user seeing the no result error would likely retry and then get a working server. Only the rare people who got 2 or 3 failures in a row complained. They fixed the actual bug and no more failures happened.
This was a case where something that seemed like a glitch was a real bad customer experience even though it wasn't obvious how bad it was.
A case in the main web app I heard about happened for months before anyone noticed. If a user entered a name with a period at the end of it, as in "Joe Smith Sr." the app would throw an exception. No one ever looked at the logs sufficiently to categorize this exception (lots of others happened). It wasn't until they started to use a better client side usage tracker that this became visible. If you don't have a process to watch your logs you might miss a terrible problem.
If no one reports a bug, does it still piss off customers. Of course!
Another really bizarre bug came to us via the in-app feedback system. A user complained that all of a sudden they would pick a date for service in a known location in our iPad app but got no results found which should be impossible (not the real service but imagine searching for Starbucks in NYC and having no coffee available). No one else reported this. Our product manager talked with him on the phone but the only information he got was "no matter what date I pick from the calendar the result year is X" where X was a small number. It made no sense.
On iOS we used the system supplied NSDate and NSDateFormatter so having the year go berserk seemed impossible. I worked on and off on this for a while but didn't understand it. Eventually he reported the problem suddenly went away. Our product manager talked with him and we found out he had switch the iPad to some strange Japanese (or maybe Buddhist, I forget) lunar calendar for a while. That turned out to be the key piece of information. The calendar did not depend on formatting the date but simply started with a raw NSDate so it showed the correct information for a Gregorian calendar. In converting the raw date into text NSDateFormatter used the system calendar by default; the string then was in the lunar calendar, where the year was something real world like the Emperor's year of reign or something. This date was then sent to our servers as a year during the early Roman Empire where of course there was no service.
The end result was since we only really supported the standard calendar and our business was only in the US it didn't matter and in any case the code for the iPad (which we had inherited from a third party) had hundreds of places where changes would have to be made which wasn't a good idea to support something rare like this.
Another strange bug came when someone reported that they chose a date for service in the early part of the following year and the month was off by one. This happened soon after a new release went out. No one had every seen such a bug in testing and it made little sense, only the month was off, everything else was correct. I played with the calendar and was able to reproduce it but only for one month. Crazy.
So this time I did a diff on the version. There were two of us who worked on it so I didn't know all the changes off the top of my head. I saw a change to a NSDateFormatter string: "mm/dd/YYYY" which seemed not quite familiar. I looked up what "YYYY" was and found it requested the year in ISO 8601 format. This special (and really odd but for a good reason) format doesn't always match a regular "yyyy" based date. Changing the format to use the correct "yyyy" fixed the problem. I've answered this question on stackoverflow a number of times, it's such as easy misspelling to make.
Each of these two show that bugs can be really confusing until you get enough information to make a diagnoses. You can't always discount a problem just because you don't understand it, it might just be that some crucial bit of information is missing.
Back in 1987 with our app Trapeze (which we wrote and marketed) we didn't have the benefit of the internet of course so our support was phone calls and snail mail. One day a user called a left a message saying all of a sudden he couldn't launch Trapeze either by itself or from a document without it crashing immediately. That was the entire message. I fixed it an hour later.
How did I do that with so little information? By knowing the code backwards and forwards. This was in the days of people running from floppies and sometimes from a hard drive, so it could have been bad media. We could have just sent him a new install disk.
What I knew was the launch sequence. The preferences file was read and validated. If the data was invalid the app would have ignored it and started a new one, since it crashed I assumed that some combination of options were not being handled. It was just a matter of looking at all of them until I saw a rare combination not being handled, leading to a crash. I fixed the case easily.
Fixing bugs is much easier when your knowledge of the codebase is deep enough to fill in the blanks in a user report. I have never been a fan of programmers working on anything and everything in a code base instead of focusing on areas that they can maintain deep understanding of the code, each to an area. This way it's easier to have your experience with the code lead to fixing bugs quicker. It's not always possible but it does help.
Another problem we continuously ran into in my current job was insufficient logging. Our mobile server API did not log enough information to solve complex problems interacting with the backend services. No one had taken the time to imagine and plan how logging could be used when things went wrong. This step is immensely important long before you ever find a bug. In our case one important service only logged data if nothing was wrong. If any exception occurred it failed to log anything making problem solving almost impossible. You have to do this right up front; once there is a problem it's too late unless you own a time machine.
Another thing we did to make tracking a set of calls in the logs (where logging was actually useful) was to add 4 parameters to every call an app made to our mobile servers. We called them DAVE, for device-id, app name, version and user id. The device-id was generated by us and saved into the application preferences. This way we could track all the calls someone had made sufficiently to identify the path of a problem. The call center would often want to know if a refund was due a customer but collected very little information but we could cross reference that with our DAVE information. Eventually we automated the whole thing so they could do the search themselves.
The important thing to take from all this is that users that report bugs are doing you a favor but you should take each one as important. If you want an app known for quality you have to not only build it that way in the first place but also plan and understand how to deal with the inevitable issues that will happen. The definition of quality I like is "An application should always do what it is supposed to do and never do what it is not supposed to". In the real world you will still get the latter. Better get good at it!
Like in House, it's not always lupus.
Another bug fix story Fixing a Nasty Physically Modeled Engine Bug in an FPS Game.