Before You Fix The Bug Make Sure You Understand It Exactly

Jul 8, 2015

I am supporting the company I worked for until recently part time until they get someone else (I am taking time to work on my house) and today the server guy reported that the beta iOS app (which I support, there is also an older production version) generated a duplicate UUID. He immediately proposed how the server could report this as an error and allow it to be handled in the client (we also have Android and web clients).

The issue of course is that it had to be some other problem since the odds of a duplicate random UUID is something like "generating a billion every second for 100 years gives you a 50% chance"— assuming your generator is based on a sufficiently good random. I thought we should actually figure out the real problem first since assuming the impossible is rarely a good starting point.

So he looked at the server logs to figure out who generated the first use of that UUID. Turned out to be the same device that sent it the second time shortly after the first, both times as POST. Normally an update is done with PUT; POST assumes the data has never been sent before.

Looking at my code I could see the iOS code (this portion I didn’t write) updated its database with an id when a successful POST happened and if there was an existing id it would send it as PUT. The only pathway that explained a double POST was that the server failed to return an id but still stored the data so the next send (the data in question is designed to be updated over time) assumed it was still new and used POST. Why the server is forgetting the id in the json response is still being investigated.

This is just a simple example, but the art of fixing bugs really does require that you not jump to a fix until you completely understand the problem.

I’ve covered a few cases like this in The Fine Art Of Solving Strange Bugs and Fixing a Nasty Physically Modeled Engine Bug in an FPS Game. These are worth reading if you haven’t seen them.

Over the decades I’ve seen a lot of people try to fix things based on guesses and a need to fix things fast, both likely to result in partial fixes or more bugs.

You have to consider not only the bug itself and what circumstances cause it, but also any potential fix—what that might do to the code—before you commit to changing anything. Sometimes this can be time consuming or tedious. Today I read a debugging story An Apple Push Notifications Debugging Story that was pretty intense. To really understand why something doesn’t work isn’t always this hard, but you never know when you can take a shortcut. Experienced programmers know debugging can be very hard and to be patient is the first requirement.

Working on the last version of Deltagraph (3.0) we worked on there were a few programmers at the publisher who helped us out. Back in those days we didn’t have a code repository (early 90’s) and we swapped code back and forth by FEDEXing hard drives and merging by hand. I found someone had fixed a bug in the most obvious place but it broke several other things, so I removed the fix and put it in the right place. The next version I got had the fix put back in so this time I just commented it out and said “this is the wrong place, it breaks all these other things” and thankfully that stayed put.

I don’t know who made that fix but it was clear they saw the bug, looked at the obvious place, made the fix and tried it once. Fixing a bug can sometimes be harder than finding it in the first place. If you don’t determine exactly what the fix will affect, either because you are lazy, or in a hurry, or have never seen the code before and miss the connections, you don’t know what that code interacts with. Ideally all code should be relatively uncoupled but in practice it isn’t always possible, and this code was written in C 25 years ago so it’s not today.

I always try to understand all the pathways my code goes through, both when I am writing it and after when I have to fix something. It’s much easier to spend time deeply understanding what the code does and what your change will do rather than fixing even more bugs later. But it’s easy to skimp on the details especially when users are clamoring for a fix now.

A bug in the production app (the above one is in a radically different future beta) was reported to me today. This app takes surveys (data collection) based on JSON sent by a server and the JSON is generated by a web app which supposedly shouldn’t allow incorrect or inconsistent data. The data includes branching, so you could answer a question one way and jump to another question, and alternatively you would jump to a different place.

The user reported that they answered a question and the app jumped to the last question leaving him unable to do his job. Of course this was on iOS so I took a look. Now this is really old code 99% of which I haven’t changed and didn’t write, and this doesn’t normally happen.

I was told the Android and Web apps didn’t have any problems with the same JSON.

I reproduced the bug (always a good start!) and then looked at the JSON data. The question the user was on clearly showed something like—in pseudocode— if (1) goto X else goto X.

The iOS code on such a branch will look for the targeted question X by looping through all the questions. It would stop and return the question matching. If it got through the entire loop it would return the last question seen, which of course is the last question! Getting through the entire loop and not finding something was by definition not supposed to be possible.

I reported my findings and the server programmer said that old code (production code goes back many years) has a complex copy operation which apparently can fail by not updating all the jumps. In this case that question did not exist in this survey so essentially it was pointing into space.

Why did the other apps not fail? Those codebases were implemented independently, and at that time assumed that an error in a branch might be possible; they decided if a branch wasn’t found to just move to the next question, which in this case was correct, but usually isn’t.

So understanding the complete problem is required before you can ever decide on a fix. Sometimes what you imagine the issue to be at first is a complete red herring. When you try to explain a problem away without knowing the whole truth, you can make yourself look pretty silly later when it’s wrong. Making fixes based on not understand the whole problem can certainly lead you to even worse problems.

Today the NYSE was shut down due to (apparently, not confirmed at the moment) a software upgrade. Working on systems that complex is very hard and sometimes it’s almost impossible to clearly understand every last thing that can happen, making fixing things a scary prospect. Add to the urgency (which may or may not be warranted) based on how high profile the software is which adds pressure or even horrible overtime and debugging can get almost impossible. I’ve had to debug problems in software having production issues that cost money every minute while it wasn’t working and that is incredibly hard—staying patient and focused and still doing it right requires enormous effort especially when everyone is yelling.

I always wondered if the Star Wars Defense System that the US Government wanted to build in the 1980’s ever was put into production and something went wrong when missiles started to fly (or the software told you they were) and you had to debug the problem or die in the next 5 minutes what that would be like. I guess if you couldn’t fix it your next annual evaluation wouldn’t ever happen!

So take time and be thorough and never cut corners. Debugging can be fun, but chopping heads off a hydra isn’t.