A Most Interesting Bug
This is the story of a fun afternoon spent tracking down an unknown problem that led to 4 different crashes, all without any idea of what it could be.
My team supports code in two of our apps plus is working on a huge new project that is something the company has never done before, so the five of us are quite busy.
One code base was written before even I got here 3 years ago by an outsourced team and is in Objective-C, and the other was written by me and another programmer all in Swift. The latter was released in mid 2017.
I always look at our crash reporting tool for both apps, as well as read App Store reviews daily, something I’ve been doing since I started doing iOS development long ago.
Our apps are built by a dozen or so teams out of more dozens of repositories and have way too many features. The quality is usually rather average for both apps. However about 3 months ago the crash rate, which was hovering around 0.5 - 1.0% suddenly dropped to 0.02% and everyone cheered, all the insane process and deadlines and other rigamarole we were and are subjected to finally paid off!
Except I did not believe it could possible be true, there was no way this overly complicated bloated app could possibly be that good. The best whole iOS app I ever released (back in iOS 7 days when the travel company I worked for was featured briefly in the keynote) had a lifetime crash rate of 0.07%.
Furthermore the App Store reviews were brutal over the same timeframe, often out of 30 daily reviews 29 of them would be one or two star and complain of constant crashes. These two facts could not be both true and I believe our customers know what a crash looks like, especially repeating ones.
A few weeks ago a new release prompted a lot of people at work to complain that even when they used one of the apps it crashed a lot, and finally people started to look into both why it was crashing, and why our crash reporting vendor was not showing the same. Someone sent them a support request asking why.
Oops. Turns out they had implemented a de-dupe method on their crash reporting to remove duplicate reports several months ago, and apparently it was eliminating 99% of our crashes! When they fixed it two weeks ago suddenly our crash rates were 0.7-1.2% again.
This story is not about an app monitoring vendor who apparently failed to monitor their own app; that’s just the setup as finally we could see what was crashing.
The Swift codebase is the higher quality of the two, often whole releases go by without a single defect reported, and it rarely has crashed previously other than the usual random noise ones. But during that period where we had no visibility four crashes appeared in our code area. They were around 10-12% together of all the crashes which one could live with at a rate of 0.7%, but my expectations are higher.
So I decided we should take a look and try to figure out what was going on. The strange thing was that all four crashes were in different places, yet they were always grouped together, with each having similar numbers. Every day—well at least since we started seeing crashes again—the same crashes at the same rate and in the same relative order.
That’s a really strange coincidence, which of course it isn’t. I’ve been shipping stuff for more than 3 decades, so I’ve seen a lot of weird errors. This one smelled of a common problem causing different results because the crash happened far removed from the problem, like driving down the road and seeing a car upside down without any idea how it got there.
Despite several releases since the crash reports went into hibernation, and with not a single report of any crash during testing (either by our testers or the full app testers) in stage, it had to be something unique to production.
A service call failure wouldn’t make much sense as even with our sometimes recalcitrant services (the app has an enormous variety) the odds of a minor constant failure every single day at the same rate seemed remote. It had to be some user action that only occurred in production, but our code has thorough automated testing and excellent testers; it didn’t seem likely someone would overlook pushing a button.
So how to trigger this bug in stage where no one had ever seen it? Note we could not test in production with a debugger as this app does not allow a controlled test.
Now our code in this app is a flow, with a common starting point and various paths until you get to an ending point where you confirm a choice. Usually one suspects things like clobbering memory, but in Swift this is very unlikely, although we do interact with other frameworks that are Objective-C so it could be that was leaving the land mine. But the every day nature of the 4 crashes didn’t make a random error like this make sense.
Now the server side code we call is currently different in stage and production, and there are various switches that influence the flow, so maybe something had been fixed in stage (nothing had been released recently).
So we were trying various crazy things to make something happen (I call this Poke and Hope, back from when I learned how to play pool in said fashion) when one programmer mentioned he got the app to crash while trying something unrelated. He described the steps, and then made a video to document it. After going through the flow a second time, on the last step the app displayed an alert reporting a service error (not an alert view, but inline) which we had never seen in stage before. Furthermore closing the alert returned the app to the beginning of the flow, but forwards.
After a bit I remembered that the product folks at the time had insisted that the transition go forwards even though you are really going back to the beginning. But it was not a big deal since this error was expected to be extremely rare. It had been tested by checking to see if the flow went forwards back to the start which it did.
The key was that this error should never happen in stage as it is a production only thing (aka, the real world). So why were we seeing it? And was this is the same behavior as the production crashes?
Repeating the steps a few times and watching the flow go forwards started to look like this was the culprit. The user would get this error (which apparently has been increasing over the past few months when we were blind) go forwards to the start of the flow, and try again. The problem is that the flow is now on the navigation stack twice, various notifications that happen are now being fed to two different sets of view controllers. When the code was shipped originally this apparently did not cause any issues but of course it was rarely if ever triggered, but it has changed over time. In normal behavior the flow always starts over with a nice fresh stack.
Depending on what path the user took through the second flow, eventually something they did would update in two places, and ultimately something got confused and led to one of the four crashes.
The reason we were seeing the service error in stage was that the current stage services had a wrong switch turned on and that was triggering the error which should not have otherwise occurred! Without it we might have poked forever and found nothing.
Now that we saw the problem, the fix was easy: go back to the beginning the right way (pop back to the beginning). Now all four crashes are fixed.
Fun afternoon of debugging. I rarely get to do anything like this since its always better to not have any problems in the first place and that’s always a priority!