Monitor Or Fail

Soon after I started in my present/soon to be former job in our mobile team, the product manager suddenly discovered one of our lines of business had had no sales in the past month.

Of course "sudden" and "the past month" sound like an oxymoron.

The situation at the time was that sales were only recorded in the upstream systems. We had our own mobile API which translated calls to and from the upstream ".com" systems for use by our mobile products. Although at the time we managed our own servers we had zero control or even input into what those upstream folks did. Thus we were given a report at the end of each month on what our sales were as if we didn't matter.

Naturally this was a shock so we decided to build our own database of sales; since they passed through our servers we simply recorded each sale's details. One guy added the support and database and then I built an iOS app to collect and display the up to the minute information. We didn't ask for permission either.

My manager and I would look at the app many times a day (he called it his cocaine app) and we got pretty good at understanding the ebb and flow of sales in each area for each day of the week. It was easy to detect when a problem was happening. When we noticed something amiss someone would analyze logs and other information and usually it came down to something upstream was broken. If it was fixable in our layer we just fixed it ourselves and deployed it once it appeared correct. Waiting for the upstream to be fixed was often way too slow. Now problems rarely caused customer distress.

The point of this long story is that what you monitor you can correct when things go wrong. What you don't know about will bite you in the ass.

One of our groups was sold to another company. We had managed the code on a server for one part of their function. When the sale happened we told them they needed to move it to the new company's servers. A couple months later we got a panicky call that the (unmoved) server wasn't responding and their customers were angry.

The server had been in a data center our company had mostly abandoned for our parent company's. So someone saw this server, had no idea what it was for, and had killed it and wiped the hard drive. A week later we got the call. They hadn't been monitoring this server at all unless you consider customers a monitoring service.

We were able to build a new one for them as a courtesy but the point again was no monitoring lead to no functioning.

Not only is monitoring useful just to keep things running but it can improve security as well.

Target was a great story in ineptitude but one thing got me especially: they had installed a security monitoring system of some kind but failed for months to look at the logs it produced which would have told them they were being hacked, yet no one noticed. Monitoring is great but if you don't look at it it's pointless. Your monitoring systems have to be easy for people to see and understand or you are simply spinning CPU time. I made my app show everything simply on one screen, you could understand the state for today and tabs for yesterday and the day before so it was easy to digest. Make it easy enough, or even better automate it as much as possible, and maybe people will actually pay attention.

Today I hassled with my new mortgage company's online site. I got  the password since I had forgotten it (they emailed it to me, security fail 101) and I decided to change it. I entered a new password and promptly got a 500 error with a default error page from ASP which warned me this page should be changed to a custom one. Now the software for this site is version 4.XXXX. Has no one noticed that every password change is followed by a 500 error in the logs? Has no one ever QA'd a password change? Apparently not. The password was applied but you couldn't tell. Logging which is ignored leads to people writing up your work in a blog post.

Of course some systems are extraordinarily complex and monitoring is an engineering problem by itself. But then again even simple systems can be monitored in an even more simple-minded fashion.

Our upstream team had a location service the we had to use; this service returned cities and other locations and some details from a partial query. One day I saw a complaint in an App Store review where the user complained he entered New York City and the app told him "No Results Found" which he thought was stupid.

So did I. Someone tried a few queries and they all worked so it was thought the user just had a bad network. I didn't think so and suspected something was actually wrong. I asked about the system design and found that the location service was running on 4 machines with a load balancer service. By running a number of automated queries I was able to get the problem to occur about 25% of the time.

The people monitoring the servers said everything was working on their end but after some higher level complaints they agreed to look at little closer. Apparently their monitor only tracked whether the service app was running, not whether it returned meaningful results. They found that sometimes when the service started up it failed to load the data but ran anyway, thus returning empty results. They happily monitored the system but not in any useful way.

So monitor everything, make the status easy to understand so people will pay attention, and make sure your monitoring is actually checking for the right thing. Your customers might not notice but that's what you want.