WTF IT Stories #1: It's Not The Database, Stupid!

Feb 10, 2007

I went to work for a financial services company and soon found myself in the middle of a long war between the AS/400 (now called the System i) group and the java development group. Like all war stories one can learn a lot in the retelling.

Apparently in this company the AS/400 had been the database of record for a long time, and even when java web apps began to be built, they were required to access the data via the DB/2 variant that ran on top of the AS/400 operating system. Now this system is superb for running batch applications written in RPG, even though the team actually wrote all of their applications in a 4GL type of environment. The operating system is actually built around a database of sorts, but its not relational and is really designed for fast sequential I/O. The DB/2 variant actually sits on top of the OS database more as a client than anything else. The files beneath the DB/2 tables are not managed by DB/2 at all as the batch applications don't even know it exists. This makes for much fun.

The AS/400 team was considered the cream of the IT group and thus had a lot of authority and clout with upper management.

One issue facing the Java team was that the 4GL tool managed its constraints internal to the tool, and this was not available to any outside application including DB/2. Thus any constraints defined in the relational database were not adhered to by any batch applications accessing the same files (ie tables). Thus there was no requirement for primary keys or unique columns or foreign keys in the data to actually be unique or even point to anything (the batch applications had no such concepts). Writing web apps on top of such data became a massively frustrating exercise where any query could return multiple results for primary key or unique queries, dangling foreign key references, etc. On top of that the 4GL only used a few datatypes, and all others were defined internal to the tool, thus there were 7 different kinds of dates none of which were real SQL dates. All of this had to be dealt with manually in java code. The column and table names were also tool generated and were basically random names until someone manually built a view layer which gave real names to everything.

Needless to say the users of the web apps were eternally unhappy as data became missing, or random exceptions happened randomly, and such problems caused no end of complaints. The field staff (which were contract) had to pay for every application use and even their hardware and thus this caused them financial pain. Naturally the Java team (being the relatively new folks on the block) were considered talentless, malingering and slow.

But this was only a sidelight to the real battle. The main web app used by the field staff (basically their only tool to do everything) was mind-numbingly slow. Logging in took minutes, searching for a client was an exercise best done over lunch. No one dared to use the app with a client in the office for fear of looking like an idiot. And of course it was the Java developer's fault.

Oddly enough the application ran very fast when run against the Dev and QA partitions in the AS/400 (the system supports multiple partitions with own memory pool, fractional CPU usage, and disk mappings). When the same code ran against the Production partition the performance turned to stone. However this fact was ignored by the AS/400 team, as their platform was known to be incredibly fast and self-tuning, and thus it had to the the fault of Java. Or the Java programmers. Or JDBC. Or anything except the database.

So they hired J2EE consultants, Java performance experts, basically anyone who claimed they could fix the web apps (since the Java programmers were talentless, malingering and slow). They found nothing, which only showed that Java was the problem anyway (must be stupid consultants). Then they hired an AS/400 expert from IBM who went over the entire system and found everything to be in excellent shape and very fast (however he admitted knowing nothing about DB/2 and only evaluated the OS and the batch environment). Thus the proof was there that the Java team and their language was, well, talentless, malingering and slow. It's not the database, stupid!

OK, now I showed up as a new hire and wanted to find some way to speed things up. Of course it was obvious to me (and everyone else in the Java team) that the problem was in the AS/400 but how to prove it? It turned out one of the DBA's (Oracle, really smart guy) had built a mirror of the web tables from the AS/400 on an old desktop running Oracle. Now now one in the Java team was allowed access to the Production partition on the AS/400 but the DBA did. So I had an idea.

I built a simple Java application which scanned a schema using JDBC, read the names of all the tables, then built a simple "SELECT * FROM XXX FETCH FIRST 1 ROWS ONLY" for both databases and ran this for every table, tracking the time necessary (thus measuring only the time to access the first row of the table and overhead). I got access from the DBA's to both the production DB/2 on the AS/400 and the Oracle running on the old desktop. We did this without telling anyone what I was doing.

After running the tests a number of times I produced a chart of the results. The Oracle version ran an order of magnitude faster, even though it was on a pathetic piece of hardware, and the AS/400 was the database hardware of record. I then sent it to everyone I could think of. All we heard from the AS/400 group then was "hummana-hummana-hummana" as there wasn't anything to direct blame to. Same code, same machine, same network, same SQL, using the production JDBC drivers from IBM and Oracle. Oopsy.

So they were ordered to have their DBA (really just an analyst) run the code I gave him (same results) and then he watched the console from the AS/400 while the production web app ran (which they could have done years earlier) and low and behold, as soon as a big query came in (like find a client!) the system began to page like mad for minutes. It seemed that the minimum memory for the partition running DB/2 through which the entire fieldstaff ran their work had only 80MB of RAM. Ouch. The AS/400 would gradually increase the memory but it was tuned for batch applications, not instant web queries. Oddly enough if several queries came in sequence it got faster as some more memory would be applied. But then it would drop back down again.

After a bit more time of complaining there wasn't enough RAM in the AS/400 (like 16GB) they finally gave in and increased the memory to 1GB as a floor. And all of the performance problems went away just like that.

And no one ever said a thing after that, like the war never happened. Of course the data problems continue to this day (I no longer work there of course) but at least you can find a client now in a few seconds.

And yeah, it was the stupid database!