Thinking About Programming

WTF IT Stories #1: It's Not The Database, Stupid!

Posted: 02/11/2007, Readers: 18713 Perm Link


I went to work for a financial services company and soon found myself in the middle of a long war between the AS/400 (now called the System i) group and the java development group. Like all war stories one can learn a lot in the retelling.

Apparently in this company the AS/400 had been the database of record for a long time, and even when java web apps began to be built, they were required to access the data via the DB/2 variant that ran on top of the AS/400 operating system. Now this system is superb for running batch applications written in RPG, even though the team actually wrote all of their applications in a 4GL type of environment. The operating system is actually built around a database of sorts, but its not relational and is really designed for fast sequential I/O. The DB/2 variant actually sits on top of the OS database more as a client than anything else. The files beneath the DB/2 tables are not managed by DB/2 at all as the batch applications don't even know it exists. This makes for much fun.

The AS/400 team was considered the cream of the IT group and thus had a lot of authority and clout with upper management.

One issue facing the Java team was that the 4GL tool managed its constraints internal to the tool, and this was not available to any outside application including DB/2. Thus any constraints defined in the relational database were not adhered to by any batch applications accessing the same files (ie tables). Thus there was no requirement for primary keys or unique columns or foreign keys in the data to actually be unique or even point to anything (the batch applications had no such concepts). Writing web apps on top of such data became a massively frustrating exercise where any query could return multiple results for primary key or unique queries, dangling foreign key references, etc. On top of that the 4GL only used a few datatypes, and all others were defined internal to the tool, thus there were 7 different kinds of dates none of which were real SQL dates. All of this had to be dealt with manually in java code. The column and table names were also tool generated and were basically random names until someone manually built a view layer which gave real names to everything.

Needless to say the users of the web apps were eternally unhappy as data became missing, or random exceptions happened randomly, and such problems caused no end of complaints. The field staff (which were contract) had to pay for every application use and even their hardware and thus this caused them financial pain. Naturally the Java team (being the relatively new folks on the block) were considered talentless, malingering and slow.

But this was only a sidelight to the real battle. The main web app used by the field staff (basically their only tool to do everything) was mind-numbingly slow. Logging in took minutes, searching for a client was an exercise best done over lunch. No one dared to use the app with a client in the office for fear of looking like an idiot. And of course it was the Java developer's fault.

Oddly enough the application ran very fast when run against the Dev and QA partitions in the AS/400 (the system supports multiple partitions with own memory pool, fractional CPU usage, and disk mappings). When the same code ran against the Production partition the performance turned to stone. However this fact was ignored by the AS/400 team, as their platform was known to be incredibly fast and self-tuning, and thus it had to the the fault of Java. Or the Java programmers. Or JDBC. Or anything except the database.

So they hired J2EE consultants, Java performance experts, basically anyone who claimed they could fix the web apps (since the Java programmers were talentless, malingering and slow). They found nothing, which only showed that Java was the problem anyway (must be stupid consultants). Then they hired an AS/400 expert from IBM who went over the entire system and found everything to be in excellent shape and very fast (however he admitted knowing nothing about DB/2 and only evaluated the OS and the batch environment). Thus the proof was there that the Java team and their language was, well, talentless, malingering and slow. It's not the database, stupid!

OK, now I showed up as a new hire and wanted to find some way to speed things up. Of course it was obvious to me (and everyone else in the Java team) that the problem was in the AS/400 but how to prove it? It turned out one of the DBA's (Oracle, really smart guy) had built a mirror of the web tables from the AS/400 on an old desktop running Oracle. Now now one in the Java team was allowed access to the Production partition on the AS/400 but the DBA did. So I had an idea.

I built a simple Java application which scanned a schema using JDBC, read the names of all the tables, then built a simple "SELECT * FROM XXX FETCH FIRST 1 ROWS ONLY" for both databases and ran this for every table, tracking the time necessary (thus measuring only the time to access the first row of the table and overhead). I got access from the DBA's to both the production DB/2 on the AS/400 and the Oracle running on the old desktop. We did this without telling anyone what I was doing.

After running the tests a number of times I produced a chart of the results. The Oracle version ran an order of magnitude faster, even though it was on a pathetic piece of hardware, and the AS/400 was the database hardware of record. I then sent it to everyone I could think of. All we heard from the AS/400 group then was "hummana-hummana-hummana" as there wasn't anything to direct blame to. Same code, same machine, same network, same SQL, using the production JDBC drivers from IBM and Oracle. Oopsy.

So they were ordered to have their DBA (really just an analyst) run the code I gave him (same results) and then he watched the console from the AS/400 while the production web app ran (which they could have done years earlier) and low and behold, as soon as a big query came in (like find a client!) the system began to page like mad for minutes. It seemed that the minimum memory for the partition running DB/2 through which the entire fieldstaff ran their work had only 80MB of RAM. Ouch. The AS/400 would gradually increase the memory but it was tuned for batch applications, not instant web queries. Oddly enough if several queries came in sequence it got faster as some more memory would be applied. But then it would drop back down again.

After a bit more time of complaining there wasn't enough RAM in the AS/400 (like 16GB) they finally gave in and increased the memory to 1GB as a floor. And all of the performance problems went away just like that.

And no one ever said a thing after that, like the war never happened. Of course the data problems continue to this day (I no longer work there of course) but at least you can find a client now in a few seconds.

And yeah, it was the stupid database!

Tags: wtf, database, java
Adam 02/12/2007 11:06

I have experience developing web applications accessing data on both System i machines and Oracle database servers. Generally the code is the same or very similar. I also have some experience developing native AS/400 programs in RPG. I have not encountered any problems like the one described, but I can see how this problem could happen. In most larger System i (AS/400) shops, there is a big division between the development and operations staff. There generally isn't anyone with actual DBA experience because, to a large extent, the system is self-tuning. Operations doesn't have to worry about most DBA tasks, and batch programmers don't have to either. Having said all of that, the root cause of the problem should have been found by operations (of course I'm a developer, other people's opinions may vary).

I have to respond to some of the article. Here are some of the things which jumped out at me:

  • ... the team actually wrote all of their applications in a 4GL type of environment.

  • On top of that the 4GL only used a few datatypes, and all others were defined internal to the tool, thus there were 7 different kinds of dates none of which were real SQL dates.

Ugh! There's the first problem. Of the (AS/400) 4GL environments that I have seen, all of them produce junk code. This sounds like an especially bad 4GL. This probably lead to many of the data problems which are mentioned in the article. On the other hand, the author is wrong about a few things:

  • The DB/2 variant actually sits on top of the OS database more as a client than anything else. The files beneath the DB/2 tables are not managed by DB/2 at all as the batch applications don't even know it exists.

This isn't really true. The database is built into the operating system, there's just more than one way to access it. The files 'beneath' the DB/2 tables can be managed by DB/2. Apparently, in this case, they weren't. You can add primary and foreign key contraints to System i files, and you'll get errors if you try to break them. It was probably too much trouble to change their 4GL programs to handle those errors, so they didn't add the constraints. From here, the author describes what the consultants found:

  • They found nothing, which only showed that Java was the problem anyway (must be stupid consultants). Then they hired an AS/400 expert from IBM who went over the entire system and found everything to be in excellent shape and very fast (however he admitted knowing nothing about DB/2 and only evaluated the OS and the batch environment).

Wow, there was definitely some stupidity here. Either they should have asked IBM for someone with experience in setting up the System i as a database server, or they should have gotten their money back from IBM. Because the actual problem that they eventually found is not real complex. However, all of the blame shouldn't be heaped on the 'AS/400 expert'. The Java experts which were brought in could have done what the author did (built a copy of the database in Oracle and compared times to access it versus the System i database). I think this whole mess shows the value of asking the right questions. It seems like both teams just asked if their stuff was okay. When they were told that their stuff was fine, they assumed it was the other team's (or system's) fault. Nobody asked how can we make the existing system work better.

rkeene 02/12/2007 13:29

Similar problem on another project. The only difference was that it was the number of WebSphere threads allocated. Soon as we raised it preformance problems evaporated.

Definately the same problem with the 'expert' WebSphere consultant that came in.

Al 02/16/2007 17:19

""And yeah, it was the stupid database!". No it wasn't!

This article is interesting, but the author clearly is not truly familiar with the AS/400 and makes a number of erroneous statements, coming across as just a little too superior.

"The DB/2 variant actually sits on top of the OS database more as a client than anything else."

Well, not really. In fact, not at all. If there is one defining characteristic of the AS/400 integrated relational database, it is that it does NOT sit on top of the operating system, as in ALL other platforms, but is integrated into the operating system. Much of the database functionality on an AS/400 sits BELOW what is called the TIMI (Technology Independent Machine Interface), which, for Java people can be thought of as roughly equivalent to the virtual machine/bytecode interface of a JVM.

The database is relational, has been since most modern relational databases were just glints in the eyes of their creators. The AS/400 database often leads all other DB2 products in capability. For example, it was the first DB2 on any platform to deliver Vector Encoded Indexes, which is a technology similar in purpose, but technically superior to, the bitmap indexes offered by other database vendors.

To deal with the relational integrity issues the author describes, there would have been no problem in adding foreign key and other integrity constraints to the underlying AS/400 tables, as the previous poster suggested.

The data types problem the author describes is a limitation of the 4 GL tool, perhaps, but mechanisms exist on the AS/400 to do "virtual data types" if you like, so the date conversion could have been done there before being made available to the users as views, so the conversions didn't HAVE to be done in Java code. SQL data types are available, so they could have been used, but it seems the 4 GL tool stood in the way of redefining the underlying tables.

Most importantly, though, the problem he describes is not a database problem, but one in an are on the AS/400 call "Work Management", which includes the concepts of tuning and how resources are allocated to executing jobs, and how those jobs are placed on appropriate queues for execution etc.

It is true that AS/400 sites often do not have deep ranks of highly skilled "tuning" and "DBA" gurus like, for example, Oracle sites do. This is because the AS/400 does a remarkably good job of tuning itself, and most of the time these people are not needed. Most Oracle shops however, would go down in a few weeks or months if the "gurus" went on vacation!

The subsystem memory allocation problem the author describes is unusual in that it is an exception to this rule - the result of insufficient thought being given to the Work Management parameters for the production environment. This is a configuration and tuning issue, and NOT "the stupid database" as the author claims.

This shop clearly could have done with someone with a few more clues in tuning the AS/400. The fact that the application had good response times for the development and QA environments should have been enough for people to realize that it wasn't "Java" that was the source of the problem. (But what can you do - true rocket scientists go build rockets for more money. We, on the other hand, who build and run software systems, keep being squeezed by the bean counters trying to send our jobs to India. In the end, some of the more talented are persuaded to pursue other careers that are more satisfying.)

I'm glad for this author, and his client, that he found the problem when others didn't. But this doesn't justify his ill-informed and supercilious posturing about what remains one of the most elegant architectures, and robust, enterprise-level operating systems the industry has ever seen.

If the author truly wants to understand the AS/400 and its capabilities, I suggest he glance at "Inside the AS/400" by Frank G. Soltis. You'll never again look at competing PC or server architectures with the same respect!

DoctorEternal 02/21/2007 19:10

Ah, the memories... Over the years I've had more "AS/400 converting to Java" students than I can count (I'm a Sun Java Instructor). I personally still love the ole AS/400s. Clink, clank clunk!

Dr.E

http://www.turingshop.com/reports/01Java/