Building My Cheap, Scalable, High-Volume Query Site

March 27, 2007

The project I am working on at home will require an architecture to support a high volume of ajax calls slicing and dicing data from a fairly large database. Since the data is only updated once a day, there are interesting options on how to build the system.

Web applications can be roughly divided into two types: mostly read and some write. The first type is mainly used to show the user some information based on some type of query, which could be explicit (like a query by example form) or more commonly based on a UI that gives them options. The second may do the same but allows updating of information by the user. It's rare for an application to not have some update features so what type it is may be hard to pin down. I view the difference as whether the user can update the information they are querying.

Amazon, for example, has a lot of write features (ordering, building lists, commenting) but they mostly don't affect the catalog of items (other than availability).

Digg may spend most of its time showing similar content but users are actually modifying the data as they interact with it.

The site I am trying to build will be more like Amazon, in that the majority of the information only changes rarely (like once per day) and that other interaction is peripheral. This makes it easier to use all sorts of caching strategies.

When I worked out the problem in 1998 with Consumers Digest Online we were using WebObjects, which would run separate processes (called an instance) in the same server (or multiple servers). Originally my digested search information (about 20MB mostly a big tree structure and associated indexes) was loaded into each instance, which somewhat limited how many we could run in a single server box. Later we modified the cache data (which was all static data) to run inside a shared memory space (HP/UX). This architecture then looked like this:

We were able to support as many as 2000 simultaneous users with this structure, even though the search engine was fuzzy with a lot of derived data searches. The average response time through the cache for any query was subsecond. There were something like 40,000 packages of features, 20,000 products, and an almost infinite combination of potential results. You couldn't cache individual results as each search was unlikely to ever appear again. The data changed once per week, and was processed for the caches after the new database was loaded. All display data came from the database directly.

During my year of working on projects for Sabre I wrote a white paper for them on how I would build a replacement for their monster reservation system (with its 8000-per-second query requirement) based on a similar but larger architecture. I don't think it went anywhere but eventually they built something with a more traditional database caching architecture. In their case the flights don't change too often but the reservations do so it is more of a type 2.

For my new application I will be managing a wide variety of slice-and-dice searches through a much larger space (iTunes for examples has 2M+ tracks) but at least the information isn't derived from the database data like in the CDOnline system (with its complex required and disallowed package combinations). The biggest issue is that in building an Ajax from end you need fast response even under heavy loads, since queries are 95% of what the application will do.

The architecture for this system looks something like this:

The point is to replicate the cache servers which have enough RAM to store the data required for searching all in memory. As the data in the database is updated only once per day, it can be optimized for the fasted possible searching when loaded into the cache server. Of course you could build something more traditional which would allow the database to manage the caching or use a caching framework (of which there are plenty both free and for-pay).

In my case I want to manage the searching myself as most of it involves deeply nested trees, which are hard to optimize in a relational database (Oracle has some benefits here but I've never used them). RAM is always faster than hard drives (at least until the new Flash drives get cheap enough) and since I know exactly what I will allow the users to do, I can build the cache precisely for those needs.

The pleasant thing about this architecture is that I can not only grow it as needed, but I can also distribute it over multiple data centers if necessary. This is basically what all of the big companies do. I expect to be able to manage as much as 100 transactions per second, all without any massive investment in servers (money I don't have anyway).

One other note is that I use Jetty as my appserver, which is really fast for Ajax calls with its continuation based architecture.