Home About The Codist RSS Feed

Building My Cheap, Scalable, High-Volume Query Site
Mar 29, 2007 08:57 perm link Readers: 3179

The project I am working on at home will require an architecture to support a high volume of ajax calls slicing and dicing data from a fairly large database. Since the data is only updated once a day, there are interesting options on how to build the system.

Web applications can be roughly divided into two types: mostly read and some write. The first type is mainly used to show the user some information based on some type of query, which could be explicit (like a query by example form) or more commonly based on a UI that gives them options. The second may do the same but allows updating of information by the user. It's rare for an application to not have some update features so what type it is may be hard to pin down. I view the difference as whether the user can update the information they are querying.

Amazon, for example, has a lot of write features (ordering, building lists, commenting) but they mostly don't affect the catalog of items (other than availability).

Digg may spend most of its time showing similar content but users are actually modifying the data as they interact with it.

The site I am trying to build will be more like Amazon, in that the majority of the information only changes rarely (like once per day) and that other interaction is peripheral. This makes it easier to use all sorts of caching strategies.

When I worked out the problem in 1998 with Consumers Digest Online we were using WebObjects, which would run separate processes (called an instance) in the same server (or multiple servers). Originally my digested search information (about 20MB mostly a big tree structure and associated indexes) was loaded into each instance, which somewhat limited how many we could run in a single server box. Later we modified the cache data (which was all static data) to run inside a shared memory space (HP/UX). This architecture then looked like this:

We were able to support as many as 2000 simultaneous users with this structure, even though the search engine was fuzzy with a lot of derived data searches. The average response time through the cache for any query was subsecond. There were something like 40,000 packages of features, 20,000 products, and an almost infinite combination of potential results. You couldn't cache individual results as each search was unlikely to ever appear again. The data changed once per week, and was processed for the caches after the new database was loaded. All display data came from the database directly.

During my year of working on projects for Sabre I wrote a white paper for them on how I would build a replacement for their monster reservation system (with its 8000-per-second query requirement) based on a similar but larger architecture. I don't think it went anywhere but eventually they built something with a more traditional database caching architecture. In their case the flights don't change too often but the reservations do so it is more of a type 2.

For my new application I will be managing a wide variety of slice-and-dice searches through a much larger space (iTunes for examples has 2M+ tracks) but at least the information isn't derived from the database data like in the CDOnline system (with its complex required and disallowed package combinations). The biggest issue is that in building an Ajax from end you need fast response even under heavy loads, since queries are 95% of what the application will do.

The architecture for this system looks something like this:

The point is to replicate the cache servers which have enough RAM to store the data required for searching all in memory. As the data in the database is updated only once per day, it can be optimized for the fasted possible searching when loaded into the cache server. Of course you could build something more traditional which would allow the database to manage the caching or use a caching framework (of which there are plenty both free and for-pay).

In my case I want to manage the searching myself as most of it involves deeply nested trees, which are hard to optimize in a relational database (Oracle has some benefits here but I've never used them). RAM is always faster than hard drives (at least until the new Flash drives get cheap enough) and since I know exactly what I will allow the users to do, I can build the cache precisely for those needs.

The pleasant thing about this architecture is that I can not only grow it as needed, but I can also distribute it over multiple data centers if necessary. This is basically what all of the big companies do. I expect to be able to manage as much as 100 transactions per second, all without any massive investment in servers (money I don't have anyway).

One other note is that I use Jetty as my appserver, which is really fast for Ajax calls with its continuation based architecture.

My Tags:

  • Anand Sharma: Mar 29, 2007 22:05

    I am wondering what software/service you're planning to use to provide the "cache server" capability?

  • nemlah: Mar 30, 2007 06:07

    We have developed an application with the exact same requirements, and while we don't have add a caching tier to our solution so far, if growth continues as planned we will have to add it soon enough. One thing I was always wondering about is how costly it is to create the caches (if for example we need to update the main database during the day, how fast will the caches be uptodate?). The application is written in rails and the obvious choice would be memcached, but I am looking forward to see your solution.

    Regards,

    Nemlah

  • Craig: Mar 30, 2007 09:42

    I would swap out parts of your existing data layer which are querying the database for results with a layer which queries Lucene for results.

    Then, simply regenerate your Lucene indexes nightly. You shouldn't need that Cache layer since I believe Lucene has mechanisms for caching frequent queries.

  • codist: Mar 30, 2007 09:52

    Yeah lucene is a good idea if you have common queries, especially full text ones. In my case it won't work (as in the CDOnline version) as the queries are multidimensional and rarely repeated.

  • Chris Lu: Mar 30, 2007 15:57

    Lucene is good, even for exact match because it's just file access. Database is simply too slow. What you cache actually did is to move data to file access.

    To try Lucene, you can use DBSight. It's super easy. You can create a production-level search in 3 minutes.

    Please take a look at this

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

    You can easily index, re-index, incremental-index. It's also highly scalable and easily customizable.

  • wpbarr: Apr 12, 2007 14:26

    You have several good options for the distributed datastore:

    - Javaspaces

    - memcached, Tangosol

    - object-oriented databases

    - in-memory databases (Intersystems Cache)

    You should be able to get sub-millisecond accesss times with any of them. With any of them, you get to stick with a pure object model and, you can use either keyed or associative queries.

  • kjgha: Apr 16, 2007 01:59

    klh

  • XCoder: Apr 18, 2007 07:04

    Since nobody is talking about hubernate's caching. How bad is hibernates two layer caching mechanism?

  • fashionhause: Jun 22, 2007 20:17

    BALENCIAGA designer handbag 139 delivered to anywhere

    <img src="http://www.fashionhause.com/images/large/balenorange_bf_LRG.jpg">

    Dimension:12(L)X8(H)X4(D) Features:Motorcycle bag in leather. Made with the finest qualities of leather, This shoulder bag zips open into a fully lined inner cavity with a separate zippered pocket inside. Also comes with its leather trimmed mirror and extra shoulder strap , have dustproof bag ,Gennuine birth card ,Warranty card and introduction .

    Please check out our website:

    http://www.fashionhause.com There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Breitling Navitimer Two-Toned Review

    <img src="http://www.fashionhause.com/images/large/Breitling-Navitimer-03_LRG.jpg">

    Bytor reviews another version of a replica Breitling Navitimer. This version benefits greatly in accuracy from the faults of its earlier brothers. Many of the common Breitling replica flaws have been addressed, leaving a very convincing replica watch. One point not mentioned by Bytor is the inaccurate date window; this still is a flaw that can be used to immediately spot a replica Breitling. Breitling’s true date wheel is quite unique to their brand and is easily identifiable.

    This article is reprinted here from The Replica Collector Forum.

    Please check out our website:

    <a href="http://www.fashionhause.com">fashionhause</a> There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Original design by Paloma Picasso

    <img src="http://www.tiffany-sterling-silver.com/images/large/tiffany-rings-R031_LRG.jpg">

    Original design by Paloma Picasso,925 Sterling silver plated Tiffany Ring - Rings are the enduring symbol of eternity, of dedication, a pledge to love and a pledge to remember. To wear fine silver on your finger can mean any number of things, but it should always be comfortable and stylish, and suit your unique taste. That's why we've taken extra special care in selecting a wide array of rings, in all sizes, to match the jewel with the wearer and what it will come to stand for. You'll surely find the special ring you desire. Sumptuous pieces in brilliant hues: Daughter of, arguably, the twentieth century's most influential painter, Paloma Picasso has a lot to live up to, and a wealth of inspiration to draw on. Her world renowned jewelry designs have organic and ultra modern flare. This collection is Italian style from one among this century's boldest visionaries.

    Tiffany-sterling-silver delights in the opportunity to offer our customers fine sterling silver rings, fake Tiffany necklaces, pendants, replica Tiffany bracelets, bands, brooches, replica Tiffany earrings and more, all at remarkably low prices.

    Please check out our website:

    http://www.tiffany-sterling-silver.com

    Or email us:

    Info@tiffany-sterling-silver.com

    [url=http://www.fashionhause.com]replica handbags[/url]

    [url=http://www.fashionhause.com/forum/]Rolex forum[/url]

    [url=http://www.fashionhause.com/links/]replica watches[/url]

    [url=http://www.yesreplica.com/]rolex replica[/url]

    [url=http://www.fashiontrends.cn/]fake rolex[/url]

    [url=http://www.smokershops.com/]swiss rolex replica[/url]

    [url=http://www.tiffany-sterling-silver.com/]tiffany sterling silver[/url]

    <a href="http://www.tiffany-sterling-silver.com">tiffany sterling silver</a>

    <a href="http://www.fashionhause.com"> replica handbags</a>

    <a href="http://www.fashionhause.com/forum/"> Rolex forum</a>

    <a href="http://www.fashionhause.com/links/"> replica watches</a>

    <a href="http://www.yesreplica.com/"> rolex replica</a>

    <a href="http://www.fashiontrends.cn/"> fake rolex</a>

    <a href="http://www.smokershops.com/">swiss rolex replica</a>

  • Alex Popescu: Oct 06, 2007 04:46

    Hi there!

    I have built a couple of apps using the same approach. However, for those with very high requirements I have quickly find out the the DB node can become a bottleneck. So, without wanting to sound like an advise, I would probably try to partition the data or replicate the storage. Considering that in your case the DB updates are rare? then probably the simplest approach would be storage replication.

    bests,

    ./alex

    --

    .w( the_mindstorm )p.

  • Add Comment

Name:


Optional URL:


Comment:


Save Cancel

Copyright © 2007 By Andrew Wulf