Home About The Codist RSS Feed

Kindergarten, The Digg Effect, Shared Hosting and Traffic Rankings
Apr 09, 2007 09:40 perm link Readers: 1038

Hows that for a complicated title?

I posted All I Need To Know To Be A Better Programmer I Learned In Kindergarten last Thursday morning, and watched it do absolutely nothing that day (600 readers or so, way below an average day).

The next day it was added to Reddit and during the day had about 6,000 readers (good, but no record). That night I went to Good Friday services and shared a meal with friends. Before going to bed (late) I check the stats and noticed 2,000 people had read the article in the last hour. I saw that it had 69 diggs so I visited Digg to see where it was. It turned out it was #2 on Digg. This was the first time one of my articles went anywhere on Digg (usually Reddit is my main traffic site). In the next two days I had nearly 40,000 readers and over 1,000 diggs.

This site runs on a shared host on linux (some kind of mid-level single CPU Athlon box) at kattare.com with apache on the front end passing all traffic to my jetty server, which runs my java blog application (BlogFiche) on top of my own web application framework (Fiche) using H2 as the database. This instance of jetty is also running some other (lower traffic) sites. I wasn't remotely afraid of the "Digg Effect" since I knew exactly how much traffic this platform was designed for. I wouldn't get worried until traffic hit 5 per second or so sustained (that's 18K+ per hour). The max I saw was probably around 3K per hour, so the system was still mostly idle.

My site caches all generated page content and only regenerates it if something changes (I edit the site or a comment is posted). All static content is served via the jetty default servlet, which also does caching. Jetty itself uses Java NIO (non-blocking IO), so it's extremely efficient at serving web content. So in effect the only thing happening on 99.9% of all connections is IO. Digged? No problem.

Of course I wouldn't advertise myself as knowing and solving performance issues if my own site performance sucked. Seeing blogs getting digged and die is a little irritating to people; I can understand a graphics or video site getting swamped, but a simple blog shouldn't have any trouble with 1 connection per second (3,600 per hour).

The funny thing about the traffic is the difference between Google and Alexa. Even though I had 40,000 readers for this one article, my Alexa ranking dropped a lot over the same timeframe. Google's index is filled with references (and outright copies!) of the article, but Alexa shows nothing happened at all. Maybe fewer people are using the Alexa toolbar these days.

I tend to look at my adgridwork account for quick updates to my traffic. Although I am collecting some stats I have been lazy to update the admin portion of this blog so I can see (and show) actual counts on the articles.

It was pretty cool (and humbling) that so many people read the article from all over the world. The web is an amazing place.

My Tags:

  • codist: Apr 09, 2007 11:45

    Another blog posted a similar article at Surviving the digg effect. Unfortunately he hit the digg front page twice at the same time and it leveled his server. Ouch.

  • Dan: Apr 12, 2007 13:26

    Congratulations! It was a great article!

  • Add Comment

Building My Cheap, Scalable, High-Volume Query Site
Mar 29, 2007 08:57 perm link Readers: 2907

The project I am working on at home will require an architecture to support a high volume of ajax calls slicing and dicing data from a fairly large database. Since the data is only updated once a day, there are interesting options on how to build the system.

Web applications can be roughly divided into two types: mostly read and some write. The first type is mainly used to show the user some information based on some type of query, which could be explicit (like a query by example form) or more commonly based on a UI that gives them options. The second may do the same but allows updating of information by the user. It's rare for an application to not have some update features so what type it is may be hard to pin down. I view the difference as whether the user can update the information they are querying.

Amazon, for example, has a lot of write features (ordering, building lists, commenting) but they mostly don't affect the catalog of items (other than availability).

Digg may spend most of its time showing similar content but users are actually modifying the data as they interact with it.

The site I am trying to build will be more like Amazon, in that the majority of the information only changes rarely (like once per day) and that other interaction is peripheral. This makes it easier to use all sorts of caching strategies.

When I worked out the problem in 1998 with Consumers Digest Online we were using WebObjects, which would run separate processes (called an instance) in the same server (or multiple servers). Originally my digested search information (about 20MB mostly a big tree structure and associated indexes) was loaded into each instance, which somewhat limited how many we could run in a single server box. Later we modified the cache data (which was all static data) to run inside a shared memory space (HP/UX). This architecture then looked like this:

We were able to support as many as 2000 simultaneous users with this structure, even though the search engine was fuzzy with a lot of derived data searches. The average response time through the cache for any query was subsecond. There were something like 40,000 packages of features, 20,000 products, and an almost infinite combination of potential results. You couldn't cache individual results as each search was unlikely to ever appear again. The data changed once per week, and was processed for the caches after the new database was loaded. All display data came from the database directly.

During my year of working on projects for Sabre I wrote a white paper for them on how I would build a replacement for their monster reservation system (with its 8000-per-second query requirement) based on a similar but larger architecture. I don't think it went anywhere but eventually they built something with a more traditional database caching architecture. In their case the flights don't change too often but the reservations do so it is more of a type 2.

For my new application I will be managing a wide variety of slice-and-dice searches through a much larger space (iTunes for examples has 2M+ tracks) but at least the information isn't derived from the database data like in the CDOnline system (with its complex required and disallowed package combinations). The biggest issue is that in building an Ajax from end you need fast response even under heavy loads, since queries are 95% of what the application will do.

The architecture for this system looks something like this:

The point is to replicate the cache servers which have enough RAM to store the data required for searching all in memory. As the data in the database is updated only once per day, it can be optimized for the fasted possible searching when loaded into the cache server. Of course you could build something more traditional which would allow the database to manage the caching or use a caching framework (of which there are plenty both free and for-pay).

In my case I want to manage the searching myself as most of it involves deeply nested trees, which are hard to optimize in a relational database (Oracle has some benefits here but I've never used them). RAM is always faster than hard drives (at least until the new Flash drives get cheap enough) and since I know exactly what I will allow the users to do, I can build the cache precisely for those needs.

The pleasant thing about this architecture is that I can not only grow it as needed, but I can also distribute it over multiple data centers if necessary. This is basically what all of the big companies do. I expect to be able to manage as much as 100 transactions per second, all without any massive investment in servers (money I don't have anyway).

One other note is that I use Jetty as my appserver, which is really fast for Ajax calls with its continuation based architecture.

My Tags:

  • Anand Sharma: Mar 29, 2007 22:05

    I am wondering what software/service you're planning to use to provide the "cache server" capability?

  • nemlah: Mar 30, 2007 06:07

    We have developed an application with the exact same requirements, and while we don't have add a caching tier to our solution so far, if growth continues as planned we will have to add it soon enough. One thing I was always wondering about is how costly it is to create the caches (if for example we need to update the main database during the day, how fast will the caches be uptodate?). The application is written in rails and the obvious choice would be memcached, but I am looking forward to see your solution.

    Regards,

    Nemlah

  • Craig: Mar 30, 2007 09:42

    I would swap out parts of your existing data layer which are querying the database for results with a layer which queries Lucene for results.

    Then, simply regenerate your Lucene indexes nightly. You shouldn't need that Cache layer since I believe Lucene has mechanisms for caching frequent queries.

  • codist: Mar 30, 2007 09:52

    Yeah lucene is a good idea if you have common queries, especially full text ones. In my case it won't work (as in the CDOnline version) as the queries are multidimensional and rarely repeated.

  • Chris Lu: Mar 30, 2007 15:57

    Lucene is good, even for exact match because it's just file access. Database is simply too slow. What you cache actually did is to move data to file access.

    To try Lucene, you can use DBSight. It's super easy. You can create a production-level search in 3 minutes.

    Please take a look at this

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

    You can easily index, re-index, incremental-index. It's also highly scalable and easily customizable.

  • wpbarr: Apr 12, 2007 14:26

    You have several good options for the distributed datastore:

    - Javaspaces

    - memcached, Tangosol

    - object-oriented databases

    - in-memory databases (Intersystems Cache)

    You should be able to get sub-millisecond accesss times with any of them. With any of them, you get to stick with a pure object model and, you can use either keyed or associative queries.

  • kjgha: Apr 16, 2007 01:59

    klh

  • XCoder: Apr 18, 2007 07:04

    Since nobody is talking about hubernate's caching. How bad is hibernates two layer caching mechanism?

  • fashionhause: Jun 22, 2007 20:17

    BALENCIAGA designer handbag 139 delivered to anywhere

    <img src="http://www.fashionhause.com/images/large/balenorange_bf_LRG.jpg">

    Dimension:12(L)X8(H)X4(D) Features:Motorcycle bag in leather. Made with the finest qualities of leather, This shoulder bag zips open into a fully lined inner cavity with a separate zippered pocket inside. Also comes with its leather trimmed mirror and extra shoulder strap , have dustproof bag ,Gennuine birth card ,Warranty card and introduction .

    Please check out our website:

    http://www.fashionhause.com There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Breitling Navitimer Two-Toned Review

    <img src="http://www.fashionhause.com/images/large/Breitling-Navitimer-03_LRG.jpg">

    Bytor reviews another version of a replica Breitling Navitimer. This version benefits greatly in accuracy from the faults of its earlier brothers. Many of the common Breitling replica flaws have been addressed, leaving a very convincing replica watch. One point not mentioned by Bytor is the inaccurate date window; this still is a flaw that can be used to immediately spot a replica Breitling. Breitling’s true date wheel is quite unique to their brand and is easily identifiable.

    This article is reprinted here from The Replica Collector Forum.

    Please check out our website:

    <a href="http://www.fashionhause.com">fashionhause</a> There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Original design by Paloma Picasso

    <img src="http://www.tiffany-sterling-silver.com/images/large/tiffany-rings-R031_LRG.jpg">

    Original design by Paloma Picasso,925 Sterling silver plated Tiffany Ring - Rings are the enduring symbol of eternity, of dedication, a pledge to love and a pledge to remember. To wear fine silver on your finger can mean any number of things, but it should always be comfortable and stylish, and suit your unique taste. That's why we've taken extra special care in selecting a wide array of rings, in all sizes, to match the jewel with the wearer and what it will come to stand for. You'll surely find the special ring you desire. Sumptuous pieces in brilliant hues: Daughter of, arguably, the twentieth century's most influential painter, Paloma Picasso has a lot to live up to, and a wealth of inspiration to draw on. Her world renowned jewelry designs have organic and ultra modern flare. This collection is Italian style from one among this century's boldest visionaries.

    Tiffany-sterling-silver delights in the opportunity to offer our customers fine sterling silver rings, fake Tiffany necklaces, pendants, replica Tiffany bracelets, bands, brooches, replica Tiffany earrings and more, all at remarkably low prices.

    Please check out our website:

    http://www.tiffany-sterling-silver.com

    Or email us:

    Info@tiffany-sterling-silver.com

    [url=http://www.fashionhause.com]replica handbags[/url]

    [url=http://www.fashionhause.com/forum/]Rolex forum[/url]

    [url=http://www.fashionhause.com/links/]replica watches[/url]

    [url=http://www.yesreplica.com/]rolex replica[/url]

    [url=http://www.fashiontrends.cn/]fake rolex[/url]

    [url=http://www.smokershops.com/]swiss rolex replica[/url]

    [url=http://www.tiffany-sterling-silver.com/]tiffany sterling silver[/url]

    <a href="http://www.tiffany-sterling-silver.com">tiffany sterling silver</a>

    <a href="http://www.fashionhause.com"> replica handbags</a>

    <a href="http://www.fashionhause.com/forum/"> Rolex forum</a>

    <a href="http://www.fashionhause.com/links/"> replica watches</a>

    <a href="http://www.yesreplica.com/"> rolex replica</a>

    <a href="http://www.fashiontrends.cn/"> fake rolex</a>

    <a href="http://www.smokershops.com/">swiss rolex replica</a>

  • Alex Popescu: Oct 06, 2007 04:46

    Hi there!

    I have built a couple of apps using the same approach. However, for those with very high requirements I have quickly find out the the DB node can become a bottleneck. So, without wanting to sound like an advise, I would probably try to partition the data or replicate the storage. Considering that in your case the DB updates are rare? then probably the simplest approach would be storage replication.

    bests,

    ./alex

    --

    .w( the_mindstorm )p.

  • Add Comment

Name:


Optional URL:


Comment:


Save Cancel

Copyright © 2007 By Andrew Wulf