Building A Virtual Department Store

April 19, 2007

My little project in building a virtual affiliate store around a large number of product feeds is progressing slowly, mainly due to other real work. But it has made some real progress.

I already have the product feeds (all ftp based) loaded and indexed in various ways including raw text search (inverted indexes based on stemming and lots of stopwords). I have a lot of set operations done as well. Everything runs in memory only (this is the read-only portion of the system). This memory-based query engine is designed to be distributed (accessed via REST) running on top of an embedded Jetty server and using XStream to marshall the data.

The biggest problems so far are (1) building a universal category tree and mapping all the merchant products into it (everyone has their own scheme) and (2) decided what type of interface to provide.

The first issue is basically an informational architecture problem. I looked at a lot of shopping sites and tried to discern what motivated their choices, then built myself one. Sadly I can't see any way yet past manually mapping the store categorizations to mine and then dealing with unknown new ones as they appear. The real solution would be to analyze all the information on a product and build a system capable of mapping the data automatically. That will become necessary when I add larger merchants like Walmart (850K products) and iTunes (2M+).

The interface question is interesting as I see three choices (1) pure ajax (2) pure html and (3) both.

A pure ajax site has the advantage of being much faster to use, given a UI that emphasizes many different ways to slice-and-dice the products. The downside is that Google sees nothing much and you don't get the discovery from people searching for products via Google.

If I provide an interface that ultimately lists all the products from all merchants in a discoverable URL scheme (e.g. /products/flowers/roses/) then all the data will be found in search engines (baring being punished for duplicate content). The downside is a ton of processing as the engines devour all the pages (between Walmart and iTunes that's 300K pages or so). I can cache this information (perhaps as zipped HTML) but it's still a ton of bandwidth and time.

I would like to do both so that people can find the "store" via Google, but still have the benefit of alternate ways to find (and build RSS feeds from) stuff in the store. Performance of all this is very predictable and not too difficult to scale (most of the work is happening in memory time). If I choose to support reviews and comments then I have to build a more robust database architecture (something to preserve the information); currently the query engine is totally read-only and updated once per day (the data feeds update that way). Currently I can load 3M products per hour into my Dual G5 development box.

So this is an interesting project so far, but I need to eat and pay bills so I can't work as much as I would like.

There are many shopping sites but I see directions that haven't been explored yet. All my investment is my time so far, so once it goes live (whenever) it doesn't have to big huge to be successful.

More later.