Home About The Codist RSS Feed

Building A Virtual Department Store
Apr 20, 2007 12:06 perm link Readers: 889

My little project in building a virtual affiliate store around a large number of product feeds is progressing slowly, mainly due to other real work. But it has made some real progress.

I already have the product feeds (all ftp based) loaded and indexed in various ways including raw text search (inverted indexes based on stemming and lots of stopwords). I have a lot of set operations done as well. Everything runs in memory only (this is the read-only portion of the system). This memory-based query engine is designed to be distributed (accessed via REST) running on top of an embedded Jetty server and using XStream to marshall the data.

The biggest problems so far are (1) building a universal category tree and mapping all the merchant products into it (everyone has their own scheme) and (2) decided what type of interface to provide.

The first issue is basically an informational architecture problem. I looked at a lot of shopping sites and tried to discern what motivated their choices, then built myself one. Sadly I can't see any way yet past manually mapping the store categorizations to mine and then dealing with unknown new ones as they appear. The real solution would be to analyze all the information on a product and build a system capable of mapping the data automatically. That will become necessary when I add larger merchants like Walmart (850K products) and iTunes (2M+).

The interface question is interesting as I see three choices (1) pure ajax (2) pure html and (3) both.

A pure ajax site has the advantage of being much faster to use, given a UI that emphasizes many different ways to slice-and-dice the products. The downside is that Google sees nothing much and you don't get the discovery from people searching for products via Google.

If I provide an interface that ultimately lists all the products from all merchants in a discoverable URL scheme (e.g. /products/flowers/roses/) then all the data will be found in search engines (baring being punished for duplicate content). The downside is a ton of processing as the engines devour all the pages (between Walmart and iTunes that's 300K pages or so). I can cache this information (perhaps as zipped HTML) but it's still a ton of bandwidth and time.

I would like to do both so that people can find the "store" via Google, but still have the benefit of alternate ways to find (and build RSS feeds from) stuff in the store. Performance of all this is very predictable and not too difficult to scale (most of the work is happening in memory time). If I choose to support reviews and comments then I have to build a more robust database architecture (something to preserve the information); currently the query engine is totally read-only and updated once per day (the data feeds update that way). Currently I can load 3M products per hour into my Dual G5 development box.

So this is an interesting project so far, but I need to eat and pay bills so I can't work as much as I would like.

There are many shopping sites but I see directions that haven't been explored yet. All my investment is my time so far, so once it goes live (whenever) it doesn't have to big huge to be successful.

More later.

My Tags:

  • Ryan Doherty: Apr 20, 2007 15:48

    You can have both AJAX and a discoverable URL scheme. Your homepage can have links to the discoverable URL scheme, but in JS you can attach event handlers to stop the browser from going there and dynamically load the content into whatever part of the page you want. Progressive enhancement/graceful degradation.

    You should also be gzipping your HTML anyway before you send it over the wire. And you could set an expires header for 1 day or something like that for the full listings. Hopefully that will reduce your bandwidth.

    Maybe you could do a user agent check to see if it's a search engine and feed them the cached versions of pages that are generated daily?

    And cache, cache, cache as much as you can. You have a lot of stuff to cache, but I'm sure there's a way to cache the most heavily used info so you reduce the workload.

    Good luck!

  • Stephen: Apr 23, 2007 16:12

    By "everything is in memory", do you mean the database too? And in any case, how big is memory? I remember when 10 megabytes was a big database. But 10^9 / 2*10^6 is still 512 bytes per record. That's alot. More than this comment. And that's only one gigabyte of RAM - the minimum RAM needed to run Vista. Face it, one gigabyte is a low end desktop. (Don't tell me Vista will run in 500 MB - that's fiction, or a lie, or something.)

  • codist: Apr 23, 2007 18:15

    There will be a database but only for a few constant items. Most of the data is loaded fresh every day into RAM and processed on the fly. RAM as in 2-4G . I'd love an OctoMac, 16GB and 8 cores but oh well.

  • Andries: May 13, 2007 01:06

    Our company tried the same thing. In the end the whole project was canned due to a single business reality: Thousands of feeds does not mean that all products are of equal usefulness for consumers. Even as a marketing company, we just could not find a way to market those products effectively. Our company also made one big technical blunder: Instead of focusing all their time & resources in a effort to developing algorithms which filter out bogus products; They had us build this distributed search index, which sure -- in the end allowed us to do a search over about 9 million products under 1.5 seconds (worst case, including network latency & rendering on the browser), but failed to resolve the fact that that most queries brought back crap. Questions such as how to categorize products was put in the back burner, while we all had to pitch in to our, then Chief Technical Officer's wet dream of gaining speed across volume.

    We gained speed by loading our master index into memory, and carefully monitoring each node on the network to ensure the index size never exceeded the available RAM. Scaling the search network simply meant adding another node on the network, either a physical box, or if a computer has more than one processor, we added another virtual node.

    In retrospect, using RAM for speed is a cop out. It ups the cost of scaling, thereby negating the purpose for building a distributed system in the first place. Our CTO also forced us to store the actual data in the indexes as well which also increases the size of index (we ended up using the indexes as storage mechanism instead of a data retrieval mechanism). This limited us on exactly how much products could be pushed to each node before we run into the physical constraints of RAM.

    In the end, i would recommend think hard and constantly about the architecture. Committing to a search goals such as under 1 second for a search (as in our case) may sound exciting (and believe me it is), but in the u would like users to visit your site and buy stuff. And a user is much more forgiving when he actually finds what he is looking for. Otherwise even if u serve up a search over many giga bytes of data in 0.3 seconds, that same user will go back to yahoo, amazon etc.

    Rather spend time and effort in a good cataloging system, or use a scheme such as digg whereby users rate products, and feed that back into your catalog & search algorithm.

  • Add Comment

Building My Cheap, Scalable, High-Volume Query Site
Mar 29, 2007 08:57 perm link Readers: 2902

The project I am working on at home will require an architecture to support a high volume of ajax calls slicing and dicing data from a fairly large database. Since the data is only updated once a day, there are interesting options on how to build the system.

Web applications can be roughly divided into two types: mostly read and some write. The first type is mainly used to show the user some information based on some type of query, which could be explicit (like a query by example form) or more commonly based on a UI that gives them options. The second may do the same but allows updating of information by the user. It's rare for an application to not have some update features so what type it is may be hard to pin down. I view the difference as whether the user can update the information they are querying.

Amazon, for example, has a lot of write features (ordering, building lists, commenting) but they mostly don't affect the catalog of items (other than availability).

Digg may spend most of its time showing similar content but users are actually modifying the data as they interact with it.

The site I am trying to build will be more like Amazon, in that the majority of the information only changes rarely (like once per day) and that other interaction is peripheral. This makes it easier to use all sorts of caching strategies.

When I worked out the problem in 1998 with Consumers Digest Online we were using WebObjects, which would run separate processes (called an instance) in the same server (or multiple servers). Originally my digested search information (about 20MB mostly a big tree structure and associated indexes) was loaded into each instance, which somewhat limited how many we could run in a single server box. Later we modified the cache data (which was all static data) to run inside a shared memory space (HP/UX). This architecture then looked like this:

We were able to support as many as 2000 simultaneous users with this structure, even though the search engine was fuzzy with a lot of derived data searches. The average response time through the cache for any query was subsecond. There were something like 40,000 packages of features, 20,000 products, and an almost infinite combination of potential results. You couldn't cache individual results as each search was unlikely to ever appear again. The data changed once per week, and was processed for the caches after the new database was loaded. All display data came from the database directly.

During my year of working on projects for Sabre I wrote a white paper for them on how I would build a replacement for their monster reservation system (with its 8000-per-second query requirement) based on a similar but larger architecture. I don't think it went anywhere but eventually they built something with a more traditional database caching architecture. In their case the flights don't change too often but the reservations do so it is more of a type 2.

For my new application I will be managing a wide variety of slice-and-dice searches through a much larger space (iTunes for examples has 2M+ tracks) but at least the information isn't derived from the database data like in the CDOnline system (with its complex required and disallowed package combinations). The biggest issue is that in building an Ajax from end you need fast response even under heavy loads, since queries are 95% of what the application will do.

The architecture for this system looks something like this:

The point is to replicate the cache servers which have enough RAM to store the data required for searching all in memory. As the data in the database is updated only once per day, it can be optimized for the fasted possible searching when loaded into the cache server. Of course you could build something more traditional which would allow the database to manage the caching or use a caching framework (of which there are plenty both free and for-pay).

In my case I want to manage the searching myself as most of it involves deeply nested trees, which are hard to optimize in a relational database (Oracle has some benefits here but I've never used them). RAM is always faster than hard drives (at least until the new Flash drives get cheap enough) and since I know exactly what I will allow the users to do, I can build the cache precisely for those needs.

The pleasant thing about this architecture is that I can not only grow it as needed, but I can also distribute it over multiple data centers if necessary. This is basically what all of the big companies do. I expect to be able to manage as much as 100 transactions per second, all without any massive investment in servers (money I don't have anyway).

One other note is that I use Jetty as my appserver, which is really fast for Ajax calls with its continuation based architecture.

My Tags:

  • Anand Sharma: Mar 29, 2007 22:05

    I am wondering what software/service you're planning to use to provide the "cache server" capability?

  • nemlah: Mar 30, 2007 06:07

    We have developed an application with the exact same requirements, and while we don't have add a caching tier to our solution so far, if growth continues as planned we will have to add it soon enough. One thing I was always wondering about is how costly it is to create the caches (if for example we need to update the main database during the day, how fast will the caches be uptodate?). The application is written in rails and the obvious choice would be memcached, but I am looking forward to see your solution.

    Regards,

    Nemlah

  • Craig: Mar 30, 2007 09:42

    I would swap out parts of your existing data layer which are querying the database for results with a layer which queries Lucene for results.

    Then, simply regenerate your Lucene indexes nightly. You shouldn't need that Cache layer since I believe Lucene has mechanisms for caching frequent queries.

  • codist: Mar 30, 2007 09:52

    Yeah lucene is a good idea if you have common queries, especially full text ones. In my case it won't work (as in the CDOnline version) as the queries are multidimensional and rarely repeated.

  • Chris Lu: Mar 30, 2007 15:57

    Lucene is good, even for exact match because it's just file access. Database is simply too slow. What you cache actually did is to move data to file access.

    To try Lucene, you can use DBSight. It's super easy. You can create a production-level search in 3 minutes.

    Please take a look at this

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

    You can easily index, re-index, incremental-index. It's also highly scalable and easily customizable.

  • wpbarr: Apr 12, 2007 14:26

    You have several good options for the distributed datastore:

    - Javaspaces

    - memcached, Tangosol

    - object-oriented databases

    - in-memory databases (Intersystems Cache)

    You should be able to get sub-millisecond accesss times with any of them. With any of them, you get to stick with a pure object model and, you can use either keyed or associative queries.

  • kjgha: Apr 16, 2007 01:59

    klh

  • XCoder: Apr 18, 2007 07:04

    Since nobody is talking about hubernate's caching. How bad is hibernates two layer caching mechanism?

  • fashionhause: Jun 22, 2007 20:17

    BALENCIAGA designer handbag 139 delivered to anywhere

    <img src="http://www.fashionhause.com/images/large/balenorange_bf_LRG.jpg">

    Dimension:12(L)X8(H)X4(D) Features:Motorcycle bag in leather. Made with the finest qualities of leather, This shoulder bag zips open into a fully lined inner cavity with a separate zippered pocket inside. Also comes with its leather trimmed mirror and extra shoulder strap , have dustproof bag ,Gennuine birth card ,Warranty card and introduction .

    Please check out our website:

    http://www.fashionhause.com There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Breitling Navitimer Two-Toned Review

    <img src="http://www.fashionhause.com/images/large/Breitling-Navitimer-03_LRG.jpg">

    Bytor reviews another version of a replica Breitling Navitimer. This version benefits greatly in accuracy from the faults of its earlier brothers. Many of the common Breitling replica flaws have been addressed, leaving a very convincing replica watch. One point not mentioned by Bytor is the inaccurate date window; this still is a flaw that can be used to immediately spot a replica Breitling. Breitling’s true date wheel is quite unique to their brand and is easily identifiable.

    This article is reprinted here from The Replica Collector Forum.

    Please check out our website:

    <a href="http://www.fashionhause.com">fashionhause</a> There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Original design by Paloma Picasso

    <img src="http://www.tiffany-sterling-silver.com/images/large/tiffany-rings-R031_LRG.jpg">

    Original design by Paloma Picasso,925 Sterling silver plated Tiffany Ring - Rings are the enduring symbol of eternity, of dedication, a pledge to love and a pledge to remember. To wear fine silver on your finger can mean any number of things, but it should always be comfortable and stylish, and suit your unique taste. That's why we've taken extra special care in selecting a wide array of rings, in all sizes, to match the jewel with the wearer and what it will come to stand for. You'll surely find the special ring you desire. Sumptuous pieces in brilliant hues: Daughter of, arguably, the twentieth century's most influential painter, Paloma Picasso has a lot to live up to, and a wealth of inspiration to draw on. Her world renowned jewelry designs have organic and ultra modern flare. This collection is Italian style from one among this century's boldest visionaries.

    Tiffany-sterling-silver delights in the opportunity to offer our customers fine sterling silver rings, fake Tiffany necklaces, pendants, replica Tiffany bracelets, bands, brooches, replica Tiffany earrings and more, all at remarkably low prices.

    Please check out our website:

    http://www.tiffany-sterling-silver.com

    Or email us:

    Info@tiffany-sterling-silver.com

    [url=http://www.fashionhause.com]replica handbags[/url]

    [url=http://www.fashionhause.com/forum/]Rolex forum[/url]

    [url=http://www.fashionhause.com/links/]replica watches[/url]

    [url=http://www.yesreplica.com/]rolex replica[/url]

    [url=http://www.fashiontrends.cn/]fake rolex[/url]

    [url=http://www.smokershops.com/]swiss rolex replica[/url]

    [url=http://www.tiffany-sterling-silver.com/]tiffany sterling silver[/url]

    <a href="http://www.tiffany-sterling-silver.com">tiffany sterling silver</a>

    <a href="http://www.fashionhause.com"> replica handbags</a>

    <a href="http://www.fashionhause.com/forum/"> Rolex forum</a>

    <a href="http://www.fashionhause.com/links/"> replica watches</a>

    <a href="http://www.yesreplica.com/"> rolex replica</a>

    <a href="http://www.fashiontrends.cn/"> fake rolex</a>

    <a href="http://www.smokershops.com/">swiss rolex replica</a>

  • Alex Popescu: Oct 06, 2007 04:46

    Hi there!

    I have built a couple of apps using the same approach. However, for those with very high requirements I have quickly find out the the DB node can become a bottleneck. So, without wanting to sound like an advise, I would probably try to partition the data or replicate the storage. Considering that in your case the DB updates are rare? then probably the simplest approach would be storage replication.

    bests,

    ./alex

    --

    .w( the_mindstorm )p.

  • Add Comment

The Naked Ajax Application
Mar 05, 2007 15:57 perm link Readers: 1485

In my last two jobs I worked on three java applications, all ajax web applications using no web framework whatsoever. All three were for an intranet, so there was no need to worry about about Javascript being turned off.

It might not be possible for everyone, but it sure made development much easier and each application was well received by the user base.

Typically when developing a Java web application you would use a web framework, either using J2EE (JSP) and usually an additional framework such as Struts/Tiles, or perhaps avoiding most of J2EE by using Spring, either by itself or with another web framework, or even using a servlet-based framework like Tapestry or Wicket. However you do it, the framework supplies the environment, generates HTML, processes form data, and all the other usual web workings. In your code you interact with the framework and also interact with some kind of database framework, such as plain JDBC, EJB, Hibernate or iBatis.

In a naked web application, the web framework and its (usually heavy) configuration is replaced by a much leaner architecture.

In my case the ajax framework was DWR. DWR (Direct Web Remoting) is a popular framework for Java which allows calls to be made to Java beans on the server from dynamically generated Javascript objects in the browser. Essentially it provides a pipeline from your code on the client to your code on the server. In the upcoming version 2, DWR will all support reverse ajax, where server objects can asynchronously call client objects.

You could argue that I am simply replacing one web framework with a lighter one. I wouldn't complain, that's the whole point.

The three applications I worked on were an employee directory, a field office and staff information application (a sort of super directory with a lot of query options); and for a different employer, a digital printing press room management console.

The directories used Oracle and iBatis. The employee directory was a single HTML page, and the field application had a single HTML page and about 1500 supplementary static pages generated daily. The main HTML "page" held only the static areas of the application, and onload obtained information via DWR from Java beans on the server. As the user typed or chose various options, the interface and data display updated on the fly as needed. There were several alternate views of data which were made visible/hidden as needed. For these applications I used mostly DWR's javascript utilities.

The press application called an existing API and a bit of JDBC for data, and also interacted via a REST interface to other systems. This application was used to manage the companies digital presses, displaying all related print jobs in a large grid (updated in real time as orders came in), with various optional details and controls (deleting, pausing, restarting, etc), and allowed the operators to drag and drop jobs to the targeted press.

Development in all three cases was rapid and the details and functionality were constantly in flux. Spending a lot of time configuring a traditional web application, especially one with lots of XML to edit, would have been doable but far less productive. In two weeks I wrote 3 complete versions of the press application. It made testing new interfaces much easier, since I could easily build a functional prototype.

Now if I had to build an huge application with a large team, this approach might not be so easy. Of course readers might point out that I could have used Ruby On Rails but that wasn't an option for either employer. In any case this technique allowed my to get some of the benefits of the ROR approach without having to give up the Java environment.

My Tags:

Name:


Optional URL:


Comment:


Save Cancel

Copyright © 2007 By Andrew Wulf