Home About The Codist RSS Feed

Building A Virtual Department Store
Apr 20, 2007 12:06 perm link Readers: 970

My little project in building a virtual affiliate store around a large number of product feeds is progressing slowly, mainly due to other real work. But it has made some real progress.

I already have the product feeds (all ftp based) loaded and indexed in various ways including raw text search (inverted indexes based on stemming and lots of stopwords). I have a lot of set operations done as well. Everything runs in memory only (this is the read-only portion of the system). This memory-based query engine is designed to be distributed (accessed via REST) running on top of an embedded Jetty server and using XStream to marshall the data.

The biggest problems so far are (1) building a universal category tree and mapping all the merchant products into it (everyone has their own scheme) and (2) decided what type of interface to provide.

The first issue is basically an informational architecture problem. I looked at a lot of shopping sites and tried to discern what motivated their choices, then built myself one. Sadly I can't see any way yet past manually mapping the store categorizations to mine and then dealing with unknown new ones as they appear. The real solution would be to analyze all the information on a product and build a system capable of mapping the data automatically. That will become necessary when I add larger merchants like Walmart (850K products) and iTunes (2M+).

The interface question is interesting as I see three choices (1) pure ajax (2) pure html and (3) both.

A pure ajax site has the advantage of being much faster to use, given a UI that emphasizes many different ways to slice-and-dice the products. The downside is that Google sees nothing much and you don't get the discovery from people searching for products via Google.

If I provide an interface that ultimately lists all the products from all merchants in a discoverable URL scheme (e.g. /products/flowers/roses/) then all the data will be found in search engines (baring being punished for duplicate content). The downside is a ton of processing as the engines devour all the pages (between Walmart and iTunes that's 300K pages or so). I can cache this information (perhaps as zipped HTML) but it's still a ton of bandwidth and time.

I would like to do both so that people can find the "store" via Google, but still have the benefit of alternate ways to find (and build RSS feeds from) stuff in the store. Performance of all this is very predictable and not too difficult to scale (most of the work is happening in memory time). If I choose to support reviews and comments then I have to build a more robust database architecture (something to preserve the information); currently the query engine is totally read-only and updated once per day (the data feeds update that way). Currently I can load 3M products per hour into my Dual G5 development box.

So this is an interesting project so far, but I need to eat and pay bills so I can't work as much as I would like.

There are many shopping sites but I see directions that haven't been explored yet. All my investment is my time so far, so once it goes live (whenever) it doesn't have to big huge to be successful.

More later.

My Tags:

  • Ryan Doherty: Apr 20, 2007 15:48

    You can have both AJAX and a discoverable URL scheme. Your homepage can have links to the discoverable URL scheme, but in JS you can attach event handlers to stop the browser from going there and dynamically load the content into whatever part of the page you want. Progressive enhancement/graceful degradation.

    You should also be gzipping your HTML anyway before you send it over the wire. And you could set an expires header for 1 day or something like that for the full listings. Hopefully that will reduce your bandwidth.

    Maybe you could do a user agent check to see if it's a search engine and feed them the cached versions of pages that are generated daily?

    And cache, cache, cache as much as you can. You have a lot of stuff to cache, but I'm sure there's a way to cache the most heavily used info so you reduce the workload.

    Good luck!

  • Stephen: Apr 23, 2007 16:12

    By "everything is in memory", do you mean the database too? And in any case, how big is memory? I remember when 10 megabytes was a big database. But 10^9 / 2*10^6 is still 512 bytes per record. That's alot. More than this comment. And that's only one gigabyte of RAM - the minimum RAM needed to run Vista. Face it, one gigabyte is a low end desktop. (Don't tell me Vista will run in 500 MB - that's fiction, or a lie, or something.)

  • codist: Apr 23, 2007 18:15

    There will be a database but only for a few constant items. Most of the data is loaded fresh every day into RAM and processed on the fly. RAM as in 2-4G . I'd love an OctoMac, 16GB and 8 cores but oh well.

  • Andries: May 13, 2007 01:06

    Our company tried the same thing. In the end the whole project was canned due to a single business reality: Thousands of feeds does not mean that all products are of equal usefulness for consumers. Even as a marketing company, we just could not find a way to market those products effectively. Our company also made one big technical blunder: Instead of focusing all their time & resources in a effort to developing algorithms which filter out bogus products; They had us build this distributed search index, which sure -- in the end allowed us to do a search over about 9 million products under 1.5 seconds (worst case, including network latency & rendering on the browser), but failed to resolve the fact that that most queries brought back crap. Questions such as how to categorize products was put in the back burner, while we all had to pitch in to our, then Chief Technical Officer's wet dream of gaining speed across volume.

    We gained speed by loading our master index into memory, and carefully monitoring each node on the network to ensure the index size never exceeded the available RAM. Scaling the search network simply meant adding another node on the network, either a physical box, or if a computer has more than one processor, we added another virtual node.

    In retrospect, using RAM for speed is a cop out. It ups the cost of scaling, thereby negating the purpose for building a distributed system in the first place. Our CTO also forced us to store the actual data in the indexes as well which also increases the size of index (we ended up using the indexes as storage mechanism instead of a data retrieval mechanism). This limited us on exactly how much products could be pushed to each node before we run into the physical constraints of RAM.

    In the end, i would recommend think hard and constantly about the architecture. Committing to a search goals such as under 1 second for a search (as in our case) may sound exciting (and believe me it is), but in the u would like users to visit your site and buy stuff. And a user is much more forgiving when he actually finds what he is looking for. Otherwise even if u serve up a search over many giga bytes of data in 0.3 seconds, that same user will go back to yahoo, amazon etc.

    Rather spend time and effort in a good cataloging system, or use a scheme such as digg whereby users rate products, and feed that back into your catalog & search algorithm.

  • Add Comment

Links: Yahoo On Web/Ajax Peformance
Apr 02, 2007 08:07 perm link Readers: 758

Found this series of articles on the yahoo UI blog. Some interesting stuff. Pardon if you all read this already. There is so much information on the web it's easy to miss it the first time around.

Performance Research Part 1: What the 80/20 Rule Tells Us about Reducing HTTP Requests

Performance Research Part 2: Browser Cache Usage - Exposed!

Performance Research Part 3: When the Cookie Crumbles

My Tags:

Building My Cheap, Scalable, High-Volume Query Site
Mar 29, 2007 08:57 perm link Readers: 3269

The project I am working on at home will require an architecture to support a high volume of ajax calls slicing and dicing data from a fairly large database. Since the data is only updated once a day, there are interesting options on how to build the system.

Web applications can be roughly divided into two types: mostly read and some write. The first type is mainly used to show the user some information based on some type of query, which could be explicit (like a query by example form) or more commonly based on a UI that gives them options. The second may do the same but allows updating of information by the user. It's rare for an application to not have some update features so what type it is may be hard to pin down. I view the difference as whether the user can update the information they are querying.

Amazon, for example, has a lot of write features (ordering, building lists, commenting) but they mostly don't affect the catalog of items (other than availability).

Digg may spend most of its time showing similar content but users are actually modifying the data as they interact with it.

The site I am trying to build will be more like Amazon, in that the majority of the information only changes rarely (like once per day) and that other interaction is peripheral. This makes it easier to use all sorts of caching strategies.

When I worked out the problem in 1998 with Consumers Digest Online we were using WebObjects, which would run separate processes (called an instance) in the same server (or multiple servers). Originally my digested search information (about 20MB mostly a big tree structure and associated indexes) was loaded into each instance, which somewhat limited how many we could run in a single server box. Later we modified the cache data (which was all static data) to run inside a shared memory space (HP/UX). This architecture then looked like this:

We were able to support as many as 2000 simultaneous users with this structure, even though the search engine was fuzzy with a lot of derived data searches. The average response time through the cache for any query was subsecond. There were something like 40,000 packages of features, 20,000 products, and an almost infinite combination of potential results. You couldn't cache individual results as each search was unlikely to ever appear again. The data changed once per week, and was processed for the caches after the new database was loaded. All display data came from the database directly.

During my year of working on projects for Sabre I wrote a white paper for them on how I would build a replacement for their monster reservation system (with its 8000-per-second query requirement) based on a similar but larger architecture. I don't think it went anywhere but eventually they built something with a more traditional database caching architecture. In their case the flights don't change too often but the reservations do so it is more of a type 2.

For my new application I will be managing a wide variety of slice-and-dice searches through a much larger space (iTunes for examples has 2M+ tracks) but at least the information isn't derived from the database data like in the CDOnline system (with its complex required and disallowed package combinations). The biggest issue is that in building an Ajax from end you need fast response even under heavy loads, since queries are 95% of what the application will do.

The architecture for this system looks something like this:

The point is to replicate the cache servers which have enough RAM to store the data required for searching all in memory. As the data in the database is updated only once per day, it can be optimized for the fasted possible searching when loaded into the cache server. Of course you could build something more traditional which would allow the database to manage the caching or use a caching framework (of which there are plenty both free and for-pay).

In my case I want to manage the searching myself as most of it involves deeply nested trees, which are hard to optimize in a relational database (Oracle has some benefits here but I've never used them). RAM is always faster than hard drives (at least until the new Flash drives get cheap enough) and since I know exactly what I will allow the users to do, I can build the cache precisely for those needs.

The pleasant thing about this architecture is that I can not only grow it as needed, but I can also distribute it over multiple data centers if necessary. This is basically what all of the big companies do. I expect to be able to manage as much as 100 transactions per second, all without any massive investment in servers (money I don't have anyway).

One other note is that I use Jetty as my appserver, which is really fast for Ajax calls with its continuation based architecture.

My Tags:

  • Anand Sharma: Mar 29, 2007 22:05

    I am wondering what software/service you're planning to use to provide the "cache server" capability?

  • nemlah: Mar 30, 2007 06:07

    We have developed an application with the exact same requirements, and while we don't have add a caching tier to our solution so far, if growth continues as planned we will have to add it soon enough. One thing I was always wondering about is how costly it is to create the caches (if for example we need to update the main database during the day, how fast will the caches be uptodate?). The application is written in rails and the obvious choice would be memcached, but I am looking forward to see your solution.

    Regards,

    Nemlah

  • Craig: Mar 30, 2007 09:42

    I would swap out parts of your existing data layer which are querying the database for results with a layer which queries Lucene for results.

    Then, simply regenerate your Lucene indexes nightly. You shouldn't need that Cache layer since I believe Lucene has mechanisms for caching frequent queries.

  • codist: Mar 30, 2007 09:52

    Yeah lucene is a good idea if you have common queries, especially full text ones. In my case it won't work (as in the CDOnline version) as the queries are multidimensional and rarely repeated.

  • Chris Lu: Mar 30, 2007 15:57

    Lucene is good, even for exact match because it's just file access. Database is simply too slow. What you cache actually did is to move data to file access.

    To try Lucene, you can use DBSight. It's super easy. You can create a production-level search in 3 minutes.

    Please take a look at this

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

    You can easily index, re-index, incremental-index. It's also highly scalable and easily customizable.

  • wpbarr: Apr 12, 2007 14:26

    You have several good options for the distributed datastore:

    - Javaspaces

    - memcached, Tangosol

    - object-oriented databases

    - in-memory databases (Intersystems Cache)

    You should be able to get sub-millisecond accesss times with any of them. With any of them, you get to stick with a pure object model and, you can use either keyed or associative queries.

  • kjgha: Apr 16, 2007 01:59

    klh

  • XCoder: Apr 18, 2007 07:04

    Since nobody is talking about hubernate's caching. How bad is hibernates two layer caching mechanism?

  • fashionhause: Jun 22, 2007 20:17

    BALENCIAGA designer handbag 139 delivered to anywhere

    <img src="http://www.fashionhause.com/images/large/balenorange_bf_LRG.jpg">

    Dimension:12(L)X8(H)X4(D) Features:Motorcycle bag in leather. Made with the finest qualities of leather, This shoulder bag zips open into a fully lined inner cavity with a separate zippered pocket inside. Also comes with its leather trimmed mirror and extra shoulder strap , have dustproof bag ,Gennuine birth card ,Warranty card and introduction .

    Please check out our website:

    http://www.fashionhause.com There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Breitling Navitimer Two-Toned Review

    <img src="http://www.fashionhause.com/images/large/Breitling-Navitimer-03_LRG.jpg">

    Bytor reviews another version of a replica Breitling Navitimer. This version benefits greatly in accuracy from the faults of its earlier brothers. Many of the common Breitling replica flaws have been addressed, leaving a very convincing replica watch. One point not mentioned by Bytor is the inaccurate date window; this still is a flaw that can be used to immediately spot a replica Breitling. Breitling’s true date wheel is quite unique to their brand and is easily identifiable.

    This article is reprinted here from The Replica Collector Forum.

    Please check out our website:

    <a href="http://www.fashionhause.com">fashionhause</a> There are top quality of replica handbags for sell

    with perfect weight, feel, and like the originals.or email us : info@fashionhause.com

    Original design by Paloma Picasso

    <img src="http://www.tiffany-sterling-silver.com/images/large/tiffany-rings-R031_LRG.jpg">

    Original design by Paloma Picasso,925 Sterling silver plated Tiffany Ring - Rings are the enduring symbol of eternity, of dedication, a pledge to love and a pledge to remember. To wear fine silver on your finger can mean any number of things, but it should always be comfortable and stylish, and suit your unique taste. That's why we've taken extra special care in selecting a wide array of rings, in all sizes, to match the jewel with the wearer and what it will come to stand for. You'll surely find the special ring you desire. Sumptuous pieces in brilliant hues: Daughter of, arguably, the twentieth century's most influential painter, Paloma Picasso has a lot to live up to, and a wealth of inspiration to draw on. Her world renowned jewelry designs have organic and ultra modern flare. This collection is Italian style from one among this century's boldest visionaries.

    Tiffany-sterling-silver delights in the opportunity to offer our customers fine sterling silver rings, fake Tiffany necklaces, pendants, replica Tiffany bracelets, bands, brooches, replica Tiffany earrings and more, all at remarkably low prices.

    Please check out our website:

    http://www.tiffany-sterling-silver.com

    Or email us:

    Info@tiffany-sterling-silver.com

    [url=http://www.fashionhause.com]replica handbags[/url]

    [url=http://www.fashionhause.com/forum/]Rolex forum[/url]

    [url=http://www.fashionhause.com/links/]replica watches[/url]

    [url=http://www.yesreplica.com/]rolex replica[/url]

    [url=http://www.fashiontrends.cn/]fake rolex[/url]

    [url=http://www.smokershops.com/]swiss rolex replica[/url]

    [url=http://www.tiffany-sterling-silver.com/]tiffany sterling silver[/url]

    <a href="http://www.tiffany-sterling-silver.com">tiffany sterling silver</a>

    <a href="http://www.fashionhause.com"> replica handbags</a>

    <a href="http://www.fashionhause.com/forum/"> Rolex forum</a>

    <a href="http://www.fashionhause.com/links/"> replica watches</a>

    <a href="http://www.yesreplica.com/"> rolex replica</a>

    <a href="http://www.fashiontrends.cn/"> fake rolex</a>

    <a href="http://www.smokershops.com/">swiss rolex replica</a>

  • Alex Popescu: Oct 06, 2007 04:46

    Hi there!

    I have built a couple of apps using the same approach. However, for those with very high requirements I have quickly find out the the DB node can become a bottleneck. So, without wanting to sound like an advise, I would probably try to partition the data or replicate the storage. Considering that in your case the DB updates are rare? then probably the simplest approach would be storage replication.

    bests,

    ./alex

    --

    .w( the_mindstorm )p.

  • Add Comment

Warning Signs Your Web Application Project May Fail
Mar 26, 2007 19:59 perm link Readers: 2712

A few more points to consider when building or managing a web application project.

1. It's been __ months and there's nothing to see

Seeing is believing, and in a web application it shouldn't take long to get something running. If a significant project has no visible progress in the first third of the development schedule if you are using SDLC/waterfall or the first few iterations of an agile project, then you should start to worry. If the team has been working for 6 or 9 months and there is no URL you can go to to view the current application running, then odds are the application isn't ever going to be complete.

You would think that this is obvious, yet I have seen it happen enough to think it's not that uncommon. Sometimes it's the development team that is incapable of making progress due to inexperience or lack of ability; even more commonly it's not their fault as the basic requirements keep changing (and not in an agile way) or management can't make needed decisions.

This isn't only a problem with web projects of course, but in a web project the browser already contributes a lot of functionality; all you need is some static HTML to demonstrate what is coming. I have heard people make excuses about building frameworks or firm foundations or working out detailed designs, but the only way for someone other than the developer to judge progress is by seeing something run.

The side benefit to seeing visible progress is that the customer can get an idea of what is being built and contribute to the shaping of the solution they are paying for. Even if you are not doing a agile project, having the customer be able to make comments on a live piece of software is a good thing (even if it will torpedo your schedule); after all it is meeting the needs of the customer that is the whole point of the project.

For developers I can only add that it is in your best interest as well to work on both the front and back ends of a project. Management and customers really appreciate seeing something work far more than dry project timelines and status reports.

2. You keep your source code where?

You would think in this modern era that everyone uses source code management systems (SCM), yet I have seen fairly large companies where source is "managed" on shared directories, or even kept on developer's hard drives. Even if they have an SCM there is often no attempt at organizing the repository, which is treated as nothing but a file dump. One place I worked at briefly on a contract had a manager who withstood 3 months of arguments before allowing the developers to set up an SCM system, arguing that using one would add too much time to project schedules.

Your SCM system should hold all source code, scripts, schemas, etc; basically everything necessary to build the entire system from scratch. It should also be backed up continuously and carefully (another thing I have seen people fail to do), and tested regularly. The repository should be designed (and the software chosen carefully) to make it easy to manage multiple versions of all projects for all contributors.

Many times only certain people will use a repository, such as the programmers, but other folks like DBAs, web designers, and QA will use other systems for their resources. It may not be as convenient for them but keeping all of the parts necessary to build and maintain a web project in an organized place is a huge benefit. Sometimes this can be a big pain, such as when I used Documentum in my projects I couldn't easily save the state of the DocBase in any meaningful way (and my employer was too cheap to pay for any real possibilities). Even keeping snapshots of external resources at regular intervals is better than nothing.

The question to ask is, if our place of business burned to a crisp, could we rebuild and continue work on projects if all we had was a backup tape of the source repository?

My personal favorite is Subversion but there are many others.

3. Management says 'let's worry about systems architecture when the app is finished'

Another wonderfully bad idea. Let's design a new car and worry about the engine when we're done. Planning on how your application will be served up should begin at or even before the coders get started. What usually happens when you just throw the application at your existing environment is lost connections, bad customer experience, crashes, and hours of painful overtime while you try to scale or fix things live. This tells your customers your application is broken (even if the code is perfect) since most people don't know or don't care how the internet works.

Today I tried to log into my Linkshare account and the login page didn't come up (different subdomain than the rest of the site); instead I get a warning from the browser that the server didn't exist. They fixed it after a while (not sure how long it was out) but still, an ordinary person might think they had done something wrong or the site was gone. Imagine you are a financial services company holding my money for retirement and your website goes up and down all day, how would I feel about my money and you?

I actually worked at one of these once and would not put my money there for anything. One of their major web applications did go down at random points every night (due to a manual backup that killed database connections with no warning) with no meaningful message to the users.

4. Management says 'let's worry about testing when the app is finished'

Ditto to the last section. QA doesn't start after the developers go on vacation. It starts the same day as the development does (or even earlier). Not only should testing be planned early but the testing environment (see the last section) should be developed early as well. The aforementioned financial services company failed to spend enough money on the QA environment to mirror the architecture of the production system, so when they rolled out a new customer portal, people started seeing other folk's financial data on their pages. Oops. The app worked fine in the limited QA environment, which had a single server. The production environment had two, and the problem involved id's being shared that were not unique in the cluster.

Oops! is not a development methodology.

My Tags:

13 Ways to Avoid Building Web Applications That Suck
Mar 21, 2007 09:26 perm link Readers: 10790

I get so tired of going to websites that provide interactivity and discovering they suck megatons. This is especially irritating in Fortune 500 companies (like Bell/AT&T/whatever they are called this week with yellowpages.com). None of this is rocket science, yet I am amazed at how companies with deep pockets manage to produce drecky sites. This is your public face, would you go on a date with mud on your face and spinach in your teeth?

At the risk of giving away all of my (not so secret) secrets, here are some useful things to do.

1. Put javascript and css in their place

Javascript and css should be kept in separate files and not embedded in your web pages. I don't care that your IDE can parse and syntax color 8 simultaneous languages in one file, it's hard to read, hard to debug, and brittle when there are multiple people editing the page. I'm ambivalent about adding event handlers upon loading instead of in the html itself but it's not a bad idea. If you want to avoid linking to files then include them in the html using your language or framework include feature.

2. Use compression

All web servers and browsers are capable of dealing with compressed content. Turn it on. It doesn't cost anything, saves bandwidth, makes dialup users happy and dogs wag their tails. You can save more by compressing your javascript files than by removing all the spaces manually. In any case it's easier to code something you can read.

3. If you put a DocType in your page, validate against it

Of all the things you find in the wide world of web, crappy HTML is by far the most common. Why put a DocType in every web page and then serve up invalid content? I always work in strict XHTML even though it isn't served up that way, simply so that I can be assured it's well written. It's a discipline thing. Use the web developers toolkit or Firebug and keep checking all the time. I don't care if IE 6 shows your site correctly. I bet IE7 doesn't. Who would buy a car that only ran on blacktop and not on concrete roads?

4. Don't air your dirty laundry in public

If your web application is properly tested (you do have QA right?) the end user should never see an error message. God forbid your code throws an exception and vomits all over the end user with debugging information. One site I went to threw a monkey wrench and then showed me a page full of every server and application level variable it could find, and then for good measure included them in a comment in the web page. I have also seen so many messages like "could not connect to database" and "timeout occurred reading database" I felt like visiting the competition. And usually did.

If you have to demonstrate poor QA and show an error page at least put it in context of the site and don't scare your customers with it. Architect your logging system to track problems and provide information to the programmers. Don't be like one employer I had who hid the logs from the developers out of some stupid security paranoia.

Whatever you do, don't include your test code and development comments in your web pages or javascript. yellowpages.com happily loads their unittest.js file on every page. Not that I want to test for them.

5. Use modern web design

Table based sites are so 19th century. Yet look around the web and see the fruits of old web design. CSS is your friend. Dreamweaver is your enemy. If your web designers are unable to build sites by hand using modern techniques and style with CSS then don't give them Dreamweaver until they can. There are millions of sites devoted to modern web design so there is no excuse to not learn how. If you can't write HTML by hand then reading the articles won't make an impression. Web design is programming + art. Accessibility, usability and maintainability are all made easy doing things the modern way and treating it as an engineering task.

6. Support all modern browsers

There aren't that many anyway. Other than IE6 and IE7 all the others are pretty similar. Saying you only support IE is stupid these days as there are two fairly different versions, with IE7 being somewhat closer to the others (yet screwed up in its own way). Code to the standards, adjust for IE6 and IE7.

Would you shop in a store that said "Whites Only!"? No, and I don't want to shop in an online store that says "IE only". Even worse are the places where they don't tell you you're not supported until the last minute. It's like going to a store and after filling up your cart, having the cashier tell you "we don't serve your kind here". Not real smart, telling customers to get lost. They will.

7. If your application has special needs, check up front

If you use javascript, flash, cookies, or something special (like a plugin) make sure you check on whatever the first page the customer sees and if you absolutely can't do anything with out it at least let the customer know up front. If you can work without it but have to degrade service either make it transparent or let them know what doesn't work. I don't want to have to explain your site to my mom.

8. Test your application usability with real users

It's amazing how many sites out there are so hard to operate. Sometimes I wonder if anyone other than the engineer actually used the site for its intended purpose. If you offer search, can anyone find useful stuff with it? Is your navigation clear? I went looking for some security hardware for my door and wound up at a Honeywell site, wherein I couldn't find anything useful and gave up after several fruitless moments. Then I went to their competition. Engineers, QA people, managers and the like are not users. If I can't find something I want on your site, then how will my mom?

9. Use a real database

Access is not a real database. Neither is Foxpro. If you can't afford (or don't want to buy) something like Oracle then use MySQL, or even better Postgres (or my personal favorite H2). Access is a toy. Would you drive a lawn mower on the highway? I've always been amazed at companies that have mission-critical information stored on someone's desktop in an Access database. And don't get me started on people who manage data using shared Excel spreadsheets.

10. Don't use platform specific functionality

ActiveX controls are the most evil things you can use. They lock you into a platform, and yet don't work easily in IE7 anyway. Just say no. If you have to Flash is always a better choice since it is available to most people.

11. Realize your customers don't all have T3 lines

An average directory page on amazon.com is around 350K and has 43 separate HTTP requests. Yikes. At least 20% (or more) of people in just the US still have dialup. One friend works by day on internet backbone switches and goes home where he has nothing but dialup available.

Compress, combine and cache were possible. I found that my biggest bandwidth hog was prototype.js. Once I compressed it and delivered it that way it almost vanished from the list. Just because I have a 6MB/sec cable line doesn't mean you aren't paying for it on your end. Most browsers only do 2 connections at a time so why make it hard on your customers and yourself.

12. Deal about accessibility and internationalization early

If your webstore or product site only caters to rock-climbing englishman, maybe you can get away with ignoring the needs of the disabled and those who speak other languages. Think about this before you build.

13. Test early, test often

You do test your web application? Testing should begin long before it's complete. Test the code, test the database, test the usability, test the environment and infrastructure. Test as if people would die if you didn't. Test as if your customers would all leave you if you fail (they will!). Test on all major browsers. On my Macbook Pro I can run IE6, IE7 (I need a bit more disk space), Firefox, Safari and Opera. I am sure a Fortune 500 company can afford more.

If you do all of these things then life will be good, people will love you, you will get rich and everyone will shower you with goodies.

Maybe yes, maybe no; but at least you can know your web application doesn't suck.

My Tags:

  • Kevin Hoang Le: Mar 21, 2007 11:43

    14. Absolutely NO pop up windows. Pop up windows are evil. Once a window is popped up and somehow gets hidden behind other windows, the link that pops it up will appear not working. If you install Firefox add-ons, they show up again in the pop up windows. Very evil.

    15. Along with the comment about no tables, NO frame-based web application or site should ever be even considered.

    16. If you web application requires sign-in/log-in, ensure that it has a logout command and ensure the logout command does what it's supposed to do regardless if the back button is clicked on. It's possible. Read my articles:

    http://www.javaworld.com/javaworld/jw-10-2006/jw-1006-logout.html

    or better yet check out my demo:

    http://pragmaticobjects.org/properLogoutDemo/

    No excuse not to do it right.

  • codist: Mar 21, 2007 12:04

    This site is a perfect example of suckiness: http://www.hrodc.com/. What a monster!

  • Impatient: Mar 21, 2007 13:53

    Here's an irony: that webpagesthatsuck.com site really kind of, well, sucks. Ads all over the place, difficult navigation, etc.

  • mike: Mar 21, 2007 14:21

    so, then, what do you suggest happen if the db is maxed out? to show nothing at all?

  • jd42: Mar 21, 2007 14:38

    I would add to #6 "please re-test the apps you wrote for IE to make sure they work with Firefox..." Amazing how many sites don't work with FF, esp. tech sites where users are prone to use Firefox.

    Good article. I've bookmarked so I can share with my team.

  • jtheory: Mar 21, 2007 17:36

    If the DB is maxed out, the point is that showing the details of the error doesn't help the user, any more than stack traces do (or exhaustive debug info...). Your logging should record all of that information for your developers, and probably shoot off an email automatically -- don't expect your customer to copy and paste the error details into an email (because chances are, they'll just go elsewhere).

    Instead, tell them what *they* want to know. Here's the sophisticated approach:

    "We apologize; our website is currently experiencing an temporary surge in traffic. The order you submitted could not be processed, and the problem has been automatically reported to the webmaster. Please try back later, or place your order over the phone, at ________. Thank you for your patience!"

    Or if you have a simpler error handling system (where you don't have custom messages set up for different errors):

    "There was an error processing your request. The details have been reported automatically to the webmaster; most problems are corrected within 24 hours. Please try again later -- we apologize for the inconvenience! If you have an urgent problem, please contact customer support at _____."

    You get the idea. The main point is to look at it from your user's standpoint. It's like when your camera's memory card is full -- do you want a screenful of data and a stack trace, or a simple message saying "your memory card is full"?

  • Tim Weaver: Mar 21, 2007 18:20

    DocType should not be optional. Always use DocType and validate against it.

    Use a Real Database isn't really great advice. Maybe better advice would be don't use a database if you don't need it. Understand the role of a database, chose wisely [no I'm not advocating Access or Excel], but blanket statements are dangerous. Many times people who don't understand the problem domain are making these decisions and they are just plain wrong.

    I would add one:

    Hire a real expert in web site development if only to test your ideas out and get an understanding of the platform, design decisions.

  • mike: Mar 21, 2007 19:11

    thanks for the claification jtheory.

  • codist: Mar 21, 2007 19:23

    If your DB is getting maxed out it might an indication of other problems as well. Sometimes all you need is caching; sometimes you might have to look at spreading the load over more servers. Problems aren't always a single point in your architecture.

    Of course if you don't need a database don't force yourself to use one, my comment was meant for those apps that need one.

  • Ian Lloyd: Mar 22, 2007 03:51

    Here is something of an irony - I couldn't add a comment on this page at work because you have used a JavaScript library technique to enable the comments facility and the library that you use is stripped by the firewall (because of eval statements contained)

    This is another example, I'm afraid, of a sucky technique. A comments facility that simply reveals the fields and requires a library to do this is the proverbial sledghammer to crack a nut. So please add that to the list!

    My original reason for commenting was to say that pop-ups can be built accessibly and with standards in mind. I just updated the Perfect Pop-up article that I wrote in 2002 (eek!) to use unobtrusive JavaScript and have also created a pop-up builder to accompany it:

    http://accessify.com/features/tutorials/the-perfect-popup/

    http://accessify.com/tools-and-wizards/accessibility-tools/pop-up-window-generator/default.php

  • Mike Owens: Mar 22, 2007 06:48

    I'm not a fan of Javascript not gracefully degrading, but once you have a proxy or firewall in place that is filtering Javacript files based on arbitrary language constructs, the problem is pretty much on your end. It's not like eval() can generate any code that the author couldn't have placed in plaintext.

    I noticed the.codist{} doesn't even try, but how is any site supposed to gracefully degrade in that situation? Your browser would ignore noscript tags, and JS-based detection would pass, assuming it used a random subset of the language no one can test against.

  • Deepak Tiwari: Mar 22, 2007 06:49

    Excellent material!

    I have something to add for point no 2. (Use compression)

    There is a side effect of turning on compression. Compressing content before sending to network takes a lot of cpu cycles. So there is a trade off. One should perform few test to find out a threshold value (size of your html/javascript/css file). Assuming the value comes 20kb. Then it does not make sense to compress files less than 20 kb. because the time taken to compress them would be more or less equal or more that what you saved in network time.

  • Fabrizio: Mar 22, 2007 07:02

    0. DON'T USE Microsoft tools or Microsoft OS!

  • codist: Mar 22, 2007 07:07

    Prototype.js is used by lots of sites. Hard to imagine why anyone turns off 'some' javascript imports; there isn't much I can do to deal with this. But I have worked for paranoid network operations people before so I can see how it happens...

  • codex: Mar 22, 2007 16:27

    I disagree about keeping the javascript and css off the html page -- simply because keeping the number of requests down can greatly improve performance. I put common css and javascript for many pages (that'll just be called once) in an external file -- put anything for one page directly in that page.

  • turambar: Mar 25, 2007 13:16

    classic example of a programmer with one side of thinking. tableless layout doesn't make one site good and css/xhtml is not a miracle pill. content, organisation and ease of usability make one site great or not.

  • krallendoerfer: Mar 25, 2007 20:03

    Good advice but I had to pause at #9. I think what you really want to say with "Use a Real Database" is something more like match the DB to the purpose of the application, the characteristics of the data, and the amount of traffic you expect. I agree that a corporation shouldn't keep its accounts receivable in Access but what about, for example, my facility's loading dock equipment inventory? What possibly could be the harm in them using Access or Foxpro for the DB? This isn't a Fortune 500 app, to be sure, but those data are pretty important to them (mission-critical, in fact) and to the few dozen users each week. The choices for this type of app, given cost and other constraints, is often not between real and toy but instead between toy and jack. If someone (IT management) told them "you gotta use MySQL instead of Access," they'd just go back to emailing Excel sheets like they used to.

  • codist: Mar 25, 2007 22:09

    I guess for me the problem with using lightweight DBs is that often the web applications wind up growing bigger, and ultimately could become a nightmare to support. I've seen this many times so it makes me a bit wary. One app is not a problem, but if you wind up with 10 of them in various departments you have no idea where your data is sitting or if it will be backed up. You can also wind up with security issues when the mission critical data isn't protected. Like everything, there is a balance point.

  • Matt Schwartz: Mar 27, 2007 08:52

    Along with using "a real database" the application should be built for expandability from the start. Many web sites start small and then hit a wall when it's time to expand to multiple database servers. If replication (or something similar) had been planned from the start, expansion would be much easier. For some web apps it can mean a total rewrite.

  • codist: Mar 27, 2007 13:37

    It looks like reddit finally saw the compression light.

  • Harvey Sugar: Mar 28, 2007 19:51

    Great list, I'm book-marking this page.

    Especially the comments about tables and valid (X)HTML. I've just finished my third part time contract fixing a Web site that looked fine on IE6 but looked horrible on Firefox. because of ancient formatting techniques and invalid HTML.

  • snlr: May 02, 2007 06:25

    Great article! Seriously guys, "not using a real database" is a mistake, "misusing tables for layout" is a mistake ... and mistakes should be pointed out and then avoided. Now IF database A is a real database or IF site B that uses tables is a great site, that is good stuff for another article. Now speaking of other articles ... ping:

    http://www.bitweaver.org/wiki/Use bitweaver to build a web site that does not suck

  • Add Comment

Name:


Optional URL:


Comment:


Save Cancel

Copyright © 2007 By Andrew Wulf