Living In a Large Heterogeneous Travel NetworkDecember 28, 2013
There are a few industries where the number of entities in the network your application lives in can be huge and highly diverse. Any large retailer like Walmart or Amazon can ultimately connect to hundreds of thousands of suppliers.
One type of industry not only has a huge number of potential connections but actually has to manage them live and that is being an online travel aggregator (OTA).
I work for one at the furthest client end, in mobile, so I'm at the receiving end of this giant network cluster. Imagine your app connecting to a mobile API; this mobile API calls a number of internal services (sometimes not real services and sometimes even screen scraping a website because no one was interested enough to make a service for us); each of those services may call other internal services; some may call into the GDS we use (which may manage some hotels, most air and car rentals go there); some may call our internal hotel services. Now each hotel service has to manage connections to thousands of hotel chain reservations systems. Smaller hotels might update and receive updates from a website (in the past this might have involved faxes) and we manage a bit more of the data. The GDS itself receives updates on airfares generally hourly (from an outfit called APTCO which collects fare information from most of the world's airlines) and from each car rental company. They also manage thousands of other hotel companies as well. Each hotel chain may give the local hotel manager the ability to edit room information and rates.
The difference between Walmart and Amazon is that all of this will eventually be connected in realtime at the point of booking a hotel room or reserving a seat on a plane. Generally travel systems use caching to make searching feasible; you can't contact thousands of external servers trying to provide fast lists of rooms or flights. At some point however we need to know if there is an actual room available and at what current price, or is there a seat available and at what current price. Usually this happens twice, when a booking page is displayed to the user, and then when they hit the book button to actually exchange money for a bed or seat.
In addition, much of the information we present to the user had at some point come from an individual sitting in an office somewhere far removed from the client. This information, like a room description, was typed by this person into a local computer; that text eventually may have passed through a dozen systems before reaching the ultimate client. Sometimes it gets garbled, sometimes it loses vital character accents. Sometimes the data values are rounded incorrectly or might be lost completely. This makes the client end interesting and generates complex strings of curses.
Of course data also has to flow up the chain from the client, passing though however many systems to eventually reach the home of the bed or flight reservation. Each system may or may not handle the same character set and sometimes may contribute other subtle mutations to the data. On the client end this can result in oddball error messages or even worse, wrong or unusable reservations. Naturally the client end gets the abuse from the customer if something goes wrong.
One of the real challenges is to anticipate all the things that can go wrong which is basically impossible to do 100%. We only control the clients (and mobile website) and the mobile API; everything beyond that is outside of our control or even understanding. The further you go up the chain and certainly into the suppliers we can only guess. When things do go wrong the hope is that the client can at least supply an error message, even if it is not very helpful, and avoid crashing. The data manglers are very clever though and sometimes completely crazy problems happen we can't do anything about.
For example Spirit airlines is one of the few that manage their own reservations and ticketing, so the connection to Spirit is uncommonly slow, as in it can sometimes take 3 minutes to complete a booking. Every system between the client and Spirit has to be able to handle not timing out while waiting for that response. At one point if you were on AT&T cellular, their proxy would time out after 45 seconds, so we had to come up with a special polling solution (I think it's not happening any more). Internally there was an Apache proxy somewhere in our data center that also couldn't handle the long connection, which seems to have be found and fixed recently. A single airline required a ton of work just to make it possible to include it in a client.
Every once in awhile one hotel would have its latitude and longitude set to an impossible value which made the iOS maps throw an exception, so I had to validate the data in the client. Of course if you are displaying a hotel on a map and the location is bogus what do you do? I picked a random legal location that of course is wrong. Another fun variation we found was that outside the US often the location of hotels is not really set so someone just picked the center of town leading to map where all the hotel icons are piled up on top of each other.
Characters are also interesting. We use UTF-8 JSON for our mobile API, but we call mostly XML based services, and the GDS we use works in EBCDIC (I think it might be partially UTF-EBCDIC) and who knows what all those thousands of hotel systems use. So characters can and do get mangled in odd ways. Sometimes too the entry system might have converted the text into HTML so we often get odd fragments of HTML that were not stripped out completely. A perfect example of these challenges is my post Bug Story - The Lost Cars of Branson.
Time is often the biggest problem, there are so many systems between the client and the ultimate booking system that timeouts are a fact of life. Since we only control the last sliver of the network we are at the mercy of a large number of places where systems can timeout, die or otherwise lose a connection, resulting in telling the user we lost it. This is especially bad if the connection lost is on the client side of the actual booking system. That means the user pushed Book and was told an error occurred. Yet that far system still completed the booking anyway. So now the only way the user knows it worked is when they get a confirmation email. Often they try again so we have to try to keep them from booking twice even though it appears it didn't work once! You'd think there was an easy way to fix this but there isn't. When you only control the front porch you don't get to tell the people in the kitchen what to do.
Temporary issues can also drive us nuts. At one point one of the servers that handle the location search (where you enter a partial city name and get a list back) would occasionally fail. The server would start up, try to load its data but fail in a state where it still ran and was included in the load balancer. However all calls would return an empty list, leading customers to complain that they were sure New York City was a real place. Of course we would try this and it would work fine. Eventually the software was fixed so that it either succeeded completely or failed completely.
That this works at all is a constant source of wonder. All OTA's have the same issues of course. Building a system of systems that integrates tens of thousands of online suppliers is a complex business. People who enter the travel booking business today generally either focus on a single slice (Hotel Tonight) or go meta (Kayak, TripAdvisor, etc). Meta is actually easier despite the fact that you have to do massive screen-scraping (at least to start with, until you make friends).
When you work on the client end of travel you get all sorts of interesting challenges, most of which are not obvious to the people using them. Hopefully anyway.
I do wish I could build a huge network diagram with all the systems both internal and external and post it here to document how crazy it is. But it's not possible and even if I could make one, you'd be waiting a long time to download a giant ball of mud!