Index Mining, A New Search Approach and Interface For Certain Document Collections

Jul 26, 2007

I worked on this idea for a couple of months and hoped to find someone in the search industry to take a look at this, along with the prototype. Sadly I have been unable to convince anyone to even read this, so I've decided to take the advice of a reader and just put it out there.

Note the prototype that goes along with this is not currently running, as my shared server is too overloaded to permit to run it alongside this blog (if I do, swapping kills performance of everything). I may put the code out later (needs a bit of cleanup), and also get it running again when I have a decent server to put it on.

This idea may be brilliant or stupid; I have no expectations that it will solve the world's search needs, but I see places all the time where it could be useful. It makes more sense when you can use the prototype (images at the bottom).

Note that the language in this article is more academic than I usually use in this blog.

Summary

Index Mining is a new approach to Search which opens the knowledge inherent in a document index to the user. Instead of guessing search terms and examining a set of returned documents to modify the query, the user enters a single root term and is presented with lists of proximal terms along with a set of documents. Terms in the lists may be examined and selected to filter the set of documents. This experience may be described as "browsing" the document collection on a pivot (the root term).

Background

In a traditional search engine, the user enters some query terms and the engine returns a list of matched documents. If the query was unambiguous and the desired answer was popular (or there are many equivalent answers) then the first results will satisfy the request. Otherwise the user has to examine the results closely, possibly scanning many pages of result documents in an effort to determine a better set of query terms to restart or improve the process. This feedback loop may happen several times until either the desired answer is found or the user gives up.

The key problem with using a "peck and poke" search system is balancing (1) providing enough information to focus the search result while (2) not filtering out meaningful results. If you provide fewer terms which may be more ambiguous you will get a larger set of results but most of them may be "noise" unrelated to your desired information. If you add more specific terms to cut out the noise then you run the risk of cutting out potential answers. Since the only feedback from the search engine is a list of results, you have no way of judging the presence or absence of information. Data which is present may be buried, data which is not present at all is completely invisible.

Many efforts have been made in the last 15 years or so to improve the quality of the search results. Generally they fall into a number of areas (although there is much overlap):

  • Query languages
  • Relevance
  • Classification
  • Metaseach

Providing a natural query language provides the user with a more conversational way to specify the desired result but often results in a more ambiguous set of results (since conversational language is often not very precise). This can be fixed by adding more boolean and proximity modifiers (this and that near something) but at the cost of more complexity.

Relevance covers a lot of approaches which attempt to interpret the user's query and essentially sort the results to rank better answers closer to the top. Pagerank from Google is a perfect example where more popular results (as measured by link popularity) are given a larger weight in the sort. There are innumerable other approaches which analyze the result documents in an attempt to find the more promising ones.

Classification is an alternate approach to relevance where the result documents are grouped into buckets based on some of the same approaches. Instead of merely sorting the results the search engine presents the documents in classes which provide a quick filter for the user to find potential answers. This may assist the user in scanning the result set faster than looking at page summaries or reading whole documents.

Metaseach is a common optimization that ships the query (or possibly a modified one) to a number of search engines and then combines the results which may be display as a sort list or often with classification. The assumption behind this is that all search engines are imperfect and thus using many of them will improve the result.

There are also many points for optimization in gathering the raw data and building the index, as well as recording user behavior during searches and improving the index through usage. In the former better analysis of the original document's meaning and relationships may lead to better relevance or classification, and in the latter user behavior may provide clues to the quality of the answers.

The fundamental problem with most of these approaches is that the user has to guess on what query terms to use and then scan the results in an attempt to discern how effective they are. If the information desired is not readily found then the user has to puzzle out a change in strategy from clues in the documents that didn't work. The hope is that by looking at the incorrect documents a new set of terms, either in different direction (restart) or enhancement (filter), will lead to a better result.

Naturally it is easy to find popular items ("Apple Computer Inc") or unambiguous terms ("Model 1234675A7 Weed Wacker") or ask for information with many equivalent answers ("buy computer"). Here the search engine has no difficulty in determining acceptable results. As the information desired becomes less popular or more ambiguous or simply obscure ("personal views on wacking weeds") the user has difficulty in supplying query terms that will elicit the desired information. If the search space is large enough the information may be too buried to be found, or with additional attempts to filter out the noise the information may not be present in the results at all.

In addition to optimizing the search process itself, it is helpful to consider the nature of the documents being searched and the precision requirements of the user.

Searching the web covers a broad range of document types, from blog posts to storefronts to lists of URLs. Using single search method for such a diverse document set may make it difficult to narrow down the list of documents containing actual information, as opposed to pointers to other information or even random unrelated bits. If the user is interested in finding reviews on a particular product the results are filled with places to buy the product. Due to the popularity of online purchasing these "documents" are likely to be ranked ahead of individual comments on the product itself.

A search space filled with similar cohesive documents, such as the US Patent Database, is very different than the web itself. Search engine strategies used for optimizing finding anything may not be as effective as taking into account the nature of a more vertical document collection.

The requirements of the user are also important to consider. If the user is searching for a place to buy a product, then there are many mostly equivalent results (any). If they are searching, for example, the patent database then the need is to discover all possible results (all); the cost of missing results is much higher. Search engine optimizations that focus on popularity and summarization help "any" but may fail an "all" requirement.

The Theory

The question becomes how to deliver more feedback to the user to aid them in asking the right question in the first place (or more likely subsequently). Assuming the information is present in the search space and reachable via the search index the key is to provide the user with an "eye" on what the index knows. Rather than rely only on the result of applying the index to the search space (a set of documents to scan) an additional set of information could be provided on what related terms the user could try next.

Although this seems on the surface to be similar to the "Classification" methods mentioned above, the key difference is to focus on actual words instead of summarized classes. Building a broad set of classes for the user to examine is usually based on summarizing entire documents (to avoid generating too many classes) which has the effect of potentially hiding useful information.

In speed reading, and especially skimming, there is a trade-off between how fast text can be covered and comprehended. Learning to read faster involves training the brain to "skip" less important words and even whole sentences. Taken to an extreme comprehension will eventually be lost as the brain no longer has sufficient information to base it on. That this is possible at all demonstrates that much of language is redundant or structural; the brain can synthesize structure and generate meaning from the more important words in the text.

This observation, that there are key words in every section of text, can be further combined with another, that in general writing key words in proximity to each other generally may distinguish one document from another. The author of a document is writing on a particular subject, and the words used are unlikely to be random. Thus discussions on document searching are likely to use many key words different than discussions on searching for a mate. Although both share the root word "searching", words used in proximity to the root may be quite different (e.g.. "examine search terms" vs. "bar search woman").

The idea is to use key words and their neighbors as a way to divide up a search space. Unlike a traditional search method where the user has to guess a set of terms to balance too many results with too few, an interface can be developed which "aids" the user in narrowing the result set. Starting with a "root" term (you have to have some idea of where to start) the interface then supplies lists of words from around the root (and its stemming derivations). The user can scan the left or right lists, choose a word, and the interface updates the opposite list for all matches, and shows the document results below. For larger document collections additional lists can be added to either side as choices are made so that further filtering can be done.

This concept simulates skimming be giving the user a quick way to scan a large set of possible searches, as well as probe the result sets when potentially interesting phrases are seen. A "phrase" in this case is a set of words anchored by the original "root" (or one of its stemmed derivations). In the general case as many words may be chosen as there are lists, one from each. This way many potential "searches" can be accomplished quickly.

An additional aid for the user is to sort and show the number of times the word in each list appears near the root term in the document set. This way the user can identify the popular uses of the root while still allowing more obscure uses to be found.

The Prototype

A prototype was built using descriptions and titles from 5 months of front page stories on digg.com, comprising some 18,000 articles with 36,000 unique keywords. A stopword list was used to cull unimportant words; also all words under 3 letters were ignored. All terms are single words. This web application uses AJAX to minimize screen refreshes and aid in interactivity. Three panes of words are displayed, the center being the single "root" term and its variations. The left and right lists show the proximal terms along with a count. Choosing a word on either side updates the opposite side with terms matching all those selected. Document matches are shown below, along with a text area to enter a standard type search as an additional filter. Matching words are hilited in the results. No attempt is made to paginate the result display.

Note that the prototype is only a partial implementation of the concept, sufficient to demonstrate its usefulness.

Firefox has poor performance deleting thousands of nodes, so switching from a long list ("new") to a filtered list ("hot"+"new") may take several seconds.

The prototype is written in Java running on a custom web framework. All data is assembled on the fly into memory during application launch.

Discussion

Index Mining is most likely appropriate for vertical collections of documents, such as the Patent Database. Its usefulness for the raw web is far less assured, as the document collection is large (billions) and there are many unrelated types of documents (e.g. lists, storefronts) where there may be many unrelated phrases not representative of the page itself. It may be that Index Mining may be useful as a filter to an existing standard search on the web, or as a method to gain ideas for a search.

This concept could also be described as "browsing" a document collection. A user can examine and click through lists much faster than the documents themselves and "discover" potential information which was not known at the start. In this usage it becomes an investigative and research tool rather than an alternative to traditional search. Here it becomes highly useful for examining research reports, documentary legal evidence, tax codes and similar vertical collections where the desired information may not be well known at the start. Since this browsing experience is user-directed and does not require as much advanced knowledge of the terms that may be in the collection, it is well suited to a discovery process.

In a production system (as opposed to the prototype) terms may contain multiple words. The indexer would examine the content and extract terms based on a knowledge of the languages grammar and usage, as well as combining obvious short phrases (e.g. President Bush). Additional filtering such as only including sentence subjects and direct objects, leaving out adjectives or action words, may also be useful to limit the index to important concepts.

Superman also owns a blue cape.

In this sentence the key words are "superman" and "cape". "blue cape" and "owns" may also be useful. The degree of filtering could be an option. Choosing "owns" as the root term would show "superman" and "cape".

In the prototype only 3 lists are shown (left, root and right). For larger document collection additional left and right lists would be come active when terms were chosen (i.e. pick a word on the left, and the right changes to match plus another further right appears). In this way the length of the phrase would expand. In the above sentence picking "superman" would show "owns" to the right and an additional right list would appear with "cape". For a small collection this is not necessary but as the document collection grows, it may require additional filtering.

One potential negative may be that in larger and more diverse document collections, the more common root terms may have too many proximal terms. In the Patent Database, for example, choosing "apparatus" may not be considered useful since it occurs in virtually every document; however, depending on the proximal words, it could still provide a useful filter.

The key questions that still need to be answered are (1) scalability and (2) applicability. The first is a complex issue but is already well known by most public search engine companies. The second requires testing this concept with larger document collections of varying types. Tuning the choice of terms may depend on the nature of the documents themselves, and the requirements of the potential users.

Combining this technique with traditional search, either before or after, is likely to be a fruitful direction. Multiple sets may also be useful (to allow "and" and "or" mining). With careful analysis of the language longer terms (e.g. "U.S. defense officials") may make it easier to scan the lists for interesting information.

Another important discussion is the applicability of this concept to non-english and mixed-language documents. Non-roman languages such as Written Chinese may prove challenging to implement.