Monthly Archives: January 2006

Feature Extraction: From Web Searches to Genomics

On Monday 1/9 I heard on NPR’s Motley Fool Show that Google has started working with J. Craig Venter on a Personal Genome project (more about Craig and Celera Genomics in The Gene Wars).

Where’s the connection, and what does Web searching have in common with Genomics? They both employ Feature Extraction.

Feature Extraction, a technique from the field of Information Retrieval, provides the bedrock of Web searching. Feature extraction maps a query from the original search space into a feature space. This mapping ensures that:

  • The feature space is much smaller than the search space.
  • The search operation in the feature space is implemented in an efficient manner (i.e., fast response times).

By “compressing” the space and simplifying the search Feature Extraction reduces the search time.

In the context of Web searching a search engine first indexes Web documents, mapping each document into a point in the k-dimensional keyword (or feature) space (k is the number of keywords). Typicallly this automatic indexing first removes common words like “and,” “at,” “the.” Then it reduces the remaining words to their normalized form; for example both “computer” and “computation” would be reduced to “comput.” Next a dictionary of synonyms helps to assign each normalized form to a “concept class”. Finally, for each document is representyed as a vector in keyword space.

Once indexing completes, the search engine is ready to answer queries. To do so the engine maps the query into the keyword space, and then uses a similarity measure to find the relevant documents. Good similarity measures take little time to evaluate.



Without Feature Extraction searching a collection of Web documents requires many string matching operations in the search space. In keyword space though documents correspond to multi-dimensional vectors, and using something like the cosine function (see my paper for details) is much faster than string matching.

Web Searching is just one of the many areas that employs Feature Extraction. Virtually any domain that deals with large volumes of data and where queries don’t return exact answers can use this technique. This includes time-series databases such as DNA data, which explains the similarity (pun intended) between Web searching and Genomics.