A visit to pandora.com prompted me to revisit the topic of feature extraction. Tim Westergren’s Music Genome Project is probably one of the coolest ways of exploring feature extraction and relevance feedback:
- The feature extraction part extracts the “phenotypes” from a piece of music you like and uses those features to find similar tunes.
- The relevance feedback part uses your input (thumbs up/down) to refine the search.
So starting from “Jimmy Smith” and after a few course corrections the suggestions (e.g., Lou Donaldson’s Funky Mama) started to sound like what I was after.It’s great to see feature extraction and relevance feedback demonstrated in such an intuitive way. It’s also great to see that the Music Genome Project got it right. Others are still having problems employing these technologies right. For example, Amazon’s recommendations insist on recommending based on items that I bought but not for myself. I bet they’ll get more mileage (read sales) if their recommendation algorithms would discriminate between an item’s intended recipient and the person buying it. Are you listening?
On Monday 1/9 I heard on NPR’s Motley Fool Show that Google has started working with J. Craig Venter on a Personal Genome project (more about Craig and Celera Genomics in The Gene Wars).
Where’s the connection, and what does Web searching have in common with Genomics? They both employ Feature Extraction.
Feature Extraction, a technique from the field of Information Retrieval, provides the bedrock of Web searching. Feature extraction maps a query from the original search space into a feature space. This mapping ensures that:
- The feature space is much smaller than the search space.
- The search operation in the feature space is implemented in an efficient manner (i.e., fast response times).
By “compressing” the space and simplifying the search Feature Extraction reduces the search time.
In the context of Web searching a search engine first indexes Web documents, mapping each document into a point in the k-dimensional keyword (or feature) space (k is the number of keywords). Typicallly this automatic indexing first removes common words like “and,” “at,” “the.” Then it reduces the remaining words to their normalized form; for example both “computer” and “computation” would be reduced to “comput.” Next a dictionary of synonyms helps to assign each normalized form to a “concept class”. Finally, for each document is representyed as a vector in keyword space.
Once indexing completes, the search engine is ready to answer queries. To do so the engine maps the query into the keyword space, and then uses a similarity measure to find the relevant documents. Good similarity measures take little time to evaluate.
Without Feature Extraction searching a collection of Web documents requires many string matching operations in the search space. In keyword space though documents correspond to multi-dimensional vectors, and using something like the cosine function (see my paper for details) is much faster than string matching.
Web Searching is just one of the many areas that employs Feature Extraction. Virtually any domain that deals with large volumes of data and where queries don’t return exact answers can use this technique. This includes time-series databases such as DNA data, which explains the similarity (pun intended) between Web searching and Genomics.