Making the Internet Archive’s Full Text Search Faster
Giovanni Damiola (via Hacker News):
Analyzing the hot threads we discovered that our bottleneck was in the parsing of the lucene‘s inverted index.
We have a lot of high-frequency words, some of them present in almost all of the 35M documents.
The combination of large documents (we index a whole book as a single document) and the need to support full phrase queries (and show the user the correct results) leads to a serious problem: when a phrase query includes a specific token present in a large portion of the documents, all of these candidate documents must be examined for scoring. The scoring process is very IO intensive because relative proximity data for the tokens in a document must be determined.
[…]
What if we were to encode proximity information at document index time? And then somehow encode that in the inverted index? This is what the Common Grams tokenizer was made for.
For the few thousand most common tokens in our index we push proximity data into the token space. This is done by generating tokens which represent two adjacent words instead of one when one of the words is a very common word. These two word tokens have a much lower document hit count and result in a dramatic reduction in the number of candidate documents which need to be scored when Lucene is looking for a phrase match.