Thursday, May 14, 2015

Dropbox’s Firefly Full-Text Search Engine

Samir Goel et al.:

As a result, we chose a sharding function based on “namespace”. A namespace is a widely used concept in our production systems. Internally, we represent a user’s Dropbox as a collection of namespaces. Each namespace consists of files and directories, along with a directory structure, and is mounted at a certain directory path within a user’s Dropbox. In the simplest case, a user’s Dropbox consists of just one namespace mounted at “/”, which is called the “Root” namespace.

[…]

This is still inefficient, as a shard typically contains a large number of namespaces (in the millions), while a user typically has access to a handful of namespaces. To make the query processing even faster, we prefix each token in the search index with the ID of its namespace: {namespace-id:token => list of document IDs}. This corresponding list of documents contains only those that contain the token and are also present in the namespace. This allows us to focus our processing on the subset of the search index that is relevant to the set of namespaces that belong to the user. For example, if a user with access to the namespace with id ns1 issues the query “san francisco”, we process it by intersecting the list of document IDs for tokens: "ns1:san" and "ns1:francisco".

Comments RSS · Twitter

Leave a Comment