Two specific changes have enabled Facebook to use Mercurial for their repository size; modifying the status updates for files to check for specific file changes as opposed to content changes (by hooking into operating system's list of file changes) and modifying the checkout to give a lightweight or shallow clone without needing the full history state.
Normally, a distributed version control system will generate hashes based on the content of data, rather than timestamp. As a result, computing whether a repository has changes often involves scanning through every file calculating hashes for each to determine whether the file's content is different. By limiting the set of files to check to ones that the operating system has reported as having changed since the last scan, the speed is proportional to the number of files whose timestamp has changed, instead of all files in the current workspace. Git tries to reduce this by running lstat to determine file specific information, but still has to walk through every file in the repository in order to determine if they are changed. By asking the operating system to provide the information, the repository can be optimised to only scan those files that the OS reports as having changed.
Update (2014-05-05): Fred McCann:
Asking why does Facebook need a single source tree is the wrong question. Facebook’s process is to treat the codebase as a single thing, so they made tools that supported their process. Same with Google. When the Jenkins project had headaches with Git, I took exception with the criticism that the project should modify its process to better work with Git. That’s backwards thinking.
Stay up-to-date by subscribing to the Comments RSS Feed for this post.