Thursday, September 24, 2015

Counting Git Objects With Bitmap Indexes

Vicent Martí:

The result is a highly dense forest of interlinked nodes stored in the shape of a graph, but essentially accessed as a key-value store (each object in the database is only indexed by the SHA1 of its contents).


Git doesn’t keep a definite list of all objects reachable from the graph, and it cannot send every single object in its database as a whole, because it could very well be that some of those objects are not reachable at all in the repository and should be thrown away instead of sent to the client. The only thing Git knows are the tips of all branches, so its only option is to walk down the graph, all the way to the beginning of the history, listing every single object that needs to be sent.


Generally speaking, caching specific results to queries is a weak approach to performance in complex systems. What you want to do instead is caching intermediate steps of the computation, to be able to efficiently answer any kind of query.


For any given commit, its bitmap index marks all the objects that can be reached from it. To find the objects that can be reached from a commit, we simply look up its bitmap and check the marked bits on it; the graph doesn’t need to be traversed anymore, and an operation that used to take several minutes of CPU time (loading and traversing every single object in the graph) now takes less than 3ms.


When Git noticed that none of the objects on the list could be sent because they were delta’ed against other objects that were not in the list, it was forced to re-delta these objects. […] This is how we were losing all the benefits of our optimization, and in fact making the process of generating a packfile 40% slower than it was before, despite the fact that the most expensive phase of the process was essentially optimized away.

Comments RSS · Twitter

Leave a Comment