Unpacking Git Packfiles
Aditya Mukerjee (via Hacker News):
I’ll count packfiles as the third strategy that Git uses to reduce disk space usage, even though packfiles were really created to reduce network usage (and increase network performance). It’s helpful to keep this in mind because the design of Git’s packfiles were informed by the goal of making network usage easy. Reducing the disk space needed is a pleasant side effect.
[…]
The packfile starts with 12 bytes of meta-information and ends with a 20-byte checksum, all of which we can use to verify our results. The first four bytes spell “PACK” and the next four bytes contain the version number – in our case, [0, 0, 0, 2]. The next four bytes tell us the number of objects contained in the pack. Therefore, a single packfile cannot contain more than 232 objects, although a single repository may contain multiple packfiles. The final 20 bytes of the file are a SHA-1 checksum of all the previous data in the file.
The heart of the packfile is a series of data chunks, with some metainformation preceding each one. This is where things get interesting! The metainformation is formatted slightly differently depending on whether the data chunk that comes after it is deltified or not. In both cases, they begin by telling us the size of the object that the packfile contains. This size is encoded as a variable-length integer with a special format.
[…]
While it’s possible to to work around the aforementioned buffering issues and parse a packfile without ever reading the IDX file, the index makes it a lot easier. Like the packfile, a version 2 index file starts with a header, though the index file header is only eight bytes instead of 12. […] After the header, we encounter what Git calls a fanout table.
I discovered this while working on a clean-room implementation of Git in pure Go. While there are a lot of references to packfiles online, surprisingly, the actual format of packfiles was rather underdocumented. Most resources just mention that they exist, and describe how to use git verify-pack to inspect a packfile, without explaining how to parse packfiles and apply deltas.
I decided to write this up to save others the trouble of having to reverse-engineer it from scratch!