Wednesday, May 6, 2020

Advantages of the Arq 6 File Format

The problem [with Arq 5] is, as your backup set gets bigger, the number of “index files” that explain what’s in those pack files grows until it becomes unwieldy. To find an object, Arq had to check every index file, as well, as the list of unpacked blobs, until it found the identifier it was looking for.
Arq 6 doesn’t do that. It stores the actual location in the data. A “snapshot” (backup record/commit) contains the path, offset and length of the trees and blobs it needs. The trees contain the paths, offsets and lengths of the trees and blobs they need. No more looking at all the index files.
Arq 6 also doesn’t store “commits” like git did, where each commit contained the identifier of the parent commit. Deleting a commit from the bottom of that queue was costly. Arq 6 stores “snapshots” (replacement for commits) independently, so one can be deleted without affecting any others.
Also, Arq uses a sqlite database to keep a list of all the blobs, trees and commits and the references among them, so that finding and deleting unreferenced data is very quick. Enforcing a budget is also far faster than in Arq 5.

I’ll be interested to learn more about how this works. It seems like it would still need to do a complete scan to see whether a new file has the same content as one that was already backed up and to locate unreferenced blobs after pruning snapshots.

Previously:

Arq Backup Git Mac Mac App macOS 10.15 Catalina