Wednesday, May 6, 2020

Advantages of the Arq 6 File Format

Stefan Reitshamer:

The problem [with Arq 5] is, as your backup set gets bigger, the number of “index files” that explain what’s in those pack files grows until it becomes unwieldy. To find an object, Arq had to check every index file, as well, as the list of unpacked blobs, until it found the identifier it was looking for.

Arq 6 doesn’t do that. It stores the actual location in the data. A “snapshot” (backup record/commit) contains the path, offset and length of the trees and blobs it needs. The trees contain the paths, offsets and lengths of the trees and blobs they need. No more looking at all the index files.

Arq 6 also doesn’t store “commits” like git did, where each commit contained the identifier of the parent commit. Deleting a commit from the bottom of that queue was costly. Arq 6 stores “snapshots” (replacement for commits) independently, so one can be deleted without affecting any others.

Also, Arq uses a sqlite database to keep a list of all the blobs, trees and commits and the references among them, so that finding and deleting unreferenced data is very quick. Enforcing a budget is also far faster than in Arq 5.

I’ll be interested to learn more about how this works. It seems like it would still need to do a complete scan to see whether a new file has the same content as one that was already backed up and to locate unreferenced blobs after pruning snapshots.

Previously:

3 Comments RSS · Twitter

Rob Mayoff

As I was reading the first paragraph of the quote, I was thinking "For god's sake just store it in SQLite!"

@Rob I think that’s just for the local cache, not the cloud data.

It's really disappointing that there's no documentation or format documentation yet.

Leave a Comment