Monday, March 30, 2020

Xattrs Make Time Machine Backups Waste Space

Howard Oakley:

When metadata used to change relatively infrequently, this had little in the way of adverse effects. Now that security and privacy protection are doing so much with extended attributes, the unintended consequence is that many of the files which are copied into each Time Machine backup haven’t actually changed in substance, but a quarantine flag has been added, for instance.

It’s easy to demonstrate this in action if you’re making Time Machine backups. Simply create a sizeable PDF file which doesn’t have a quarantine flag attached to it, or strip the flag from a file which already has one. Leave the file alone for the next automatic backup. After that, open the document using Preview, which will in a fraction of a second automatically write a quarantine flag to it. Leave it for the next automatic backup, and that backup will contain a second copy of that PDF which only differs in that quarantine flag, maybe as little as 31 bytes in all. Imagine this happening to many 10 GB movie clips and you see where this is heading.

Previously:

11 Comments RSS · Twitter

But sparse bundles would ensure only small chunks of data are actually transferred, no? Why not make sure your Time Machine drive is using a sparse bundle?

@Nathan I haven’t tested this, but I suspect that a sparse bundle wouldn’t help in this case because Time Machine is going to copy the whole file from scratch.

Sören Nils Kuklau

>But sparse bundles would ensure only small chunks of data are actually transferred, no?

Sparse bundles don't help against the underlying issue that Time Machine works at a file level rather than a block level. This is an increasingly large problem.

(I wonder if Howard's reporting explains recent backups of mine I've seen where an unusual amount of the system itself keeps getting backed up, despite not having installed any updates. Still doesn't seem right, since it should be mounted read-only?)

Sparse bundles just mean you don't need to have a disk image that takes up the entire space (including free space) of the virtual volume inside the image. Instead, they only take up the non-free space, and slightly more.

@Sören
Actually sparse bundles are supposed to allow for slicing the large disk image into smaller 8MB bands and only changed bands are supposed to copy during backup. But you are likely correct, because Time Machine only handles files, a huge virtual machine image will get backed up and all of those many, many, many, many bands will copy over. Sparse image is different and works exactly the way you describe, a disk image that can grow to the maximum configured size but does not take up that space initially.

The sparse bundle is very useful as you can use rysnc to copy it over and only some of the data will copy over, whereas an unmounted disk image will require the whole file as far as I know. Also, while a disk image nor sparse image larger than 4GB could not be placed on a FAT32 formatted drive, technically you could use a sparse bundle on the same drive, since each band is only 8MB.

@Nathan Isn’t rsync supposed to detect which parts of a large file have changed and only copy those?

Sören Nils Kuklau

@ Nathan: oof, I forgot there’s a distinction between sparse disk images and sparse bundle disk images.

I think recent (all?) versions of Time Machine that use network storage use the sparse bundle variant, though.

Actually sparse bundles are supposed to allow for slicing the large disk image into smaller 8MB bands and only changed bands are supposed to copy during backup.

Honestly, I don’t know how it handles this case. (I.e., a file is larger than 8 MiB, the file changes, but the entire span of those 8 MiB hasn’t changed.) Does it detect that the resulting band is identical, or does it write it anyway? Because if it does handle deduplication at this level, sort of, then it would need to be able to reconstruct both of those files, which partially stem from different bands, and partially from the same.

Isn’t rsync supposed to detect which parts of a large file have changed and only copy those?

Yup, rsync has its own block-level diffing, at the transfer layer. The resulting files are just files, though.

@Sören There’s nothing fancy going on there. If Time Machine is backing up from a sparse bundle of a VM image, it sees the band files and copies the ones that have changed. If it’s backing up to a sparse bundle, the band files are invisible to Time Machine.

Sören Nils Kuklau

If Time Machine is backing up from a sparse bundle of a VM image, it sees the band files and copies the ones that have changed. If it’s backing up to a sparse bundle, the band files are invisible to Time Machine.

I mean backing up to a sparse bundle image, with a file in the source volume spanning more than one band, and one of those bands being entirely unchanged. Does the sparse bundle now include two identical bands (or one twice its original size), or is there deduplication going on at the band level?

@Sören I think you’re overthinking what the band are. They’re just a way to divide the big disk image file into multiple, equally sized smaller ones. There’s no smarts to it, as far as I know. So if you modify one byte of a file that takes 2 bands to back up, Time Machine will copy it again and it will then take up 4 bands. Two of the bands would be identical—if everything happened to line up on even band boundaries, which it wouldn’t.

Sören Nils Kuklau

Gotcha.

Here's hoping they'll figure out something for block level-diffs. (APFS may not be as obvious a choice here, as it's really not optimized for hard disks.)

Honestly, sparse images/bundles remain a bit confusing to me in how they interact with rsync. On Linux, pretty sure sparse files (which I think are more similar to sparse disk images in OS X parlance) do not work well with rsync unless you use the correct commands. Otherwise, say you have a 100GB sparse file that is currently only 10GB on your source, once you use rsync, you have a 10GB file on source and a 100GB file on destination because the destination file is no longer sparse and now fills the entire amount. Which is not ideal for network transfers, let alone over the Internet.

https://fedoramagazine.org/copying-large-files-with-rsync-and-some-misconceptions/

You have commands like inplace and sparse and I never khow exactly how to handle them. However, with the sparse bundle, you are only actually dealing with individual 8MB bands.

I hope my band numbers. are correct, I am going off memory and I do not use Mac OS X anymore, so let me doublecheck quickly…yeah it was 8MB at some point, there might be variable sizes now. But, either way, it is small slices of the overall image.

Leave a Comment