Monday, October 8, 2007

ZFS

Back in June, MacJournals wrote a 13,000-word article that explained some of the virtues of ZFS and gleefully debunked the (crazy) rumor that it was to be the default file system in Leopard. Amid solid points about areas where ZFS falls short compared to HFS+ was much sneering at “command-line-addled ‘everything invented by Apple must be evil’ true believers” who want Apple to switch to ZFS. (It was the Attitudinal, after all.) A few days ago, AppleInsider wrote a fairly typical rumors article about ZFS and Leopard. MacJournals responded with snark, which was not appreciated by Drew Thaler, a former Apple filesystems engineer. And now MacJournals has responded to Thaler, saying “You don’t have to hate ZFS to know it’s wrong for you.”

My opinion is pretty well summed up by Thaler’s statement in the comments:

If you’re an OS engineer and you aren’t hard at work solving the problems intrinsic to that TODAY, you’re not doing your job. HFS+ is “not that bad” for today’s needs. It’s going to be woefully inadequate real soon now.

ZFS isn’t ready to be the default Mac OS X filesystem today, but HFS+ is or soon will be a liability, and ZFS is perhaps the best candidate for its eventual replacement. Some features like snapshots and merging multiple drives into a storage pool may not make sense for all consumers, especially on notebooks, but there’s no requirement that they be used. There are still data integrity features that could benefit everyone, as I was reminded this weekend when I had problems with the catalogs of two HFS+ drives. And, personally, I think snapshots and pools will be relevant to consumers sooner than people think.

The problem is that, as MacJournals explains, ZFS (as it currently stands) can’t be a drop-in replacement for HFS+. It supports filenames only half as long. It’s case-sensitive rather than case-insensitive/case-preserving. It has inefficient storage for certain extended attributes. Thaler thinks the filesystem should be case-sensitive and the insensitivity should be layered on top by the application frameworks. This wouldn’t be perfect, but I think it could be “good enough.” (The current situation isn’t perfect, either; relying on the filesystem’s case folding means that some software won’t work properly on UFS or HFSX.) Apple to date hasn’t shown much interest in using extended attributes, so I think MacJournals’ concerns about increased space consumption under ZFS are overblown. Of course, I hope that Apple will start using extended attributes more, and if they do that could become a problem with ZFS.

So what’s the way forward? Apple could stick with the current HFS+ or try to add more features to it. It could use ZFS or its own customized version of ZFS. Or it could use another filesystem entirely. I have no inside information, but my guess is that Apple will eventually ship Macs that boot from some version of ZFS. Perhaps it will make changes to ZFS and try to get them into the official version. I don’t think this is an area where it makes sense to swim against the tide. Users may end up having to make a few concessions, but frankly I think using 128-character filenames instead of 256-character ones is a great tradeoff in return for ZFS’s data integrity features alone. (Use of super-long filenames is limited due to OS X’s PATH_MAX, anyway.)

Update: Jesper covers a point I made above, but should have emphasized more: “‘Everything on’ isn’t the only way to run ZFS.”

Update 2: MacJournals responds to my post, still emphasizing the large storage pools and mirroring, but its main target seems to be “ZFS fans dishonestly asserting all of its ‘magic’ properties without honestly discussing its limitations.” I hadn’t seen much of that, but I try to avoid fanboy circles.

Update 3: Drew Thaler responds. I don’t understand why more people aren’t bothered by the error rates with today’s hard disks. This is one of the reasons that EagleFiler checksums all the files in its library.

Update 4: MacJournals responds to Thaler’s response. In the original article, they said ZFS was “completely unsuitable for a Mac OS X startup disk, now or in the foreseeable future.” Now they say, “We do, however, reject the notion that everyone would be willing to make that compromise with today’s hardware. We believe those who are should have the option,” which is basically what I think. It should be an option, soon, for those whose data and drive(s) are such that ZFS wouldn’t use too much disk space or be too slow. There’s also a strange part where MacJournals seems to suggest that Apple should try to integrate ZFS ideas into HFS+ or else build an entirely new filesystem. “The entire discussion of filesystems is skewed, from the start, towards the default position that Macintosh-related filesystem constructs are somehow bad.” I think it’s, rather, that some of us wanted certain ZFS features yesterday. ZFS already exists, and it could provide real benefits relatively soon. The other approaches would probably take much longer, for relatively little benefit. The primary reason for Apple to re-invent the wheel (unless it already has a secret project to do so that’s near completion) would be if it had the attitude towards Unix and open source that MacJournals ascribes to ZFS fans and Mac technologies (which, for the record, I tend to like).

23 Comments RSS · Twitter

I don't recall personally using more than 128 characters to name a file/folder, but is it a technical challenge to add 256-character support to ZFS?

I think the main stem of this whole argument is something that both parties (Thaler & MJ) agree upon but has gotten lost in all the technical talk:

Leopard.

ZFS will not be the default for Leopard. Nor probably 10.6. In MJ's rebuttal article, they keep referring to "Today's mac user." But I don't really see Thaler addressing "today's" mac user, he's more addressing "today's" OS Engineer. For an OS that will be delivered for "Tomorrow's" mac user, when, say, perhaps 1TB notebook drives are commonplace.

1. ZFS as default for Leopard would be stupid.

2. ZFS as an option as soon as possible would be useful.

3. ZFS could be useful as a default in the future.

4. Something else could be useful as a default in the future.

I recently had a hard drive get damaged while moving my Power Mac G5. Most of the information was accessable and the SMART chip was OK, but big chunks couldn't be read. (I think the needle probably didn't get parked and banged against the platter.)

I remember wishing for ZFS that day. It would have been very useful to know *which* files got damaged, even if it couldn't automatically fix them (no RAID).

Doesn't ZFS make since for an ease-of-use standpoint as far as Time Machine is concerned?

From my understanding you have to use a separate partition for time machine right now. With ZFS, this would be a lot easier for the average user.

Sean Mills: It’s probably not technically difficult, but it wouldn’t be compatible with other ZFS implementations (including the one in Leopard).

Joshua: That’s a good summary, but clearly there’s also debate over “tomorrow’s Mac user” or else MacJournals wouldn’t have responded after Thaler agreed that rumors of Leopard defaulting to ZFS were “obviously fake.” MacJournals writes that “there’s even less reason why it should be the default file system for anything smaller than four drives and 2TB,” and that’s where we disagree. Assuming the performance were good, if bootable ZFS were available today I’d probably use it on my Macs and also install it on those I help manage (most of which have relatively small drives but plenty of disk space to spare, even in the event that ZFS uses it a bit less efficiently).

"Apple to date hasn’t shown much interest in using extended attributes"

Actually, they are used quite a bit under the hood right now. Finder Info is stored there, Disk Images have a copy of the checksum there, and resource forks can be accessed as an extended attribute as well.

There are other uses I've seen, but those are the ones that come to mind in 10.4.

The currently usage suggests to me that Apple might view extended attributes as a way to preserve all of the meta-data we've come to expect, on filesystems other than HFS+ without resorting to messy workarounds like AppleSingle or AppleDouble.

Dave: The Finder info and resource fork can be accessed using extended attributes, but that's not how they’re stored on HFS+. The extended attributes API is a great way to write portable code that preserves metadata; how the filesystem stores that metadata is an entirely separate issue. It might be messy, but the mess might be hidden.

Anyway, the point I was trying to make is that I just don’t think it’s case that 20% of the files in a typical Tiger or Leopard installation will need fatzap storage.

"Apple to date hasn’t shown much interest in using extended attributes, so I think MacJournals’ concerns about increased space consumption under ZFS are overblown. Of course, I hope that Apple will start using extended attributes more, and if they do that could become a problem with ZFS."

In Apple web site => developer section => Leopard technology series for developer => Leopard OS Foundations Overview => paragraph "File System Improvements", we can read:

"The file system also gains support for extended attributes on all file system types."

So yes this is a problem with ZFS.....

http://developer.apple.com/leopard/overview/osfoundations.html

The fatzap structure is only a problem if the total of all attributes is consistently small *and* if all of the attributes are unable to fit in microzap structures. From TFM:

ZAP objects come in two forms; microzap objects and fatzap objects. Microzap objects are
a lightweight version of the fatzap and provide a simple and fast lookup mechanism for a
small number of attribute entries. The fatzap is better suited for ZAP objects containing
large numbers of attributes.
The following guidelines are used by ZFS to decide whether or not to use a fatzap or a
microzap object.
A microzap object is used if all three conditions below are met:
• all name-value pair entries fit into one block. The maximum data block size in
ZFS is 128KB and this size block can fit up to 2047 microzap entries.
• The value portion of all attributes are of type uint64_t.
• The name portion of each attribute is less than or equal to 50 characters in
length (including NULL terminating character).
If any of the above conditions are not met, a fatzap object is used.

So... type and creator info, date, Finder bits, labels, all that stuff fits just fine into microzap objects. Custom icons, resource forks... those will require fatzap objects. If you have a lot of spotlight tags, those will probably go into fatzap objects. I can't see how this is such an inefficient mechanism, unless you think that each attribute gets its own 128KB. Read the specs, they make a lot of sense.

Fred Hamranhansenhansen

How much video has been edited on ZFS? How much 64-channel 192kHz 24-bit audio? Yeah, that's what I thought.

Hakime: I don’t think that quote from ADC means what you think it does. Saying that the xattr API is supported says nothing about how it’s implemented on each filesystem.

Scooby: Yes, that’s what I was trying to say. Microzaps should be sufficient for most of the attributes that are in use today, except resource forks. I’m not sure what you mean about Spotlight tags; currently those are stored in Spotlight’s database, not in extended attributes.

Fred: How fast will you fill up an HFS partition with that? ZFS is how you store lots of data, and you need that for video: http://blogs.sun.com/jonathan/entry/going_bollywood

As for large-scale writing speeds, you'll be using up most of your time spoon-feeding your disks, leaving a bit of spare CPU time for your checksum. Checksumming isn't slow or computationally expensive. It also tells you when your drive's going bad -- nice to know if you don't want to lose your drive altogether when you're editing video.

And, if you decide you want to cut down your write times by having more than one disk store your data (have them write different parts in parallel), ZFS will do it faster than your normal LVM.

Finally, I'm really surprised more people aren't really excited about checkpoints -- it's a dead-simple way to do reliable backups and restores. Keep one checkpoint active after your last backup, and store the diff. Make a new checkpoint for now, and delete the old one. Old data only stays around long enough to be backed up. *And* those backups are serial byte streams, stuff you can shove through gzip before you write it to disk.

Think of all the disk thrashing you don't have to do to make good incremental backups. SuperDuper freaks me out in this area.

Just for the record, MacJournals writes some very good stuff.

That said, they are also human, and falable.

MacJournals repeatedly wrote about the impossibility of bringing journaling to HFS+ without major changes that would break everything, and also about how it would be impossible to go to the Intel architecture.

ZFS will come, the only question for now is when.

I wouldn't bet against MacJournals very often, but on this issue, I'll wait to see what surprises Apple has on Leopard-is-Shipping Announcement Day.

They have surprised us before, and will again.

Pete

Pete: MacJournals also wrote that Apple wouldn’t have 64-bit application frameworks anytime soon. But yes, on the whole their writing is very good.

A few random comments ...

Lally: If you think ZFS is faster than a normal LVM for large sequential writes (the audio/video case), you should benchmark it. The current implementation breaks both reads and writes into small chunks, even smaller if you're using RAID-Z, which hurts performance substantially. QFS could easily outperform ZFS by 2x or more on identical hardware for streaming read/write workloads. Even UFS was faster for many workloads. (It doesn't help that ZFS fragments your disk to an extreme.)

Scooby/Michael: One issue with the microzap is that it would require that *each* file attribute be given its own 4-byte entry. A FSGetCatalogInfo call would then require about 10 lookups per file. This is expensive in terms of CPU time. (It's only slightly wasteful for space.) Of course, any third-party attributes would still require use of a fatzap, which is 128KB per file (though compression could alleviate this, again at the cost of CPU).

Richard: If your hard disk was actually damaged and portions couldn't be read, you don't need ZFS to find out which files are unreadable. Any utility that scans the disk would do just fine ('find', 'xargs', and 'dd' if you're handy on the command line). ZFS only improves the situation if you have data blocks which contain data which was written incorrectly (or if you use mirroring, which you can also do with Apple RAID, SoftRAID, etc.).

Tom: Snapshots don't really help Time Machine (or similar concepts), because they're not selective. You can't say, I don't want a copy of the 5 GB of pictures I just downloaded (because I'll toss 4 GB of them anyway), but keep the 5 MB of email around. They are an all-or-nothing solution. (And of course, they're not a backup at all, since they share the same media and even the same disk blocks.) Lally's approach would be the right way to use snapshots to advantage in the backup process, but it suffers from the same problem -- you can't ever delete anything. If you accidentally backup your 200 GB Parallels disk, too bad, you've got that 200 GB on your backup disk forever. Real backup software can [usually -- sadly, Retrospect on Mac OS doesn't, though the Windows version does, grrr] let you selectively remove items from the backup.

Anton: good point. I had two different thoughts (ZFS vs LVM and the speed advantages of striping) and incorrectly put them together in the same sentence.

Wouldn't the microzaps be loaded off the same block, resolving them from memory after the 1st?

As for backups, you don't do incrementals forever. Store a full snapshot once in a while (e.g. once a week/month), and store incrementals in between. Unless you're a real data pack-rat, the new snapshots overwrite the old (and the incrementals). That's when your 200GB parallels disk goes away.

Fragmentation: quoting, as I don't have the time to go in-depth on ZFS:

"Currently fragmentation has not been found to be a problem, the general rule with all properly designed filesystems is that you use no more than 90% of the space then you will have no problems with fragmentation. There currently isn't a defragger, if it becomes a problem it will be intergrated into zpool scrub and it will be able to run in the background or in the middle of the night. In the future they will come up recomendations for when and if you should run zpool scrub. It will check all your data for errors and fix any it finds.

I haven't experienced problems with fragmentation and I have exceeded the 90% rule quite frequently almost constantly in fact and have had no problems with fragmentation. I currently have 48 filesystems, and over 300 snapshots. On approximately 100GB of storage."

Lally: You're right, the microzaps will be all in the same block. It's a CPU cost more than an I/O cost; but doing multiple searches for each file access does hurt.

Fragmentation actually is quite a problem for some people (again, see the zfs-discuss mailing list -- a great resource if you're interested in learning more about the limited real-world experience here). It isn't for others. I don't know anyone who's tried to stream data from a ZFS file system at high rates, which is where fragmentation tends to be the biggest issue. (Note that Apple’s HFS+ implementation has the ability to preallocate space for a file contiguously on disk, important for audio/video and useful for file copies as well; presumably this feature could also be added to ZFS in the future.)

Since real data, or at least an approximation thereto, is better than guesses, I used rsync to copy most of my laptop disk (with the exception of /Users) to (a) an HFS+ disk image, and (b) a ZFS disk on a Solaris VM. (For those who want to play with ZFS, Solaris 10 U4 runs nicely under VMWare Fusion.) It's fairly large as I have a lot of applications, fonts etc. installed.

HFS+ required 19611576K to store this data (as measured by 'df -k').
ZFS required 19065405K, again as measured by 'df -k'.

I then used 'runat touch filetype' to create a microzap for each file on the ZFS partition.
ZFS required 20243967K after this -- a 6% overhead, or 1.1 GB. Not as bad as I was expecting, actually, but still not cheap.

I did note that Sun lists a suggested project to add additional attributes to each file node proposed by 'a third party vendor'. I don't know whether this was Cluster File Systems (now purchased by Sun) or Apple, but either of them would seem likely candidates. I suspect that project is fairly low on Sun's priority list, but Apple obviously would have an interest in making it happen.

[...] Another respected Mac developer, Michael Tsai, also responded with a thoughtful post. [...]

ZFS does support filenames of up to 255 characters (well, at least ZFS on my Solaris Nevada b55 box).

Marc: ZFS supports 255 ASCII chars, but not 255 unichars.

ZFS' snapshots are VERY useful, but more so on the TimeMachine volume than on the boot drive. All that hard-link hack magic of TimeMachine can be done away with.

As for space: ZFS can run a transparently compressed storage pool, which in my tests saves about 20% disk capacity. Particularly on a laptop that's key, and it may speed some things up, because generally IO is the bottleneck, not CPU required to decompress, which is why such things as compressed or encrypted swap files are not a problem.

ZFS is the way to go, and should for some reason ZFS be too slow for some media application, then do what most media people do anyway: use a separate drive for recording/streaming, and you can use a legacy file system on that.

I actually use ZFS and if performs very well on 2 of my workstations, and one of them is 5 years old. I use ZFS for all my filesystems, and I was very happy when I was able to use it for my boot drives.

Even if I have only one drive I would use ZFS, snapshots/clones are a gift from god, even today I rolled back to a snapshot because one of my experiments wiped out a part of the OS, add to that checksums and compression and I would choose ZFS over HFS any day... also your filesystem is always consistent on disk, no more FSCK, anybody remembers the fsck after a power outage? ZFS does not need that ... you are up and running a lot faster with ZFS... I also think that ZFS will perform even better on Solid State Drives ...

I hope ZFS will replace HFS soon...

Leave a Comment