Thursday, February 17, 2022

Apple SSD Benchmarks and F_FULLSYNC

Hector Martin (Hacker News):

It turns out Apple’s custom NVMe drives are amazingly fast - if you don’t care about data integrity.

[…]

On Linux, fsync() will both flush writes to the drive, and ask it to flush its write cache to stable storage.

But on macOS, fsync() only flushes writes to the drive. Instead, they provide an F_FULLSYNC operation to do what fsync() does on Linux.

[…]

So effectively macOS cheats on benchmarks; fio on macOS does not give numbers comparable to Linux, and databases and other applications requiring data integrity on macOS need to special case it and use F_FULLSYNC.

[…]

So, effectively, Apple’s drive is faster than all the others without cache flushes, but it is more than 3 times slower than a lowly SATA SSD at flushing its cache.

As far as I can tell, the summary is:

  1. fsync() does different things on Mac and Linux for historical reasons.
  2. Many non-Apple SSDs don’t actually flush their cache when doing F_FULLSYNC; they seem faster because they lie.
  3. Compared with other SSDs that actually do flush, Apple’s are (for unknown reasons) much slower, though they are faster when not flushing. Or, perhaps, these non-Apple SSDs are lying, too.
  4. Often, what you really want is F_BARRIERFSYNC, not F_FULLSYNC.

Dominic Evans:

Surely that’s a mischaracterisation to claim they’re “cheating” — this is just legacy diversions. On earlier versions of the Linux kernel and in posix fsync() didn’t used to flush the cache either. Darwin independently added the special fnctl to do a “FULLSYNC” long ago

Yes newer kernels (2.6 onward or something?) changed the semantics of fsync() to request the full cache flush too. Darwin didn’t change their fsync because they already had their fnctl to provide the option where needed.

Dominic Giampaolo, in 2005:

On MacOS X, fsync() always has and always will flush all file data from host memory to the drive on which the file resides. The behavior of fsync() on MacOS X is the same as it is on every other version of Unix since the dawn of time (well, since the introduction of fsync anyway :-).

I believe that what the above comment refers to is the fact that fsync() is not sufficient to guarantee that your data is on stable storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, to ask the drive to flush all buffered data to stable storage.

Rosyna Keller:

Force Unit Access, what this “flush to permanent storage, not disk cache” command is called, is ignored by the majority of drive types (either through lying firmware or a bridge).

It’s not enabled by default in most kernels (Linux, Windows) due to synchronous writes being slow.

[…]

However, every disk Apple ships actually supports Force Unit Access (F_FULLSYNC), and is under a different flag because most cross-platform developers don’t expect fsync() to actually be synchronous, leading to massive performance losses compared to drives that don’t support it.

If you write software they uses full flushing on firmware that isn’t a lying liar and actually goes through a flush to permanent storage, remember that every time you’re doing the full sync, you significantly impact the performance of the entire system, not just your software.

Andrew Wooster:

I was the backupd performance lead and I’d love to move on but it keeps coming up. 🤷‍♂️🤷‍♂️

[…]

I am thankful for the various people at Apple who made sure Apple hardware functioned correctly. Otherwise it would’ve been impossible to have both performance and correctness. The former is easy if you ignore the latter.

Unfortunately, there were problems in another layer that made Time Capsules corrupt their data all the time.

Hector Martin:

So you’re saying my WD NVMe drive lies about flushes, and yet they’re 10x slower than not flushing? Must be really bad at lying then…

The problem is Apple SSDs are 1000x slower when flushing. That’s called a firmware bug.

Maynard Handley:

As I described elsewhere, the traditional solution to ordering writes on unix is fsync. This is a highly sub-optimal solution because it does much more than required.

Apple’s solution is to use the equivalent of barriers, rather than flushes, to enforce ordering; and it works every bit as well as the equivalent solution (barriers rather than flushes) in a CPU pipeline.

Scott Perry:

There’s a third sync operation that lets you have your performance and write ordering too: F_BARRIERFSYNC. SQLite already uses it on Darwin, and it’s part of the best practices guide for I/O reduction

Update (2022-03-09): See also: Howard Oakley, MacRumors, Howard Oakley, JP Simard.

Russ Bishop (Hacker News):

I tested a random selection of four NVMe SSDs from four vendors. Half lose FLUSH’d data on power loss. That is the flush went to the drive, confirmed, success reported all the way back to userspace. Then I manually yanked the cable. Boom, data gone.

The other half never lost data confirmed after a flush (F_FULLFSYNC on macOS) no matter how much I abused them. All four had perf hit from flushing so they are doing some work.

Top two performers on flush? One lost data 40% of the time. The other never lost any.

I guess review sites don’t test this stuff. Everyone just assumes data disappearing on crash/power loss is just how computers work?

I feel bad for the other two vendors who must have test suites and spent engineering hours making sure FLUSH works, only to find out no one cares

2 Comments RSS · Twitter

There are two levels of flushing software or users may care about:

The first, mainly for software, is to make sure that anything I've just written to a file is actually flushed to the disk, so that, if I read back block _directly_ from the disk, I see those changes. That's for instance, relevant, when I unmount a volume and then try to access its blocks directly, outside of the file system that managed it when it was mounted. Needed for disk repair tools, too.

Then there's the flush that shall ensure that the disk's cache is writen to the permanent store of the disk. That matter if you want to disconnect (eject) the disk or put it to sleep, and make sure that even if you lose power, the disk still has all the latest data.

Those two flushes have therefore different purposes and it's smart to perform them with different commands. If Linux does both in one fsync(), that's not smart but rather a lack of control for performance.

Nah, Apple are clearly wrong here. Props to them for getting there first with an actual means for application-layer code to force a disk flush, but POSIX is (non-normatively) clear that fsync(2) should guard against a system crash in general and that is clearly what system software authors expect. If Darwin doesn't do that, then it's just broken, plain and simple.

Now. If they can make guarantees about ordering of writes and can offer async transactions / barriers, then that's great. And if Linux (or Windows) could do that, too, by making guarantees about underlying hardware/firmware disk writes through flushes, then that would be awesome. But, it's a system-level optimisation for the I/O scheduler to just schedule a disk flush periodically so as to minimise disruption and maximise application throughput, and that's not a data integrity guarantee: if your system does crash, but your database rolls a valid database back to before the crash occurred, you've still lost data.

I suppose it's excusable, for notebooks. But not for the Mac Mini, or iMac.

Leave a Comment