Apple SSD Benchmarks and F_FULLSYNC
It turns out Apple’s custom NVMe drives are amazingly fast - if you don’t care about data integrity.
[…]
On Linux, fsync() will both flush writes to the drive, and ask it to flush its write cache to stable storage.
But on macOS,
fsync()
only flushes writes to the drive. Instead, they provide anF_FULLSYNC
operation to do whatfsync()
does on Linux.[…]
So effectively macOS cheats on benchmarks; fio on macOS does not give numbers comparable to Linux, and databases and other applications requiring data integrity on macOS need to special case it and use
F_FULLSYNC
.[…]
So, effectively, Apple’s drive is faster than all the others without cache flushes, but it is more than 3 times slower than a lowly SATA SSD at flushing its cache.
As far as I can tell, the summary is:
fsync()
does different things on Mac and Linux for historical reasons.- Many non-Apple SSDs don’t actually flush their cache when doing
F_FULLSYNC
; they seem faster because they lie. - Compared with other SSDs that actually do flush, Apple’s are (for unknown reasons) much slower, though they are faster when not flushing. Or, perhaps, these non-Apple SSDs are lying, too.
- Often, what you really want is
F_BARRIERFSYNC
, notF_FULLSYNC
.
Surely that’s a mischaracterisation to claim they’re “cheating” — this is just legacy diversions. On earlier versions of the Linux kernel and in posix
fsync()
didn’t used to flush the cache either. Darwin independently added the specialfnctl
to do a “FULLSYNC
” long agoYes newer kernels (2.6 onward or something?) changed the semantics of
fsync()
to request the full cache flush too. Darwin didn’t change theirfsync
because they already had theirfnctl
to provide the option where needed.
Dominic Giampaolo, in 2005:
On MacOS X,
fsync()
always has and always will flush all file data from host memory to the drive on which the file resides. The behavior offsync()
on MacOS X is the same as it is on every other version of Unix since the dawn of time (well, since the introduction offsync
anyway :-).I believe that what the above comment refers to is the fact that
fsync()
is not sufficient to guarantee that your data is on stable storage and on MacOS X we provide afcntl()
, calledF_FULLFSYNC
, to ask the drive to flush all buffered data to stable storage.
Force Unit Access, what this “flush to permanent storage, not disk cache” command is called, is ignored by the majority of drive types (either through lying firmware or a bridge).
It’s not enabled by default in most kernels (Linux, Windows) due to synchronous writes being slow.
[…]
However, every disk Apple ships actually supports Force Unit Access (
F_FULLSYNC
), and is under a different flag because most cross-platform developers don’t expectfsync()
to actually be synchronous, leading to massive performance losses compared to drives that don’t support it.If you write software they uses full flushing on firmware that isn’t a lying liar and actually goes through a flush to permanent storage, remember that every time you’re doing the full sync, you significantly impact the performance of the entire system, not just your software.
I was the backupd performance lead and I’d love to move on but it keeps coming up. 🤷♂️🤷♂️
[…]
I am thankful for the various people at Apple who made sure Apple hardware functioned correctly. Otherwise it would’ve been impossible to have both performance and correctness. The former is easy if you ignore the latter.
Unfortunately, there were problems in another layer that made Time Capsules corrupt their data all the time.
So you’re saying my WD NVMe drive lies about flushes, and yet they’re 10x slower than not flushing? Must be really bad at lying then…
The problem is Apple SSDs are 1000x slower when flushing. That’s called a firmware bug.
As I described elsewhere, the traditional solution to ordering writes on unix is
fsync
. This is a highly sub-optimal solution because it does much more than required.Apple’s solution is to use the equivalent of barriers, rather than flushes, to enforce ordering; and it works every bit as well as the equivalent solution (barriers rather than flushes) in a CPU pipeline.
There’s a third sync operation that lets you have your performance and write ordering too:
F_BARRIERFSYNC
. SQLite already uses it on Darwin, and it’s part of the best practices guide for I/O reduction
Update (2022-03-09): See also: Howard Oakley, MacRumors, Howard Oakley, JP Simard.
I tested a random selection of four NVMe SSDs from four vendors. Half lose FLUSH’d data on power loss. That is the flush went to the drive, confirmed, success reported all the way back to userspace. Then I manually yanked the cable. Boom, data gone.
The other half never lost data confirmed after a flush (F_FULLFSYNC on macOS) no matter how much I abused them. All four had perf hit from flushing so they are doing some work.
Top two performers on flush? One lost data 40% of the time. The other never lost any.
I guess review sites don’t test this stuff. Everyone just assumes data disappearing on crash/power loss is just how computers work?
I feel bad for the other two vendors who must have test suites and spent engineering hours making sure FLUSH works, only to find out no one cares
2 Comments RSS · Twitter
There are two levels of flushing software or users may care about:
The first, mainly for software, is to make sure that anything I've just written to a file is actually flushed to the disk, so that, if I read back block _directly_ from the disk, I see those changes. That's for instance, relevant, when I unmount a volume and then try to access its blocks directly, outside of the file system that managed it when it was mounted. Needed for disk repair tools, too.
Then there's the flush that shall ensure that the disk's cache is writen to the permanent store of the disk. That matter if you want to disconnect (eject) the disk or put it to sleep, and make sure that even if you lose power, the disk still has all the latest data.
Those two flushes have therefore different purposes and it's smart to perform them with different commands. If Linux does both in one fsync(), that's not smart but rather a lack of control for performance.
Nah, Apple are clearly wrong here. Props to them for getting there first with an actual means for application-layer code to force a disk flush, but POSIX is (non-normatively) clear that fsync(2) should guard against a system crash in general and that is clearly what system software authors expect. If Darwin doesn't do that, then it's just broken, plain and simple.
Now. If they can make guarantees about ordering of writes and can offer async transactions / barriers, then that's great. And if Linux (or Windows) could do that, too, by making guarantees about underlying hardware/firmware disk writes through flushes, then that would be awesome. But, it's a system-level optimisation for the I/O scheduler to just schedule a disk flush periodically so as to minimise disruption and maximise application throughput, and that's not a data integrity guarantee: if your system does crash, but your database rolls a valid database back to before the crash occurred, you've still lost data.
I suppose it's excusable, for notebooks. But not for the Mac Mini, or iMac.