APFS’s “Bag of Bytes” Filenames
David Reed (via Alastair Houghton):
I received a polite reply from my bug stating:
“iOS HFS Normalized UNICODE names, APFS now treats all files as a bag of bytes on iOS. We are requesting that Applications developers call the correct Normalization routines to make sure the file name contains the correct representation.”
This is not that surprising, technically, because HFS+ was a file system outlier, having Unicode normalization built in. Things are much easier for the file system if it can just treat names as bags of bytes.
However, Apple failed to tell developers that it was making this change when it announced AFPS at WWDC 2016. And, as far as I can tell, it’s not mentioned anywhere in the APFS guide. iOS 10.3 is scheduled to ship in a matter of weeks or months, and it will convert existing volumes to APFS. It’s not trivial to make an app that was accustomed to working with a normalized file system work without one. And since there was no announcement, I doubt most developers have even thought about this. So this is bound to cause lots of bugs.
Here are some thoughts that come to mind:
Depending on what an app does, it may not care about this at all. But some apps will require major changes. Think of any code that reads a filename from within a file, or from the network, or from NSUserDefaults, or from the user’s typing, or from a URL handler, and then looks for that file by name. Or code that reads the contents of a folder and compares the filenames with ones it has seen before. Also think of any code that compares a filename with a string, or puts filenames in a dictionary or set, or creates checksums or secure hashes of filenames. Even if code is just reading a filename from disk, so that it starts out with the right byte sequence, you have to be careful of any code that processes it changing the normalization.
If normalization is not handled by the file system, it becomes the responsibility of each app. Apple could update Cocoa to make the common cases easier, but it sounds like this has not been done, and it wouldn’t handle everything, anyway. Not all file system access goes through Cocoa. And, in truth, even with a file system that does Unicode normalization, some code needs to care about this.
The Apple engineer’s reply is not very helpful because it’s not clear what the “correct Normalization routines” are. If APFS is not normalized, then there really is no canonical form that you can expect to find on disk. Your code has to pick one and use it consistently. Cocoa has four different methods for normalizing strings. None of these normalize the same way that HFS+ did. It uses a variant of Form D.
It’s not clear what normalization the APFS to HFS+ converter uses. It doesn’t really matter, though, since apps need to be able to handle all the cases, anyway.
Even if an app creates a file, I’m not sure it’s safe for it to rely on being able to find it again with the same name. The file may reside on a disk whose file system is migrated. Or the file could be restored from backup, cloud synced, or transferred in some other manner that might change its name.
As far as I know, there is no API to tell, given a path, whether its file system uses normalization.
As far as I know, if you have a path there is no easy way and efficient way to look for it on disk if the normalizations may differ. If you get unlucky, you would have to start at the top and read the contents of each folder until you find the file or folder that, when normalized, matches that component of your path. And there is ambiguity here because, with a bag-of-bytes file system, there could be multiple items at each level that match the path component.
More generally, once APFS is deployed users can legitimately end up with multiple files in the same folder whose names only differ in normalization. So it’s not simply a matter of having all application code convert all filenames to a canonical normalization. Apps need to be able to support files like this that coexist as well as the case where there’s only one file and the name in the file system doesn’t exactly match the name that the app stored or calculated.
Update (2017-03-25): See also: Hacker News.
I should also mention that, Unicode issues aside, APFS also allows multiple files in the same folder whose names differ only in case. HFS+ traditionally (on the Mac) does not and is case-insensitive. Applications will need to handle files that sync or transfer back and forth between these systems. Some folder structures created on APFS cannot exist on HFS+. And file references that rely on case-insensitivity to work on HFS+ will not work on APFS. Cocoa does have an API to detect how the file system handles case, and apps should already have been using it, but there will probably be more people running into these issues once the iOS/Mac worlds are no longer standardized on HFS+.
Update (2017-03-27): iOS 10.3 has shipped with APFS.
I hope at least that the file name, even if not normalized, is guaranteed to be valid UTF-8, right?
[…]
However, when performing manipulations with NSString/NSURL/Swift String, do those preserve composition enough that developers can rely on them for that?
Previously: The Case Against Insensitivity.
The way OSX does it is pretty reasonable: the userspace functions like fileSystemRepresentation return NFD UTF8, and inside the kernel each filesystem renormalizes on its own to its desired format using a set of shared utf8 conversion routines. The people who made this the official kernel filesystem policy at Apple were the same ones who spent years working on the Unicode standard.
Update (2017-03-28): Pierre Lebeaupin:
Nevertheless, this shows Apple themselves sometimes get it wrong and normalize strings in a way that causes issues because the underlying namespace has a dumb byte string for key. So if they can get it wrong, then third-party developers will need all the help they can get to get it right.
Update (2017-04-06): See also: Howard Oakley.
Update (2017-07-17): Howard Oakley:
The most obvious problems arose with iOS users who transferred files from Windows (which prefers a different normalisation form to HFS+) which were named using Korean and other character sets, although this even included European languages with accented characters like ñ and é. There’s a chilling series of messages on the Apple Developer Forums in which an iOS app developer details how users running iOS 10.3 were transferring files using iTunes for Windows, but could not access those files once they were on an iOS device.
NSString implicitly normalized paths whenever it interacts with them, which is what’s causing the core behavior you’re seeing. NSURL does NOT share this behavior and can be used. All of the NSURL NSFileManager APIs should work fine, none of the path/NSString APIs should be used.
Seperately, some APIs that take NSURL arguments actually end up extracting the path and then using NSString. So, for example, while dataWithContentsOfFile doesn’t work for the reason above, dataWithContentsOfURL also fails because it’s not really using NSURL under the hood. On the other handle, NSFileHandle fileHandleForReadingFromURL (and variants) all work fine.