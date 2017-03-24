APFS’s “Bag of Bytes” Filenames
David Reed (via Alastair Houghton):
I received a polite reply from my bug stating:
“iOS HFS Normalized UNICODE names, APFS now treats all files as a bag of bytes on iOS. We are requesting that Applications developers call the correct Normalization routines to make sure the file name contains the correct representation.”
This is not that surprising, technically, because HFS+ was a file system outlier, having Unicode normalization built in. Things are much easier for the file system if it can just treat names as bags of bytes.
However, Apple failed to tell developers that it was making this change when it announced AFPS at WWDC 2016. And, as far as I can tell, it’s not mentioned anywhere in the APFS guide. iOS 10.3 is scheduled to ship in a matter of weeks or months, and it will convert existing volumes to APFS. It’s not trivial to make an app that was accustomed to working with a normalized file system work without one. And since there was no announcement, I doubt most developers have even thought about this. So this is bound to cause lots of bugs.
Here are some thoughts that come to mind:
Depending on what an app does, it may not care about this at all. But some apps will require major changes. Think of any code that reads a filename from within a file, or from the network, or from NSUserDefaults, or from the user’s typing, or from a URL handler, and then looks for that file by name. Or code that reads the contents of a folder and compares the filenames with ones it has seen before. Also think of any code that compares a filename with a string, or puts filenames in a dictionary or set, or creates checksums or secure hashes of filenames. Even if code is just reading a filename from disk, so that it starts out with the right byte sequence, you have to be careful of any code that processes it changing the normalization.
If normalization is not handled by the file system, it becomes the responsibility of each app. Apple could update Cocoa to make the common cases easier, but it sounds like this has not been done, and it wouldn’t handle everything, anyway. Not all file system access goes through Cocoa. And, in truth, even with a file system that does Unicode normalization, some code needs to care about this.
The Apple engineer’s reply is not very helpful because it’s not clear what the “correct Normalization routines” are. If APFS is not normalized, then there really is no canonical form that you can expect to find on disk. Your code has to pick one and use it consistently. Cocoa has four different methods for normalizing strings. None of these normalize the same way that HFS+ did. It uses a variant of Form D.
It’s not clear what normalization the APFS to HFS+ converter uses. It doesn’t really matter, though, since apps need to be able to handle all the cases, anyway.
Even if an app creates a file, I’m not sure it’s safe for it to rely on being able to find it again with the same name. The file may reside on a disk whose file system is migrated. Or the file could be restored from backup, cloud synced, or transferred in some other manner that might change its name.
As far as I know, there is no API to tell, given a path, whether its file system uses normalization.
As far as I know, if you have a path there is no easy way and efficient way to look for it on disk if the normalizations may differ. If you get unlucky, you would have to start at the top and read the contents of each folder until you find the file or folder that, when normalized, matches that component of your path. And there is ambiguity here because, with a bag-of-bytes file system, there could be multiple items at each level that match the path component.
More generally, once APFS is deployed users can legitimately end up with multiple files in the same folder whose names only differ in normalization. So it’s not simply a matter of having all application code convert all filenames to a canonical normalization. Apps need to be able to support files like this that coexist as well as the case where there’s only one file and the name in the file system doesn’t exactly match the name that the app stored or calculated.
Update (2017-03-25): See also: Hacker News.
I should also mention that, Unicode issues aside, APFS also allows multiple files in the same folder whose names differ only in case. HFS+ traditionally (on the Mac) does not and is case-insensitive. Applications will need to handle files that sync or transfer back and forth between these systems. Some folder structures created on APFS cannot exist on HFS+. And file references that rely on case-insensitivity to work on HFS+ will not work on APFS. Cocoa does have an API to detect how the file system handles case, and apps should already have been using it, but there will probably be more people running into these issues once the iOS/Mac worlds are no longer standardized on HFS+.
I want to add, the reason why I'm excited, is that because the low level API doesn't do any kind of normalisation, application developers can't get away with stupid crap like not working on case-sensitive file systems, because that's simply the default, and you have to deal with it.
last week I learned that HFS+ uses Form D to normalize while Linux uses Form C to normalize (https://linux.die.net/man/1/convmv), which creates the weirdest problems when syncing with rsync.
It's totally frustrating when umlauts are part of many of your filenames and out of a sudden multiple files share the same name.
On the other hand syncthing automatically adjusts filenames to the correct normalization form, in case they are wrong (https://docs.syncthing.net/advanced/folder-autonormalize.html).
The C in NFC stands for canonical normalization form.
The D for derived, mainly they deviate in ligatures.
Umlauts are not treated by normalization, usually only apostrophes.
Normalization is important to keep filenames identifiable. Different written characters which are rendered the same, should be stored normalized.
should be stored and compared as . I.e. café -> café
It's the same name.
Karsten's comment about HFS vs. Linux with rsync makes no sense to me. HFS does normalize files names going in and out. So should Linux, meaning that even though they use different forms on disk, rsync uses the POSIX APIs, which would go thru the normalization process and thus avoid any problems. Unless Linux's file system API does NOT normalize names. BTW, AFAIK, Windows also does normalization via their FS API, and it uses a composite format where HFS+ uses a decomposed format. But there's no synch problem there because, when exchanging file names, they may be in one or the other format, but the low level FS code will always convert it into its internal representation.
Oh, wait. Maybe what Karsten means is that rsync does not, when comparing directory file names from Mac and Linux, it does not recognize that they mean are effectively same, because rsync gets a Form D name from the Mac and a Form C name from the Linux system, and they are not matching with rsync's file name comparison function? Well, that would be a bug in rsync, because, the same problem would be occuring when the file systems are not case-sensitive (which HFS+ usually isn't), and then rsync would also have to understand the case-insensitive comparison rules of each operating system. But has anyone complained that rsync fails about that? Maybe it does. I never use rsync. Correct me if I'm wrong, but normalization when comparing file names from different file systems is technically nothing different than comparing case-insensitive file names. In both cases you need to know how to apply conversions to make them comparable. You can't claim do it right if you don't observe both challenge. Of course, the problem is that case-insensitive comparions are a widely known issue whereas normalization isn't.
I was using rsync to quickly move a folder of files to linux. Then on Linux I used syncthing to spread the files further. As far as I can tell the mac-normalization was kept by rsync. When I started syncthing, it was corrected. Subsequent rsync calls would then delete and upload the files instead of just updating just the changed files because it wouldn't recognize the different normalization.
@Samuel I would say it’s only trivial to implement if you never interact with any other systems with different rules and don’t care about enforcing any presentation niceties for the user.
> The C in NFC stands for canonical normalization form.
> The D for derived, mainly they deviate in ligatures.
That's wrong. C stands for canonical composition and D for canonical decomposition.
> Umlauts are not treated by normalization, usually only apostrophes.
Also wrong. Umlauts like U+00C4 are decomposed under NFD into U+0041 U+0308.
This is why Unicode should be more Nazi. Developers cannot be trusted. We need a Final Solution for the Apple Problem.
Did you forget about -[NSString fileSystemRepresentation]?
@Ken No, but I think that's about encoding rather than normalization (or case). And it doesn't look at the contents of the file system.
I'm the David Reed who posted the referenced issue. First, in defense of the Apple Engineer mentioned in #3, he indicated he is not a file system engineer so he was not certain what the correct thing to do, but was at least trying to help by pointing out the issue. I admit to not knowing as much as I should about Unicode, but it appears based on your commentary in #3 that none of the NSString methods do the same thing HFS+ does so I'm at a loss as to what to do. I'm happy to change my code (although it's probably too late as iOS 10.3 will probably be out next week with the new iPads), but I still haven't been able to find documentation that indicates exactly what should be done. I will be submitting a DTS incident today to see what they say the correct thing to do is.
@Dave Is the issue that the filename you're getting from Core Data doesn't exist?
@Michael I believe so. This is obviously difficult to reproduce and test since it requires doing the conversion from HFS+ to APFS which I suspect is not reversible (i.e., I wonder if you would be able to downgrade a iOS 10.3 device to 10.2). My app allows users to back up their Core Data files by zipping them (since UIManageDocument creates a directory of files /StoreContent/persistentStore where persistentStore is the actual sqlite file) and either uploading to Dropbox or emailing. The person sent one of those zipped files to me but who knows what the zip process with its handling of the filenames). I have a simple plist with the list of filenames that the user has created and when they pick a file to open, I tell the UIManagedDocument subclass to open that file. In this case, the UIManagedDocument openWithCompletionHandler's success parameter in the completion handler is false.
https://developer.apple.com/reference/uikit/uidocument/1619977-openwithcompletionhandler?language=objc
So what it appears based on the discussion from everyone is that the NSString I got from the user entering in a UITextField (and using that to make a NSURL) is being converted differently to the filename under HFS+ than it is under APFS. I'll email you the details of what I put in the DTS incident that I just submitted this morning in case that helps.
This is going to be terrible. The case-sensitivity issues alone will be a pain. Ever try running macOS on case-sensitive HFS+? All sorts of apps break in weird, unexpected ways.
@Michael, I believe fileSystemRepresentation handles normalization as well.
NSString *base = @"Héllo";
NSString *formC = [base precomposedStringWithCanonicalMapping];
NSString *formD = [base decomposedStringWithCanonicalMapping];
NSLog(@"lengths:%lu %lu", (unsigned long)[formC length], (unsigned long)[formD length]);
const char *formCCStr = [formC fileSystemRepresentation];
const char *formDCStr = [formD fileSystemRepresentation];
NSLog(@"equality? %d", strcmp(formCCStr, formDCStr));
Prints
2017-03-25 10:13:09.254008-0700 FileSystemRepTest[2207:315911] lengths:5 6
2017-03-25 10:13:09.254256-0700 FileSystemRepTest[2207:315911] equality? 0
@ken Cool. Thanks.
Please tell me that this is not true (or I have misunderstood). How can APFS be a transparent replacement for HFS+ if it really sees file names as just a bag of bytes?
The update (end of the article) says "APFS also allows multiple files in the same folder whose names differ only in case". This is a fair definition of a case-sensitive filing system. My experience is purely with macOS but I assume that iOS currently adopts the same behaviour, where almost every system is configured to be case-insensitive.
The APFS Guide lists the Sierra implementation of having a "limitation" of "Filenames are case-sensitive only" through being a Developer Preview version: I read "limitation" as meaning that it's not intended to be that way and won't be when it's released. If that's correct then there will at least be a case-insensitive option; that means that APFS cannot just treat the file name as a bag of bytes.
If people are working with the Sierra preview then this may just be a false alarm. If they're working with iOS and seeing this behaviour then I remain confused.
@Tony iOS has always used case-sensitive HFS+. The bug report is from iOS.
@ken I asked about using that on the Cocoa-Dev list but when going from NSString to a C-String using fileSystemRepresentation and then back to NSString to send to NSURL, I was told (on the Cocoa-dev mailing list) that was effectively a no-op. So what does one do to make a NSURL with APFS that will match what HFS+ created from the same NSString?
@Dave Depending on what the migrator did, you might be able to use -fileSystemRepresentation before looking up the path. It sounds like people are saying that that uses the same normalization as HFS+, though I don’t think it’s actually documented to do so. If the migrator did change the normalization (or that method doesn’t use the right one) then you would be in the #7 case that I mentioned and have to read the contents of the directory to find a match. In that case, you could pick any of the four NSString methods and use it to normalize both the directory listing and the string you got from Core Data to find the matching file. Or you could check for equality using -compare: without the literal option. (Don’t use -isEqual:.)
I've read through the comments on the cocoa-dev list and found a few good comments. But suggesting that Apple should just make one of the NS... functions or NSURL fix this would be a very bad idea, because those functions are only a high level layer on top of lower level APIs. What I mean is that if Apple does not fix this on the file system driver level (the HFS FS driver does it there so far), Apple must at least "fix" the low level file system APIs, e.g. the FS... and CF... APIs foremost, and probably also the BSD and POSIX functions, or whatever layer comes below that.
I can understand that the APFS devs do not want to have to deal with file name ambiguities. If their FS driver is asked for file named "abc", they do want to be able to look at the name by its binary representation. That's what they mean by "bag of bytes".
But for that to work, the level between NSURL and the APFS driver needs to do SOME conversion. For instance, it needs to make sure the file name is in a certain encoding and format. Such as UTF-8, because, as Michael Tsai has well explained in the post here, if that weren't the case, even "correct" names would not be found if they'd be encoded in UTF-16 in your code whereas they're stored in UTF-8 on the APFS end.
And I am quite sure that Apple has foreseen this need and is performing such a as-needed conversion on that lower layer before passing names to the APFS driver.
However, seeing that Dave Reed has run into an issue with that suggests to me that either they're not doing this correctly, or that they're having a bug in the conversion code from HFS+ to APFS.
What's most upsetting, though, is the ignorance of the once responding to Dave Reed in the first place - even if the APFS driver sees names as just a "bag of bytes", he should understand the need of the higher levels to pass a consistent format of the text, and therefore suspect that there#s some bug in there that needs to be addressed, and that QUICKLY, because otherwise it might be a desaster for Apple as soon as they do a world-wide release of this and likely screw up any iOS device that uses non-Roman file names.
@Thomas My understanding is that there already is a layer that converts to UTF-8 if you start with a higher level API. The issues are that the UTF-8 isn’t normalized and that you may be able to use a lower level API to pass a char* that’s not valid UTF-8 (probably more of a programmer error that would have gotten you in trouble before, too).
@Thomas and @Michael, yes I think the only choice I have at this point is to try each of the four conversion methods/properties on NSString and check if any of those directories exist. But I was confused by Michael's comment in the original blog post in item #3 where Michael states HFS+ uses none of them exactly but a variant of form D. If that's the case wouldn't none of the four match? I have just written a method to rescan the directory and give me all the filenames (which are actually directories since UIMangedDocument creates a directory for each file and stores the actual sqlite file in a subdirectory). The person using the app (with Arabic file names on iOS 10.3) who reported the problem to me stated that any new files created with Arabic names on iOS 10.3 worked. It's only the files created on iOS 10.2 and converted from HFS+ to APFS that can't be opened (i.e., found).
And yes, @Thomas, I agree it should be fixed at the lowest API level that takes a Unicode string and converts it to however the filesystem expects to see that name. At some level, it might expect just a sequence of bytes (i.e, maybe the open function passes a C array of bytes (i.e., C char type) as the file name, but I think any level above it that accepts Unicode strings should pick one conversion from Unicode and do it for you.
@Dave Instead of normalizing your Core Data string and looking for it in the file system, you should read the directory contents from the file system, normalize those strings, and compare them to the normalized Core Data string. When you find a match, find the corresponding unnormalized filename (that you got from NSFileManager) and use it to access the file. That way your code doesn’t care what the file system actually stores. I hope that’s clearer?
@Michael, yes, that makes more sense. thanks.
I'm so excited by APFS, and I think treating file names as an array of bytes is really the only sane thing for a file-system. It should be the responsibility of the layers above to do normalisation. In theory it's trivial to implement but as you correctly point out, hard to refactor existing code.