Friday, March 24, 2017

APFS’s “Bag of Bytes” Filenames

I received a polite reply from my bug stating:

“iOS HFS Normalized UNICODE names, APFS now treats all files as a bag of bytes on iOS. We are requesting that Applications developers call the correct Normalization routines to make sure the file name contains the correct representation.”

This is not that surprising, technically, because HFS+ was a file system outlier, having Unicode normalization built in. Things are much easier for the file system if it can just treat names as bags of bytes.

However, Apple failed to tell developers that it was making this change when it announced AFPS at WWDC 2016. And, as far as I can tell, it’s not mentioned anywhere in the APFS guide. iOS 10.3 is scheduled to ship in a matter of weeks or months, and it will convert existing volumes to APFS. It’s not trivial to make an app that was accustomed to working with a normalized file system work without one. And since there was no announcement, I doubt most developers have even thought about this. So this is bound to cause lots of bugs.

Here are some thoughts that come to mind:

Depending on what an app does, it may not care about this at all. But some apps will require major changes. Think of any code that reads a filename from within a file, or from the network, or from NSUserDefaults, or from the user’s typing, or from a URL handler, and then looks for that file by name. Or code that reads the contents of a folder and compares the filenames with ones it has seen before. Also think of any code that compares a filename with a string, or puts filenames in a dictionary or set, or creates checksums or secure hashes of filenames. Even if code is just reading a filename from disk, so that it starts out with the right byte sequence, you have to be careful of any code that processes it changing the normalization.
If normalization is not handled by the file system, it becomes the responsibility of each app. Apple could update Cocoa to make the common cases easier, but it sounds like this has not been done, and it wouldn’t handle everything, anyway. Not all file system access goes through Cocoa. And, in truth, even with a file system that does Unicode normalization, some code needs to care about this.
The Apple engineer’s reply is not very helpful because it’s not clear what the “correct Normalization routines” are. If APFS is not normalized, then there really is no canonical form that you can expect to find on disk. Your code has to pick one and use it consistently. Cocoa has four different methods for normalizing strings. None of these normalize the same way that HFS+ did. It uses a variant of Form D.
It’s not clear what normalization the APFS to HFS+ converter uses. It doesn’t really matter, though, since apps need to be able to handle all the cases, anyway.
Even if an app creates a file, I’m not sure it’s safe for it to rely on being able to find it again with the same name. The file may reside on a disk whose file system is migrated. Or the file could be restored from backup, cloud synced, or transferred in some other manner that might change its name.
As far as I know, there is no API to tell, given a path, whether its file system uses normalization.
As far as I know, if you have a path there is no easy way and efficient way to look for it on disk if the normalizations may differ. If you get unlucky, you would have to start at the top and read the contents of each folder until you find the file or folder that, when normalized, matches that component of your path. And there is ambiguity here because, with a bag-of-bytes file system, there could be multiple items at each level that match the path component.
More generally, once APFS is deployed users can legitimately end up with multiple files in the same folder whose names only differ in normalization. So it’s not simply a matter of having all application code convert all filenames to a canonical normalization. Apps need to be able to support files like this that coexist as well as the case where there’s only one file and the name in the file system doesn’t exactly match the name that the app stored or calculated.

Update (2017-03-25): See also: Hacker News.

I should also mention that, Unicode issues aside, APFS also allows multiple files in the same folder whose names differ only in case. HFS+ traditionally (on the Mac) does not and is case-insensitive. Applications will need to handle files that sync or transfer back and forth between these systems. Some folder structures created on APFS cannot exist on HFS+. And file references that rely on case-insensitivity to work on HFS+ will not work on APFS. Cocoa does have an API to detect how the file system handles case, and apps should already have been using it, but there will probably be more people running into these issues once the iOS/Mac worlds are no longer standardized on HFS+.

Update (2017-03-27): iOS 10.3 has shipped with APFS.

Pierre Lebeaupin:

I hope at least that the file name, even if not normalized, is guaranteed to be valid UTF-8, right?

[…]

However, when performing manipulations with NSString/NSURL/Swift String, do those preserve composition enough that developers can rely on them for that?

Previously: The Case Against Insensitivity.

Drew Thaler:

The way OSX does it is pretty reasonable: the userspace functions like fileSystemRepresentation return NFD UTF8, and inside the kernel each filesystem renormalizes on its own to its desired format using a set of shared utf8 conversion routines. The people who made this the official kernel filesystem policy at Apple were the same ones who spent years working on the Unicode standard.

Update (2017-03-28): Pierre Lebeaupin:

Nevertheless, this shows Apple themselves sometimes get it wrong and normalize strings in a way that causes issues because the underlying namespace has a dumb byte string for key. So if they can get it wrong, then third-party developers will need all the help they can get to get it right.

Update (2017-04-06): See also: Howard Oakley.

Update (2017-07-17): Howard Oakley:

The most obvious problems arose with iOS users who transferred files from Windows (which prefers a different normalisation form to HFS+) which were named using Korean and other character sets, although this even included European languages with accented characters like ñ and é. There’s a chilling series of messages on the Apple Developer Forums in which an iOS app developer details how users running iOS 10.3 were transferring files using iTunes for Windows, but could not access those files once they were on an iOS device.

KevinE:

NSString implicitly normalized paths whenever it interacts with them, which is what’s causing the core behavior you’re seeing. NSURL does NOT share this behavior and can be used. All of the NSURL NSFileManager APIs should work fine, none of the path/NSString APIs should be used.

Seperately, some APIs that take NSURL arguments actually end up extracting the path and then using NSString. So, for example, while dataWithContentsOfFile doesn’t work for the reason above, dataWithContentsOfURL also fails because it’s not really using NSURL under the hood. On the other handle, NSFileHandle fileHandleForReadingFromURL (and variants) all work fine.

Apple File System (APFS) File System iOS iOS 10 Mac Programming Top Posts Unicode

57 Comments RSS · Twitter

Samuel Williams

March 25, 2017 1:36 AM

I'm so excited by APFS, and I think treating file names as an array of bytes is really the only sane thing for a file-system. It should be the responsibility of the layers above to do normalisation. In theory it's trivial to implement but as you correctly point out, hard to refactor existing code.

Samuel Williams

March 25, 2017 1:37 AM

I want to add, the reason why I'm excited, is that because the low level API doesn't do any kind of normalisation, application developers can't get away with stupid crap like not working on case-sensitive file systems, because that's simply the default, and you have to deal with it.

Michael Tsai – Blog – APFS’s “Bag of Bytes” Filenames/話題/ノマドアンテナ

March 25, 2017 2:38 AM

[…] Michael Tsai – Blog – APFS’s “Bag of Bytes” Filenames […]

Karsten

March 25, 2017 3:08 AM

last week I learned that HFS+ uses Form D to normalize while Linux uses Form C to normalize (https://linux.die.net/man/1/convmv), which creates the weirdest problems when syncing with rsync.
It's totally frustrating when umlauts are part of many of your filenames and out of a sudden multiple files share the same name.
On the other hand syncthing automatically adjusts filenames to the correct normalization form, in case they are wrong (https://docs.syncthing.net/advanced/folder-autonormalize.html).

Reini Urban

March 25, 2017 4:30 AM

The C in NFC stands for canonical normalization form.
The D for derived, mainly they deviate in ligatures.

Umlauts are not treated by normalization, usually only apostrophes.

Normalization is important to keep filenames identifiable. Different written characters which are rendered the same, should be stored normalized.
should be stored and compared as . I.e. café -> café
It's the same name.

Thomas Tempelmann

March 25, 2017 6:02 AM

Karsten's comment about HFS vs. Linux with rsync makes no sense to me. HFS does normalize files names going in and out. So should Linux, meaning that even though they use different forms on disk, rsync uses the POSIX APIs, which would go thru the normalization process and thus avoid any problems. Unless Linux's file system API does NOT normalize names. BTW, AFAIK, Windows also does normalization via their FS API, and it uses a composite format where HFS+ uses a decomposed format. But there's no synch problem there because, when exchanging file names, they may be in one or the other format, but the low level FS code will always convert it into its internal representation.

Thomas Tempelmann

March 25, 2017 6:08 AM

Oh, wait. Maybe what Karsten means is that rsync does not, when comparing directory file names from Mac and Linux, it does not recognize that they mean are effectively same, because rsync gets a Form D name from the Mac and a Form C name from the Linux system, and they are not matching with rsync's file name comparison function? Well, that would be a bug in rsync, because, the same problem would be occuring when the file systems are not case-sensitive (which HFS+ usually isn't), and then rsync would also have to understand the case-insensitive comparison rules of each operating system. But has anyone complained that rsync fails about that? Maybe it does. I never use rsync. Correct me if I'm wrong, but normalization when comparing file names from different file systems is technically nothing different than comparing case-insensitive file names. In both cases you need to know how to apply conversions to make them comparable. You can't claim do it right if you don't observe both challenge. Of course, the problem is that case-insensitive comparions are a widely known issue whereas normalization isn't.

Karsten

March 25, 2017 6:23 AM

I was using rsync to quickly move a folder of files to linux. Then on Linux I used syncthing to spread the files further. As far as I can tell the mac-normalization was kept by rsync. When I started syncthing, it was corrected. Subsequent rsync calls would then delete and upload the files instead of just updating just the changed files because it wouldn't recognize the different normalization.

Michael Tsai

March 25, 2017 6:44 AM

@Samuel I would say it’s only trivial to implement if you never interact with any other systems with different rules and don’t care about enforcing any presentation niceties for the user.

Nick Wellnhofer

March 25, 2017 7:09 AM

> The C in NFC stands for canonical normalization form.
> The D for derived, mainly they deviate in ligatures.

That's wrong. C stands for canonical composition and D for canonical decomposition.

> Umlauts are not treated by normalization, usually only apostrophes.

Also wrong. Umlauts like U+00C4 are decomposed under NFD into U+0041 U+0308.

Aubrey Kohn

March 25, 2017 8:08 AM

This is why Unicode should be more Nazi. Developers cannot be trusted. We need a Final Solution for the Apple Problem.

ken

March 25, 2017 8:13 AM

Did you forget about -[NSString fileSystemRepresentation]?

Michael Tsai

March 25, 2017 8:16 AM

@Ken No, but I think that's about encoding rather than normalization (or case). And it doesn't look at the contents of the file system.

Dave Reed

March 25, 2017 9:42 AM

I'm the David Reed who posted the referenced issue. First, in defense of the Apple Engineer mentioned in #3, he indicated he is not a file system engineer so he was not certain what the correct thing to do, but was at least trying to help by pointing out the issue. I admit to not knowing as much as I should about Unicode, but it appears based on your commentary in #3 that none of the NSString methods do the same thing HFS+ does so I'm at a loss as to what to do. I'm happy to change my code (although it's probably too late as iOS 10.3 will probably be out next week with the new iPads), but I still haven't been able to find documentation that indicates exactly what should be done. I will be submitting a DTS incident today to see what they say the correct thing to do is.

Michael Tsai

March 25, 2017 10:54 AM

@Dave Is the issue that the filename you're getting from Core Data doesn't exist?

Dave Reed

March 25, 2017 11:32 AM

@Michael I believe so. This is obviously difficult to reproduce and test since it requires doing the conversion from HFS+ to APFS which I suspect is not reversible (i.e., I wonder if you would be able to downgrade a iOS 10.3 device to 10.2). My app allows users to back up their Core Data files by zipping them (since UIManageDocument creates a directory of files /StoreContent/persistentStore where persistentStore is the actual sqlite file) and either uploading to Dropbox or emailing. The person sent one of those zipped files to me but who knows what the zip process with its handling of the filenames). I have a simple plist with the list of filenames that the user has created and when they pick a file to open, I tell the UIManagedDocument subclass to open that file. In this case, the UIManagedDocument openWithCompletionHandler's success parameter in the completion handler is false.

https://developer.apple.com/reference/uikit/uidocument/1619977-openwithcompletionhandler?language=objc

So what it appears based on the discussion from everyone is that the NSString I got from the user entering in a UITextField (and using that to make a NSURL) is being converted differently to the filename under HFS+ than it is under APFS. I'll email you the details of what I put in the DTS incident that I just submitted this morning in case that helps.

APFS does not normalize Unicode filenames | ExtendTree

March 25, 2017 12:00 PM

[…] Read Full Story […]

Mark Egli

March 25, 2017 12:36 PM

This is going to be terrible. The case-sensitivity issues alone will be a pain. Ever try running macOS on case-sensitive HFS+? All sorts of apps break in weird, unexpected ways.

ken

March 25, 2017 1:15 PM

@Michael, I believe fileSystemRepresentation handles normalization as well.

NSString *base = @"Héllo";
NSString *formC = [base precomposedStringWithCanonicalMapping];
NSString *formD = [base decomposedStringWithCanonicalMapping];
NSLog(@"lengths:%lu %lu", (unsigned long)[formC length], (unsigned long)[formD length]);
const char *formCCStr = [formC fileSystemRepresentation];
const char *formDCStr = [formD fileSystemRepresentation];
NSLog(@"equality? %d", strcmp(formCCStr, formDCStr));

Prints

2017-03-25 10:13:09.254008-0700 FileSystemRepTest[2207:315911] lengths:5 6
2017-03-25 10:13:09.254256-0700 FileSystemRepTest[2207:315911] equality? 0

Tony

March 25, 2017 2:27 PM

Please tell me that this is not true (or I have misunderstood). How can APFS be a transparent replacement for HFS+ if it really sees file names as just a bag of bytes?

The update (end of the article) says "APFS also allows multiple files in the same folder whose names differ only in case". This is a fair definition of a case-sensitive filing system. My experience is purely with macOS but I assume that iOS currently adopts the same behaviour, where almost every system is configured to be case-insensitive.

The APFS Guide lists the Sierra implementation of having a "limitation" of "Filenames are case-sensitive only" through being a Developer Preview version: I read "limitation" as meaning that it's not intended to be that way and won't be when it's released. If that's correct then there will at least be a case-insensitive option; that means that APFS cannot just treat the file name as a bag of bytes.

If people are working with the Sierra preview then this may just be a false alarm. If they're working with iOS and seeing this behaviour then I remain confused.

Michael Tsai

March 25, 2017 3:04 PM

@ken Cool. Thanks.

Michael Tsai

March 25, 2017 3:07 PM

@Tony iOS has always used case-sensitive HFS+. The bug report is from iOS.

Dave Reed

March 25, 2017 4:29 PM

@ken I asked about using that on the Cocoa-Dev list but when going from NSString to a C-String using fileSystemRepresentation and then back to NSString to send to NSURL, I was told (on the Cocoa-dev mailing list) that was effectively a no-op. So what does one do to make a NSURL with APFS that will match what HFS+ created from the same NSString?

Michael Tsai

March 25, 2017 4:29 PM

@Dave Depending on what the migrator did, you might be able to use -fileSystemRepresentation before looking up the path. It sounds like people are saying that that uses the same normalization as HFS+, though I don’t think it’s actually documented to do so. If the migrator did change the normalization (or that method doesn’t use the right one) then you would be in the #7 case that I mentioned and have to read the contents of the directory to find a match. In that case, you could pick any of the four NSString methods and use it to normalize both the directory listing and the string you got from Core Data to find the matching file. Or you could check for equality using -compare: without the literal option. (Don’t use -isEqual:.)

Thomas Tempelmann

March 25, 2017 4:55 PM

I've read through the comments on the cocoa-dev list and found a few good comments. But suggesting that Apple should just make one of the NS... functions or NSURL fix this would be a very bad idea, because those functions are only a high level layer on top of lower level APIs. What I mean is that if Apple does not fix this on the file system driver level (the HFS FS driver does it there so far), Apple must at least "fix" the low level file system APIs, e.g. the FS... and CF... APIs foremost, and probably also the BSD and POSIX functions, or whatever layer comes below that.

I can understand that the APFS devs do not want to have to deal with file name ambiguities. If their FS driver is asked for file named "abc", they do want to be able to look at the name by its binary representation. That's what they mean by "bag of bytes".

But for that to work, the level between NSURL and the APFS driver needs to do SOME conversion. For instance, it needs to make sure the file name is in a certain encoding and format. Such as UTF-8, because, as Michael Tsai has well explained in the post here, if that weren't the case, even "correct" names would not be found if they'd be encoded in UTF-16 in your code whereas they're stored in UTF-8 on the APFS end.

And I am quite sure that Apple has foreseen this need and is performing such a as-needed conversion on that lower layer before passing names to the APFS driver.

However, seeing that Dave Reed has run into an issue with that suggests to me that either they're not doing this correctly, or that they're having a bug in the conversion code from HFS+ to APFS.

What's most upsetting, though, is the ignorance of the once responding to Dave Reed in the first place - even if the APFS driver sees names as just a "bag of bytes", he should understand the need of the higher levels to pass a consistent format of the text, and therefore suspect that there#s some bug in there that needs to be addressed, and that QUICKLY, because otherwise it might be a desaster for Apple as soon as they do a world-wide release of this and likely screw up any iOS device that uses non-Roman file names.

Michael Tsai

March 25, 2017 5:05 PM

@Thomas My understanding is that there already is a layer that converts to UTF-8 if you start with a higher level API. The issues are that the UTF-8 isn’t normalized and that you may be able to use a lower level API to pass a char* that’s not valid UTF-8 (probably more of a programmer error that would have gotten you in trouble before, too).

Dave Reed

March 25, 2017 8:55 PM

@Thomas and @Michael, yes I think the only choice I have at this point is to try each of the four conversion methods/properties on NSString and check if any of those directories exist. But I was confused by Michael's comment in the original blog post in item #3 where Michael states HFS+ uses none of them exactly but a variant of form D. If that's the case wouldn't none of the four match? I have just written a method to rescan the directory and give me all the filenames (which are actually directories since UIMangedDocument creates a directory for each file and stores the actual sqlite file in a subdirectory). The person using the app (with Arabic file names on iOS 10.3) who reported the problem to me stated that any new files created with Arabic names on iOS 10.3 worked. It's only the files created on iOS 10.2 and converted from HFS+ to APFS that can't be opened (i.e., found).

And yes, @Thomas, I agree it should be fixed at the lowest API level that takes a Unicode string and converts it to however the filesystem expects to see that name. At some level, it might expect just a sequence of bytes (i.e, maybe the open function passes a C array of bytes (i.e., C char type) as the file name, but I think any level above it that accepts Unicode strings should pick one conversion from Unicode and do it for you.

Michael Tsai

March 25, 2017 10:08 PM

@Dave Instead of normalizing your Core Data string and looking for it in the file system, you should read the directory contents from the file system, normalize those strings, and compare them to the normalized Core Data string. When you find a match, find the corresponding unnormalized filename (that you got from NSFileManager) and use it to access the file. That way your code doesn’t care what the file system actually stores. I hope that’s clearer?

Dave Reed

March 26, 2017 8:58 AM

@Michael, yes, that makes more sense. thanks.

Michael Tsai – Blog – APFS’s “Bag of Bytes” Filenames – Wilcove Technology

March 26, 2017 11:21 AM

[…] Source: Michael Tsai – Blog – APFS’s “Bag of Bytes” Filenames […]

Tony

March 26, 2017 5:43 PM

The current (HFS+) situation on macOS, as I understand it, is that the -fileSystemRepresentation produces a particularly significant UTF-8 representation of any string (file name). This is HFS+'s canonical form: nothing magic but always the same choice of options for representing awkward accented characters etc. By experiment, that same form seems to be what comes back from the filing system. Hence, comparing file names in this format should give the same answers as HFS+ does. At least, that's my working hypothesis. To that extent, at a given level, even HFS+ uses a bag of bytes for file names (but that's actually obvious and common to all filing systems).

The thing I still don't understand is the case-sensitive bit. That implies a different comparison algorithm from today's devices and, what's more, allows several files with what today's systems see as the same name (as stated in the original item). How is this intended to interwork? With different OS versions or between iOS and macOS devices sharing files on iCloud Drive etc.

Michael Tsai

March 26, 2017 8:04 PM

@Tony That may be, but I don’t think the documentation actually promises that. It just says that it gives you a valid C string that you can pass to the POSIX APIs. And my understanding is that you don’t have pass normalized strings to those APIs because the lower layers do that. So maybe -fileSystemRepresentation is currently doing the same thing, but redundantly.

iOS has always supported multiple files that the Mac would see as having the same name (because they only differ in case). I’m not sure how iCloud Drive and other services treat that.

norio_nomura

March 27, 2017 12:51 AM

Thank you for article.
After verifying the behavior of APFS with iOS 10.3, I found a bug in iOS 10.3 resulting from this behavior.
http://www.openradar.me/radar?id=4936770785378304

Clark

March 27, 2017 1:57 PM

Hmm. That's my real worry. Not that my code will break but that Apple's code will break.

has

March 27, 2017 2:17 PM

@Clark: Inconceivable!

eeeeee

March 27, 2017 4:20 PM

As far as I know, the normalization was only done on OS X (because of case insensitivity), and never done on iOS.

They planned to abandon case insensitivity on OS X for a decade, but it broke too many apps.

PS: Contrary to common believe, NTFS and Windows internally is not case insensitive. Only the Win32 API is emulating case insensitivity.

Jean-Daniel

March 28, 2017 7:57 AM

@Dave Reed: You can maybe get the exact string expected by the FS by using the CFString API with 'kCFStringEncodingMacHFS'. But this is just a guess, I'm not even sure about what kCFStringEncodingMacHFS does.

Jean-Daniel

March 28, 2017 7:59 AM

Also, the HFS decomposed format is vaguely documented here: https://developer.apple.com/library/content/qa/qa1173/_index.html

Dave Reed

March 28, 2017 10:16 AM

Thanks for the link @Jean-Daniel. I believe I've worked around it (update to my app is waiting for review) as Michael suggested (getting array of filenames from filesystem and applying -decomposedStringWithCanonicalMapping to them and to string I got from user and stored and finding the match. Also, to anyone using ZipArchive (https://github.com/ZipArchive/ZipArchive), they use -UTF8String which seems to break when trying to unzip zip on APFS that were created on HFS+. Using -fileSystemRepresentation seems to work around that. I've posted an issue on GitHub for them so someone with more expertise than me can check it.

Tony

March 28, 2017 5:00 PM

@eeeeee Normalisation (to a canonical form) is about a number of Unicode issues as well as about case (insensitivity) matching. The usual example is that many accented characters can e coded as precomposed - the character is an accented character - or use two characters - the plain letter followed by a 'composing' accent character that, when displayed, adds to the character to produce the same appearance. The filing system matching needs to meet the user expectation that these two indistinguishable representations are treated as matching. That's a Mac OS issue and I can't see how iOS avoids it too.

@Jean-Daniel The canonical decomposition is also alluded to in Apple TN1150 (HFS Plus Volume Format) (https://developer.apple.com/legacy/library/technotes/tn/tn1150.html). That in turn links to Unicode Decomposition table but that is impenetrable; some of the info is likely to be obscure, the example quoted concerning substitution of an illegal representation of some Korean Hangul characters (no - me neither). The TN does describe the matching algorithm and how case insensitivity is handled (its the first operation, unsurprisingly).

@Dave Reed I realise that I misread your original point 8 where you said "More generally, once APFS is deployed users can legitimately end up with multiple files in the same folder whose names only differ in normalization.". If it's not too late, I don't think that's possible. The canonical form is about a unique representation of names that can be represented differently but, critically, still appear identical to the user (see first part of this comment). I think the case you cite could only arise if there was no a single (canonical) representation for the filing system: the act of normalising any but the first will show the file name as in use (oh dear, I hope I'm not going in circles here).

@Michael Tsai I agree that the description for -fileSystemRepresentation is less than definitive but it's hard to put a different meaning on it. The method description says "Returns a C-string representation of a given path that properly encodes Unicode strings for use by the file system." and "Use this method if your code calls system routines that expect C-string path arguments.". I also feel that there's hint in the name ;-). I have experimented with (and used) the call and it does produce a coding of the right nature but I too would have liked a simple statement that this is the one true encoding. The description of the same method for NSURL tantalisingly says "The file system representation format is described in File Encodings and Fonts" but I can't find that reference (and it's not a link).

Michael Tsai

March 28, 2017 6:44 PM

@Tony Yes, but see the quote from former Apple file system engineer Drew Thaler, where he says that it gets renormalized inside the kernel anyway. He says -fileSystemRepresentation returns NFD, but it’s not clear to me whether that’s the special HFS one or not. However, even if it is, I think one would still want to use one of the regular normalization methods in some cases.

eeeeee

March 29, 2017 3:05 PM

@Tony: I know what this is about. The Unicode normalization was introduced /because of/ the case insensitivity. It was a bad idea from the start, and Apple was trying to end this for a decade now. Even the ZFS plans failed because of this issue. Major apps like Adobe Photoshop even failed to work on case sensitive HFS+.

And the first APFS problems are being reported already:

http://www.openradar.me/radar?id=4936770785378304

Dave Reed

March 30, 2017 8:15 AM

I received a reply from DTS with lots of information and references. The suggestion for me was to store both the name the user enters and a filename that won't have the issue in my plist file with the list of "courses" (I assume they mean use a ASCII name such as GUID. The other suggestion was to iterate over the files an apply the same decomposition to the filename and the user entered string to find the match - that's what I did over the weekend and the update is now available.

Tony

March 31, 2017 12:58 PM

@eeeeee When I said that "Normalisation (to a canonical form) is about a number of Unicode issues as well as about case" I was leaning on the Apple HFS+ document that I referenced. This says:
"Unicode allows some sequences of characters to be represented by multiple, equivalent forms. For example, [snip about accented characters]. To reduce complexity in the B-tree key comparison routines (which have to compare Unicode strings), HFS Plus defines that Unicode strings will be stored in fully decomposed form, with composing characters stored in canonical order." Note that this can be read as the B-tree comparison being done on bags of bytes (very sensible at that level, IMHO).

Michael Tsai - Blog - APFS to Add Case-Insensitive Variant for Mac

March 31, 2017 2:26 PM

[…] the normalization issues that I raised last […]

Thomas Tempelmann

April 5, 2017 3:08 AM

I just talked to someone who had run into the same issue with his iOS app. And now I believe I fully understand what happened:

A file with the name "Zürch" had been created pre-iOS 10.3. First, the name was determined, e.g. from user input, and stored in composited form in a file or a database. Then a file based on that name was created on the HFS volume. Per HFS+ behavior, the name was stored decomposited. Then the user installs iOS 10.3 and the volume got converted to APFS, with the name preserved in its binary form. Now, the same app runs again and wants to open the file again that it once created - and that will now fail because even though, back then, the file was created with the composited name, it's now having a slightly different name, i.e. decomposited, and APFS is not forgiving that difference any more.

The worst part of it is that you could think you did everything correctly pre-10.3. But what you would have had to do, and which NO ONE would have thought of, was to fetch the actual name from disk back after creating the file, and store that in the database – not the one you used to create the file. Only then you'd end up with the decomposited name that APFS uses now.

An impossible situation, created by Apple without any warning ahead of it. Even now, I bet, they don't tell programmers that pitfall.

The result is that anyone who wrote iOS apps that created files that could contain non-ASCII chars and remembers those names in a file or database need to explicitly call a normalization function now to access them. That hasn't been clearly communicated yet, has it?

Thomas Tempelmann

April 5, 2017 4:09 AM

Damn, as usual, I wrote too soon. The described scenario is what I assumed the person I spoke to had, but in fact he had trouble creating new files through UIManagedDocument, just like David did - though in David's case it was apparently like I described in the previous post here, i.e. by migrating names from HFS+ to APFS, and those names getting not found any more, whereas the person I just mentioned had the problem with newly created files, independent of previous HFS+ migration - which is even more worrysome if it turns out to be a general problem (but maybe not, or we'd see more of this being brought up). I wonder if I'm just behind on all this or if this is still really that unclear to even the more involved with this.

Michael Tsai

April 5, 2017 9:12 AM

@Thomas Your first scenario sounds right to me. For those who want to fix their files/databases, you can get the name in the file system using NSURLNameKey. I’m not sure what’s going on with not being able to create new files after already migrating to APFS. In that case, it shouldn’t matter what normalization you use so long as you’re consistent. Maybe some other app or system code is inadvertently converting the string?

Tony

April 5, 2017 10:43 AM

I tried an experiment on iOS 10.3.1 and Pages (the version current as of today) on iPad. I turned off iCloud Drive so that I was saving documents onto the iPad. I created a trivial document and saved it as "vvvv", duplicated it and tried to save the second document as "VVVV". To my surprise, Pages (in document manager) reported that the name was already in use. If APFS is case-insensitive, where is this check being made?

Is there perhaps some internal Apple convention about use of intermediate (Cocoa?) layers that hasn't quite been communicated to developers. BTW, my personal concern lies in preparing for APFS on macOS, I'm just a user of iOS so I should welcome this consistency of behaviour.

Jürg

April 7, 2017 10:46 AM

That's my real worry. Not that my code will break but that Apple's code will break.

Here's what Apple's Documentation says:

To avoid introducing bugs in your code with mismatched Unicode normalization in filenames:

Use high-level Foundation APIs such as NSFileManager and NSURL when interacting with the filesystem
Use the fileSystemRepresentation property of NSURL objects when creating and opening files with lower-level filesystem APIs such as POSIX open(2), or when storing filenames externally from the filesystem

The UIManagedDocument (using CoreData) issue mentioned above boils down to this:

NSString *fileName = @"Zürich";
NSURL *fileURL = [folder URLByAppendingPathComponent:fileName];
self.document = [[MyUIManagedDocumentSubclass alloc] initWithFileURL:fileURL];
[self.document saveToURL:self.document.fileURL forSaveOperation:UIDocumentSaveForCreating completionHandler:^(BOOL success) {

=> crash
Why? UIManagedDocument creates the directory: /path/to/Zürich, but is unable to locate a required subdirectory (containing the CoreData db) /path/to/Zürich/StoreContent/MyMobileStore (on iOS 10.3 with APFS)

According to the Docs, this should not be an issue, as only "high-level API's" are being used.
So is this "my code that broke" or Apple's (UIManagedDocument)-Framework that broke?

Either way, the workaround is:

NSString *fileName = @"Zürich";
const char *fileNameFSR = [fileName fileSystemRepresentation];
fileName = [[NSFileManager defaultManager] stringWithFileSystemRepresentation:fileNameFSR length:strlen(fileNameFSR)];

Michael Tsai

April 7, 2017 11:01 AM

@Tony That is very interesting because APFS on iOS is supposed to be case-sensitive.

@Jürg We don’t know whose bug it is because Apple hasn’t actually documented how it’s supposed to work. And this confirms what I said in the follow-up post, which is that Apple’s documented advice doesn’t actually solve the problems. Your workaround addresses the case of the file not already existing, but others have reported that using -fileSystemRepresentation doesn’t help in the case of a pre-existing file and a migrated file system. I think you would need to actually read the file system and figure out which name is in use in order to know what to pass to MyUIManagedDocumentSubclass.

Tony

April 7, 2017 1:27 PM

@Michael My bad, I did of course mean case-sensitive ... which is why it was surprising. Oops.

For those who are interested, there are some experimental results on an APFS case-insensitive (and "not normalization-sensitive") volume here: https://eclecticlight.co/2017/04/07/apfs-and-macos-10-13-many-apps-and-tools-will-need-to-be-revised/
This uses the updated APFS in Sierra 10.12.4. I, at least, hope that this will be the version used in future macOS releases.

iOS App Fulfillment Issue Explained

May 4, 2017 11:04 AM

[…] https://mjtsai.com/blog/2017/03/24/apfss-bag-of-bytes-filenames/ […]

APFS: Mehr freier Speicher, vereinzelt Probleme mit Umlauten › iphone-ticker.de

May 25, 2017 4:26 PM

[…] Apple muss sich diesbezüglich allerdings den Vorwurf gefallen lassen, die Entwickler im Vorfeld nicht ausreichend informiert bzw. instruiert zu […]

Michael Tsai - Blog - APFS Native Normalization

June 27, 2017 2:02 PM

[…] iOS transition to APFS seems to have gone very smoothly except for some Unicode normalization issues. Apple never really explained to developers how they could make their code work properly, most were […]

Apple File System - Wikipedia - SAPERELIBERO

October 8, 2021 11:28 PM

[…] (EN) David Reed, APFS’s “Bag of Bytes” Filenames, su mjtsai.com, 24 aprile […]

David

July 15, 2022 11:41 AM

For anyone else searching for it (https://mjtsai.com/blog/2017/03/24/apfss-bag-of-bytes-filenames/#comment-2697839):
The "File Encodings and Fonts" reference only seems to be in the archived archive, here:

https://web.archive.org/web/20041013053848/http://developer.apple.com/documentation/MacOSX/Conceptual/BPInternational/index.html

APFS’s “Bag of Bytes” Filenames

57 Comments RSS · Twitter

Leave a Comment