Thursday, May 11, 2023

Getting Ready for Dataless Files

TN3150:

In a modern file system, a file’s content may not be available locally on the device. A file that contains only metadata is known as a dataless file. The file’s content typically resides on a remote server and is available to people or apps, transparently, when they access the file.

[…]

The system, or a person using the device, can make dataless files whenever they determine it’s appropriate, and your app needs to be ready to handle them. Specifically, avoid unnecessarily materializing dataless files and, when your app requires access to a file’s contents, perform that work asynchronously off the main thread.

[…]

UIDocument and NSDocument automatically access the file system in a coordinated and asynchronous manner.

[…]

If your app or framework uses low-level POSIX APIs to access the file system and you’re unable to migrate to the preferred methods, consider the following two options[…] Be aware that stat and getattrlist both trigger the materialization of any intermediate folders in the file’s path, if they themselves are dataless.

I find this rather confusing. On macOS, it seems like nearly any file could potentially be dataless. It’s less likely for files in Library but probably possible via symlinking. Even an action as simple as checking whether a file exists can now take an unexpectedly long amount of time. This breaks many longstanding assumptions.

If your app deals with user-created files, I guess the best practice is to do everything asynchronously and using file coordination. Without coordination—at least on older systems—you can run into the opposite problem: instead of accessing an evicted file being slow, it might stay unmaterialized. So you need to use the special APIs even if you already have your file code on a background thread.

But the NSFileCoordinator APIs are awkward, error-prone, and slow, and they infect your entire codebase. Hopefully you aren’t relying on any cross-platform code that’s not aware of them. And even with Apple-specific code, they make it hard to reuse the same code for working with folders that may or may not contain dataless files.

It all feels shoehorned in, like with the security scope URL APIs. Most APIs don’t do the right thing automatically, so you have to wrap uses of them. (But then some other APIs may secretly use coordination so you have to not use it yourself in order to avoid deadlocks.) Any file-related code could potentially need special handling, but there’s no way to make sure that you didn’t miss a spot somewhere. But then, once you’ve done this, your code is much harder to read and much slower for the common case of regular locally stored files.

Previously:

Update (2023-05-12): Thomas Clement:

Out of curiosity I tried to stat() a non-local file as described in the tech note, but I get a “no such file” error. Same when trying to access it from Terminal. Not sure how we are supposed to test whether a file is dataless then.

Another thing that is not explained is what is the right way to monitor download progress in case the file is dataless.

Update (2023-08-10): Howard Oakley:

Over the last couple of weeks I have been exploring how macOS and its features handle dataless files. While apps that take advantage of AppKit’s NSDocument to read and write files should handle these problems seamlessly, there are some definite seams when it comes to macOS services. These result from three constraints:

  • features reliant on the contents of file data can’t be used with dataless files;
  • features reliant on file data stored outside the file aren’t available to other systems accessing that file from iCloud;
  • limitations on the total size of extended attributes in iCloud storage may require some to be removed.

8 Comments RSS · Twitter · Mastodon

We didn’t have NSFileCoordinator then, but this has been the situation for me since 1999 on Mac OS X DP3, and even longer on other Unix-type systems.

On lease-type network file systems, once the file is leased by the client, subsequent reads from and writes to the data that has been fetched is local-speed, but before that of course it will depend on network and server speeds.

I’ve had my home directory in a network file system (AFS) starting with the first Mac OS X Public beta, so these problems aren’t really new to the platform, but might be for many developers.

@Magnus Yes, it’s just that I think most developers (and Apple, too) ignored the issues before, but now it is much more mainstream.

I presume "dataless" files are those indicated in the Finder with a cloud and a downward arrow?

Do iCloud Drive and OneDrive both use this mechanism? Because they seem to behave differently.

A dataless file from iCloud Drive-backed ~/Documents seems to actually be stored with a prefix `.` and a suffix `.icloud`, and be a binary plist. Finder then transparently resolves that to the non-prefixed and -suffixed filename. It also shows the file size if downloaded, not the size the pseudo-file takes up on disk. Except… this breaks Quick Look, so it isn't really very transparent at all, IMHO. And Finder itself gets confused about this difference. If you drag such a file to Terminal, Finder complains "The file can’t be found."

A dataless file from OneDrive, meanwhile, seems to always reside in ~/Library/CloudStorage, and doesn't seem to get a prefix or suffix. (It could also be that marking OneDrive files as non-local is a bit buggy.)

@Sören I think they use “dataless” to refer to the APFS feature, which I guess iCloud Drive is not yet using, but the new third-party cloud file providers are. But I’m not sure. You would think they would first prove the technology internally. Maybe the coordinator performance problems will improve when iCloud Drive switches?

I have not noticed any prefixes or suffixes or any other sort of chicanery when it comes to iCloud rive.

Maybe I use it differently compared to other people? I only use it for sync. And only for specific files I put in that folder. The poorly explained "Desktop and Documents Folders" is disabled.

I guess, when setup like this, its basically Dropbox from 2015. So congrats Apple, you managed not to screw up 2015... in 2023.

Yay. 😴

@Mike The iCloud drive suffixes happen when you have it set to use optimized storage.

Noticed this in passing launchd, see option MaterializeDatalessFiles in launchd.plist(5). Apparently you can have control over I/O policy in your launchd jobs, including whether these files are "materialized" on access.

But, really, maybe re-inventing network filesystems to make them a bit more predictable (because, better or worse, local files are always first-class citizens) is something you do higher up the API food-chain? At least they haven't made POSIX APIs completely unviable, but I'd suggest that this is essentially an Apple platform convenience feature, which on other platforms would just be an implementation detail of the filesystem which processes should be expected to deal with. That they haven't is unsurprising, yet I've never known these on-demand features to ever work predictably or in accordance with the wishes or expectations of users, so perhaps we should just not do that on consumer platforms?

I just added avoidance of dataless files to my scanning and content parsing function in Find Any File, so that, by default, searching on the entire disk won't pull all the offline files in from a server. Problem is that I don't know of a way to test this. Let's hope I can rely on what Apple wrote in the TN3150.

Leave a Comment