Thursday, January 19, 2023

Folders With High File Counts

Mike Bombich (Hacker News):

Any time a folder has more than a few thousand items in it, the filesystem is going to be a lot slower when working with that folder. Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts, so trivial tasks like that will take progressively longer as the file count increases. Gathering the enormous file list will also take progressively longer as the list gets larger.

[…]

Last week, one of our users found the task as shown above. Upon closer analysis, we determined that the “media” folder had 181,274 files in it. In other words, more than 10% of the files on the whole startup disk were in that “media” folder. In extreme cases like this, the delay to retrieve a file list can be so long (i.e. longer than 10 minutes) that the task aborts with an error[…]

[…]

For a contrasting example, consider how Mail organizes a potentially astronomic list of files. If you navigate to the hidden Library folder in your home folder, then to Mail > V10 > {any UUID} > {any mailbox} > {another UUID} > Data, you’ll see folders named by number, four layers deep, until you finally get to a Messages folder with actual files in it.

Apple should also employ this technique for Core Data external storage and Spotlight temporary files.

Previously:

9 Comments RSS · Twitter · Mastodon


What's slow, enumeration or search? Linux filesystems have long used b-tree directory indexes precisely to solve the problem of fast entry lookup. Is it really not done on APFS?


> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts, so trivial tasks like that will take progressively longer as the file count increases

Why does this need to be a slow process?

Just to take a trivial example; databases tackle this exact problem all the time with many more records than any file system will see.

> Mail > V10 > {any UUID} > {any mailbox} > {another UUID} > Data

It's like the Mail folks have just figured out a rudimentary indexing system. What stops the file system designers from doing this invisibly behind the scenes (or something much smarter)?


What about Photos? These swell to several hundred gigabytes.
Photos, thumbnails, etc. under one folder.
Ticking time bomb?


Mail specifically changed its data storage to this subfolder scheme in 10.7 to speed up Time Machine backups. Time Machine uses directory hardlinks for efficiency, but if one file in the directory changes then it necessitates copying/hardlinking every item in that directory since the top-level hardlink can no longer be used. Thus in a bad case: one new mail message could result in having to do backup operations for a million messages. By having a few layers of nesting, you can significantly reduce the amount of backup work.

I personally would have liked to have seen changes in the file system or Time Machine instead to have better large folder support, since any application by any developer (or even user) that writes a lot of files could be victim to this problem. However I wasn't in a position at the time to be able to push for this.


@Rob Yeah, it seems like the B-Tree part in the filesystem shouldn’t care that much for lookups. I’ve mostly noticed the slowness with listing a directory. That doesn’t seem like it should be more than linear, but I presume that somehow the higher layers are creating big lists of file records and copying them repeatedly or something. I don’t understand why it is so slow to enumerate not-that-many entries compared with what my apps deal with using SQLite or their own file formats, but it is.

@Del I don’t have a lot in Photos so I don’t know how its library scales, but it seems to have one folder of originals per hex digit. So it’s not all in one folder, but they should add multiple levels like with Mail if it doesn’t already do that.

@Jonathan Yeah, Time Machine was the impetus, but it also helped with browsing the directories. Oddly, I don’t remember that being particularly bad with the prior single-file-per-message folder structure from 10.4. I wonder if APFS is when the listing got slow.


I suppose this is made complicated by case insensitive comparaisons and Unicode equivalences. Nothing that can’t be solved by indexing the collation elements for each file name, but I have to suppose they aren’t doing that.


About B-Trees:
APFS, like HFS before it, uses B-Trees.

With HFS, the directory would gather all files of a specific directory together in clusters, with the file/dir names ordered alphabetically. With that, an insertion would not need to scan ALL entries like Mike B suggested but can perform a binary search. Insertion then is also rather fast because of the B-Tree's nodes, often only a single block back to disk.

APFS is a bit different:

It uses two B-Trees, where one has pointers into the other. That makes it slower than HFS in some situations (that's the main reason why you don't want to use APFS on hard disks, because of the double number of sector seek operations, which are quite noticable on a mechanical devices but not at all on an SSD).

Apart from that, the key for a tree node entry encodes not only the directory ID and the file name (which are ordered) but also some extra code for dealing with unicode intricacies. This combines three parts: The directory ID, a hash code of the file name and finally the file name itself ("j_drec_hashed_key_t"). I believe (but have no verified) that this extra hash code, which may appear rather random, will order the file names not alphabetically because the hash code has precedence, but a lookup still should be doable in a binary search because once you know the name you want to look up, you also know its hash code.

So, I don't think it's a problem with APFS (although, it would be nice if someone did a similar test with HFS+), but it's the Finder.

The Finder is what really sucks at handling larger numbers of items in a folder.

Since I write a program (Find Any File) that imitates some of the features of the Finder, i.e. listing file system items in folders, I can confirm that large amounts are causing quite some complications, especially if you want the program (Finder, FAF) to update its shown contents if the folder contents change. And the Finder is very inefficient at fetching metadata (e.g. tags, and especially the icons of applications) - my tests a view years ago with the "fs_usage" cmd showed that the Finder repeatedly fetches the same data from the lowest file system layer when its NSURL should have all cached it at a higher level. Also, sorting of 200k item becomes a bit slow. But nothing should justify the long wait times Finder has, even at those numbers. FAF, for instance, can handle 200k items still with little stalling.


Ugh, I should have read Mike's article first. Part of my previous reply does not apply, i.e. the part about the Finder. What Mike describes does not involve the Finder but his own software, as it seems.


@Thomas My experience with listing large APFS directories is that sampling shows that the delay is in the filesystem call, not (primarily) at the Finder level. This is consistent with it also affecting SuperDuper (fts) an Terminal (ls). (This is all on an Apple internal SSD, so it shouldn’t be waiting much on the storage device.)

Leave a Comment