Thursday, February 23, 2023

Speeding Up Scanner in Swift

My first tip goes back to when I started using NSScanner in the Puma days. In short, you should never call scanCharacters(from:into:) in a loop because every time it’s called it creates an inverted copy of the character set. It then delegates to NSString.rangeOfCharacter(from:options:range:), passing that copy. The documentation contains the cryptic comment:

Using the inverse of an immutable character set is much more efficient than inverting a mutable character set.

But my experience is that it’s not fast with immutable characters sets, either. It seems like there should be an NSCharacterSet subclass that flips the membership of another object. Then each character set could store its own inverse with minimal overhead and just return the same one each time. But there’s apparently no such optimization, so I recommend calling inverted yourself, storing the result, and then using scanUpToCharacters(from:into:), which will then use the character set unchanged.

Even this is very slow when calling from Swift, though. Whenever you call scanUpToCharacters(from:into:) with a CharacterSet, it calls CharacterSet._bridgeToObjectiveC(), which calls __CFCharacterSetCreateCopy(), which again makes an expensive copy. (I have been doing a lot of profiling but somehow didn’t notice this until Ventura, so I wonder whether something changed there.) In any case, currently CharacterSet does not bridge efficiently like Data and String do.

My first try at working around this was to do the bridging up front:

let fast = characterSet as NSCharacterSet

and then pass the same NSCharacterSet, which should bridge cheaply, each time. But this didn’t help.

What did work was to create an NSCharacterSet directly:

let fast = NSCharacterSet(bitmapRepresentation: characterSet.bitmapRepresentation)

With that change, the bridging overhead goes way. Scanner is still not particularly fast, though. Maybe this will improve with the forthcoming Swifty Foundation, or I may end up writing a replacement for just the few cases that I need that works directly on Swift strings.


Update (2023-02-24): Another point to be aware of is that the documention implies that the caseSensitive option applies to scanCharacters(from:into:), and scanCharacters(from:into:) does actually pass the option into NSString.rangeOfCharacter(from:options:range:), but NSString.rangeOfCharacter(from:options:range:) is documented to ignore that flag, and in fact it does. So caseSensitive only actually applies to the Scanner methods that take strings.

Rhys Morgan:

swift-parsing from @pointfreeco is a really good library that’s usually faster than Foundation’s Scanner!

Update (2023-03-10): Jonathan Wight:

(NS)Scanner is truly one of the most under appreciated features of Foundation. I use it whenever I need to do structured parsing of text when a simple regex isn’t appropriate (or even possible).

But why limit your Scanning to just Strings?

Here’s my CollectionScanner that can scan any collection of arbitrary elements. Useful if you need to process arrays of data that aren’t necessarily Strings.

Indeed, I’ve found it really useful to have a Data scanner.

3 Comments RSS · Twitter · Mastodon

FWIW in my testing it doesn't use -[NSScanner scanUpToCharactersFromSet:intoString:] but rather -[NSString rangeOfCharacterFromSet:options:range:].

By the way, there are 2 separate inversions, because there is also -[NSScanner charactersToBeSkipped].

@Jeff Thanks! That’s what I meant to write—post updated. Yes, there are two inversions, but I think it caches the inverted skip set. Not much you can do if you have lots of short-lived scanners, though.

I ran into this hacking on BibDesk 15+ years ago, and I swore off NSScanner because the performance was so awful in unexpected ways. The -[NSScanner scanUpToCharactersFromSet:intoString:] implementation detail of creating an inverted/autoreleased NSCharacterSet caused me to run out of address space back in the 32-bit days, and it's annoying to sprinkle NSAutoreleasePools around when you're using alloc/init/release to avoid the overhead of -autorelease to begin with. I miss Shark, MallocDebug, and OmniObjectMeter.

It was a happy day when I finally discovered CFStringInlineBuffer, which worked great for the cases I was dealing with; reading the CFString source was pretty fascinating, and I found various sneaky ways to optimize things to avoid conversions and copying.

Leave a Comment