Thursday, March 21, 2019

UTF-8 String in Swift 5

Michael Ilseman:

Switching to UTF-8 fulfills one of String’s long-term goals to enable high-performance processing, which is the most passionate request from performance-sensitive developers. It also lays the groundwork for providing even more performant APIs in the future. String’s preferred encoding is baked into Swift’s ABI for performance, so it was imperative that this switch happen in time for ABI stability in Swift 5.

[…]

Swift 5, like Rust, performs encoding validation once on creation, when it is far more efficient to do so. NSStrings, which are lazily bridged (zero-copy) into Swift and use UTF-16, may contain invalid content (i.e. isolated surrogates). As in Swift 4.2, these are lazily validated when read from.

This sounds great, as I’ve run into problems in Objective-C where strings that are not valid Unicode would cause strange failures a layer or two below my code. I don’t see it documented what happens when validation fails, but my guess from the code is that it repairs the string using replacement characters. That makes sense given the cases I’ve seen. Set one bad attribute on a managed object, and the entire context fails to save. If validation were eager, maybe I could do better at the point of creation than replacement characters (assuming I’m even creating the strings myself). But, this much later, I don’t think there’s much to be done. It’s not worth risking data loss for the common case where the developer hasn’t anticipated this happening and written code to fix the strings.

As mentioned above, Swift 5 switches from two native storage representations to one. This allows for better analyses and more aggressive optimizations with fewer potential code-size or compilation time costs.

For example, inlining is a compiler optimization that can improve run-time performance at a potential cost to code size. In Swift 4.2, most string methods contained a pair of implementations, one for each storage representation. No matter what form a 4.2 string was in, an entire portion of potentially-inlined code wouldn’t even be run; this increases the cost and diminishes the benefits of inlining. Furthermore, the greatest benefits of inlining come from follow-on analyses and optimizations specific to one call-site, which are exponentially more difficult to perform on a dual representation. Swift 5’s unified storage representation is far more amenable to inlining and follow-on optimizations.

Michael Ilseman:

String remembers performance-relevant information about its contents through the use of performance flags.

For example, a String that is known to be all-ASCII has a trivial UTF8View, UTF16View, and UnicodeScalarView. Also, mapping offsets between the two code unit views is trivial, so there is no need for any bookkeeping as part of Cocoa interop.

Previously: String’s ABI and UTF-8.

1 Comment RSS · Twitter

Leave a Comment