Tuesday, November 6, 2018

String’s ABI and UTF-8

Michael Ilseman:

We just landed String’s proposed final ABI on master. This ABI includes some significant changes, the primary one being that native Swift strings are stored as UTF-8 where they were previously stored either as ASCII or UTF-16 depending on their contents.

[…]

Unifying the storage representation for ASCII and Unicode-rich strings gives us a lot of performance wins. These wins are an effect of several compounding factors including a simpler model with less branching, on-creation encoding validation of native Strings (enabled by a faster validator), a unified implementation code path, a more efficient allocation and use of various bits in the struct, etc.

[…]

By maintaining nul-termination in our storage, interoperability with C is basically free: we just use our pointer.

[…]

Efficient interoperability with Cocoa is a huge selling point for Swift, and strings are lazily bridged to Objective-C. String’s storage class is a subclass of NSString at runtime, and thus has to answer APIs assuming constant-time access to UTF-16 code units. We solved this with a breadcrumbing strategy: upon first request from one of these APIs on large strings, we perform a fast scan of the contents to check the UTF-16 length, leaving behind breadcrumbs at regular intervals. This allows us to provide amortized constant-time access to transcoded UTF-16 contents by scanning between breadcrumbs.

[…]

The branch also introduces support in the ABI (but currently not exposed in any APIs) for shared strings, which provide contiguous UTF-8 code units through some externally-managed storage. These enable future APIs allowing developers to create a String with shared storage from a [UInt8], Data, ByteBuffer, or Substring without actually copying the contents.

This sounds great. Here is the pull request and the implementation of StringObject (which is actually a struct). Note that this locks in the in-memory layout for Swift strings, because code that uses them may be inlined. In contrast, Cocoa’s NSString has an internal representation that has evolved over time. This was possible because access was indirected through the Objective-C runtime.

Previously: Swift String ABI, Performance, and Ergonomics, Swift ABI Stability Dashboard.

Update (2018-11-07): See this pull request from David Smith about improving performance for bridged strings (tweet).

Update (2018-11-09): See also: Hacker News.

Update (2018-12-11): Michael Ilseman:

Would you like to access the raw UTF-8 code units backing a String? Now you can, thanks to String.UTF8View.withContiguousStorageIfAvailable hook[…]

Comments RSS · Twitter

Leave a Comment