{"id":23320,"date":"2018-11-06T12:12:18","date_gmt":"2018-11-06T17:12:18","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=23320"},"modified":"2018-12-11T16:25:42","modified_gmt":"2018-12-11T21:25:42","slug":"strings-abi-and-utf-8","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2018\/11\/06\/strings-abi-and-utf-8\/","title":{"rendered":"String&rsquo;s ABI and UTF-8"},"content":{"rendered":"<p><a href=\"https:\/\/forums.swift.org\/t\/string-s-abi-and-utf-8\/17676\">Michael Ilseman<\/a>:<\/p>\n<blockquote cite=\"https:\/\/forums.swift.org\/t\/string-s-abi-and-utf-8\/17676\">\n<p>We just landed String&rsquo;s proposed final ABI on master. This ABI includes some significant changes, the primary one being that native Swift strings are stored as UTF-8 where they were previously stored either as ASCII or UTF-16 depending on their contents.<\/p>\n<p>[&#8230;]<\/p>\n<p>Unifying the storage representation for ASCII and Unicode-rich strings gives us a lot of performance wins. These wins are an effect of several compounding factors including a simpler model with less branching, on-creation encoding validation of native Strings (enabled by a faster validator), a unified implementation code path, a more efficient allocation and use of various bits in the struct, etc.<\/p>\n<p>[&#8230;]<\/p>\n<p>By maintaining nul-termination in our storage, interoperability with C is basically free: we just use our pointer. <\/p>\n<p>[&#8230;]<\/p>\n<p>Efficient interoperability with Cocoa is a huge selling point for Swift, and strings are lazily bridged to Objective-C. String&rsquo;s storage class is a subclass of NSString at runtime, and thus has to answer APIs assuming constant-time access to UTF-16 code units. We solved this with a breadcrumbing strategy: upon first request from one of these APIs on large strings, we perform a fast scan of the contents to check the UTF-16 length, leaving behind breadcrumbs at regular intervals. This allows us to provide amortized constant-time access to transcoded UTF-16 contents by scanning between breadcrumbs.<\/p>\n<p>[&#8230;]<\/p>\n<p>The branch also introduces support in the ABI (but currently not exposed in any APIs) for shared strings, which provide contiguous UTF-8 code units through some externally-managed storage. These enable future APIs allowing developers to create a String with shared storage from a [UInt8], Data, ByteBuffer, or Substring without actually copying the contents.<\/p>\n<\/blockquote>\n\n<p>This sounds great. Here is the pull <a href=\"https:\/\/github.com\/apple\/swift\/pull\/20315\">request<\/a> and the implementation of <a href=\"https:\/\/github.com\/apple\/swift\/blob\/6636815568efa8af5a62bbd68d585691d981a82b\/stdlib\/public\/core\/StringObject.swift\">StringObject<\/a> (which is actually a struct). Note that this locks in the in-memory layout for Swift strings, because code that uses them may be inlined. In contrast, Cocoa&rsquo;s <code>NSString<\/code> has an internal representation that has <a href=\"https:\/\/mjtsai.com\/blog\/2015\/07\/31\/nstaggedpointerstring\/\">evolved<\/a> over time. This was possible because access was indirected through the Objective-C runtime.<\/p>\n\n<p>Previously: <a href=\"https:\/\/mjtsai.com\/blog\/2018\/01\/16\/swift-string-abi-performance-and-ergonomics\/\">Swift String ABI, Performance, and Ergonomics<\/a>, <a href=\"https:\/\/mjtsai.com\/blog\/2017\/04\/17\/swift-abi-stability-dashboard\/\">Swift ABI Stability Dashboard<\/a>.<\/p>\n\n<p id=\"strings-abi-and-utf-8-update-2018-11-07\">Update (2018-11-07): See this <a href=\"https:\/\/github.com\/apple\/swift\/pull\/20383\">pull request<\/a> from David Smith about improving performance for bridged strings (<a href=\"https:\/\/twitter.com\/Catfish_Man\/status\/1059993229771235329\">tweet<\/a>).<\/p>\n\n<p id=\"strings-abi-and-utf-8-update-2018-11-09\">Update (2018-11-09): See also: <a href=\"https:\/\/news.ycombinator.com\/item?id=18394640\">Hacker News<\/a>.<\/p>\n\n<p id=\"strings-abi-and-utf-8-update-2018-12-11\">Update (2018-12-11): <a href=\"https:\/\/twitter.com\/Ilseman\/status\/1072283524579831808\">Michael Ilseman<\/a>:<\/p>\n<blockquote cite=\"https:\/\/twitter.com\/Ilseman\/status\/1072283524579831808\">\n<p>Would you like to access the raw UTF-8 code units backing a String? Now you can, thanks to <code>String.UTF8View.withContiguousStorageIfAvailable<\/code> hook[&#8230;]<\/p>\n<\/blockquote>","protected":false},"excerpt":{"rendered":"<p>Michael Ilseman: We just landed String&rsquo;s proposed final ABI on master. This ABI includes some significant changes, the primary one being that native Swift strings are stored as UTF-8 where they were previously stored either as ASCII or UTF-16 depending on their contents. [&#8230;] Unifying the storage representation for ASCII and Unicode-rich strings gives us [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2018-11-06T17:12:20Z","apple_news_api_id":"e1740b6d-26dc-45f3-a733-abc39d341e8c","apple_news_api_modified_at":"2018-12-11T21:25:46Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAAAg==","apple_news_api_share_url":"https:\/\/apple.news\/A4XQLbSbcRfOnM6vDnTQejA","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[4],"tags":[69,46,138,71,901,258],"class_list":["post-23320","post","type-post","status-publish","format-standard","hentry","category-programming-category","tag-cocoa","tag-languagedesign","tag-optimization","tag-programming","tag-swift-programming-language","tag-unicode"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/23320","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=23320"}],"version-history":[{"count":4,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/23320\/revisions"}],"predecessor-version":[{"id":23656,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/23320\/revisions\/23656"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=23320"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=23320"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=23320"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}