{"id":24857,"date":"2019-04-03T14:30:32","date_gmt":"2019-04-03T18:30:32","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=24857"},"modified":"2019-04-04T15:24:52","modified_gmt":"2019-04-04T19:24:52","slug":"utf-8s-history-and-virtues","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2019\/04\/03\/utf-8s-history-and-virtues\/","title":{"rendered":"UTF-8&rsquo;s History and Virtues"},"content":{"rendered":"<p><a href=\"https:\/\/www.cl.cam.ac.uk\/~mgk25\/ucs\/utf-8-history.txt\">Rob Pike<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.cl.cam.ac.uk\/~mgk25\/ucs\/utf-8-history.txt\"><p>What happened was this.  We had used the original UTF from ISO 10646\nto make Plan 9 support 16-bit characters, but we hated it.  We were\nclose to shipping the system when, late one afternoon, I received a\ncall from some folks, I think at IBM - I remember them being in Austin\n- who were in an X\/Open committee meeting.  They wanted Ken and me to\nvet their FSS\/UTF design.  We understood why they were introducing a\nnew design, and Ken and I suddenly realized there was an opportunity\nto use our experience to design a really good standard and get the\nX\/Open guys to push it out.  We suggested this and the deal was, if we\ncould do it fast, OK.  So we went to dinner, Ken figured out the\nbit-packing, and when we came back to the lab after dinner we called\nthe X\/Open guys and explained our scheme.  We mailed them an outline\nof our spec, and they replied saying that it was better than theirs (I\ndon&rsquo;t believe I ever actually saw their proposal; I know I don&rsquo;t\nremember it) and how fast could we implement it?  I think this was a\nWednesday night and we promised a complete running system by Monday,\nwhich I think was when their big vote was.<\/p><p>So that night Ken wrote packing and unpacking code and I started\ntearing into the C and graphics libraries.  The next day all the code\nwas done and we started converting the text files on the system\nitself.  By Friday some time Plan 9 was running, and only running,\nwhat would be called UTF-8.  We called X\/Open and the rest, as they\nsay, is slightly rewritten history.<\/p><p>Why didn&rsquo;t we just use their FSS\/UTF?  As I remember, it was because\nin that first phone call I sang out a list of desiderata for any such\nencoding, and FSS\/UTF was lacking at least one - the ability to\nsynchronize a byte stream picked up mid-run, with less that one\ncharacter being consumed before synchronization.  Becuase that was\nlacking, we felt free - and were given freedom - to roll our own.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/www.cl.cam.ac.uk\/~mgk25\/ucs\/utf-8-history.txt\">Ken Thompson<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.cl.cam.ac.uk\/~mgk25\/ucs\/utf-8-history.txt\"><p>Below are the guidelines that were used in defining the UCS\ntransformation format:<\/p>\n<p>1) Compatibility with historical file systems:<\/p>\n<p>Historical file systems disallow the null byte and the ASCII\n\tslash character as a part of the file name.<\/p>\n<p>\t2) Compatibility with existing programs:<\/p>\n<p>\tThe existing model for multibyte processing is that ASCII does\n\tnot occur anywhere in a multibyte encoding.  There should be\n\tno ASCII code values for any part of a transformation format\n\trepresentation of a character that was not in the ASCII\n\tcharacter set in the UCS representation of the character.<\/p>\n<p>\t3) Ease of conversion from\/to UCS.<\/p>\n<p>\t4) The first byte should indicate the number of bytes to\n\tfollow in a multibyte sequence.<\/p>\n<p>\t5) The transformation format should not be extravagant in\n\tterms of number of bytes used for encoding.<\/p>\n<p>\t6) It should be possible to find the start of a character\n\tefficiently starting from an arbitrary location in a byte\n\tstream.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/twitter.com\/RichFelker\/status\/907794945964158981\">Rich Felker<\/a>:<\/p>\n<blockquote cite=\"https:\/\/twitter.com\/RichFelker\/status\/907794945964158981\">\n<p>Not only do ASCII bytes never appear in multibyte UTF-8 chars; NO character is ever a substring of another character.<\/p>\n<p>UTF-8 was really a work of brilliance, guaranteeing what&rsquo;s pretty much a maximal set of important desirable properties like this.<\/p>\n<p>Of course the desirable properties necessitate one property that&rsquo;s hard to like: not all byte sequences can be legal\/valid.<\/p>\n<\/blockquote>\n\n<p>See also: <a href=\"https:\/\/www.youtube.com\/watch?v=_2NI6t2r_Hs\">The History of Unix<\/a> (via <a href=\"https:\/\/news.ycombinator.com\/item?id=18417918\">Hacker News<\/a>).<\/p>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2019\/03\/21\/utf-8-string-in-swift-5\/\">UTF-8 String in Swift 5<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2018\/01\/04\/a-branchless-utf-8-decoder\/\">A Branchless UTF-8 Decoder<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2012\/04\/30\/utf-8-everywhere\/\">UTF-8 Everywhere<\/a><\/li>\n<\/ul>\n\n<p id=\"utf-8s-history-and-virtues-update-2019-04-04\">Update (2019-04-04): See also: <a href=\"https:\/\/news.ycombinator.com\/item?id=19565980\">Hacker News<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Rob Pike: What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it. We were close to shipping the system when, late one afternoon, I received a call from some folks, I think at IBM - I remember them being in Austin [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2019-04-03T18:30:36Z","apple_news_api_id":"3efe1c8c-f645-4e6c-b0ef-52ee07d0369b","apple_news_api_modified_at":"2019-04-04T19:24:57Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAAAA==","apple_news_api_share_url":"https:\/\/apple.news\/APv4cjPZFTmyw71LuB9A2mw","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[4],"tags":[45,295,697,71,258,163],"class_list":["post-24857","post","type-post","status-publish","format-standard","hentry","category-programming-category","tag-c","tag-history","tag-ibm","tag-programming","tag-unicode","tag-unix"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/24857","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=24857"}],"version-history":[{"count":2,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/24857\/revisions"}],"predecessor-version":[{"id":24867,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/24857\/revisions\/24867"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=24857"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=24857"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=24857"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}