{"id":40740,"date":"2023-10-05T21:38:21","date_gmt":"2023-10-06T01:38:21","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=40740"},"modified":"2023-10-05T21:38:21","modified_gmt":"2023-10-06T01:38:21","slug":"the-absolute-minimum-every-software-developer-must-know-about-unicode-in-2023","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2023\/10\/05\/the-absolute-minimum-every-software-developer-must-know-about-unicode-in-2023\/","title":{"rendered":"The Absolute Minimum Every Software Developer Must Know About Unicode in 2023"},"content":{"rendered":"<p><a href=\"https:\/\/tonsky.me\/blog\/unicode\/\">Nikita Prokopov<\/a> (<a href=\"https:\/\/news.ycombinator.com\/item?id=37735801\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/tonsky.me\/blog\/unicode\/\"><p>The problem is, you don&rsquo;t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called &ldquo;<strong>extended grapheme clusters<\/strong>&rdquo;, or graphemes for short.<\/p><p>[&#8230;]<\/p><p>Even in the widest encoding, UTF-32, <code>&#x1F468;&#x200D;&#x1F3ED;<\/code> will still take three 4-byte units to encode. And it still needs to be treated as a single character.<\/p><p>[&#8230;]<\/p><p>But whatever you choose, make sure it&rsquo;s on the recent enough version of Unicode (15.1 at the moment of writing), because the definition of graphemes changes from version to version.<\/p><p>[&#8230;]<\/p><p>Unicode motivation is to save code points space (my guess). Information on how to render is supposed to be transferred outside of the string, as locale\/language metadata.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/news.ycombinator.com\/item?id=37739494\">jcranmer<\/a>:<\/p>\n<blockquote cite=\"https:\/\/news.ycombinator.com\/item?id=37739494\"><p>The truth of the matter is that there are several different definitions of &ldquo;character&rdquo;, depending on what you want to use it for. An extended grapheme cluster is largely defined on &ldquo;this visually displays as a single unit&rdquo;, which isn&rsquo;t necessarily correct for things like &ldquo;display size in a monospace font&rdquo; or &ldquo;thing that gets deleted when you hit backspace.&rdquo; Like so many other things in Unicode, the correct answer is use-case dependent.<\/p><\/blockquote>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/08\/08\/unicode-is-harder-than-you-think\/\">Unicode Is Harder Than You Think<\/a><\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Nikita Prokopov (Hacker News): The problem is, you don&rsquo;t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called &ldquo;extended grapheme clusters&rdquo;, or graphemes for short.[&#8230;]Even in the widest encoding, UTF-32, &#x1F468;&#x200D;&#x1F3ED; will [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2023-10-06T01:38:25Z","apple_news_api_id":"4c4fe598-29be-41a4-9176-6c6892715c49","apple_news_api_modified_at":"2023-10-06T01:38:25Z","apple_news_api_revision":"AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/w==","apple_news_api_share_url":"https:\/\/apple.news\/ATE_lmCm-QaSRdmxoknFcSQ","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[4],"tags":[71,258],"class_list":["post-40740","post","type-post","status-publish","format-standard","hentry","category-programming-category","tag-programming","tag-unicode"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/40740","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=40740"}],"version-history":[{"count":1,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/40740\/revisions"}],"predecessor-version":[{"id":40741,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/40740\/revisions\/40741"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=40740"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=40740"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=40740"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}