The Absolute Minimum Every Software Developer Must Know About Unicode in 2023
Nikita Prokopov (Hacker News):
The problem is, you don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short.
[…]
Even in the widest encoding, UTF-32,
👨🏭
will still take three 4-byte units to encode. And it still needs to be treated as a single character.[…]
But whatever you choose, make sure it’s on the recent enough version of Unicode (15.1 at the moment of writing), because the definition of graphemes changes from version to version.
[…]
Unicode motivation is to save code points space (my guess). Information on how to render is supposed to be transferred outside of the string, as locale/language metadata.
The truth of the matter is that there are several different definitions of “character”, depending on what you want to use it for. An extended grapheme cluster is largely defined on “this visually displays as a single unit”, which isn’t necessarily correct for things like “display size in a monospace font” or “thing that gets deleted when you hit backspace.” Like so many other things in Unicode, the correct answer is use-case dependent.
Previously: