Thursday, October 5, 2023

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023

Nikita Prokopov (Hacker News):

The problem is, you don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short.

[…]

Even in the widest encoding, UTF-32, 👨‍🏭 will still take three 4-byte units to encode. And it still needs to be treated as a single character.

[…]

But whatever you choose, make sure it’s on the recent enough version of Unicode (15.1 at the moment of writing), because the definition of graphemes changes from version to version.

[…]

Unicode motivation is to save code points space (my guess). Information on how to render is supposed to be transferred outside of the string, as locale/language metadata.

jcranmer:

The truth of the matter is that there are several different definitions of “character”, depending on what you want to use it for. An extended grapheme cluster is largely defined on “this visually displays as a single unit”, which isn’t necessarily correct for things like “display size in a monospace font” or “thing that gets deleted when you hit backspace.” Like so many other things in Unicode, the correct answer is use-case dependent.

Previously:

3 Comments RSS · Twitter · Mastodon

What’s sad for us is that the rules defining grapheme clusters change every year as well. What is considered a sequence of two or three separate code points today might become a grapheme cluster tomorrow! There’s no way to know! Or prepare!

Things like this (among others) make me question if Unicode stewardship is in the right hands.

Okay, lemme toot my own horn since we're talking about code points and UTF-8 by sharing one of my favorite answers that I've posted to StackOverflow...

It explains how to split a UTF-8 string to the last character before a given byte length in C#, and does it, as one comment helpfully & accurately put it, in O(n) time.

It was an amazingly fun question -- I was on vacation fooling around with code to a text-based database while in Starbucks, waiting for some young'uns to finish up a vacationish activity, and the existing answers had code smells so bad I couldn't enjoy my coffee's aroma, so I followed my Spolsky & took the time to actually learn how the leading bits to UTF-8 work.

TL;DR:

If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.

If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!

If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.

Not rocket science here, but there's something about taking a domain apart, learning it to the 0s and 1s, and putting it back together for a more efficient, elegant answer that's insanely enjoyable.

Some of you know what I'm talking about. ;^D (And now somebody's going to look at that and figure out a better way of doing it... which is fun too!)

Ha! I haven't been back in a while -- someone already has offered a "better" solution! Though I'm admittedly smiling to myself when I read, "but due to a bug it won't produce the correct result before .NET 5," which means I had the best on SO until at least then. Know your bits, folks! :^D

But the answerer skipped their Spolsky. That can't have been nearly as much fun to come up with. (Nice update to Spolsky from Nikita there, btw.)

Leave a Comment