Tuesday, August 8, 2023

Unicode Is Harder Than You Think

Marco Cilloni (via Cédric Luthi, Hacker News):

Reading the excellent article by JeanHeyd Meneide on how broken string encoding in C/C++ is made me realise that Unicode is a topic that is often overlooked by a large number of developers. In my experience, there’s a lot of confusion and wrong expectations on what Unicode is, and what best practices to follow when dealing with strings that may contain characters outside of the ASCII range.

[…]

The fact UTF-32 is a fixed-width encoding is only marginally useful, due to grapheme clusters still being a thing. This means that the equivalence between codepoints and rendered glyphs is still not 1:1, just like in UCS-4[…]

[…]

As I previously mentioned, Unicode codepoints can be modified using combining characters, and the standard supports precomposed forms of some characters which have decomposed forms. The resulting glyphs are visually indistinguishable after being rendered, and there’s no limitation on using both forms alongside each other in the same text bit of text[…]Another headache is the fact Unicode also may define special forms for the same letter or group of letters, which are visibly different but understood by humans to be derived from the same symbol.

Previously:

Comments RSS · Twitter · Mastodon

Leave a Comment