Saturday, January 8, 2022

Why “utf8” in MySQL Is Not UTF-8

Florian Köhler (via Ken Harris):

For whatever reason, a few months later, in September 2002, a MySQL developer decided to push a one-byte commit UTF8 now works with up to 3 byte sequences only to the repository and change the allowed bytes from six to three.

Since then, the character set called utf8 has been a crippled and proprietary variation as it neither conforms to the old nor the new definition (RFC 3629) of UTF-8. The misleading name still causes issues today.

[…]

To remediate this mistake MySQL added the utf8mb4 charset in version 5.5.3. utf8mb4 fully implements the current standard. Now utf8 is an alias for utf8mb3 and will be switched to utf8mb4.

Update (2022-01-13): See also: Hacker News.

Comments RSS · Twitter

Leave a Comment