Wednesday, April 3, 2019

UTF-8’s History and Virtues

Rob Pike:

What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it. We were close to shipping the system when, late one afternoon, I received a call from some folks, I think at IBM - I remember them being in Austin - who were in an X/Open committee meeting. They wanted Ken and me to vet their FSS/UTF design. We understood why they were introducing a new design, and Ken and I suddenly realized there was an opportunity to use our experience to design a really good standard and get the X/Open guys to push it out. We suggested this and the deal was, if we could do it fast, OK. So we went to dinner, Ken figured out the bit-packing, and when we came back to the lab after dinner we called the X/Open guys and explained our scheme. We mailed them an outline of our spec, and they replied saying that it was better than theirs (I don’t believe I ever actually saw their proposal; I know I don’t remember it) and how fast could we implement it? I think this was a Wednesday night and we promised a complete running system by Monday, which I think was when their big vote was.

So that night Ken wrote packing and unpacking code and I started tearing into the C and graphics libraries. The next day all the code was done and we started converting the text files on the system itself. By Friday some time Plan 9 was running, and only running, what would be called UTF-8. We called X/Open and the rest, as they say, is slightly rewritten history.

Why didn’t we just use their FSS/UTF? As I remember, it was because in that first phone call I sang out a list of desiderata for any such encoding, and FSS/UTF was lacking at least one - the ability to synchronize a byte stream picked up mid-run, with less that one character being consumed before synchronization. Becuase that was lacking, we felt free - and were given freedom - to roll our own.

Ken Thompson:

Below are the guidelines that were used in defining the UCS transformation format:

1) Compatibility with historical file systems:

Historical file systems disallow the null byte and the ASCII slash character as a part of the file name.

2) Compatibility with existing programs:

The existing model for multibyte processing is that ASCII does not occur anywhere in a multibyte encoding. There should be no ASCII code values for any part of a transformation format representation of a character that was not in the ASCII character set in the UCS representation of the character.

3) Ease of conversion from/to UCS.

4) The first byte should indicate the number of bytes to follow in a multibyte sequence.

5) The transformation format should not be extravagant in terms of number of bytes used for encoding.

6) It should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream.

Rich Felker:

Not only do ASCII bytes never appear in multibyte UTF-8 chars; NO character is ever a substring of another character.

UTF-8 was really a work of brilliance, guaranteeing what’s pretty much a maximal set of important desirable properties like this.

Of course the desirable properties necessitate one property that’s hard to like: not all byte sequences can be legal/valid.

See also: The History of Unix (via Hacker News).

Previously:

Update (2019-04-04): See also: Hacker News.

Comments RSS · Twitter

Leave a Comment