Thursday, January 4, 2018

A Branchless UTF-8 Decoder

Chris Wellons (via Matías N. Goldberg):

The CPU must correctly predict the length of the code point or else it will suffer a hazard. An incorrect guess will stall the pipeline and slow down decoding.

[…]

This reads four bytes regardless of the actual length. Avoiding doing something is branching, so this can’t be helped. The unneeded bits are shifted out based on the length. That’s all it takes to decode UTF-8 without branching.

C Programming Language Optimization Programming Unicode

1 Comment RSS · Twitter

Michael Tsai - Blog - UTF-8’s History and Virtues

April 4, 2019 3:25 PM

[…] A Branchless UTF-8 Decoder […]

Leave a Comment

Copyright © 2000–2025 Michael Tsai.