Wednesday, November 27, 2013

The String Type Is Broken

Edaqa:

I encourage you to run such tests in your favourite language. If you are doing work with international text it is vital that you understand what your ‘string’ type is actually doing. Once you’ve run this you should reconsider what your “string” type is actually doing for you. In my opinion they’re all broken.

1 Comment RSS · Twitter

Edaqa is correct. I just don’t hope anyone’s takeaway is “and it is possible without a lot of hard work and surfaced data details that you would not want to worry about most of the time to fix this”. Text handling is fucked up - even if you get everyone to agree on UTF-8, Unicode Normalization will do you in. Even after that, where does round tripping and error handling meet for forbidden sequences? This can’t be thoroughly fixed without settling all these things once and for all and restarting everything.

Increasingly less broken for the general case and possible to handle for the mission critical case is the best that can be done, bar that.

Leave a Comment