Wednesday, May 28, 2014

Python 3 and Unicode

Drew Crawford links to two great posts about Python 3. Armin Ronacher:

Python 3 takes a very difference stance on Unicode than UNIX does. Python 3 says: everything is Unicode (by default, except in certain situations, and except if we send you crazy reencoded data, and even then it's sometimes still unicode, albeit wrong unicode). Filenames are Unicode, Terminals are Unicode, stdin and out are Unicode, there is so much Unicode! And because UNIX is not Unicode, Python 3 now has the stance that it's right and UNIX is wrong, and people should really change the POSIX specification to add a C.UTF-8 encoding which is Unicode. And then filenames are Unicode, and terminals are Unicode and never ever will you see bytes again although obviously everything still is bytes and will fail.

Nick Coghlan:

The conceptual problem with this [Python 2] model is that it is an appropriate model for boundary code - the kind of code that handles the transformation between wire protocols and file formats (which are always a series of bytes), and the more structured data types actually manipulated by applications (which may include opaque binary blobs, but are more typically things like text, numbers and containers).

Actual applications shouldn’t be manipulating values that “might be text, might be arbitrary binary data”. In particular, manipulating text values as binary data in multiple different text encodings can easily cause a problem the Japanese named “mojibake”: binary data that includes text in multiple encodings, but with no clear structure that defines which parts are in which encoding.

Unfortunately, Python 2 uses a type with exactly those semantics as its core string type, permits silent promotion from the “might be binary data” type to the “is definitely text” type and provides little support for accounting for encoding differences.

In Cocoa terms, Python 2 uses a mix of NSData (str) and NSString (unicode) to represent strings. As long as you work with NSString, everything is fine. The problem is that some APIs give you NSData when they “should” give you NSString, or they take the NSString that you gave them and at some point convert it to NSData. You can end up in situations where you get back an NSData and don’t know its encoding, or Python tries to concatenate an NSData and an NSString or two NSDatas that implicitly have different encodings (later to be turned back into an NSString). Then you get an exception, and the exception’s stack trace is from when you tried to do something with the “bad” object, not from when the “bad” object was created.

Python 3 is more like Cocoa (or Java). It says that all strings will be NSString (str), which is always Unicode, and everything else is NSData (bytes). You need to take extra care to make sure that you are correctly converting to/from NSString. Otherwise, you can mess up things that might have worked in Python 2 “by accident.” But if you do the work of putting in the proper explicit conversions at the system boundaries, you should end up with a more reliable system where everything inside is clean.

I’m not a Ruby programmer, but I gather that it uses NSData for everything, only each NSData has an associated encoding (which may be “BINARY”). At least you can’t lose track of the encoding. But, like Python 2, every NSData object is potentially “dirty.” If you try to combine two NSDatas that explicitly have different encodings, you get an exception—again, possibly far removed from where the offending object entered your system.


Stay up-to-date by subscribing to the Comments RSS Feed for this post.

Leave a Comment