Python 3 takes a very difference stance on Unicode than UNIX does. Python 3 says: everything is Unicode (by default, except in certain situations, and except if we send you crazy reencoded data, and even then it's sometimes still unicode, albeit wrong unicode). Filenames are Unicode, Terminals are Unicode, stdin and out are Unicode, there is so much Unicode! And because UNIX is not Unicode, Python 3 now has the stance that it's right and UNIX is wrong, and people should really change the POSIX specification to add a C.UTF-8 encoding which is Unicode. And then filenames are Unicode, and terminals are Unicode and never ever will you see bytes again although obviously everything still is bytes and will fail.
The conceptual problem with this [Python 2] model is that it is an appropriate model for boundary code - the kind of code that handles the transformation between wire protocols and file formats (which are always a series of bytes), and the more structured data types actually manipulated by applications (which may include opaque binary blobs, but are more typically things like text, numbers and containers).
Actual applications shouldn’t be manipulating values that “might be text, might be arbitrary binary data”. In particular, manipulating text values as binary data in multiple different text encodings can easily cause a problem the Japanese named “mojibake”: binary data that includes text in multiple encodings, but with no clear structure that defines which parts are in which encoding.
Unfortunately, Python 2 uses a type with exactly those semantics as its core string type, permits silent promotion from the “might be binary data” type to the “is definitely text” type and provides little support for accounting for encoding differences.
In Cocoa terms, Python 2 uses a mix of
unicode) to represent strings. As long as you work with
NSString, everything is fine. The problem is that some APIs give you
NSData when they “should” give you
NSString, or they take the
NSString that you gave them and at some point convert it to
NSData. You can end up in situations where you get back an
NSData and don’t know its encoding, or Python tries to concatenate an
NSData and an
NSString or two
NSDatas that implicitly have different encodings (later to be turned back into an
NSString). Then you get an exception, and the exception’s stack trace is from when you tried to do something with the “bad” object, not from when the “bad” object was created.
Python 3 is more like Cocoa (or Java). It says that all strings will be
str), which is always Unicode, and everything else is
bytes). You need to take extra care to make sure that you are correctly converting to/from
NSString. Otherwise, you can mess up things that might have worked in Python 2 “by accident.” But if you do the work of putting in the proper explicit conversions at the system boundaries, you should end up with a more reliable system where everything inside is clean.
I’m not a Ruby programmer, but I gather that it uses
NSData for everything, only each
NSData has an associated encoding (which may be “BINARY”). At least you can’t lose track of the encoding. But, like Python 2, every
NSData object is potentially “dirty.” If you try to combine two
NSDatas that explicitly have different encodings, you get an exception—again, possibly far removed from where the offending object entered your system.
Stay up-to-date by subscribing to the Comments RSS Feed for this post.