{"id":8894,"date":"2014-05-28T13:54:34","date_gmt":"2014-05-28T17:54:34","guid":{"rendered":"http:\/\/mjtsai.com\/blog\/?p=8894"},"modified":"2014-05-28T13:54:34","modified_gmt":"2014-05-28T17:54:34","slug":"python-3-and-unicode","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2014\/05\/28\/python-3-and-unicode\/","title":{"rendered":"Python 3 and Unicode"},"content":{"rendered":"<p><a href=\"http:\/\/sealedabstract.com\/rants\/python-3-is-fine\/\">Drew Crawford<\/a> links to two great posts about Python 3. <a href=\"http:\/\/lucumr.pocoo.org\/2014\/5\/12\/everything-about-unicode\/\">Armin Ronacher<\/a>:<\/p>\n<blockquote cite=\"http:\/\/lucumr.pocoo.org\/2014\/5\/12\/everything-about-unicode\/\"><p>Python 3 takes a very difference stance on Unicode than UNIX does.  Python\n3 says: everything is Unicode (<em>by default, except in certain situations,\nand except if we send you crazy reencoded data, and even then it's\nsometimes still unicode, albeit wrong unicode<\/em>).  Filenames are Unicode,\nTerminals are Unicode, stdin and out are Unicode, there is so much\nUnicode!  And because UNIX is not Unicode, Python 3 now has the stance\nthat it's right and UNIX is wrong, and people should really change the\nPOSIX specification to add a <tt>C.UTF-8<\/tt> encoding which is Unicode.  And\nthen filenames are Unicode, and terminals are Unicode and never ever will\nyou see bytes again although obviously everything still is bytes and will\nfail.<\/p><\/blockquote>\n<p><a href=\"http:\/\/python-notes.curiousefficiency.org\/en\/latest\/python3\/questions_and_answers.html\">Nick Coghlan<\/a>:<\/p>\n<blockquote cite=\"http:\/\/python-notes.curiousefficiency.org\/en\/latest\/python3\/questions_and_answers.html\"><p>The conceptual problem with this [Python 2] model is that it is an appropriate model for\n<em>boundary<\/em> code - the kind of code that handles the transformation between\nwire protocols and file formats (which are always a series of bytes), and the\nmore structured data types actually manipulated by applications (which may\ninclude opaque binary blobs, but are more typically things like text, numbers\nand containers).<\/p>\n<p>Actual <em>applications<\/em> shouldn&rsquo;t be manipulating values that &ldquo;might be\ntext, might be arbitrary binary data&rdquo;. In particular, manipulating text\nvalues as binary data in multiple different text encodings can easily cause\na problem the Japanese named &ldquo;mojibake&rdquo;: binary data that includes text in\nmultiple encodings, but with no clear structure that defines which parts are\nin which encoding.<\/p>\n<p>Unfortunately, Python 2 uses a type with exactly those semantics as its core\nstring type, permits silent promotion from the &ldquo;might be binary data&rdquo; type\nto the &ldquo;is definitely text&rdquo; type and provides little support for accounting\nfor encoding differences.<\/p>\n<\/blockquote>\n<p>In Cocoa terms, Python 2 uses a mix of <code>NSData<\/code> (<code>str<\/code>) and <code>NSString<\/code> (<code>unicode<\/code>) to represent strings. As long as you work with <code>NSString<\/code>, everything is fine. The problem is that some APIs give you <code>NSData<\/code> when they &ldquo;should&rdquo; give you <code>NSString<\/code>, or they take the <code>NSString<\/code> that you gave them and at some point convert it to <code>NSData<\/code>. You can end up in situations where you get back an <code>NSData<\/code> and don&rsquo;t know its encoding, or Python tries to concatenate an <code>NSData<\/code> and an <code>NSString<\/code> or two <code>NSData<\/code>s that implicitly have different encodings (later to be turned back into an <code>NSString<\/code>). Then you get an exception, and the exception&rsquo;s stack trace is from when you tried to do something with the &ldquo;bad&rdquo; object, not from when the &ldquo;bad&rdquo; object was created.<\/p>\n<p>Python 3 is more like Cocoa (or Java). It says that all strings will be <code>NSString<\/code> (<code>str<\/code>), which is always Unicode, and everything else is <code>NSData<\/code> (<code>bytes<\/code>). You need to take extra care to make sure that you are correctly converting to\/from <code>NSString<\/code>. Otherwise, you can mess up things that might have worked in Python 2 &ldquo;by accident.&rdquo; But if you do the work of putting in the proper explicit conversions at the system boundaries, you should end up with a more reliable system where everything inside is clean.<\/p>\n<p>I&rsquo;m not a Ruby programmer, but I gather that it uses <code>NSData<\/code> for everything, only each <code>NSData<\/code> has an associated encoding (which may be &ldquo;BINARY&rdquo;). At least you can&rsquo;t lose track of the encoding. But, like Python 2, every <code>NSData<\/code> object is potentially &ldquo;dirty.&rdquo; If you try to combine two <code>NSData<\/code>s that explicitly have different encodings, you get an exception&mdash;again, possibly far removed from where the offending object entered your system.<\/p>","protected":false},"excerpt":{"rendered":"<p>Drew Crawford links to two great posts about Python 3. Armin Ronacher: Python 3 takes a very difference stance on Unicode than UNIX does. Python 3 says: everything is Unicode (by default, except in certain situations, and except if we send you crazy reencoded data, and even then it's sometimes still unicode, albeit wrong unicode). [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"","apple_news_api_id":"","apple_news_api_modified_at":"","apple_news_api_revision":"","apple_news_api_share_url":"","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[4],"tags":[69,46,71,232,287,258,163],"class_list":["post-8894","post","type-post","status-publish","format-standard","hentry","category-programming-category","tag-cocoa","tag-languagedesign","tag-programming","tag-python","tag-ruby","tag-unicode","tag-unix"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/8894","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=8894"}],"version-history":[{"count":0,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/8894\/revisions"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=8894"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=8894"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=8894"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}