Friday, November 19, 2021 [Tweets] [Favorites]

Unicode and Copying and Pasting Code

Glenn Faison:

I recently saw first-hand why I should never copy and paste any code I found online (or anywhere, for that matter).

[…]

To cut the long story short, what looks like a loose inequality check on line #4, is deceptively an assignment operation, which reads like (environmentǃ = ENV_PROD)! In JavaScript, assignment operations return the assigned value, which in this case is truthy (will be treated as true wherever a boolean value is expected).

But isn’t environmentǃ an invalid variable name in JavaScript, you ask? It’s complicated. You’d be right to say an exclamation sign cannot be part of a variable name. However, the ǃ you see there is in fact not the everyday exclamation sign you know. It’s an obscure character that happens to be accepted as regular text by the JavaScript interpreter, and thus can be a valid part of a variable name.

This particular example is unlikely to happen in Swift, both because assignments don’t have values and because the compiler is picky about whitespace around operators.

Via Nick Lockwood:

This is why unicode (outside of string literals) in programming languages was a mistake.

[…]

Support for unicode in variables adds a massive new surface for hiding security exploits in plain sight (see also: unicode urls).

The supposed benefit of being able to use mathematical symbols for custom operators is mostly just an attractive nuisance since you can’t type them.

Inclusivity is good, but unicode variables offer little practical benefit to non-English speakers if the platform APIs and dominant 3rd party frameworks are not localized, and unicode is neither necessary nor sufficient to solve that (it should ideally be handled at IDE-level).

CVE-2021-42574 (via Daniel Martín):

The Rust Security Response WG was notified of a security concern affecting source code containing “bidirectional override” Unicode codepoints: in some cases the use of those codepoints could lead to the reviewed code being different than the compiled code.

Previously:

4 Comments

That is what happen when you have ideology driven technology and not engineering.

In Python, they now have a PEP about the use of Unicode in source code:

https://www.python.org/dev/peps/pep-0672/

Swift operator whitespace requirements don't protect us here: xǃ=6 is a valid assignment in Swift.

And some limited shenanigans are possible in Swift as well:

let a = NSObject()
let xǃ = NSObject()

if (xǃ==a) {
    print("never reached")
}

Luckily Swift requires explicit declaration of variables, so this is quite obvious compared to JavaScript.

Inclusivity is good, but unicode variables offer little practical benefit to non-English speakers if the platform APIs and dominant 3rd party frameworks are not localized

Oh, hard disagree. If you’ve ever done consulting for an industry with very region-specific topics (say, accounting), you’ll quickly find that

• official translations for some certains may or may not exist
• when they do, customers may or may not know about them
• sooner or later, you get to field a support ticket where a customer uses the region’s term, and good luck translating that back and forth with your code’s internal naming convention

IOW, it’s much simpler to use the customer’s jargon in the first place. And when your software’s primary users don’t have English as their first language, that means you have identifiers in your code that aren’t English either.

unicode is neither necessary nor sufficient to solve that (it should ideally be handled at IDE-level).

It would be nice if IDEs offered such mapping. Heck, that applies to APIs in general. Spreadsheets sort of get this right: launch Excel in German, and SUM() becomes SUMME().

But that’s not how we typically use IDEs. Except for, say, Smalltalk, we by and large write our code in plain text, and that means the tooling operates on plain text, and often doesn’t have semantic context (see also: most diff/merge tools not understanding that, when you merely switch the order of two functions, it’s not a semantically relevant change). And it doesn’t look like this is changing any time soon.

Stay up-to-date by subscribing to the Comments RSS Feed for this post.

Leave a Comment