Friday, August 7, 2020

Renaming Human Genes for Excel

James Vincent:

But [Excel’s] default settings were designed with more mundane applications in mind, so when a user inputs a gene’s alphanumeric symbol into a spreadsheet, like MARCH1 — short for “Membrane Associated Ring-CH-Type Finger 1“ — Excel converts that into a date: 1-Mar.

This is extremely frustrating, even dangerous, corrupting data that scientists have to sort through by hand to restore. It’s also surprisingly widespread and affects even peer-reviewed scientific work. One study from 2016 examined genetic data shared alongside 3,597 published papers and found that roughly one-fifth had been affected by Excel errors.

[…]

This week, the HGNC published new guidelines for gene naming, including for “symbols that affect data handling and retrieval.” From now on, they say, human genes and the proteins they expressed will be named with one eye on Excel’s auto-formatting. That means the symbol MARCH1 has now become MARCHF1, while SEPT1 has become SEPTIN1, and so on.

[…]

Many gene symbols that can be read as nouns have been renamed to avoid false positives during searches, for example. In the past, CARS has become CARS1, WARS changed to WARS1, and MARS tweaked to MARS1.

Update (2020-08-10): Michael Love:

As long as we’re renaming things around Excel, it would be great if we could change the Pinyin syllable ‘jun’ into something that Excel doesn’t interpret as a month.

8 Comments RSS · Twitter

I see so much trouble caused by Excel. It feels like there should be a huge market for a really high-performance no-frills spreadsheet that is super fast, works with huge amounts of data, does not corrupt CSV files and does not do any kind of auto-type detection. At work I see all the time business users trying to simply work with data and ending up with major corruption. They just want to sort, filter, modify, save before it goes back in to Oracle or whatever. I doubt they would ever give up Excel though.

I work with a startup in the Pharmacogenetics field, and Excel is used everywhere.

Most times, Excel is not even used to calculate data, but simply to manage (view, edit) tables of static data. It's basically, what a database does, but with a UI for editing it. And users like to be able to use background colors on rows or cells to highlight them.

The immediacy of the data input is what makes spreadsheets so popular. And while Apple's Numbers may be smarter with this kind of data (although it also sucks big at importing generic csv formats), portability, i.e. exchange with Windows, is a must in this field.

It would be nice of Microsoft would understand these needs and provide a way to set defaults that make all cells "Text" instead of "General" (which is where they'd replace "MARCH1" with a date), but then again, Excel is already quite modal in its preferences, causing lots of troubles, e.g. when using different system languages. Ideally, there should be default settings per document, but that's somehow not happening.

So, the scientists need to get pragmatic, and this renaming of genes is the logical result of this. Can't wait for Microsoft to come to their senses.

Another reason to use Apple’s Numbers - it doesn’t do any auto-formatting of MARCH1.

Just to throw in yet another perspective:
Our IT installs the MS Office suite on all Macs and Windows PCs, but many of our scientists use LibreOffice for working with CSV files, diff'ing Excel files, working with ginormous documents (think Word files with 1000 pages or so), and a few other tasks. LibreOffice is not particularly pretty to look at and definitely feels non-native, but it is shocking how much better it handles certain scenarios than Excel does, especially considering how mature Excel should be by now and how much money went into its development.

The situation got even worse on macOS with the MS Office 2016 rewrite, which made the applications much more sluggish and hungry for resources. It's certainly no coincidence that Excel 2011 only had a size of 47 MB, whereas Excel 2016 is a 1.7 GB monster.

Anecdote from just last week: customer asks me to please revert to the old behavior where date values in exported CSV are zero-padded (e.g., 08082020 for today; please don’t comment on the format — the customer’s app requires it that way). I check and, nope, I haven’t actually changed anything on my end. I did suspect that them opening and saving the file in Excel was the likely culprit (due to its parsing defaults), but it took me a while to figure out what was happening: previously, they had opened the file from my app (a database) in the default app, which presumably used to be some plain text editor. Then they saved it, and opened it from their database. Now, though, Excel is the default app, and when opening, tries to parse the dates as numbers. I.e., it was probably a change to default apps that actually triggered the customer’s issues.

My point being: this kind of stuff unfortunately happens all the time.

It would be nice of Microsoft would understand these needs and provide a way to set defaults that make all cells “Text” instead of “General” (which is where they’d replace “MARCH1” with a date), but then again, Excel is already quite modal in its preferences, causing lots of troubles, e.g. when using different system languages. Ideally, there should be default settings per document, but that’s somehow not happening.

I just wish when opening CSVs, there were a modal dialog first. Instead, there’s no UI at all; Excel quietly tries to parse every field in the CSV as “general”.

Now, you might think it’s because Microsoft hasn’t bothered to make such a UI. But they have! And it’s not completely terrible — it even has live preview. If, instead of opening the CSV, you open a blank spreadsheet and go to the Data tab in the Ribbon, you can actually configure (again, with live preview of the effect on the file) the encoding, the separators (field and row), and then, for each column, the initial data type. Select all columns, pick Text, you’re golden.

If there were a Task Dialog on open the let me pick “Let Excel pick defaults” or “Configure how to import this CSV file”, I could simply direct my customers to click the latter button.

Instead, I have to tell them to please don’t open and especially don’t save a CSV file in Excel, but rather to first save it from my app, then open Excel, then open the file from within Excel using the Data tab.

(https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/suggestions/10576116-add-a-setting-option-where-all-csv-delimited-files appears to be roughly this kind of feature request.)

It’s certainly no coincidence that Excel 2011 only had a size of 47 MB, whereas Excel 2016 is a 1.7 GB monster.

My guess is you’re not calculating that correctly. 2011-era Office had lots of satellite files spread over the file system, whereas newer ones put a lot more into their bundles. This means there’s far more duplication if you have multiple apps, but it also makes it more self-contained and more sandboxing-friendly. For example, both Word and Excel have the mso*.framework bundles inside each app bundle, with an identical 1825 items and 216 MB in size. These probably used to sit somewhere, once, in /Library/Frameworks or /Library/Application Support/…/Frameworks.

I’m sure 2016 is bigger, but I’m almost certain you’re omitting files that used to not be in the app bundle.

@Jules Unfortunately, Numbers is not very efficient at handling large CSV files.

You are right, Sören: If you include frameworks and other files that are not part of the app bundles, the 2016 versions are still significantly bigger, but the difference is nowhere near the 30x you get by just comparing the apps themselves.

There might have been some confirmation bias at work on my part, due to the fact that especially Excel is just so much more sluggish than its 2011 counterpart.

Adrian: “It feels like there should be a huge market for a really high-performance no-frills spreadsheet that is super fast, works with huge amounts of data, does not corrupt CSV files and does not do any kind of auto-type detection.”

Except these folks *aren’t* using it as a spreadsheet, they’re using it as an ad-hoc database. Which is an *extremely* common practice. It’d make more sense to decouple the GUI (grid layout) frontend from the backend, and allow different backends—spreadsheet (programmable parallel calculator), relational database, visual layout tool/end-user GUI toolkit, etc to be used as appropriate.

The problem is MS Excel has the de facto lock-in, in large part because of its monolithic complexity, thus MS have neither the motive nor the capability to radically redesign and rebuild Excel as a more modular system to better fit users’ real-world needs. It’ll take radical disruption by a very deep-pocketed and aggressive competitor (<cough> Google) to unseat it, and even there there’s been little progress made.

Leave a Comment