Don’t use Excel for biological data

Why? Because it quietly modifies your data.

How? Excel auto-reformats the contents of cells it recognises as being of being a particular format. The most common examples are things that look like dates, i.e. ’09-10′ will be formatted as ’09-Oct’. What’s wrong with that you ask? It is a problem because it modifies the data: this example would become ’09/10/2012′. If you don’t notice it, there is no going back to the data you entered.

This even happens to files you open in Excel that were created elsewhere.

In bioinformatics, text files formatted as comma- or tab-delimited are very common and are often viewed in Excel. Much of bioinformatics data relates to genes and genes have specific abbreviations such as BRCA2. The problem is that many of these gene abbreviations look like dates to excel e.g. MARCH1, SEPT10. These examples are changed to ’01-Mar’ and ’10-Sep’, respectively.

This is not new news. A group from the National Institutes for Health in Bethesda, USA published a paper highlighting this problem in…2004.

Now nearly nine years later it is still occuring in very public places. A Cell paper published in October last year (A good paper, worth a future blog post). The paper came with three supplementary files in Excel format, two of which had genes names as dates. One with 29 genes and the other with two. How did this get past the reviewers? It’s such an obvious error.

A little while later, in December, I was looking for some annotations for a microarray and found it in ArrayExpress, but it too had dates for gene names (for 85 probes). Fair dos to ArrayExpress as they fixed the annotation file after I pointed it out to them. More revealing was this:

We do quite often see problems created by Excel automatic formatting and try
to fix this where spotted in direct submissions to ArrayExpress.

I was informed that the original data was obtained from GEO ([HG-U219] Affymetrix Human Genome U219 Array), so I looked at their entry and true enough there were dates for gene names. I likewise contacted GEO regarding this issue and their response was reassuring and depressing at the same time:

Thank you very much for your message. In response to your inquiry we plan to
perform a review of GEO Platforms to identify instances of this problem.
However, it may be several weeks before all records are cleaned up.

Reassuring because they acknowledged the seriousness of the issue. Depressing because they weren’t aware of this beforehand.

How is it that two important data repositories for the scientific community did not know and did not deal with this problem? And how is that the reviewers or editors of a prestigious journal like Cell did not notice the errors in the paper?

In summary, Microsoft Excel is detrimentally affecting biological data. You could argue that this is a user error and there are ways to avoid it (not always), but its insiduous nature ensures that it is easy to miss and then it affects others who have not even used Excel on the data.

So, don’t use Excel. For all our sakes.

NB: I’m not the only one raising this issue: http://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not-learned/

Advertisements

13 thoughts on “Don’t use Excel for biological data

  1. Pingback: Links 2/12/13 | Mike the Mad Biologist

  2. Bloody right…I’ve had this problem too. The octamer-binding transcription factors (including Oct-4 , of embryonic stem cell fame) are also affected. What do you recommend as an alternative to Excel?

    Steve Pells

    • The only alternative to Excel is LibreOffice, but it works similarly.

      Take home message is only use it to view data and to *always* open via the wizard (LibreOpen does this) setting gene name columns as ‘Text’ not ‘General’.

  3. It’s a poor workman who blames the tools.

    While Excel does have its problems, autoformatting is neither the worst of these, nor should it (nor its other issues) stop you from using excel for it can be useful.

    Lesson 1: don’t open a document without a backup – and never save over your backup. It doesn’t matter *what* software you are using, if you are hurting your saved data because an autoformat overwrote something already saved, you are doing it wrong.

    • It’s the lack of feedback from Excel regarding that is the problem.

      If you open a tab-delimited file, sort on one of the columns and then save. You have no idea that it converted some ‘dates’ for you. That’s clearly wrong. In passing the file to a collaborator you’re giving erroneous data.

  4. So, what should we use? Is there another spreadsheet program that a) doesn’t do these very annoying automodifications AND b) produces files that can be easily shared with colleagues, WITHOUT requiring them to obtain esoteric software of their own?
    I’m aware of Open Office, and I’ve used it enough to know that it mostly satisfies both conditions. I say “mostly” because it has its own set of weird, unexpected automatic functions that can sometimes be turned off, and the files it produces can sometimes be shared with colleaugues who do not themselves have Open Office installed on their computers. But not always.

    • It depends.

      OpenOffice and LibreOffice are similarly affected, but they always open a wizard with plain text data. So, you can check the data. Excel doesn’t always and will modify the ‘dates’ as it sees fit.

      Excel is ok if it is used as a data viewer only. It is not ok if as part of a data processing workflow or for disseminating data.

      To share data only use formatted plain text.

  5. I use Excel 2007 at work, and often have to import text files. Step 3 in the text import wizzard lets you select data type (general/date/text) by column. You can also use MS Access, which would only import or link your data, not overwrite the original file.

  6. 1) At the time of importing data to excel from txt file, remember to choose text for this column in the import wizard.
    2) Adding a space, example: (space)2:00 will force excel not to format as date.
    3) Many problems occur when you copy/paste columns that are formatted as general into a new column. Right click, format as text as column, doesn’t work all the time.
    Solution: Select the column that you are having issues, go to data ribbon, text to columns and then follow the wizard and check TEXT. This will fix it.
    A good validation technique is to right click in the column, format and verify that the column is formatted as text each time that you copy and paste to different sheets or workbooks.
    For complicated issues go to mrexcel.com forum
    Regards,
    Marioustx

    • The serious issue is that you can open data in Excel bypassing the wizard which then corrupts your data. Yes, there are ways around it, but it should never be possible for Excel to edit your data quietly.

  7. Pingback: Open Refine for Librarians – Librarian of Things

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s