Why? Because it quietly modifies your data.
How? Excel auto-reformats the contents of cells it recognises as being of being a particular format. The most common examples are things that look like dates, i.e. ’09-10′ will be formatted as ’09-Oct’. What’s wrong with that you ask? It is a problem because it modifies the data: this example would become ’09/10/2012′. If you don’t notice it, there is no going back to the data you entered.
This even happens to files you open in Excel that were created elsewhere.
In bioinformatics, text files formatted as comma- or tab-delimited are very common and are often viewed in Excel. Much of bioinformatics data relates to genes and genes have specific abbreviations such as BRCA2. The problem is that many of these gene abbreviations look like dates to excel e.g. MARCH1, SEPT10. These examples are changed to ’01-Mar’ and ’10-Sep’, respectively.
This is not new news. A group from the National Institutes for Health in Bethesda, USA published a paper highlighting this problem in…2004.
Now nearly nine years later it is still occuring in very public places. A Cell paper published in October last year (A good paper, worth a future blog post). The paper came with three supplementary files in Excel format, two of which had genes names as dates. One with 29 genes and the other with two. How did this get past the reviewers? It’s such an obvious error.
A little while later, in December, I was looking for some annotations for a microarray and found it in ArrayExpress, but it too had dates for gene names (for 85 probes). Fair dos to ArrayExpress as they fixed the annotation file after I pointed it out to them. More revealing was this:
We do quite often see problems created by Excel automatic formatting and try
to fix this where spotted in direct submissions to ArrayExpress.
I was informed that the original data was obtained from GEO ([HG-U219] Affymetrix Human Genome U219 Array), so I looked at their entry and true enough there were dates for gene names. I likewise contacted GEO regarding this issue and their response was reassuring and depressing at the same time:
Thank you very much for your message. In response to your inquiry we plan to
perform a review of GEO Platforms to identify instances of this problem.
However, it may be several weeks before all records are cleaned up.
Reassuring because they acknowledged the seriousness of the issue. Depressing because they weren’t aware of this beforehand.
How is it that two important data repositories for the scientific community did not know and did not deal with this problem? And how is that the reviewers or editors of a prestigious journal like Cell did not notice the errors in the paper?
In summary, Microsoft Excel is detrimentally affecting biological data. You could argue that this is a user error and there are ways to avoid it (not always), but its insiduous nature ensures that it is easy to miss and then it affects others who have not even used Excel on the data.
So, don’t use Excel. For all our sakes.
NB: I’m not the only one raising this issue: http://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not-learned/