The spreadsheet problem in data science
Written by Robert Coyle
The following text is an extract from Designing For Data Scientist published by Creme Global recently. This contents summarises research carried out by IBM in their paper The Risks of Using Spreadsheets for Statistical Analysis.
Spreadsheets are a ubiquitous tool in data science. Spreadsheets are not only used for data manipulation, they are also used for statistical analysis and even running basic models. A team might have a spreadsheet sheet with macros that run an analysis on data. The user can be anyone from a Data Scientist to an analyst working in marketing , finance or any industry that models data. The common activity of forecasting is essentially a type of modelling.
Risks of statistical analysis with spreadsheets
Users stretch the limits of spreadsheets to the point where they are used to perform tasks beyond their ability to perform correctly. IBM carried out useful research on this topic. Their findings are summarised in their unambiguously titled paper The Risks of Using Spreadsheets for Statistical Analysis. Here are some of the insights:
- A typical spreadsheet will have a restriction on the number of records it can handle. If the scale of the job is large, a different tool like an SQL database might be more appropriate.
- If you only need a superficial review of your data, a spreadsheet may be a suitable tool. But if you suspect that there is valuable information in your data that isn’t immediately obvious, or if you need to perform a detailed analysis or find hidden patterns, a spreadsheet will not give you the functionality you need.
- Another factor to consider is the degree of accuracy required. Spreadsheet results can be unreliable, especially on large datasets and/or for complex calculations. If absolute accuracy is required, a spreadsheet may not suffice. Instead, a different, more reliably accurate tool should be considered.
- If the task is simply to analyze a limited quantity of historical data, a spreadsheet will do. However if you want to make reliable forecasts or draw trends, especially if they involve large dataset, then there are much better tools.
- Spreadsheets are prone to errors. A number of studies have been made concerning the frequency of errors in spreadsheets. Based on these studies the indication is that 90 percent of all spreadsheets contain at least one error. These include stealth errors that pass the attention of experts and can go undiscovered.
Type of errors
If we look at the type of errors spreadsheet users encounter we see more clearly the advantages of moving to a formal model
Functional Errors: These are errors that break the spreadsheet and give an error message.
Outlier Errors: The spreadsheet still works but an expert spots that these numbers can’t be correct.
Stealth Errors: These errors in results even pass the attention of experts. Often in statistics no one know what correct results should look like. Errors like these can go undiscovered.
“IBM studies reveal that 90 percent of all spreadsheets contain at least one error.”
Causes of spreadsheet errors
Some of the primary causes include:
Logic: Relating cells incorrectly might mean a logic error. Using the wrong function or operator are logic errors.
Copying formulas: Copying a formula might fall foul of unseen cell relationships. Hand typing a formula is also dangerous.
Copying numbers on top of formulas: Copy a constant number over an invisible formula in a cell and the formula is wiped out along with other cell relationships.
Using the wrong function: Some spreadsheet functions have dangerously similar names.
Leaving out data: Most spreadsheets setups will not give validation error if you have not completed full rows of data and have left some cells out.
Adding incorrect data: Incorrect data entered can often pass unnoticed. Ideally the entry would cause an output effect that is very noticeable – for example accidentally entering a date rather than a number – however this is not always the case.
Excel patches: Tragically patches have caused errors to existing sheets as shown by this report.
Accuracy: Spreadsheet have been shown to have accuracy issues when executing complex maths, even when error free. Excel has gotten criticism for its shortcomings in statistical analysis.
Complexity: Spreadsheet analyses are often made up of chains of calculations. It is very easy to alter links in these chains without realizing and “breaking” the analysis.
“Professional statisticians continue to write books with titles like ‘Statistics with Excel,’ but they now warn students not to bet their jobs on Excel’s accuracy.”
Bruce D. McCullough, “The Unreliability of Excel’s Statistical Procedures,” Foresight, (February 2006) Vol. 3, 44-45.
Other spreadsheet issues
Handling missing data values: A user here might try to enter a zero value which would upset the findings of a median value. If they enter a string it might be misinterpreted by an equation as a zero or something else entirely. One would at least need to come up with a standard for denoting missing values and stick to it. Mark Lambe here at Creme Global pointed out a related danger: “Excel’s default is to try to handle / interpret ambiguities rather than throw an error. This is a strength for simple stuff but becomes a disadvantage as complexity increases”
Categorical data with hidden meaning: Imagine you enter the value -1 or -2 in a cell to denote two different types of error. This categorical data needs to be documented somewhere. What if the author of the spreadsheet moves on?
Labour caused by focusing on cells: Spreadsheets are cell centric and modifying a cell can require the changing of formulas or data across a range of cells/rows/columns.
New data and equations: Equations can be set to extend automatically for new data, but sometimes this is not desired. Either way this will be a consideration and possible source of new errors.
Using spreadsheets for serious statistical work can be more trouble than it’s worth, and sometimes the user never even finds out about it, but reaps the negative consequences.
Furthermore in Excel the Data Scientist loses the reproducibility coding gives by way of versioning and pull requests.
For Designers working in data science, be aware of the ubiquity and consequences of spreadsheets and be able to bring this knowledge into your design workshops.