With the hype of deep learning neural nets, and machine learning algorithms, it’s easy to forget that most of the work in data science involves accessing and preparing data for analysis. Indeed, not all data is Kaggle-ready. The reality is: data is often far from perfect.
Very far from perfect.
Do your consultant (and budget) a favor and follow these rules-of-thumb when using spreadsheets to collect and organize your data.
First, do not rely on spreadsheet formatting to indicate associations in your data. The most common form of this mistake is to use color-coding to indicate treatment or group.
Here, the alternating red and green rows (are supposed to) convey information about alternating treatments. If the software (or data scientist) is color blind, however, then the association of “row color” with “treatment” is unclear, and likely to be lost. My recommendation is to explicitly capture and encode information about treatment in a separate column.
A rule-of-thumb for good data preparation, is that each row should not rely on it’s formatting or color to convey metadata or information that might be required or useful for analysis. You do not want to have to rely on your memory that red is ‘a’ and green is ‘b’.
Second, never merge spreadsheet cells. Another place I see room for improvement is with the use of “merged” cells in spreadsheets. Excel provides the user with the ability to combine cells, perhaps with an eye to improving the (human) readability of the data contained in the table.
Merged cells are problematic when exported to .csv
Here, the Position and Setting columns for each row has been grouped using the “merge” function in excel. The problem arises because most of the merged cells, when unmerged (during .csv export, say), are left blank. These blank cells result in the loss of required or useful group associations for these records.
Remember: in preparing data for analysis, each row of data should be a stand-alone entity, and not rely on its relative position or order to convey part of its meaning. A test of this principle is, for any given table of data, you should be able to re-order your rows in any order, and not lose any required or useful information for analysis.
When setting up your spreadsheet for data analysis, the rule-of-thumb is to simply never merge your data cells.
Much more clear, especially for my R scripts.
Third, always use Data Validation tools for data entry. Excel is an excellent tool for entering data. However, if you are capturing and entering data by hand, there are ample opportunities for making difficult-to-detect data entry errors. Can you spot the error in this table?
With only ten rows, you probably noticed that the Treatment entry in the third row (cell C6) has an UPPER CASE label, while the rest of the entries for Treatment have lower case labels.
Most scripting languages (R, python) are case sensitive — by default — and would group ‘a’ and ‘A’ separately, possibly leading to errors (unless detected during a preliminary data audit).
When you set up your spreadsheet for data entry, take time to consider checking data as it is entered. The above error could be prevented using the Data Validation function in Excel. Here, a list of acceptable values (a,b) could be used to confirm each entry. Entries not on the list are detected and can be corrected.
Conditional Formatting (typically, color highlighting) is another useful Excel tool for quickly checking that data is entered correctly. Expected values will have a format (e.g., color) and unexpected values will not, making them stand out during visual inspection.
The fourth rule of thumb is to never (ever!) delete rows of data if you want the data excluded from the analysis. Once a cell or row is deleted, you no longer have any indication that your deleted data existed! The best practice for excluding data in an analysis, is to use a new column in the data table to act as a flag- (or indicator-) variable, and for each record, encode and capture whether that record should be included (e.g., 0=”include”, 1=”exclude”).
For additional clarity, a second column can be added that provides an explanation (perhaps encoded, or that uses Data Validation) for why the data is to be excluded from the analysis. In this way, if your consultant (or your future self) is curious and wants to analyze these “excluded” data, he or she has ability to do so.
The quality of the data is indicated by the QC_flag column (0=good, 1=bad) and the rational for the QC status can be noted in the QC_description column.
(If you “Format as Table” your data, you can easily apply filters so that the data you want to exclude is made invisible.)
Fifth, to help your consultant (and perhaps your future self) understand and audit the types and ranges of your data, create a key that explains each column of data in a table. This key can live in a separate, human readable document — perhaps as a README.txt file that is updated and saved in the same folder as the spreadsheet file, or as information in a separate tab on the spreadsheet, or as information at the top of each spreadsheet. If a column of data contains a factor (e.g., a label, or class, or group, or treatment), enumerate all the possible values that this factor could take (In the above table, the QC flag can be 0=good, or 1=bad). If a column contains numbers, indicate the range (and perhaps information about the precision) for those numbers (In the above table, the x1 column is a measurement in mm units, ranging from 0-50). This information can be critical for communicating what the data means — and therefore how the data could be used, and how it should “typically” look. Departures from “typical” values (e.g., negative measurements) can be flagged for later consideration (or correction).
The sixth rule for spreadsheets is to preserve the integrity of the data by separating the data from the analysis, and requires the understanding of a principle. The principle is: the data are the data. As such, there should exist a single source of truth for the data. There isn’t my data, and there isn’t your data. There is only the data. In practice, there should not be different versions of the same(ish) data.
This does not mean that you can not make versions of the analysis of data, only that the raw data itself should not change.
I bring this up because when it comes to data analysis, spreadsheets are a double edged sword. Spreadsheets are useful because they allow you to capture data and do analyses of those data in the same setting (Excel!). This is good for two reasons. First, there is a relatively low barrier to entry for “doing data analysis.” Second, it is good because it enables rapid prototyping and exploration. Spreadsheets are dangerous, however, because they are too easy to alter the data (perhaps by accident) without indicating how or what the change was. This can lead to a proliferation of versions of the raw data, perhaps unintentionally.
Ideally the data should be sequestered from the user in a way that makes it easy to access, but very difficult to change. The user should be free to analyze the data in any way, as long as the original data isn’t altered. When organizing a data project (on your hard-drive) I would recommend making a subdirectory called “data” that contains only the data, and to save your data as read-only files. A second subdirectory, called “analysis” can contain files based on the raw data files, but including additional analyses or annotations (or even changes). If I am going to do some analyses in excel, I simply append “_analysis YYYY-MM-DD” to the name of the raw data file, and save it in the analysis subdirectory.
I would also be remiss if I didn’t take this opportunity to say that if you are planning on capturing a lot of (structured) data, it would be worth storing these data in a relational SQL database rather than in spreadsheet files. This practice helps to preserve the integrity of the data because it enforces the separation of data from analysis. (And I know a good consultant who can help you set up a database!)
The seventh rule for spreadsheets is to use a fixed spreadsheet template and to collect data in a series of spreadsheet files (rather than a series of tabs in a file). If you are collecting data from experiments that follow the same format, or make the same measurements or observations over different days, then consider setting up a template so that each collection of measurements is organized in the same way.
When analyzing your data, I’m going to be using a tool like R or a scripting language like python, and I will want to automate as much of that workflow as possible. To that end, using a spreadsheet template means the data is in a form that is expected (and anticipated by my R or python script) and makes the process of data inhalation easy, fast, and reliable.
Ideally, you’ll set up the data collection spreadsheet prior to collecting the data. It may also be a good idea to review the format of template after a few preliminary runs of data collection to make sure you have optimized the template before collecting most of your data.
(Even better, if you know who is going to do the analysis of your data, have that person review your spreadsheet template so that they can make suggestions to the template that will facilitate their analyses.)
There you have my seven rules for using spreadsheets to capture and organize your data for analyses and machine learning. Let me know if I’ve missed any useful rules!
- Murrell, Paul. Chapter 3: Data Intended for Human Consumption, not Machine Consumption in Bad Data Handbook, Edited by Q. Ethan McCallum, O’Reilly Media Inc., Sebastopol, 2013. This entire book is worth reading, although the topics in other chapters extend beyond the scope of this post.
- Pyle, Dorian. Data Preparation for Data Mining Academic Press, San Francisco, 1999. One of the first books I read on understanding how to best prepare data for machine learning.