Skip to Main Content

Reproducible Research Methods

Introduction

Many people use spreadsheets in their research and work.  A pitfall, especially when trying to create reproducible research, is trying to make the spreadsheet pleasant to look at, such as emulating a lab notebook.  Unfortunately, this will make the data more difficult to easily use, reuse, and share.  While data in spreadsheets has a dichotomy between human-readability and machine-readability, machine-readability should take precedence. 

 

There are other methods to make data-entry easier, such as using a data-entry form connected to a database (i.e. SQL or Access) or spreadsheet (Excel or Google Sheets).  Similarly, there are other methods to best present the data, such as visualizations or formatted tables.  However, in order to use visualizations - and reuse the same visualization on similar data - the data must be organized in a manner where computer programs can easily read them. 

Data in Spreadsheets

In general, data in a spreadsheet should be organized with columns of variables, and rows of observation.  Each column and row should represent one thing, either one variable or one observation. Research is easier to perform and reproduce if the data structure is determined prior to conducting research. 

Common Data Formatting Errors in Spreadsheets (and how to avoid them)

  • Multiple tables within one sheet: while this might make the data easier to look at (for the person who created the multiple tables), it will sometimes confuse other people, and will absolutely cause a computer to try and combine data from multiple observations into one. It's best to keep one table on one sheet, with one column for each variable, and one row for each observation. 
     
  • Multiple tabs in one file: While there are libraries/packages for programming languages such as R and Python to read data stored in separate tabs in an excel workbook, there still might be issues with storing data in this manner, and so it should be used with caution.  Issues arise because, ultimately, excel is a proprietary product that is subject to change over time. This requires developers to continuously develop and upgrade their packages/libraries, and can also require additional costs to be passed along to the end user.  It is for this reason that we suggest using the open and standard CSV format to store and share structured data.
     
  • Blank Cells: Viewers don't know what a blank cell means. Is it a null value? Is it a zero? Is it accidentally missing data? Be explicit and fill in all cells with values. 
     
  • Problematic null values: Null values can be tricky because they can be represented by different values depending on the programming language. The best practice for null values is the same practice for good data: clearly state what the value is in a data dictionary and be consistent. 
     
  • Using extensive formatting in the sheet to convey information and/or make the data sheet look pretty: While formatting a sheet (sometimes) makes it easier for humans to read, it makes it near impossible for computers (and sometimes other people) to read. Formatting your data as structured in a CSV file makes it easy for computer programs and other people to associate values in a cell with the appropriate observation and variable. 
     
  • Placing more than one datum within a cell, comments, units in cells, and/or entering more than one piece of information in a cell: similar to formatting, adding additional information in cells will throw errors or read the data incorrectly when a program loads the file. Place only datum (singular or structured) within a cell.
     
  • Problematic field names: field names should be concise and descriptive to the variable in the column. Avoid abbreviations not common to the field. Also, avoid spaces within the field name, as spaces are sometimes read a field separators by different programs. Underscores and CamelCase (i.e. no spaces) are good alternatives. 
     
  • White space and other special characters in data: White space characters (new line, tabs, etc) are read by computer programs as different things; usually new observations, or a different column. Additionally, characters such as the pipe ( | ), em-dash, and commas are read by programs as something other than data. It's best to avoid including these types of characters in your data.