Wednesday, November 24, 2021

Draft Checklist for Code and Data Archiving

 The following is a work-in-progress, posted here to obtain feedback. The goal is a succinct, user-friendly document presenting authors with an accessible and relatively basic set of recommendations for how to comply with journal requirements for data archives. At this point it is a good idea to archive your data and your code, and some journals now require one, or the other, and you may expect both to be required soon. Feedback is welcome!


A CHECKLIST FOR REPRODUCIBLE ARCHIVING DATA AND CODE

IN ECOLOGY, EVOLUTION, AND BEHAVIOR

November 24, 2021

Daniel I. Bolnick (daniel.bolnick@uconn.edu), Roger Schürch (rschurch@vt.edu), Daniel Vedder (daniel.vedder@idiv.de), Max Reuter (m.reuter@ucl.ac.uk)


RATIONALE

The fundamental question you should ask yourself is, “If a reader downloads my data and code, will my scripts be comprehensible, and will they run to completion and yield the same results on their computer?” Any computer code used to generate scientific results should be easily usable by reviewers or readers. Sharing this information is vital for many reasons. It promotes appropriate interpretation of results, checking validity, future data synthesis, replication, and a teaching tool for students learning to do analyses themselves. Shared code provides greater confidence in results.


The following bullet points are meant to help you reach this goal. High priority points are in blue font, while black font indicates suggestions to follow ‘best practices’.



1. CLEAN DOCUMENTATION

➤  Prepare a README_SUMMARY file with important information about your repository as a whole (code, and files contents). Text (.txt) README files are readable by a wider variety of software tools, so have greater longevity.

  • Author names, contact details.

  • A brief summary of what the study is about 

  • Link to publication or preprint if available

  • Identify who is responsible for collecting data, and writing code.

  • The versions of all packages and software you used (including the operating system), and dependencies (if these are not installed by the script itself). For instance, in R you can use sessionInfo().

  • Overview of folders/files and their contents

  • Workflow instructions for users to run the software (e.g. explain the project workflow, and any configuration parameters of your software)

  • For larger software projects: instructions for developers (e.g. the structure and interactions of submodules), and any subsidiary documentation files.

  • Links to protocols.io or equivalent methods repositories, where applicable

 Use informative names for folders and files (e.g. “code”, “data”, “outputs”)

 Give license information (either in the README_SUMMARY or a separate file), such as Creative Commons open source license language granting readers the right to reuse code. For more information on how to choose and write a license, see choosealicense.com.

➤If applicable, list funding sources used to generate the data archive, and include information about permits (collection, animal care, human research).



2. CLEAN CODE

Thoroughly annotate your code with in-script comments indicating what each set of commands is meant to do, and why.

 Scripts should start by loading required packages, and importing raw data in a format exactly as it is archived in your data repository.

 Use relative paths to files and folders (e.g. avoid setwd() with an absolute path in R), so other users can replicate your data input steps on their own computers. 

 Where useful (e.g. if you have a lot of files) have one root directory (folder), with sub-directories containing data, code, outputs, figures, etc. Use this root directory as the address to which all other relative paths refer.

 Use informative names for input files, and variables

 Any data manipulations (merging, sorting, transforming, filtering) should be done in your script, for fully transparent documentation of any changes to the data

 Organise your code by splitting it into logical sections, such as importing and cleaning data, transformations, analysis and graphics and tables. Sections can be separate script files run in order (as explained in your README_SUMMARY), blocks of code within one script that are separated by clear breaks (e.g., comment lines, #--------------), or a series of functions (which can facilitate reuse of code for future work). Aim for 300 - 800 lines of code per file for easy review and proofreading; functions should not be longer than a screen.

 Label code sections with headers that match the figure number, table number, or text subheading of the paper.

 Omit extraneous code not used for generating the results of your publication, or place any such code in a Coda.

 Save intermediate steps as their own files. For instance, if you use raw data to calculate a table of group means and then run further analyses on those group means, provide both the raw data files and the intermediate table of means. Similarly, if your scripts include computationally intensive steps, you can provide their output as an extra file as an alternative entry point to re-running your code.

 If your code contains any stochastic process (e.g.,random number generation, bootstrap re-sampling), set a random number seed at least once at the start of the script or, better, for each random sampling task. This will allow other users to reproduce your exact results.

 Test code before shipping, ideally on a pristine machine without any packages installed, but at least using a new session.

 If you are adapting other researcher’s published code for your own purposes, acknowledge and cite the sources you are using. Likewise, cite the authors of packages that you use in your published article.



3. CLEAN DATA


Checklist for preparing data to upload to DRYAD or other repository

 Repository contents 

all data used to generate a published result should be included in the archive. For papers with multiple experiments, this may mean a corresponding number of data files.

➤  Save each file with a short, meaningful file name (see DRYAD recommendations here), except the README_DATA file(s) which should just be called README_DATA.txt

prepare a README_DATA text file to accompany each data file. The README_DATA should provide a brief overall description of the file’s contents, and a list of all variable names with explanation (e.g., units) so that a new reader can understand what the numbers or other data in that column mean and relate this information to the Methods and Results of your paper. Alternatively, this may be a “Codebook” file in a table format with each variable as a row and column providing variable names (in the file), descriptions (e.g., for axis labels), units, etc. 

save the README_DATA files as a text (.txt) file and all of the data files as comma-separated variable (.csv) files. 

➤  if your data are in EXCEL spreadsheets you are welcome to submit those as well (to indicate colour coding and provide additional information (formulas etc) but each worksheet of data should also be saved as a separate .csv file.

We recommend archiving all files used to generate data (e.g., photos, videos, etc), but this may use too much memory for some repository sites. At a minimum, upload a few example files illustrating the range of outcomes. 

Data file contents and formatting

archived files should include raw data, not simply group means or other summary statistics; such summary statistics can be a separate file, or generated by code archived with the data

➤  identify each variable (column names) with a short name. Names should preferably <10 characters long and not contain any special characters that could interfere with reading the data and running analysis code. Use an underline (e.g. wing_length) or camel case (e.g., WingLength) to distinguish words if you think that is needed.

omit variables not analyzed in the publication, for brevity

a common data structure is to ensure that every observation is a row and every variable is a column

 follow a practice of having one column contain only one kind of data (e.g., do not mix numerical values and comments or categorical scores into a single column)

 Use “NA” or equivalent to indicate missing data (and specify what you use in the README file).




4. COMPLETING YOUR ARCHIVE


➤  It is a good habit to prepare your data and code archive, and associated README files, simultaneously with manuscript preparation (analysis and writing).

➤  Data and code should be archived on version-controlled repositories (e.g., DRYAD, ZENODO). Your own GitHub account (or other privately controlled website) does not qualify as a public archive because it does not provide a DOI, and you control access and might take down the data at a later date.

➤  Provide all of the metadata and information requested by the repository, even if this is not required and redundant with information contained in the README files. Metadata makes your data easier to find and understand.

➤   from the repository get a URL that can be used by editors and reviewers before your data are made public with a DOI. Provide that private URL key on submission of your manuscript

 

 



FOR MORE INFORMATION


More detailed guides to reproducible code principles can be found here:


A guide to reproducible code in Ecology and Evolution, British Ecological Society: https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible-Code-2019.pdf?utm_source=web&utm_medium=web&utm_campaign=better_science 


Dokta tools for building code repositories:

https://github.com/stencila/dockta#readme


Principles of Software Development - an Introduction for Computational Scientists (https://doi.org/10.5281/zenodo.5721380), with an associated code inspection checklist (https://doi.org/10.5281/zenodo.5284377).


Style Guide for Data Files

 See the Google R style guide (https://google.github.io/styleguide/Rguide.html) and the Tidyverse style guide (https://style.tidyverse.org/syntax.html#object-names) for more information


Guidelines for archiving data AND code

The following is a cross-post from the Editor's blog of The American Naturalist, developed with input from various volunteers (credited ...