Friday, December 3, 2021

Guidelines for archiving data AND code

The following is a cross-post from the Editor's blog of The American Naturalist, developed with input from various volunteers (credited below).



A CHECKLIST FOR REPRODUCIBLE ARCHIVING DATA AND CODE IN ECOLOGY, EVOLUTION, AND BEHAVIOR


December 1, 2021

Daniel I. Bolnick (daniel.bolnick@uconn.edu), Roger Schürch (rschurch@vt.edu), Daniel Vedder (daniel.vedder@idiv.de), Max Reuter (m.reuter@ucl.ac.uk), Leron Perez (leron@stanford.edu), Robert Montgomerie (mont@queensu.ca)



Starting January 1, 2022, The American Naturalist will require that any computer code (R scripts, Matlab scripts, Mathematica notebooks) used to generate reported results be archived in a public repository (we specifically suggest Dryad, see below). This has been our recommendation for a couple of years, and author compliance has been very common. As part of our commitment to Open and Reproducible Science, we are transitioning to make this a requirement. The following document, developed with input from a variety of volunteers, is intended to be a relatively basic guide to help authors comply with this new requirement.


RATIONALE


The fundamental question you should ask yourself is, “If a reader downloads my data and code, will my scripts be comprehensible, and will they run to completion and yield the same results on their computer?” Any computer code used to generate scientific results should be easily usable by reviewers or readers. Sharing this information is vital for many reasons. It promotes appropriate interpretation of results, checking validity, future data synthesis, replication, and a teaching tool for students learning to do analyses themselves. Shared code provides greater confidence in results. 


The recommendations below can help authors conduct a final check when finishing a research project, before setting it aside. In our experience, you will find it easier to build reusable clean code and data if you adhere to these recommendations from the start of your research project.


The following bullet points are meant to help you achieve usable code and reproducible data. We separately list requirements, and recommendations in each category below.



1. CLEAN DOCUMENTATION


  Great template available here: https://github.com/gchure/reproducible_research


REQUIRED:


➤ Prepare a single README file with important information about your data repository as a whole (code, and data files). Text (.txtor .rtf) and Markdown (.md) README files are readable by a wider variety of free and open source software tools, so have greater longevity. The README file should simply be called README.txt (or .rtf or .md). The file should contain, in the following order:

Citation to the publication associated with the datasets and code 

Author names, contact details

A brief summary of what the study is about 

 Identify who is responsible for collecting data, and writing code.

List all folders and files by name, and briefly describe their contents. For each data file, list all variables (e.g., columns) with a clear description of each variable (e.g., units)

Information about versions of packages and software used (including operating system) and dependencies (if these are not installed by the script itself. An easy way to get this information is to use sessionInfo() in R, or 'pip list --format freeze' in Python.


RECOMMENDED:

  Provide workflow instructions for users to run the software (e.g. explain the project workflow, and any configuration parameters of your software)

 Use informative names for folders and files (e.g. “code”, “data”, “outputs”)

  Provide license information (either in the README or a separate file), such as Creative Commons open source license language granting readers the right to reuse code. For more information on how to choose and write a license, see choosealicense.com.

 If applicable, list funding sources used to generate the archived data, and include information about permits (collection, animal care, human research).

  Link to Protocols.io or equivalent methods repositories where applicable



2. CLEAN CODE


REQUIRED:

  Scripts should start by loading required packages, and importing raw data in a format exactly as it is archived in your data repository.

  Use relative paths to files and folders (e.g. avoid setwd() with an absolute path in R), so other users can replicate your data input steps on their own computers. 

➤ Annotate your code with in-script comments indicating what the purpose of each set of commands is (i.e. “why?”). If the functioning of the code (i.e. “how”) is unclear, strongly consider re-writing it to be clearer/simpler.  In-line comments can provide specific details about a particular command

➤ Annotate code to indicate how commands correspond to figure numbers, table numbers, or subheadings of results within the manuscript.

  If you are adapting other researcher’s published code for your own purposes, acknowledge and cite the sources you are using. Likewise, cite the authors of packages that you use in your published article.


RECOMMENDED:

  Test code before shipping, ideally on a pristine machine without any packages installed, but at least using a new session.

  Use informative names for input files, variables, and functions (and describe them in the README file).

  Any data manipulations (merging, sorting, transforming, filtering) should be done in your script, for fully transparent documentation of any changes to the data.

  Organise your code by splitting it into logical sections, such as importing and cleaning data, transformations, analysis and graphics and tables. Sections can be separate script files run in order (as explained in your README) or blocks of code within one script that are separated by clear breaks (e.g., comment lines, #--------------), or a series of function calls (which can facilitate reuse of code).

  Label code sections with headers that match the figure number, table number, or text subheading of the paper.

  Omit extraneous code not used for generating the results of your publication, or place any such code in a Coda.

  Where useful, save and deposit intermediate steps as their own files. Particularly, if your scripts include computationally intensive steps, it can be helpful to provide their output as an extra file as an alternative entry point to re-running your code. 

  If your code contains any stochastic process (e.g.,random number generation, bootstrap re-sampling), set a random number seed at least once at the start of the script or, better, for each random sampling task. This will allow other users to reproduce your exact results.

  Include clear error messages as annotations in your code that explain what might go wrong (e.g. if the user gave a text input where a numeric input was expected) and what the effect of the error or warning is.


3. CLEAN DATA

Checklist for preparing data to upload to DRYAD (or other repository)


3.a.  Repository contents 

REQUIRED: 

  All data used to generate a published result should be included in the archive. For papers with multiple experiments or sets of observations, this may mean a corresponding number of data files.

 Save each file with a short, meaningful file name and extension (see DRYAD recommendations here).

 Prepare a README text file to accompany the data. Our recommendation is to put this in the same README file described above. For complex repositories where this readme becomes unmanageably long, you may opt to create a set of separate README files for the overall repository, with more specific files for code and for data. But, our preference is one README. The README file(s) should provide a brief overall description of each data file’s contents, and a list of all variable names with explanation (e.g. units). This should allow a new reader to understand what the entries in each column mean and relate this information to the Methods and Results of your paper. Alternatively, this may be a “Codebook” file in a table format with each variable as a row and column providing variable names (in the file), descriptions (e.g. for axis labels), units, etc. 

 Save the README files as a text (.txt) or Markdown (.md) files and all of the data files as comma-separated variable (.csv) files. 

➤ Save all of the data files as comma-sepatrated variable (.csv) files. If your data are in EXCEL spreadsheets you are welcome to submit those as well (to be able to use colour coding and provide additional information, such as formulae) but each worksheet of data should also be saved as a separate .csv file.

RECOMMENDED:

 We recommend also archiving any digital material used to generate data (e.g., photos, sound recordings, videos, etc), but this may use too much memory for some repository sites. At a minimum, upload a few example files illustrating the nature of the material and a range of outcomes. We recognize that some projects entail too much raw data to archive all the photos / videos / etc in their original state.


3.b. Data file contents and formatting 
 

REQUIREMENTS: 

 Archived files should include raw data as when you first began analyses, not group means or other summary statistics; for convenience, summary statistics can be provided in a separate file, or generated by code archived with the data.

 Identify each variable (column names) with a short name. Variable names should preferably be <10 characters long and not contain any spaces or special characters that could interfere with reading the data and running analysis code. Use an underline (e.g. wing_length) or camel case (e.g., WingLength) to distinguish words if you think that is needed.

RECOMMENDATIONS: 

 Omit variables not analyzed in your code, for brevity.

 A common data structure is to ensure that every observation is a row and every variable is a column.

 Each column should contain only one data type (e.g. do not mix numerical values and comments or categorical scores in a single column).

  Use “NA” or equivalent to indicate missing data (and specify what you use in the README file)



4. COMPLETING YOUR ARCHIVE

REQUIREMENTS:

 Upload your data and code to a curated, version-controlled repository (e.g. DRYAD, Zenodo). Your own GitHub account (or other privately or agency controlled website) does not qualify as a public archive because you control access and might take down the data at a later date.

 Provide all of the metadata and information requested by the repository, even if this is optional and redundant with information contained in the README files. Metadata makes your archived material easier to find and understand.

 From the repository, get a private URL and provide this on submission of your manuscript so that editors and reviewers can access your archive before your data are made public.

 

RECOMMENDED:
 Prepare your data, code, and README files, before or during manuscript preparation (analysis and writing).



5. FOR MORE INFORMATION



More detailed guides to reproducible code principles can be found here:



Documenting Python Code: A Complete Guide - https://realpython.com/documenting-python-code/



A guide to reproducible code in Ecology and Evolution, British Ecological Society: https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible-Code-2019.pdf?utm_source=web&utm_medium=web&utm_campaign=better_science 



Dokta tools for building code repositories:  https://github.com/stencila/dockta#readme



Version management for python projects:  https://python-poetry.org/



Principles of Software Development - an Introduction for Computational Scientists (https://doi.org/10.5281/zenodo.5721380), with an associated code inspection checklist (https://doi.org/10.5281/zenodo.5284377).



Style Guide for Data Files

 See the Google R style guide (https://google.github.io/styleguide/Rguide.html) and the Tidyverse style guide (https://style.tidyverse.org/syntax.html#object-names) for more information

Google style guide for Python: https://google.github.io/styleguide/pyguide.html

 

6. WHY DRYAD OR ZENODO?


The American Naturalist requests that authors use DRYAD or ZENODO for their archives when possible. 
  • DRYAD/zenodo are curated and this means that there is some initial checking by DRYAD for completeness and consistency in both the data files and the metadata. DRYAD requires some compliance before they will allow a submission
  • We are finding it much easier and more convenient to download repositories from DRYAD/zenodo rather than searching the ms etc for the files or repository
  • Files on DRYAD/zenodo cannot be arbitrarily deleted or changed by authors or others after publication. DRYAD will allow changes if a good case can be made, such changes are documented and original versions are retained.
  • DRYAD is free for Am Nat authors and we/I have a good working relationship with them ad they take seriously my suggestions for improvement etc.
  • editors, reviewers and authors will all become familiar with the workings of DRYAD

Wednesday, November 24, 2021

Draft Checklist for Code and Data Archiving

 The following is a work-in-progress, posted here to obtain feedback. The goal is a succinct, user-friendly document presenting authors with an accessible and relatively basic set of recommendations for how to comply with journal requirements for data archives. At this point it is a good idea to archive your data and your code, and some journals now require one, or the other, and you may expect both to be required soon. Feedback is welcome!


A CHECKLIST FOR REPRODUCIBLE ARCHIVING DATA AND CODE

IN ECOLOGY, EVOLUTION, AND BEHAVIOR

November 24, 2021

Daniel I. Bolnick (daniel.bolnick@uconn.edu), Roger Schürch (rschurch@vt.edu), Daniel Vedder (daniel.vedder@idiv.de), Max Reuter (m.reuter@ucl.ac.uk)


RATIONALE

The fundamental question you should ask yourself is, “If a reader downloads my data and code, will my scripts be comprehensible, and will they run to completion and yield the same results on their computer?” Any computer code used to generate scientific results should be easily usable by reviewers or readers. Sharing this information is vital for many reasons. It promotes appropriate interpretation of results, checking validity, future data synthesis, replication, and a teaching tool for students learning to do analyses themselves. Shared code provides greater confidence in results.


The following bullet points are meant to help you reach this goal. High priority points are in blue font, while black font indicates suggestions to follow ‘best practices’.



1. CLEAN DOCUMENTATION

➤  Prepare a README_SUMMARY file with important information about your repository as a whole (code, and files contents). Text (.txt) README files are readable by a wider variety of software tools, so have greater longevity.

  • Author names, contact details.

  • A brief summary of what the study is about 

  • Link to publication or preprint if available

  • Identify who is responsible for collecting data, and writing code.

  • The versions of all packages and software you used (including the operating system), and dependencies (if these are not installed by the script itself). For instance, in R you can use sessionInfo().

  • Overview of folders/files and their contents

  • Workflow instructions for users to run the software (e.g. explain the project workflow, and any configuration parameters of your software)

  • For larger software projects: instructions for developers (e.g. the structure and interactions of submodules), and any subsidiary documentation files.

  • Links to protocols.io or equivalent methods repositories, where applicable

 Use informative names for folders and files (e.g. “code”, “data”, “outputs”)

 Give license information (either in the README_SUMMARY or a separate file), such as Creative Commons open source license language granting readers the right to reuse code. For more information on how to choose and write a license, see choosealicense.com.

➤If applicable, list funding sources used to generate the data archive, and include information about permits (collection, animal care, human research).



2. CLEAN CODE

Thoroughly annotate your code with in-script comments indicating what each set of commands is meant to do, and why.

 Scripts should start by loading required packages, and importing raw data in a format exactly as it is archived in your data repository.

 Use relative paths to files and folders (e.g. avoid setwd() with an absolute path in R), so other users can replicate your data input steps on their own computers. 

 Where useful (e.g. if you have a lot of files) have one root directory (folder), with sub-directories containing data, code, outputs, figures, etc. Use this root directory as the address to which all other relative paths refer.

 Use informative names for input files, and variables

 Any data manipulations (merging, sorting, transforming, filtering) should be done in your script, for fully transparent documentation of any changes to the data

 Organise your code by splitting it into logical sections, such as importing and cleaning data, transformations, analysis and graphics and tables. Sections can be separate script files run in order (as explained in your README_SUMMARY), blocks of code within one script that are separated by clear breaks (e.g., comment lines, #--------------), or a series of functions (which can facilitate reuse of code for future work). Aim for 300 - 800 lines of code per file for easy review and proofreading; functions should not be longer than a screen.

 Label code sections with headers that match the figure number, table number, or text subheading of the paper.

 Omit extraneous code not used for generating the results of your publication, or place any such code in a Coda.

 Save intermediate steps as their own files. For instance, if you use raw data to calculate a table of group means and then run further analyses on those group means, provide both the raw data files and the intermediate table of means. Similarly, if your scripts include computationally intensive steps, you can provide their output as an extra file as an alternative entry point to re-running your code.

 If your code contains any stochastic process (e.g.,random number generation, bootstrap re-sampling), set a random number seed at least once at the start of the script or, better, for each random sampling task. This will allow other users to reproduce your exact results.

 Test code before shipping, ideally on a pristine machine without any packages installed, but at least using a new session.

 If you are adapting other researcher’s published code for your own purposes, acknowledge and cite the sources you are using. Likewise, cite the authors of packages that you use in your published article.



3. CLEAN DATA


Checklist for preparing data to upload to DRYAD or other repository

 Repository contents 

all data used to generate a published result should be included in the archive. For papers with multiple experiments, this may mean a corresponding number of data files.

➤  Save each file with a short, meaningful file name (see DRYAD recommendations here), except the README_DATA file(s) which should just be called README_DATA.txt

prepare a README_DATA text file to accompany each data file. The README_DATA should provide a brief overall description of the file’s contents, and a list of all variable names with explanation (e.g., units) so that a new reader can understand what the numbers or other data in that column mean and relate this information to the Methods and Results of your paper. Alternatively, this may be a “Codebook” file in a table format with each variable as a row and column providing variable names (in the file), descriptions (e.g., for axis labels), units, etc. 

save the README_DATA files as a text (.txt) file and all of the data files as comma-separated variable (.csv) files. 

➤  if your data are in EXCEL spreadsheets you are welcome to submit those as well (to indicate colour coding and provide additional information (formulas etc) but each worksheet of data should also be saved as a separate .csv file.

We recommend archiving all files used to generate data (e.g., photos, videos, etc), but this may use too much memory for some repository sites. At a minimum, upload a few example files illustrating the range of outcomes. 

Data file contents and formatting

archived files should include raw data, not simply group means or other summary statistics; such summary statistics can be a separate file, or generated by code archived with the data

➤  identify each variable (column names) with a short name. Names should preferably <10 characters long and not contain any special characters that could interfere with reading the data and running analysis code. Use an underline (e.g. wing_length) or camel case (e.g., WingLength) to distinguish words if you think that is needed.

omit variables not analyzed in the publication, for brevity

a common data structure is to ensure that every observation is a row and every variable is a column

 follow a practice of having one column contain only one kind of data (e.g., do not mix numerical values and comments or categorical scores into a single column)

 Use “NA” or equivalent to indicate missing data (and specify what you use in the README file).




4. COMPLETING YOUR ARCHIVE


➤  It is a good habit to prepare your data and code archive, and associated README files, simultaneously with manuscript preparation (analysis and writing).

➤  Data and code should be archived on version-controlled repositories (e.g., DRYAD, ZENODO). Your own GitHub account (or other privately controlled website) does not qualify as a public archive because it does not provide a DOI, and you control access and might take down the data at a later date.

➤  Provide all of the metadata and information requested by the repository, even if this is not required and redundant with information contained in the README files. Metadata makes your data easier to find and understand.

➤   from the repository get a URL that can be used by editors and reviewers before your data are made public with a DOI. Provide that private URL key on submission of your manuscript

 

 



FOR MORE INFORMATION


More detailed guides to reproducible code principles can be found here:


A guide to reproducible code in Ecology and Evolution, British Ecological Society: https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible-Code-2019.pdf?utm_source=web&utm_medium=web&utm_campaign=better_science 


Dokta tools for building code repositories:

https://github.com/stencila/dockta#readme


Principles of Software Development - an Introduction for Computational Scientists (https://doi.org/10.5281/zenodo.5721380), with an associated code inspection checklist (https://doi.org/10.5281/zenodo.5284377).


Style Guide for Data Files

 See the Google R style guide (https://google.github.io/styleguide/Rguide.html) and the Tidyverse style guide (https://style.tidyverse.org/syntax.html#object-names) for more information


Guidelines for archiving data AND code

The following is a cross-post from the Editor's blog of The American Naturalist, developed with input from various volunteers (credited ...