Friday, October 23, 2020

Data Archiving (and Open Access)

 The following is a guest post by Bob Montgomerie, Queens University, written with input from Dan Bolnick. This was first posted on the American Naturalist Editors blog, and is cross-posted here for more visibility as the issues span far beyond The American Naturalist.




As you may know, this is Open Access Week (19-26 October) celebrating the progress made so far with making science open and accessible to all, but reminding us that there are still some challenges. One key feature of open science is providing free access to the data underlying published reports.  To that end, The American Naturalist requires that all authors make their data available either on DRYAD (free to authors), on another public repository, or as a supplement published along with their paper. My own experience seeking data from work published in a wide variety of journals has been, shall we say, mixed in recent years. So, it seemed like a good time to assess the availability and quality of recent data made available with American Naturalist papers. 
 
To evaluate the quality of data archiving with The American Naturalist, I looked at 100 papers published in 2020 (the first 50 and most recent 50). Of those 100 papers, at least 78 were based on data that should have been made available—the others were reviews, commentaries, or model-based (though some of those models seem to use data). The good news is that all but four of those papers had made data available either on DRYAD (56 papers), on other public repositories (3) or as appendices/supplements available with the paper as supplementary material (12). Three papers have data embargoed for a while, and only 4 papers made none of the data available. This is, in my experience, a remarkably high level of compliance.
 
The downside is that only 7 of those 56 of those papers with datasets on DRYAD have provided data in such a way that I, and I assume most users, would find convenient or even comprehensible. Here are the main issues:
•         no README (or any other) file explaining the variable names in data files
•         data files in EXCEL and other formats that are not easily read by statistical software. Yes, I know that R can read Excel files but only if they are set up properly, and many of those were not
•         odd file extensions that are not explained. I counted more than 30 file extensions in those 58 repositories, a few of which I had never heard of, and many of which are unlikely to be accessible without expensive or eventually obsolete software
•         not all data made available—I did not check every paper, as that would have taken too long, but I did check a few and could not find the data supporting a couple of figures and tables. Sometimes authors provided summary data (means, SDs) and not the raw data from which those summaries were calculated.
•         no R, Python or other scripts or notebooks to replicate the analyses
•         analysis code not well-enough annotated to be comprehensible
•         code that does not run, presumably created in earlier versions of the software with unknown packages and package versions
 
None of this is unexpected as (i) this whole idea about making data available is relatively new and not often part of our formal training in graduate school, (ii) journals rarely (ever?) provide guidelines for authors that detail what they consider to be best practices, and (iii) most journals have nobody checking to see if authors have actually complied with their requirements. There are many excellent reasons for all of us to want data to be freely available for every published study and I feel that we should take pride in doing as good a job with that as we do with our published papers. Good data will always be useful, whereas most papers have a short half-life if citation metrics are any indication.
 
The American Naturalist is now publishing guidelines for best practices in data archiving (see author instructions) and will have a small team of data editors checking each paper’s data repository to make sure that it is complete, comprehensive, and adequately documented. We are probably the first biology journal to fully embrace the value of open data in this fashion and we welcome your comments as we put this policy in place.  We also now encourage authors to submit private Dryad data links upon submission, so reviewers and editors have the option of checking compliance before manuscript acceptance (see Author Instructions for Submission for details). We will be asking authors resubmitting revisions to provide data links for checking prior to final acceptance.

If you find a published paper's Dryad or related archive that is unusable (incomplete, or unclear), please contact the author and ask that they fix the deficiencies, with a cc the editor . Since 2011, The American Naturalist has made complete data archiving (sufficient to reproduce the analyses and results) a condition of publication. Authors that have not done so are failing to live up to their side of the bargain that led to their publication. 

It is now the Editorial policy that the American Naturalist reserves the right to publish Editorial Expressions of Concern when we are made aware of grossly deficient data archives that are not amended in a reasonable amount of time. In extreme cases, we reserve the right to retract papers that are not supported by appropriately archived data, or to hold up an author's future submissions until past deficiencies are amended.  However, we also recognize that new policies entail growing pains and that compliance is understandably imperfect as we adjust to a new culture of more rigorous and complete data sharing.

Predicting Speciation?

(posted by Andrew on behalf of Marius Roesti) Another year is in full swing. What will 2024 hold for us? Nostradamus, the infamous French a...