Friday, May 20, 2022

Not by vaults and locks: To archive code, or not?

 The cases for and against archiving code

by, Bob Montgomerie (Queen’s University)

Data Editor—The American Naturalist


As of the beginning of this year, The American Naturalist required all authors of submitted manuscripts to upload to their data repository all of the code they used for analysis and modelling. About 80% of authors were already doing that by the time their paper was published, so it seemed like a general requirement for code at the reviewing stage would be useful for reviewers and editors, and especially end users once the paper was published.  Note that this requirement only applies if code was used; authors using drag-and-drop GUI software for analyses can generate code reports documenting their steps, but rarely do so. 


As data editor, I have now looked at about 100 scripts of R and Python code provided by authors. Based on that experience and my own coding practices for the past 55 years(!), I am not so sure that requiring code in data repositories is such a good idea. To that end, Dan Bolnick (current EIC), Volker Rudolf (incoming EIC), and I will hold a discussion session about this issue, and data repositories in general, at the Asilomar meeting of The American Society of Naturalists in January 2023 at Asilomar.





To provide a basis for that discussion, here are what I see as the cases for and against providing code, staring with the cons:


A1. Workload: For most of us, cleaning up the code that you used for analysis or modelling can be a ton of work—organizing, deleting extraneous material, annotating, checking that results match what is reported, providing workflow information.  It can also be a bit daunting to think that someone is going to look at your crappy coding practices—I know that it took me a month of work and deliberation before I posted my first code script in a public repository. Even now, having done this a couple of dozen times, I know that I am looking at a full day’s work to bring my code up to my rather minimal standards of usability and transparency.


Let me allay one fear right here—none of the 100+ code scripts I have looked at so far have been horrible. Everyone using R and Python seems to have adopted reasonable coding—though not necessarily reasonable statistical—practices. One person wrote code that was so compact and so elegant that it took me hours to figure out what the author had done. Normally I would not spend more than a few minute looking at code submitted with a manuscript, but I thought I might learn something useful, which I did. While I appreciated that author’s expertise, I would rather read and use code that is a little more friendly to the average user like me. My guess is that nobody cares how inelegant your code is as long as it’s well annotated and gets the job done.


On requiring code for papers submitted to The American Naturalist, Dan Bolnick (EIC at The American Naturalist) was a little worried that the extra work, and exposure to mediocre coding practices, would reduce submissions to The American Naturalist. At first there appeared to be a little downturn in submission rates t the start of 2022 but there are other potential causes for that (pandemic burnout, gearing up for a potential renewal of field work, etc). Even so, if this requirement discourages authors who are unwilling to be transparent about their analyses, then the journal and science benefit.


A2. Usefulness: One purpose, ostensibly, of published code is transparency, allowing reviewers, editors and readers to replicate what is reported in the manuscript and eventual publication. But my experience, so far is that hardly anybody does this. Reviewers and editors rarely seem to comment on the code that authors provide. Even the data editors do not check to see if the code produces the output that matches what is in the manuscript, nor even that all of the reported results are dealt with in the code. We do look at the code but only to address some formatting and accessibility issues (see A3 below).


As a user of published data, I much prefer to write my own code to see if the results that I get match those reported by the authors. If there is a mismatch it might be interesting—but rather tedious—to figure out where the authors went wrong, but I am not sure that their advantages of such forensic work unless scientific misconduct is suspected. A disturbing proportion of authors’ code that I have looked at so far contains what I would call statistical, rather than coding, errors, especially with respect to violating important assumptions or ignoring without comment any errors thrown up by the code. Checking for those errors should really be the job of reviewers and editors but is probably considered to be far too time-consuming.


A3. Accessibility: The vast majority of R and Python scripts that I have looked at so far will not run ‘as is’ on my computer. The most common reasons for this are that authors write scripts that (i) try to load files on the author’s hard drive, or even to files that do not exist in their data repository, (ii) are so poorly annotated that the purpose is far from clear, (iii) use packages that are not on CRAN etc with no information as to where to get them, (iv) throw up error messages with no comments as to why those errors were ignored, (v) use deprecated functions and packages from versions much earlier than the current version (often with no package version info provided), or (vi) require a specific, and unspecified, work flow. I have enough experience that I can usually deal with those deficiencies in minutes but I worry about other users just throwing up their hands when code does not work, or not willing to  put in the time and effort to fix the code. Moreover, code written today will not necessarily run ‘as is’ in a few months or years, as the coding environments, packages, and functions evolve and often become deprecated. Unlike well documented datasets, saved in a universal simple format like csv, code has the potential to become obsolete and almost unusable in the future. Can anyone open my  VisiCalc files—my go-to spreadsheet app in the late 1970s?  


The data editors at The American Naturalist ask authors to fix their code when they identify those deficiencies, but we simply do not have the time or inclination to work with authors to make their code completely transparent and usable. We take a fairly light touch on this to help, rather than discourage authors, and presumably practices will improve as everyone gets used to submitting code to journals. Most journal editors and reviewers probably do not have the time or expertise to deal with code in addition to handling manuscripts.


And the advantages, as a counterpoint to the disadvantages listed above:


F1. Workload: This is work you should really be doing whether or not you make your code publicly available in a data repository. Your future self will thank you when it comes to revising your manuscript, or at some later date needing to reanalyze those data. Reusing well-annotated code is also just standard practice among coders, rather than re-inventing the wheel every time you repeat an analysis or a modelling exercise. When developing analyses in R or Python, for example, it can be tremendously time-saving to (i) use versioning, (ii) comment extensively to explain what you are doing and why, (iii) use R or jupyter (or other) notebooks to keep a running commentary of the purpose of chunks of code and the conclusions that you draw from analyses


F2. Usefulness: Code is undoubtedly extremely useful for modelling exercises, where little or no empirical data are used, and the research depends entirely on the code. Presumably reviewers and editors handling such manuscripts take a close look at the code to make sure that the authors have not made any obvious mistakes that would lead to incorrect conclusions.


For research based on data analyses, published code can be very useful for training students, as well as researchers beginning to do some coding. I often use published code in my own work especially when learning new analytical procedures. Especially in my graduate and undergraduate statistics courses, students seem to appreciate those real-world examples. As I mentioned above, re-using code is standard practice although most of us probably get the most useful code from blog posts, online manuals, and stack overflow. Despite that usable code associated with a publication can help you to replicate the authors’ methods in your own work.


F3. Accessibility: With R and Python, at least, previous language and package versions are available on the internet so that users can recreate the environment used by the original coder. There are also sites online that can facilitate this. As authors get into the practice of providing version information, and adequately commenting their code, the problems of accessibility should largely be alleviated as the analysis software evolves.


Food for Thought

As the scientific publishing environment evolved over the past 250 years, and especially in the last two decades, we have all more-or-less seamlessly adapted to the new environments. The requirement to provide data and code in a public repository, however, has been a sea change that not everyone is comfortable with, yet. I am all-in for providing raw data but wonder whether the requirement for providing code has enough benefits that it is worth pursuing at least in the short term. We have identified errors and deficiencies in more than 75% of the datasets archived with manuscripts from The American Naturalist in the past year, and I would like to see that problem solved first. To that end, I feel that undergraduate and graduate student education needs to focus as much on data management as on any other aspect of doing good science. 




“Since the first satellites had been orbited, almost fifty years earlier, trillions and quadrillions of pulses of information had been pouring down from space, to be stored against the day when they might contribute to the advance of knowledge. Only a minute fraction of all this raw material would ever be processed; but there was no way of telling what observation some scientist might wish to consult, ten, or fifty, or a hundred years from now. So everything had to be kept on file, stacked in endless airconditioned galleries, triplicated at the [data] centers against the possibility of accidental loss. It was part of the real treasure of mankind, more valuable than all the gold locked uselessly away in bank vaults.” 
― Arthur C. Clarke


“Let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.” 
― Thomas Jefferson*

*Jefferson is a divisive figure because of his defense of, and involvement in, slavery. 

1 comment:

  1. Thanks for these remarks. I'm a philosopher of science, and for me, being able to read the source code that was used in paper is often invaluable--for example for understanding how scientists use relationships between probabilities in simulations, statistical methods, and biological processes. This is is a little bit like being a student or a postdoc--so my point is related to F2--but I'm often asking different questions than scientists would ask. I know I'm not part of the usual target audience for scientific papers, however. Perhaps another benefit of providing code is like the benefit of potentially irrelevant methods information ("we fed the rabbits with Purina Kibble #5"); you never know what might turn out to make a difference. Some library routines turn out to have bugs later.

    ReplyDelete

A 25-year quest for the Holy Grail of evolutionary biology

When I started my postdoc in 1998, I think it is safe to say that the Holy Grail (or maybe Rosetta Stone) for many evolutionary biologists w...