Friday, May 20, 2022

Not by vaults and locks: To archive code, or not?

 The cases for and against archiving code

by, Bob Montgomerie (Queen’s University)

Data Editor—The American Naturalist


As of the beginning of this year, The American Naturalist required all authors of submitted manuscripts to upload to their data repository all of the code they used for analysis and modelling. About 80% of authors were already doing that by the time their paper was published, so it seemed like a general requirement for code at the reviewing stage would be useful for reviewers and editors, and especially end users once the paper was published.  Note that this requirement only applies if code was used; authors using drag-and-drop GUI software for analyses can generate code reports documenting their steps, but rarely do so. 


As data editor, I have now looked at about 100 scripts of R and Python code provided by authors. Based on that experience and my own coding practices for the past 55 years(!), I am not so sure that requiring code in data repositories is such a good idea. To that end, Dan Bolnick (current EIC), Volker Rudolf (incoming EIC), and I will hold a discussion session about this issue, and data repositories in general, at the Asilomar meeting of The American Society of Naturalists in January 2023 at Asilomar.





To provide a basis for that discussion, here are what I see as the cases for and against providing code, staring with the cons:


A1. Workload: For most of us, cleaning up the code that you used for analysis or modelling can be a ton of work—organizing, deleting extraneous material, annotating, checking that results match what is reported, providing workflow information.  It can also be a bit daunting to think that someone is going to look at your crappy coding practices—I know that it took me a month of work and deliberation before I posted my first code script in a public repository. Even now, having done this a couple of dozen times, I know that I am looking at a full day’s work to bring my code up to my rather minimal standards of usability and transparency.


Let me allay one fear right here—none of the 100+ code scripts I have looked at so far have been horrible. Everyone using R and Python seems to have adopted reasonable coding—though not necessarily reasonable statistical—practices. One person wrote code that was so compact and so elegant that it took me hours to figure out what the author had done. Normally I would not spend more than a few minute looking at code submitted with a manuscript, but I thought I might learn something useful, which I did. While I appreciated that author’s expertise, I would rather read and use code that is a little more friendly to the average user like me. My guess is that nobody cares how inelegant your code is as long as it’s well annotated and gets the job done.


On requiring code for papers submitted to The American Naturalist, Dan Bolnick (EIC at The American Naturalist) was a little worried that the extra work, and exposure to mediocre coding practices, would reduce submissions to The American Naturalist. At first there appeared to be a little downturn in submission rates t the start of 2022 but there are other potential causes for that (pandemic burnout, gearing up for a potential renewal of field work, etc). Even so, if this requirement discourages authors who are unwilling to be transparent about their analyses, then the journal and science benefit.


A2. Usefulness: One purpose, ostensibly, of published code is transparency, allowing reviewers, editors and readers to replicate what is reported in the manuscript and eventual publication. But my experience, so far is that hardly anybody does this. Reviewers and editors rarely seem to comment on the code that authors provide. Even the data editors do not check to see if the code produces the output that matches what is in the manuscript, nor even that all of the reported results are dealt with in the code. We do look at the code but only to address some formatting and accessibility issues (see A3 below).


As a user of published data, I much prefer to write my own code to see if the results that I get match those reported by the authors. If there is a mismatch it might be interesting—but rather tedious—to figure out where the authors went wrong, but I am not sure that their advantages of such forensic work unless scientific misconduct is suspected. A disturbing proportion of authors’ code that I have looked at so far contains what I would call statistical, rather than coding, errors, especially with respect to violating important assumptions or ignoring without comment any errors thrown up by the code. Checking for those errors should really be the job of reviewers and editors but is probably considered to be far too time-consuming.


A3. Accessibility: The vast majority of R and Python scripts that I have looked at so far will not run ‘as is’ on my computer. The most common reasons for this are that authors write scripts that (i) try to load files on the author’s hard drive, or even to files that do not exist in their data repository, (ii) are so poorly annotated that the purpose is far from clear, (iii) use packages that are not on CRAN etc with no information as to where to get them, (iv) throw up error messages with no comments as to why those errors were ignored, (v) use deprecated functions and packages from versions much earlier than the current version (often with no package version info provided), or (vi) require a specific, and unspecified, work flow. I have enough experience that I can usually deal with those deficiencies in minutes but I worry about other users just throwing up their hands when code does not work, or not willing to  put in the time and effort to fix the code. Moreover, code written today will not necessarily run ‘as is’ in a few months or years, as the coding environments, packages, and functions evolve and often become deprecated. Unlike well documented datasets, saved in a universal simple format like csv, code has the potential to become obsolete and almost unusable in the future. Can anyone open my  VisiCalc files—my go-to spreadsheet app in the late 1970s?  


The data editors at The American Naturalist ask authors to fix their code when they identify those deficiencies, but we simply do not have the time or inclination to work with authors to make their code completely transparent and usable. We take a fairly light touch on this to help, rather than discourage authors, and presumably practices will improve as everyone gets used to submitting code to journals. Most journal editors and reviewers probably do not have the time or expertise to deal with code in addition to handling manuscripts.


And the advantages, as a counterpoint to the disadvantages listed above:


F1. Workload: This is work you should really be doing whether or not you make your code publicly available in a data repository. Your future self will thank you when it comes to revising your manuscript, or at some later date needing to reanalyze those data. Reusing well-annotated code is also just standard practice among coders, rather than re-inventing the wheel every time you repeat an analysis or a modelling exercise. When developing analyses in R or Python, for example, it can be tremendously time-saving to (i) use versioning, (ii) comment extensively to explain what you are doing and why, (iii) use R or jupyter (or other) notebooks to keep a running commentary of the purpose of chunks of code and the conclusions that you draw from analyses


F2. Usefulness: Code is undoubtedly extremely useful for modelling exercises, where little or no empirical data are used, and the research depends entirely on the code. Presumably reviewers and editors handling such manuscripts take a close look at the code to make sure that the authors have not made any obvious mistakes that would lead to incorrect conclusions.


For research based on data analyses, published code can be very useful for training students, as well as researchers beginning to do some coding. I often use published code in my own work especially when learning new analytical procedures. Especially in my graduate and undergraduate statistics courses, students seem to appreciate those real-world examples. As I mentioned above, re-using code is standard practice although most of us probably get the most useful code from blog posts, online manuals, and stack overflow. Despite that usable code associated with a publication can help you to replicate the authors’ methods in your own work.


F3. Accessibility: With R and Python, at least, previous language and package versions are available on the internet so that users can recreate the environment used by the original coder. There are also sites online that can facilitate this. As authors get into the practice of providing version information, and adequately commenting their code, the problems of accessibility should largely be alleviated as the analysis software evolves.


Food for Thought

As the scientific publishing environment evolved over the past 250 years, and especially in the last two decades, we have all more-or-less seamlessly adapted to the new environments. The requirement to provide data and code in a public repository, however, has been a sea change that not everyone is comfortable with, yet. I am all-in for providing raw data but wonder whether the requirement for providing code has enough benefits that it is worth pursuing at least in the short term. We have identified errors and deficiencies in more than 75% of the datasets archived with manuscripts from The American Naturalist in the past year, and I would like to see that problem solved first. To that end, I feel that undergraduate and graduate student education needs to focus as much on data management as on any other aspect of doing good science. 




“Since the first satellites had been orbited, almost fifty years earlier, trillions and quadrillions of pulses of information had been pouring down from space, to be stored against the day when they might contribute to the advance of knowledge. Only a minute fraction of all this raw material would ever be processed; but there was no way of telling what observation some scientist might wish to consult, ten, or fifty, or a hundred years from now. So everything had to be kept on file, stacked in endless airconditioned galleries, triplicated at the [data] centers against the possibility of accidental loss. It was part of the real treasure of mankind, more valuable than all the gold locked uselessly away in bank vaults.” 
― Arthur C. Clarke


“Let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.” 
― Thomas Jefferson*

*Jefferson is a divisive figure because of his defense of, and involvement in, slavery. 

Monday, May 9, 2022

The Spine of the Stickleback

How an first-year undergrad drove more than 15,000 km to brave mosquitoes and forest fires en route to photographing, fin clipping, and releasing more than 10,000 stickleback in the #WorldsGreatestEcoEvoExperiment. By Ismail Ameen

Left to Right: Ismail Ameen, Hillary Poore, and Victor Frankel somewhere in the Kenai Peninsula

With the bustle of Calgary fading behind me, and the rolling green hills of Alberta’s countryside flanking me from either side, the golden sun mid July exposed the enormity of the country I was passing through. I had just completed a month of ecological field work in Kenai Alaska, and was trekking back to Montreal with 3 fellow researchers. We were carrying a valuable cargo. Countless samples detailing key features of lake ecosystems primed for an evolutionary journey.

I first became exposed to the field of Eco-Evolutionary Dynamics (eco-evo) after reading Jonathan Weiner’s book “The Beak of the Finch” which covers the groundbreaking research of Peter and Rosemary Grant. The Grant’s research concerned the finches of Daphne Major in the Galapagos, where they found that natural selection was not just limited to being a slow process as Darwin described, but could also act at rapid timescales. In the span of a few generations, selection could impose significant changes on a population. Instilled with a newfound curiosity, I reached out to Professor Andrew Hendry who was teaching my introductory biology course. Following his recommendation, I read his book on Eco-Evolutionary Dynamics (2017 – Princeton University Press) to gain a somewhat better idea of the field and its goals. Essentially, eco-evo studies the following feedback loop:

Ecological dynamics shaping evolution

Merging the fields of evolution and ecology in this way is particularly useful since the two are intrinsically linked. Thus, understanding how the two interact has numerous downstream benefits that extend to applied sciences like conservation and agriculture.

Upon completing the book, I was inspired by the sense of power this field of research bestows. Understanding eco-evo is akin to grasping a fundamental law of life on this planet. Filled with a sense of purpose, I immediately sought to participate in research. My efforts paid off, as Professor Hendry offered me a spot on a field research team whose goal was to set up a large eco-evo study in Alaska.

The team, comprising PI’s, grad students, and undergrads from multiple universities, had several objectives, but all of them revolved around one species: the threespine stickleback. This bony little fish provides an extremely powerful tool to evolutionary biologists, acting as a “supermodel” species that can be studied from multiple angles (molecular, genetic, ecological, etc.). Freshwater stickleback also happen to be a keystone species in lake ecosystems, meaning that they are essential to local food web stability. The essential niche they occupy isn’t set in stone, however,  as two stickleback “ecotypes” arise depending on the circumstances. The first ecotype is the benthic form which are found at the bottom of the water column, while the second ecotype is the limnetic form which are found towards the surface of the lake. Benthic stickleback mostly feed on larval insects on the lake bottom whereas limnetic stickleback mostly feed on zooplankton in the water column. Leveraging the importance of stickleback to food web stability, and the clear difference in ecotypes, Professor Hendry and the other PI’s developed a unique experimental design.

The Experimental Design:

An ideal experiment to test the importance of stickleback ecotype to lake ecosystems might look like this: start with lakes devoid of any stickleback and introduce benthic stickleback into some of them and limnetic stickleback into others. Then track the resulting changes in various ecosystem parameters through time. Additionally, population characteristics of the stickleback could be tracked to elucidate the interactions between stickleback and their ecosystem.

Designing an experiment like this is one thing, but actually implementing it is another. Luckily, one of the PI’s (Mike Bell) was working with Alaska Fish and Game, who were using Rotenone to completely clear 10 lakes of their fish populations to combat invasive pike. This provided the opportunity to run the experiment in a natural setting.

Before I got to Alaska, several years of preparation had occurred which culminated in a plan to source and transplant stickleback. One of the most important aspects of this plan was the generation of pools of benthic or limnetic stickleback formed from multiple source lakes. These pools could then be transplanted into recipient lakes. Pooling improved the chances of recipient lake recovery as the mixed populations would be more resilient to hazards that could destroy a single source lake population. Pooling would also allow for generalizations to be made about benthic and limnetic stickleback due to the replication embedded in the pools’ construction (multiple sources reduces the bias of one source). Ultimately, 8 source lakes were decided upon with an even division in geographic region, and stickleback ecotype (Table 1). 4 source lakes of the same ecotype contributed equally to a pool to mitigate the potential of one source lake dominating the others.

Table 1: Source lake pools

Source Lake:

Stickleback Ecotype:

Location:

Tern

Benthic

Kenai

Watson

Benthic

Kenai

Walby

Benthic

Mat Su

Finger

Benthic

Mat Su

Spirit

Limnetic

Kenai

Wik

Limnetic

Kenai

South Rolly

Limnetic

Mat Su

Long

Limnetic

Mat Su

Table 1: Here I present the 8 source lakes used for transplant. Source lakes were classified as either benthic or limnetic. The 4 benthic and 4 limnetic lakes were combined, respectively, into pools which were then transplanted into recipient lakes.

Overall, there were 9 recipient lakes. 4 of the recipient lakes received benthic stickleback, while another 4 received limnetic. Since one of the primary goals of the study was to observe the interactions between a particular ecotype and its environment, recipient lakes were paired based on ecological similarity and size. This would allow us to transplant the benthic or limnetic pool in two lakes of similar ecology, increasing inferential power on the effects each ecotype has. It was especially important to keep lake pairs geographically isolated from each other to prevent ecotypes from mixing due to watershed connections. Additionally, to account for the size differences in lakes, non-linear scaling of stickleback transplant numbers was applied (Table 2). Finally, Loon lake received an equal proportion of the transplant lakes in order to monitor how the two ecotypes interact in the same environment.

Table 2: Recipient Lake Transplant Numbers

Recipient Lake:

Ecotype Pool:

Number of Stickleback

Leisure Pond

Benthic

400

Fred Lake

Limnetic

400

CC Lake

Benthic

800

Ranchero Lake

Limnetic

800

Leisure Lake

Benthic

1600

Crystal Lake

Limnetic

1600

G Lake

Benthic

2400

Hope Lake

Limnetic

2400

Loon Lake

Benthic + Limnetic

Dependent on sampling logistics

Table 2: Here I present the recipient lakes along with the number of stickleback they received, and the pools those stickleback came from. As seen in the “Number of Stickleback” column, scaling was non-linear in order to balance transplant logistics, Alaska Fish & Game advice, and experimental utility.

Onsite: The Challenges of Field Work

Once I arrived at the field site, I realized just how rigorous the field work was going to be. Thousands of stickleback were going to have to be captured, processed, and transplanted to set up the experiment. Even the processing required a tremendous amount of detail be applied to each stickleback since every fish received an ID number, photograph, and a fin clipping for gene sequencing. The extra effort would be worth it, as the data that would be collected in that one month created a high resolution snapshot of the initial conditions in each lake. In essence, we were building a baseline that could support future research endeavors for decades to come.

Anesthetized stickleback after being ID'd and clipped.
It would only be a short time before he started a new life in a fresh lake.
Needless to say, our task was not without complications. Three major obstacles needed to be overcome during the course of the field work. The first was the issue of optimizing stickleback processing. Thousands of fish needing to be processed meant that at any given moment of the day, at least 3 of us field assistants would be processing stickleback. As time went on, we each became better at our individual roles in the processing pipeline, forming a well oiled machine that could handle up to 1,000 stickleback in a day. Another problem posed by the number of stickleback we needed to transplant was making sure they survived from capture to release. Some of the source lakes were several hours from the processing station, and stickleback could be held for several days before being released. In order to keep survival high, we constructed an elaborate “aquarium” outside the cabin with shade and air bubblers. Furthermore, clear communication between the processing team and the transplant/capture team allowed for the fastest possible transition between capturing, processing, and releasing a stickleback.

Stickleback capture! Here I'm showing off some of the minnow traps we used to catch stickleback

The second issue was that Alaska was in the height of forest fire season. High temperatures and summer storms brought about fires which could last weeks. In several instances these fires came right up to the lakes we were working with. Towards the end of the field work, a final collection from a source lake needed to happen to meet our quota. As I drove out to the lake with another researcher we were forced to take a back road to get past traffic. With smoke limiting our visibility, and a few close calls with local wildlife, we finally made it to the lake only to see swarms of firefighters preparing to control the incoming blaze. What followed was a manic sprint through the swampy shore to collect the stickleback before hightailing it back to the cabin, feeling, perhaps a bit dramatically, like we’d just escaped with our lives.
Driving beside a forest
fire on the way to a transplant site
While this was definitely the closest call during the course of field work, forest fires constantly kept us on our toes logistically to make sure lakes received the right amount of stickleback. Once again, clear communication, and overnight trips helped us overcome this obstacle.

In a state like Alaska which has a history of homesteading and self sufficiency, residents were primed to be suspicious of government efforts. Gaining these people’s trust required direct communication and transparency. This was accomplished in two main ways. The first was through our main contact at the Alaska Fish and Game Rob Massengill. Rob was extremely excited by our work, and had a good relationship with all the residents we’d be interacting with. As a result, he became an invaluable bridge to communicate with the locals. With Rob’s endorsement, residents were more open to giving us a chance to work on their property.

The second step we took to build a relationship with residents was to host a barbecue. It turns out that when ~20 ecologists put their heads together, they come up with a pretty awesome barbecue. We saw a huge turnout in residents coming by our cabin for food, drink, and to learn what we were doing. The result was overwhelmingly positive. Once residents understood that our work would (A) benefit the lakes they lived near, and (B) give them better fishing opportunities, they tended to support us. It was at this barbecue that I realized that science requires communication. That if it is never explained to the general public, it won’t do the maximum amount of good it can accomplish. Personally, I came to realize that my excitement towards the field isn’t universally shared, but by face to face communication I could frame that excitement in a way that appealed to someone who may have had reservations.

One of the final transplants.
While it may not look momentous, this
was one of the most profound moments of field work
At the end of June our numbers had whittled down to just 5 remaining researchers. Our last release was down a muddy road and through a swarm of mosquitos. Once we reached the dock, and watched the last of the stickleback swim away, I was hit with the profundity of our work. It is rare in eco-evo to be able to set up such a large experiment in freshwater ecosystems. Smaller systems are often required since expanding the system introduces more and more noise. In this special scenario we were able to access a large system with reduced noise. And with that we could collect data on the key markers for evolution and ecology (defensive traits, invertebrate populations, limnology, etc.). All of these features were captured in incredible detail due to our efforts. From these data, and future work, deep understandings of how stickleback trait space is evolving, and how that evolution is influencing the environment, can be gained. As our understanding of how these lakes develop, and what forces are at play grows, then our policy towards the conservation of these and other lakes can be more impactful. Furthermore, not only had our work yielded quality data, but also quality methods. We had optimized processing, worked out a logistics plan, and most importantly, built a relationship with local residents. Our work ensured that future sampling could be conducted much more easily.

Upon completing the journey back to McGill, I had to take a victory photo with some souvenirs we brought for the museum.

Ultimately, my field season was an episode of great personal growth, and quite a bit of learning. I was able to harness my passion of the field, and direct it towards something that furthered the field. I was able to see all the behind the scenes work that goes into doing good science. Science that doesn’t just, hopefully, lead to important discoveries, but science that benefits people in the process of being conducted. If I were to generalize, I would argue that is one of the main reasons eco-evo is so important. Its utilitarian nature does good for science, for the environment, and for the community. 

 

 


 

 

 





Not by vaults and locks: To archive code, or not?

  The cases for and against archiving code by, Bob Montgomerie (Queen’s University) Data Editor—The American Naturalist As of the beginn...