Friday, May 20, 2022

Not by vaults and locks: To archive code, or not?

 The cases for and against archiving code

by, Bob Montgomerie (Queen’s University)

Data Editor—The American Naturalist


As of the beginning of this year, The American Naturalist required all authors of submitted manuscripts to upload to their data repository all of the code they used for analysis and modelling. About 80% of authors were already doing that by the time their paper was published, so it seemed like a general requirement for code at the reviewing stage would be useful for reviewers and editors, and especially end users once the paper was published.  Note that this requirement only applies if code was used; authors using drag-and-drop GUI software for analyses can generate code reports documenting their steps, but rarely do so. 


As data editor, I have now looked at about 100 scripts of R and Python code provided by authors. Based on that experience and my own coding practices for the past 55 years(!), I am not so sure that requiring code in data repositories is such a good idea. To that end, Dan Bolnick (current EIC), Volker Rudolf (incoming EIC), and I will hold a discussion session about this issue, and data repositories in general, at the Asilomar meeting of The American Society of Naturalists in January 2023 at Asilomar.





To provide a basis for that discussion, here are what I see as the cases for and against providing code, staring with the cons:


A1. Workload: For most of us, cleaning up the code that you used for analysis or modelling can be a ton of work—organizing, deleting extraneous material, annotating, checking that results match what is reported, providing workflow information.  It can also be a bit daunting to think that someone is going to look at your crappy coding practices—I know that it took me a month of work and deliberation before I posted my first code script in a public repository. Even now, having done this a couple of dozen times, I know that I am looking at a full day’s work to bring my code up to my rather minimal standards of usability and transparency.


Let me allay one fear right here—none of the 100+ code scripts I have looked at so far have been horrible. Everyone using R and Python seems to have adopted reasonable coding—though not necessarily reasonable statistical—practices. One person wrote code that was so compact and so elegant that it took me hours to figure out what the author had done. Normally I would not spend more than a few minute looking at code submitted with a manuscript, but I thought I might learn something useful, which I did. While I appreciated that author’s expertise, I would rather read and use code that is a little more friendly to the average user like me. My guess is that nobody cares how inelegant your code is as long as it’s well annotated and gets the job done.


On requiring code for papers submitted to The American Naturalist, Dan Bolnick (EIC at The American Naturalist) was a little worried that the extra work, and exposure to mediocre coding practices, would reduce submissions to The American Naturalist. At first there appeared to be a little downturn in submission rates t the start of 2022 but there are other potential causes for that (pandemic burnout, gearing up for a potential renewal of field work, etc). Even so, if this requirement discourages authors who are unwilling to be transparent about their analyses, then the journal and science benefit.


A2. Usefulness: One purpose, ostensibly, of published code is transparency, allowing reviewers, editors and readers to replicate what is reported in the manuscript and eventual publication. But my experience, so far is that hardly anybody does this. Reviewers and editors rarely seem to comment on the code that authors provide. Even the data editors do not check to see if the code produces the output that matches what is in the manuscript, nor even that all of the reported results are dealt with in the code. We do look at the code but only to address some formatting and accessibility issues (see A3 below).


As a user of published data, I much prefer to write my own code to see if the results that I get match those reported by the authors. If there is a mismatch it might be interesting—but rather tedious—to figure out where the authors went wrong, but I am not sure that their advantages of such forensic work unless scientific misconduct is suspected. A disturbing proportion of authors’ code that I have looked at so far contains what I would call statistical, rather than coding, errors, especially with respect to violating important assumptions or ignoring without comment any errors thrown up by the code. Checking for those errors should really be the job of reviewers and editors but is probably considered to be far too time-consuming.


A3. Accessibility: The vast majority of R and Python scripts that I have looked at so far will not run ‘as is’ on my computer. The most common reasons for this are that authors write scripts that (i) try to load files on the author’s hard drive, or even to files that do not exist in their data repository, (ii) are so poorly annotated that the purpose is far from clear, (iii) use packages that are not on CRAN etc with no information as to where to get them, (iv) throw up error messages with no comments as to why those errors were ignored, (v) use deprecated functions and packages from versions much earlier than the current version (often with no package version info provided), or (vi) require a specific, and unspecified, work flow. I have enough experience that I can usually deal with those deficiencies in minutes but I worry about other users just throwing up their hands when code does not work, or not willing to  put in the time and effort to fix the code. Moreover, code written today will not necessarily run ‘as is’ in a few months or years, as the coding environments, packages, and functions evolve and often become deprecated. Unlike well documented datasets, saved in a universal simple format like csv, code has the potential to become obsolete and almost unusable in the future. Can anyone open my  VisiCalc files—my go-to spreadsheet app in the late 1970s?  


The data editors at The American Naturalist ask authors to fix their code when they identify those deficiencies, but we simply do not have the time or inclination to work with authors to make their code completely transparent and usable. We take a fairly light touch on this to help, rather than discourage authors, and presumably practices will improve as everyone gets used to submitting code to journals. Most journal editors and reviewers probably do not have the time or expertise to deal with code in addition to handling manuscripts.


And the advantages, as a counterpoint to the disadvantages listed above:


F1. Workload: This is work you should really be doing whether or not you make your code publicly available in a data repository. Your future self will thank you when it comes to revising your manuscript, or at some later date needing to reanalyze those data. Reusing well-annotated code is also just standard practice among coders, rather than re-inventing the wheel every time you repeat an analysis or a modelling exercise. When developing analyses in R or Python, for example, it can be tremendously time-saving to (i) use versioning, (ii) comment extensively to explain what you are doing and why, (iii) use R or jupyter (or other) notebooks to keep a running commentary of the purpose of chunks of code and the conclusions that you draw from analyses


F2. Usefulness: Code is undoubtedly extremely useful for modelling exercises, where little or no empirical data are used, and the research depends entirely on the code. Presumably reviewers and editors handling such manuscripts take a close look at the code to make sure that the authors have not made any obvious mistakes that would lead to incorrect conclusions.


For research based on data analyses, published code can be very useful for training students, as well as researchers beginning to do some coding. I often use published code in my own work especially when learning new analytical procedures. Especially in my graduate and undergraduate statistics courses, students seem to appreciate those real-world examples. As I mentioned above, re-using code is standard practice although most of us probably get the most useful code from blog posts, online manuals, and stack overflow. Despite that usable code associated with a publication can help you to replicate the authors’ methods in your own work.


F3. Accessibility: With R and Python, at least, previous language and package versions are available on the internet so that users can recreate the environment used by the original coder. There are also sites online that can facilitate this. As authors get into the practice of providing version information, and adequately commenting their code, the problems of accessibility should largely be alleviated as the analysis software evolves.


Food for Thought

As the scientific publishing environment evolved over the past 250 years, and especially in the last two decades, we have all more-or-less seamlessly adapted to the new environments. The requirement to provide data and code in a public repository, however, has been a sea change that not everyone is comfortable with, yet. I am all-in for providing raw data but wonder whether the requirement for providing code has enough benefits that it is worth pursuing at least in the short term. We have identified errors and deficiencies in more than 75% of the datasets archived with manuscripts from The American Naturalist in the past year, and I would like to see that problem solved first. To that end, I feel that undergraduate and graduate student education needs to focus as much on data management as on any other aspect of doing good science. 




“Since the first satellites had been orbited, almost fifty years earlier, trillions and quadrillions of pulses of information had been pouring down from space, to be stored against the day when they might contribute to the advance of knowledge. Only a minute fraction of all this raw material would ever be processed; but there was no way of telling what observation some scientist might wish to consult, ten, or fifty, or a hundred years from now. So everything had to be kept on file, stacked in endless airconditioned galleries, triplicated at the [data] centers against the possibility of accidental loss. It was part of the real treasure of mankind, more valuable than all the gold locked uselessly away in bank vaults.” 
― Arthur C. Clarke


“Let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.” 
― Thomas Jefferson*

*Jefferson is a divisive figure because of his defense of, and involvement in, slavery. 

Monday, May 9, 2022

The Spine of the Stickleback

How an first-year undergrad drove more than 15,000 km to brave mosquitoes and forest fires en route to photographing, fin clipping, and releasing more than 10,000 stickleback in the #WorldsGreatestEcoEvoExperiment. By Ismail Ameen

Left to Right: Ismail Ameen, Hillary Poore, and Victor Frankel somewhere in the Kenai Peninsula

With the bustle of Calgary fading behind me, and the rolling green hills of Alberta’s countryside flanking me from either side, the golden sun mid July exposed the enormity of the country I was passing through. I had just completed a month of ecological field work in Kenai Alaska, and was trekking back to Montreal with 3 fellow researchers. We were carrying a valuable cargo. Countless samples detailing key features of lake ecosystems primed for an evolutionary journey.

I first became exposed to the field of Eco-Evolutionary Dynamics (eco-evo) after reading Jonathan Weiner’s book “The Beak of the Finch” which covers the groundbreaking research of Peter and Rosemary Grant. The Grant’s research concerned the finches of Daphne Major in the Galapagos, where they found that natural selection was not just limited to being a slow process as Darwin described, but could also act at rapid timescales. In the span of a few generations, selection could impose significant changes on a population. Instilled with a newfound curiosity, I reached out to Professor Andrew Hendry who was teaching my introductory biology course. Following his recommendation, I read his book on Eco-Evolutionary Dynamics (2017 – Princeton University Press) to gain a somewhat better idea of the field and its goals. Essentially, eco-evo studies the following feedback loop:

Ecological dynamics shaping evolution

Merging the fields of evolution and ecology in this way is particularly useful since the two are intrinsically linked. Thus, understanding how the two interact has numerous downstream benefits that extend to applied sciences like conservation and agriculture.

Upon completing the book, I was inspired by the sense of power this field of research bestows. Understanding eco-evo is akin to grasping a fundamental law of life on this planet. Filled with a sense of purpose, I immediately sought to participate in research. My efforts paid off, as Professor Hendry offered me a spot on a field research team whose goal was to set up a large eco-evo study in Alaska.

The team, comprising PI’s, grad students, and undergrads from multiple universities, had several objectives, but all of them revolved around one species: the threespine stickleback. This bony little fish provides an extremely powerful tool to evolutionary biologists, acting as a “supermodel” species that can be studied from multiple angles (molecular, genetic, ecological, etc.). Freshwater stickleback also happen to be a keystone species in lake ecosystems, meaning that they are essential to local food web stability. The essential niche they occupy isn’t set in stone, however,  as two stickleback “ecotypes” arise depending on the circumstances. The first ecotype is the benthic form which are found at the bottom of the water column, while the second ecotype is the limnetic form which are found towards the surface of the lake. Benthic stickleback mostly feed on larval insects on the lake bottom whereas limnetic stickleback mostly feed on zooplankton in the water column. Leveraging the importance of stickleback to food web stability, and the clear difference in ecotypes, Professor Hendry and the other PI’s developed a unique experimental design.

The Experimental Design:

An ideal experiment to test the importance of stickleback ecotype to lake ecosystems might look like this: start with lakes devoid of any stickleback and introduce benthic stickleback into some of them and limnetic stickleback into others. Then track the resulting changes in various ecosystem parameters through time. Additionally, population characteristics of the stickleback could be tracked to elucidate the interactions between stickleback and their ecosystem.

Designing an experiment like this is one thing, but actually implementing it is another. Luckily, one of the PI’s (Mike Bell) was working with Alaska Fish and Game, who were using Rotenone to completely clear 10 lakes of their fish populations to combat invasive pike. This provided the opportunity to run the experiment in a natural setting.

Before I got to Alaska, several years of preparation had occurred which culminated in a plan to source and transplant stickleback. One of the most important aspects of this plan was the generation of pools of benthic or limnetic stickleback formed from multiple source lakes. These pools could then be transplanted into recipient lakes. Pooling improved the chances of recipient lake recovery as the mixed populations would be more resilient to hazards that could destroy a single source lake population. Pooling would also allow for generalizations to be made about benthic and limnetic stickleback due to the replication embedded in the pools’ construction (multiple sources reduces the bias of one source). Ultimately, 8 source lakes were decided upon with an even division in geographic region, and stickleback ecotype (Table 1). 4 source lakes of the same ecotype contributed equally to a pool to mitigate the potential of one source lake dominating the others.

Table 1: Source lake pools

Source Lake:

Stickleback Ecotype:

Location:

Tern

Benthic

Kenai

Watson

Benthic

Kenai

Walby

Benthic

Mat Su

Finger

Benthic

Mat Su

Spirit

Limnetic

Kenai

Wik

Limnetic

Kenai

South Rolly

Limnetic

Mat Su

Long

Limnetic

Mat Su

Table 1: Here I present the 8 source lakes used for transplant. Source lakes were classified as either benthic or limnetic. The 4 benthic and 4 limnetic lakes were combined, respectively, into pools which were then transplanted into recipient lakes.

Overall, there were 9 recipient lakes. 4 of the recipient lakes received benthic stickleback, while another 4 received limnetic. Since one of the primary goals of the study was to observe the interactions between a particular ecotype and its environment, recipient lakes were paired based on ecological similarity and size. This would allow us to transplant the benthic or limnetic pool in two lakes of similar ecology, increasing inferential power on the effects each ecotype has. It was especially important to keep lake pairs geographically isolated from each other to prevent ecotypes from mixing due to watershed connections. Additionally, to account for the size differences in lakes, non-linear scaling of stickleback transplant numbers was applied (Table 2). Finally, Loon lake received an equal proportion of the transplant lakes in order to monitor how the two ecotypes interact in the same environment.

Table 2: Recipient Lake Transplant Numbers

Recipient Lake:

Ecotype Pool:

Number of Stickleback

Leisure Pond

Benthic

400

Fred Lake

Limnetic

400

CC Lake

Benthic

800

Ranchero Lake

Limnetic

800

Leisure Lake

Benthic

1600

Crystal Lake

Limnetic

1600

G Lake

Benthic

2400

Hope Lake

Limnetic

2400

Loon Lake

Benthic + Limnetic

Dependent on sampling logistics

Table 2: Here I present the recipient lakes along with the number of stickleback they received, and the pools those stickleback came from. As seen in the “Number of Stickleback” column, scaling was non-linear in order to balance transplant logistics, Alaska Fish & Game advice, and experimental utility.

Onsite: The Challenges of Field Work

Once I arrived at the field site, I realized just how rigorous the field work was going to be. Thousands of stickleback were going to have to be captured, processed, and transplanted to set up the experiment. Even the processing required a tremendous amount of detail be applied to each stickleback since every fish received an ID number, photograph, and a fin clipping for gene sequencing. The extra effort would be worth it, as the data that would be collected in that one month created a high resolution snapshot of the initial conditions in each lake. In essence, we were building a baseline that could support future research endeavors for decades to come.

Anesthetized stickleback after being ID'd and clipped.
It would only be a short time before he started a new life in a fresh lake.
Needless to say, our task was not without complications. Three major obstacles needed to be overcome during the course of the field work. The first was the issue of optimizing stickleback processing. Thousands of fish needing to be processed meant that at any given moment of the day, at least 3 of us field assistants would be processing stickleback. As time went on, we each became better at our individual roles in the processing pipeline, forming a well oiled machine that could handle up to 1,000 stickleback in a day. Another problem posed by the number of stickleback we needed to transplant was making sure they survived from capture to release. Some of the source lakes were several hours from the processing station, and stickleback could be held for several days before being released. In order to keep survival high, we constructed an elaborate “aquarium” outside the cabin with shade and air bubblers. Furthermore, clear communication between the processing team and the transplant/capture team allowed for the fastest possible transition between capturing, processing, and releasing a stickleback.

Stickleback capture! Here I'm showing off some of the minnow traps we used to catch stickleback

The second issue was that Alaska was in the height of forest fire season. High temperatures and summer storms brought about fires which could last weeks. In several instances these fires came right up to the lakes we were working with. Towards the end of the field work, a final collection from a source lake needed to happen to meet our quota. As I drove out to the lake with another researcher we were forced to take a back road to get past traffic. With smoke limiting our visibility, and a few close calls with local wildlife, we finally made it to the lake only to see swarms of firefighters preparing to control the incoming blaze. What followed was a manic sprint through the swampy shore to collect the stickleback before hightailing it back to the cabin, feeling, perhaps a bit dramatically, like we’d just escaped with our lives.
Driving beside a forest
fire on the way to a transplant site
While this was definitely the closest call during the course of field work, forest fires constantly kept us on our toes logistically to make sure lakes received the right amount of stickleback. Once again, clear communication, and overnight trips helped us overcome this obstacle.

In a state like Alaska which has a history of homesteading and self sufficiency, residents were primed to be suspicious of government efforts. Gaining these people’s trust required direct communication and transparency. This was accomplished in two main ways. The first was through our main contact at the Alaska Fish and Game Rob Massengill. Rob was extremely excited by our work, and had a good relationship with all the residents we’d be interacting with. As a result, he became an invaluable bridge to communicate with the locals. With Rob’s endorsement, residents were more open to giving us a chance to work on their property.

The second step we took to build a relationship with residents was to host a barbecue. It turns out that when ~20 ecologists put their heads together, they come up with a pretty awesome barbecue. We saw a huge turnout in residents coming by our cabin for food, drink, and to learn what we were doing. The result was overwhelmingly positive. Once residents understood that our work would (A) benefit the lakes they lived near, and (B) give them better fishing opportunities, they tended to support us. It was at this barbecue that I realized that science requires communication. That if it is never explained to the general public, it won’t do the maximum amount of good it can accomplish. Personally, I came to realize that my excitement towards the field isn’t universally shared, but by face to face communication I could frame that excitement in a way that appealed to someone who may have had reservations.

One of the final transplants.
While it may not look momentous, this
was one of the most profound moments of field work
At the end of June our numbers had whittled down to just 5 remaining researchers. Our last release was down a muddy road and through a swarm of mosquitos. Once we reached the dock, and watched the last of the stickleback swim away, I was hit with the profundity of our work. It is rare in eco-evo to be able to set up such a large experiment in freshwater ecosystems. Smaller systems are often required since expanding the system introduces more and more noise. In this special scenario we were able to access a large system with reduced noise. And with that we could collect data on the key markers for evolution and ecology (defensive traits, invertebrate populations, limnology, etc.). All of these features were captured in incredible detail due to our efforts. From these data, and future work, deep understandings of how stickleback trait space is evolving, and how that evolution is influencing the environment, can be gained. As our understanding of how these lakes develop, and what forces are at play grows, then our policy towards the conservation of these and other lakes can be more impactful. Furthermore, not only had our work yielded quality data, but also quality methods. We had optimized processing, worked out a logistics plan, and most importantly, built a relationship with local residents. Our work ensured that future sampling could be conducted much more easily.

Upon completing the journey back to McGill, I had to take a victory photo with some souvenirs we brought for the museum.

Ultimately, my field season was an episode of great personal growth, and quite a bit of learning. I was able to harness my passion of the field, and direct it towards something that furthered the field. I was able to see all the behind the scenes work that goes into doing good science. Science that doesn’t just, hopefully, lead to important discoveries, but science that benefits people in the process of being conducted. If I were to generalize, I would argue that is one of the main reasons eco-evo is so important. Its utilitarian nature does good for science, for the environment, and for the community. 

 

 


 

 

 





Thursday, April 7, 2022

The promise and perils of preprint scouts for journals

 This is a joint post co-written equally by:

Daniel Bolnick (daniel.bolnick@uconn.edu)

David Fisher (david.fisher@abdn.ac.uk)

Maurine Neiman (maurine-neiman@uiowa.edu)


Disclaimer: This essay is not a statement of policy for any journal or organization. 


The preprint era:

Preprints are increasingly used in biology as a means to rapidly disseminate research before peer review and to make scientific research freely and equitably available. The COVID pandemic illustrated the rewards and risks of such a system. Preprints provided a crucial means for rapidly conveying findings that helped shape public policy and medical practice, at a time when we could little afford the delays typical of the scientific review process. However, preprints also have the potential to facilitate the spread of flawed research. The rare examples of truly flawed preprints highlight the importance of following up with traditional peer review. Although peer review as ‘gate-keeping’ has a negative connotation in a lot of conversations on scientific publishing, peer review also has a genuine role to ensure that rigorous science is disseminated while flawed science (or, pseudoscience) is not. Preprints lack the gate-keeping of formal peer review, and so their flaws emphasize the genuine value that is added by traditional peer review, typically conducted by journals (in our experience, voluntary reviews on preprint servers tend to be scarce). 

A reasonable compromise between these perspectives seems to be increasingly accepted: authors go to preprints for rapid dissemination. The scientific community recognizes these preprints but treats them with a grain of salt. Meanwhile, authors submit to journals for the value-added provided by constructive review and formal publication in a recognized journal. In this publishing model, journals remain passive recipients of submissions. Authors choose which journals to submit to, based on their impression of journal prestige, subject area, readership, and fit to their own manuscript. And, we hope that policy and media attention focuses on the peer-reviewed publications.



Preprint Scouts

An interesting alternative is to view preprints as a display case, and journals as proactive ‘shoppers’. As concerns regarding equitable access to scientific publishing become increasingly apparent, Editors are also hoping to increase the diversity of the pool of submitting authors. A journal may also wish to increase the diversity of subjects it receives submissions on, perhaps if the journal wants to expand its remit, or if the Editorial board feels certain subjects among current topics are underrepresented. 

Journal Editors are also concerned with the prestige of the institutions they manage, and seek to publish the best available submissions that will inspire authors to submit other high-quality submissions. As Editors, we frequently feel a twinge of regret when we see a great paper published elsewhere that we wish had come to our journal. All these concerns can motivate Editors to encourage authors to submit promising preprints to their journal. A few journals (e.g., Proceedings B; Evolution Letters) initiated a system of Preprint Scouts (or Preprint Editors) whose job is to watch for newly posted preprints, identify promising preprints that would be a good fit to the journal’s goals, and encourage submission. 

If this preprint scout system sounds radically new, it’s not. It is just a formal version of what has long been an informal process. Throughout the history of modern scientific publishing, Editors and Associate Editors have informally encouraged authors to submit exciting work to their journals. At conferences, Editors may give words of encouragement to a student after a particularly enticing lecture or poster. As a visiting speaker at a university, an Editor might hear about exciting new results in preparation and encourage the author to submit. From a cynical perspective, the preprint scout system simply formalizes what had been an informal networking system, with all of its associated challenges. In other words, it should be pretty obvious that the old informal system would have been riddled with unconscious biases. As Editors, we are more likely to give encouragement to someone working on a topic of particular interest to us, or to someone working in the lab of a close collaborator or friend. This doesn’t mean the bias is ill-intentioned, merely that it is a natural outgrowth of the fact that we inevitably interact socially and intellectually with a non-random subset of the broader scientific community (there are just too many people out there to know them all personally). And, we are more likely to encourage manuscript submission from the people we interact with. The hope is that formalizing this system of Preprint Scouts can reduce these biases by bringing it out from the shadows. Or better still, Preprint Scouts might be a tool for proactively removing biases and consciously diversifying a journal’s authorship or subject matter. The latter prospect is particularly enticing, but requires great care to implement successfully.

The remainder of this blog post aims to articulate the aspirations of Preprint Scout arrangements, the risks associated with the approach, and strategies to mitigate those risks. We provide specific recommendations for how journals might implement Preprint Scout systems, and how not to do so. We wish to emphasize that this document is written by us personally and does not represent a statement of institutional policy by any journal or university that we are affiliated with. Another useful reference that covers distinct but related issues can be found in Neiman et al. (2021)



Goals:

Diversity, Equity, and Inclusiveness: The primary goal of most journals is to publish the best science they can (rigorous, clear, innovative, impactful), in the field(s) that they address. In doing so, they hope to (1) increase knowledge of the world (a service to their readership), and (2) promote the careers of the authors who contribute excellent papers. In the interests of equity, fairness, and social justice, journals have a moral obligation to promote the careers of diverse authors (by nationality, ethnicity, gender, or other aspects of identity). An inclusive and diverse set of authors isn’t just a political statement. The variety of lived experiences and perspectives provides a richer view of the scientific topics that concern us and can make our scientific insights deeper and more varied. A key goal of Preprint Scouts, therefore, should be to advance the globalization and diversification of science. STEM fields are plagued by myriad inequalities affecting who can conduct research and who can publish in top journals; inequalities arising from cultural biases, geography, socioeconomic inequality (both personal and research funding), racism, and sexism. Preprint Scouts represent an interesting opportunity to proactively mitigate these inequalities, though they also have the potential to entrench inequalities if not done with care.


Subject Matter: As Editors, we are keenly aware that our journals are only able to publish the papers that get submitted to us. This limits our ability to diversify our authorship or the set of scientific topics that we wish to publish. Authors have impressions of what our journals do, or do not, publish. Authors have impressions about journal’s openness (or lack thereof) to submissions from authors in the global south, or student authors, or authors from groups that are historically underrepresented or marginalized. While biases surely exist (we do not contest this), there is often a mismatch between authors’ perceptions of what a journal will publish, and what the Editors actually are interested in. Author misconceptions about Editor expectations lead to an author-generated bias in what papers get submitted to a given journal. For instance, the journal title “The American Naturalist” implies to some that the journal prioritizes research by American authors (it does not), or natural history (it is often a theory or conceptually focused journal). As a result, authors from the Global South, or Asia, might be less likely to submit, generating a bias in our submissions that the Editors don’t even know exists. (as an aside: the journal name is a 150-year-old legacy that Editors have been hesitant to abandon, despite occasional conversations on the topic). 

Or, when it comes to subject matter, Editors often feel helpless to steer the journal into a new subject area when we receive few or no submissions on the topic. For example, The American Naturalist receives relatively few submissions using genomics or transcriptomics, and few submissions in ecosystem ecology, or neurobiology, or ecophysiology: these are topics that the Editors truly value and wish to promote, but as long as authors believe the Editors don’t want submissions in the topic, they won’t submit. The result is a feedback loop: authors don’t submit papers on a given topic to a particular journal, so the journal doesn’t publish on that topic and so authors perceive that the journal does not desire submissions on that topic. This problem is heightened by the reality that a journal that does not publish very often on a particular topic will also often feature Associate Editors with that topical expertise, meaning that even when papers that address this topic are submitted, they are often perceived as out of scope. 

This feedback loop can be broken by Preprint Scouts, whose proactive solicitation of submissions conveys interest to authors, and who can even identify new topical areas that a journal should consider broadening to include. The experiences of existing Preprint Editors emphasize this point: both David Fisher and Maurine Neiman note that when they have encouraged submissions they frequently get replies stating “Oh, I didn’t know Evolution Letters {or, Proceedings B} published in this area”! Even if a given preprint has already been submitted to another journal, the solicitation conveys to authors that the journal is open to publishing in that subject area, and so they may be more inclined to plan a future submission. 


Risks and solutions

An informal and unscientific poll (screenshot below) suggests that there is cautious interest by the scientific community in Preprint Scouts, but also a great deal of concern over the potential for exacerbating biases.



Preprint Scouts are only a good choice for journals if they effectively achieve the goals listed above. If Preprint Scouts simply provide a means for journals to compete over already high-profile authors or sustain or exacerbate inequalities in access to top journals, then they should not be adopted. Here, we examine some potential concerns, paired with some potential solutions.


1.Biases in who posts on preprint servers

Not everyone is comfortable posting preprints. Authors may feel it is wise to “put your best foot forward”, presenting your colleagues with only the most polished product possible. Getting reviews from journals is a means to get (hopefully constructive) feedback that improves the clarity of your writing and graphics, and perhaps catches errors in logic or analyses. Review and revision provides a means to minimize the risk of later embarrassment by letting a small number of anonymous peers confidentially check for profound flaws before the big reveal to everyone. So, preprints might be used more often by people with strong scientific networks of colleagues who can give them feedback before posting on a preprint server, giving more confidence that the work is solid before posting online.

It is our impression that career stage has a strong impact on one’s willingness to use preprint servers. Our older colleagues generally seem more skeptical than our students. In this sense, preprint scouts are more likely to invite submissions from junior scientists, which we see as a generally positive bias.

We expect that there may be geographic biases (e.g., nationality) in who posts on preprints, but are not aware of data on this issue. Cultural differences in how preprints are evaluated may incentivize (or disincentivize) preprint use. Challenges with publishing in a second (or third, or fourth) language means that some authors may be more comfortable working directly through a journal to get Editorial and copyediting help before making their work public. The lack of fees associated with preprints, which give an open access version of a manuscript, may be attractive for those working in countries or at institutions without the funds to pay the substantial open access fees at many journals. Any disparities in what nationalities use preprints will generate biases in what preprints are available for Preprint Scouts to evaluate. Such biases are not necessarily a bad thing: if the Global South is over-represented in preprints, for instance, then Preprint Scouts will tend to promote submissions from areas that may be under-represented in traditional author-initiated submissions. We require data to determine whether geographic biases exist in preprint use, to determine whether Scouts would therefore exacerbate or ameliorate geographic biases in submissions. 

Solutions to the above biases lie at a community level, by trying to even the playing field of who submits preprints, and are largely out of the hands of journals. However, if many journals initiate a Preprint Scout system, it may create incentives for using preprints that alter the landscape of who is choosing to post preprints (as well as increasing the number of authors that chose to post preprints overall).


2. Biases in which preprints come to a scout’s attention

Preprint Scouts are tasked with the job of scanning weekly lists of new preprints. Given the volume of preprint submissions, this necessarily requires some winnowing before the Scouts can look at abstracts and manuscripts. The winnowing takes place by carefully choosing which preprint servers to monitor and what keywords to use. Both choices will tend to define what sets of papers a Scout will see. Different disciplines tend to gravitate to different preprint servers - for example, BioRxiv, ARxiv, PCI, or EcoEvoRxiv. Preprint Scouts that focus on particular servers will entrench biases towards the subfields that prefer a given server. Keyword searches further narrow the set of visible papers. Within a given discipline there may be cultural variation in semantics, definitions, or even spelling, so that a keyword that an American Preprint Scout might choose could be a term less often in use by authors in Europe. Terms may also have a directionality to them (e.g., “assortative mating” might fail to reveal papers on “disassortative mating” or papers giving negative results implying random mate choice). Such directional terms can exaggerate the “file drawer” effect where journals preferentially publish results that are statistically significant (and perhaps in a particular direction that corroborates standard views).

The solutions to this problem are within reach. Journals using Preprint Scouts should have a diverse team (by discipline, nationality, etc), and carefully discuss a set of systematic key words, or abandon key words to focus on all submissions in a discipline. Preprint Scouts should use search engine and auto-alert tools rather than Twitter or other social networking tools. Many of us do use Twitter or other social networks. Indeed, this is how the three authors of this document connected to draft this blog, and one of us (Neiman) has used social media, including Twitter, to recruit new members for her Preprint Editorial team at Proceedings B. There’s a good chance you are reading this because you (or someone you know) saw it posted on Twitter. One of the benefits of social networks is that it draws our attention to brand new scientific publications (including preprints) that are of interest to us. This has great benefits but also is likely to entrench biases because we are more likely to see the papers by the people we choose to follow on social media (and, who on social media tends to post about their preprints, or not). Therefore, preprint scouts should be discouraged from using their personal Twitter (or equivalent) feed as the primary means for finding preprints to invite.


3. Biases in which preprints a scout chooses to encourage submission

Preprint Scouts, being human, naturally have their own preferences (and dislikes). We all have subject areas or organisms that we find especially fascinating, and other topics that just have never excited us. Inevitably, these personality quirks will influence what articles a Preprint Scout finds exciting enough to send out an invitation. Then there is the potential for personal network biases: will a preprint scout be more likely to send an invitation to a close personal friend, or close collaborator? Or to the student of a friend/collaborator? Would a scout be more (or less) likely to send an invitation to a prominent author? That big name may be intimidating to reach out to, or may be someone the scout wishes to curry favor with. This in some ways mimics the potential for bias in the decisions an Editor makes over which articles submitted to a journal to send out for review. Additionally, preprint servers may indicate the number of downloads or amount of Twitter attention a preprint has received. Preprints Scouts trying to decide if a candidate is “interesting” enough to be worth an invitation might use downloads or Twitter retweets as a guide (captured by Altmetric scores). While there is surely information in this online attention, allowing it to influence decisions risks biasing submissions towards authors with large social media networks, papers with particularly engaging titles, or flawed work that attracts attention for the wrong reasons.

Several solutions present themselves. First, having a large and diverse team of preprint scouts allows a journal to ‘average over’ the variation among individuals. This team should span nationalities, career stages, and subject matter to provide greater awareness of cultural differences and minimize the effect of personal biases in subject, study organism, or personalities. Second, preprint scouts should be encouraged to ignore the author list, and go straight to the title, abstract, and main body of the work (this may be easier said than done). Third, journals may wish to institute a tiered system where a team of Preprint Scouts make recommendations to a Preprint Editor who makes a final decision for a set of invitations for the week. Given the Preprint Editor is likely to be more experienced than the Scouts, this allows a more experienced head to make the final call on what should be invited, and what will not be. This tiered system also allows the Preprint Editor to keep an eye on the overall diversity (nationality, gender, etc) of invited authors. And, it separates the step of identifying candidate papers, from the person issuing the invitation, reducing the Preprint Scout’s temptation to curry favor with prominent authors. Finally, avoiding a reliance on altmetrics such as downloads or retweets when selecting which preprints to invite is a must.


4. Other potential drawbacks

If preprint scouts become pervasive, authors may begin to ‘expect’ invitations and get offended by the lack of an invitation. This strikes us as a relatively unlikely scenario.

If journals interested in similar topic areas implement Preprint Scout systems, then there will be significant duplicated effort. Each journal would have a team scanning an overlapping set of preprints, and perhaps issuing competing invitations to mutually appealing papers. Given how limited all of our time can be, duplicate effort perhaps should be avoided. The alternative is a system like PCI, or the now-defunct Axios, where a single team (PCI Editors) examines submissions and makes recommendations to authors as to which journal(s) might be a good fit. This, however, removes the crucial ability for journal Editors to use Preprint Scouts to move their journal into new subject areas (that an unaffiliated set of reviewers might not be aware of).

It can happen that Preprint Scouts invite a paper for submission, only to have that paper declined. In general, authors are more likely to react poorly to such a decline, because they have received mixed messages. This is particularly true when the Editor declines to even send an invited submission out for review (as has happened to at least one of us writing this essay). Part of the reason for such mixed messages is that different individuals issue the invitation, and evaluate submissions, and they may have different visions for the journal’s goals. For instance, an Editor might wish to use preprint invitations to proactively move the journal into publishing an emerging subject area that it has not previously featured. But an Associate Editor (or, reviewers) examining the submission may not know this intent and decide that the paper is not a good match to the journal. So, if a journal Editor seeks to use preprint scouting to shift the journal in a new direction, that direction must be clearly conveyed to all Associate Editors to avoid mixed expectations. And, reviewer comments based on misconceptions about the journal’s subject matter need to then be discounted. A more radical option (not likely to be popular among many journal Editors) is that a preprint scout invitation to submit comes with a guarantee that the paper would, at a minimum, not be desk rejected without review.

If particular Preprint Editors (or, Editors, Associate Editors, etc) do not buy into the broader agenda of increasing subject matter diversity or author representation, then preprint solicitations will fail to achieve their goals. Open and transparent discussions with all Editorial Board members are required to articulate the goals and values of the journal, including training regarding implicit/explicit bias. Journals may need to consider bringing in new Associate Editors, or formally establishing new subject matter areas, to emphasize policy goals. A good example of this strategy is provided by the establishment of the new Biological Science Practices section at Proceedings B. This new paper type, focused on papers that analyze the way in which science is conducted within biology, and especially how these scientific practices influence research quality, scientific community health, and the public understanding of science, was instituted at Proceedings B  as a direct consequence of a gap between desired scope, perceived scope, and Editorial board composition.

Preprint scouts need to be sufficiently familiar with the journal they are serving, to have a realistic view of what kinds of manuscripts stand a good chance at publication. Otherwise, there is a risk that scouts will tend to encourage authors to submit manuscripts that do not have a good chance at publication because of scientific flaws that will be critiqued in review or poor fit to the journal’s standards of novelty, clarity, or subject matter. Over-enthusiastic invitations to papers that stand little chance of publication risk wasting everyone’s time and generating substantial ill-will. Scouts should therefore have substantial familiarity with the journal as a reader, a reviewer, an author, or some combination of these, as well as having clear instructions from the Editor. 

An obvious difficulty with preprint scouts is that many authors post preprints more or less simultaneously with submission to a journal. This appears to be the most frequent situation. Chances are, the authors have prepared the manuscript with that journal in mind for some time in advance. Might Preprint Scouts thus be a waste of time and not yield submissions? We certainly have found that scouting can yield submissions, so it clearly is not a complete waste of time. First, some authors do post preprints first, and wait for feedback, before submitting. While these are a minority, they represent an opportunity for journals.  Second, if papers get declined from their first journal submission, the authors have a backup plan in place with an invitation. The most common response that Evolution Letters preprint scouts get is along the lines of “the paper is currently in review, but we will think of Evolution Letters in the future.”  Third, even if the paper in question is not submitted, the invitation raises awareness of the journal, particularly when an Editor seeks to publish more on an emerging topic.



Strategies for Implementing Preprint Scout Systems


1. Who gets invited to be a preprint scout

Given the central importance of diversity to science and the need to pay particular attention to bias generated by limited diversity, we believe that focusing first and foremost on bringing a diverse set of scientists to the table to serve as these Preprint Scouts is critical. With this in mind, we suggest using a simple application procedure that explicitly focuses on evaluating how the applicant can contribute distinct but relevant perspectives. A statement of motivation for joining the team is also helpful. At least one of us has taken on the philosophy of trying to accommodate most if not all early-career scientists interested in joining a Preprint Scout team, with the caveat that not all journals have the breadth or bandwidth to include a large group of scouts. 

Where seniority might be helpful is when it comes to the person actually issuing the invitation: to maximize the likelihood that the solicitation will be taken seriously and not viewed as spam, it is ideal if the email comes from an individual, institution, and/or journal with good potential for name recognition or a professional online presence that is readily findable via Internet search. 


2. Training

Training needs to cover three core pillars. First, Preprints Scouts must be trained to do the job fairly. This means avoiding existing biases in how preprints are found and which are invited as described above. Unconscious bias training is a useful tool here for making people aware of their own existing biases, while highlighting existing biases in the field at large is also key. While many more senior academics may have received such training (even multiple times) as part of their existing roles, younger colleagues may not. There exist myriad articles, videos, and interactive tools for teaching Associate Editors and Preprint Scouts about unconscious bias and how to mitigate it (see, for instance, a curated list of UB training tools posted by Elsevier).

Second, scouts need training in how to do the job efficiently. There are enormous numbers of preprints within the biological sciences posted daily, and therefore a huge number of articles to look at and as many decisions to make about whether the preprint is relevant and interesting enough to be worth an invitation. There is considerable potential for this process to use up a great deal of time, especially when any preprint scouting team is small and/or the journal remit is broad. Key-word alerts can make the process much more efficient by taking much of the searching out of the scouts’ hands, meaning they only need to peruse the generated lists of relevant preprints every few days or once a week. However, key-words can (as noted above) be spelled differently, or be culturally biased, so a broader approach may be advisable. If the scouts also are writing the invitation, a template can be used (more details below), requiring only the addition of the email, name, and title of paper, which can be automatically added from spreadsheets if relevant preprint information is collated there, while perhaps the specific reasons for the invitation will still need to be added manually.

Finally, scouts will need guidance on how to do the job to best achieve the goals of the journal. If the mission is primarily to diversify the submissions, then the methods described above may be sufficient. If the journal also wishes to target specific subject areas, then training may be required in how to find those preprints and approach those authors. As the goals of a journal may be diverse, prescribing the appropriate training is difficult, but we wish to highlight to interested Editors the need to make sure their Preprint Scouts have the tools necessary to achieve the journal’s goals.  


3. Procedures for finding prospective articles

When the preprint server covers a wider remit than the journal is interested in featuring, topic and keyword alerts allow the range of papers to be quickly reduced to the most relevant. For example, bioRxiv allows both “Subject Collection” alerts, where the title, author list, and a link for any preprint posted in subjects (author identified) such as “Developmental Biology” and “Zoology” is then sent to an email address, and “Relevant paper” alerts, where preprints are identified based on key words matched in titles, abstracts, or author lists. Alternatively, scouts can search the relevant repositories with defined key words, or in defined subject areas, each time they wish to find new preprints to invite. The OSF have made a guide for such an approach here

4. Criteria for choosing which articles to invite

The journal goals for preprint solicitation should be always kept in mind when deciding which papers to invite, whether these goals be about broadening participation or scope or simply ensuring that the journal is soliciting preprints likely to also be solicited by competitor journals. The Preprint Editor should discuss with other members of the Editorial board regarding the extent to which solicited papers should meet, or come close to meeting, formatting requirements for submission such as length, section representation, etc. In our experience, it is frustrating to authors to receive a solicitation for a preprint that must be extensively reformatted prior to submission. 


5. Invitation procedures

We have found that it is critical to personalize (e.g., author name, paper title) invitation emails to preprint authors to reduce the risk that these emails will be perceived as spam. While this risk is not entirely eliminated by such personalization, correctly identifying authors, papers, etc. will likely increase the likelihood that the email will be taken seriously. How to achieve this personalization will depend on the number of solicitations. At Proceedings B, dozens of solicitations go out per month, requiring the Preprint Editor to use a custom-build Python script that scrapes a Google sheet for paper titles and names. A smaller-scale endeavor, say for a journal with a more narrow remit, however, could potentially be centered on individual emails written for each solicited preprint. 

Regardless of approach, it seems important to ensure that all corresponding authors are included in the solicitation. As a mechanism to broaden participation by early-career researchers and to possibly increase the rate at which solicited preprints are submitted, one might also consider adding first authors (who are often early in their careers) to the solicitation. A template letter is provided at the end of this post.


6. Procedures for submission and post-submission evaluation:

Two key considerations present themselves here. First, should scouted papers be flagged as such upon submission to a journal? This would signal to the handling Editor that the submission had been given a green-light by a Preprint Scout as likely appropriate for the journal. This would presumably make an Editorial “desk rejection” (without review) less likely. The closely related question is whether a journal treats such flagged papers differently as a matter of policy. Specifically, does an invitation convey a guarantee that the paper would at least be sent to an Associate Editor for detailed evaluation? A guarantee the paper would go to review? The specifics here are likely to vary among journals, and even among Editors. But, if scouted papers are just as likely as regular submissions to receive desk rejections, over time this may lead to disgruntled authors and ultimately devalue the notion of a Preprint Scout invitation. 


Conclusion.

We believe that Preprint Scouts, if deployed with appropriate training, policies, and deliberation, may be a valuable tool. They may help journals reverse historical and current biases in who publishes. They may help Editors steer journals into publishing subjects where they previously did not get as many submissions as desired. These opportunities can only be realized, however, with a diverse and well trained staff of Preprint Scouts. However, this system is not without costs - personnel time, most notably. And, if many journals begin adopting this type of system, other journals may feel obliged to do the same.








Sample Invitation letter


Dear Dr. (author last name here),


My name is Maurine Neiman, and I am the Preprint Editor for the Proceedings of the Royal Society B ("Proc B"). Proc B is the Royal Society of London’s primary biological journal accepting original articles of outstanding scientific interest. Proc B's scope covers the breadth of biology and is described more fully at http://rspb.royalsocietypublishing.org/about.


My Preprint Editorial Team and I have used a survey of the papers published in bioRxiv over the last month to identify your manuscript, "manuscript title here", as one that we consider a potentially good fit for Proc B, pending, of course, formal Editorial consideration and peer review. You can learn more about our team and our process in our recently published paper (https://doi.org/10.1098/rspb.2021.1248). I am now writing to encourage you to submit your manuscript to Proc B via our online submission system (https://mc.manuscriptcentral.com/prsb). 


This invitation in no way assures selection for peer review. Indeed, one possible outcome is rejection without review. My invitation to you does indicate that we think that your paper could be appropriate for publication in Proc B. While you are not obligated to respond to this email, it will be very helpful for us going forward to know whether you (1) have interest in submitting to Proc B, (2) do end up submitting to Proc B, and (3), if (2), the outcome of the review process. 


In any event, if you do submit to Proc B, we do request that you mention that you were solicited via the Proceedings B Preprint Editorial Team in your cover letter. We also make the submission process very easy for you through automated transfer from bioRxiv, though you can also submit through the traditional route on the Proc B page listed above. If you choose to submit via bioRxiv, you can submit your manuscript to Proc B by selecting the journal from a drop-down list available in the bioRxiv author interface. You will then receive an automated email from Proc B with instructions on how to complete your submission. Once your submission is complete, your paper is treated at Proc B as a regular submission. The bioRxiv number remains on the submission so we will know that the paper was transferred via this route. One of our goals of this preprint solicitation endeavor is to broaden the scope of submissions to our journal. If you do choose to submit, please also ensure that you briefly explain in your cover letter why you believe the paper is a good fit for Proceedings B. You will need to format your manuscript and associated elements to meet the requirements for submission to Proc B (http://rspb.royalsocietypublishing.org/author-information).


Please don't hesitate to contact me if you have any questions or would like additional information.


Sincerely,


Maurine Neiman, Ph.D.

Professor

Department of Biology

Department of Gender, Woman's and Sexuality Studies

Provost Faculty Fellow for Diversity, Equity, and Inclusion

University of Iowa

Editor and Preprint Editor, Proceedings of the Royal Society of London B

maurine-neiman@uiowa.edu

http://bioweb.biology.uiowa.edu/neiman/

Twitter: @mneiman


Not by vaults and locks: To archive code, or not?

  The cases for and against archiving code by, Bob Montgomerie (Queen’s University) Data Editor—The American Naturalist As of the beginn...