Tuesday, February 16, 2021

In praise of double-dipping from data.

 

There’s a widely held view in biology that one dataset should equal one paper. Using the same data in two papers is often viewed with suspicion. To readers, it may appear that the authors are trying to get twice the academic credit for a given amount of work. This even has a name, ‘double-dipping’.  (Note, this is distinct from publishers’ ‘double-dipping’ to get both author open access fees, and also institutional subscriptions).

 

I recently double-dipped. Triple-dipped, Quadruple-dipped, and more really. I was both very hesitant to do so, and yet happy with the result. I’m writing this post to explain what I did, why I did it, and why I think double-dipping can be an excellent choice for authors, readers, and funding agencies. So much so, that I think funding agencies should consider fellowships to support salaries of students, postdocs, and perhaps even faculty to revisit published data to squeeze more insights.

 

What I did.

In 2009, I obtained funding from the Howard Hughes Medical Institute to pursue a study of how stickleback immune genes evolved across complex landscapes of populations connected by varying degrees of gene flow (for hosts and parasites). The focus initially was meant to be on MHC.  Our very first order of business was to set the stage for the evolutionary genetics work by learning the basic natural history of the host-parasite system I intended to study. What parasites are present in the area I worked in on Vancouver Island? To what extent do they differ from one stickleback population to the next? To what extent is this parasite community variation attributable to abiotic variables, biotic communities, fish population traits, or fish genotypes? By answering such basic natural history questions, we get a foundation to choose the most interesting populations for evolutionary genetic contrasts. 

 




To collect such data, we conducted a field survey in May 2009. “We” included my graduate student Will Stutz (who helped conceive of the project to begin with), a grad student collaborator Travis Ingram, a new PhD student Yuexin Jiang, two undergraduates (Chris Thompson and Todasporn Rodbumrung), UBC graduate student Travis Ingram, and a high-school biology teacher Kim Hendrix (most of us pictured, above)
. Over the course of 4 weeks on Vancouver Island we sampled ~100 stickleback from each of 45 populations, a mix of lake and stream and estuary sites. Then from fall 2009 through 2013 I employed Julie Day then Kim Ballare as research technicians to count parasites, characterize stomach contents, measure ecomorphology for ~3500 fish specimens. From 2013-2014, Hollis Woodard worked as a technician in my lab to genotype a couple thousand stickleback for MHC, and Yoel Stuart helped her do ddRADseq on a subset of individuals to get neutral genetic markers. The result was an enormous natural history dataset on diet, morphology, infection, MHC, >100,000 SNPs, all set within a geographic context of varying lake sizes, elevations, etc scattered across watersheds on north-eastern Vancouver Island.




 

The resulting data set was enormous, and intimidating. I had all the data in had by sometime in 2014, and it took me nearly a year of on-and-off-again work just to organize and curate the data to check for errors, odd outliers, misspelled population IDs, and all those little things that can creep into a dataset that has been handled by many separate people. For several years in a row, the data would lie fallow for months on end, then I would find a bit of time to work on analyses, only to set it aside and start over half a year later. The problem was this was nobody’s primary dataset. It was meant to be exploratory (with some a priori predictions to be sure), and there was much to explore, and I was the sole person delving into these explorations for a long time. Each summer I’d spend a couple weeks on Cape Cod with my family and I’d sneak in some time at a great French Patisserie spending a few hours analyzing these data while my kids were in summer camp (photo below). Over time I built up a set of analyses that answered my a priori questions and went a step further to describing the spatial structure of the parasite metacommunity in great detail.

 

Thousands of lines of R code later, it was time to write. But I pretty quickly found that the writing built up to over 100 pages of text. I was writing a book, and not even I want to read a book-length document on the natural history of stickleback parasites on Vancouver Island. The trick was, there are so many distinct questions that can come from a dataset of this size. Do we analyze parasite species richness? Or multivariate composition? Do we analyze each species of parasite separately, or via ordination in one group? Do we include genetics, or host diet, or lake abiotic conditions? These all are interesting, all tell us something different, but to do them all simply took way too much text, it would strain the interest of any but the most dedicate readers.

 

At some point, in stepped Emlyn Resetarits (a PhD student with Mathew Leibold and myself), who helped convince me to split this up into bite sized parts. The result:


1)    A paper focused on parasite metacommunity composition – which species are found where, and which species are found together or apart, and what predicts this variation? We throw in a big GWAS study of many parasites in the appendix, which might have been a paper unto itself. Bolnick et al 2020 Ecology




2)    A paper focused on parasite metacommunity diversity – not so much who is found where, but how diverse they are, which revealed a richly different story than species composition alone. Bolnick et al 2020 Ecography


3)    A third paper set aside the parasite information to focus on the evolutionary ecology of stickleback diet and individual specialization. This was revisiting a topic that was core to my academic beginnings, which I hadn’t touched in a few years. But the dataset on stickleback diets (collected to understand infection patterns) was also exactly something I’d hoped to achieve for years. This turned out to give a beautifully clean and intuitive result that generalist populations (eating roughly equal mixes of benthic and limnetic prey, in mid-sized lakes) had the greatest dietary and phenotypic diversity. But, these functional variances were unrelated to genomic heterozygosity, which increased steadily with lake size. In short, neutral genomic diversity and functional ecological diversity were unrelated, and responded to entirely different features of the populations’ environments (Bolnick and Ballare, 2020, Ecology Letters).


4)    That Ecology Letters paper happened to include, in passing, a GWAS analysis of SNPs related to lake size. Are there loci whose allele frequency varies predictably between smaller versus larger lakes, and whose heterozygosity was largest in mid-sized lakes? Well, at an American Naturalist conference right around when this paper came out, Diana Rennison and I compared notes. I had this GWAS between benthic versus limnetic allopatric lake populations, and she had population genomic data for benthic versus limnetic species pairs in symatry. Why not compare these? Harer et al 2020 Molecular Ecology was the result. Remarkably, the benthic-limnetic species pairs show both more repeatable evolution, and greater divergence, than allopatric populations.


5)    Most recently, we finally got to the original motive for this data collection: Major Histocompatibility Complex genetic diversity. MHC (here MHC IIb) is among the most diverse genes in the vertebrate genome, frequently said to be under balancing or frequency-dependent selection to maintain this diversity. The whole point of this survey was to determine the diversity of MHC, and its association with parasite load and diversity. Well, with help from Stijn de Haan we finally ran the bioinformatics pipeline to identify alleles and genotype individuals (using a bioinformatics protocol developed by Will Stutz, the PhD student who first planned this study with me). Then Foen Peng adopted the dataset to run statistical analyses and write. The resulting paper just posted to Molecular Ecology a few days ago: Peng et al 2021 Molecular Ecology.  Disappointingly, (e.g., contrary to our starting motives) very little about the parasite community tells us anything about MHC. Instead, MHC diversity seems to be best predicted by neutral genomic diversity, not parasite diversity. And MHC divergence between populations is best predicted by genomic Fst, and not parasite or ecological differences. This only deepens the puzzle of MHC diversity for us, because it certainly is insanely diverse, yet we twice now have failed to find a clear adaptive explanation for this variation (see also Stutz et al 2017 Molecular Ecology, which used a different set of populations and different analytical approach).


6) Ultimately, I am glad to say we mostly moved away from the MHC focus, which seems to not matter for the parasite that engages us most, the cestode Schistocephalus solidus (pictured below), and instead started doing QTL mapping, expression, and GWAS analyses. The data set collected in 2009 for some basic natural history proved to be extremely useful in motivating and guiding our genetic mapping studies (manuscripts in prep, and also Weber et al 2017 PNAS).



 

Why I did it: the benefits of N-dipping

Now, apologies for what must seem like a lengthy advertisement for a facet of my lab’s recent work (okay, it sort of is an advert). But I have a broader goal with this post. Here we had one survey, one dataset, that has yielded five papers (and more in queue). That’s some serious data recycling. So is it ethical? Absolutely yes, indeed I’d say it is morally preferable. 

 

Here’s why. First, each of the papers cited above asks an entirely different question of the data. The biggest overlap is between Peng et al 2021, and Stutz et al 2017, but those used two different datasets, and different analytical approaches, to ask the same question. Stutz et al used parapatric populations to take advantage of gene flow, Peng et al used allopatric populations but far more of them, with the added bonus of ddRADseq genomic data.  And the MHC data analyses don’t really make sense until you have grappled with covariation between parasite species, and described their diversity, so we had to tackle that Ecology and Ecography paper first. So, the papers support each other, but they certainly aren’t redundant conceptually.

 

Second, we could have put all this into one paper but it would have been a 150 page behemoth. You don’t want to read that, and I don’t want to write it. In fact, I did write it. At least, the Ecography and Ecology and Ecology Letters papers were all mashed into one >100 page manuscript at one point, and the MHC data would have added another >50. And it was so hard to keep clear threads of which analyses when with which results. Breaking it up made it easier to read and understand.

 

Third, funding agencies put significant funding into this dataset, which required salary for four technicians and two postdocs to pull together. Aren’t we morally obligated to wring every bit of insight out of that hard-won data?

 

A call to funds

This last question leads me to a suggestion. Many of us accumulate datasets as our careers progress. Often these data sets have a rich multi-layered nature, but we publish the most exciting bit of information then move on. This is partly because we prioritize publishing the highest-impact work we can, and for career advancement are better off setting aside less exciting findings, to spend time on the splashiest stuff. But there is also a financial angle. Data analyses, and re-analyses, and writing, take time. Which takes money. Most Associate Professors and Professors have sedimentary layers of unpublished or incompletely published analyses. These took funds to generate. It is a shame to have their results only partly published, with important elements gathering dust. The reason is, when I apply for my next grant the review panel will want to see evidence of a new plan of action. What data will I collect next, what experiment will I conduct, what survey or model will I design? Revisiting older data to wring out more precious drops of insight? Not fundable. Well, it should be. Those data contain more insights. They are hard-won, costly to obtain, and carry more than one paper’s worth of knowledge and lessons. So I think it would be great if NSF or other funding agencies would support fellowships, whether for grad students, postdocs, or faculty, to revisit and repurpose existing data to achieve new ends. The data are there, we just need the time to delve deeper.


So to conclude, I think we need to encourage people to use their hard-won data more efficiently and thoroughly. That requires funds, and the social support for the practice. It also has the interesting side-effect that we end up with interconnected papers strewn across many different journals, that build a much larger holistic story when viewed together. I'm intrigued by the notion of bundling published papers to create a story arc that transcends a single paper in one journal. Consider the above description of a set of papers to be such a bundled set.

Guidelines for archiving data AND code

The following is a cross-post from the Editor's blog of The American Naturalist, developed with input from various volunteers (credited ...