Saturday, December 12, 2015

Archiving Primary Data (Or Not)

Scientists now work in an environment that might be called #OA-Shaming, where publishing behind a “paywall” is increasingly considered elitist: at best, unhelpful to science and, at worst, downright nefarious. Set against this backdrop, the DRY-BAR (Hendry-Barrett) lab meeting this week discussed the recent Trends in Ecology and Evolution (TREE) paper ArchivingPrimary Data: Solutions for Long-Term Studies. (Although I am on sabbatical in California, I was back in Montreal this last week.)

The paper has 63 authors, all scientists with individual-based long-term datasets. The paper was written as a response to the newish policy, now increasingly enforced by many journals and funding agencies, requiring that the data used to produce a paper be made freely and unconditionally available online. The main point of the TREE paper was that what might seem like a meritoriously philosophy and policy (free data availability forever for all humanity!) might not be so obviously beneficial in some cases. I won’t detail their arguments in relation to long-term data sets, but rather reflect a bit on the issues from the perspective of someone who has experienced the transition in policy.

A first important point for young advocates of #OA data accessibility to recognize is that their philosophy is a logical extension of other areas of societal change – most obviously in entertainment. When I grew up, we mostly paid for our music and movies. Sure, we could copy tapes (one of my most prized possessions was my huge ghetto-blaster with high-speed tape-to-tape dubbing capacity) or VHS movies; but it was a pain and, really, we wanted to own the real thing. It was just the way it was. Now, of course, many young people rarely buy their music or movies, preferring instead to get them for free, really staring with Napster and hence progressing in various re-incarnations. Without passing judgement on the merits of this philosophy, it is important to realize that free access to any product (music, movies, data) produced by someone else means that the other person (or entity or company) might not be receiving appropriate compensation for what could well have been produced at massive expense and effort. Sure, much of the science is publically-funded but scientists still clearly invest much of their life in procuring and analyzing data and writing papers.

My original file sharing site.
I found this photo of my cherished 1984 JVC PC-W300 tape-to-tape dubbing ghetto blaster on line. I think it cost my parents $600 - best investment they ever made. This what the ad says "Fully functioning JVC dual deck portable. Heavy and very solidly made. Cleaned lubed and demagnetized. It sounds fantastic and is in excellent condition. Circa 1984! Dolby B, music search, auto-reverse, ceramic speaker drivers, phono input and much more. One recently ended on auction for over $500." It is owned by someone in Victoria, BC, where I owned mine in university. Maybe this one is mine! I could pay for it again!
A second important point is that a strong negative correlation likely exists between the extent to which a person feels data should be freely accessible and the amount of data that person has collected. (Which might just be an artifact of the fact that #OA advocates tend to be younger and so can’t have produced much data yet.) Interpreted cynically, it simple and easy and non-costly to demand that all data be freely accessible when one doesn’t have much data of their own. Once #OA advocates collect a large amount of data and realize first-hand the expense and effort and implications for their careers, then they will have a clear understanding of what they have been asking for up to that point. Perhaps they will feel the same way, or perhaps they will want to hang onto their data a bit longer.

Don’t worry, be happy.

Regardless of these personal opinions and any realities they might or not reflect, my main point in this post might simply be characterized in one statement: Don’t worry, be happy! This sentiment is based on two realizations.

First, scientists who fear that others will scoop their research or use their data poorly just because it is freely accessible on line are likely deluding themselves as to the demand for, and the likely use of, their data. This statement paraphrases something that I was told had been said by the editor of a major ecology/evolution journal at the time it started to require data be published on DRYAD or other online archives. In essence, the basic argument is that most data published online will never be used, or if it is used, it will not be used in a way that harms the data-collector’s future research or career. This got me to wondering – how have my data fared in this new regime?

A quick search for “Andrew Hendry” in DRYAD found data for 16 papers published between 2010 and 2015 (stats for all DRYAD on Dec. 12, 2015 are above). One of my papers was, in fact, based on a long-term (20 years) individual-based study run by my collaborators, who are not authors on the above TREE paper. These data “packages” (the webpage showing the paper information with the list of data files) have been viewed a total of 3788 times, a much larger number than I had expected. Three of the packages have been viewed more than 500 times and one nearly 1000 times! However, only a fraction of these views lead to downloads. Counting only the most-downloaded data file per paper, downloads totaled 564, still a surprisingly large number. One data file has been downloaded more than 148 times! Some interesting (and perhaps obvious) patterns were evident. First, the number of downloads was strongly correlated with the number of views (first figure below), although this correlation is quite imperfect. For instance, one data file has been downloaded 25 times on 26 views (96%), whereas another has been downloaded only 68 times in 975 views (7%). Downloads and views are, not surprisingly, higher for older papers; and the highest frequency of downloads to views (96%) is for one of the most recent papers. Finally, the number of times a paper is cited is correlated with the number of downloads considering only data packages posted before 2014 (second figure below). Only part of this association is due to the effects of publication date.

At this point, my first thought was "Wow, it looks like freely-accessible data is, indeed, freely accessed – frequently." So how often have I been scooped or how often has my data been used inappropriately? Never, to my knowledge. As far as I can tell, an analysis of these data has never been published anywhere. How can this be? Perhaps robots are downloading my data. Perhaps my data sucks and this is only noticed after a download. Perhaps the data are being used but only in meta-analyses. Or, perhaps, I am about to be scooped soon! However, I suggest the more innocent alternative. People are curious and interested but they have no intention of taking the data and publishing it to their own ends. Don’t worry, be happy.

My second realization speaks to the counterpoint. That is, even if data aren’t freely available, it won’t have a major negative impact on science. First, I would bet that nearly all reasonably recent data are accessible in one way or another. In fact, nearly every time I have asked a scientist for their data, they have provided it – though, admittedly, it has often taken some repeated prompting. My favorite instance occurred in 2005 when I was writing a paper about morphological changes in Darwin’s finches. It was 2004 and I was working at a site (Academy Bay, Santa Cruz Island) from which finches had been sampled in 1968 (the year I was born!) by Hugh Ford, who had published the data in 1973. I searched online and found that Hugh was a professor at the University of New  England in Arimdale, Australia. I emailed him and he responded that the data were old note books and he would happily dig them out, enter the data in excel, and send it to me. These data became a key part of the paper and I invited Hugh to be a co-author even though he hadn’t asked. On flip side, I have been asked many times (I will speculate the number is over 50) to provide raw data from my previous work and I have – every single time – provided it. A few times I was a collaborator on the resulting paper but most of the time none of us saw the need for me to be an author.

The simple point is that data will generally be there simply for the asking, regardless of whether it is “archived” online. (An exception from my own experience is given below - but I just digitized it from the paper in the end anyway.) One might complain that such data often come with unreasonable demands for co-authorship but, really, if one subscribes to the #OA philosophy about the betterment of society and society, then who cares, really, if you add another author to your paper. If you want to exclude from co-authorship someone who contributes data to your paper, then surely you shouldn’t simultaneously complain when people don’t want to share their data.

So, no matter how this plays out, and I think where it is going is pretty clear, I am confident that science won’t really be that compromised either way. If data are truly valuable, then they can be obtained even without freely-accessible online access. At the same time, if one puts their data online and freely-accessible, then it is extremely unlikely doing so will ever harm their research programs. In fact, I have never heard of a scientist who has had a bad experience with data they have placed online – although I am suspect there must be such an instance.

In conclusion, data archiving and #OA advocates and data archiving and #OA detractors both: don’t worry, be happy!

Sunday, December 6, 2015

An Evolutionary Biologist's Apology

A few months ago, I was attending the joint lab meeting of Rosemary Gillespie and George Roderick at UC Berkeley, where I am on sabbatical. At the start of the meeting, Philip Spieth showed us a review in Bioscience about a book called A Mathematician’s Apology that was, this year, celebrating its 75th anniversary. What made the book very interesting to evolutionary biologists was that it was written by George Hardy, a British mathematician most of us know as the (co)originator of the Hardy-Weinberg equilibrium. If you do any work in population genetics or evolution or indeed in many other aspects of biology, you will know about this equation. “Hmmm,” I thought at the time “that might be a cool read,” and so the same day I ordered it from Amazon. The slim book, of which about a third was an excellent introduction by C.P. Snow, arrived a few days later and it became my reading material for the next few nights.

In his book, Hardy used the term apology “in the sense of a formal justification or defence (as in Plato's Apology of Socrates), not in the sense of a plea for forgiveness” (from Wikipedia). In particular, Hardy mounted a defense of his brand of “pure” or “real” mathematics, in contrast to applied mathematics, which he described in terms such as “trivial,” “ugly,” and “dull.” Now, I am all for defending a science, or any endeavor really, in the sense of intrinsic interest or beauty – the pure delight of discovery. And this is what Hardy logically did for his real mathematics; yet the bizarre additional facet of his apology was that real mathematics could be justified because of its very uselessness. That is, if an endeavor can’t be of any use, then it can’t be of any misuse either. Indeed, the applied math that Hardy discussed was often done so in the context of its use in war. Thus, because real mathematics had no use, it couldn’t be used for horrible things, just providing a further justification for its existence. Thus, Hardy’s justification boiled down to beautiful and useless.

I was recently caused to reflect (again) on how these sorts of justifications relate to my own avocation – evolutionary biology. I have been back in Montreal this past week for a meeting of our long-standing bioGENESIS core project, which was originally a part of DIVERSITAS and is now transitioning to Future Earth. Within DIVERSITAS, an NGO focused on specifically biodiversity, our role was to bring evolutionary perspectives, in both deep-time (e.g., phylogenetics) and contemporary time (e.g., genetic variation within species), to biodiversity science. The value of this role would initially seem straightforward given that all past, current, and future biodiversity is the product of evolution; yet we sometimes found ourselves having to “apologize” for our existence within the context of a growing emphasis not on biodiversity per se but on ecosystem services. In this context, we wrote a number of papers about the importance of evolutionary thinking not only for biodiversity but also for ecosystem services. Most directly, we pointed out that EVOsystem services were the foundation of all current and future ecosystem services, as well as many other useful and non-useful aspects of biodiversity. And, over the years, a few of these papers and the debates surrounding the ideas made it into some of my blog musings.

DIVERSITAS has now ended and its core projects, including bioGENESIS, are being folded into Future Earth, along with a number of other global change NGOs. Future Earth is a much bigger and more encompassing enterprise than was DIVERSITAS: its focus goes beyond biodiversity to immediate human concerns, such as health and wellbeing, alternative energy sources, sustainable development, social structures, and so on. Thus, in the context of transitioning to Future Earth, bioGENESIS again needs justify its continuance. (I am not being pejorative here because, quite reasonably, all core projects transitioning into Future Earth need to do the same thing.) As a result, we spent several days writing a “transition document” that describes how we will fit into Future Earth, how we will address its core concerns and questions, and how we will interface with other core projects, as well as with likely stakeholders. In essence, one can think of the transition document as having elements of an “apology” in the sense of Hardy and Socrates, which made me wonder: What would an evolutionary biologist’s apology actually look like? (This apology is my own and does not necessary reflect the views of bioGENESIS or Future Earth.)

Following Hardy, and countless other commentators, we might divide any scientific discipline into “basic” (Hardy’s “real”) and “applied” contexts. Basic evolutionary biology is interesting, fascinating, inspiring, and enjoyable but, at the same time, often useless. We might here consider paleontology. Just think of how much richer our understanding and appreciation of the world has become simply because of all those cool dinosaurs that have been described. But this discovery and knowledge is useless, right? Well, perhaps not in the sense that such discoveries increase revenues at museums and lead to Hollywood blockbusters. But what about more specific discoveries, such as the fact that dinosaurs had feathers. This finding is super cool but surely the information is truly useless. Jurassic Park would not have been any scarier, and perhaps less so, if the velociraptors had feathers (see the video below). Thus, paleontology is really about wonder and beauty that we appreciate in the sense of great art or music while being useless in applied context. So it seems to me that this branch of evolutionary biology is pretty close to justifiable on the grounds by which Hardy justified “real” math.

Of course, most evolutionary biologists applying for grants do not say their work is useless. Instead, they often say precisely the opposite. That is, they find ways to make their work sound applied and relevant even if it isn’t, really. Sometimes they even write grants for applied work not because they want to address the applied question but rather because they think it is more likely to attract money, which will then allow them to piggy-back “real” science onto the “trivial”, “ugly”, and “dull” applied science that the grant outlines. So, in reality, many evolutionary biologists spend time justifying their existence in precisely the opposite way to Hardy – that their work is useful. Examples abound and our transition document for Future Earth makes four main cases (which I here rephrase in my own words and meanings).


Evolutionary history is relevant to many human endeavors. As just one example, knowing the evolutionary tree of life means that we can be sure to preserve particularly distinctive branches of life that might harbor properties that are useful for us in one way or another (remember, we are here justifying evolutionary biology in relation to its usefulness to humans). This perspective is often discussed in the context of conserving many diverse forms of life as “optional values” for the future.

Contemporary (sometimes called “rapid”) evolution is essential for projecting and shaping future, including how changing environments will shape populations, communities, and ecosystems. For instance, evolutionary potential – and natural selection acting on it – unquestionably determines whether or not populations can persist in the face of climate change, invasive species, pollution, habitat loss, harvesting, and so on. Evolutionary potential thus also shapes all of the community and ecosystem properties (including “services”) that stem from organisms.

Evolutionary thinking has direct benefits in many directly applied contexts. As one key example, the pervasiveness of evolutionary thinking in medicine has allowed us to make incredible advances in the control of infectious diseases and cancer by slowing the evolution of resistance in pathogens and cancerous cell lines. As another example, evolutionary thinking has been very effective in agriculture in slowing the evolution of resistance to pesticides, herbicides, and fungicides. And, of course, we have biological control and the diversification/domestication/improvement of crops and so on.

Evolutionary tools can be applied to many other contexts. For instance, evolutionary thinking (we should seek a polymerase that can function at high temperatures in test tubes by finding hot spring bacteria that are naturally adapted to high temperatures) is what led to efficient PCR methods, which has completely revolutionized genetic analyses and therefore medicine and agriculture and much else. In addition, the basic ingredients of evolution (variation, selection, inheritance) provide an algorithm that has been useful in many engineering and design contexts (e.g., the use of “genetic algorithms” in many optimization procedures).


Clearly, evolutionary biology as a general field is critically important, indeed essential, for pretty much any human endeavor. “But wait,” I hear you saying “this suggests that, beyond the gee-whiz dinosaur argument, we should not give any more money to basic evolutionary biology.” I too would – at this juncture of the apology – start to be concerned on the same account, not the least because much of my research is focused simply on understanding the way that various aspects of the world works. How fast do salmon evolve? What forces drove the evolution of Darwin’s finches? How do natural selection and gene flow oppose each other in threespine stickleback? How do different predation environments cause reproductive isolation to evolve between guppy populations? None of these – and countless other – evolutionary studies have any obvious immediate use. Yet these studies are important, perhaps more so than any of the other angles described above, for several reasons. First, the results of such studies are interesting, beautiful, amazing, inspiring, and just damn cool. Thus, they are justifiable in the same way as is paleontology and real math. Second, basic evolutionary biology elucidates patterns and mechanisms, the understanding of which can subsequently be adopted and used in applied evolutionary questions. For instance, the basic studies showing that evolution in natural populations can be rapid has subsequently had profound influences on conservation biology, natural resource management (fisheries!), agriculture, medicine, and so on. Thus, basic evolutionary biology studies of natural populations are justifiable on all levels: they have artistic appeal, like Hardy’s real math, and they have potential future utility, like Hardy’s applied math.

Hardy was concerned that if something was useful it could also be misused and thus cause harm: trigonometry helps design buildings but it also helps aim artillery shells. This same concern could well be leveled at evolutionary biology. Indeed, eugenics, which in extreme forms attempted to justify and promote the unfair treatment of various categories of humanity, took at least some twisted inspiration from applications of Darwin’s initially innocent ideas regarding “survival of the fittest.” Thus, evolution – like any other science or, for that matter, art – can be used for both good and bad. Here is where societal values and controls need to come into play. That is, science itself is neither good nor bad – these judgments must be rendered only to our uses of it. I am reminded of the "I have not come for what you hoped to do. I've come for what you did." scene in V for Vendetta. Nuclear physics can provide energy but it can also destroy the world. Fortunately, modern societies seem pretty good at – at least with time – sorting the good from the bad of any new scientific advance.

All species are the product of evolution and will evolve in the immediate and long-term future. Thus, all services and disservices that species provide have been, are being, and continue to be shaped by evolution. Without applying evolutionary thinking to species and their biological communities, we will have a drastically reduced ability to respond to ecological and societal challenges. But we also need basic evolutionary biology because it is first fine (at least some of it is) art in that we appreciate and enjoy its discoveries, including literally in the form of museums and nature documentaries and also in an enhanced appreciation for how the world works as we walk or swim through it. Moreover, the fundamental truths revealed by basic evolutionary biology will often have applications that we can’t even envision. I am optimistic that these applications will be all (or nearly all) to the good in the years to come. But it is up to us and to you.

Dobzhansky’s famously overused phrase “Nothing in biology makes sense except in the light of evolution” is clearly a vast understatement. In fact, nothing in the world makes sense except in the light of evolution. Of course, much of the world still doesn’t make sense and so only the widespread application of evolutionary thinking will bring the necessary illumination.


Of course, I am not the first to consider justifications for the study of evolutionary biology, with a good previous example being:

Futuyma, D. J. 1995. The uses of evolutionary biology. Science 267:41-42.

A 25-year quest for the Holy Grail of evolutionary biology

When I started my postdoc in 1998, I think it is safe to say that the Holy Grail (or maybe Rosetta Stone) for many evolutionary biologists w...