Saturday, December 12, 2015

Archiving Primary Data (Or Not)

Scientists now work in an environment that might be called #OA-Shaming, where publishing behind a “paywall” is increasingly considered elitist: at best, unhelpful to science and, at worst, downright nefarious. Set against this backdrop, the DRY-BAR (Hendry-Barrett) lab meeting this week discussed the recent Trends in Ecology and Evolution (TREE) paper ArchivingPrimary Data: Solutions for Long-Term Studies. (Although I am on sabbatical in California, I was back in Montreal this last week.)

The paper has 63 authors, all scientists with individual-based long-term datasets. The paper was written as a response to the newish policy, now increasingly enforced by many journals and funding agencies, requiring that the data used to produce a paper be made freely and unconditionally available online. The main point of the TREE paper was that what might seem like a meritoriously philosophy and policy (free data availability forever for all humanity!) might not be so obviously beneficial in some cases. I won’t detail their arguments in relation to long-term data sets, but rather reflect a bit on the issues from the perspective of someone who has experienced the transition in policy.

A first important point for young advocates of #OA data accessibility to recognize is that their philosophy is a logical extension of other areas of societal change – most obviously in entertainment. When I grew up, we mostly paid for our music and movies. Sure, we could copy tapes (one of my most prized possessions was my huge ghetto-blaster with high-speed tape-to-tape dubbing capacity) or VHS movies; but it was a pain and, really, we wanted to own the real thing. It was just the way it was. Now, of course, many young people rarely buy their music or movies, preferring instead to get them for free, really staring with Napster and hence progressing in various re-incarnations. Without passing judgement on the merits of this philosophy, it is important to realize that free access to any product (music, movies, data) produced by someone else means that the other person (or entity or company) might not be receiving appropriate compensation for what could well have been produced at massive expense and effort. Sure, much of the science is publically-funded but scientists still clearly invest much of their life in procuring and analyzing data and writing papers.

My original file sharing site.
I found this photo of my cherished 1984 JVC PC-W300 tape-to-tape dubbing ghetto blaster on line. I think it cost my parents $600 - best investment they ever made. This what the ad says "Fully functioning JVC dual deck portable. Heavy and very solidly made. Cleaned lubed and demagnetized. It sounds fantastic and is in excellent condition. Circa 1984! Dolby B, music search, auto-reverse, ceramic speaker drivers, phono input and much more. One recently ended on auction for over $500." It is owned by someone in Victoria, BC, where I owned mine in university. Maybe this one is mine! I could pay for it again!
A second important point is that a strong negative correlation likely exists between the extent to which a person feels data should be freely accessible and the amount of data that person has collected. (Which might just be an artifact of the fact that #OA advocates tend to be younger and so can’t have produced much data yet.) Interpreted cynically, it simple and easy and non-costly to demand that all data be freely accessible when one doesn’t have much data of their own. Once #OA advocates collect a large amount of data and realize first-hand the expense and effort and implications for their careers, then they will have a clear understanding of what they have been asking for up to that point. Perhaps they will feel the same way, or perhaps they will want to hang onto their data a bit longer.

Don’t worry, be happy.

Regardless of these personal opinions and any realities they might or not reflect, my main point in this post might simply be characterized in one statement: Don’t worry, be happy! This sentiment is based on two realizations.

First, scientists who fear that others will scoop their research or use their data poorly just because it is freely accessible on line are likely deluding themselves as to the demand for, and the likely use of, their data. This statement paraphrases something that I was told had been said by the editor of a major ecology/evolution journal at the time it started to require data be published on DRYAD or other online archives. In essence, the basic argument is that most data published online will never be used, or if it is used, it will not be used in a way that harms the data-collector’s future research or career. This got me to wondering – how have my data fared in this new regime?

A quick search for “Andrew Hendry” in DRYAD found data for 16 papers published between 2010 and 2015 (stats for all DRYAD on Dec. 12, 2015 are above). One of my papers was, in fact, based on a long-term (20 years) individual-based study run by my collaborators, who are not authors on the above TREE paper. These data “packages” (the webpage showing the paper information with the list of data files) have been viewed a total of 3788 times, a much larger number than I had expected. Three of the packages have been viewed more than 500 times and one nearly 1000 times! However, only a fraction of these views lead to downloads. Counting only the most-downloaded data file per paper, downloads totaled 564, still a surprisingly large number. One data file has been downloaded more than 148 times! Some interesting (and perhaps obvious) patterns were evident. First, the number of downloads was strongly correlated with the number of views (first figure below), although this correlation is quite imperfect. For instance, one data file has been downloaded 25 times on 26 views (96%), whereas another has been downloaded only 68 times in 975 views (7%). Downloads and views are, not surprisingly, higher for older papers; and the highest frequency of downloads to views (96%) is for one of the most recent papers. Finally, the number of times a paper is cited is correlated with the number of downloads considering only data packages posted before 2014 (second figure below). Only part of this association is due to the effects of publication date.

At this point, my first thought was "Wow, it looks like freely-accessible data is, indeed, freely accessed – frequently." So how often have I been scooped or how often has my data been used inappropriately? Never, to my knowledge. As far as I can tell, an analysis of these data has never been published anywhere. How can this be? Perhaps robots are downloading my data. Perhaps my data sucks and this is only noticed after a download. Perhaps the data are being used but only in meta-analyses. Or, perhaps, I am about to be scooped soon! However, I suggest the more innocent alternative. People are curious and interested but they have no intention of taking the data and publishing it to their own ends. Don’t worry, be happy.

My second realization speaks to the counterpoint. That is, even if data aren’t freely available, it won’t have a major negative impact on science. First, I would bet that nearly all reasonably recent data are accessible in one way or another. In fact, nearly every time I have asked a scientist for their data, they have provided it – though, admittedly, it has often taken some repeated prompting. My favorite instance occurred in 2005 when I was writing a paper about morphological changes in Darwin’s finches. It was 2004 and I was working at a site (Academy Bay, Santa Cruz Island) from which finches had been sampled in 1968 (the year I was born!) by Hugh Ford, who had published the data in 1973. I searched online and found that Hugh was a professor at the University of New  England in Arimdale, Australia. I emailed him and he responded that the data were old note books and he would happily dig them out, enter the data in excel, and send it to me. These data became a key part of the paper and I invited Hugh to be a co-author even though he hadn’t asked. On flip side, I have been asked many times (I will speculate the number is over 50) to provide raw data from my previous work and I have – every single time – provided it. A few times I was a collaborator on the resulting paper but most of the time none of us saw the need for me to be an author.

The simple point is that data will generally be there simply for the asking, regardless of whether it is “archived” online. (An exception from my own experience is given below - but I just digitized it from the paper in the end anyway.) One might complain that such data often come with unreasonable demands for co-authorship but, really, if one subscribes to the #OA philosophy about the betterment of society and society, then who cares, really, if you add another author to your paper. If you want to exclude from co-authorship someone who contributes data to your paper, then surely you shouldn’t simultaneously complain when people don’t want to share their data.

So, no matter how this plays out, and I think where it is going is pretty clear, I am confident that science won’t really be that compromised either way. If data are truly valuable, then they can be obtained even without freely-accessible online access. At the same time, if one puts their data online and freely-accessible, then it is extremely unlikely doing so will ever harm their research programs. In fact, I have never heard of a scientist who has had a bad experience with data they have placed online – although I am suspect there must be such an instance.

In conclusion, data archiving and #OA advocates and data archiving and #OA detractors both: don’t worry, be happy!


  1. Andrew, very nice and balanced post. Particularly interesting to see both your stats and reflections on your own Dryad downloads.

    I was a bit confused about the analogy to music sharing. I think we can agree that skeptics of data sharing policies would not be mollified simply by making repositories like Dryad into subscription-based systems you had to pay to access, right. Paying for access may be central to the open access discussion for literature, but I don't see the connection when it comes to data archiving.

    If you haven't seen it, you might be interested in the study of Vines et al,, which (like studies in other fields) shows data sharing "on request" to be ~ 1/4 for recent papers and to deteriorate over time; though predominantly not because people are unwilling to share but rather cannot be reached or no longer have the data readily available. Repositories can be a benefit to data producers too.

  2. My own data sharing experience is certainly much higher than 1/4, more like 4/5 but maybe I am lucky or asking for less important data.

    As for music sharing, the analogy is the use of something produced at the expense and effort of someone else without compensating that person(s). I don't mean financial compensation, I simply mean that forcing someone to put their data freely online means that people who wish to use that data do not have to contact the originator to discuss potential shared use (or other compensation) of the data, which might involve coauthorship - for example. Note that I am not advocating for or against this perspective, I am merely pointing out that the attitude as now applied to data archiving is probably a logically extension of the attitude that has arisen out of music sharing.

  3. Thanks for writing this thoughtful piece. I'm 99% sure that Dryad screens out robots accessing the data, so those should all be real views and real downloads by real people.

    I'm probably a fundamentalist when it comes to mandatory data sharing at publication, but this stems from two observations:

    1) this is science, where you have to provide evidence for your conclusions. The data are without doubt part of that evidence, and withholding the dataset makes about as much sense as withholding the tables and figures.

    2) far too many researchers cannot be trusted to preserve their data in the long term, and cannot be trusted to provide the data whenever it is requested (cf the Current Biology paper that Carl points to). Having the data public from the moment of publication is the only effective solution.

    I agree that there are downsides for individual authors when they share their data at publication (risk of getting scooped, etc), but these are vastly outweighed by the benefits to the community from having a) access to the evidence underlying the paper, and b) the ability to re-use the data for new purposes. It's therefore up to funding agencies and journals to enforce the public good of data availability even when authors are reluctant.

  4. This comment has been removed by the author.

  5. Andrew, I couldn't agree more with your points. Usage of data without full knowledge of the system seems an odd excuse, particularly given that data will outlive researchers (see our small piece on Act to staunch loss of research data). Moreover, given that most data generation are funded by taxpayers, they should be perhaps seen as public patrimony (well government "secrets" are too and lots of people don't have an issue with them, so I heard from a few researchers that are not keen in sharing data).

    One small addition to the good points you made is that availability of data also helps training students. At least two reasons for that: 1) Analyses of existing data regarding its original purpose help students to re-create the original results, thus helping students to practice the analyses used in its original purposes. This is particularly useful today when so much code is shared; 2) Create discussion groups among student test new ideas (e.g., meta-analyses, new hypotheses, etc). The two are obviously not exclusive to students but they help a great deal in the early stages of developing intuition and creativity within academic endeavours.


A 25-year quest for the Holy Grail of evolutionary biology

When I started my postdoc in 1998, I think it is safe to say that the Holy Grail (or maybe Rosetta Stone) for many evolutionary biologists w...