Tuesday, January 4, 2022

What happens to those Dryad repositories?

A decade ago, The American Naturalist and a few other journals instituted Data Archiving policies, coincinding with the start of DataDryad.  The rationale for this move was nicely articulated in a paper by Mike Whitlock and the other Editors of participating journals. In this blog post I want to present a little data analysis of what happens to my own data archives, since they were posted.

First, some rationales. The shift to data archiving has some now-familiar benefits:

1) It allows readers to check authors' claims against the original data. This has led to a number of high-profile retractions arising from either discoveries of flaws in the data that undermine the credibility of the claimed results, or errors in data analyses that fundamentally alter results. Journals that did not institute data archiving rules earlier, now find themselves struggling to evaluate potential flaws in previous data files.

2) If authors upload their data  archives before submission (include your code too please!), they can provide a private key to the journal when they submit, so that the data remains embargoed until publication but reviewers and editors can check the data. I have recently seen cases where reviewers missed flaws that were not obvious from the written manuscript, but emerged when the data files were checked. This practice also allows journals time to check the completeness and integrity of the data repositories, to ensure they have everything they should (a practice that The American Naturalist recently instituted, with help from DataSeerAI).

3) Archiving your data protects you from losing your own hard-won data. It's a back-up for you. 

4) Archived data provide students and colleagues with an opportunity to practice data analysis and learn how to do analyses the way you have done them. It is a teaching tool. Badly archived data inhibits this, because readers don't know what variables are which. So, follow good-practice guidelines to build usable and complete and clearly documented data archives.

5) Archived data can be used in subsequent meta-analyses. I have an idea for a way to analyze phenotypic selection data, that is different from Lande and Arnolds, for instance. I can't just draw on people's published estimates of linear and quadratic selection gradients for this, I need the raw data. So to do a meta-analysis, I need to go beyond what people put in their papers, to get raw data, and this requires usable archives. Note, around 2015 I emailed a hundred authors (most papers from the 90's and 2000's) asking for the data under their stabilizing selection estimates, I got four responses. I never finished the study (P.S., reach out to me if you're interested, I've not had time to follow up on this).

6) The last and probably least important benefit is that you contribute a citable published resource. Some people include their Dryad repositories as a separate entry in their CV, as it represents a product of your work, in its own right. Few people actually do this: see twitter poll results:

It is this last point, arguably the least important that I want to explore in depth today. I decided on the spur of the moment this morning, as a form of procrastination, to go ahead and add my data repositories to my CV: Authors, Year, Title, URL. Took me about an hour. As I did this, I began to notice a few points of interest I wanted to convey.

First, some basic stats:  My first data repository was in 2011 (in The American Naturalist!) and I've found 39 archived datasets with my name. I've published more papers than this in the past 10 years. I confess I was inconsistent with archiving in the early 20teens, depending on the journal, doing it when required. Some archives aren't on Dryad because they are archived with the publishing journal instead (e.g., my 2020 papers in Ecography and Ecology both used the publishers website for posting the archived data as supplements). And, some of my students/postdocs may have built archives that don't have my name on the archive. And then there are theory or review papers that don't merit archives.

The next thing that intrigued me as I began looking at these, is that my data archives were getting views, and downloads. Frankly this surprised me a bit. Not in a bad way, I just figured archives were sitting there in a dark virtual room, lonely. But people are actually looking at them! The average archive has been viewed 148 times (sd = 129), with the leading papers being an analysis of lake-stream stickleback parallel evolution (647 views, Kaeuffer et al Evolution), a meta-analysis of assortative mating (464 views, Jiang et al American Naturalist), and yeast epistasis (Kuzmin 2019 Science, 369 views). The Jiang et al one in particular didn't surprise me because it is a bit provocative and I know it stimulated some rebuttals and re-analyses. Here are is a histogram of repository views:

A important caveat: Some unknown fraction of views and downloads may be from bots with unknown motivations.

Dryad also tracks downloads. On average my repositories have been downloaded 47 times (sd = 77), and this time it is the Jiang et al AmNat paper that leads with 402 downloads followed by Kuzmin 2019 Science at 287. Only one paper wasn't downloaded at all, and that is the one paper still subject to a embargo (Paccard et al) because it uses an assemblage of data from many collaborators some of whom had yet to publish their own analyses. All told, my repositories have had a total of 5791 views and 1822 downloads. To be clear, I'm not trying to pat my own back here, my point is that data repositories get (to my view) a shockingly large amount of attention. I had no idea.

I'm not alone in being surprised that repositories are being downloaded and viewed. A quick twitter poll suggests most people who responded thought they would get few, if any, downloads. I bet you all will be surprised.

Now it'll come as no surprise that there's a time effect here. Older repositories have more time to accumulate views. And, repositories that are viewed more, and downloaded more

The last thing I wanted to check is whether a paper's citation rate is correlated with the data downloads. The answer is clearly yes. I used a general linear model with Poisson link to test whether the number of downloads (approximately Poisson distributed) depends on the year the repository was public, the number of views of the repo, and the number of citations to the paper. All three terms were significant, but the largest effect by far is that well-cited papers are the most likely to have their repositories downloaded:


                         Estimate         Std. Error         z value         Pr(>|z|)    

(Intercept)        -1.034e+02      2.248e+01      -4.598         4.27e-06 ***

Year                5.268e-02          1.114e-02       4.730         2.25e-06 ***

PaperCitations  7.422e-03        3.438e-04      21.588          < 2e-16 ***

Views               1.968e-03          2.377e-04       8.278          < 2e-16 ***

Focusing on the paper citation effect (in a log-log plot): 

Note, the one red point there is a paper by Pruitt in Proc Roy Soc, that has suspect data that the co-authors (myself included) asked to be removed from as co-authors. I include it here out of morbid curiosity.

Why do more cited papers get more data downloads My guess, and it is just a guess, is that there's a mix of motivations led by a desire to try recreating results, a desire to learn how to do analyses, and simple curiosity. These downloads might also be class exercises in action. For example, I assigned my spring 2021 graduate class (on graphical analysis of data) the task of finding a paper they liked, with a figure they thought could be improved upon, and get the data repository and build a new and improved version of a figure of interest. Another option that April Wright suggested via twitter is that this is all driven by bots. But I struggle to see how bots would generate such a strong paper citation effect, as opposed to a year effect.

The last thing I want to note here is that the original proponents of data repositories argued that these are citable sources, that could accrue their own citations and their own impact factor. After seeing how much my repositories are downloaded, I do think that it is worth tracking total data repo downloads, at a minimum, though as far as I can tell there is no automated way to do this at present. But, citations to repositories are basically useless as far as I can tell. The repositories posted in the last 0-1 years have zero citations, apparently because it takes a while for the published article's citation to the data file to link to Dryad. But, for repositories published from 2011-2018, every single one had exactly 1 citation, and that was from the paper that reported the data. 

The upshots:

1. repositories are widely viewed, often downloaded, but never cited. We need a way to track this impact (and to exclude bot false positives).

2. your data repository is not a throw-away waste of time to be done in a sloppy manner. Prepare it carefully and clearly with complete data and clear README.txt file documentation. People are likely to view it, and likely to download it. If you go in assuming that people will actually look, you will feel compelled to make a better quality and more usable repository. And that's good for everyone, even if it takes a bit more time and care.

Update: According to Daniella Lowenberg of DataDryad, "DataDryad standardizes views and downloads against a code of practice written by @makedatacount & @ProjectCounter to ensure we eliminate bots, crawlers, double clicks, etc!"

For transparency, the data and code are provided here (not sure how to put a .csv as an attachment in this blog page, so sorry here's a table):


#Blog on Dryad

dat <- read.csv("DryadInfo.csv")
par(mar = c(5,5,1,1))
hist(dat$Views, col = rgb(0.2, 0, 0.3, 0.5), breaks = 14, xlab = "Views", main = "", cex.lab = 1.4)

hist(dat$Downloads+0.001, col = rgb(0.2, 0, 0.3, 0.5), breaks = 24, xlab = "Views", main = "", cex.lab = 1.4)

  plot(Downloads ~ Views, dat, pch = 16)
model <- lm(Downloads ~ Views, dat)
text(150,350,"t = 5.837, P = 0.000001")

model <- glm(Downloads ~ Year + PaperCitations + Views, dat, family = "poisson")

par(mfrow = c(1,2))
dat$Downloads[dat$Downloads == 0] <- NA
dat$Views[dat$Views == 0] <- NA
dat$PaperCitations[dat$PaperCitations == 0] <- NA

plot(log(Downloads) ~ Year, dat, pch = 16)
model <- lm(log(Downloads) ~ Year, dat)
text(2013, 0.5,"t = -0.29, P = 0.000054")
plot(log(Views) ~ Year, dat, pch = 16)
model <- lm(log(Views) ~ Year, dat)
text(2013, 2.5,"t = -0.22, P = 0.000006")

  par(mfrow = c(1,1))
  colortouse <- as.numeric(dat$LeadAuthor == "PRUITT")+1
  PCHtouse <- abs(as.numeric(dat$LeadAuthor == "PRUITT")-1)*15+1
  plot(log(Downloads) ~ log(PaperCitations), dat, pch = PCHtouse, col = colortouse, cex = 2)
  model <- lm(log(Downloads) ~ log(PaperCitations), dat)
  text(1.5,5.5,"t = 0.68, P = 0.000004")

sum(dat$Views, na.rm = T)
sum(dat$Downloads, na.rm = T)

No comments:

Post a Comment

A 25-year quest for the Holy Grail of evolutionary biology

When I started my postdoc in 1998, I think it is safe to say that the Holy Grail (or maybe Rosetta Stone) for many evolutionary biologists w...