Maintaining Trust AND Data Integrity - a forum for discussion. INFO HERE.
#IntegrityAndTrust 3. Collaborate and share: data analysis practices are key to science credibility and reproducibility
Jacob Brownscombe, PhD, Banting Postdoctoral Fellow, Dalhousie University, @sci_angler on Twitter
The ongoing #PruittData controversy has prompted many of us in the field of ecology to further consider how we can ensure the accuracy and credibility of our science at the individual, team, and community level. This is particularly important with ecological studies due to the nature of data collection, which involves diverse methods and approaches, often in remote field or laboratory experiments with varied team members and sizes. Further, ecological systems are inherently complex, stochastic, and rarely well-controlled. Ecological experimental design and data analysis therefore presents a particular set of challenges and may require controlling for a range of factors, such as spatial or temporal autocorrelation, to minimize bias and uncertainty. All of the above are likely contributing factors in the reproducibility crisis in ecology, which is also an issue in many other fields, and has led to a push for more open science methods including standardized protocols for experimental design, data collection, analysis, and reporting.
There is potential for bias and error to creep in at many stages of the scientific process due to mistakes, inappropriate approaches, or malicious practices. In #IntegrityAndTrust Idea 2 Dr. Cooke argued that the potential for these errors is reduced when data collection occurs in teams, and further suggested this could be extended to data analysis as well. There is indeed massive potential for error and bias to develop in the process of analyzing ecological data. In addition to being inherently challenging, in many cases, datasets are analyzed by a single individual scientist. Many of these scientists are students, often conducting a specific analysis for the first time. The most popular analysis programs (e.g., R Statistical Programing Language, Python) involve computer coding, which enables a highly flexible and manual data processing and analytical process, but also creates potential for analytical misuse or error. It seems common practice to have all coauthors provide input to manuscript drafts, but less commonly do coauthors double check the data analysis workflow. Enabling this Wild West of data analysis are peer-reviewed journals, very few of which require submission of analytical materials (e.g., R scripts) with published manuscripts in the field of ecology. A growing number of journals require that manuscript-related data be made publicly available, not just in the supplementary material (often paywalled) or GitHub, but a permanent online repository such as Zenodo. However, reviews indicate that to date, the materials provided by authors are often insufficient to enable reproducibility, which defeats the purpose.
On a personal level, I have historically been guilty of many of my aforementioned criticisms – analyzing data independently and failing to share associated data and analytical processes with my science products. This was in part a result of my ignorance to the importance of open data science practices, and part due to fear that it would leave me open to greater criticisms of my work – still today I’m never sure when my models are ‘good enough’ to hang my hat on. I suspect many ecologists may share similar feelings. Yet, if there is truly an error in the work that makes it inaccurate, we all surely would want to know, especially in cases where the work has implications for the lives of people and animals.
Online systems such as GitHub are making it increasingly tractable to analyze datasets collaboratively and remotely, with tools for direct interfacing with programs like RStudio and options for version control to keep any individual collaborator from compromising the database or analytical workflow. This site provides useful resources for getting established working with R and GitHub (Chapter 4) and setting up collaborative projects (Chapter 9). Collaborators can work simultaneously on the same project, although this can cause merge conflicts that need to be sorted out and are likely best avoided through organized workflow. The primary author/data analyst could also choose to ask analysis-savvy coauthors access the project’s established GitHub repository to double check their work at any time, certainly at least once prior to dissemination. Repositories can be kept private to individuals or teams and can be made public at any time by those with this administrative privilege, which is handy for sharing associated data and analyses when the work is disseminated. GitHub currently restricts individual file sizes to 100 MB, so if sharing large workspaces (files with all generated objects such as models, figures) is of interest, I personally use Dropbox.
Data analysis is an essential piece to the puzzle for achieving open science objectives and the tools are available to share and collaborate on analytical processes. Both top down and bottom up approaches are likely needed to push the envelope on this. Individuals and teams of scientists will benefit from more open practices by increasing the accuracy, credibility, and scope of their science and improving uptake (and likely citations) by making the science accessible and reproducible. Science journals also have an important role to play in mandating standardized practices that consider reproducibility – this includes requiring both reproducible data, and the analytical process that is integral to reproducibility. These practices also have implications for database development and generation of big data. As we continue to generate larger volumes of ecological information, the potential for applications of big data is growing. Yet, to date top down approaches to building these datasets have not been as successful as we had hoped, while there are a growing number of bottom-up examples of successful big data (at least big for our field) generation and analysis with more grass-roots approaches. The benefits of open science therefore extend beyond credibility and reproducibility of single studies, to building larger, more powerful datasets from which we can generate insights into ecological systems and improve conservation efforts.