Monday, February 10, 2020

#IntegrityAndTrust 4. Building environments that promote data integrity

Maintaining Trust AND Data Integrity - a forum for discussion. INFO HERE.
#IntegrityAndTrust 4. Building Environments that Promote Data Integrity

By Grant E. Haines with contributions from Charles C.Y. Xu, David A.G.A Hunt, Marc-Olivier Beausoleil, and Winer Daniel Reyes


After reading the accounts by Dan Bolnick and Kate Laskowski of how the issues with Jonathan Pruitt-authored or coauthored papers were detected, our weekly lab group meeting on Thursday was devoted largely to discussion of how similar issues may be avoided, identified, and corrected in the future. Many of the conclusions we came to have been previously articulated in the other recent posts to this blog, those made by Joan Straussman and Alexandra McInturf, and the numerous perspectives bouncing around social media. 

As have others, we came to (mostly) agree on the following, somewhat unsatisfying, understandings: First, that as researchers become increasingly comfortable with the modern computational tools used to conduct scientific analyses, it will simultaneously become easier for them to falsify data and more difficult for others to detect falsified data. Second, that no matter how transparent the documentation of data and code becomes, there will always be some opportunities for those who are savvy enough to manipulate their data in dishonest ways. And third, that collaboration, like other parts of the scientific process, depends on trust. It is thus not advisable to treat every dataset received from a collaborator with heightened skepticism.

None of this is to say, however, that improvements to the way we do our work are not worth pursuing, and we identified several that could both build the integrity of our science and disincentivize the taking of unethical shortcuts. None of these are foolproof, but there are several practices that we think can either reduce the opportunities for data manipulation, facilitate its discovery, or reduce the impact of studies using manipulated data in the scientific record.

First, all data and code should be accessible by multiple people within a lab and among collaborators working on projects using it. This includes all versions of the data and code, either in multiple files or in platforms like GitHub and Labstep that track changes so that previous versions can be recovered. This practice would mean that, unless data is manipulated as it is taken, researchers can verify that the original data has not been unethically altered[1].

Second, when writing a manuscript or protocol that is shared with other labs, methods should be written clearly and completely enough (even in supplementary materials if journal word limits are insufficient) that someone attempting to replicate the study can easily understand them. We have previously heard the perspective of some scientists that this is not the purpose of a methods section, and that it should instead be used more as a rhetorical tool to convince readers that the methods used were appropriate for the study and you know what you’re doing. Because the robustness of scientific ideas is built on multiple lines of evidence, we reject this view in favor of one that facilitates the generation of these multiple lines of evidence, even if in different study systems. This should dilute the impact on the literature of studies that appear to show improbably large effects, and, through replication, enable tests of the claims made in studies which have had doubt cast upon them by accusations of fraud.

Third, authors should include tests that produce statistically insignificant results in supplementary materials. Because of word limits provided by journals, scientists often trim some of the tests they conducted from papers. This practice can facilitate the creation of a coherent narrative around the results, but deprive the scientific record of useful information. This point goes hand-in-hand with the previous recommendation because it makes it more likely that results not supporting strong effects presented by studies with fraudulent data will be available in the literature. It also has the happy byproduct of reducing the frequency with which researchers will pursue hypotheses that other labs have tested and found to be unsupported but not published.

Fourth, and perhaps most importantly, the senior members of a research group (PIs, post-docs, and senior PhD students) should cultivate an environment in which everyone knows the data management expectations for the lab and, I cannot stress this enough, feels comfortable asking questions. People might ask stupid questions. That’s good. Pretend like they aren’t. We all have different blind spots and areas of expertise, and everyone, especially those recording the bulk of the data in most labs (undergrads and more junior Masters and PhD students), sometimes needs help filling those blind spots in. The rest of us should do everything we can to help them do so. The alternative is an environment in which more junior researchers try to find or develop answers on their own, sometimes making serious mistakes in the process. 

A good data management system boils down to having clear and shared goals, both in general, and for specific projects or types of projects. These standards should encompass the “life-cycle” of various research projects: from when a project starts to when it finishes. Recording all of the steps and communicating them within the research group. This becomes especially important in long term studies, throughout the course of which a lab may experience significant turnover. If the protocols are not maintained and communicated with the rest of the lab, it becomes difficult to ensure that the data produced throughout these projects has been systematically collected and analyzed.

These principles can extend beyond individual labs to the influential voices in fields or subfields. The credibility of an entire body of research can benefit or suffer by the practices of these researchers. This influence can be leveraged by highlighting procedures to maintain data integrity in guest lectures or conference presentations to build a culture of careful data management across labs.

This whole series of events that has unfolded over the past few weeks is obviously quite troubling. But perhaps if the scientific process is susceptible to manipulation on this scale, the healthiest response for its practitioners to adopt is not to ignore its weaknesses, but to adapt by creating reliable procedures and research that address these weaknesses head on.




[1] Because of recent attempts by the U.S. EPA to prevent science relating to public health being considered in the development of new environmental regulations under the guise of data transparency, it is important to note that in order to maintain the confidentiality of subjects’ personal information, this practice may not be advisable or possible in all cases.


1 comment:

  1. On Twitter ...
    "Jonathan Pruitt
    @Agelenopsis
    Behavioral ecologist, avid gaymer, and fast talker"

    "Fast talker" should have warned this scientific community.

    ReplyDelete

A 25-year quest for the Holy Grail of evolutionary biology

When I started my postdoc in 1998, I think it is safe to say that the Holy Grail (or maybe Rosetta Stone) for many evolutionary biologists w...