Tuesday, February 11, 2020

#IntegrityAndTrust 5. With Data Editors, Everyone Wins

Maintaining Trust AND Data Integrity - a forum for discussion. INFO HERE.

#IntegrityAndTrust 5. With Data Editors, Everyone Wins.
Trust is everything. If we don’t trust each other, then science ceases to be a creative and fun collaborative endeavor – which is what makes it so wonderful. For me, it all starts and ends with trust, from which the rest follows. Thus, I will NOT now interrogate the data of each student and collaborator for fraud. I will, of course, interrogate it for outliers and data entry mistakes and so on – but NOT for dishonesty. I want every student in my lab knowing they have my complete trust, and I want every collaborator working with me knowing likewise.

Ok, fine, that sounds good, but then how do we catch fraud from this position of trust? At the outset, I would suggest that outright fraud is exceptionally rare (I will later write a post on the many “flavors of scientific dishonesty”) – to the point that we can start from a default position of not worrying about it on a daily basis. Moreover, if the science matters, the fraud will eventually be caught by someone and corrected. Afterall, that is what is currently happening. The scientific record is being correct and we will move forward from a better place. (I am not diminishing the harm done to those who have been impacted, nor the incredible work being done to correct it.) This self-correction has happened in the past in several well publicized cases, and also in a few you haven’t heard about. Hence, we definitely should always strive for improved recording and presenting and archiving of data, I do not think that we should enter into any collaboration with the expectation that collaborators will be actively checking each other for fraud.

Moreover, any sort of data fraud checking among collaborators will not catch situations where collaborators are cheating together, and it won’t help for single-authored papers, and it won’t help in cases where only one author (e.g., a bioinformatician) has the requisite knowledge to generate and analyze the data. Afterall, if everyone had the ability to generate, analyze, and interpret the data equally well, then we wouldn’t need collaborations would we. Instead, collaborators tend to be brought together for their complementary, rather than redundant, skills. I certainly won’t be able to effectively check the (for example) genomic data generated by specialist collaborators. Nor should all authors be spending their time on this – they should be focusing their efforts on areas where they have unique skills. Otherwise, why collaborate?

Yet we obviously need a better way to detect the cheaters. I can see one clear solution for detecting fraud before it hits the scientific record while not compromising an atmosphere of trust among collaborators. Consider this: the one place where trust does not currently exist is between journals and authors submitting papers. That is, reviewers/editors at journals don’t “trust” that the authors have chosen the right journal, that experimental design is correct, that their analyses appropriate, and that their interpretation is valid. Instead, reviewers/editors interrogate these aspects of each submitted paper to see if they trust the authors’ work and choice of journal.

Why not then have fraud detection enter in from the journal side of the process. For instance, many journals already run the text of submitted papers through a program that checks for plagiarism. Indeed, I was editor for a paper where plagiarism by a junior lead author was caught by just such a program and the senior author was able to fix it. Why not do the same for data? R packages are being developed to check for common approaches to fraud and can be used to interrogate data by officially-sanctioned and recognized Data Editors. These Data Editors could be acknowledged just like regular editors on the back pages of journals and even on the papers themselves. The Data Editors can put this role on their CVs and be recognized for this contribution. I expect that many scientists – especially those with coding skills with a passion for data integrity – would jump at the opportunity to be an official Data Editor at a journal.

Yes, I hear you saying “That would be a ton of work” – and so here is a suggestion to minimize unnecessary effort. Specifically, the Data Editors kick in only when the paper is already accepted. This would avoid duplication of effort when the same paper is rejected from one journal and then serially submitted to other journals. I suppose a few instances would be detected where the paper passed normal peer review and then was rejected (or corrected) after being examined by a Data Editor – but I expect this would be rare. Also, I am not suggesting Data Editors should be checking code – only the raw data itself. Also, I think they should be EDITORS, not reviewers. That is, the journal would engage 10 or so Data Editor who would then “handle” the data for all accepted papers according to their expertise.

I hope that the scientific community will seriously consider this Data Editor suggestion because it seems to me by far the best (perhaps the only) way to maintain trust and complementarity among collaborators while also improving data integrity. I think also that it would be an opportunity for data enthusiasts, programmers, and coders to get more recognition for their skills and efforts. Everyone wins.

Monday, February 10, 2020

#IntegrityAndTrust 4. Building environments that promote data integrity

Maintaining Trust AND Data Integrity - a forum for discussion. INFO HERE.
#IntegrityAndTrust 4. Building Environments that Promote Data Integrity

By Grant E. Haines with contributions from Charles C.Y. Xu, David A.G.A Hunt, Marc-Olivier Beausoleil, and Winer Daniel Reyes


After reading the accounts by Dan Bolnick and Kate Laskowski of how the issues with Jonathan Pruitt-authored or coauthored papers were detected, our weekly lab group meeting on Thursday was devoted largely to discussion of how similar issues may be avoided, identified, and corrected in the future. Many of the conclusions we came to have been previously articulated in the other recent posts to this blog, those made by Joan Straussman and Alexandra McInturf, and the numerous perspectives bouncing around social media. 

As have others, we came to (mostly) agree on the following, somewhat unsatisfying, understandings: First, that as researchers become increasingly comfortable with the modern computational tools used to conduct scientific analyses, it will simultaneously become easier for them to falsify data and more difficult for others to detect falsified data. Second, that no matter how transparent the documentation of data and code becomes, there will always be some opportunities for those who are savvy enough to manipulate their data in dishonest ways. And third, that collaboration, like other parts of the scientific process, depends on trust. It is thus not advisable to treat every dataset received from a collaborator with heightened skepticism.

None of this is to say, however, that improvements to the way we do our work are not worth pursuing, and we identified several that could both build the integrity of our science and disincentivize the taking of unethical shortcuts. None of these are foolproof, but there are several practices that we think can either reduce the opportunities for data manipulation, facilitate its discovery, or reduce the impact of studies using manipulated data in the scientific record.

First, all data and code should be accessible by multiple people within a lab and among collaborators working on projects using it. This includes all versions of the data and code, either in multiple files or in platforms like GitHub and Labstep that track changes so that previous versions can be recovered. This practice would mean that, unless data is manipulated as it is taken, researchers can verify that the original data has not been unethically altered[1].

Second, when writing a manuscript or protocol that is shared with other labs, methods should be written clearly and completely enough (even in supplementary materials if journal word limits are insufficient) that someone attempting to replicate the study can easily understand them. We have previously heard the perspective of some scientists that this is not the purpose of a methods section, and that it should instead be used more as a rhetorical tool to convince readers that the methods used were appropriate for the study and you know what you’re doing. Because the robustness of scientific ideas is built on multiple lines of evidence, we reject this view in favor of one that facilitates the generation of these multiple lines of evidence, even if in different study systems. This should dilute the impact on the literature of studies that appear to show improbably large effects, and, through replication, enable tests of the claims made in studies which have had doubt cast upon them by accusations of fraud.

Third, authors should include tests that produce statistically insignificant results in supplementary materials. Because of word limits provided by journals, scientists often trim some of the tests they conducted from papers. This practice can facilitate the creation of a coherent narrative around the results, but deprive the scientific record of useful information. This point goes hand-in-hand with the previous recommendation because it makes it more likely that results not supporting strong effects presented by studies with fraudulent data will be available in the literature. It also has the happy byproduct of reducing the frequency with which researchers will pursue hypotheses that other labs have tested and found to be unsupported but not published.

Fourth, and perhaps most importantly, the senior members of a research group (PIs, post-docs, and senior PhD students) should cultivate an environment in which everyone knows the data management expectations for the lab and, I cannot stress this enough, feels comfortable asking questions. People might ask stupid questions. That’s good. Pretend like they aren’t. We all have different blind spots and areas of expertise, and everyone, especially those recording the bulk of the data in most labs (undergrads and more junior Masters and PhD students), sometimes needs help filling those blind spots in. The rest of us should do everything we can to help them do so. The alternative is an environment in which more junior researchers try to find or develop answers on their own, sometimes making serious mistakes in the process. 

A good data management system boils down to having clear and shared goals, both in general, and for specific projects or types of projects. These standards should encompass the “life-cycle” of various research projects: from when a project starts to when it finishes. Recording all of the steps and communicating them within the research group. This becomes especially important in long term studies, throughout the course of which a lab may experience significant turnover. If the protocols are not maintained and communicated with the rest of the lab, it becomes difficult to ensure that the data produced throughout these projects has been systematically collected and analyzed.

These principles can extend beyond individual labs to the influential voices in fields or subfields. The credibility of an entire body of research can benefit or suffer by the practices of these researchers. This influence can be leveraged by highlighting procedures to maintain data integrity in guest lectures or conference presentations to build a culture of careful data management across labs.

This whole series of events that has unfolded over the past few weeks is obviously quite troubling. But perhaps if the scientific process is susceptible to manipulation on this scale, the healthiest response for its practitioners to adopt is not to ignore its weaknesses, but to adapt by creating reliable procedures and research that address these weaknesses head on.




[1] Because of recent attempts by the U.S. EPA to prevent science relating to public health being considered in the development of new environmental regulations under the guise of data transparency, it is important to note that in order to maintain the confidentiality of subjects’ personal information, this practice may not be advisable or possible in all cases.


Friday, February 7, 2020

#IntegrityAndTrust 3. Collaborate and share: data analysis practices are key to science credibility and reproducibility

Maintaining Trust AND Data Integrity - a forum for discussion. INFO HERE.
#IntegrityAndTrust 3. Collaborate and share: data analysis practices are key to science credibility and reproducibility
Jacob Brownscombe, PhD, Banting Postdoctoral Fellow, Dalhousie University, @sci_angler on Twitter
The ongoing #PruittData controversy has prompted many of us in the field of ecology to further consider how we can ensure the accuracy and credibility of our science at the individual, team, and community level. This is particularly important with ecological studies due to the nature of data collection, which involves diverse methods and approaches, often in remote field or laboratory experiments with varied team members and sizes. Further, ecological systems are inherently complex, stochastic, and rarely well-controlled. Ecological experimental design and data analysis therefore presents a particular set of challenges and may require controlling for a range of factors, such as spatial or temporal autocorrelation, to minimize bias and uncertainty. All of the above are likely contributing factors in the reproducibility crisis in ecology, which is also an issue in many other fields, and has led to a push for more open science methods including standardized protocols for experimental design, data collection, analysis, and reporting.

There is potential for bias and error to creep in at many stages of the scientific process due to mistakes, inappropriate approaches, or malicious practices. In #IntegrityAndTrust Idea 2 Dr. Cooke argued that the potential for these errors is reduced when data collection occurs in teams, and further suggested this could be extended to data analysis as well. There is indeed massive potential for error and bias to develop in the process of analyzing ecological data. In addition to being inherently challenging, in many cases, datasets are analyzed by a single individual scientist. Many of these scientists are students, often conducting a specific analysis for the first time. The most popular analysis programs (e.g., R Statistical Programing Language, Python) involve computer coding, which enables a highly flexible and manual data processing and analytical process, but also creates potential for analytical misuse or error. It seems common practice to have all coauthors provide input to manuscript drafts, but less commonly do coauthors double check the data analysis workflow. Enabling this Wild West of data analysis are peer-reviewed journals, very few of which require submission of analytical materials (e.g., R scripts) with published manuscripts in the field of ecology. A growing number of journals require that manuscript-related data be made publicly available, not just in the supplementary material (often paywalled) or GitHub, but a permanent online repository such as Zenodo. However, reviews indicate that to date, the materials provided by authors are often insufficient to enable reproducibility, which defeats the purpose.

On a personal level, I have historically been guilty of many of my aforementioned criticisms – analyzing data independently and failing to share associated data and analytical processes with my science products. This was in part a result of my ignorance to the importance of open data science practices, and part due to fear that it would leave me open to greater criticisms of my work – still today I’m never sure when my models are ‘good enough’ to hang my hat on. I suspect many ecologists may share similar feelings. Yet, if there is truly an error in the work that makes it inaccurate, we all surely would want to know, especially in cases where the work has implications for the lives of people and animals.

Online systems such as GitHub are making it increasingly tractable to analyze datasets collaboratively and remotely, with tools for direct interfacing with programs like RStudio and options for version control to keep any individual collaborator from compromising the database or analytical workflow. This site provides useful resources for getting established working with R and GitHub (Chapter 4) and setting up collaborative projects (Chapter 9). Collaborators can work simultaneously on the same project, although this can cause merge conflicts that need to be sorted out and are likely best avoided through organized workflow. The primary author/data analyst could also choose to ask analysis-savvy coauthors access the project’s established GitHub repository to double check their work at any time, certainly at least once prior to dissemination. Repositories can be kept private to individuals or teams and can be made public at any time by those with this administrative privilege, which is handy for sharing associated data and analyses when the work is disseminated. GitHub currently restricts individual file sizes to 100 MB, so if sharing large workspaces (files with all generated objects such as models, figures) is of interest, I personally use Dropbox.

Data analysis is an essential piece to the puzzle for achieving open science objectives and the tools are available to share and collaborate on analytical processes. Both top down and bottom up approaches are likely needed to push the envelope on this. Individuals and teams of scientists will benefit from more open practices by increasing the accuracy, credibility, and scope of their science and improving uptake (and likely citations) by making the science accessible and reproducible. Science journals also have an important role to play in mandating standardized practices that consider reproducibility – this includes requiring both reproducible data, and the analytical process that is integral to reproducibility. These practices also have implications for database development and generation of big data. As we continue to generate larger volumes of ecological information, the potential for applications of big data is growing. Yet, to date top down approaches to building these datasets have not been as successful as we had hoped, while there are a growing number of bottom-up examples of successful big data (at least big for our field) generation and analysis with more grass-roots approaches. The benefits of open science therefore extend beyond credibility and reproducibility of single studies, to building larger, more powerful datasets from which we can generate insights into ecological systems and improve conservation efforts.

Sunday, February 2, 2020

#IntegrityAndTrust 2. Teamwork in research ethics and integrity

Maintaining Trust AND Data Integrity - a forum for discussion. INFO HERE.
#IntegrityAndTrust 2. The role of teamwork in research ethics and integrity

Steven J. Cooke, Institute of Environmental and Interdisciplinary Science, Carleton University, Ottawa, Canada

About ten years ago I created an “expectations” document that I share and discuss with all new team members.  It consists of twelve topics and usually takes me about two hours to work my way through.  The very first topic on the list is ethics and the document states “Lab members will adhere to the highest level of standards related to ethical practices of data collection, analysis and reporting”.  Despite it being the FIRST thing I cover, I usually spend a grand total of 30 seconds as I fumble around the idea that they shouldn’t commit fraud, trying to do so without making them sound like criminals.  I think one of the reasons that I move past that point so quickly is that it seems so obvious – don’t fudge your data or do “unethical” things.  Upon reflection over the last few days I have concluded that I must do better and engage in a more meaningful discussion about ethics and integrity.  However, I am still struggling with what that means.  I look forward to learning from our community and formalizing how I approach issues of research ethics and integrity not just upon entry into our lab but as an ongoing conversation.

As I reflect on recent events, I am left wondering how this could happen.  A common thread is that data were collected alone.  This concept is somewhat alien to me and has been throughout my training and career.  I can’t think of a SINGLE empirically-based paper among those that I have authored or that has been done by my team members for which the data were collected by a single individual without help from others.  To some this may seem odd, but I consider my type of research to be a team sport.  As a fish ecologist (who incorporates behavioural and physiological concepts and tools), I need to catch fish, move them about, handle them, care for them, maintain environmental conditions, process samples, record data, etc – nothing that can be handled by one person without fish welfare or data quality being compromised.  Beyond that, although I am a researcher, I am also (and perhaps foremost) an educator.  And the best way for students to learn is by exposing them to diverse experiences which means having them help one another.  So – in our lab – collaborative teamwork during data collection is essential and the norm.

So, is having students (and post docs and/or technicians) work together to collect data sufficient to protect against ethical breaches?  Unlikely… yet it goes a long way to mitigating some aspects of this potential problem.  Small (or large) teams working together means more eyes (and hearts and minds) thinking about how research is being done.  This creates opportunities for self-correction as different values, worldviews and personal ethics intersect.  It also enhances learning opportunities especially if there are opportunities to circle back with mentors.  I try to visit team members (especially on the front end of studies) or connect with them via phone or email to provide opportunity for adjustments beyond the “plans” that may have appeared in a proposal.  When I do so I don’t just communicate with the “lead” student.  Rather, I want to interact with all of those working on a given project.  This provides an opportunity to reinforce the culture of intellectual sharing.  It is not uncommon in our lab for well-designed plans to be reworked based on input from assistants during the early phases of a project –adjustments in what we do and how we do it.  This level of ownership means that there is collective impetus to get it right (and I don’t mean finding statistically significant findings).  Creating an environment where all voices matter (no matter of status) has the potential to reduce bias and improve the quality of the science.  As a community we already do this with respect to workplace safety where safety is “everybody’s business”.  Why can’t this become the norm where “research ethics” and study quality are everybody’s business? 

There is much to think about ranging from the role of open science and pre-registration of scientific studies to reframing our training programs to build an ethical culture.  Yet, an obvious practice that I will be continuing is one where students (and others) work in teams during the data collection phase.  There is need to empower individuals to challenge their peers but do so in a constructive and collaborative manner.  By Working together in data collection there is opportunity for continual interaction and engagement that can only benefit science and enable the development of ethical researchers.  There are also ways in which this can be extended to the data analysis phase (e.g., using GitHub) where team members work collaboratively on analysis and take time to double (and triple) check their code and analysis (I am not the best one to comment on this approach but we are actively pursuing how to do this in our lab).

Saturday, February 1, 2020

#IntegrityAndTrust 1. Publish statements on how data integrity was assured

Maintaining Trust AND Data Integrity - a forum for discussion. INFO HERE.
#IntegrityAndTrust 1. Publish statements on how data integrity was assured

By Pedro R. Peres-Neto

It is great to see how much effort is being put to scrutinize these data and even understand how the reported issues have happened.  Perhaps this was a one-time incident, though data issues of less severe proportions are likely to occur from time to time.   Perhaps we should take this opportunity to start thinking towards a more robust system to reduce future potential issues.   Obviously, the number of papers is increasing dramatically, so data issues (from minor to huge) are likely to increase as a result of data management, lack of scientific rigor or even data fraud.  Collating data from different sources beyond single studies, for example, has become common practice in many fields. Simple or systematic ways to improve data integrity and scrutiny may be required by different fields and data sizes.  

There are strong signs that the number of co-authors per paper is increasing through time in many fields, including ecology.  Perhaps there can be some support to adhere to a publication policy in which co-authors explain which steps they took to assure data integrity.  These could be in the form of published statements.  One way could be to have more than one co-author (perhaps not all) spend a meaningful amount of time in scrutinizing the data and analyses to assure that (we) co-authors did the best to reduce potential data issues.  By no means I am saying that co-authors of retracted papers are to blame. We know all too well that issues like that can happen.  What I’m saying is that by having more than one co-author involved from the beginning is one way to increase data integrity control.  There can be certainly other solutions and that’s why I’m suggesting that authors publish statements on how they did their best to assure data integrity.  

It is time for a discussion on potential solutions to hold and preserve data integrity to the highest standards possibly. Solutions (in my mind) are needed to have the continuing public support in trusting and using research outcomes, and the support of taxpayers that much support our research.  

Maintaining Trust AND Data Integrity - a forum

So many discussions are happening right now about ways to improve data integrity in published papers, while also maintaining trust in collaborators, supervisors, and students.

I am sure that many of these ideas will be expressed in other venues. However, we here wish to make a space available for constructive suggestions by those who wish to comment in depth but do not have another forum to do so.

These ideas will play out as a series of guests posts - as opposed to the back and forth in comment sections. We are striving for well-considered and deliberate ideas, rather than knee-jerk comments, criticisms, or the like.

These posts will not be about the Pruitt data debate specifically - but rather more general comments on how to improve the process overall - regardless of what happens with those papers.

These posts will be moderated by myself (Andrew Hendry). Please send me an email (andrew.hendry@mcgill.ca) if you would like to write one.


#IntegrityAndTrust 1. Publish statements on how data integrity was assured. 
By Pedro Peres-Neto.

#IntegrityAndTrust 2The role of teamwork in research ethics and integrity. 
By Steven Cooke.

#IntegrityAndTrust 3. Collaborate and share: data analysis practices are key to science credibility and reproducibility
By Jacob Brownscombe.

#IntegrityAndTrust 4. Building environments that promote data integrity
By Grant Haines and others.

#IntegrityAndTrust 5. With Data Editors, everyone wins
By Andrew Hendry.



COMPILING SOME EXTERNAL POSTS

Trust Your Collaborators
By Joan Strassmann

Publish and Perish: A Graduate Student Perspective
By Alexandra McInturf


Data Dilemmas in Animal Behavior
By TheFreePhenotype

To the early career scientists affected by #PruittData
By Wayne Maddison

Social Spider Research is Here to Stay

By Leticia Avilés
On Luck, Success, and Becoming a Professor
By Aleeza Gerstein

-------------------------------------------------------

Although this forum will not be specifically about the Pruitt data or controversy, I should make clear that:

1. I have never evaluated Jonathan or his work in any context. That is, I never reviewed any of his proposals or papers as reviewer, editor, or panel member. Nor have I provided assessment letters for his job applications or tenure or promotion. I have not been involved in any of the assessments of the data that is currently being questioned - although I have certainly heard about these issues as they unfolded. I have not collaborated with Jonathan, nor did I have any plans in place to do so.

2. I have met Jonathan at several meetings and when I visited his university for a seminar. I enjoyed meeting and talking with him, and he contributed a joke picture to #PeopleWhoFellAsleepReadingMyBook that I posted on twitter. Since, the controversy has erupted into public consciousness, I have emailed Jonathan offering him this blog as a venue should he wish to respond to the criticisms - and then later to let him know that he might want to consult legal advice before doing so.

Predicting Speciation?

(posted by Andrew on behalf of Marius Roesti) Another year is in full swing. What will 2024 hold for us? Nostradamus, the infamous French a...