Wednesday, December 19, 2018

(Mis)adventures with P Values

In our multi-lab paper reading group at McGill, we have been reading "classics" - or various related engaging papers. Our most recent one was The Cult of Statistical Significance by Ziliak and McCloskey, in which the authors argue for the stupidity and damage caused by adherence to inference from p values - as opposed to effect size. During the discussion, a couple of profs (myself and Ehab Abouheif) related a couple of stories about our adventures with P values. Ehab's story was very interesting and so I asked if he could write it down. Here it is - followed by mine:

Ehab working on Rensch's Rule at Concordia.

By Ehab Abouheif

Today, I presented a paper by Zilliak and McClossky called the “Cult of statistical significance,” where they argued that the Fisherian 5% statistical cutoff was misleading researchers and leading to fatal outcomes. I was first exposed to Zilliak and McClossky’s work during my last sabbatical by the late Werner Callebaut, the then Director if the KLI Institute, and found it fascinating. It brought me right back to my days as an MSc student, when I struggled with the meaning of statistical significance in biology.

As an MSc student in Daphne Fairbairn’s lab at Concordia University (Montreal) during the Phylogenetic Comparative Method revolution in the 1990’s, I was fast at work collecting comparative data and getting my hands on as many phylogenetic trees as I could from the literature. I spent hours scouring journals in the Blacker-Wood Library at McGill University for phylogenies of birds, spiders, primates, water striders, reptiles, snakes – you name it – and I was looking for it. Back then, phylogenies were hard to come by, but evolutionary biologists were pushing hard to incorporate any kind of phylogenetic information, in part or whole, into their comparative analyses. After two years, I published my first paper ever using the comparative method (Abouheif and Fairbairn, 1997, A comparative analysis of allometry for sexual size dimorphismAmerican Naturalist 149: 540-562). We coined the term Rensch’s Rule, and today is one of my most highly cited papers.

This was my first taste of statistical inference and evolutionary biology up close and personal. While this experience left me excited about evolutionary biology and I was ready to jump into a PhD, I was left with many questions about the comparative method, one could say even doubts, about the big assumptions we were making to account for evolutionary history while testing adaptive hypotheses. It felt like we were introducing a great deal of statistical uncertainty in the process of trying to achieve statistical precision.

In 1996, Emília P Martins has published a method of accounting for evolutionary history when the phylogeny is not known. In other words, she devised a method to account for phylogeny for the group we are working with even if the phylogeny for that group is unknown. Emília’s method randomly generated phylogenies and analyzed the data on each random tree, and took the average of all random trees as the parameter estimate and gave confidence intervals around this average estimate. I thought this was brilliant, and I had always admired Emília's pioneering work on the Phylogenetic Comparative method. I was really curious to see how this method would perform on my MSc data. Would I come to the same conclusions about the patterns and significance about Rensch’s rule if I assumed no knowledge about the phylogeny? This question consumed me during the early days of my PhD, and so I started reanalyzing all of my MSc data using Emila’s method. It brought flash backs of the Blacker Wood library at McGill and all the dust I had to breath in for the sake of science.

Ehab and Daphne Fairbairn
Several months later, the answer finally came. The average of all random trees was … almost the same as not using a tree at all! Somehow the trees were ‘cancelling each other out’ and their average give a similar estimate as not using a tree at all. The difference was in the confidence intervals, which had been inflated dramatically because they were taking account phylogenetic uncertainty. For example, for water striders, a group where the statistical power for estimating the slope and the statistical difference from a slope of 1 was very high (0.998), we had estimated a slope of 0.859 with 95% confidence intervals of 0.805-0.916. Using random trees, the slope was 0.946 and the 95% confidence intervals were between -12.6 and 14.4! I published this results in Evolution in 1997, and needless to say, Emília was not very happy. I have a an enormous amount of respect for Emília’s work, but this was about something larger. In an attempt to achieve greater precision and reduce statistical errors when not accounting for phylogeny, we introduce another, perhaps larger error: not recognizing real patterns in nature that actually exist!

Ever since, I have kept with me, a healthy skepticism about the statistical significance testing and the Fisharian 5% cut off. Patterns should at least be weighted equally with the statistical significance of that pattern, but trying to convince my own students has been hard, and the journals even harder! Thinks are starting to change though, and so I am hopeful for the future. Thanks Andrew Hendry for getting me to write this, it brought me back a good number of years, and made me go back and read my own paper. I was pleasantly surprised, and realized how far back our thinking goes! 

By Andrew Hendry

I first started to work on threespine stickleback during my postdoc at UBC in 1999-1999. As a diligent postdoc working on a new system, I read a bunch of papers about stickleback - especially those by the folks at UBC with whom I would be interacting. One of those was Dolph Schluter 1993 Science paper Experimental Evidence that Competition Promotes Divergence in Adaptive Radiation. This paper included the now-famous and oft-republished in textbooks and reviews figure showing that when a competitor was added, the slope of the relationship between traits and fitness (growth) changed in predicted ways.

By chance I happened to see the commentaries written about this paper in Science. Some were based on interesting conceptual questions about whether the manipulation was the right way to do but others were more narrowly focused on the statistics saying, in essence, your P value is wrong:

To which Schluter replied: "ooops ..."

My point here isn't to criticize Dolph - he is an exceptionally insightful, rigorous (including statistically), and helpful scientist. Instead, it is to point out the particular attention paid to P values and whether they do or do not exceed some alpha level. If Dolph had calculated his P value correctly, he almost certainly would not have submitted his paper to Science - and, if he had, saying "My one-tailed p value is 0.11" - it never would have been published. Thus, a simple mistake in calculating a P value, led the publication of a paper that has proven extremely influential - it has been cited nearly 500 times.

Closing notes:

P values provide little insight into well, anything. Consider these illustrative arguments:

1. Imagine you are comparing two populations to see if their mean values (for anything) are significant. The truth is that no two populations ever have the same mean trait value in reality - the population PARAMETER is never identical. Hence, the test for significance is pointless - you already know they differ. All a statistical test does is reveal whether your sample size is large enough to confirm what you already know to be true - the means differ.

2. Most statistical tests involve a test for whether the assumptions of that test are met. The reality is that they NEVER are. That is, no distribution of residuals is ever normal (or any other modeled distribution) and homoscedastic. If you fail to reject the null hypothesis of normality and homoscedasticity, it simply reflects low power of your test to reveal reality.

3. Ever see a paper that calculates P values for a parameter in a simulation model? As the parameter is always (by the mechanics of a model) having an effect, a P value simply reflects how many simulations you have run. Want your parameter to be significant - simply run more replicates of your simulations until it becomes so.


What matters is effect size (difference between means, variance explained, etc.), your certainty in that estimate (credibility intervals, AIC weights, sampling distributions, etc.), and the extent to which uncertainty is due to sampling error or true biological variation (closely related to sample size). 

Of course, oceans of ink have already been spilled on this, so I will stop now and suggest just a few additional resources and tidbits:

1. My post on How To Do Statistics

2. Steve Heard's Defence of the P value

3. Dan Bolnick's coding error that made a P value significant when it wasn't that then led to a paper retraction. (The picture might show the moment he realized his P value was wrong.)

1 comment:

  1. I agree with almost all of this. However, I disagree with this:

    "Ever see a paper that calculates P values for a parameter in a simulation model? As the parameter is always (by the mechanics of a model) having an effect, a P value simply reflects how many simulations you have run. Want your parameter to be significant - simply run more replicates of your simulations until it becomes so."

    It can be the case that one does not know in advance whether a given parameter will have an effect in a simulation model. If you're writing a model just to demonstrate an effect that you already know exists, then yes, your point is valid. But if you're writing a model to find out whether a hypothesized effect exists or not, and the model is not written in such a way as to guarantee that the effect exists, then I don't see anything wrong with using a p-value to see whether or not the effect does, in fact, exist – and increasing the number of replicates will not magically make a parameter that in fact has no effect look like it does have an effect. I think a better point to make with respect to simulations is: If a given parameter *does* have any effect at all, no matter how tiny, you will be able to bring out that effect as "significant" by cranking up the number of replicates, but that doesn't mean the effect is *important*. So after finding a significant effect, the next step should always be an assessment of the effect size.


Enough with Academic Pedigrees Already

Nearly every introduction of a seminar speaker I have ever seen includes a chronological report on where they got their degrees, where the...