*In our multi-lab paper reading group at McGill, we have been reading "classics" - or various related engaging papers. Our most recent one was The Cult of Statistical Significance by Ziliak and McCloskey, in which the authors argue for the stupidity and damage caused by adherence to inference from p values - as opposed to effect size. During the discussion, a couple of profs (myself and Ehab Abouheif) related a couple of stories about our adventures with P values. Ehab's story was very interesting and so I asked if he could write it down. Here it is - followed by mine:*

Ehab working on Rensch's Rule at Concordia. |

__By Ehab Abouheif__

Today, I presented a paper by Zilliak and McClossky called the “Cult of
statistical significance,” where they argued that the Fisherian 5% statistical
cutoff was misleading researchers and leading to fatal outcomes. I was first
exposed to Zilliak and McClossky’s work during my last sabbatical by the late
Werner Callebaut, the then Director if the KLI Institute, and found it
fascinating. It brought me right back to my days as an MSc student, when I
struggled with the meaning of statistical significance in biology.

As an MSc student in Daphne Fairbairn’s lab at Concordia University (Montreal)
during the Phylogenetic Comparative Method revolution in the 1990’s, I was fast
at work collecting comparative data and getting my hands on as many
phylogenetic trees as I could from the literature. I spent hours scouring
journals in the Blacker-Wood Library at McGill University for phylogenies of
birds, spiders, primates, water striders, reptiles, snakes – you name it – and
I was looking for it. Back then, phylogenies were hard to come by, but
evolutionary biologists were pushing hard to incorporate any kind of
phylogenetic information, in part or whole, into their comparative analyses.
After two years, I published my first paper ever using the comparative method
(Abouheif and Fairbairn, 1997, A
comparative analysis of allometry for sexual size dimorphism

**.**American Naturalist 149: 540-562). We coined the term Rensch’s Rule, and today is one of my most highly cited papers.
This was my first taste of
statistical inference and evolutionary biology up close and personal. While
this experience left me excited about evolutionary biology and I was ready to
jump into a PhD, I was left with many questions about the comparative method,
one could say even doubts, about the big assumptions we were making to account
for evolutionary history while testing adaptive hypotheses. It felt like we
were introducing a great deal of statistical uncertainty in the process of
trying to achieve statistical precision.

In 1996, Emília P Martins has
published a method of accounting for evolutionary history when the phylogeny is
not known. In other words, she devised a method to account for phylogeny for
the group we are working with even if the phylogeny for that group is unknown.
Emília’s method randomly generated phylogenies and analyzed the data on each
random tree, and took the average of all random trees as the parameter estimate
and gave confidence intervals around this average estimate. I thought this was
brilliant, and I had always admired Emília's pioneering work on the
Phylogenetic Comparative method. I was really curious to see how this method
would perform on my MSc data. Would I come to the same conclusions about the
patterns and significance about Rensch’s rule if I assumed no knowledge about
the phylogeny? This question consumed me during the early days of my PhD, and
so I started reanalyzing all of my MSc data using Emila’s method. It brought
flash backs of the Blacker Wood library at McGill and all the dust I had to
breath in for the sake of science.

Ehab and Daphne Fairbairn |

Ever since, I have kept with me, a healthy
skepticism about the statistical significance testing and the Fisharian 5% cut
off. Patterns should at least be weighted equally with the statistical significance
of that pattern, but trying to convince my own students has been hard, and the
journals even harder! Thinks are starting to change though, and so I am hopeful
for the future. Thanks Andrew Hendry for getting me to write this, it brought
me back a good number of years, and made me go back and read my own paper. I
was pleasantly surprised, and realized how far back our thinking goes!

__By Andrew Hendry__

I first started to work on threespine stickleback during my postdoc at UBC in 1999-1999. As a diligent postdoc working on a new system, I read a bunch of papers about stickleback - especially those by the folks at UBC with whom I would be interacting. One of those was Dolph Schluter 1993 Science paper

*Experimental Evidence that Competition Promotes Divergence in Adaptive Radiation*. This paper included the now-famous and oft-republished in textbooks and reviews figure showing that when a competitor was added, the slope of the relationship between traits and fitness (growth) changed in predicted ways.

By chance I happened to see the commentaries written about this paper in Science. Some were based on interesting conceptual questions about whether the manipulation was the right way to do but others were more narrowly focused on the statistics saying, in essence, your P value is wrong:

To which Schluter replied: "ooops ..."

My point here isn't to criticize Dolph - he is an exceptionally insightful, rigorous (including statistically), and helpful scientist. Instead, it is to point out the particular attention paid to P values and whether they do or do not exceed some alpha level. If Dolph had calculated his P value correctly, he almost certainly would not have submitted his paper to Science - and, if he had, saying "My one-tailed p value is 0.11" - it never would have been published. Thus, a simple mistake in calculating a P value, led the publication of a paper that has proven extremely influential - it has been cited nearly 500 times.

__Closing notes:__

P values provide little insight into well, anything. Consider these illustrative arguments:

1. Imagine you are comparing two populations to see if their mean values (for anything) are significant. The truth is that no two populations ever have the same mean trait value in reality - the population PARAMETER is never identical. Hence, the test for significance is pointless - you already know they differ. All a statistical test does is reveal whether your sample size is large enough to confirm what you already know to be true - the means differ.

2. Most statistical tests involve a test for whether the assumptions of that test are met. The reality is that they NEVER are. That is, no distribution of residuals is ever normal (or any other modeled distribution) and homoscedastic. If you fail to reject the null hypothesis of normality and homoscedasticity, it simply reflects low power of your test to reveal reality.

3. Ever see a paper that calculates P values for a parameter in a simulation model? As the parameter is always (by the mechanics of a model) having an effect, a P value simply reflects how many simulations you have run. Want your parameter to be significant - simply run more replicates of your simulations until it becomes so.

__Coda__

What matters is effect size (difference between means, variance explained, etc.), your certainty in that estimate (credibility intervals, AIC weights, sampling distributions, etc.), and the extent to which uncertainty is due to sampling error or true biological variation (closely related to sample size).

Of course, oceans of ink have already been spilled on this, so I will stop now and suggest just a few additional resources and tidbits:

1. My post on How To Do Statistics

2. Steve Heard's Defence of the P value

3. Dan Bolnick's coding error that made a P value significant when it wasn't that then led to a paper retraction. (The picture might show the moment he realized his P value was wrong.)

I agree with almost all of this. However, I disagree with this:

ReplyDelete"Ever see a paper that calculates P values for a parameter in a simulation model? As the parameter is always (by the mechanics of a model) having an effect, a P value simply reflects how many simulations you have run. Want your parameter to be significant - simply run more replicates of your simulations until it becomes so."

It can be the case that one does not know in advance whether a given parameter will have an effect in a simulation model. If you're writing a model just to demonstrate an effect that you already know exists, then yes, your point is valid. But if you're writing a model to find out whether a hypothesized effect exists or not, and the model is not written in such a way as to guarantee that the effect exists, then I don't see anything wrong with using a p-value to see whether or not the effect does, in fact, exist – and increasing the number of replicates will not magically make a parameter that in fact has no effect look like it does have an effect. I think a better point to make with respect to simulations is: If a given parameter *does* have any effect at all, no matter how tiny, you will be able to bring out that effect as "significant" by cranking up the number of replicates, but that doesn't mean the effect is *important*. So after finding a significant effect, the next step should always be an assessment of the effect size.