|Ehab working on Rensch's Rule at Concordia.|
|Ehab and Daphne Fairbairn|
By Andrew Hendry
I first started to work on threespine stickleback during my postdoc at UBC in 1999-1999. As a diligent postdoc working on a new system, I read a bunch of papers about stickleback - especially those by the folks at UBC with whom I would be interacting. One of those was Dolph Schluter 1993 Science paper Experimental Evidence that Competition Promotes Divergence in Adaptive Radiation. This paper included the now-famous and oft-republished in textbooks and reviews figure showing that when a competitor was added, the slope of the relationship between traits and fitness (growth) changed in predicted ways.
By chance I happened to see the commentaries written about this paper in Science. Some were based on interesting conceptual questions about whether the manipulation was the right way to do but others were more narrowly focused on the statistics saying, in essence, your P value is wrong:
To which Schluter replied: "ooops ..."
My point here isn't to criticize Dolph - he is an exceptionally insightful, rigorous (including statistically), and helpful scientist. Instead, it is to point out the particular attention paid to P values and whether they do or do not exceed some alpha level. If Dolph had calculated his P value correctly, he almost certainly would not have submitted his paper to Science - and, if he had, saying "My one-tailed p value is 0.11" - it never would have been published. Thus, a simple mistake in calculating a P value, led the publication of a paper that has proven extremely influential - it has been cited nearly 500 times.
P values provide little insight into well, anything. Consider these illustrative arguments:
1. Imagine you are comparing two populations to see if their mean values (for anything) are significant. The truth is that no two populations ever have the same mean trait value in reality - the population PARAMETER is never identical. Hence, the test for significance is pointless - you already know they differ. All a statistical test does is reveal whether your sample size is large enough to confirm what you already know to be true - the means differ.
2. Most statistical tests involve a test for whether the assumptions of that test are met. The reality is that they NEVER are. That is, no distribution of residuals is ever normal (or any other modeled distribution) and homoscedastic. If you fail to reject the null hypothesis of normality and homoscedasticity, it simply reflects low power of your test to reveal reality.
3. Ever see a paper that calculates P values for a parameter in a simulation model? As the parameter is always (by the mechanics of a model) having an effect, a P value simply reflects how many simulations you have run. Want your parameter to be significant - simply run more replicates of your simulations until it becomes so.
What matters is effect size (difference between means, variance explained, etc.), your certainty in that estimate (credibility intervals, AIC weights, sampling distributions, etc.), and the extent to which uncertainty is due to sampling error or true biological variation (closely related to sample size).
Of course, oceans of ink have already been spilled on this, so I will stop now and suggest just a few additional resources and tidbits:
1. My post on How To Do Statistics
2. Steve Heard's Defence of the P value
3. Dan Bolnick's coding error that made a P value significant when it wasn't that then led to a paper retraction. (The picture might show the moment he realized his P value was wrong.)