No two populations are identical for any trait. No two communities have the same species composition. No detectable phenotype is ever completely neutral (i.e., “some” selection is always present). No trait is completely absent of genetic (or plastic) influences. No evolutionary trajectory is ever random. No behaviour lacks at least some repeatability. In all statistical tests, the data violate at least some of the assumptions. In short, the null hypothesis is always wrong – unless, of course, the researcher has a completely vacuous (i.e., not informed by the biology of the system) or intentionally disingenuous (e.g., known already to be wrong) alternative hypothesis.
The issue, of course, is that the EFFECT of any focal
hypothesis (two populations are different, a phenotype is heritable, a
behaviour is repeatable, data always violate some assumptions) is often very
weak. When effects are weak, it becomes very hard to reject a null hypothesis;
even though that null hypothesis is, in reality, inevitably wrong. A clear
example comes from simulation modeling. Researchers do not put variables into
models unless they plausibly have an effect – and so running statistical tests
to confirm that a model term has a “significant” effect is simply a function of
the number of replicate simulations one runs. Stated the opposite way, even an
infinitesimal effect will be significant if you have enough data – in the simulation
model case, you can just run more replicates of our simulations and voila – the
effect will become significant.
I will first outline a series of problems associated with
the use of null hypotheses, while also emphasizing that the solution is to
focus on effect sizes (and sampling distribution for those effect sizes). I
realize that some statisticians will (perhaps violently) object to some of the
statements below – and will also point out problems with specific statements or
suggestions. My point in this post, however, is to bring some biology back and
to promote an understanding of what statistics are used for in a biological
sense: they are simply a way of indicating a degree of confidence that one has
when stating a conclusion from data.
1.
Null hypothesis testing distracts from what
matters – the effect size.
When researchers use a null hypothesis, they tend to make a
dichotomous decision: that is, “my alternative hypothesis is not correct” or “my
alternative hypothesis is correct.” (Of course, the technical way to state the
outcomes is different – but the preceding statement is what everyone actually
wants to infer.) In the first case (failing to reject the null), researchers
often then don’t even report the estimated effect size – which has a well-known
effect biasing meta-analyses. Researchers should always report the estimated
effect size even if the null is not rejected! (Remember, it is the estimated
effect size that matters). Further, researchers struggle with p values that are
barely non-significant – thus, discussing them as “marginally significant” or
“suggestive” or “a trend.” Again, all of this goes away if the focus is on
effect size.
Perhaps most insidious, researchers conclude that their data
fit the assumptions of their statistical model (e.g., error distribution) when
they fail to reject a null hypothesis that the assumptions are not violated.
The reality is that the data ALWAYS violate the assumptions of the model: no
data/residual distribution is ever normal (or log-normal or Poisson or whatever
distribution you achieve with transformations or not), no two populations ever
have the same variance, no X-Y relationship is ever perfectly linear, and so
on. What matters is the extent to which (that is, the effect size) the data
violate the assumptions and WHETHER THAT MATTERS FOR YOUR CONCLUSION. Similar
issues attend the removal of “non-significant” terms from models (e.g., terms
that do not provide a “significant improvement to the fit of the model to the
data”) – especially when those non-significant terms account for structure in
the data (Arnqvist 2019 - TREE).
In the reverse case (rejecting the null), researchers often
then just state they have found an effect without emphasizing the magnitude of
that effect. For example, a researcher that rejects a null hypothesis will
often state that two populations differ in trait value or that a trait is
repeatable or heritable. The reality is that no populations are identical for
any trait, all traits are (at least to some extent) heritable and all
behaviours are (at least to some extent) repeatable, what matters is HOW
heritable or repeatable. Simply stating that a trait is repeatable or heritable
provides no useful information – what we need to know is HOW heritable (as that
predicts the rate of evolution) or HOW repeatable the trait is (Hendry 2023 - Bioscience).
2.
The null hypothesis is given favorable status
when it is really just one of several competing hypotheses.
With limited data, many substantial effect sizes will be
deemed non-significant in a statistical test. The reason is that the null often
encompasses such a huge potential range of possibilities that it becomes very
hard to reject the null even when – as noted above – the null hypothesis is surely
wrong. Stated another way, the null hypothesis is given “favored status” among
a set of competing hypotheses. The solution to this problem is to instead gauge
the level of support for several alternative hypotheses, including – if one
wishes – a “random” equivalent to a null hypothesis.
An excellent example of this problem (and solution) comes
from attempts to detect non-random changes in paleontological time series. In a
particularly telling instance, Mike Bell collected fossil time series at high
resolution for threespine stickleback traits (Bell et al. 2006 – Paleobiology).
Although the traits (armour plates and pelvic structures) were well known to be
functional and to repeatedly evolve in response to selection in contemporary
populations, and even though the dramatic change in the trait through time was
perfectly consistent with contemporary time series, all available statistical
tests failed to reject the null hypothesis of random change. Gene Hunt then
took the much more reasonable approach for the same data set of estimating the
level of support for alternative hypothesis (e.g., drift versus adaptation to a
fitness peak), revealing that the adaptive hypotheses had much more support
than a drift hypothesis (Hunt et al. 2008 - Evolution). Gene Hunt also
applied this competing-hypothesis approach to other fossil time series, again revealing
much more support for selective mechanisms than had previously been inferred
using null-hypothesis approaches (Hunt 2007 PNAS).
This competing hypothesis approach is embraced by methods
such as model comparisons based on AIC or BIC. Importantly, however, one must
not fall into the trap of simply then stating one hypothesis is a better fit
than the other but should instead emphasizing the level of support for the
different hypotheses (e.g., AIC weights).
| The time series of fossil stickleback traits analyzed by Bell et al. (2006) and then re-analyzed by Hunt et al. (2007). This figure is from the latter paper. |
3.
The null
hypothesis is subjective.
It might seem that null hypotheses are objective: that is,
“a variable of interest has no effect” or “two populations are not different.”
In many instances, however, the null hypothesis is subjective – and thus, when
combined with the above “favored status” given to a null hypothesis can
dramatically influence conclusions for the same set of data.
A starting example here occurs for correlations between two
variables (X and Y) in two groups (e.g., two populations or two treatments). A
logical question is whether the two slopes differ – and so the null hypothesis would
state that the Y variable has no statistical influence from the interaction
between group and the X variable. Assume the null hypothesis is then rejected
and the researcher goes home happy. At the same time, we would presumably only
be interested in the relationship between X and Y within a population if the
relationship within that population were significant (in the classical
approach), in which case the null hypothesis would be that no relationship was
evident within a population. Thus, in the case where one population shows a
slight positive correlation between X and Y and the other population shows a
negative correlation between X and Y, the two slopes can be significantly
different from each other (i.e., “the relationship differs between the two
populations”) even though neither slope is significantly different from zero
(i.e., “X does not influence Y in either population). (I am not here saying
that both null hypotheses SHOULD be tested in such cases. I am merely here pointing
out the cognitive dissonance that can arise via null hypothesis logic.)
Another example comes from the use of one-tailed versus
two-tailed hypotheses. In many cases, researchers are interested in the
direction of an effect (e.g., Y increases as X increases) and so the null
hypothesis is that the correlation is NOT positive and the alternative
hypothesis is one-tailed (a positive relationship), thus increasing power. What
does a researcher do then, when they find a strong relationship in the opposite
direction to the alternative hypothesis? Yes, the alternative hypothesis was incorrect
and yet it would be silly to leave it at that. The evidence instead clearly
suggests a strong effect in the opposite direction – and, yet technically, in
the null hypothesis approach, all the researcher can technically conclude is
that the effect is not positive – they cannot conclude the effect is negative.
Alternatively, the researcher might choose a two-tailed test, in which case the
null hypothesis is that the effect is not zero. In this case two-tailed case,
however (again, technically in the null hypothesis approach), the researcher
cannot state the DIRECTION of the effect they find because the alternative
hypothesis was simply that the slope was not zero.
Other examples of subjective null hypotheses abound. In
studies of parallel evolution, for example, the null hypothesis can completely
flip depending on your interest. On the one hand, the null hypothesis might be
that evolutionary trajectories are random, with the alternative being that they
are not random. For other researchers, the null hypothesis might be that the
evolutionary trajectories are parallel (because they are interested in
deviations from parallelism), with the alternative being that they are not
parallel (De Lisle and Bolnick 2020 - Evolution).
Again, the solution is to estimate the effect size and the
sampling distribution for those effect sizes.
4.
Other issues
I probably haven’t yet mentioned one of the criticisms you
were expecting – that the critical value for rejection (usually 5%) is
arbitrary – in essence, made up by Ronald Fisher when he invented the null
hypothesis statistical approach. This specific arbitrary cut-off contributes to
many of the above effects, such as the “favored” status for the null and the lack
of emphasis on effect size. Importantly, however, this problem can’t be fixed
simply by changing the critical value.
Another problem arises when multiple comparisons are
involved. In some cases, it is argued that one needs to adjust your P value
(e.g., Tukey post-hoc tests, Bonferroni “corrections,” False Discovery Rates,
etc.) to some experiment-wide or study-wide effect. This adjustment becomes
advantageous when one doesn’t want to reject a null hypothesis – usually in the
case of testing for violated assumptions. But it is also problematic in the
other direction. Imagine for instance, that you measure whether some particular
trait differs between populations and find that it does (because you can reject
the null that it doesn’t). Then you decide to measure more traits and realize
that – to maintain a study-level p-value – you need to adjust for the chance
that you will mistakenly reject the null hypothesis 5% of the time FOR EACH TRAIT. It doesn’t
take a lot of traits for your original significant relationship to become
non-significant. (Yes, I realize you can employ multivariate tests but – often
– trait specific conclusions are necessary.) Another common example is when you
test for differences between more than two populations – and find that you can
reject the null hypothesis that they are different. However, with such a test,
you can’t conclude which specific populations are different from which other
populations – so you then conduct post-hoc tests of one sort or other and find
that you can’t find ANY differences – even though you “know” that at least one
difference must be present.
As noted above, the solution to these problems is the work
with effect sizes, such as R2 or AIC or Hedge’s G or Cohen’s D or estimated
slopes in linear models. Importantly, however, these estimates should be
combined with presentations of their sampling distributions (the probability
distribution of effect size estimates given the data and the sample sizes),
which can then be used to estimate confidence intervals (or equivalents) on effect
size estimates. Of course, each estimator has its own set of uses and misuses –
but that is for another time.
Many of you might be thinking that I am a hypocrite because work from my lab does often use null hypothesis-based statistical testing. The reason that we do this is, of course, that reviewers and editors and journals often insist on it – and so, if one is to publish a paper, one needs to use that approach. HOWEVER, we always report and emphasize and compare the effect sizes – as opposed to null hypothesis rejection. Stated another way, we do report p-values so that people who care about them can see them – but we also report and emphasize the effect sizes, which is what matters.