Saturday, March 7, 2026

The Null Hypothesis is Always Wrong

 No two populations are identical for any trait. No two communities have the same species composition. No detectable phenotype is ever completely neutral (i.e., “some” selection is always present). No trait is completely absent of genetic (or plastic) influences. No evolutionary trajectory is ever random. No behaviour lacks at least some repeatability. In all statistical tests, the data violate at least some of the assumptions. In short, the null hypothesis is always wrong – unless, of course, the researcher has a completely vacuous (i.e., not informed by the biology of the system) or intentionally disingenuous (e.g., known already to be wrong) alternative hypothesis.

The issue, of course, is that the EFFECT of any focal hypothesis (two populations are different, a phenotype is heritable, a behaviour is repeatable, data always violate some assumptions) is often very weak. When effects are weak, it becomes very hard to reject a null hypothesis; even though that null hypothesis is, in reality, inevitably wrong. A clear example comes from simulation modeling. Researchers do not put variables into models unless they plausibly have an effect – and so running statistical tests to confirm that a model term has a “significant” effect is simply a function of the number of replicate simulations one runs. Stated the opposite way, even an infinitesimal effect will be significant if you have enough data – in the simulation model case, you can just run more replicates of our simulations and voila – the effect will become significant.

I will first outline a series of problems associated with the use of null hypotheses, while also emphasizing that the solution is to focus on effect sizes (and sampling distribution for those effect sizes). I realize that some statisticians will (perhaps violently) object to some of the statements below – and will also point out problems with specific statements or suggestions. My point in this post, however, is to bring some biology back and to promote an understanding of what statistics are used for in a biological sense: they are simply a way of indicating a degree of confidence that one has when stating a conclusion from data.

1.      Null hypothesis testing distracts from what matters – the effect size.

When researchers use a null hypothesis, they tend to make a dichotomous decision: that is, “my alternative hypothesis is not correct” or “my alternative hypothesis is correct.” (Of course, the technical way to state the outcomes is different – but the preceding statement is what everyone actually wants to infer.) In the first case (failing to reject the null), researchers often then don’t even report the estimated effect size – which has a well-known effect biasing meta-analyses. Researchers should always report the estimated effect size even if the null is not rejected! (Remember, it is the estimated effect size that matters). Further, researchers struggle with p values that are barely non-significant – thus, discussing them as “marginally significant” or “suggestive” or “a trend.” Again, all of this goes away if the focus is on effect size.

Perhaps most insidious, researchers conclude that their data fit the assumptions of their statistical model (e.g., error distribution) when they fail to reject a null hypothesis that the assumptions are not violated. The reality is that the data ALWAYS violate the assumptions of the model: no data/residual distribution is ever normal (or log-normal or Poisson or whatever distribution you achieve with transformations or not), no two populations ever have the same variance, no X-Y relationship is ever perfectly linear, and so on. What matters is the extent to which (that is, the effect size) the data violate the assumptions and WHETHER THAT MATTERS FOR YOUR CONCLUSION. Similar issues attend the removal of “non-significant” terms from models (e.g., terms that do not provide a “significant improvement to the fit of the model to the data”) – especially when those non-significant terms account for structure in the data (Arnqvist 2019 - TREE).

In the reverse case (rejecting the null), researchers often then just state they have found an effect without emphasizing the magnitude of that effect. For example, a researcher that rejects a null hypothesis will often state that two populations differ in trait value or that a trait is repeatable or heritable. The reality is that no populations are identical for any trait, all traits are (at least to some extent) heritable and all behaviours are (at least to some extent) repeatable, what matters is HOW heritable or repeatable. Simply stating that a trait is repeatable or heritable provides no useful information – what we need to know is HOW heritable (as that predicts the rate of evolution) or HOW repeatable the trait is (Hendry 2023 - Bioscience).

2.      The null hypothesis is given favorable status when it is really just one of several competing hypotheses.

With limited data, many substantial effect sizes will be deemed non-significant in a statistical test. The reason is that the null often encompasses such a huge potential range of possibilities that it becomes very hard to reject the null even when – as noted above – the null hypothesis is surely wrong. Stated another way, the null hypothesis is given “favored status” among a set of competing hypotheses. The solution to this problem is to instead gauge the level of support for several alternative hypotheses, including – if one wishes – a “random” equivalent to a null hypothesis.

An excellent example of this problem (and solution) comes from attempts to detect non-random changes in paleontological time series. In a particularly telling instance, Mike Bell collected fossil time series at high resolution for threespine stickleback traits (Bell et al. 2006 – Paleobiology). Although the traits (armour plates and pelvic structures) were well known to be functional and to repeatedly evolve in response to selection in contemporary populations, and even though the dramatic change in the trait through time was perfectly consistent with contemporary time series, all available statistical tests failed to reject the null hypothesis of random change. Gene Hunt then took the much more reasonable approach for the same data set of estimating the level of support for alternative hypothesis (e.g., drift versus adaptation to a fitness peak), revealing that the adaptive hypotheses had much more support than a drift hypothesis (Hunt et al. 2008 - Evolution). Gene Hunt also applied this competing-hypothesis approach to other fossil time series, again revealing much more support for selective mechanisms than had previously been inferred using null-hypothesis approaches (Hunt 2007 PNAS).

This competing hypothesis approach is embraced by methods such as model comparisons based on AIC or BIC. Importantly, however, one must not fall into the trap of simply then stating one hypothesis is a better fit than the other but should instead emphasizing the level of support for the different hypotheses (e.g., AIC weights).

The time series of fossil stickleback traits analyzed by Bell et al. (2006) and then re-analyzed by Hunt et al. (2007). This figure is from the latter paper.


3.       The null hypothesis is subjective.

It might seem that null hypotheses are objective: that is, “a variable of interest has no effect” or “two populations are not different.” In many instances, however, the null hypothesis is subjective – and thus, when combined with the above “favored status” given to a null hypothesis can dramatically influence conclusions for the same set of data.

A starting example here occurs for correlations between two variables (X and Y) in two groups (e.g., two populations or two treatments). A logical question is whether the two slopes differ – and so the null hypothesis would state that the Y variable has no statistical influence from the interaction between group and the X variable. Assume the null hypothesis is then rejected and the researcher goes home happy. At the same time, we would presumably only be interested in the relationship between X and Y within a population if the relationship within that population were significant (in the classical approach), in which case the null hypothesis would be that no relationship was evident within a population. Thus, in the case where one population shows a slight positive correlation between X and Y and the other population shows a negative correlation between X and Y, the two slopes can be significantly different from each other (i.e., “the relationship differs between the two populations”) even though neither slope is significantly different from zero (i.e., “X does not influence Y in either population). (I am not here saying that both null hypotheses SHOULD be tested in such cases. I am merely here pointing out the cognitive dissonance that can arise via null hypothesis logic.)

Another example comes from the use of one-tailed versus two-tailed hypotheses. In many cases, researchers are interested in the direction of an effect (e.g., Y increases as X increases) and so the null hypothesis is that the correlation is NOT positive and the alternative hypothesis is one-tailed (a positive relationship), thus increasing power. What does a researcher do then, when they find a strong relationship in the opposite direction to the alternative hypothesis? Yes, the alternative hypothesis was incorrect and yet it would be silly to leave it at that. The evidence instead clearly suggests a strong effect in the opposite direction – and, yet technically, in the null hypothesis approach, all the researcher can technically conclude is that the effect is not positive – they cannot conclude the effect is negative. Alternatively, the researcher might choose a two-tailed test, in which case the null hypothesis is that the effect is not zero. In this case two-tailed case, however (again, technically in the null hypothesis approach), the researcher cannot state the DIRECTION of the effect they find because the alternative hypothesis was simply that the slope was not zero.

Other examples of subjective null hypotheses abound. In studies of parallel evolution, for example, the null hypothesis can completely flip depending on your interest. On the one hand, the null hypothesis might be that evolutionary trajectories are random, with the alternative being that they are not random. For other researchers, the null hypothesis might be that the evolutionary trajectories are parallel (because they are interested in deviations from parallelism), with the alternative being that they are not parallel (De Lisle and Bolnick 2020 - Evolution).

Again, the solution is to estimate the effect size and the sampling distribution for those effect sizes.

4.      Other issues

I probably haven’t yet mentioned one of the criticisms you were expecting – that the critical value for rejection (usually 5%) is arbitrary – in essence, made up by Ronald Fisher when he invented the null hypothesis statistical approach. This specific arbitrary cut-off contributes to many of the above effects, such as the “favored” status for the null and the lack of emphasis on effect size. Importantly, however, this problem can’t be fixed simply by changing the critical value.

Another problem arises when multiple comparisons are involved. In some cases, it is argued that one needs to adjust your P value (e.g., Tukey post-hoc tests, Bonferroni “corrections,” False Discovery Rates, etc.) to some experiment-wide or study-wide effect. This adjustment becomes advantageous when one doesn’t want to reject a null hypothesis – usually in the case of testing for violated assumptions. But it is also problematic in the other direction. Imagine for instance, that you measure whether some particular trait differs between populations and find that it does (because you can reject the null that it doesn’t). Then you decide to measure more traits and realize that – to maintain a study-level p-value – you need to adjust for the chance that you will mistakenly reject the null hypothesis 5% of the time FOR EACH TRAIT. It doesn’t take a lot of traits for your original significant relationship to become non-significant. (Yes, I realize you can employ multivariate tests but – often – trait specific conclusions are necessary.) Another common example is when you test for differences between more than two populations – and find that you can reject the null hypothesis that they are different. However, with such a test, you can’t conclude which specific populations are different from which other populations – so you then conduct post-hoc tests of one sort or other and find that you can’t find ANY differences – even though you “know” that at least one difference must be present.

 The solution

As noted above, the solution to these problems is the work with effect sizes, such as R2 or AIC or Hedge’s G or Cohen’s D or estimated slopes in linear models. Importantly, however, these estimates should be combined with presentations of their sampling distributions (the probability distribution of effect size estimates given the data and the sample sizes), which can then be used to estimate confidence intervals (or equivalents) on effect size estimates. Of course, each estimator has its own set of uses and misuses – but that is for another time.

 In closing

Many of you might be thinking that I am a hypocrite because work from my lab does often use null hypothesis-based statistical testing. The reason that we do this is, of course, that reviewers and editors and journals often insist on it – and so, if one is to publish a paper, one needs to use that approach. HOWEVER, we always report and emphasize and compare the effect sizes – as opposed to null hypothesis rejection. Stated another way, we do report p-values so that people who care about them can see them – but we also report and emphasize the effect sizes, which is what matters.

No comments:

Post a Comment

The Null Hypothesis is Always Wrong

 No two populations are identical for any trait. No two communities have the same species composition. No detectable phenotype is ever compl...