Saturday, January 3, 2015

How to do statistics

This fall, I wrote a series of “How to” blog posts that proved somewhat popular, or at least well-read:

I hadn't initially planned a series like this, it just kind of emerged. However, I had long planned one particular “How to” post. Ironically, that post was the one I still hadn’t written. Now that it is 2015, the time seems ripe to get back to the original idea. (Thanks to Ben Haller, Gregor Rolshausen, Joost Raeymaekers, and Chuck Fox for critical comments that helped improve this post.)

How to do statistics.

I used to teach statistics. Really! I was a whiz at SPSS and Systat, and I could find my way around JMP. I was almost at the cutting edge, which then was SAS. No one complained seriously about the stats in the papers I submitted. Now, it seems that – with the same statistical skills as before, and maybe even a bit better – I have become a dinosaur. Increasingly, the feeling seems to be that you can’t be considered even moderately competent at statistics unless you can do a GLMM in R. In this sea-change from [insert your previous status package here] to R, I feel that several important points are getting lost – or at least under-emphasized. My goal in the present post is to revisit what statistics are supposed to be for and how you should do them. I do not mean the details of how to choose and run a particular model but rather how to view stats as a way of enhancing your science and refining your inference. I will outline these ideas through a series of assertions.

1. It’s all about the (appropriate) replication

An incredibly important route to improving your science is to maximize replication at the appropriate level of inference. Imagine you are interested in a particular effect, say the difference in an experiment between two treatments or the difference in some trait between populations in two environments. You need to here strive for maximum replication of the two treatments or the two environments. This might seem obvious but – as a reviewer/editor – I have seen many studies where people wish to make inferences about the effects of two environments, yet they have studied only one population in each environment. In such cases, they are entitled to draw conclusions about differences between the two studied populations but not between the two environments because – with only one population per environment – the investigator cannot gauge the difference between environments in relation to variation within environments. That is, it is quite possible that two populations within each environment would differ just as much as two populations sampled from the different environments. While the temptation is to get larger sample sizes for each measured population, what is much more important is to sample many populations. I have seen many papers rejected for lack of replication at the level for which inferences are desired.

2. The data are real – statistics are merely a way of placing a statement of confidence in an inference you draw from the data.

I have frequently seen students paralyzed by their inability to fit an appropriate error distribution in R. They spend weeks and weeks trying various options only to eventually give up and throw out the offending data. The opinion seems to be that, “if I can’t fully satisfy the requirements of a statistical test, then the data must be bad and I shouldn’t report them.” This is folly! The data are the real thing – the stats are just a tool to aid interpretation. What is infinitely better in cases where a perfect model cannot be fit is to present the data, analyze them the best possible way, and then own up to cases where the data do not fully satisfy the assumptions. The truth is that many statistical tests are extremely robust to small-to-modest violations of their assumptions as long as the P value (but see below) is not too close to the critical value.

Of course, I am not here advocating using a bad model when a better one exists. If a better model exists, by all means you should use it. However, this more practical point is already emphasized quite frequently nowadays to the point that it can become detrimental to a student’s progress, and I am here trying to push the pendulum back a bit. That is, finding the ideal model is valuable and helpful, but slavish dedication to this goal can sometimes detract from the quality of scientific education and insight. Of course, the most important thing is to have a good question and experimental design before you conduct the study, which will simultaneously improve the science and help avoid later statistical constraints.

3. It's not about the P value.

Although opinions are changing, many students are still fixated on obtaining a P value smaller than the critical level of 0.05. This goal is misguided – for three reasons. First, 0.05 is totally arbitrary. If you are focused on P values, what is much more useful is the actual P value – is it small or large? (Journals should always require actual P values in all cases.) Second, any particular set of data can be analyzed multiple ways and cycling through those options can lead to the temptation to choose the one that generates the smallest P value. Third, P values themselves (the probably that, if the null hypothesis is true and you reject it, you will be wrong in doing so) are a silly way to do science – sorry RA Fisher. Among the many reasons, the null hypothesis is – in traditional frequentist statistics – treated as a default rather than as an alternative model, and thus one often rejects the alternative hypothesis even when it has more support than the null hypothesis.

Instead of null hypotheses, it is much better to specify alternative hypotheses that are competed against each other with alternative statistical models to thereby judge the relative support for each hypothesis. Such comparisons can take the form of likelihood ratio tests, Bayesian credibility intervals, AIC comparisons, or the like. One might argue that a level of arbitrariness creeps in here (because a standard yes-no threshold is sometimes lacking) but the truth is that such approaches are much less arbitrary because they quantitatively compare the level of support for competing hypotheses. The author can then draw whatever conclusions he/she wants from the levels of support, while still allowing the reader to draw some other conclusion from the same model comparisons should they wish to do so.

4. Effect sizes are what matter.

P values are determined by an interaction between effect size (strength of an effect) and sample size. Thus, P values are NOT the strength of an effect. As a result, one cannot – without other information – say that a P value of 0.0001 represents a stronger effect than a P value of 0.05. It might simply be that the former analysis has a much larger sample size. Take simulation models as a particularly obvious example. In this case, one can have whatever sample size one wants given computing power and time. Thus, the exact same effect size (determined by the parameters of the simulation) can have totally different P values determined by the number of replicate simulations performed. If you have a tiny (but real) effect size, simply run more simulations and it will eventually become significant! The same logic applies to experiments and surveys. What matters are effect sizes based on how much variance in the data is explained, or based on the difference between group means weighted by the variance or the mean. Examples include R2, Cohen’s D, and Eta.squared.

Of course, one still wants to place a statement of confidence in assertions about a given effect size, which is where one adds P values or – better yet –model comparisons as discussed above. Note that, when true effect sizes are small, they tend to be overestimated when sample sizes are also small, which as generates the so-called funnel plot of meta-analyses. Thus, one still wants as large a sample size as possible and one would ideally correct the measured effect size for an estimate of the error – either using Bayesian approaches or through brute force. That is, a measured R2 can be adjusted by the R2 expected if no effect was present – with an example here.

Effect sizes (here estimates of the strength of selection) are higher when sample sizes are smaller. From Kingsolver et al. (2001 - American Naturalist).

5. Graph your data

In many meetings with students where I am to see the outcome of their experiment or sampling for the first time, I am presented with detailed statistical tables where the student emphasizes whether or not particular effects are significant in this or that model. I find myself incapable of interpreting these results without seeing the data in graphical format. In fact, I think a student should first graph the data in a manner that addresses the original question before running ANY formal statistical tests. This aids not only the assessment of assumptions for subsequent statistical tests (hugely influential outlier errors sometimes pop up when I ask a student to do this) but also reveals – at a first glance – the gestalt effect size assessment that rarely ever changes much as time goes on, notwithstanding any ups and downs that occur in the subsequent formal statistics. In this way, the student and supervisor can have a rough picture of what the experiment has revealed before having to worry about the statistical details. I would bet that 90% of the important work (if not the time investment) is done once you graph your data in a way that informs the original hypothesis/question.

All data sets have the same means, variances, correlations, and regression lines. Only graphing shows how different they really are: Anscombe's quartet from Wikipedia.

Some additional notes about statistical packages.

6. R is simply one of many useful platforms for drawing statistic inference.
Nowadays, students feel incompetent if they don’t analyze their data in R – hell, I even feel that way sometimes. However, R is simply a post-experiment tool – a hammer with which you help massage your data into optimal inference. SPSS, Systat, JMP, and SAS are also hammers – they too can massage your data. Perhaps R is a titanium hammer, better and more efficient at massaging the truth from data; but think of all the amazing inferences that were derived before R was popular. Does the failure of these countless previous studies to use R mean that we should not believe everything published before (and much published after) 2002? (Of course, re-analysis does change the conclusions of some previously-published and superficially-analyzed studies.) Does the fact that something else will eventually replace R mean that our current inferences with R then become incorrect? Nonsense. Valid and excellent inference can be obtained with any number of statistical packages.

Given that R is now the most common statistical program it does make sense for new (and old) researchers to start with (or switch to) R. However, the main advantage is not – in my opinion – dramatically improved inference but rather ease of communication with other scientists, such as through the sharing of code. Moreover, R has many other components not present in canned packages, such as data exploration tools, connection to database and file system structures on your computer, if-else statements, while loops and other programming tools, detailed plotting functions, connection to other programming languages such as C++ and Python just to name a few. It also contains user-motivated novel statistical tools for specific applications that are simply not available in other packages.
How to program a Christmas tree in R.
In reality, however, most scientists seek much simpler assistance from statistical analysis, for which other packages can do the trick. Moreover, efforts to master R can take so much time and dedication that students sometimes neglect what is really important in science: good and novel ideas, good experimental design, diligent execution with high replication and large sample sizes, effective visual presentation of information, and common sense deduction. I would much rather have a student who mastered these skills and analyzed their data in SPSS than I would have a student who was an R whiz but neglected the key skills of scientific investigation. Of course, what I really want is student who can do both, but the former is vastly more important. (Of course, most students who do learn R certainly don’t regret it afterward.)

7. R has its own foibles.

Any statistical program has bugs or flaws, and R is no different. Many issues with existing packages have been pointed out well after those packages were used in published studies. The simple fact is that R is modified by many people and can (like other statistical packages) suffer from the inadvertent introduction of errors that it takes time for others to discover and the originators to correct. Moreover, R has its own set of defaults that can be confusing or misleading. For instance, the standard default in R is Type I sums of squares (SS), whereas the default in many other stats packages in Type III SS. These different SS options have their own sets of positives and negatives and supporters and detractors. However, one must understand the differences between them. Of critical importance, Type I SS fits the first term of the model first before fitting other terms, whereas Type III SS fits all of the terms simultaneously. As a result – and as my students found out – you can get very different results if you run the same analysis in R and some other package, as well as if you change the order of entry of the terms in the model in R. (For my money Type III SS is usually more appropriate and my students now usually specify this option in R.)

It is important to make clear that I am not suggesting that students forsake the use of R for some other package. In most cases, they should probably use R. What I am instead saying is that learning R is not the most important (although it could be the most useful) thing you do in your education. Do not think that R = science and that, if you don’t learn R you are not a good scientist. Instead, think of R as a titanium hammer. If you need that hammer, then use it. If you don’t yet have any hammer, you might as well go titanium if you have the time. However, remember not to equate knowledge of R with intelligence or with a good study or with your own sense of self worth. Learn R for the right reasons and don’t let it become your raison d’etre – unless you wish to specialize in statistical analyses. Indeed, statistics and the development of R packages is certainly a branch of science in its own right - but my focus in the present post is empirical biologists who do not have a special interest in developing statistical methods.


There are some basic thoughts about statistics that are sometimes lost or forgotten in this brave new world of R-based statistics. The truth is, I am not a statistics expert by any stretch of the imagination, and so I have concentrated my comments on more basic, perhaps even philosophical, points. However, so much training is now provided in the mechanics of statistics, and R, that I think it is these more basic points that you are more in danger of forgetting or foregoing. Having said all this, it is perhaps time for #SPSSHero to also become #RHero, instead of relying on my lab members to do all the heavy lifting while I simply sit around and complain about it.


Links added later:


  1. One of the main advantages of R is simply that it's free ...

  2. Well, I am not so convinced by that argument. I can get several stats packages free through my university - plus most folks by R books and attend R workshops, both of which can be expensive. In short, I am not sure the cost of R (at least for students) is really that much less than the cost of SAS, for example.

    1. What happens when those students graduate and no longer have access to cheap or reduce cost SAS?

    2. What happens when those students graduate and no longer have access to cheap or reduce cost SAS?

    3. I agree that one should not count on having access to the same proprietary software after one has moved to a different institution. Besides, it is not "free" when your university buys the licences. I guess they would not mind spending the money differently.

  3. The fact that R is free certainly contributed to my interest in it, because one never knows the future. McGill might have provided me with this or that statistical package for free, but I knew I wouldn't be at McGill forever, and I didn't know whether I would continue to have free access to those packages. I also didn't know whether collaborators at other universities would also have free access to the same software as me. R is not just free at McGill, it is free everywhere, forever. Students who are thinking about their long-term future in academia don't want to invest time and energy in learning a proprietary package, I think.

  4. Great article!

    Regarding the cost argument - Sas and SPSS may be free to you but site licenses are expensive for the universities.

    One advantage of tools like R is that they can easily be integrated into a script based analysis so that it is more reproducible by the investigator and other researchers. We have written several papers that are executable r documents.

    I'm interested in hearing more about people's thoughts on reporting p values. We went around and around with a viewer before giving up. They wanted Andy mention of a p value stripped and merely indicate whether it was significant or not relative to an alpha value that was stated in the methods. There seems to be mixed opinion on what is best. Ideas here?

  5. I have been wanting to chip in on this post for some time, but wasn't sure how. On the one hand, I respect Andrew's pragmatic approach and certainly applaud his emphasis on the important on data visualization and effect size. On the other hand, I think we may have different philosophical stands on data analysis. I am of the (perhaps extreme) view that statistics is not an 'aid' to understand the data, but the mathematical formulation of the very scientific process involved in gathering and interpreting it. It's not an ancillary, tool, it is the very definition of your hypothesis, experimental design and scientific philosophy. Data is there to feed that and, while indeed I agree that data is 'true' (which would make me a Bayesian, I guess ;) ), it is just one of the many possible observations/samples of reality (and reality, at the same time, is just a singular realization of the processes behind it, but that's perhaps getting too metaphysical).
    So, what's the point of me getting so 'philosophical' in this post, which is obviously about pragmatism? My point is that, I have a hard time understanding how one can do model-free science. And by model I don't necessarily mean mathematical ones. The point I am trying to get to here is that a graph is a model too. What and how you choose to plot things depends on the model you have in your head of how data ought to be understood/classified. An 'innocent' barplot and standard errors will assume many things about the data: for example, that the mean and standard errors are the right measure of tendency to look at (that is only 'obvious' for symmetric distributions). The very scale at which the y axis is is a model assumption: it assumes the process behind the data is additive (as opposed, e.g. to a log scale). Any summary of data, however descriptive, is a model with assumptions at some level. We cannot avoid the fact that what we see is the product of the tool we use to see it in the very same way we cannot measure UV light with an RGB camera, or hear ultrasonic frequencies with human ears. The same goes for a linear model, as Andrew well illustrates with the Wikipedia figure, than for a bar-plot.
    So to reconcile Andrew and I, by all means plot your data, but please don't fool yourself into thinking that this is a 'tool free' way to immunize you from model assumptions. After all human visual perception is also just a model with its own filters, biases and assumptions, too (think of our remarkable ability to see patterns from randomness, like bunny-shaped clouds or jesus-christ pancakes).
    Finally, a short comment on R. I confess I love R, as Andrew well knows. I like it not only because it is free (and thus accessible to any researcher of any institution however wealthy it is (or not)), but because it forces people to be explicit about what they are doing. Sure, there are packages with default assumptions that one can ignore... but that's a clear misuse of R... the assumptions are all there in the help file (and ultimately the open source). Once you write a code, it is a testament to what you have done exactly that can be evaluated by anybody, replicated, corrected, criticized... and with journals more and more asking for code and data in the supplementary material, I see it as the most transparent way to do science to date. I don't see R as a fad or snob-tool, I see R as an opportunity to make science more transparent and universal. After all, if there are mistakes, the entire scientific community can correct them, which is not the case for commercial packages. Why should I trust SAS programmers to make less mistakes than the entire open statistical community that watches R?

  6. Perhaps to say it clearer. Although I agree that the data 'is true', I do not agree that it is the truth we're after. Is science after understanding data (or the particular data we have in hand), or is data gathering simply the tool we use to understanding natural processes, by acknowledging that they govern the kind of data we can record?

  7. I think a benefit of R over s/w that is less flexible is that the "R way" of working makes it more likely you will detect and address anomalies in your dataset.

    For the record, I spent a year as a SAS programmer, many moons ago, so speak from experience. The rigidity of such s/w encourages you to simply rejoice when the data imports and go straight to modelling and inference. Even if you do find some data problems, something like SAS makes it harder to, e.g., script the data prep & cleaning, so that it becomes a proper part of the analysis record.

    Totally agree you can do great science without R, but I think there is some causality here.