This fall, I wrote a series of “How to” blog posts that proved somewhat popular, or at least well-read:
How to write/present science – 2700+ views
How to be a reviewer/editor – 2600+ views
Where to submit your paper – 4000+ views
I hadn't initially planned a series like this, it just kind of emerged. However, I had long planned one particular “How to” post. Ironically, that post was the one I still hadn’t written. Now that it is 2015, the time seems ripe to get back to the original idea. (Thanks to Ben Haller, Gregor Rolshausen, Joost Raeymaekers, and Chuck Fox for critical comments that helped improve this post.)
How to do statistics.
I used to teach statistics. Really! I was a whiz at SPSS and Systat, and I could find my way around JMP. I was almost at the cutting edge, which then was SAS. No one complained seriously about the stats in the papers I submitted. Now, it seems that – with the same statistical skills as before, and maybe even a bit better – I have become a dinosaur. Increasingly, the feeling seems to be that you can’t be considered even moderately competent at statistics unless you can do a GLMM in R. In this sea-change from [insert your previous status package here] to R, I feel that several important points are getting lost – or at least under-emphasized. My goal in the present post is to revisit what statistics are supposed to be for and how you should do them. I do not mean the details of how to choose and run a particular model but rather how to view stats as a way of enhancing your science and refining your inference. I will outline these ideas through a series of assertions.
1. It’s all about the (appropriate) replication
An incredibly important route to improving your science is to maximize replication at the appropriate level of inference. Imagine you are interested in a particular effect, say the difference in an experiment between two treatments or the difference in some trait between populations in two environments. You need to here strive for maximum replication of the two treatments or the two environments. This might seem obvious but – as a reviewer/editor – I have seen many studies where people wish to make inferences about the effects of two environments, yet they have studied only one population in each environment. In such cases, they are entitled to draw conclusions about differences between the two studied populations but not between the two environments because – with only one population per environment – the investigator cannot gauge the difference between environments in relation to variation within environments. That is, it is quite possible that two populations within each environment would differ just as much as two populations sampled from the different environments. While the temptation is to get larger sample sizes for each measured population, what is much more important is to sample many populations. I have seen many papers rejected for lack of replication at the level for which inferences are desired.
2. The data are real – statistics are merely a way of placing a statement of confidence in an inference you draw from the data.
I have frequently seen students paralyzed by their inability to fit an appropriate error distribution in R. They spend weeks and weeks trying various options only to eventually give up and throw out the offending data. The opinion seems to be that, “if I can’t fully satisfy the requirements of a statistical test, then the data must be bad and I shouldn’t report them.” This is folly! The data are the real thing – the stats are just a tool to aid interpretation. What is infinitely better in cases where a perfect model cannot be fit is to present the data, analyze them the best possible way, and then own up to cases where the data do not fully satisfy the assumptions. The truth is that many statistical tests are extremely robust to small-to-modest violations of their assumptions as long as the P value (but see below) is not too close to the critical value.
Of course, I am not here advocating using a bad model when a better one exists. If a better model exists, by all means you should use it. However, this more practical point is already emphasized quite frequently nowadays to the point that it can become detrimental to a student’s progress, and I am here trying to push the pendulum back a bit. That is, finding the ideal model is valuable and helpful, but slavish dedication to this goal can sometimes detract from the quality of scientific education and insight. Of course, the most important thing is to have a good question and experimental design before you conduct the study, which will simultaneously improve the science and help avoid later statistical constraints.
3. It's not about the P value.
Although opinions are changing, many students are still fixated on obtaining a P value smaller than the critical level of 0.05. This goal is misguided – for three reasons. First, 0.05 is totally arbitrary. If you are focused on P values, what is much more useful is the actual P value – is it small or large? (Journals should always require actual P values in all cases.) Second, any particular set of data can be analyzed multiple ways and cycling through those options can lead to the temptation to choose the one that generates the smallest P value. Third, P values themselves (the probably that, if the null hypothesis is true and you reject it, you will be wrong in doing so) are a silly way to do science – sorry RA Fisher. Among the many reasons, the null hypothesis is – in traditional frequentist statistics – treated as a default rather than as an alternative model, and thus one often rejects the alternative hypothesis even when it has more support than the null hypothesis.
Instead of null hypotheses, it is much better to specify alternative hypotheses that are competed against each other with alternative statistical models to thereby judge the relative support for each hypothesis. Such comparisons can take the form of likelihood ratio tests, Bayesian credibility intervals, AIC comparisons, or the like. One might argue that a level of arbitrariness creeps in here (because a standard yes-no threshold is sometimes lacking) but the truth is that such approaches are much less arbitrary because they quantitatively compare the level of support for competing hypotheses. The author can then draw whatever conclusions he/she wants from the levels of support, while still allowing the reader to draw some other conclusion from the same model comparisons should they wish to do so.
4. Effect sizes are what matter.
P values are determined by an interaction between effect size (strength of an effect) and sample size. Thus, P values are NOT the strength of an effect. As a result, one cannot – without other information – say that a P value of 0.0001 represents a stronger effect than a P value of 0.05. It might simply be that the former analysis has a much larger sample size. Take simulation models as a particularly obvious example. In this case, one can have whatever sample size one wants given computing power and time. Thus, the exact same effect size (determined by the parameters of the simulation) can have totally different P values determined by the number of replicate simulations performed. If you have a tiny (but real) effect size, simply run more simulations and it will eventually become significant! The same logic applies to experiments and surveys. What matters are effect sizes based on how much variance in the data is explained, or based on the difference between group means weighted by the variance or the mean. Examples include R2, Cohen’s D, and Eta.squared.
Of course, one still wants to place a statement of confidence in assertions about a given effect size, which is where one adds P values or – better yet –model comparisons as discussed above. Note that, when true effect sizes are small, they tend to be overestimated when sample sizes are also small, which as generates the so-called funnel plot of meta-analyses. Thus, one still wants as large a sample size as possible and one would ideally correct the measured effect size for an estimate of the error – either using Bayesian approaches or through brute force. That is, a measured R2 can be adjusted by the R2 expected if no effect was present – with an example here.
|Effect sizes (here estimates of the strength of selection) are higher when sample sizes are smaller. From Kingsolver et al. (2001 - American Naturalist).|
5. Graph your data
In many meetings with students where I am to see the outcome of their experiment or sampling for the first time, I am presented with detailed statistical tables where the student emphasizes whether or not particular effects are significant in this or that model. I find myself incapable of interpreting these results without seeing the data in graphical format. In fact, I think a student should first graph the data in a manner that addresses the original question before running ANY formal statistical tests. This aids not only the assessment of assumptions for subsequent statistical tests (hugely influential outlier errors sometimes pop up when I ask a student to do this) but also reveals – at a first glance – the gestalt effect size assessment that rarely ever changes much as time goes on, notwithstanding any ups and downs that occur in the subsequent formal statistics. In this way, the student and supervisor can have a rough picture of what the experiment has revealed before having to worry about the statistical details. I would bet that 90% of the important work (if not the time investment) is done once you graph your data in a way that informs the original hypothesis/question.
|All data sets have the same means, variances, correlations, and regression lines. Only graphing shows how different they really are: Anscombe's quartet from Wikipedia.|
Some additional notes about statistical packages.
6. R is simply one of many useful platforms for drawing statistic inference.
Nowadays, students feel incompetent if they don’t analyze their data in R – hell, I even feel that way sometimes. However, R is simply a post-experiment tool – a hammer with which you help massage your data into optimal inference. SPSS, Systat, JMP, and SAS are also hammers – they too can massage your data. Perhaps R is a titanium hammer, better and more efficient at massaging the truth from data; but think of all the amazing inferences that were derived before R was popular. Does the failure of these countless previous studies to use R mean that we should not believe everything published before (and much published after) 2002? (Of course, re-analysis does change the conclusions of some previously-published and superficially-analyzed studies.) Does the fact that something else will eventually replace R mean that our current inferences with R then become incorrect? Nonsense. Valid and excellent inference can be obtained with any number of statistical packages.
Given that R is now the most common statistical program it does make sense for new (and old) researchers to start with (or switch to) R. However, the main advantage is not – in my opinion – dramatically improved inference but rather ease of communication with other scientists, such as through the sharing of code. Moreover, R has many other components not present in canned packages, such as data exploration tools, connection to database and file system structures on your computer, if-else statements, while loops and other programming tools, detailed plotting functions, connection to other programming languages such as C++ and Python just to name a few. It also contains user-motivated novel statistical tools for specific applications that are simply not available in other packages.
|How to program a Christmas tree in R.|
In reality, however, most scientists seek much simpler assistance from statistical analysis, for which other packages can do the trick. Moreover, efforts to master R can take so much time and dedication that students sometimes neglect what is really important in science: good and novel ideas, good experimental design, diligent execution with high replication and large sample sizes, effective visual presentation of information, and common sense deduction. I would much rather have a student who mastered these skills and analyzed their data in SPSS than I would have a student who was an R whiz but neglected the key skills of scientific investigation. Of course, what I really want is student who can do both, but the former is vastly more important. (Of course, most students who do learn R certainly don’t regret it afterward.)
7. R has its own foibles.
Any statistical program has bugs or flaws, and R is no different. Many issues with existing packages have been pointed out well after those packages were used in published studies. The simple fact is that R is modified by many people and can (like other statistical packages) suffer from the inadvertent introduction of errors that it takes time for others to discover and the originators to correct. Moreover, R has its own set of defaults that can be confusing or misleading. For instance, the standard default in R is Type I sums of squares (SS), whereas the default in many other stats packages in Type III SS. These different SS options have their own sets of positives and negatives and supporters and detractors. However, one must understand the differences between them. Of critical importance, Type I SS fits the first term of the model first before fitting other terms, whereas Type III SS fits all of the terms simultaneously. As a result – and as my students found out – you can get very different results if you run the same analysis in R and some other package, as well as if you change the order of entry of the terms in the model in R. (For my money Type III SS is usually more appropriate and my students now usually specify this option in R.)
It is important to make clear that I am not suggesting that students forsake the use of R for some other package. In most cases, they should probably use R. What I am instead saying is that learning R is not the most important (although it could be the most useful) thing you do in your education. Do not think that R = science and that, if you don’t learn R you are not a good scientist. Instead, think of R as a titanium hammer. If you need that hammer, then use it. If you don’t yet have any hammer, you might as well go titanium if you have the time. However, remember not to equate knowledge of R with intelligence or with a good study or with your own sense of self worth. Learn R for the right reasons and don’t let it become your raison d’etre – unless you wish to specialize in statistical analyses. Indeed, statistics and the development of R packages is certainly a branch of science in its own right - but my focus in the present post is empirical biologists who do not have a special interest in developing statistical methods.
There are some basic thoughts about statistics that are sometimes lost or forgotten in this brave new world of R-based statistics. The truth is, I am not a statistics expert by any stretch of the imagination, and so I have concentrated my comments on more basic, perhaps even philosophical, points. However, so much training is now provided in the mechanics of statistics, and R, that I think it is these more basic points that you are more in danger of forgetting or foregoing. Having said all this, it is perhaps time for #SPSSHero to also become #RHero, instead of relying on my lab members to do all the heavy lifting while I simply sit around and complain about it.
Links added later:
Links added later: