This fall, I wrote a series of “How to” blog posts that
proved somewhat popular, or at least well-read:
How
to write/present science – 2700+ views
How
to be a reviewer/editor – 2600+ views
Where
to submit your paper – 4000+ views
I hadn't initially planned a series like this, it just kind of
emerged. However, I had long planned one particular “How to” post. Ironically,
that post was the one I still hadn’t written. Now that it is 2015, the time
seems ripe to get back to the original idea. (Thanks to Ben Haller, Gregor Rolshausen, Joost Raeymaekers, and Chuck Fox for critical comments that helped improve this post.)
How to do statistics.
I used to teach statistics. Really! I was a whiz at SPSS and
Systat, and I could find my way around JMP. I was almost at the cutting edge,
which then was SAS. No one complained seriously about the stats in the papers I
submitted. Now, it seems that – with the same statistical skills as before, and
maybe even a bit better – I have become a dinosaur. Increasingly, the feeling
seems to be that you can’t be considered even moderately competent at
statistics unless you can do a GLMM in R. In this sea-change from [insert your previous
status package here] to R, I feel
that several important points are getting lost – or at least under-emphasized. My
goal in the present post is to revisit what statistics are supposed to be for
and how you should do them. I do not mean the details of how to choose and run
a particular model but rather how to view stats as a way of enhancing your
science and refining your inference. I will outline these ideas through a
series of assertions.
1. It’s all about the (appropriate)
replication
An incredibly important route to improving your science is
to maximize replication at the appropriate level of inference. Imagine
you are interested in a particular effect, say the difference in an experiment between
two treatments or the difference in some trait between populations in two
environments. You need to here strive for maximum replication of the two
treatments or the two environments. This might seem obvious but – as a
reviewer/editor – I have seen many studies where people wish to make inferences
about the effects of two environments, yet they have studied only one
population in each environment. In such cases, they are entitled to draw
conclusions about differences between the two studied populations but not
between the two environments because – with only one population per environment
– the investigator cannot gauge the difference between environments in relation
to variation within environments. That is, it is quite possible that two
populations within each environment would differ just as much as two
populations sampled from the different environments. While the temptation is to
get larger sample sizes for each measured population, what is much more
important is to sample many populations. I have seen many papers rejected for
lack of replication at the level for which inferences are desired.
2. The data are real –
statistics are merely a way of placing a statement of confidence in an
inference you draw from the data.
I have frequently seen students paralyzed by their inability
to fit an appropriate error distribution in R. They spend weeks and weeks
trying various options only to eventually give up and throw out the offending data.
The opinion seems to be that, “if I can’t fully satisfy the requirements of a
statistical test, then the data must be bad and I shouldn’t report them.” This
is folly! The data are the real thing – the stats are just a tool to aid
interpretation. What is infinitely better in cases where a perfect model cannot
be fit is to present the data, analyze them the best possible way, and then own
up to cases where the data do not fully satisfy the assumptions. The truth is
that many statistical tests are extremely robust to small-to-modest violations
of their assumptions as long as the P value (but see below) is not too close to
the critical value.
Of course, I am not here advocating using a bad model when a
better one exists. If a better model exists, by all means you should use it.
However, this more practical point is already emphasized quite frequently nowadays
to the point that it can become detrimental to a student’s progress, and I am here
trying to push the pendulum back a bit. That is, finding the ideal model is valuable
and helpful, but slavish dedication to this goal can sometimes detract from the
quality of scientific education and insight. Of course, the most important
thing is to have a good question and experimental design before you conduct the
study, which will simultaneously improve the science and help avoid later
statistical constraints.
3. It's not about the P value.
Although opinions are changing, many students are still
fixated on obtaining a P value smaller than the critical level of 0.05. This
goal is misguided – for three reasons. First, 0.05 is totally arbitrary. If you are
focused on P values, what is much more useful is the actual P value – is it small or large? (Journals should always require
actual P values in all cases.) Second, any particular set of data can be
analyzed multiple ways and cycling through those options can lead to the
temptation to choose the one that generates the smallest P value. Third, P
values themselves (the probably that, if the null hypothesis is true and you reject it, you will be wrong in doing so) are a silly way to do science – sorry RA Fisher. Among the
many reasons, the null hypothesis is – in traditional frequentist statistics – treated
as a default rather than as an alternative model, and thus one often rejects
the alternative hypothesis even when it has more support than the null
hypothesis.
Instead of null hypotheses, it is much better to specify
alternative hypotheses that are competed against each other with alternative
statistical models to thereby judge the relative support for each hypothesis.
Such comparisons can take the form of likelihood ratio tests, Bayesian
credibility intervals, AIC comparisons, or the like. One might argue that a
level of arbitrariness creeps in here (because a standard yes-no threshold is
sometimes lacking) but the truth is that such approaches are much less
arbitrary because they quantitatively
compare the level of support for competing hypotheses. The author can then draw
whatever conclusions he/she wants from the levels of support, while still
allowing the reader to draw some other conclusion from the same model
comparisons should they wish to do so.
4. Effect sizes are what
matter.
P values are determined by an interaction between effect
size (strength of an effect) and sample size. Thus, P values are NOT the
strength of an effect. As a result, one cannot – without other information – say
that a P value of 0.0001 represents a stronger effect than a P value of 0.05.
It might simply be that the former analysis has a much larger sample size. Take
simulation models as a particularly obvious example. In this case, one can have
whatever sample size one wants given computing power and time. Thus, the exact
same effect size (determined by the parameters of the simulation) can have
totally different P values determined by the number of replicate simulations
performed. If you have a tiny (but real) effect size, simply run more
simulations and it will eventually become significant! The same logic applies to
experiments and surveys. What
matters are effect sizes based on how much variance
in the data is explained, or based on the difference between group means weighted by the variance or the mean. Examples include
R2, Cohen’s D, and Eta.squared.
Of course, one still wants to place a statement of
confidence in assertions about a given effect size, which is where one adds P
values or – better yet –model comparisons as discussed above. Note that, when true
effect sizes are small, they tend to be overestimated
when sample sizes are also small, which as generates the so-called funnel
plot of meta-analyses. Thus, one still wants as large a sample size as possible
and one would ideally correct the measured effect size for an estimate of the
error – either using Bayesian approaches or through brute force. That is, a
measured R2 can be adjusted by the R2 expected if no effect was present – with
an example here.
Effect sizes (here estimates of the strength of selection) are higher when sample sizes are smaller. From Kingsolver et al. (2001 - American Naturalist). |
5. Graph your data
In many meetings with students where I am to see the outcome
of their experiment or sampling for the first time, I am presented with
detailed statistical tables where the student emphasizes whether or not
particular effects are significant in this or that model. I find myself incapable
of interpreting these results without seeing the data in graphical format.
In fact, I think a student should first graph the data in a manner that
addresses the original question before running ANY formal statistical tests.
This aids not only the assessment of assumptions for subsequent statistical
tests (hugely influential outlier errors sometimes pop up when I ask a student
to do this) but also reveals – at a first glance – the gestalt effect size assessment that rarely ever changes much as time goes on, notwithstanding
any ups and downs that occur in the subsequent formal statistics. In this way,
the student and supervisor can have a rough picture of what the experiment has
revealed before having to worry about the statistical details. I would bet that
90% of the important work (if not the time investment) is done once you graph
your data in a way that informs the original hypothesis/question.
All data sets have the same means, variances, correlations, and regression lines. Only graphing shows how different they really are: Anscombe's quartet from Wikipedia. |
Some additional notes
about statistical packages.
6. R is simply one of
many useful platforms for drawing statistic inference.
Nowadays, students feel incompetent if they don’t analyze
their data in R – hell, I even feel that way sometimes. However, R is simply a
post-experiment tool – a hammer with which you help massage your data into
optimal inference. SPSS, Systat, JMP, and SAS are also hammers – they too can
massage your data. Perhaps R is a titanium hammer, better and more efficient at
massaging the truth from data; but think of all the amazing inferences that were
derived before R was popular. Does the failure of these countless previous
studies to use R mean that we should not believe everything published before
(and much published after) 2002? (Of course, re-analysis does change the
conclusions of some previously-published and superficially-analyzed studies.) Does
the fact that something else will eventually replace R mean that our current
inferences with R then become incorrect? Nonsense. Valid and excellent
inference can be obtained with any number of statistical packages.
Given that R is now the most common statistical program it
does make sense for new (and old) researchers to start with (or switch to) R.
However, the main advantage is not – in my opinion – dramatically improved
inference but rather ease of communication with other scientists, such as
through the sharing of code. Moreover, R has many other components not present
in canned packages, such as data exploration
tools, connection to database and file system structures on your computer, if-else statements, while loops and other programming tools, detailed plotting
functions, connection to other programming languages such as C++ and Python
just to name a few. It also contains user-motivated novel statistical tools for
specific applications that are simply not available in other packages.
How to program a Christmas tree in R. http://simplystatistics.org/2012/12/24/make-a-christmas-tree-in-r-with-random-ornamentspresents/ |
In reality, however, most scientists seek much simpler assistance
from statistical analysis, for which other packages can do the trick. Moreover,
efforts to master R can take so much time and dedication that students sometimes
neglect what is really important in science: good and novel ideas, good
experimental design, diligent execution with high replication and large sample
sizes, effective visual presentation of information, and common sense
deduction. I would much rather have a student who mastered these skills and
analyzed their data in SPSS than I would have a student who was an R whiz but neglected
the key skills of scientific investigation. Of course, what I really want is student
who can do both, but the former is vastly more important. (Of course, most
students who do learn R certainly don’t regret it afterward.)
7. R has its own foibles.
Any statistical program has bugs or flaws, and R is no
different. Many issues with existing packages have been pointed out well after
those packages were used in published studies. The simple fact is that R is
modified by many people and can (like other statistical packages) suffer from
the inadvertent introduction of errors that it takes time for others to
discover and the originators to correct. Moreover, R has its own set of
defaults that can be confusing or misleading. For instance, the standard
default in R is Type I sums of squares (SS), whereas the default in many other
stats packages in Type III SS. These different SS options have their own sets
of positives and negatives and supporters and detractors. However, one must
understand the differences between them. Of critical importance, Type I SS fits
the first term of the model first before fitting other terms, whereas Type III
SS fits all of the terms simultaneously. As a result – and as my students found
out – you can get very different results if you run the same analysis in R and
some other package, as well as if you change the order of entry of the terms in
the model in R. (For my money Type III SS is usually more appropriate and my
students now usually specify this option in R.)
It is important to make clear that I am not suggesting that
students forsake the use of R for some other package. In most cases, they
should probably use R. What I am instead saying is that learning R is not the
most important (although it could be the most useful) thing you do in your education.
Do not think that R = science and that, if you don’t learn R you are not a good
scientist. Instead, think of R as a titanium hammer. If you need that hammer,
then use it. If you don’t yet have any hammer, you might as well go titanium if
you have the time. However, remember not to equate knowledge of R with intelligence
or with a good study or with your own sense of self worth. Learn R for the
right reasons and don’t let it become your raison d’etre – unless you wish to
specialize in statistical analyses. Indeed, statistics and the development of R
packages is certainly a branch of science in its own right - but my focus in
the present post is empirical biologists who do not have a special interest in developing
statistical methods.
Coda
There are some basic thoughts about statistics that are
sometimes lost or forgotten in this brave new world of R-based statistics. The truth
is, I am not a statistics expert by any stretch of the imagination, and so I
have concentrated my comments on more basic, perhaps even philosophical,
points. However, so much training is now provided in the mechanics of
statistics, and R, that I think it is these more basic points that you are more in
danger of forgetting or foregoing. Having said all this, it is perhaps time for #SPSSHero to also become #RHero, instead of relying on my lab members to do all the heavy lifting while I simply sit around and complain about it.
--------------------------------
Links added later:
https://scientistseessquirrel.wordpress.com/2015/02/09/in-defence-of-the-p-value/
--------------------------------
Links added later:
https://scientistseessquirrel.wordpress.com/2015/02/09/in-defence-of-the-p-value/
One of the main advantages of R is simply that it's free ...
ReplyDeleteWell, I am not so convinced by that argument. I can get several stats packages free through my university - plus most folks by R books and attend R workshops, both of which can be expensive. In short, I am not sure the cost of R (at least for students) is really that much less than the cost of SAS, for example.
ReplyDeleteWhat happens when those students graduate and no longer have access to cheap or reduce cost SAS?
DeleteWhat happens when those students graduate and no longer have access to cheap or reduce cost SAS?
DeleteI agree that one should not count on having access to the same proprietary software after one has moved to a different institution. Besides, it is not "free" when your university buys the licences. I guess they would not mind spending the money differently.
DeleteThe fact that R is free certainly contributed to my interest in it, because one never knows the future. McGill might have provided me with this or that statistical package for free, but I knew I wouldn't be at McGill forever, and I didn't know whether I would continue to have free access to those packages. I also didn't know whether collaborators at other universities would also have free access to the same software as me. R is not just free at McGill, it is free everywhere, forever. Students who are thinking about their long-term future in academia don't want to invest time and energy in learning a proprietary package, I think.
ReplyDeleteGreat article!
ReplyDeleteRegarding the cost argument - Sas and SPSS may be free to you but site licenses are expensive for the universities.
One advantage of tools like R is that they can easily be integrated into a script based analysis so that it is more reproducible by the investigator and other researchers. We have written several papers that are executable r documents.
I'm interested in hearing more about people's thoughts on reporting p values. We went around and around with a viewer before giving up. They wanted Andy mention of a p value stripped and merely indicate whether it was significant or not relative to an alpha value that was stated in the methods. There seems to be mixed opinion on what is best. Ideas here?
I have been wanting to chip in on this post for some time, but wasn't sure how. On the one hand, I respect Andrew's pragmatic approach and certainly applaud his emphasis on the important on data visualization and effect size. On the other hand, I think we may have different philosophical stands on data analysis. I am of the (perhaps extreme) view that statistics is not an 'aid' to understand the data, but the mathematical formulation of the very scientific process involved in gathering and interpreting it. It's not an ancillary, tool, it is the very definition of your hypothesis, experimental design and scientific philosophy. Data is there to feed that and, while indeed I agree that data is 'true' (which would make me a Bayesian, I guess ;) ), it is just one of the many possible observations/samples of reality (and reality, at the same time, is just a singular realization of the processes behind it, but that's perhaps getting too metaphysical).
ReplyDeleteSo, what's the point of me getting so 'philosophical' in this post, which is obviously about pragmatism? My point is that, I have a hard time understanding how one can do model-free science. And by model I don't necessarily mean mathematical ones. The point I am trying to get to here is that a graph is a model too. What and how you choose to plot things depends on the model you have in your head of how data ought to be understood/classified. An 'innocent' barplot and standard errors will assume many things about the data: for example, that the mean and standard errors are the right measure of tendency to look at (that is only 'obvious' for symmetric distributions). The very scale at which the y axis is is a model assumption: it assumes the process behind the data is additive (as opposed, e.g. to a log scale). Any summary of data, however descriptive, is a model with assumptions at some level. We cannot avoid the fact that what we see is the product of the tool we use to see it in the very same way we cannot measure UV light with an RGB camera, or hear ultrasonic frequencies with human ears. The same goes for a linear model, as Andrew well illustrates with the Wikipedia figure, than for a bar-plot.
So to reconcile Andrew and I, by all means plot your data, but please don't fool yourself into thinking that this is a 'tool free' way to immunize you from model assumptions. After all human visual perception is also just a model with its own filters, biases and assumptions, too (think of our remarkable ability to see patterns from randomness, like bunny-shaped clouds or jesus-christ pancakes).
Finally, a short comment on R. I confess I love R, as Andrew well knows. I like it not only because it is free (and thus accessible to any researcher of any institution however wealthy it is (or not)), but because it forces people to be explicit about what they are doing. Sure, there are packages with default assumptions that one can ignore... but that's a clear misuse of R... the assumptions are all there in the help file (and ultimately the open source). Once you write a code, it is a testament to what you have done exactly that can be evaluated by anybody, replicated, corrected, criticized... and with journals more and more asking for code and data in the supplementary material, I see it as the most transparent way to do science to date. I don't see R as a fad or snob-tool, I see R as an opportunity to make science more transparent and universal. After all, if there are mistakes, the entire scientific community can correct them, which is not the case for commercial packages. Why should I trust SAS programmers to make less mistakes than the entire open statistical community that watches R?
Perhaps to say it clearer. Although I agree that the data 'is true', I do not agree that it is the truth we're after. Is science after understanding data (or the particular data we have in hand), or is data gathering simply the tool we use to understanding natural processes, by acknowledging that they govern the kind of data we can record?
ReplyDeleteI think a benefit of R over s/w that is less flexible is that the "R way" of working makes it more likely you will detect and address anomalies in your dataset.
ReplyDeleteFor the record, I spent a year as a SAS programmer, many moons ago, so speak from experience. The rigidity of such s/w encourages you to simply rejoice when the data imports and go straight to modelling and inference. Even if you do find some data problems, something like SAS makes it harder to, e.g., script the data prep & cleaning, so that it becomes a proper part of the analysis record.
Totally agree you can do great science without R, but I think there is some causality here.