Tuesday, September 17, 2019

How to make rational inferences from data

This post is motivated by the paralysis that many students encounter when attempting to fit a model to their data, typically in R. I have long been frustrated by how this process sometimes turns thinking students that are seeking new ideas into desperate technicians that are seeking engineering solutions. However, even more recently, I have become concerned by the counter-productive self-questioning hand-wringing that so many students encounter during this process – to the point that they sometimes don’t believe their own data and start to second-guess their biological questions and experiments and outcomes. Hence, I have here written a “take a step back” approach to inference where only the last 5% of the process is worrying about generating a P value or AIC difference or the equivalent, thus leaving the other 95% for thinking!

A.     TURN OFF YOUR COMPUTER. Don’t look at your data. Get out a pen and paper – or, better yet, a white board. Invite your lab mates over – maybe even your supervisor. Then proceed with the following steps. Of course, it is useful to do all of this before designing your study, but the realities of field data collection can mean that the data you end up dictate the need to redo the following steps after data collection – but before analysis.

1.      Decide – based on your question/hypothesis/prediction what the published “money plot” should be - the plot that will "get you paid"! That is, what graphical representation of the data will convey to the reader the specific answer to your biological question. Draw this figure or figures and indicate what interpretation you will draw from a given pattern. An example might be an x-y plot where a positive correlation would mean one thing, no correlation would mean another, and a positive correlation would mean something else again. Don’t just imagine various ways to plot your data; instead specifically design the plot(s) that will convey to the reader the answer to the question. You should be able to point to a specific pattern that would result in a specific conclusions directly relevant to your specific conclusion.

2.      Decide how you will interpret a given effect size. For instance, if you are looking for a positive correlation coefficient between x and y, then perhaps you will compare that coefficient to a meta-analysis showing a distribution of similarly-obtained correlation coefficients. Or, to what other correlation between variables will you compare your target correlation – that is, can you define a non-causal variable that you can plot your y-axis data against – a “control” correlation that should show no true effect? Determining effect size and conveying its relatively importance to the reader will be the absolute key to rational and useful inference.

3.      Figure out your unit of replication when it comes specifically to your questions and the money plot you intend to generate. In one sense, this point might be re-stated as “don’t pseudoreplicate”, which might seem obvious but – in practice – can be confusing; or, at the least, misapplied. If, for example, your question is to what extent populations or species show parallel evolution in response to a given environmental gradient, then your unit of replication for inference is the number of populations, not the number of individuals. If you have two experimental treatments that were each imposed on five experimental tanks – those tanks become your unit of replication.

4.      Decide what your fixed and random effects are. Fixed effects are factors for which you are interested in making inferences about differences between the specific levels within the factor. Random effects are, in essence, factors where the different levels are a conceptually-random selection of replicates. Random effects are things for which you can make an inference about the overall factor (e.g., different populations have different values) but not the individual levels of that factor (you would not, with a random effect, say “population A differed from population B but not population C”). Those sorts of direct among-level comparisons are not relevant to a random effect.

B.     TURN ON YOUR COMPUTER AND OPEN A DATABASE AND GRAPHING PROGRAM. Excel, or something like that, is ideal here. If you are very comfortable in R already, then go ahead and use that but, importantly, do not open any module that will do anything other than plot data. Don’t attempt to fit any inferential models. Don’t attempt to statistically infer fits to distribution or specific outliers. Don’t generate any P values or AIC values or BIC values or log-likelihoods, etc. You are going to use your eye and your brain only! Now proceed with the following steps.

5.      Plot your data for outliers. Don’t use a program to identify them (not yet anyway) – use your eye. Look at data distributions and plot every variable against every other variable. Extreme outliers are obvious and are typically errors. These must be fixed or removed – or they will poison downstream analyses. Some errors can be easily identified and corrected by reference to original data sheets or other sources of information. If dramatic outliers cannot be fixed, delete the entire record from the dataset. Note: Don’t change or delete data just because they are contradictory to your hypotheses – the examination suggested here is hypothesis free.

6.      Decide which of covariates you need to consider. If you are measuring organisms, these covariates an obvious example is body size. If you are doing experiments or observations, other examples include temperature or moisture. These covariates are things NOT directly related to your question but are instead things that might get between your data and your inference. Plot your data against these covariates to see if you need to consider them when making initial inferences from your data. It is very important to evaluate covariates within each level that you have in your data. For instance, you need to know whether body size is influencing your measured trait WITHIN each population or treatment not across ALL data pooled.

7.      Plot your data in a fashion as close as possible to the money-plot you previously designed. If you have important covariates, make sure to correct for them as necessary. For instance, you can add an axis to your money plot that allows you to assess the key result across the range of body sizes. Make sure that your plot does not have unequal representation of experimental units (e.g., numbers of fish in different tanks) within a given level of your treatment. Otherwise, you might get tricked by one over-represented unit that has an anomalous result. This point is obviously related to the above comment about determining your unit of replication.

8.      Look at your plot and draw your inference. Does the (for example) line go in the direction you predicted? How steep is that line – that is, the effect size? How big is the difference between your control and your treatment in relation to the variation in each group (at the correct level of replication)? How does that result qualitatively compare to previous work – so that you have some idea of the relative importance of the effect you have (or have not) uncovered.

OK YOU ARE DONE. Congratulations. You know the answer. Write up your paper, defend your thesis, get a postdoc, get a job, get tenure, retire, and give your Nobel lecture.

Well, I suppose there is one more thing you should do – but, really, you are 95% done here in most cases. What you see in the data is the reality of the situation and you have interpreted it in light of previous work. Your eye is really really good at this stuff. The one small thing left to do is to figure out a way to state the level of confidence you have in the interpretation you have just drawn from the data. This minor thing is all that p values, AIC levels, BIC levels, confidence intervals, and so on are for. That is, the data are the real thing and you have drawn an interpretation from them – now all you need is a way of conveying to a reader how confident you are in that interpretation. I will make some suggestions in this regard, especially in relation to model fitting, in the next post.


  1. I don't understand this line: "If dramatic outliers cannot be fixed, delete the entire record from the dataset." Are you recommending that they delete outliers?

  2. If the outliers are clearly not realistic and therefore errors, and yet can't be corrected through reference to other information such as original data books, then - yes - they should be deleted.


A scientific Will

 A sure sign of adulting is finally deciding to write a legal Will to designate how your assets should be handled in the event of your death...