This post is motivated by the paralysis that many students
encounter when attempting to fit a model to their data, typically in R. I have
long been frustrated by how this process sometimes turns thinking students that
are seeking new ideas into desperate technicians that are seeking engineering
solutions. However, even more recently, I have become concerned by the
counter-productive self-questioning hand-wringing that so many students
encounter during this process – to the point that they sometimes don’t believe
their own data and start to second-guess their biological questions and
experiments and outcomes. Hence, I have here written a “take a step back”
approach to inference where only the last 5% of the process is worrying about
generating a P value or AIC difference or the equivalent, thus leaving the other
95% for thinking!
A.
TURN OFF YOUR COMPUTER. Don’t look at your data.
Get out a pen and paper – or, better yet, a white board. Invite your lab mates
over – maybe even your supervisor. Then proceed with the following steps. Of
course, it is useful to do all of this before designing your study, but the
realities of field data collection can mean that the data you end up dictate
the need to redo the following steps after data collection – but before
analysis.
1.
Decide – based on your
question/hypothesis/prediction what the published “money plot” should be - the plot that will "get you paid"! That is, what graphical representation of the data will convey to the reader
the specific answer to your biological question. Draw this figure or figures
and indicate what interpretation you will draw from a given pattern. An example
might be an x-y plot where a positive correlation would mean one thing, no
correlation would mean another, and a positive correlation would mean something
else again. Don’t just imagine various ways to plot your data; instead
specifically design the plot(s) that will convey to the reader the answer to
the question. You should be able to point to a specific pattern that would result
in a specific conclusions directly relevant to your specific conclusion.
2.
Decide how you will interpret a given effect
size. For instance, if you are looking for a positive correlation
coefficient between x and y, then perhaps you will compare that coefficient to
a meta-analysis showing a distribution of similarly-obtained correlation
coefficients. Or, to what other correlation between variables will you compare
your target correlation – that is, can you define a non-causal variable that
you can plot your y-axis data against – a “control” correlation that should
show no true effect? Determining effect size and conveying its relatively
importance to the reader will be the absolute key to rational and useful
inference.
3.
Figure out your unit of replication when
it comes specifically to your questions and the money plot you intend to
generate. In one sense, this point might be re-stated as “don’t
pseudoreplicate”, which might seem obvious but – in practice – can be confusing;
or, at the least, misapplied. If, for example, your question is to what extent
populations or species show parallel evolution in response to a given
environmental gradient, then your unit of replication for inference is the
number of populations, not the number of individuals. If you have two
experimental treatments that were each imposed on five experimental tanks –
those tanks become your unit of replication.
4.
Decide what your fixed and random effects
are. Fixed effects are factors for which you are interested in making inferences
about differences between the specific levels within the factor. Random effects
are, in essence, factors where the different levels are a conceptually-random
selection of replicates. Random effects are things for which you can make an
inference about the overall factor (e.g., different populations have different
values) but not the individual levels of that factor (you would not, with a
random effect, say “population A differed from population B but not population
C”). Those sorts of direct among-level comparisons are not relevant to a random
effect.
B.
TURN ON YOUR COMPUTER AND OPEN A DATABASE AND
GRAPHING PROGRAM. Excel, or something like that, is ideal here. If you are very
comfortable in R already, then go ahead and use that but, importantly, do not
open any module that will do anything other than plot data. Don’t attempt to
fit any inferential models. Don’t attempt to statistically infer fits to
distribution or specific outliers. Don’t generate any P values or AIC values or
BIC values or log-likelihoods, etc. You are going to use your eye and your
brain only! Now proceed with the following steps.
5.
Plot your data for outliers. Don’t use a
program to identify them (not yet anyway) – use your eye. Look at data
distributions and plot every variable against every other variable. Extreme
outliers are obvious and are typically errors. These must be fixed or removed –
or they will poison downstream analyses. Some errors can be easily identified
and corrected by reference to original data sheets or other sources of
information. If dramatic outliers cannot be fixed, delete the entire record
from the dataset. Note: Don’t change or delete data just because they are
contradictory to your hypotheses – the examination suggested here is hypothesis
free.
6.
Decide which of covariates you need to
consider. If you are measuring organisms, these covariates an obvious
example is body size. If you are doing experiments or observations, other examples
include temperature or moisture. These covariates are things NOT directly
related to your question but are instead things that might get between your
data and your inference. Plot your data against these covariates to see if you
need to consider them when making initial inferences from your data. It is very
important to evaluate covariates within each level that you have in your data.
For instance, you need to know whether body size is influencing your measured
trait WITHIN each population or treatment not across ALL data pooled.
7.
Plot your data in a fashion as close as
possible to the money-plot you previously designed. If you have important
covariates, make sure to correct for them as necessary. For instance, you can
add an axis to your money plot that allows you to assess the key result across
the range of body sizes. Make sure that your plot does not have unequal
representation of experimental units (e.g., numbers of fish in different tanks)
within a given level of your treatment. Otherwise, you might get tricked by one
over-represented unit that has an anomalous result. This point is obviously
related to the above comment about determining your unit of replication.
8.
Look at your plot and draw your inference. Does
the (for example) line go in the direction you predicted? How steep is that
line – that is, the effect size? How big is the difference between your control
and your treatment in relation to the variation in each group (at the correct
level of replication)? How does that result qualitatively compare to previous
work – so that you have some idea of the relative importance of the effect you
have (or have not) uncovered.
Well, I suppose there is one more thing you should do – but, really, you are 95% done here in most cases. What you see in the data is the reality of the situation and you have interpreted it in light of previous work. Your eye is really really good at this stuff. The one small thing left to do is to figure out a way to state the level of confidence you have in the interpretation you have just drawn from the data. This minor thing is all that p values, AIC levels, BIC levels, confidence intervals, and so on are for. That is, the data are the real thing and you have drawn an interpretation from them – now all you need is a way of conveying to a reader how confident you are in that interpretation. I will make some suggestions in this regard, especially in relation to model fitting, in the next post.
I don't understand this line: "If dramatic outliers cannot be fixed, delete the entire record from the dataset." Are you recommending that they delete outliers?
ReplyDeleteIf the outliers are clearly not realistic and therefore errors, and yet can't be corrected through reference to other information such as original data books, then - yes - they should be deleted.
ReplyDelete