Saturday, September 8, 2018

On simulation, and genealogies

I’m writing this post about what I think will be a really useful new technique for forwards-time population genetics simulations, tree-sequence recording. (“I” in this post is Peter Ralph; see the author lists on the papers below for who else has been working on this.) This isn’t just about a under-the-hood technical advance: tree-sequence recording enables a bunch of other neat analyses that let you get more out of your simulations, in addition to making simulations faster (by orders of magnitude if your simulations are big). I’ll explain how and why, below. I’m particularly excited because it finally lets us simulate reasonably large populations with lots of selection on a whole-chromosome scale, and life is so much better when I can simulate the processes I am thinking about.
Everything I’m talking about here is described in two recent preprints:
There’s also some good explanation and examples in the latest SLiM manual and in the msprime documentation.
The basic idea is that by recording the genealogy of the population as you go along, you don’t have to carry around neutral mutations - you can add them in afterwards. This makes things much faster, and also gives us all the genealogies! I’ve been wanting to do this for a long time, and probably quite a few other people have had the idea also, but to make it happen took several new ideas and a lot of hard work by several people.

When should I use forwards-time versus coalescent simulations?

Let me get one thing out of the way. A forwards-time, individual-based simulation is the simple, natural thing to do: the computer keeps track of digital organisms, has them reproduce, mutate, die, etcetera. However, population genetics often uses coalescent simulations, which work by tracing paths of inheritance back through time from your sampled individuals. This is much faster than a forwards simulation because you skip over a lot of irrelevant detail, such as possible ancestors that never had offspring. Producing genome sequences with a coalescent simulation is equivalent to a forwards-time, individual-based simulation, but only if (a) there are no selective differences between genetic variants, and (b) all your populations are large. (There are some additional caveats - it’s possible to do a coalescent simulation around a single selected locus, for instance - but that is pretty much true.) This means that if those two conditions aren’t satisfied – neutrality and large populations – then you can’t use a coalescent simulation. (The technical reason is that these conditions are necessary for the reverse-time process to be Markovian.) Condition (b) also means, for subtle reasons, that you can’t do coalescent simulations in continuous space.
So, coalescent simulations are useful if you’re willing to assume neutral evolution in large, randomly mating populations that exchange migrants, but if you want to get much more complicated than that, you need to use forwards-time simulation.

What is a tree sequence?

Suppose you’ve sampled a collection of genomes. At any location on the genome, your samples are related by a genealogical tree (the “gene tree” there). As you move along the genome, the tree changes because of recombinations in ancestors somewhere back in the tree. Since each recombination only changes a small bit of the tree, nearby trees along the genome are similar. A tree sequence is a particularly compact and useful way of storing that sequence of correlated trees. Once you’ve got the trees, it’s easy to keep track of who has which alleles by just marking on the trees where every mutation occurred. The tree sequence does this, too - it stores both all the gene trees, and all the genotypes. It stores the trees very efficiently because adjacent trees are highly correlated, and it stores genome sequence very efficiently using the trees.
This data structure was introduced by Jerome Kelleher, Alison Etheridge, and Gil McVean in Kelleher et al for the purposes of implementing the super-fast coalescent simulator msprime. Its properties are discussed in that paper, and also in this recent preprint. It has been subsequently extended in various ways, notably to allow storing information about polyploid individuals. See the msprime documentation for the definitive specification. In information content, a tree sequence is equivalent to an Ancestral Recombination Graph (or “ARG”), but Kelleher et al chose the name “tree sequence” to refer to the data structure representing that information.

Why can it make my simulation go faster?

The reason that storing extra information can speed up your simulation is that it lets you ignore a bunch of other stuff. In many simulations, most mutations are neutral, and neutral mutations - by definition - don’t affect the ongoing dynamics of the simulation in any way. That means that if you know all the gene trees, you can add them in after the fact, which saves you the trouble of producing lots of mutations that don’t end up segregating in the final sample, and keeping track of genotypes at all those locations as you go along. It turns out that keeping track of all those neutral mutations really slows down forwards-time simulators, and that adding mutations after the fact turns out to be extremely fast.
Tree sequence recording isn’t free, so it doesn’t always speed things up. If your simulation has no neutral mutations (or if you forget to turn them off), it won’t help at all. Generally speaking, tree sequence recording helps more for larger simulations, and slows down very small ones. In our tests, we find that simulations that take more than a few hours are typically 1-2 orders of magnitude faster. This lets us simulate tens of thousands of individuals with whole-chromosome-scale genomes - simulations that previously would have taken weeks now take hours.

What else can I do with tree sequences?

The other really exciting thing about tree sequence recording is that, well, we end up with the tree sequence! That means that we don’t have to guess from genotypes how things are related; we can actually interrogate the gene trees themselves to find out. The msprime python package provides a lot of nice tools to work with tree sequences, and lets you iterate through all the trees in a genome very quickly. The tree sequence keeps all sorts of information: for instance, you can find out the ages of all the selected mutations, and identify any sites that had more than one mutation. You can see how well your local gene tree reconstruction method is doing, or whether those long IBD blocks are actually inherited from single common ancestors. Here’s some other fun things you can do.
Remembering individuals: By default, the tree sequence you get at the end of a simulation contains the history of the entire final generation. This might be more information than you need, but the tree sequence format is so efficient that it doesn’t matter; you can select the samples you want in post-simulation analysis. You can also ask SLiM to “remember” other individuals along the way. Normally, information extraneous to the genealogies of the final generation is discarded along the way, but remembered individuals won’t be discarded - their histories will be kept in the tree sequence for ever.
This lets us determine true local ancestry: for instance, you could introduce a few Neanderthals as remembered individuals into a simulated population of modern humans, and at the end ask which bits of genome in the final generation were inherited from those Neanderthals, just by seeing which branches of the trees trace through them. Or, you could remember entire populations at some point in the past, and then identify which segments of the final generation inherit from which ancestral population.
Recapitation: Forwards-time simulations have to start somewhere. But, you might ask, where did the first generation come from? Who were their parents? What are their genotypes? To avoid taking a stand on these questions, we generally try to simulate for long enough that their answers don’t matter - by the end of the simulation, the first generation is so far in the remote past that each of the initial individuals are ancestors to either all or none of the final generation. In other words, all the trees in the final tree sequence have coalesced to a single ancestor. However, this can take quite a long time: in a population of size N, you have to simulate for something like 10N generations. One way that people have made this requirement less onerous is by using a (much faster) coalescent simulation to provide the initial generation with history and genotypes. As I discussed above, this isn’t quite right, but using the coalescent for deep-time history is a lot less wrong than using it for modern dynamics. For instance, the effects of geography in the remote past wash out, and the deep-time portions of a genealogy, even with strong population structure, are expected to look like an unstructured coalescent. And besides: if N is very large, probably your species’ population structure looked pretty different N generations ago.
It’s easy to imagine gluing a coalescent simulation to the top of a recorded tree sequence to provide the initial generation a genealogical history (and, genotypes). But, we can go one step further: we can run a forwards-time simulation without providing the history of the first generation, and then, when it’s done, scan through the tree sequence looking for any uncoalesced segments of the genome. We can then perform a coalescent simulation using only those segments, thereby filling in only the portions of history necessary for understanding the final generation. Since those trees are missing their “heads”, which we then fill in, we call the method “recapitation”.

How does tree sequence recording work?

Imagine that as the simulator proceeds, we ask it to write down, for each new genome, who its parents were, which segments of genome it inherited from them, and the locations and states of any new mutations. This clearly tells us everything we need to know: from this information, we can reconstruct the tree sequence for every individual, ever. This is precisely what we do, and this description of how segments of genome are inherited perfectly mirrors the tree sequence data structure developed for msprime. That was the first good idea that made this possible: the data structure, and the excellent set of algorithms and tools that Jerome Kelleher implemented for it. But, we still needed another good idea, because it turns out that the history of everyone, ever, is way too much data. If N=10,000 and you simulate for 10N generations, then “everyone ever” is 109 individuals. So, we developed an algorithm, called simplification, to discard the extra stuff. It works by tracing the ancestry of the individuals we are interested in back up through the tree sequence: everything that doesn’t get traced through is discarded, for instance. (In fact, the information this process discards is exactly the information that gets ignored by a coalescent simulation; the difference is that using this method lets us extract that minimal information from (much more flexible) forwards simulations.)

What is the software?

  • SLiM v3.1: SLiM now has full support for tree sequence recording, including recording mutations, remembering specified individuals, and detecting coalescence, and can output recorded tree sequences as .trees files for analysis in Python.
  • a fwdpy11 implementation : an implementation of tree sequence recording using fwdpp, used for benchmarking in this paper.
  • ftprime: a python implementation interfacing with simuPOP, also in this paper.
  • msprime, the python package: tools for working with tree sequences, in python.
  • pyslim: a python package for reading and manipulating the extra information that SLiM stores in tree sequences, mostly a thin wrapper around msprime.
  • tskit: all the tools above use, under the hood, the same set of C code for working with tree sequences, bundled with msprime. In the future, we plan to split this code from the coalescent simulator as a standalone library.

Prior work

While working on this project we came upon a number of previous attempts in this direction. Padhukasahasram, Marjoram, Wall, Bustamante, and Nordborg kept track of a rolling window of the eight previous generations, only tracking segments having descendants over that period. A program called AnA-FiTS by Aberer and Stamatakis stored genealogies to allow putting neutral mutations on afterwards, but without the important step of simplification.
And, while working on this project, I passed the discard pile from a lab down the hall, and stumbled upon the following 28-year-old unpublished thesis, titled “The Tree Simulator”:

 It was a hefty tome, mostly 300 pages in an obscure-to-me programming language. Was this a prior implementation of the very same project? Were we only treading in the footsteps of a previous Oregon evolutionary biologist? Fortunately, Ben Haller was visiting at the time, and is fluent in Mac Pascal, the language the work was written in. We were able to determine that, in fact, the program’s purpose was to simulate realistic pictures of trees for landscape architecture drawings.

1 comment:

A 25-year quest for the Holy Grail of evolutionary biology

When I started my postdoc in 1998, I think it is safe to say that the Holy Grail (or maybe Rosetta Stone) for many evolutionary biologists w...