TL;DR – short summary
We can work out how all genomes are related to each other by looking at shared DNA sequences and working out what their shared common ancestor might have looked like at the DNA level. We can then deduce trees of relationships by then mapping the known genomes to these pieces of ancestral DNA. This is insanely more efficient than all previous attempts to infer genetic history.
Because it efficiently captures all information on genomic ancestry, and this ancestry is responsible for similarities and differences in our DNA, our technique should permit genetic information to be stored in a fraction of the disk space normally required.
It’s been a long while since I posted anything on this blog. That’s mainly because I’m now helping with fascinating work at Oxford’s Big Data Institute, which promises to re-write some of the foundations of evolutionary biology. Together with coding on the OneZoom tree of life project, it’s what has been taking up most of my research time.
The Big Data research involves two major ideas. The first is that an evolutionary approach can be used to compress genetic data by many orders of magnitude. The second is that inferring ancestors of pieces of DNA in the past can allow us to reconstruct evolutionary history across millions of genomes. These two ideas turn out to be intimately linked, a realisation that is largely due to Gil McVean and Jerome Kelleher, my colleagues at the BDI and the principle drivers of the work I’m about to discuss.
By way of a preface, I should highlight an important connection between family trees (sometimes called “pedigrees” or “genealogies”) and evolutionary processes. The connection comes about because evolution concerns the differential inheritance of pieces of DNA, and DNA is inherited through the family tree. As Richard Dawkins and I stated in The Ancestor’s Tale:
a gene takes a single path chosen from the maze of criss-crossing routes mapped by the (people) family tree
The basic idea of our project, and others like it, is to find likely paths for every piece of DNA (or loosely “gene”) in a set of genomes. It might help to imagine these paths forming a diaphanous net, tracing out many of the routes of the complete genealogyaThe network of ancestral routes taken by DNA through the family tree is technically known as the ARG (for Ancestral Recombination Graph). Programs like ARGweaver have previously been developed to try and estimate the true ARG from DNA, but are not scalable to large numbers of individuals. We are aiming to trace these paths for millions if not billions of individuals, hence the title of this blog post, and of our preprint: inferring the ancestry of everyone.
This task sounds impossible. After all each person has many millions of ancestors in their family tree. But there are two reasons to take heart. Firstly, as described in an excellent series of posts by Graham Coop, the number of ancestors from whom we inherit DNA – our “genetic ancestors” if you will – is much, much smaller than the number of genealogical ancestors. Secondly we can usually ignore “intermediate” ancestors.
What do I mean by intermediate? Well, think about the family tree shown here, which happens to link the two presidential candidates in the divisive 2016 US presidential election. This relationship shows them to be (at least) 19th cousins, sharing a distant pair of royal British ancestors. If all we are interested in is relationships, we only need to know that they share great-great-great-…-grandparents in the 14th century. We can forget about the identity of their intermediate ancestors: all those Percys, Roddams, Gordons, MacKays, MacLeods and so on. The genealogical relationship between the two candidates resides in the details of their most recent common ancestors: who they are (John of Gaunt & Katherine Swynford) and how long ago they lived (19 generations).
The same applies to DNA ancestry. If we want a comprehensive measure of the evolutionary links between two distant humans, or indeed between two oak trees, we don’t need to figure out the DNA sequences of their all their parents, grandparents, great-grandparents, and so on. All we need are the most recent common ancestral sequences, together with some sort of timescale.
There is a complication here, because there isn’t just one most recent common ancestor (MRCA) sequence for a pair of genomes. Recombination, the mixing of DNA that happens as a prelude to sexual reproduction, means that different parts of the genome, even sections on the same chromosome, are inherited via different routes. For example, a chromosome that you inherited from your mother may consist of a mixture of parts from your mother’s mother with parts from your mother’s father. This means we can’t really think of a whole genome as following a single line of descent. Instead, different pieces of genome have their own ancestries, and therefore potentially different MRCAsbOne complication is that looking back from genomes of individuals today, it is not always obvious where the ancestral pieces start and stop. In other words, it’s hard to work out the locations on the chromosome where recombination occurred in the past.
Computer simulations can give us an idea of what to expect. The picture below shows the ancestry of seven genome sequences (the seven black lines at the bottom). I have plotted their most recent common ancestral stretches of DNA as coloured lines, oldest (red) at the top. You can think of these lines as composed of a unbroken string of letters from the four-letter DNA alphabet, but at too small a resolution to make out the individual A’s, T’s, G’s, and C’s. An unbroken length of DNA sequence is sometimes known as a haplotype, so technically the coloured lines represent ancestral haplotypes (“ancestors” for short), and I call this sort of diagram an ancestral haplotype plot. The little coloured triangles above the middle of each line show the identity of the ancestor of each haplotype. Many of the lines have several triangles for the same reason I mentioned above: a single length of DNA in one generation may actually consist of sections inherited from different ancestors in previous generations. It’s worth trying to understand this plot, as it’s central to all I’m going to say.
Some lines (“ancestors”) in the plot span the almost the whole genome from left to right. Others are much shorter. Together, the lines and their ancestral relationships represent the full evolutionary history of the 7 “sample” genomes. To put it another way, it shows the best reconstruction a genetic historian can ever hope to produce from comparing these 7 genomes.
The ancestral haplotype plot is an unusual way to represent DNA ancestry. You may have come across an equivalent and more conventional diagram: an evolutionary tree. In fact, if you take a take a single vertical slice through the ancestral haplotype plot, connecting the ancestors as per their coloured triangles, you should get one of these trees. If you switch to the other tab on the above plot, you’ll see an example I prepared earlier.
But one evolutionary tree isn’t an adequate picture of the genetic relationships shown in the plot. Each time an ancestor (i.e. a line) changes as we go from left to right in the ancestral haplotype plot, the tree changes too. This creates a series of trees along the genome: the picture here shows an example (drawn in an alternative, rectangular plotting style). Such a series of trees presents an attractive picture, but it misses an important truth. These trees are not independent of each other: the same ancestor, represented by a branch point in a tree, can be shared across adjacent trees. This is much clearer in the ancestral haplotype plot, where ancestral sequences persist across variable spans of the genome – often quite long spans if the ancestors are recent ones. As a result, there is no single point where one tree switches to a completely new one. Instead, the trees change shape gradually and incrementally along the genomecIn fact, if every recombination event takes place at a unique position in the genome, it can be proved that one tree will differ from the next by only one “tree difference” or “tree edit”, where a portion of the tree is cut off and pasted back on in another place: this is also known as a “subtree-prune-and-regraft” (SPR) operation, for obvious reasons. It’s this property that trees along the genome differ only slightly at each step that allows us to gain such efficiency in the algorithms we produce..
How to store everyone’s DNA on a USB stick
What’s this got to do with storing DNA sequences? Well, there’s a rather neat way to recreate DNA sequences given their evolutionary ancestry. It’s based on the observation that DNA differences between people reflect copying mistakes that they have inherited from their ancestors. We can think of a change in the past as occurring on one branch of an evolutionary tree: all the descendants of that branch therefore inherit that particular difference. If we place each DNA change (or “mutation”) on the correct branch of the correct tree, it’s easy to work out the pattern of inherited differences.
In the simple example here there are seven DNA sequences, containing four points of difference, marked in red at positions a, b, c, and d. The ancestry of these seven sequences is described by the trees above them. From the trees, you should be able to see that mutation a occurs on the leftmost branch, and is therefore inherited by sequences 2 and 3 (as shown in the sequence below the tree). The same goes for the other three mutations. The trees and the position of the mutations fully account for the differences between the seven sequences. Together with the default (reference) sequence, that’s enough to recreate the full DNA dataset.
My colleague Jerome Kelleher has come up with a very efficient way to store both the positions of mutations in a tree and, more importantly, the evolutionary trees along a genome (essentially by storing the lines in the ancestral haplotype plot). Because it’s possible to fully recreate DNA sequences from this information, his “succinct tree sequence” formatdSomewhat confusingly, “tree sequence” refers to the series or sequence of trees, rather than to DNA sequences. Sorry about that. is what I’ve termed an evolutionary encoding of human DNA. And not just humans. It can potentially serve as an evolutionary encoding for any life on Earth.
I’m sure you’ll agree that this is a fascinating idea for an evolutionary biologist. And it’s just as compelling for computer scientists, because these “tree sequences” are amazingly compact. Essentially, we are using our knowledge of inheritance to devise a near-perfect compression method for genetic data. To give you an idea, the plot to the left, from our preprint, shows an rough extrapolation of how much disk space (on a logarithmic scale!) it would take to store genetic data from every human on the planet. Tree sequences have to potential to solve our DNA storage problems forever.
The ancestry of everyone
It’s all very well to propose tree sequences as a means of storing genetic information, but that requires knowing the ancestry of the sequences we want to store. If we have (say) a million DNA sequences, that means trying to reconstruct many evolutionary trees, each with a million tips. Now, many decades of research have gone into methods to infer evolutionary trees from real data, and all methods proposed so far are what one might call “computationally challenging”. At best, they can deal with trees with thousands or tens of thousands of tips, certainly not millions.
This is where the second part of our research comes in, based on an inspired idea by Gil McVean. It relies on the fact that there is a pretty efficient statistical computer method, called the Li and Stephens algorithm, which matches a target DNA sequence against a set (or panel) of other DNA sequences. This technique calculates, for each region of the target, which sequence in the panel is the best match. Or to put it another way, it finds the most likely source in the panel from which the target has “copied” its DNA letters. It’s an elegant algorithm that works pretty quickly, even for a panel of thousands (although perhaps not millions) of sequences. Nowadays it’s commonly used to match existing sequences against a set of well-studied reference sequences, a reference panel. Nice … yet that doesn’t sound particularly useful in our case, since we have millions of DNA sequences. That’s quite a lot of matching to do, and anyway, it’s not clear which sequences should serve as targets and which should be put into a reference panel.
But what if we had the DNA sequences from our common ancestors? That is, by some magic, we were simply handed the “ancestral haplotypes” (the coloured lines in the plot)? Then we could use these ancestral sequences as the equivalent of a reference panel: younger ancestors could copy from a reference panel of older ancestors, and finally the millions of current-day DNA sequences could copy from a reference panel of all the ancestors. Not only would we be using the Li & Stephens algorithm in a philosophically correct way (after all, descendants really do copy DNA from their ancestors), but as we’ve seen, there aren’t really that many ancestral sequences to copy from. Moreover, these ancestral tend to be quite short, at least in the distant past, which further reduces the amount of matching we would need to do. Maybe, just maybe, that would make the problem solvable, perhaps even on a normal desktop computer. Then, by working out which sequences copied from which other ones in the past, we could reconstruct all knowable ancestral history!eI’m being deliberately grandiose. I really mean “all history that is knowable from these particular DNA sequences”. For humans we also have oral and written history, as well as other sources of information such as fossils.
Well, it turns out that we can do something almost as good as that. We can make a pretty good guess as to the sequences of our common ancestors in the past, because we can usually assume that a DNA change seen in some (but not all) people has a unique originfThat’s because the DNA replication process is so accurate. In humans, on average, each letter is wrongly copied only once or twice in 100 million generations. So if we both have an altered letter, there’s an extremely good chance it’s been inherited from a common ancestor, rather than exactly the same alteration happening twice by chance. That’s not to say alterations are rare over the whole genome: we have 6 billion letters in total, so a handful of your DNA letters will almost certainly have been miscopied from your parents, even though the chance of any one of them being mutated is miniscule.. Imagine that you and I have a DNA sequence that contains a “T” at one particular (“focal”) position, but everyone else in the world has a “G” there instead. I can be pretty sure we both got our T from a common ancestor, and a recent one at that (since no-one else has it). In other words, we do know something about this particular piece of ancestral DNA. We know it contained a T at that particular position in the DNA sequence. That’s a start.
We can also work out a little more about the sequence in our hypothetical common ancestor. If everyone in the world has a C to the right of that position, then the ancestral DNA that you and I share almost certainly had a C there too. And if, at the next position along, we have an “A” but so do a thousand other people, then our recent common ancestor probably had an A. If, however, these thousand have an A, but we (along with the rest of the world) have a “G”, then out common ancestral DNA probably had a G instead. We can use this sort of logic, rightwards and leftwards from the focal position, to build up a good guess as to this particular recent common ancestral DNA sequence that you and I must have shared.
A neat point about this ancestor reconstruction process is that we don’t need to build up the entire DNA sequence of our shared ancestor. As we saw in the simulations, the further back in time the ancestor (and therefore potentially the more unreliable our method), the smaller the chunk of ancestral DNA that we are likely to need to fit into our evolutionary history.
I think the easiest way to grok what we’re doing is to think back to the ancestral haplotype plot. We first try to make an educated guess as to the DNA sequence of each of the ancestral haplotypes (the coloured lines on the plot), by focussing on mutations shared by some but not all people. We don’t have to guess the whole sequence, only enough to be reasonably sure we have covered the true ancestral region. As in the plot, these ancestors need to be ordered by age; we do this by assuming that the mutations shared by very few people are likely to be more recentgI’ve skated over this assumption in my blog post for two reasons. Firstly, the age ordering is only relevant for regions close together on the chromosome. Secondly, we have some other plans up our sleeve for improving the age ordering of ancestors. Regardless of the general validity of this assumption, we can prove that our method works extremely effectively on simulated DNA data.. Then we add the arrows (the order of copying) to the ancestral haplotype lines on the plot, by using the Li and Stephens algorithm: older ancestors are treated as a sort-of reference panel against which to compare each younger ancestor in turn. Having worked out the most likely sequence of copying between our ancestors, we then match the real data against all our reconstructed ancestors, to create a final tree sequence of the entire dataset.
This method is astoundingly efficient. Here’s a plot showing how much computer time our “ts-infer” method takes to resolve the ancestry of tens of thousands of simulated DNA sequences. You can see that even with a hundred thousand sequences (a hundred thousand tips on each evolutionary tree) it takes in the order of hours. What’s more, unlike the other method shown in the plot, or other methods that take even more time, the computing effort increases roughly linearly as the number of sequences goes up. That means it only takes a few days of computing power to achieve what previously seemed impossible: inferring the shared ancestry of millions of genomes.
Our approach might sound a bit ad-hoc in places, but we have a number of reasons to be confident that it works well. Firstly, using simulated DNA sequences, we can show that as long as we have exactly the right ancestors, the Li and Stephens matching process gives exactly the right answer. We call this perfect inference. Secondly, even if we use our “educated guess” approach to ancestor reconstruction, we do as well, if not better, than most other methods of ancestral tree inference, at least for small numbers of sequences (we can’t test large numbers, because other methods simply won’t run at that scale, even on high-powered computers). Thirdly, when we run our method on real data, such as the 2500 people in the 1000 Genomes Project, or the million human chromosomes from the UK Biobank, we get results on ancestry and geographical distribution of relatives that make complete sense. I’ll detail these in a subsequent post, but I hope this is enough to whet your appetite for now.
Many of the gory details are to be found in our preprint on BioRxiv, or you can contact me if you want to find out any more.
Notes [ + ]
|a.||↑||The network of ancestral routes taken by DNA through the family tree is technically known as the ARG (for Ancestral Recombination Graph). Programs like ARGweaver have previously been developed to try and estimate the true ARG from DNA, but are not scalable to large numbers of individuals|
|b.||↑||One complication is that looking back from genomes of individuals today, it is not always obvious where the ancestral pieces start and stop. In other words, it’s hard to work out the locations on the chromosome where recombination occurred in the past.|
|c.||↑||In fact, if every recombination event takes place at a unique position in the genome, it can be proved that one tree will differ from the next by only one “tree difference” or “tree edit”, where a portion of the tree is cut off and pasted back on in another place: this is also known as a “subtree-prune-and-regraft” (SPR) operation, for obvious reasons. It’s this property that trees along the genome differ only slightly at each step that allows us to gain such efficiency in the algorithms we produce.|
|d.||↑||Somewhat confusingly, “tree sequence” refers to the series or sequence of trees, rather than to DNA sequences. Sorry about that.|
|e.||↑||I’m being deliberately grandiose. I really mean “all history that is knowable from these particular DNA sequences”. For humans we also have oral and written history, as well as other sources of information such as fossils.|
|f.||↑||That’s because the DNA replication process is so accurate. In humans, on average, each letter is wrongly copied only once or twice in 100 million generations. So if we both have an altered letter, there’s an extremely good chance it’s been inherited from a common ancestor, rather than exactly the same alteration happening twice by chance. That’s not to say alterations are rare over the whole genome: we have 6 billion letters in total, so a handful of your DNA letters will almost certainly have been miscopied from your parents, even though the chance of any one of them being mutated is miniscule.|
|g.||↑||I’ve skated over this assumption in my blog post for two reasons. Firstly, the age ordering is only relevant for regions close together on the chromosome. Secondly, we have some other plans up our sleeve for improving the age ordering of ancestors. Regardless of the general validity of this assumption, we can prove that our method works extremely effectively on simulated DNA data.|