Friday, August 17, 2018

DNA genealogy 'splainer 2: maximum parsimony

So let's take the first two rows from the table in the last blogpost: that is, the STR results from the first two individuals, who, we're assuming, are living today. As you can see, there are three changes, in DYS464-3, CDY-1, and CDY-2. There is also, of course, a common ancestor, whom we'll call person A. Now in principle, person A could have any STR count, but what we'll assume is that a minimum number of changes in STR number happened in going from A to 1 and A to 2. This makes sense; changes are rare, perhaps one every four generations. Obviously, the minimum number of total changes three, and there are four ways to do that. Now you could argue, that because A lived probably more than 100 years before person 1 and person 2, and therefore about the same length of time before each, the changes from A to 1 and from A to 2 should be approximately equal. And that is true, but it's only a probability; there is a ¾ chance the changes are split 2:1 (top row), and a ¼ chance they are split 0:3 (bottom row). For the moment, we're going to assume that isn't enough of a difference to count on (If we did, the method would be called maximum likelihood, not maximum parsimony, and I'll get to it later). So, from a maximum parsimony standpoint, these tiny two-person trees (let's call them 2-trees) are equally good.

So let's try a 3 tree. Here are three individuals. Simply by trial and error, it's easy to come up with A as the 'ancestor'; A has two differences with 1, one with 2, and one with 3, for a total of four differences. We can depict it thusly. So this time there's a unique solution for A; no other will give us the minimum four total changes.

But we've been making one mistake, and that's to call A the ancestor. Imagine the common ancestor actually had the same STR counts as 3. That common ancestor was ancestor to A, who was then the common ancestor to 2 and 1. That scenario would have the exact same number of changes. In fact, it would really be the same tree. Using just maximum parsimony data, we can't distinguish. In fact, the 'root' -- the common ancestor -- could be either 1, 2, 3 or A, or even half-way between 1 and A.

This is a limitation of maximum parsimony within a single cluster of individuals; you can find the best tree, but you can't locate the root within the tree (maximum likelihood says it's A, but that's only a probability).

When we go to four individuals, we can construct all sorts of trees, and so we need to set some rules. There are the rules I'm going to set.

• All 'ancestors', which we've seen may not really be ancestors, so we'll call them nodes -- they're the blue, lettered circles -- are connected to exactly three other points, either other nodes, or individuals
• All individuals, which we've seen might also be ancestors, are connected to only one other point, which must be a node. We'll call them vertices.
• There are no cycles or rings, where (say) A is connected to B, which is connected to C, which is connected back to A. This doesn't work, even in Appalachia.
We've entered an area of mathematics called Graph Theory. Fortunately, our trees are almost the simplest kind of trees, so we don't actually need to know much about graph theory to use it.

You might object that one ancestor might have four or more descendants. And that's true. But we can take care of it by simply saying there are two connected nodes, A and B, with 0 distance between them. Or more.

The beautiful thing is we end up with a rather small number of possible trees, and a systematic procedure to find the minimum parsimony tree. In fact, there's the only possible 4-tree that obeys our rules. And below it is the only possible 5-tree.

So if we have say five individuals and want to find the most parsimonious tree, we have a three step procedure
1. Arrange the individuals in the empty vertex slots on the 5-tree, in every possible unique configuration
2. Use a systematic or random procedure (I use the latter, an algorithm called Monte-Carlo/Metropolis) to change the STR counts of the nodes until the maximum parsimony solution is encountered
3. Select the vertex arrangement which together with maximum parsimony gives the lowest overall number of changes.
The word 'unique' is important. because it doesn't matter how we order two vertices connected to a single node, and all trees related to each other by simple symmetry operations are all the same, there are actually only three ways to create a 4-tree.

Our final post will consider how to root the tree.