Simple Distance Tree

Move the slider to add random error to the distances

Distance Error
Topological Error

Discussion

This simple demonstration shows a dendrogram — a tree diagram — that is constructed from the pairwise differences between DNA contributors, in this case the five hobbits Adelard, Bilbo, Cottar, Drogo, and Everard.

Adelard and Bilbo are siblings, so their tMRCA is 1 (the distance in generations from each to their father). Cottar is their first cousin and has tMRCA of 2 to each of them. Drogo and Everard are third cousins with a common great-grandfather and tMRCA of 3. They are fifth cousins to the others with tMRCA of 5 for all pairs. This is the "ground truth" and doesn't change. In the real world we don't know this; it's what we want to discover by analyzing DNA.

Imagine that the hobbits have done their Y DNA tests and we've estimated the pairwise tMRCAs for all 10 possible pairs (there are N(N - 1)/2 possible pairs of N things). We aren't concerned here with exactly how the tMRCAs are estimated, except to note that there is statistical error in the process.

Now move the slider back and forth. The more you move it to the right, the more error you add to the distance measurements — actually a normally-distributed random value with the mean percent shown. We know the typical level of error associated with a given number of Y STR markers (such as Y12, Y37, Y111) and that is shown next to the percent. The errors are expressed as percents (not generations or years) because we know that with real STRs the error is proportional to the tMRCA value, and that's how the error is applied here.

Watch the dendrogram as you move the slider. Notice when the error meters start giving consistently high values.

Observations

If we had thousands markers we could recreate family trees with complete accuracy, just from the STR data. The dendrogram converges on the true tree. While a dendrogram has no ancestors on it, in this limit we could reliably assign specific ancestors everywhere that a line of the diagram crosses a generation mark (in green above). It would become a family tree.

But today we often have 111 markers, which you can see implies about 20% error in any given tMRCA estimate, and results in similar errors in tree distances. Topological errors (incorrect branching) errors are rare at this level.

Above about 30% error, which is expected for Y37 and below, the dendrogram gets chaotic and unreliable. This illustrates why STRs got a bad reputation for reconstructing family trees when all we had were Y12 and Y25 kits. These data are still very good for excluding someone from a lineage (i.e. we're definitely not related, or the defendant is innocent), but they are unreliable for tree-building or asserting inclusion (we're related or the defendant is guilty).

Note also that the big patterns are more reliable than the fine details. In this example, (ABC) are separated from (DE) in nearly all cases while the fine structure of (AB)C vs (AC)B may jump around.