Cluster Y-DNA STR or mtDNA data to produce Distance Dendrograms


Open another browser tab or window, go to the FTDNA Y-DNA or mtDNA data page of your choice (classic, not colorized), select all and copy. If the project is large, set it to display 1000+ rows, wait for a refresh, and then select all and copy. Return here, click in the box below, paste — then click the blue Start button.

Add Y STR Demo
Add mtDNA Demo
Start the Analysis

Note that the analysis may be slow for more than 1000 kits. Use a fast machine with Safari, Chrome, or Firefox.

version of

Nothing to see here until you run an analysis.
Nothing to see here until you run an analysis.
Nothing to see here until you run an analysis.

Genetic genealogy (GG) has an interesting feel to it. I'm more accustomed to physics and biochemistry where methods and standards are older and more formal. This feels like a new disciple with contributions from traditional genealogists, statisticians, plus a strong component of citizen science; indeed while the data are collected by the private sector, most of the innovation in analysis and presentation is done by amateurs. In that spirit I have read through the literature and have made my own decisions about methods.

Why use a distance method? This type of analysis goes back to Fitch and Margoliash (1967) Science 155: 279-284. Why dredge up a 50 year old method when we have more accurate and deterministic methods of inferring family trees -- maximum parsimony and others ? There are several reasons

If your goal is best-possible accuracy in deducing your own recent ancestry, then you should use Dave Vance's excellent SAPP tool, especially because he lets you incorporate additional information from SNPs or paper genealogy.

Genetic distance: For STRs I calculate this like everyone else. If Alan has 11 copies of the STR at locus DYS393 and Bob has 13 copies, I count that as a mismatch of 2. Likewise I calculate the mismatches for multiple loci as per this FTDNA update. For mtDNA, genetic distance = the number of basepairs in HVR1 and HVR2.

Normalization across different sets of STRs: Much of the literature of GG is written around the peculiar historic numbers of STR kits that you can buy: 12, 37, 67, or 111 STRs. You can mix these together but have to normalize the calculated distances. I use a variation on Bruce Walsh's model to express distance in units of tMRCA -- not number of mismatches -- right up front. So for example if Alan and Bob match in 12 of 13 STRs while Chris and Dave match in 62 of 67, their distances (using μ = 0.0026) are

tMRCAAlan-Bob = (1/2μ)ln(n/k) = 216*ln(13/12) = 17.3 generations while tMRCAChris-Dave = 216*ln(67/62) = 16.8

which can be averaged later during tree generation -- which allows the mixing of Y37 with Y111 data without any loss of information. I've also done a lot of simulation comparing infinite alleles to a model with a Poisson distribution with a mean of 6 alleles (which is what we have in reality). The differences are negligible back to 400 generations -- mutations are simply too infrequent to be affected by the selection boundaries over that timescale.

In preparation for clustering I calculate all pairwise distances as tMRCA. If two people have different length STR sets I just use the subset that they have in common. At this point the value of μ is only a linear scaling factor and does not affect subsequent clustering.

Clustering and Display: I use average linkage hierarchic clustering which produces a tree data structure appropriate for display in a dendrogram. The only question in the display is what scale to use for the distance axis (the horizontal axis as I've drawn the linear dendrogram, and the radial distance from rim to center for the circular). The diagrams use tMRCA exactly as calculated as inter-person distance and as returned for branch nodes by the clustering. Linear and log scales are available as options.

Calibrating tMRCA: The Y STR default is 0.00231 mutations per locus per generation, which is between the mean and median of Y111 values and which I have found empirically to give the best results over the range 50-200 generations. For mtDNA I use 1.64 x 10-7 mutations/year/bp, optimized for best fit to published data; see this report.

Groups, Labels, and Clade Detection: The groups assigned by FTDNA adminstrators are shown in the linear dendrogram but they are only labels. No group or haplotype information is used to create these diagrams, which are based only pairwise Y STR or mtDNA differences. The manually assigned groups are usually in good agreement with the automated clustering, which is reassuring (the methods are similar but not identical). The code includes automated clade detection, as shown by pastel caps in the dendrograms. The method is good but does make mistakes; improvements under consideration include precise fitting to single-founder distributions, and/or matching against reference signatures (perhaps modal haplotypes or mtDNA patterns) from complete trees at YFull or FTDNA.