Rapid evolution of the human mutation spectrum
More mutations at different rates between populations.
Rapid evolution of the human mutation spectrum
More mutations at different rates between populations.
For Mark Jost or anyone who is familiar with Ken Nordtvedt's Y-STR-based Interclade age estimating methodology, I recently changed the TMRCA estimations in SAPP (my phylogenetic tree tool) to use that methodology instead of Nordtvedt's older adaptation of the Bruce Walsh methodology.
The main reason I switched is because Nordtvedt's older method required knowledge of the allele frequencies which are specific to individual haplogroups. I built the L21 allele frequencies into the tool but that made it less accurate for other haplogroups. The Interclade least squares estimations don't require allele frequencies so they are more widely applicable.
The tool recalculates a Interclade age estimation at every branching point (node) on the phylogenetic tree. Internally, every node has only two branches, so there are always two "sub-clades" to use for the methodology. When the phylogenetic tree is drawn it eliminates unnecessary nodes so you only see relevant branching with their TMRCA estimates. I should also note that although I calculate coalescence ages as well, I don't currently report them for simplicity.
Switching between the two methods, I find the Interclade methodology produces slightly younger TMRCA estimates but with tighter one-standard-deviation error ranges than Nordtvedt's older method.
Since the tool constrains the phylogenetic tree by SNP results, I also report the SNP TMRCA ages (where known) from YFull (V5.04 currently) against the SNPs reported on the tree. This allows for direct comparison around the tree between TMRCAs calculated by SNPs and TMRCAs calculated by STRs.
But... this is not necessarily an apples-to-apples comparison since the TMRCA nodes reported on the tree are only the TMRCAs for the group being charted, not the TMRCA for that SNP overall. You would expect then in general that the SNP TMRCAs should be older (MUCH older, in some cases) than the STR TMRCAs. So the way to compare the two ages is that the SNP's overall TMRCA is somewhere ABOVE the node shown on the tree, and the node itself is of the age reported for the STR-based calculation.
This is an example (from data invented for display purposes) of a node in the phylogenetic tree chart with both SNP (in blue) and STR (in green) ages shown:
Roslav (08-16-2021)
A little late, but I have attached a short report that discusses my 2012 estimation method and it's relation to the original Chandler's approach. Basically (2012) estimation method reworks relevant math removing some approximations and also, perhaps more importantly, introduces topological data weighting. It appears that, although unpublished, these estimates are still referred to in the web even in some peer-reviewed papers. Some extra details then seem appropriate.
JMcB (04-06-2021), razyn (04-06-2021), Roslav (07-03-2021), sheepslayer (04-06-2021), Telfermagne (04-06-2021)
YHRD dataset as been slowly expanding after 2012. New loci have been included and the dataset has been diversified. So I wrote a short description of the current situation wrt the accuracy of the 111 estimates.
edit: I replaced the attachment to fix a comment related to the original error estimates
Last edited by MarkoH; 07-16-2021 at 09:23 PM.
J-Live (07-17-2021), JMcB (07-16-2021), sheepslayer (07-17-2021)
There was the technical detail that the rate estimates are found by a probability maximization process. Such estimate would then be a mode of a distribution rather than a mean as I incidentally suggested in the document yhrd2021b.pdf .
It seems that I can't any more edit the previous post, so the document with a more consistent error estimate section is attached here.
sheepslayer (07-18-2021)
Somehow I forgot to add the quite relevant 95% confidence interval for the overall scale of the estimates.
After all, errors are composed of per locus errors and an error in the overall scale. For the very fastest loci these are fairly similar in magnitude, while the per locus error dominates in the case of the slower loci.
Also the scale factor uncertainty needed to be removed from calculation producing Fig 1.
None of this of course changes anything material in the document....
Last edited by MarkoH; 07-21-2021 at 08:12 AM.
I would summarize my recent observations as follows:
Any pair of two real-world haplotypes is related by a sequence of independent
periods of evolution (called "branches" in rrates.pdf). If each of the
"branches" of evolution are given an equal probability of occurrence
(by using a weighting scheme), random draws from an abstract haplotype
pair distribution can be simulated and mutation rates can be
estimated.
Even tough my 2012 estimates did not score particularly well with
father/son pair data that was available in 2012, they work very well
with today's more complete YHRD dataset. It is estimated that the
accuracy is at least similar to a father/son dataset of 10,000
samples, but the accuracy could potentially be as good as with a set
of 80,000 father/son pairs.
Last edited by MarkoH; 07-22-2021 at 02:46 PM.
The use of a particular weighting function in the estimatinon algorithm description document rrates.pdf was critical for the accuracy. However, it may appear cryptic. To clarify the meaning of the weighting I wrote a short description of it (attached).
For example, if m(a,b) is a mutation count between samples a and b, and t(a,b) is their time distance, the related mutation rate is simply
sum w(a,b) m(a,b) / sum w(a,b) t(a,b)
where w(a,b) is the weighting factor and the sum is over all sample pairs (a,b). This simple result is accurate eg if the changes always happen in unique locations and can then be counted accurately from the pair observations without complications like back and parallel mutations.
Last edited by MarkoH; 07-26-2021 at 11:55 AM.
If the statistical tests of the document yhrd2021d.pdf are done for
"Heinila (2012)" rates, McDonald rates, and Chandler rates,
respectively, I find the following values with the data at
https://yhrd.org/pages/resources/mutation_rates :
P value
0.36 , 0.0016 , 3e-5
maximum likelihood N0 value
89000, 12000, 7300
This is surprising since McDonald rates are partly based on (2012)
rates. However, an inclusion of relatively small father/son studies
may reduce the effective size N0. Perhaps the biggest single
difference of (2012) rates and Chandler rates was the use of the
summing theorem (weighting.pdf).
I started editing the documents I posted earlier since they omit some details. If you would like to get extended versions, please send a message.
I would strongly encourage P value tests for various mutation rate sets, perhaps simply ignoring complex loci like 385 & 389. P tells how likely as extreme (or more extreme) mutation counts as seen in measured transmission events would be if the rate set in question was assumed to be accurate.
Last edited by MarkoH; 08-01-2021 at 09:45 PM.