PDA

View Full Version : Comparing SNP-based and STR-based TMRCA Age Estimations



Dave-V
11-27-2019, 09:44 PM
To my knowledge no studies have been done comparing STR-based (Walsh, Nordvedt, etc) and SNP-based (YFull, McDonald, etc) estimation techniques for Time to Most Recent Common Ancestor (TMRCA). I just completed one and I'll need to write it up as a study but I'm posting the results first here.

The approach starts with 425 Big Y tests (about 50/50 Y500 and Y700) under R1b-L513, which YFull estimates at 3800 ybp (years before present, meaning about 1850BC). That estimate has an error range but note that the fixed reference point matters more to the approach than the exact number. L513 is a large enough haplogroup to be representative and while other haplogroups will vary as to the details and averages, the approach is not L513-specific.

To create a SNP-based TMRCA reference, I counted all the SNPs in equivalent blocks under L513 on FTDNA's haplotree, and all the average private variants that FTDNA carries on their tree. I adjusted Y500-only branches to Y700 coverage by multiplying by 1.586 (the coverage difference between Y500 and Y700). In general the FTDNA tree on branches with a mix of Y500 and Y700 have already applied the benefit of additional variants discovered through Y700 testing and so those equivalent blocks can already be taken as "Y700-adjusted".

This gives a total tree calibrated to Y700 coverage, and then each branching SNP's age can be calculated by the ratio of SNPs above them to L513 against the average of the number of SNPs to present day on all the branches below them.

Those ages are here (https://drive.google.com/open?id=1vbS1Ygkv4aFvHegXuOoerhKD5g-h-tBT) for 416 SNPs under L513. The years-per-SNP on average across the whole L513 tree was found to be 73.12 with a 27.8% variance. Note that a single Y700 test should on average find SNPs at a rate of about 83 years-per-SNP, but the haplotree gets the benefit of repeated application of Y500 and Y700 test coverages which result in an effective larger coverage area and so a lower number of years-per SNP.

Note also that that list of Formed and TMRCA ages is not smoothed like YFull does so that the TMRCA age of a higher block equals the Formed age of the lower block. The raw data often shows a lower block's Formed age that is earlier than the upper block's TMRCA age - why?, because the lower block has more SNPs than the average of ALL the branches under the upper block and so as a stand-alone it looks a little older. I preferred to use the raw data for purposes of this study, but that table would be more useful as a standalone if it were smoothed and I'll get around to that.

Ok so that's how the SNP ages were derived. The STR-based ages were calculated by putting the Y111 markers for all 425 Big Y kits into SAPP which calculates TMRCAs at all the branching points according to Nordtvedt's Interclade technique. An explanation of this technique can be found here (http://dienekes.blogspot.com/2008/08/validation-of-ken-nordtvedts-interclade.html).

Please note that the point of this study was not to figure out which was more accurate, only how consistent and how close the two methods are. Accuracy would demand that I know historically when each SNP occurred and I don't know that of course. Just because both SNPs and STRs might agree and say a branching point is 1400AD doesn't mean they're correct, both methods may just be overestimating or underestimating by the same amount.

I'll put more details in the study write-up, but the summary picture of STR age estimates against SNP age estimates is here. Each dot represents a branching point under L513 (i.e. SNP), with its STR-based age on the X axis and its SNP-based age on the Y-axis.

34969

The red line shows where the dots should fall if every SNP age was identical to every STR age. I'll repeat again that the red line does NOT represent the actual ages of the branching points since I'm not measuring accuracy, only consistency. The two grey lines give the 27% high and low error margins of the SNP-based calculations I made which is the approximate error range of the SNP (Y-axis) component of the blue dots (meaning the dots which fall inside the two grey lines are at least within one set of error margins).

While the blue dots are not aligned on the red line, the correlation between the two sets of data (STRs vs. SNPs) is astounding. The two estimates have a computed correlation of 0.8778 and a p-value of 6.2x10-87 (as background, the closer the correlation approaches 1 from 0 the more closely related the two variables are, and any p-value lower than 0.05 means their relationship is statistically significant). For those who care, the coefficient of determination is 0.77, meaning that 77% of the variance of one variable is explainable by the other.

In other words, STR-based and SNP-based age estimations are closely correlated and while they both can vary within independent error ranges, they both consistently represent a common underlying cause (i.e. the number of years over which mutations have occurred). They certainly do not mutate randomly compared to each other.

This is the slide I use to explain STR and SNP techniques and how they compare. On select branches compared to averages, the behavior of SNPs and STRs will fall somewhere within the circle shown. Depending on where that point is, either or both can be over- or under-estimated. The trick is to then use them both along with historical evidence to triangulate an accurate date.

34970