Results 1 to 8 of 8

Thread: Comparing SNP-based and STR-based TMRCA Age Estimations

  1. #1
    Registered Users
    Y-DNA (P)
    R1b-L21 L513*

    United States of America Ireland Germany Belgium Wallonia

    Comparing SNP-based and STR-based TMRCA Age Estimations

    To my knowledge no studies have been done comparing STR-based (Walsh, Nordvedt, etc) and SNP-based (YFull, McDonald, etc) estimation techniques for Time to Most Recent Common Ancestor (TMRCA). I just completed one and I'll need to write it up as a study but I'm posting the results first here.

    The approach starts with 425 Big Y tests (about 50/50 Y500 and Y700) under R1b-L513, which YFull estimates at 3800 ybp (years before present, meaning about 1850BC). That estimate has an error range but note that the fixed reference point matters more to the approach than the exact number. L513 is a large enough haplogroup to be representative and while other haplogroups will vary as to the details and averages, the approach is not L513-specific.

    To create a SNP-based TMRCA reference, I counted all the SNPs in equivalent blocks under L513 on FTDNA's haplotree, and all the average private variants that FTDNA carries on their tree. I adjusted Y500-only branches to Y700 coverage by multiplying by 1.586 (the coverage difference between Y500 and Y700). In general the FTDNA tree on branches with a mix of Y500 and Y700 have already applied the benefit of additional variants discovered through Y700 testing and so those equivalent blocks can already be taken as "Y700-adjusted".

    This gives a total tree calibrated to Y700 coverage, and then each branching SNP's age can be calculated by the ratio of SNPs above them to L513 against the average of the number of SNPs to present day on all the branches below them.

    Those ages are here for 416 SNPs under L513. The years-per-SNP on average across the whole L513 tree was found to be 73.12 with a 27.8% variance. Note that a single Y700 test should on average find SNPs at a rate of about 83 years-per-SNP, but the haplotree gets the benefit of repeated application of Y500 and Y700 test coverages which result in an effective larger coverage area and so a lower number of years-per SNP.

    Note also that that list of Formed and TMRCA ages is not smoothed like YFull does so that the TMRCA age of a higher block equals the Formed age of the lower block. The raw data often shows a lower block's Formed age that is earlier than the upper block's TMRCA age - why?, because the lower block has more SNPs than the average of ALL the branches under the upper block and so as a stand-alone it looks a little older. I preferred to use the raw data for purposes of this study, but that table would be more useful as a standalone if it were smoothed and I'll get around to that.

    Ok so that's how the SNP ages were derived. The STR-based ages were calculated by putting the Y111 markers for all 425 Big Y kits into SAPP which calculates TMRCAs at all the branching points according to Nordtvedt's Interclade technique. An explanation of this technique can be found here.

    Please note that the point of this study was not to figure out which was more accurate, only how consistent and how close the two methods are. Accuracy would demand that I know historically when each SNP occurred and I don't know that of course. Just because both SNPs and STRs might agree and say a branching point is 1400AD doesn't mean they're correct, both methods may just be overestimating or underestimating by the same amount.

    I'll put more details in the study write-up, but the summary picture of STR age estimates against SNP age estimates is here. Each dot represents a branching point under L513 (i.e. SNP), with its STR-based age on the X axis and its SNP-based age on the Y-axis.

    SNPs vs STRs Analysis.jpg

    The red line shows where the dots should fall if every SNP age was identical to every STR age. I'll repeat again that the red line does NOT represent the actual ages of the branching points since I'm not measuring accuracy, only consistency. The two grey lines give the 27% high and low error margins of the SNP-based calculations I made which is the approximate error range of the SNP (Y-axis) component of the blue dots (meaning the dots which fall inside the two grey lines are at least within one set of error margins).

    While the blue dots are not aligned on the red line, the correlation between the two sets of data (STRs vs. SNPs) is astounding. The two estimates have a computed correlation of 0.8778 and a p-value of 6.2x10-87 (as background, the closer the correlation approaches 1 from 0 the more closely related the two variables are, and any p-value lower than 0.05 means their relationship is statistically significant). For those who care, the coefficient of determination is 0.77, meaning that 77% of the variance of one variable is explainable by the other.

    In other words, STR-based and SNP-based age estimations are closely correlated and while they both can vary within independent error ranges, they both consistently represent a common underlying cause (i.e. the number of years over which mutations have occurred). They certainly do not mutate randomly compared to each other.

    This is the slide I use to explain STR and SNP techniques and how they compare. On select branches compared to averages, the behavior of SNPs and STRs will fall somewhere within the circle shown. Depending on where that point is, either or both can be over- or under-estimated. The trick is to then use them both along with historical evidence to triangulate an accurate date.

    Age Estimation with STRs and SNPs.jpg

  2. The Following 9 Users Say Thank You to Dave-V For This Useful Post:

     deadly77 (04-11-2020),  FionnSneachta (12-03-2019),  JMcB (08-03-2021),  MitchellSince1893 (08-03-2021),  palamede (11-29-2019),  RobertCasey (04-13-2020),  Roslav (08-16-2021),  sheepslayer (11-29-2019),  Tolan (02-08-2021)

  3. #2
    Registered Users
    Y-DNA (P)
    mtDNA (M)

    Quote Originally Posted by Dave-V View Post
    The years-per-SNP on average across the whole L513 tree was found to be 73.12 with a 27.8% variance. Note that a single Y700 test should on average find SNPs at a rate of about 83 years-per-SNP, but the haplotree gets the benefit of repeated application of Y500 and Y700 test coverages which result in an effective larger coverage area and so a lower number of years-per SNP.
    I am trying to figure out what Block Tree mutation rate is not for a specific lineage but across the board so as to have an expected mutation rate.

    I haven't received the help I've wanted on this but am aware the Block Tree Private Variant counts exclude SNPs from DYZ19, CEN and q12. I'm not how significant that is but that should mean James Kane's data warehouse calculation of 83 years per SNP is too low (fast). I understand that no coverage situations (the mitigation thereof) would mean a potentially faster rate for well tested branches (public SNPs). I'm not sure what's the best rate and I get that I could back off to just ComBED SNPs but you do lose a lot of resolution with that.

    Any comments on rough mutation rates for the Block Tree? Let's simplify and ask what mutation rates should be used for Big Y700 tested only branches?

  4. The Following User Says Thank You to TigerMW For This Useful Post:

     JMcB (08-03-2021)

  5. #3
    Registered Users
    Eastern North America
    80% Brit/Irish, 19% N Eur
    aDNA Match (1st)
    I12771 Mid Iron Age Derbyshire England 0.0216
    aDNA Match (2nd)
    I0774 Anglo-Saxon Cambridge England 0.0221
    aDNA Match (3rd)
    VK364 Viking Age Langeland Denmark 0.0228
    Y-DNA (P)
    mtDNA (M)
    Y-DNA (M)
    mtDNA (P)

    England Scotland Wales Germany Palatinate Ireland Leinster Sweden Finns
    Quote Originally Posted by Dave-V View Post
    [COLOR=#222222][FONT=Verdana]To my knowledge no studies have been done comparing STR-based (Walsh, Nordvedt, etc) and SNP-based (YFull, McDonald, etc) estimation techniques for Time to Most Recent Common Ancestor (TMRCA). I just completed one and I'll need to write it up as a study but I'm posting the results first here...
    I was under the impression that STR based TMRCA estimates tended to be accurate for more recent ages, while STR-based TMRCA estimates were better at older estimates as STRs can back mutate over time. I don't recall where I heard it but I believe I remember that the transition period was somewhere in the 500-1500 years before present timeframe. i.e. STRs are better somewhere in the 500-1000 ybp, then it transitions to STRs being more accurate.

    Am I mistake in this perception?
    Y DNA line continued: Z142>Z12222>FGC12378>FGC12401>FGC12384
    35% English, 15% Scottish, 14% Welsh, 14% German, 11% Ulster Scot, 5% Ireland, 3% Scandinavian, 2% French/Dutch, 1% India
    "Nemo est supra leges."

  6. #4
    Registered Users

    Hi folks,

    Mark Mitchell has put me onto this thread, since I don't generally lurk on Anthrogenica much! I did some STR versus SNP age estimations way back in the day, and I've done some more theoretical ones recently. The old ones are written up on pages 10-14 of this ancient and outdated summary for R-U106:
    while the new ones are written up in the examples of my paper here:

    In summary, SNP-based ages are most accurate; STR-based ages can be very accurate, but depend strongly on the methods used. The main issues are back mutations and convergent mutations, which become unseen in the data: the genetic distance of individual STRs quickly becomes dissociated from the number of mutations that have taken place. A table showing the approximate relationship between the two can be found in Table 1 of my paper, reproduced here:

    SNP-based ages can currently be thought of as the gold standard, and they are much better calibrated, so we are interested in trying to reproduce the SNP-based ages with STRs.

    The short answer is that Bruce Walsh's infinite-alleles model starts to become inaccurate after about one mutation timescale of the fastest markers, typically ~800 years ago; the stepwise-alleles model is less accurate for recent relationships due to multi-step mutations, but becomes more accurate over the period 800~1300 years ago, or between about one and two mutation timescales of fast markers; the Ken Nordtvedt's variance-based STR method then becomes most accurate, but variance seems to stop growing linearly after a few thousand years. Various exponential corrections have been proposed for this and, using these, several pre-2014 people have got fairly close to SNP-based estimates (I don't know whether by luck or ). My new method for STR-based TMRCAs can account for all these phenomena in a pairwise way (i.e. for two tests), but is subject to some inaccuracy over timescales of millennia due to hidden convergent mutations (a crude estimate would be ~15% in ~4000 years). See also Sofie Claerhout's recent paper on YMrCA:

    However, the bigger issue causing a divergence is normally mutation rates. Even authoritative sources for mutation rates vary by ~30% for the same STR. This leads to horribly large uncertainties for STR-based mutation rates (Dave, I think this might be what you're seeing). If you plot SNP- against STR-based TMRCAs, you should find a 1:1 correlation. There will be a linear term (change of slope) due to uncertain mutation rates. There will be a quadratic term (curvature of the relation) due to any second-order deviation like those in the previous paragraph. Over the timescales we can probe with reliable genealogical records, it can be difficult to separate the two, as it is essentially a scatter plot. An accurate STR-based TMRCA calculation should use mutation rates appropriate for their calculation method (e.g. using my method, they should be from father-son or close relation studies; other methods might be better with population-based estimates).

    Nevertheless, the most-accurate TMRCAs, which are required for recent relationships, are undoubtedly obtained by combining SNP- and STR-based TMRCAs. In recent times, you will get a much better answer from averaging the two that you will get from either alone. I wouldn't trust this averaging over anything much older than, say, R-L151/P310, but it should be a good way of getting more accurate results.



    (P.S. - replies may be slow)

  7. The Following 6 Users Say Thank You to imcdonald For This Useful Post:

     Jack Johnson (10-07-2021),  JMcB (08-03-2021),  MitchellSince1893 (08-03-2021),  Roslav (08-16-2021),  TigerMW (08-04-2021),  Wing Genealogist (08-03-2021)

  8. #5
    Registered Users

    I would like to point out one obvious problem with some STR based time estimators.

    It is that estimators like Nordtvedt interclade variance is quite ok if the both clades are approximately star clusters or otherwise close to balanced trees in general. (Or more precisely are not known to have any particular structure.) The problem is that the accuracy is reduced in the case where a clear substructure is present. In that situation, the worst case is that the estimate is reduced to a time distance between the two most populous subclusters within each clade.... then there is effectively only one sample representing the separation between those subclusters and possible parallel branches (whose inclusion would improve accuracy) are almost ignored. Optimal weighting would put more importance on the samples in the more distant sub branches of each clade.

    If something is known about the substructure the correct estimator would use a pair weight factor similar as was used in mutation rate estimation application in an adjacent forum thread. In general one can force the known clade separation if some kind of automatic tree estimation method is used. Then one can use average weighting produced by the different potential cluster solutions.. normally, since nowhere near complete information is not available, there are many potential solutions which are close in probability.

    Edit: Actually wouldn't call this estimator optimal either except perhaps in some kind of limited sense... optimal weighting depends on the details of tree time structure in time estimation application, but not in the much simpler rate estimation case. Rate estimation can be formulated so that only the shape of the tree matters (such like degree of nodes etc but not explicit time values). The TMRCA problem is much uglier in comparison.
    Last edited by MarkoH; 08-15-2021 at 08:35 PM.

  9. The Following 4 Users Say Thank You to MarkoH For This Useful Post:

     Dave-V (08-15-2021),  JMcB (08-17-2021),  Roslav (08-16-2021),  sheepslayer (08-15-2021)

  10. #6
    Registered Users

    Perhaps something like this would include most available information:

    If T(s) is TMRCA for samples s, we have for a d-way star cluster consisting of subclaces s1,...,sd :

    T(s1,...,sd) = (1/d)( Tic(s1,...,sd) + T(s1)+...+T(sd) )

    Here Tic(s1,...,sd) is the interconnect time of subclusters s1,...,sd :

    Tic(s1,...,sd) = w_sum(s1,....,sd) - (w_sum(s1) + .... + w_sum(sd))

    Here w_sum(s) over samples s is

    w_sum(s) = sum(a<b) w_s(a,b) t(a,b)

    and w_s is calculated from the network for samples s, the sum is over pairs (a,b) within the sample set s, t(a,b) is a pairwise time distance estimate. w_sum reduces to the sum of edge time lengths in the respective trees

    Since the interconnect time Tic is like a two sample time estimator, one can work hierarchically and construct reasonable error bounds for T(s1,...,sd) using any possible knowledge of the tree shape.

    Edit: In the s1/s2 interclade case this just reduces to T(s1,s2) = (1/2) sum_ab w(a,b) t(a,b) where a b belong to s1 and s2 respectively... the breakdown is needed for the error estimate however since errors in t(a,b)'s are correlated, so only error estimation is relatively hard. Still not clear in exactly what sense this construction would be optimal, however...
    Last edited by MarkoH; 08-17-2021 at 06:15 PM.

  11. The Following User Says Thank You to MarkoH For This Useful Post:

     Roslav (08-19-2021)

  12. #7
    Registered Users

    Quote Originally Posted by MarkoH View Post
    ... Still not clear in exactly what sense this construction would be optimal, however...
    Since "w_sum's" are tree length estimators based on balanced minimum evolution principle, it is in that sense optimal....

  13. #8
    Registered Users
    English, Irish, German
    Y-DNA (P)

    England Germany Netherlands France Ireland Switzerland
    The traditional YSNP TMRCA estimates works quite well for predictable haplogroups above those YSNP branches above predictable haplogroups - those above the 1500 to 2500 YBP time frame. This time frame is variable because of statistical variation - so there is no hard time frame. This time frame is based on Y67 signatures. For Y111 signatures, you may be able to push the time back a few hundred years more.

    Once you get down to predictable haplogroups (those that can usually be predicted with over 99 % accuracy), the years per YSNPs is highly dependent on the sample size. My predictable haplogroup, R-L226, now has just under 1,000 testers at Y67 or higher. We have 28 surname clusters where I have assigned 1000 AD for this Irish haplogroup. Between L226 and its 28 surname clusters, the average years per YSNP is now down to 70 years as more and a few more intermediate branches will be found over time. Two surname clusters average 48 years per YSNP and one YSNP path is 334 years per YSNP. This is a huge statistical variation that can be removed by actually counting the number of YSNP branches between the predictable haplogroup and its surname clusters. These TMRCA estimates are incredibly accurate but only one-third of the YSNP branches fall between L226 and it is increasing as the number of surname clusters increases. For YSNP branches that have no surname cluster in the path, I use the average of 70 years which has changed several times in the last few years. Also, below surname clusters, I also use 70 years. The largest surname cluster - the royal O'Brien surname cluster now has over 100 testers at Y67 or more. It also has seven levels of branches below the surname cluster defining YSNP branch. This makes this estimate 1490 AD (which is probably a just little bit too old). As the sample size increases, more surname clusters will be revealed and we already have 41 % of L226 testers now under surname clusters. The number of intermediate branches is slowly down a lot as they are very few branch equivalents left to split off branches.

    This TMRCA method only really works with predictable haplogroups that have at least 500 testers. But this list keeps getting larger as new testers are added. Here are the predictable haplogroups that could use this methodology (numbers are for 1/1/21):

    M222 - 2,855
    L1065 - around 1,500
    CTS4466 - 1,064
    Z255- 879
    L226 - 796
    L371 - 376 (will get to 500 in the next few years)

    There are dozens of predictable haplogroups that have 200 or 300 testers, so in five to ten years, there will be many predictable haplogroups that can use this TMRCA methodology. Another drawback is the dependency of surname clusters which works well for most western European but would not work for Scandinavian testers (and many other geographies).

    Another interesting discovery is that there are very few YSTR mutations found upper parts of charts for the 60 haplogroups that I have predicted to date under haplogroup R. Mutations (both YSNP and YSTR) have variable rates per year as they become younger. Each path of any chart goes from having only a few paths to mutate at the top of charts to 1,000 paths for the 1,000 testers under R-L226. The mutation rate is not changing - but the number of paths dramatically increases as you get closer to the testers. Also, Y700 YSTRs are becoming very useful as sample sizes increase. Once you have 500 testers. not only does a better TMRCA methodology becomes an option but Y700 YSTR mutations can be used as well. Unfortunately, Y700 YSTRs are not available in any public FTDNA report. With 370 Big Y testers known for L226, it is being observed that Y700 doubles the number of YSTR mutations over just using Y111 markers. With over 35 % of our Y67 or higher testers with Big Y results, Y700 YSTRs are creating quite of few new Y700 YSTR only branches (only five percent date). But these really help with charting as well as determining how closely these tester are related to each other. There is huge a variation of Y700/Y111 mutations.

    With the L226 chart, even with 35 % Big Y tested, YSTR only branches account for 75 % of the branches and YSNP branches only account for 25 % of the branches. So any TMRCA methodology that uses both YSTR and YSNPs will be more accurate. Also, larger projects can also use Y700 YSTRs as well. YSTRs are more noisy primarily due to so many parallel mutations of the same marker to the same values. Backwards mutations and multiple mutations (where two mutations happen where the second mutation moves further away from the modal value) are pretty rare events. YSNP prediction works for around 75 % of P312 and L21 (you need around 50 to 100 testers). But U106, R1a and R1b Basal struggle to get 10 % predicted. But I have predicted two haplogroups for each of these older haplogroups. There are two issues with these older haplogroups: 1) sample sizes are just much smaller; 2) there are far few YSNP/YSTR bottlenecks found (there is a more even progression of mutations).

  14. The Following 3 Users Say Thank You to RobertCasey For This Useful Post:

     JMcB (09-08-2021),  Roslav (12-11-2021),  sheepslayer (09-08-2021)

Similar Threads

  1. Replies: 7
    Last Post: 02-19-2020, 10:39 PM
  2. Replies: 2
    Last Post: 11-24-2019, 04:43 PM
  3. Quick map based on U152 percentages based on Lucotteetal 2015
    By MitchellSince1893 in forum R1b-U152
    Replies: 8
    Last Post: 02-19-2015, 08:43 PM
  4. U106 TMRCA Estimations
    By MJost in forum R1b-U106
    Replies: 5
    Last Post: 10-27-2013, 02:29 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts