PDA

View Full Version : Yfull - Accuracy of Estimated Dates



discreetmaverick
06-19-2021, 02:50 PM
As dates estimated by Yfull plays a significant role from validating a claimed historical event to assessing the genetic impact of a historical event/s.



How accurate are the dates as estimated by Yfull for various clades ?
Any cases were they had underestimated or overestimated by small time frame to large discrepancy ?
Are there options to validate accuracy of estimation ?

JMcB
06-19-2021, 05:43 PM
All age estimations have to be considered within the parameters of their Confidence Intervals which are usually quite wide.


For example, the age estimation for my branch is 225 ybp (or 1795). Our genealogical records indicate that none of us share a common ancestor after 1720. Although, other measurements are pointing to a common ancestor who lived in the 1600s


YFull’s CI give us a range of dates as follows: TMRCA CI 95% 400<->100 ybp
(Or 1620 - 1920)

All in all, I would say they’ve done about as well as can be expected but there’s still a lot of leeway there. Which we have been able to narrowed down using genealogical records.


https://www.yfull.com/tree/I-A13242/



If I go further back in time to an earlier branch the ranges are naturally even wider:

TMRCA 1150 ybp (Or 870 AD)

formed CI 95% 2400<->1400 ybp, TMRCA CI 95% 1650<->700 ybp
(Or Formed 380 BC - 620 AD, TMRCA 370 AD - 1320 AD)


https://www.yfull.com/tree/I-A13248/



Here’s an approximate visualization of the above. Although, the formed date has slightly change since I did this:


45238

deadly77
06-20-2021, 09:52 AM
I'd agree with JMcB on this - most people just look at the midpoint of the age estimate listed on YFull's tree without considering the spread of the 95% confidence interval (found by hovering your cursor over the date). Within this confidence interval, it's fair to say that the TMRCA age estimates are accurate, but imprecise (these two criteria often seem to be conflated, but really shouldn't be). With the caveat that 5% of samples fall outside the 95% confidence interval.

discreetmaverick
06-21-2021, 09:49 PM
FWIW - I am now permanently in YFull's Y-Tree, representing (for now) the R-BY160158 line. I have also messaged the unknown sample YF17019 who shares the same line as me within ~1900 years. I will update this thread when/if the person responds with his paternal origin. The other sample is a Tamil Srilankan from 1000Genomes, so effectively anonymous.

37492


YFull just increased my paternal line's TMRCA by 1000 years. So, instead of 1900ybp, as it was earlier, it is now 2900ybp.

https://www.yfull.com/tree/R-BY160158/



Continuing here,

That is really a large discrepancy of 1000 Years, did Y full provide explanation for such a large error in estimation?

Can this be independently verified ?

JMcB
06-21-2021, 11:47 PM
Continuing here,

That is really a large discrepancy of 1000 Years, did Y full provide explanation for such a large error in estimation?

Can this be independently verified ?


Not sure if this was meant for this thread or another one because poi hasn’t posted in this thread. So he might not see this, if it was intended for somewhere else. Be that as it may, judging from the information in the [info] box, there’s a large difference (15.5 SNPs) in the number of Novel Variants each tester has. Which is why the date has changed so much. At any rate, poi can always email them to see what’s up.



Edit: Plus, judging from the coverage, one looks like it may be Y500 and the other Y700 or WGS.

discreetmaverick
06-22-2021, 01:35 PM
Not sure if this was meant for this thread or another one because poi hasn’t posted in this thread. So he might not see this, if it was intended for somewhere else. Be that as it may, judging from the information in the [info] box, there’s a large difference (15.5 SNPs) in the number of Novel Variants each tester has. Which is why the date has changed so much. At any rate, poi can always email them to see what’s up.



Edit: Plus, judging from the coverage, one looks like it may be Y500 and the other Y700 or WGS.


PMed Him,

STR haplotype for MRCA of R-BY160158 (minimum parsimony hypothetical using 2 samples) with 632 STRs.


Age estimated based on Average of two, Formula: (1890+3963)/2 = 2926 ybp

Age estimation difference between the two samples is 2073 yrs.

https://i.postimg.cc/13CnMFWb/Screenshot-from-2021-06-22-17-02-18.png (https://postimages.org/)


Does use of Big 500 and Big 700 or WGS can result in difference of 2000 yrs?

Other person is from 1000 GENOMES PROJECT https://www.internationalgenome.org/data-portal/sample/HG03685

If other person get Big 700 wold be closer to Poi? then wouldn't Date estimate for the clade would be pushed forward by another 1000 yrs ?

Am I understanding it correctly?

JMcB
06-22-2021, 01:59 PM
PMed Him,

STR haplotype for MRCA of R-BY160158 (minimum parsimony hypothetical using 2 samples) with 632 STRs.

Age estimated based on Average of two, Formula: (1890+3963)/2 = 2926 ybp

Age estimation difference between the two samples is nearly 2073 yrs.

https://i.postimg.cc/13CnMFWb/Screenshot-from-2021-06-22-17-02-18.png (https://postimages.org/)


Does use of Big 500 and Big 700 can result in difference of 2000 yrs?

If other person get Big 700 wold be closer to Poi? then wouldn't Date estimate for the clade would be pushed forward by another 1000 yrs ?

Am I understanding it correctly?


The increased coverage in a Y700 test can lead to more Novel Variants being found. For example, in my Y500 results I had 7 Novel Variants and in my Y700 test I received 14 NVs. So, assuming I’ve interpreted the coverage figures correctly, it’s a possibility. As YFull uses the better quality NVs for dating purposes. Do you know what version of test these two sample have taken?

I also noticed last night that YFull is in the process of updating their tree and the difference in Novel Variants has slightly changed from last night (it’s now 14.35). So it’s possible they haven’t finished analyzing the samples.

RobertCasey
06-22-2021, 03:00 PM
YFULL, the BigTree and U106 uses very similar TMRCA estimates. These are quite accurate down to predictable haplogroups which originate from 1,500 to 2,500 YBP. Most of these
estimates are within 10 to 20 % but some are not. Accuracy assumes a reasonable sample size - otherwise accuracy can suffer. Unfortunately, YFULL continues to use this
methodology in very recent times where accuracy is very questionable but Big Tree normally cuts off these questionable TMRCAs.

Below predictable haplogroups, counting YSNP mutations in a bottoms up approach is much more accurate. But this requires access to private YSNPs and the estimates are really
affected the sample size. The problem with this approach is the years per YSNP varies dramatically depending on the sample size and which NGS test. Big Y700 has better coverage
and slightly longer read lengths which produce much lower years per YSNP. Also, statistical variation can be very significantly in the genealogical time frame. The assumptions
for years per YSNP vary a lot which is the weak part of these estimates. You also should use some blend of years per YSNP to adjust for the mixture of Big Y500 and Big Y700 testers.

If you have over 500 Y67 testers under your predictable haplogroup, I use yet a different method which depends on surname clusters (I use 1000 AD for my Irish haplogroup). This
is by far the most accurate measurement between the origin of haplogroup and L226's 25 surname clusters. You can actually calculate the real years per YSNP for 20 to 30 % of
the upper part of the haplotree. This adjusts for statistical variation as the years per YSNP is allowed to vary between 48 years and 350 years. From these branches, you
can calculate a pretty reasonable "average" years per YSNP which will decline over time as the sample size increases (now around 70 years per YSNP branch). However, in the
genealogical time frame, estimates will remain much older until you have around 100 testers under a surname cluster which rarely happens. Using the average produces somewhat
conservative estimates that are usually somewhat older than they should be. Here is a presentation on this approach:

https://www.youtube.com/watch?v=sKaxanrxBgs&t=1682s

This approach is best of breed for larger predictable haplogroup: L1065, M222, L193, CTS4466, etc. Its weak point is that TMRCA estimates under
surname clusters and TMRCA where no surname clusters exist tend to be older if the sample sizes are lacking.

IanFitzpatrick
06-22-2021, 03:39 PM
Everything Robert says above is spot on, you simply need to have a large sample size for accuracy.

Y-Full, from what I have seen on groups I am familiar with, seems to do a reasonable job with dating but simply do not have enough samples on the vast majority of lines, especially when it comes to the last 1000 years.

discreetmaverick
06-22-2021, 04:10 PM
The increased coverage in a Y700 test can lead to more Novel Variants being found. For example, in my Y500 results I had 7 Novel Variants and in my Y700 test I received 14 NVs. So, assuming I’ve interpreted the coverage figures correctly, it’s a possibility. As YFull uses the better quality NVs for dating purposes. Do you know what version of test these two sample have taken?

I also noticed last night that YFull is in the process of updating their tree and the difference in Novel Variants has slightly changed from last night (it’s now 14.35). So it’s possible they haven’t finished analyzing the samples.

Other person is from 1000 genome project, https://www.internationalgenome.org/data-portal/sample/HG03685

How many SNP/STR they had tested for that person?

Sorry, I don't know about Poi.


How much does a novel variant add on average to the time estimate ? Were in the formula or calculation of estimation does this value of 14.35 will be plugged into ?

JMcB
06-22-2021, 04:32 PM
Other person is from 1000 genome project, https://www.internationalgenome.org/data-portal/sample/HG03685

How many SNP/STR they had tested for that person?

Sorry, I don't know about Poi.


How much does a novel variant add on average to the time estimate ? Were in the formula or calculation of estimation does this value of 14.35 will be plugged into ?

If it’s used for dating a NV will be calculated as 144.41 years per mutation. The 14.35 figure was just the difference between the Novel Variants in each test and isn’t a factor. I just noticed it.

The formula is in the [info] box:

For example: 12.67 (corrected number of SNPs) x 144.41 (years per mutation) + 60 (the average age of most testers) = 1890 (years before present)

deadly77
06-22-2021, 11:05 PM
I think this is a good demonstration of why the 95% confidence intervals of these age estimates are so large - and in this case one of the samples is even outside of that (and given that 5% of samples will be outside of that spread, that's not impossible). As a few of the others have said, it comes down to having a very small sample size of only two samples on that branch that are being considered for age estimation. If you had a lot more samples, you'd see some that were closer together and you'd have a better idea of which one was an outlier - of course, it's possible that both are outliers on different ends of the distribution. You'll get a better estimate if there are a lot of samples contributing to the age estimate as you will be able to distinguish between the samples that are more representative of the population versus statistical noise. With only two, you can't really tell. This is the issue with all age estimates - not enough samples.

If you want to know how YFull calculate their age estimates, it's worth reading their FAQ which can be found here https://www.yfull.com/faq/ with the most relevant sections here https://www.yfull.com/faq/what-yfulls-age-estimation-methodology/ and here https://www.yfull.com/faq/how-does-yfull-determine-formed-age-tmrca-and-ci/ as well as their paper here https://www.researchgate.net/publication/273773255_Defining_a_New_Rate_Constant_for_Y-Chromosome_SNPs_based_on_Full_Sequencing_Data

In a nutshell - count up SNPs to branch in question (novel variants plus known SNPs if not the terminal branch); evaluate each SNP on whether it is appropriate for age estimation (in combBED region, not an INDEL or MNP, not found too often in other branches/haplogroups, sufficient number of reads, not ambiguous) and filter that number accordingly; correct that number for missing coverage in the combBED region; multiply by assumed mutation rate. After that, do the same for each sample on the branch and then take the average of those. Other dating calculations may include more regions of the Y chromosome outside of the combBED region (such as Iain MacDonald) and accordingly have a different mutation rate per SNP https://www.mdpi.com/2073-4425/12/6/862/pdf

Ultimately, these are estimates and they have a lot of assumptions built in. They'll get better and become more statistically accurate with refinement, but not genealogically exact... and never will be (line stolen from one of Maurice Gleeson's presentations).

There is a difference between what a Big Y500 and a Big Y700 contribute, but the difference of a small number of samples versus a large number of samples overwhelms that.

As for HG03685, Yfull aren't including it in age estimates. This is likely that the sample is not good enough quality for those purposes. The academic samples on YFull vary in this regard, some are included for age estimation, but I'd say the majority are not and are just used for branching. HG03685 is on Alex Williamson's Big Tree here https://www.ytree.net/SNPinfoForPerson.php?personID=2331 although it seems the analysis is not yet finished. It does, however, list a table of unique mutations, so I guess you could manually work through this using YFull's methodology and calculate what the answer is and compare to the two kits at YFull.