PDA

View Full Version : The YFull TMRCA vs. Iain McDonald methods



Mikewww
10-21-2017, 12:09 AM
YFull uses a term called the "formed" date. I have read their definition and the formed date of a subclade is just date of the TMRCA of the parent clade. However, although the labeling does include "formed", in the detailed tables they show their calculations and have ages that some call "formed".

Below is a explanation. I understand how the tables/calculations work, I think, but I don't see these dates as particularly unique versus other SNP age estimates and their calculations. I've been told these are exclusives from YFull.

Please convince me of their value, but do so relative to the method that Iain McDonald uses. Is there really any added value in the term "formed" date? and are sub-table calculations showing the relative age between brother subclades any more useful than using Iain McDonald's TMRCA age estimates for subclades?

I do understand relative aging by SNP counting, either bottoms-up or tops-down but I do not think we have enough accuracy (because of variability in mutation rates) that getting into some of this precision is valuable anyway.

-------------------------------------------------------------------
https://www.dropbox.com/s/9k5g3mhhkkj59ty/SNP-Formation-4mike.jpg?dl=0

YFull provides two estimates each for “formed” and “TMRCA”.

The estimate provided in the info tables (“Formed 2” on dropbox .jpg above) is not labeled as such by YFull and is not readily available from any source other than YFull, to the best of my knowledge. If nothing else, it does provide a relative estimate of the order of formation of the children of an SNP. For example:

4814 YBP R-L21

4968 YBP ….. R-DF63
5096 YBP ….. ….. R-CTS6919
4529 YBP ….. ….. R-Y10997

4669 YBP ….. R-DF13
5527 YBP ….. ….. R-DF49
5476 YBP ….. ….. R-FGC19914
5280 YBP ….. ….. R-L679
5160 YBP ….. ….. R-CTS3386
5026 YBP ….. ….. R-Y14049
4855 YBP ….. ….. R-FGC5496
4601 YBP ….. ….. R-Y5717
4576 YBP ….. ….. R-CTS2501
4564 YBP ….. ….. R-CTS1751
4526 YBP ….. ….. R-Y16233
4483 YBP ….. ….. R-Z253
4454 YBP ….. ….. R-Z16503
4420 YBP ….. ….. R-Y9097
4396 YBP ….. ….. R-DF1
4289 YBP ….. ….. R-S1026
4285 YBP ….. ….. R-DF21
4275 YBP ….. ….. R-BY4048
4218 YBP ….. ….. R-Y5305
4210 YBP ….. ….. R-L1335
4133 YBP ….. ….. R-Z251
3923 YBP ….. ….. R-Y9090
3832 YBP ….. ….. R-FGC11134
3716 YBP ….. ….. R-S1051
3567 YBP ….. ….. R-Y30336
3294 YBP ….. ….. R-Z14303
3097 YBP ….. ….. R-Z255
2774 YBP ….. ….. R-L371

4236 YBP ….. R-A5846
4050 YBP ….. ….. R-Y16251

2949 YBP ….. R-A7906
2260 YBP ….. ….. R-FGC52350
1835 YBP ….. ….. R-A7908

-------------------------------------------------------------------

Regardless, is the fact that YFull provides "formed" dates a major exclusive value-add for their method?

Mikewww
10-21-2017, 12:21 AM
YFull uses a term called the "formed" date. I have read their definition and the formed date of a subclade is just date of the TMRCA of the parent clade. ...

Regardless, is the fact that YFull provides "formed" dates a major exclusive value-add for their method?

Here are two explanatory web pages from YFull:

https://www.yfull.com/faq/what-yfulls-age-estimation-methodology/

https://www.yfull.com/faq/how-does-yfull-determine-formed-age-tmrca-and-ci/


Subclade "formed" age: The TMRCA (time to most recent common ancestor) of a subclade is used as the "formed" age of each branch of the subclade. Stated otherwise, the formed age of a branch is the same as the TMRCA of the "parent" subclade of that branch.

Wing Genealogist
10-21-2017, 02:04 AM
Iain has published a technical explanation of the process he utilizes to arrive at his dates at: http://www.jb.man.ac.uk/~mcdonald/genetics/pipeline-summary.pdf

Please note it is very heavy in the statistical analysis and thus way over my pay grade.

rms2
10-21-2017, 02:41 AM
I'm no expert. I like YFull. I paid for their services. But I tend to think Iain MacDonald is probably closer to right. Don't expect me to argue either way. I'm too damned tired.

MitchellSince1893
10-21-2017, 04:30 AM
I would like to provide a case study of yfull's, and Alex Williamson's SNP dating methods (based on McDonald's method) vs STR dating methods, using a relatively recent branch I'm quite familiar with (in the last 1000 years). As such it's in that area where both methods are useful (IMO once your get beyond 1000-1500 years the STR method tends to underestimate ages)

The branch is called FGC47884 on yfull and ytree.

https://www.yfull.com/tree/R-FGC47884/
http://ytree.net/BlockInfo.php?blockID=2638

Background: One man on this branch has done FGC Y Elite 2.1 and the other has done BigY and both have done FTDNA's 111 marker test. What is known from genealogical records is their MRCA had to have lived prior to 1720 AD.

1. Their Genetic Distance = 4 at 111 markers. According to FTDNA

A 107/111 match indicates a genealogical relationship. Most matches at this level are related as 10th or more recent cousins, and over half will be 6th or more recent cousins. This is well within the range of traditional genealogy. From genealogical records the closest these two men could possibly be is 8th cousins...more likely it further back...10th cousins or earlier.

2. Using the mymcgee tool with the FTDNA 111 markers at http://www.mymcgee.com/tools/yutility111.html set at 95% probability provides a TMRCA of 480 years (1470 AD).

3. Both men have submitted their data to yfull. Using Yfull's STR matching tool they had have 19 out of 255 SNPs that are different for a distance of .075.

4. I combined the 111 marker and Yfull STR data to come up with 319 STRs to compare of which 24 had different values. Using the default settings at http://dna.cfsna.net/HAP/MLE.htm website and the 319 STRs provided TMRCA of 470 years (280 - 660 years ago). 1480 AD (1290 to 1670 AD)

5. One man took the FGC Y Elite 2.1 test and has 7 private SNPs in McDonald's BED region and 5 in the combBed region (used by Yfull for dating). The other took BigY and has 2 private SNPs in both McDonald's BED and the combBed region.

-On Yfull: TMRCA for this branch is 517 ybp or 1433 AD; with 95% CI of 900 AD to 1725 AD rounded to 500 ybp

-Alex Williamson's site
Using the aging method developed by Iain McDonald, the median age of this block is 738.557 YBP (1212 AD). The 95% confidence interval is 740 AD to 1564 AD.

So compared to the 319 STR dating method (470 ybp):
- YFull's best guess date is ~49 years older
- Alex Williamson best guess date is 269 years older

In prior discussions with Iain McDonald, he stressed the 95% CI date range is more important than the best guess date. As such, YFull's and Alex Williamson's date ranges overlap the STR date (1480 AD for 319 STR method vs upper limit of 1725 AD for yfull and 1564 AD for Ytree.net).

This is not to imply that the STR dating method is more accurate than the SNP methods above. In this study of 111 markers, the STR method has proven to be quite variable. http://linealarboretum.blogspot.com/2016/07/are-111-marker-tests-better-at.html?m=1

While 111 markers aided in fine tuning our connectivity to those sharing our genetic and genealogical roots, genetic distance was not an accurate predictor of most relationships. Outliers can and do happen, as experienced with a GD=0; however, 78% of the participants at a GD=0 fell within the predicted level of six generations or less with a p ≤ .01, Two did not, and as explained earlier, this was due to convergence. We have seen close relatives (5th cousins and closer) having genetic distances up to 5, while 13th cousins, once removed have a GD=0.

But as I've been able to compare 319 STRs between these two men, I have to say I tend to place more confidence in this method's date.

If the MRCA for these two men is ever discovered (hundreds of man hours on my part have so far been to no avail), it will be interesting to revisit this branch.

Dibran
10-21-2017, 05:32 AM
I would like to provide a case study of yfull's, Iain McDonald's, and Alex Williamson's SNP dating methods vs STR dating methods, using a relatively recent branch I'm quite familiar with (in the last 1000 years). As such it's in that area where both methods are useful (IMO once your get beyond 1000-1500 years the STR method tends to underestimate ages)

The branch is called FGC47884 on yfull and ytree and R-BY12085 by Iain McDonald and FTDNA.

https://www.yfull.com/tree/R-FGC47884/
http://ytree.net/BlockInfo.php?blockID=2638

Background: One man on this branch has done FGC Y Elite 2.1 and the other has done BigY and both have done FTDNA's 111 marker test. What is known from genealogical records is their MRCA had to have lived prior to 1720 AD.

1. Their Genetic Distance = 4 at 111 markers. According to FTDNA
From genealogical records the closest these two men could possibly be is 8th cousins...more likely it further back...10th cousins or earlier.

2. Using the mymcgee tool with the FTDNA 111 markers at http://www.mymcgee.com/tools/yutility111.html set at 95% probability provides a TMRCA of 480 years (1470 AD).

3. Both men have submitted their data to yfull. Using Yfull's STR matching tool they had have 19 out of 255 SNPs that are different for a distance of .075.

4. I combined the 111 marker and Yfull STR data to come up with 319 STRs to compare of which 24 had different values. Using the default settings at http://dna.cfsna.net/HAP/MLE.htm website and the 319 STRs provided TMRCA of 470 years (280 - 660 years ago). 1480 AD (1290 to 1670 AD)

5. One man took the FGC Y Elite 2.1 test and has 7 private SNPs in McDonald's BED region and 5 in the combBed region (used by Yfull for dating). The other took BigY and has 2 private SNPs in both McDonald's BED and the combBed region.

-On Yfull: TMRCA for this branch is 517 ybp or 1433 AD; with 95% CI of 900 AD to 1725 AD rounded to 500 ybp

-Alex Williamson's site

-Iain McDonald's method: 988 AD (325 AD — 1455 AD)

So compared to the 319 STR dating method (470 ybp):
- YFull's best guess date is ~49 years older
- Alex Williamson best guess date is 269 years older
- Iain McDonald's best guess date is 492 years older

In prior discussions with Iain, he stressed the 95% CI date range is more important than the best guess date. As such, YFull's and Alex Williamson's date ranges overlap the STR dates, and McDonald's is just outside of it (1480 AD for 319 STR method vs Iain's upper range of 1455 AD).

This is not to imply that the STR dating method is more accurate than the SNP methods above. In this study of 111 markers, the STR method has proven to be quite variable. http://linealarboretum.blogspot.com/2016/07/are-111-marker-tests-better-at.html?m=1


But as I've been able to compare 319 STRs between these two men, I have to say I tend to place more confidence in this method's date.

If the MRCA for these two men is ever discovered (hundreds of man hours on my part have so far been to no avail), it will be interesting to revisit this branch.

I tried copy and pasting my str data. I just get a bunch of html text. Even when I paste it here, it still doesn;t paste as a chart or anything. What am I missing?

JohnHowellsTyrfro
10-21-2017, 06:59 AM
I would like to provide a case study of yfull's, Iain McDonald's, and Alex Williamson's SNP dating methods vs STR dating methods, using a relatively recent branch I'm quite familiar with (in the last 1000 years). As such it's in that area where both methods are useful (IMO once your get beyond 1000-1500 years the STR method tends to underestimate ages)

The branch is called FGC47884 on yfull and ytree and R-BY12085 by Iain McDonald and FTDNA.

https://www.yfull.com/tree/R-FGC47884/
http://ytree.net/BlockInfo.php?blockID=2638

Background: One man on this branch has done FGC Y Elite 2.1 and the other has done BigY and both have done FTDNA's 111 marker test. What is known from genealogical records is their MRCA had to have lived prior to 1720 AD.

1. Their Genetic Distance = 4 at 111 markers. According to FTDNA
From genealogical records the closest these two men could possibly be is 8th cousins...more likely it further back...10th cousins or earlier.

2. Using the mymcgee tool with the FTDNA 111 markers at http://www.mymcgee.com/tools/yutility111.html set at 95% probability provides a TMRCA of 480 years (1470 AD).

3. Both men have submitted their data to yfull. Using Yfull's STR matching tool they had have 19 out of 255 SNPs that are different for a distance of .075.

4. I combined the 111 marker and Yfull STR data to come up with 319 STRs to compare of which 24 had different values. Using the default settings at http://dna.cfsna.net/HAP/MLE.htm website and the 319 STRs provided TMRCA of 470 years (280 - 660 years ago). 1480 AD (1290 to 1670 AD)

5. One man took the FGC Y Elite 2.1 test and has 7 private SNPs in McDonald's BED region and 5 in the combBed region (used by Yfull for dating). The other took BigY and has 2 private SNPs in both McDonald's BED and the combBed region.

-On Yfull: TMRCA for this branch is 517 ybp or 1433 AD; with 95% CI of 900 AD to 1725 AD rounded to 500 ybp

-Alex Williamson's site

-Iain McDonald's method: 988 AD (325 AD — 1455 AD)

So compared to the 319 STR dating method (470 ybp):
- YFull's best guess date is ~49 years older
- Alex Williamson best guess date is 269 years older
- Iain McDonald's best guess date is 492 years older

In prior discussions with Iain, he stressed the 95% CI date range is more important than the best guess date. As such, YFull's and Alex Williamson's date ranges overlap the STR dates, and McDonald's is just outside of it (1480 AD for 319 STR method vs Iain's upper range of 1455 AD).

This is not to imply that the STR dating method is more accurate than the SNP methods above. In this study of 111 markers, the STR method has proven to be quite variable. http://linealarboretum.blogspot.com/2016/07/are-111-marker-tests-better-at.html?m=1


But as I've been able to compare 319 STRs between these two men, I have to say I tend to place more confidence in this method's date.

If the MRCA for these two men is ever discovered (hundreds of man hours on my part have so far been to no avail), it will be interesting to revisit this branch.

I don't understand the technicalities very well but some of the estimated TMRC ancestor dates on my paternal line of descent I'm sceptical about, particularly numerous surnames of apparently both Welsh and Anglo Saxon (or old English) origin which suggests to me that the dates or some may be much earlier than is currently indicated or at least at the earlier end of the range.
Of course I may be wrong (probably am) and pending results may reveal more.

Wing Genealogist
10-21-2017, 10:35 AM
Please Note: Iain McDonald and YFull both create the age estimates from the Big Y (and in the case of YFull, the FGC) data submitted to them. If they don't have a good representation of the whole clade, then their analysis would be inaccurate.

This affects the youngest clades much more so than the ages of clades such as P312 or U106.

As I reported elsewhere, Iain and Greg Magoon have had some discussions regarding incorporating FGC results into his age analysis. The recent Build38 conversion may well delay this effort (as time spent dealing with the conversion is time which is not available to work on incorporating FGC results).

MitchellSince1893
10-21-2017, 12:56 PM
I tried copy and pasting my str data. I just get a bunch of html text. Even when I paste it here, it still doesn;t paste as a chart or anything. What am I missing?

If you are referring to the my mcgee tool, go to the ftdna str project page you are member of and copy your row and at least two other sample'ss rows and paste it in the mymcgee tool

MitchellSince1893
10-21-2017, 02:26 PM
I would like to provide a case study of yfull's, Iain McDonald's, and Alex Williamson's SNP dating methods vs STR dating methods, using a relatively recent branch I'm quite familiar with (in the last 1000 years). As such it's in that area where both methods are useful (IMO once your get beyond 1000-1500 years the STR method tends to underestimate ages)

The branch is called FGC47884 on yfull and ytree and R-BY12085 by Iain McDonald and FTDNA.

https://www.yfull.com/tree/R-FGC47884/
http://ytree.net/BlockInfo.php?blockID=2638

.

My apologies. I've made a mistake above....confusing FGC47884 as being the same branch.as BY12085...They are in fact parallel branches of FGC12401....shouldn't post when not feeling well.

I have edited my post above to remove the BY12085 data and associated McDonald dating method

Michał
10-21-2017, 02:39 PM
YFull uses a term called the "formed" date. I have read their definition and the formed date of a subclade is just date of the TMRCA of the parent clade. However, although the labeling does include "formed", in the detailed tables they show their calculations and have ages that some call "formed".

Below is a explanation. I understand how the tables/calculations work, I think, but I don't see these dates as particularly unique versus other SNP age estimates and their calculations. I've been told these are exclusives from YFull.

Please convince me of their value, but do so relative to the method that Iain McDonald uses. Is there really any added value in the term "formed" date? and are sub-table calculations showing the relative age between brother subclades any more useful than using Iain McDonald's TMRCA age estimates for subclades?
I am not sure whether I understand your problem, but in case this is all about using the term "formed" (or let's say "formation date" or "birth date"), this is just to make the customers realize that the age of each clade might be understood either as the "formation age" or as the "expansion age" (with the latter corresponding to TMRCA) . Of course, many customers (especially those experienced ones) are perfectly aware that it is the TMRCA age of a parent clade that corresponds to that "formed" date, but this is not so obvious for a regular customer who usually does not discriminate between these two possibilities (so he/she would have no idea that one needs to look for the TMRCA age of the parent clade to learn when the clade was "born" rather than when it "expanded"). In fact, most customers are still confusing the TMRCA age with the age of a given mutation that is used to define this clade (but this is a slightly different story).

To summarize, there is nothing in using the term "formed" date that would make the YFull approach fundamentally different from the approach used by Iain McDonald. At the moment, the major advantage of YFull is that they produce such age estimates for nearly all possible haplogroups/clades, while Iain McDonald does it for some selected subclades of R1b only. The major advantage of Iain McDonald is that he still does it for free.



I do understand relative aging by SNP counting, either bottoms-up or tops-down but I do not think we have enough accuracy (because of variability in mutation rates) that getting into some of this precision is valuable anyway.

Counting SNPs bottoms-up is much more reliable than counting them tops-down, especially when we have more than two descending lineages. Also, while I agree that the accuracy might not be high enough when a small number of descending lineages is analyzed, this is no longer the case when we have tens or hundreds of descending lineages for a given clade, so we get the average with a relatively low standard deviation, and thus comparing the ages of two parallel subclades can produce statistically significant differences.

Wing Genealogist
10-21-2017, 03:17 PM
Iain readily acknowledges where he read the "white paper" YFull published regarding their age analysis calculations. He also acknowledges that by and large he uses a very similar approach, but states he adds in some calculations which YFull does not perform.

Iain had been using the VCF/BED file while YFull uses the BAM file. There is more data in the BAM files than in the old VCF/BED files (such as qualitative scores) which help YFull refine the age estimates.

By and large, I have to trust what Iain is saying is accurate (as the description of his calculations is beyond my mathematical abilities). He does note where he believes the extra steps he has added to his calculations does a slightly better job at the age estimates than YFull's.

By far, the biggest advantage Iain has over YFull (in regards to U106 and to a lesser extent P312) is the fact he has been able to collect many more kits. In this case, size does matter, as it allows more data and thus a better age estimation.

Mikewww
10-21-2017, 03:18 PM
I am not sure whether I understand your problem
...
To summarize, there is nothing in using the term "formed" date that would make the YFull approach fundamentally different from the approach used by Iain McDonald.
My problem was that a YFull advocate (not YFull) is claiming the "formed date" is an exclusive value-add from from YFull. What you are saying is what I think. It is not. It is just another term but fundamentally the approaches are similar.


At the moment, the major advantage of YFull is that they produce such age estimates for nearly all possible haplogroups/clades, while Iain McDonald does it for some selected subclades of R1b only. The major advantage of Iain McDonald is that he still does it for free..
This is true. However, within those subclades of P312 and U106, or L151/L11(S27), McDonald has more individual results, probably many more. He will have both more branches and a larger sample.

L151 is the largest haplogroup for Western and Central Europe. This is a challenge when someone is charging for something when someone else is providing it for free. On the other end you've got FTDNA providing a free on-line Y chromosome browser now.

In that sense, free stuff is cutting the heart out of YFull's target market.

Mikewww
10-21-2017, 03:27 PM
Counting SNPs bottoms-up is much more reliable than counting them tops-down, especially when we have more than two descending lineages. Also, while I agree that the accuracy might not be high enough when a small number of descending lineages is analyzed, this is no longer the case when we have tens or hundreds of descending lineages for a given clade, so we get the average with a relatively low standard deviation, and thus comparing the ages of two parallel subclades can produce statistically significant differences.
What's the standard deviation vaue across the board? I have not seen the actual calculation of the variance of SNP mutation rates. By eyeball (anecdotally) I see some wide swings so I am wary of YFull's confidence intervals.

Joe B
10-21-2017, 04:54 PM
My problem was that a YFull advocate (not YFull) is claiming the "formed date" is an exclusive value-add from from YFull. What you are saying is what I think. It is not. It is just another term but fundamentally the approaches are similar.


This is true. However, within those subclades of P312 and U106, or L151/L11(S27), McDonald has more individual results, probably many more. He will have both more branches and a larger sample.

L151 is the largest haplogroup for Western and Central Europe. This is a challenge when someone is charging for something when someone else is providing it for free. On the other end you've got FTDNA providing a free on-line Y chromosome browser now.

In that sense, free stuff is cutting the heart out of YFull's target market.
This constant drumbeat against YFull is getting tiresome.

Trojet
10-21-2017, 05:05 PM
This constant drumbeat against YFull is getting tiresome.

Indeed! Furthermore, he's been constantly discrediting FTDNA's competitors, while promoting FTDNA's product. Sometimes I keep asking myself why?

MitchellSince1893
10-21-2017, 05:24 PM
Characterizing the owner of FGC's optimistic predictions about his company's future as the comments of a "loser" didn't help his cause.

Mikewww
10-21-2017, 05:39 PM
This constant drumbeat against YFull is getting tiresome.

Joe B, I think you could substitute other vendor names, particularly one, for "YFull" and we find that "tiresome" too. You agree, right?

I post much more frankly and curtly here than on other forums because I perceive the experience level on the Anthrogenic forum is much higher. That's a compliment. :) I like a good debate, which is fun. I'm no Antonin Scalia of whom it was written "has been known for his love of debate, intellectual curiosity", but knowledge and understanding improve with logical debate.

I opened this topic to learn more and to clear up misunderstandings. I wasn't getting too far in terms of YFull experts on an L21 discussion group explaining the statement below. The only purported YFull expert wrote.

"​I think that leaves YFull as the-only-game-in-town for formation age estimates."
https://groups.yahoo.com/neo/groups/R1b-L21-Project/conversations/messages/35564

It didn't make sense to me based on YFull definitions and I didn't see that the differentiation between YFull and Iain McDonald on this point was very valuable, but as I noted on that forum I might have misunderstood. Hence, I opened this topic here on Anthrogenica.

I understand that this response is a "tit for tat" or like a child saying "but he hit me first." Still given the nature of the other forum, it was important to clear it up so people would either see the "formed" dates as valuable or not too valuable and not really that exclusive.

Don't forget I am a YFull customer too and will attempt to send them my new Hg38 BAM file when I get it. I've stated that before and I will again soon in this thread in this forum if you want to follow along.
https://groups.yahoo.com/neo/groups/R1b-YDNA/conversations/messages/7202

Originally, among my personal reasons for ordering YFull was to get the age estimates. That's no longer a good exclusive.

I think the extra Y STRs are potentially valuable, but not much so until their database fills up more and they have more in the way of matching systems. I don't know if that will ever happen.

They never discovered any new SNPs for me and my little subclade because we already dived into the raw results on our own. Unfortunately, they provided some redundant SNP names even though I asked them not to.

Still, for only $49, you get a second opinion on SNPs, you get added to another tree and you get more data (STRs). To me it's a good deal.

vettor
10-21-2017, 05:51 PM
Joe B, I think you could substitute other vendor names, particularly one, for "YFull" and we find that "tiresome" too. You agree, right?

I post much more frankly and curtly here than on other forums because I perceive the experience level on the Anthrogenic forum is much higher. That's a compliment. :) I like a good debate, which is fun. I'm no Antonin Scalia of whom it was written "has been known for his love of debate, intellectual curiosity", but knowledge and understanding improve with logical debate.

I opened this topic to learn more and to clear up misunderstandings. I wasn't getting too far in terms of YFull experts on an L21 discussion group explaining the statement below. The only purported YFull expert wrote.

"​I think that leaves YFull as the-only-game-in-town for formation age estimates."
https://groups.yahoo.com/neo/groups/R1b-L21-Project/conversations/messages/35564

It didn't make sense to me based on YFull definitions and I didn't see that the differentiation between YFull and Iain McDonald on this point was very valuable, but as I noted on that forum I might have misunderstood. Hence, I opened this topic here on Anthrogenica.

I understand that this response is a "tit for tat" or like a child saying "but he hit me first." Still given the nature of the other forum, it was important to clear it up so people would either see the "formed" dates as valuable or not too valuable and not really that exclusive.

Don't forget I am a YFull customer too and will attempt to send them my new Hg38 BAM file when I get it. I've stated that before and I will again soon in this thread in this forum if you want to follow along.
https://groups.yahoo.com/neo/groups/R1b-YDNA/conversations/messages/7202

Originally, among my personal reasons for ordering YFull was to get the age estimates. That's no longer a good exclusive.

I think the extra Y STRs are potentially valuable, but not much so until their database fills up more and they have more in the way of matching systems. I don't know if that will ever happen.

They never discovered any new SNPs for me and my little subclade because we already dived into the raw results on our own. Unfortunately, they provided some redundant SNP names even though I asked them not to.

Still, for only $49, you get a second opinion on SNPs, you get added to another tree and you get more data (STRs). To me it's a good deal.

we all have our own opinions and also targets to achieve on yfull ..............for me , yfull is very beneficial , as it found new SNP for my line and also a private SNP ( in which they stated to me is a letter, that it will not be a private SNP for you IF we find others that have it, then is will be another branch splitting off)

to me, along with the ftdna administrator, the yfull has been money very well spent

Mikewww
10-21-2017, 07:31 PM
Counting SNPs bottoms-up is much more reliable than counting them tops-down, especially when we have more than two descending lineages. Also, while I agree that the accuracy might not be high enough when a small number of descending lineages is analyzed, this is no longer the case when we have tens or hundreds of descending lineages for a given clade, so we get the average with a relatively low standard deviation, and thus comparing the ages of two parallel subclades can produce statistically significant differences.


What's the standard deviation vaue across the board? I have not seen the actual calculation of the variance of SNP mutation rates. By eyeball (anecdotally) I see some wide swings so I am wary of YFull's confidence intervals.

Dr. Iain McDonald is an astrophysicist and I've conversed with him directly. He is sharp and I trust his intellect. That's an understatement. To read more about him see:
http://www.jb.man.ac.uk/~mcdonald/me.html

I picked out a couple of very large subclades with lots of Big Y results and divided the 95% confidence ranges by the best estimage age in ybp.

U106 ... Chalcolithic-Bronze . 24.6% total range with 95% confidence
P312 ... Chalcolithic-Bronze . 25.8%
M222 ... Classical Period .... 42.6%
L226 ... Anglo-Saxon Era ..... 44.0%

http://www.jb.man.ac.uk/~mcdonald/genetics/table.html
http://www.jb.man.ac.uk/~mcdonald/genetics/p312/table.html

Needless to say, I don't know if it worth getting too excited about estimating ages down to precision of less than a century. I'm okay with displaying precision and think that makes sense but we have to recognize the accuracy is just not worth claiming significant differentiation in theese two SNP based TMRCA models, in my opinion.

Humanist
10-21-2017, 11:45 PM
This thread is being reviewed. Some posts may be either edited or deleted if they are found to have violated our ToS. Please keep in mind the following (http://www.anthrogenica.com/faq.php):


3.12 Anthrogenica encourages its members to participate in discussions in a topic-focused manner. Personalization of discussions is completely prohibited at all times. This includes (and is not limited to) direct personal attacks, accusations, insinuations and false disclosures. Additionally, discussions that degenerate into inconsequential flaming or inanity will be deleted without prior notice. Note that this discussion policy also applies to Anthrogenica's Private Messaging and Visitor Message functions.

Michał
10-21-2017, 11:48 PM
However, within those subclades of P312 and U106, or L151/L11(S27), McDonald has more individual results, probably many more. He will have both more branches and a larger sample.
Unfortunately, this doesn't help the rest of us who are not R1b-L11 members, so we have practically no alternative to YFull (not to mention that YFull offers much more than age estimates).



This is a challenge when someone is charging for something when someone else is providing it for free.

At the moment, nobody is providing it for free, if not counting some subclade-specific or family-specific projects. Also, YFull doesn't charge for age estimates alone, as this is just a part of a much larger package that includes functions/tools not available elsewhere, so far, so it is not fair to compare it with what Iain McDonald offers for free (to the members of clade L11 only).



On the other end you've got FTDNA providing a free on-line Y chromosome browser now.

In that sense, free stuff is cutting the heart out of YFull's target market.
Wow, I didn't know FTDNA is offering this for free. You mean we don't need to pay for Big Y to get access to this service (otherwise this would be for free only in that sense in which the age estimates are for free at YFull)?



I picked out a couple of very large subclades with lots of Big Y results and divided the 95% confidence ranges by the best estimage age in ybp.

U106 ... Chalcolithic-Bronze . 24.6% total range with 95% confidence
P312 ... Chalcolithic-Bronze . 25.8%
M222 ... Classical Period .... 42.6%
L226 ... Anglo-Saxon Era ..... 44.0%

http://www.jb.man.ac.uk/~mcdonald/genetics/table.html
http://www.jb.man.ac.uk/~mcdonald/genetics/p312/table.html

Needless to say, I don't know if it worth getting too excited about estimating ages down to precision of less than a century. I'm okay with displaying precision and think that makes sense but we have to recognize the accuracy is just not worth claiming significant differentiation in theese two SNP based TMRCA models, in my opinion.
The 95% confidence interval for age estimate is not the same as the standard deviation for an average number of downstream SNPs. When calculating the confidence interval for the estimated TMRCA age (based on the average number of SNPs), one needs to take into account some additional factors, like the uncertainty associated with the number of years per SNP. In other words, it is possible that the difference between the average numbers of downstream SNPs for two parallel clades is statistically significant, while there is no such statistically significant difference between the predicted TMRCA ages.

leonardo
10-22-2017, 12:09 AM
I was in the first batch of the BigY test and YFull actually did my analysis (and others) for free. Not that I would have objected to the $49 charge for the following:They found several new subclades for my branch. I value their TMRCA estimation. I also got a Mt-Haplogroup classification (which FTDNA later removed from the BAM file) and STR results, with matches approaching 400 markers.

Saetro
10-22-2017, 01:25 AM
The 95% confidence interval for age estimate is not the same as the standard deviation for an average number of downstream SNPs. When calculating the confidence interval for the estimated TMRCA age (based on the average number of SNPs), one needs to take into account some additional factors, like the uncertainty associated with the number of years per SNP. In other words, it is possible that the difference between the average numbers of downstream SNPs for two parallel clades is statistically significant, while there is no such statistically significant difference between the predicted TMRCA ages.

I myself would like to have both the Standard Deviation and the 95% confidence limits.
Although the latter can be ridiculously wide if very few samples are available.

We need to encourage more people to do this extended testing and they need to be told something that is accurate.
The online tree itself needs to be brief (with an easily findable place for definitions and explanations).
New individuals cannot be expected to know the subtleties.
They need to have spelled out that the subclade above them has a fairly certain dating of whatever it is.
And that their downstream one cannot be accurately dated because few examples are available.
Then provide a very rough estimate based on whatever model and state which it is and whether it tends to bias in one direction or another and emphasize that it is rough.
And encourage the individual to bring others to test who may be able to reduce the uncertainty in this age estimate.

This last, surely, is what is needed.

Mikewww
10-22-2017, 02:15 AM
... Don't forget I am a YFull customer too and will attempt to send them my new Hg38 BAM file when I get it. I've stated that before and I will again soon in this thread in this forum if you want to follow along.
https://groups.yahoo.com/neo/groups/R1b-YDNA/conversations/messages/7202

Originally, among my personal reasons for ordering YFull was to get the age estimates. That's no longer a good exclusive.

I think the extra Y STRs are potentially valuable, but not much so until their database fills up more and they have more in the way of matching systems. I don't know if that will ever happen.

They never discovered any new SNPs for me and my little subclade because we already dived into the raw results on our own. Unfortunately, they provided some redundant SNP names even though I asked them not to.

Still, for only $49, you get a second opinion on SNPs, you get added to another tree and you get more data (STRs). To me it's a good deal.

Sometimes people forget my presentation often has two sides.

Below is what I just told 1,992 R1b folks, many of which are newbies. This was on a contentious thread where one person strongly recommends YFull interpretations for all Big Y people, another says it is redundant and a waste, and another says U106 does not recommend YFull (although another U106 admin came in and corrected that.)


All testing and all interpretation and analysis services are optional.

Genetic genealogy is a hobby in the first place and we all have budgetary capabilities and constraints so all purchase decisions are personal and legitimate.

I have done both FGC interpretations and YFull interpretations. I've posted this elsewhere but I didn't really get any new SNPs out of my YFull interpretation nor did my branch-mate but that's probably just because we already dived deeply into the raw results. I like to get the significant public trees updated with my branch information so in that regard I don't mind, and even like the redundancy among the ISOGG, YFull and FTDNA trees.

I will send YFull my Hg38 BAM file when I get it and will pay for re-interpretation if they charge me. To me the $49 is not a big deal for a second opinion and to get documented on another tree. However, for some that value isn't there. That's okay.

Let's try to be careful and not knock any of the products and services. Everything has warts and different pricing. We just need to make sure understand what we want and what we are getting.

Mike W https://groups.yahoo.com/neo/groups/R1b-YDNA/conversations/messages/7205

Anthrogenic is a forum for people who can eat meat (if you know the ancient words), so I am more direct here as I posted on reply #18, "I post much more frankly and curtly here than on other forums because I perceive the experience level on the Anthrogenic forum is much higher."

Mikewww
10-22-2017, 05:13 PM
Dr. Iain McDonald is an astrophysicist and I've conversed with him directly. He is sharp and I trust his intellect. That's an understatement. To read more about him see:
[url]http://www.jb.man.ac.uk/~mcdonald/me.

Iain McDonald posted the below yesterday. He gave me permission to quote him.


CLADE TMRCA DATE: This is the date at which everyone tested in the clade shares their most-recent common ancestor. This is the date that's most useful to most people, since it tells you when everyone with that mutation today was last related. The only significant drawback is that it's possible for someone new to come along who is more distantly related, which pushes the TMRCA date further back. This can be a problem for young clades.

BRANCH FORMATION DATE: This is the date that YFull uses as its "formed" date. This is actually the TMRCA of the parent clade.

SNP FORMATION DATE: This is when the mutation first formed. This date is often very uncertain, and rarely useful. If you have a long list of SNPs (e.g. M269 and the 106 equivalent SNPs), you don't know where M269 came in this sequence. It could have been first or 107th, but probably it was somewhere in the middle. Such formation dates can be calculated from the previous two dates, so also pick up the uncertainty in both of them. That means that M269 could have formed more than 13,000 years ago, or less than 6,400 years ago. The point is arbitrary, since it has no useful historical context, and since (in our example) M269 is simply picked arbitrarily from a list of equivalent SNPs to denote the clade R-M269. Its use is limited mainly to comparisons against ancient DNA, where the clade TMRCA may under-estimate the relevant date.

For these reasons, the formation date in YFull is not particularly useful, and the TMRCA date is the most valid for most people's research.

Edit/addition:
Iain’s last statement is particularly important to my purpose in the topic. Sometimes people think I am trying to discredit a function or service out of the blue. Generally, I am triggered by something posted that doesn’t sound right or may be questionable. I investigate and attempt to correct if it is appropriate. This is to the benefit of the testing audience.

The trigger in this case is the following statement by a YFull advocate to 1,463 people who probably accept this statement as true unless this there is more discussion. I think it was overstatement.
"​I think that leaves YFull as the-only-game-in-town for formation age estimates."
https://groups.yahoo.com/neo/groups/...messages/35564

Should I have let a misleading statement stand?

Mikewww
10-23-2017, 01:43 AM
I picked out a couple of very large subclades with lots of Big Y results and divided the 95% confidence ranges by the best estimage age in ybp.

U106 ... Chalcolithic-Bronze . 24.6% total range with 95% confidence
P312 ... Chalcolithic-Bronze . 25.8%
M222 ... Classical Period .... 42.6%
L226 ... Anglo-Saxon Era ..... 44.0%

http://www.jb.man.ac.uk/~mcdonald/genetics/table.html
http://www.jb.man.ac.uk/~mcdonald/genetics/p312/table.html
.

For comparison with McDonald’s 95% confidence ranges, here are YFull’s. I don’t understand why P312’s range is bigger. The sample size should be good. The two more youthful subclades have tighter ranges.

U106 ... Chalcolithic-Bronze . 21.3% total range with 95% confidence
P312 ... Chalcolithic-Bronze . 35.6%
M222 ... Classical Period .... 27.5%
L226 ... Anglo-Saxon Era ..... 34.5%

https://www.yfull.com/tree/R-L151/

JohnHowellsTyrfro
10-23-2017, 07:15 AM
Sorry, I'm having difficulty following this. (U106 Z326 under R-Y 20959 formed 3,300 ybp TMRCA 650ybp)
Our oldest ancestor SNP is FGC3946 which Iain estimates around 995AD. Our next upstream (older) SNP is FGC18850 estimated at about 778 BC.
So what should I be looking at and is this "young" or "old" and are the dates reliable or not so reliable? Thanks.

Michał
10-23-2017, 09:32 AM
Sorry, I'm having difficulty following this. (U106 Z326 under R-Y 20959 formed 3,300 ybp TMRCA 650ybp)
Our oldest ancestor SNP is FGC3946 which Iain estimates around 995AD. Our next upstream (older) SNP is FGC18850 estimated at about 778 BC.
So what should I be looking at and is this "young", "old" and are the dates reliable or not so reliable? Thanks.
Firstly, your above TMRCA ages according to Iain McDonald should be updated (it is 1064 AD for FGC3946 and 836 BC for FGC18850). Secondly, both YFull and Iain McDonald can estimate the TMRCA ages only for those clades for which they have any raw data (ie. BAM files in the case of YFull and VCF/BED files for Iain McDonald). Iain McDonald performs his calculations exclusively for clades R1b-U106 and R1b-P312, so he has access to most Big Y results from clade R1b-U106, while only a small fraction of those BigY-tested U106 people have submitted their BAM files to YFull. In other words, the U106 part of the YFull tree does not include many lineages that have been analyzed by Iain McDonald, so this strongly affects the TMRCA ages of particular clades (not to mention their very presence on the tree). In the above case, it seems that clade Y20959/FGC3946 is represented at YFull only by three quite closely related people (I guess these are all members of a more downstream subclade BY20441), so this is probably why their TMRCA age is much younger than the TMRCA age for all Y20959/FGC3946 members who were analyzed by Iain. If you are one of those who are FGC3946+ but BY20441-, then all this is because you have not submitted your BAM file to YFull. As for the difference between 3,300 ybp (according to YFull) and about 2,800 ybp (according to Iain McDonald) for the parental clade FGC18850, this is likely related to the slightly different procedures that YFull and Iain McDonald use in their calculations, so it is hard to say which age is closer to the truth. Also, please note that the margin of error is 2500-4200 ybp in the case of YFull and about 2250-3450 ybp for Iain's estimate, so both these estimates are within the range suggested by the other estimation.



Our oldest ancestor SNP is FGC3946 which Iain estimates around 995AD. Our next upstream (older) SNP is FGC18850 estimated at about 778 BC.
All this is wrong, and you should avoid making such statements that are very likely to confuse the less experienced readers (by suggesting that the TMRCA age is an age of a given mutation/SNP). See Iain's explanation that was quoted above by Mike:
http://www.anthrogenica.com/showthread.php?12370-The-YFull-TMRCA-vs-Iain-McDonald-methods&p=300014&viewfull=1#post300014

Michał
10-23-2017, 09:45 AM
The trigger in this case is the following statement by a YFull advocate to 1,463 people who probably accept this statement as true unless this there is more discussion. I think it was overstatement.
"​I think that leaves YFull as the-only-game-in-town for formation age estimates."
https://groups.yahoo.com/neo/groups/...messages/35564

Should I have let a misleading statement stand?
That (over)statement was made on a forum that is closed to the public (and thus not accessible to most of us), so it would be enough to correct your opponent while discussing this question on that very forum. In fact, if not you, we would be still unaware of that unfortunate argument, so one could accuse you of propagating such misleading statements. ;)

leonardo
10-23-2017, 10:55 AM
Also, please note that the margin of error is 2500-4200 ybp in the case of YFull and about 2250-3450 ybp for Iain's estimate, so both these estimates are within the range suggested by the other estimation.

That was my first thought: neither estimate falls outside the range suggested by the other estimation or that of the standard deviation for the calculation, which typically has a extensive range - for all TMRCA estimates. It really is an educated guess at best, at this point, is it not?

Cofgene
10-23-2017, 11:33 AM
Unfortunately, this doesn't help the rest of us who are not R1b-L11 members, so we have practically no alternative to YFull (not to mention that YFull offers much more than age estimates).


Iain's code is open source and individuals in other major haplogroup regions can take on the task of setting up the analysis for their region if they have the resources. it DOES take some effort to set up the branching constraints so that comparison and follow-on age estimation operations produce an output which overlays with the current tree structure. Iain's system needs some redesign work to get around some limitations when several thousand BigY results are analyzed in a complete run. Computers with SSD or M.2 drives will allow the analysis to be very fast.

JohnHowellsTyrfro
10-23-2017, 03:19 PM
Firstly, your above TMRCA ages according to Iain McDonald should be updated (it is 1064 AD for FGC3946 and 836 BC for FGC18850). Secondly, both YFull and Iain McDonald can estimate the TMRCA ages only for those clades for which they have any raw data (ie. BAM files in the case of YFull and VCF/BED files for Iain McDonald). Iain McDonald performs his calculations exclusively for clades R1b-U106 and R1b-P312, so he has access to most Big Y results from clade R1b-U106, while only a small fraction of those BigY-tested U106 people have submitted their BAM files to YFull. In other words, the U106 part of the YFull tree does not include many lineages that have been analyzed by Iain McDonald, so this strongly affects the TMRCA ages of particular clades (not to mention their very presence on the tree). In the above case, it seems that clade Y20959/FGC3946 is represented at YFull only by three quite closely related people (I guess these are all members of a more downstream subclade BY20441), so this is probably why their TMRCA age is much younger than the TMRCA age for all Y20959/FGC3946 members who were analyzed by Iain. If you are one of those who are FGC3946+ but BY20441-, then all this is because you have not submitted your BAM file to YFull. As for the difference between 3,300 ybp (according to YFull) and about 2,800 ybp (according to Iain McDonald) for the parental clade FGC18850, this is likely related to the slightly different procedures that YFull and Iain McDonald use in their calculations, so it is hard to say which age is closer to the truth. Also, please note that the margin of error is 2500-4200 ybp in the case of YFull and about 2250-3450 ybp for Iain's estimate, so both these estimates are within the range suggested by the other estimation.



All this is wrong, and you should avoid making such statements that are very likely to confuse the less experienced readers (by suggesting that the TMRCA age is an age of a given mutation/SNP). See Iain's explanation that was quoted above by Mike:
http://www.anthrogenica.com/showthread.php?12370-The-YFull-TMRCA-vs-Iain-McDonald-methods&p=300014&viewfull=1#post300014

Thank you. I AM one of those less experienced readers which is why I am seeking guidance and repeating what I was told by someone who I assumed understood these things. :)

So what is a "young clade" in the context mention below please?

"CLADE TMRCA DATE: This is the date at which everyone tested in the clade shares their most-recent common ancestor. This is the date that's most useful to most people, since it tells you when everyone with that mutation today was last related. The only significant drawback is that it's possible for someone new to come along who is more distantly related, which pushes the TMRCA date further back. This can be a problem for young clades."

Mikewww
10-23-2017, 03:33 PM
Iain's code is open source and individuals in other major haplogroup regions can take on the task of setting up the analysis for their region if they have the resources. it DOES take some effort to set up the branching constraints so that comparison and follow-on age estimation operations produce an output which overlays with the current tree structure. Iain's system needs some redesign work to get around some limitations when several thousand BigY results are analyzed in a complete run. Computers with SSD or M.2 drives will allow the analysis to be very fast.
Cofgene makes a good point, the Big Tree and McDonald’s Age Estimate tools are open and have expanded in usage over the years. These are great volunteer services and I am sure they could be leveraged to more haplogroups.

Today, Wiiiamson’s Big Tree and McDonald’s Branch Age Estimates cover R-L151 so there is pretty good momentum behind them. R-L151 is the monster Y haplogroup of Western and Central Europe and therefore many European migrants to the New World. See Figure 2 of "The cautionary tale of R-M269 in Europe", Busby et al, The frequency of R-S127(L11), aka R-L151, is well over 50% along the Atlantic facade of Europe and approximately 50% for most of Central Europe.

It’s something to think about but Alex and Iain are great guys so I don’t think you can go wrong. I know they are interested in R1b as a whole and there is already a haplogroup-R web site set up. This doesn’t mean we shouldn’t do YFull interpretations as well, as I will. It's just easier to get more people involved when something is free.

Mikewww
10-23-2017, 03:51 PM
I don’t want to lose this question and I don’t have the YFull expertise on some of those other forums (which is a compliment for you on Anthrogenica :) )

What makes the YFull P312 confidence range so wide?

P312 encompasses M222 and L226 and a lot more. There are tons of samples for P312. Please double check what I’m looking at and see what is happening.


(from McDonald I) divided the 95% confidence ranges by the best estimage age in ybp.

U106 ... Chalcolithic-Bronze . 24.6% total range with 95% confidence
P312 ... Chalcolithic-Bronze . 25.8%
M222 ... Classical Period .... 42.6%
L226 ... Anglo-Saxon Era ..... 44.0%

http://www.jb.man.ac.uk/~mcdonald/genetics/table.html
http://www.jb.man.ac.uk/~mcdonald/ge...312/table.html

For comparison with McDonald’s 95% confidence ranges, here are YFull’s. ...

U106 ... Chalcolithic-Bronze . 21.3%
P312 ... Chalcolithic-Bronze . 35.6%
M222 ... Classical Period .... 27.5%
L226 ... Anglo-Saxon Era ..... 34.5%

https://www.yfull.com/tree/R-L151/

Wing Genealogist
10-23-2017, 05:33 PM
Thank you. I AM one of those less experienced readers which is why I am seeking guidance and repeating what I was told by someone who I assumed understood these things. :)

So what is a "young clade" in the context mention below please?

"CLADE TMRCA DATE: This is the date at which everyone tested in the clade shares their most-recent common ancestor. This is the date that's most useful to most people, since it tells you when everyone with that mutation today was last related. The only significant drawback is that it's possible for someone new to come along who is more distantly related, which pushes the TMRCA date further back. This can be a problem for young clades."

Actually, it really isn't the AGE of a clade as much as how many INDIVIDUALS have submitted their Big Y test for analysis which matter. If you go to one of Iain's charts at: http://www.jb.man.ac.uk/~mcdonald/genetics/report.html You will see where there are 17 S11136+ individuals in his analysis. These 17 individuals encompass 13 different surnames and include individuals whose MDKA is from the European mainland as well as the "Isles". Thus, we can consider S11136 rather well tested, and therefor more results likely will not significantly alter the age (unless a whole new branch of the family is discovered, similar to the A00 line some years ago significantly altered the TMRCA of all human males).

Your next clade below S11136 is FGC18850/FGC18862. We have 11 kits who have tested positive for this clade, but almost half of them (5 of the 11) belong to the Cecil, etc. surname. It is likely future discoveries may cause this age to shift a bit. Your next subclade (BY21404) is even more dominated by the Cecil surname (5 out of 8) so it is even more likely than its parent clade to be significantly altered with new Big Y results (especially results outside of the Cecil surname).

vettor
10-23-2017, 05:42 PM
I don’t want to lose this question and I don’t have the YFull expertise on some of those other forums (which is a compliment for you on Anthrogenica :) )

What makes the YFull P312 confidence range so wide?

P312 encompasses M222 and L226 and a lot more. There are tons of samples for P312. Please double check what I’m looking at and see what is happening.

Yfull in its use of TMRCA use an average number based on the member samples they have.

example ..........I am 3410 for my individual TMRCA, yet i fall in the group stating 3100 TMRCA

.................................................. ...............Unrounded age (ybp) Rounded age (ybp) .....Age by all samples (ybp)
− T-Z19945 21 1 20 .............................3410 ........................3400 (2200-5100) ........... 3100 (2200-4100)

Wing Genealogist
10-23-2017, 06:38 PM
Actually, it really isn't the AGE of a clade as much as how many INDIVIDUALS have submitted their Big Y test for analysis which matter.

To be technically accurate, it isn't really the number of individuals per se, but how well the individuals encompass the whole clade (ie how well they cover all of the subclades). For the majority of the clades, we have absolutely no idea how widespread the clade is, so the more individuals we have, the more likely we would include most of its subclades.

The exception is for families like the Cecil's, which (because they fall in the Gentry) are fairly well documented. It is likely the age of the Y21405 clade (which is the "family" clade of all of the Cecil's) is more accurate than the age of its parent clade (simply because we don't know if we have Big Y results from all of the branches).

Dewsloth
10-23-2017, 06:48 PM
Actually, it really isn't the AGE of a clade as much as how many INDIVIDUALS have submitted their Big Y test for analysis which matter. If you go to one of Iain's charts at: http://www.jb.man.ac.uk/~mcdonald/genetics/report.html You will see where there are 17 S11136+ individuals in his analysis. These 17 individuals encompass 13 different surnames and include individuals whose MDKA is from the European mainland as well as the "Isles". Thus, we can consider S11136 rather well tested, and therefor more results likely will not significantly alter the age (unless a whole new branch of the family is discovered, similar to the A00 line some years ago significantly altered the TMRCA of all human males).

Your next clade below S11136 is FGC18850/FGC18862. We have 11 kits who have tested positive for this clade, but almost half of them (5 of the 11) belong to the Cecil, etc. surname. It is likely future discoveries may cause this age to shift a bit. Your next subclade (BY21404) is even more dominated by the Cecil surname (5 out of 8) so it is even more likely than its parent clade to be significantly altered with new Big Y results (especially results outside of the Cecil surname).

This is problematic for McDonald's DF19 estimate: He only has one DF19 report. It results in an extremely skewed age estimate for Z302/DF87.
This isn't his fault, he just needs more samples.
I've submitted mine, but I don't know when/if he's doing the next update.

McDonald:
1 Sample
DF19 2747 BC (3596 BC — 1773 BC)
DF87/Z302 1159 AD (432 AD — 1602 AD) [this is the major split under DF19; with DF88 and Z302 being the major branches underneath]
BY19316 1589 AD (1136 AD — 1857 AD)

There are 32 DF19 samples on Alex's Big Tree, but no age estimates.

Yfull:
28 Samples (4 Z302)
R-DF19Z4161 * DF19/S232 * S1354/Z8192formed 4500 ybp, TMRCA 4500 ybp [So ~2483BC]
R-Z302 CTS12966/DF87 * CTS9798 * Z302/S233formed 4500 ybp, TMRCA 4500 ybp [So also ~2483BC]
R-DF88 S4274 * DF88/S4298 * Y3095/FGC11834+5 SNPsformed 4500 ybp, TMRCA 4500 ybpinfo


FTDNA DF19 Group (volunteer admins do the crunching)
~200 Big Y Samples (~35 Z302/DF87; almost all the rest DF88)
DF19 2400BC
DF88 and Z302/DF87 2300BC

JohnHowellsTyrfro
10-23-2017, 07:39 PM
Actually, it really isn't the AGE of a clade as much as how many INDIVIDUALS have submitted their Big Y test for analysis which matter. If you go to one of Iain's charts at: http://www.jb.man.ac.uk/~mcdonald/genetics/report.html You will see where there are 17 S11136+ individuals in his analysis. These 17 individuals encompass 13 different surnames and include individuals whose MDKA is from the European mainland as well as the "Isles". Thus, we can consider S11136 rather well tested, and therefor more results likely will not significantly alter the age (unless a whole new branch of the family is discovered, similar to the A00 line some years ago significantly altered the TMRCA of all human males).

Your next clade below S11136 is FGC18850/FGC18862. We have 11 kits who have tested positive for this clade, but almost half of them (5 of the 11) belong to the Cecil, etc. surname. It is likely future discoveries may cause this age to shift a bit. Your next subclade (BY21404) is even more dominated by the Cecil surname (5 out of 8) so it is even more likely than its parent clade to be significantly altered with new Big Y results (especially results outside of the Cecil surname).

Thank you for the clarification, I appreciate it.

Michał
10-24-2017, 11:53 AM
I donít want to lose this question and I donít have the YFull expertise on some of those other forums (which is a compliment for you on Anthrogenica :) )

What makes the YFull P312 confidence range so wide?

P312 encompasses M222 and L226 and a lot more. There are tons of samples for P312. Please double check what Iím looking at and see what is happening.
To be honest I don't know what the exact reason for this is, but I suspect it may have something to do with the way YFull tries to modify/adjust the age of a parental clade when a relatively older age for some downstream subclades/lineages is calculated. In this particular case, the average age for P312 is much higher when calculated based on all major downstream subclades (like U152, DF27, S461, DF19, etc) than when calculated based on multiple singleton lineages alone (ie. those marked as P312*). One possible explanation for this is that a significant fraction of those P312* singleton lineages (especially those with a relatively low number of downstream SNPs) are in fact members of a yet unknown subclade directly under P312 that is defined by a mutation not covered by Big Y (nor by Y Elite?). Also, one can suspect that what additionally characterizes this hypothetical subclade is a relatively low TMRCA age, so when assuming that the common ancestors of this subclade accumulated exceptionally few SNPs (or practically no SNPs that would be covered by Big Y), this would explain why the average number of downstream mutations in the singleton lineages under P312 is much lower than in the case of known downstream subclades. One way to verify this hypothesis is to test some of those singleton lineages (especially those with a low number of BigY-tested SNPs under P312) with Y Elite and then search for other singleton lineages that would share any "private" mutations that were found by YFull but are not covered by Big Y. It should be noted that if there is any such hidden subclade under P312 (with relatively low TMRCA age and "no defining SNPs"), this would not only narrow down the confidence range for the age of clade P312 but it would also make this estimated TMRCA age significantly increase (let's say to about 4800 ybp when using the YFull mutation rate, or even more when assuming this mutation rate is in fact significantly lower).

Wing Genealogist
10-24-2017, 12:11 PM
To be honest I don't know what the exact reason for this is, but I suspect it may have something to do with the way YFull tries to modify/adjust the age of a parental clade when a relatively older age for some downstream subclades/lineages is calculated. In this particular case, the average age for P312 is much higher when calculated based on all major downstream subclades (like U152, DF27, S461, DF19, etc) than when calculated based on multiple singleton lineages alone (ie. those marked as P312*). One possible explanation for this is that a significant fraction of those P312* singleton lineages (especially those with a relatively low number of downstream SNPs) are in fact members of a yet unknown subclade directly under P312 that is defined by a mutation not covered by Big Y (nor by Y Elite?). Also, one can suspect that what additionally characterizes this hypothetical subclade is a relatively low TMRCA age, so when assuming that the common ancestors of this subclade accumulated exceptionally few SNPs (or practically no SNPs that would be covered by Big Y), this would explain why the average number of downstream mutations in the singleton lineages under P312 is much lower than in the case of known downstream subclades. One way to verify this hypothesis is to test some of those singleton lineages (especially those with a low number of BigY-tested SNPs under P312) with Y Elite and then search for other singleton lineages that would share any "private" mutations that were found by YFull but are not covered by Big Y. It should be noted that if there is any such hidden subclade under P312 (with relatively low TMRCA age and "no defining SNPs"), this would not only narrow down the confidence range for the age of clade P312 but it would also make this estimated TMRCA age significantly increase (let's say to about 4800 ybp when using the YFull mutation rate, or even more when assuming this mutation rate is in fact significantly lower).


Within U106, there appears to be a significant bottleneck immediately after its origin. While the Z381 clade continued to spawn new subclades, all of the other subclades directly below U106 were likely spawned several hundred years after U106's origin. See http://www.jb.man.ac.uk/~mcdonald/genetics/tree.html for a graphical representation of this.

It is quite possible a somewhat similar situation may have happened with P312, where some subclades formed fairly quickly, but others went through a bottleneck and didn't separate from its parent for a significant time period.

Michał
10-24-2017, 12:44 PM
Within U106, there appears to be a significant bottleneck immediately after its origin. While the Z381 clade continued to spawn new subclades, all of the other subclades directly below U106 were likely spawned several hundred years after U106's origin. See http://www.jb.man.ac.uk/~mcdonald/genetics/tree.html for a graphical representation of this.

It is quite possible a somewhat similar situation may have happened with P312, where some subclades formed fairly quickly, but others went through a bottleneck and didn't separate from its parent for a significant time period.
What you describe is something perfectly normal (thus nothing unexpected). Also, this alone does not explain why the average number of SNPs under P312 is much higher for known subclades than for singleton lineages. A "hidden subclade" under P312 would of course explain this, but we first need to prove it.

MitchellSince1893
10-24-2017, 01:15 PM
Could it be that some mutations are sequentially occurring outside of the test coverage? They are happening but we just aren't detecting them? For example BigY covers 76% of the Y dna 55000 known SNPs. What if some branches a few SNPs ccurring in the other 24%? Y Elite covers 97% but I'm assuming many branches presently consists of only BigY testers. And then there is the small chance SNPs sequentially occur in regions neither BigY nor FGC cover.

Michał
10-24-2017, 02:02 PM
Could it be that some mutations are sequentially occurring outside of the test coverage? They are happening but we just aren't detecting them? For example BigY covers 76% of the Y dna 55000 known SNPs. What if some branches a few SNPs ccurring in the other 24%? Y Elite covers 97% but I'm assuming many branches presently consists of BigY testers. And then there is the small chance SNPs sequentially occur in regions neither BigY nor FGC cover.
If by "sequentially" you just mean the effect, not the cause (ie. you don't suggest that one mutation outside of the test coverage makes it more likely for the subsequent mutations to arise in those uncovered regions), then this seems perfectly possible (although relatively uncommon).

MitchellSince1893
10-24-2017, 02:39 PM
Yes just by luck they happen to occur one after another in a currently untested region.
If my math is right 24% chance 1 occurs there. 5.76% chance 2 in a row and 1.38% chance 3 in a row on a BigY test. Uncommon for sure.

Mikewww
10-24-2017, 09:12 PM
What you describe is something perfectly normal (thus nothing unexpected). Also, this alone does not explain why the average number of SNPs under P312 is much higher for known subclades than for singleton lineages. A "hidden subclade" under P312 would of course explain this, but we first need to prove it.
Thank you. Do you know how how these models weight the immediate subclade branches when calculating the TMRCA for the parent?

At first glance weighting them equally makes sense so as not to provide bias to the most populous branches. On the other hand populous branches have the most data which should help accuracy.

Is this a problem and how is it handled? Bottlenecks with strings of equivalents for large groups, i.e. M222, may insert a bias from those blocks which have had mutation rates significantly off the average?

I see how YFull uses the immediate child subclades and then makes an adjustment at the parent TMRCA related to the parentís brothers and parent. Is this done all up and down the branching levels? That could mean a lot of manual adjustments which I think could cause error accumulation. I would think McDonald has to handle the same way.

People donít often agree with me but I like having independent data types, like STRs and ancient DNA to use as validation checkpoints of fencing. Of course this is why I like computer simulations and linear programming models which can use all the data to find the ďbest fitĒ.

Michał
10-25-2017, 12:11 AM
Thank you. Do you know how how these models weight the immediate subclade branches when calculating the TMRCA for the parent?

At first glance weighting them equally makes sense so as not to provide bias to the most populous branches. On the other hand populous branches have the most data which should help accuracy.

I suspect the same weight is given to each subclade/lineage irrespective of the size, and this seems to be the best approach.

Mikewww
10-25-2017, 12:52 AM
I suspect the same weight is given to each subclade/lineage irrespective of the size, and this seems to be the best approach.
I agree. Look at the Big Tree for M222, the NW Irish.
http://www.ytree.net/DisplayTree.php?blockID=93

This is the behometh of L21 as far as consumer testing. There are several hundred NGS results on the Big Tree for M222. Look at the big block of about 20 SNPs equivalent to M222. If you weight subclades by population size and this block had an abnormal mutation rate it could throw all of L21 and P312 off.

Does anyone know how McDonald handles this?

Wing Genealogist
10-25-2017, 08:06 AM
I agree. Look at the Big Tree for M222, the NW Irish.
http://www.ytree.net/DisplayTree.php?blockID=93

This is the behometh of L21 as far as consumer testing. There are several hundred NGS results on the Big Tree for M222. Look at the big block of about 20 SNPs equivalent to M222. If you weight subclades by population size and this block had an abnormal mutation rate it could throw all of L21 and P312 off.

Does anyone know how McDonald handles this?

I cannot answer your specific question, but I have spoken with Iain in the past about the appearance of some subclades having a faster/slower mutation rate than average. His reply has been that this is to be expected, due to the random nature of mutations and that they average out in the end.

Given this point of view, I don't believe he would make any adjustment to clades like M222.

However, I doesn't directly weigh subclades by population size. The larger number of M222 tests may indirectly weigh on his calculations.

I believe the bottom line is that the 95% Confidence Interval is more than wide enough to compensate for any such abnormalities.

Mikewww
10-30-2017, 02:40 PM
That was my first thought: neither estimate falls outside the range suggested by the other estimation or that of the standard deviation for the calculation, which typically has a extensive range - for all TMRCA estimates. It really is an educated guess at best, at this point, is it not?

Are the confidence intervals provided by SNP counting TMRCA estimation models misleading? I don't mean intentionally misleading, but misleading because the data is not too clean.

Robert Casey has done a nice job on evaluating STRs and makes the very strong and correct (I believe) case that massive YSNP testing is needed. He reminded me of the underlying assumptions used in applying models. Confidence intervals, at least as I know them, are based on the underlying variance in the data and assumptions about that variance.

The basic assumption is there is that the data fits a bell curve. Has anyone tried to look at mutation rates across the board, all haplogroups, by branch, to evaluate if there is truly a bell curve? I do not see that in these studies either by McDonald or YFull/Adamov. Is it there? We probably shouldn't use these confidence ranges if there is not a good fit with the bell curve.

If not, we just have to consider this stuff is an educated guess and recognize the greatest value is still in just understanding branch ages relative to each other.

Wing Genealogist
10-30-2017, 03:15 PM
Are the confidence intervals provided by SNP counting TMRCA estimation models misleading? I don't mean intentionally misleading, but misleading because the data is not too clean.

Robert Casey has done a nice job on evaluating STRs and makes the very strong and correct (I believe) case that massive YSNP testing is needed. He reminded me of the underlying assumptions used in applying models. Confidence intervals, at least as I know them, are based on the underlying variance in the data and assumptions about that variance.

The basic assumption is there is that the data fits a bell curve. Has anyone tried to look at mutation rates across the board, all haplogroups, by branch, to evaluate if there is truly a bell curve? I do not see that in these studies either by McDonald or YFull/Adamov. Is it there? We probably shouldn't use these confidence ranges if there is not a good fit with the bell curve.

If not, we just have to consider this stuff is an educated guess and recognize the greatest value is still in just understanding branch ages relative to each other.

I know the Confidence Intervals expressed by McDonald are quite wide, and he readily concedes in some cases the branch age may well fall outside of the CI (which is one of the primary reasons why the CI is 95% rather than 100%).

Given the fact the NGS tests are SNP tests (and not STR tests) many of the issues with STR testing don't pertain. But in the end, these estimates really are educated guesses. As you stated, the greatest value is in observing the branch ages relative to each other.

Dave-V
10-30-2017, 03:42 PM
The basic assumption is there is that the data fits a bell curve. Has anyone tried to look at mutation rates across the board, all haplogroups, by branch, to evaluate if there is truly a bell curve? I do not see that in these studies either by McDonald or YFull/Adamov. Is it there? We probably shouldn't use these confidence ranges if there is not a good fit with the bell curve.

I don't think anyone has done all haplogroups by branch, but I did do the SNP rate bell curve for L21 on this thread: http://www.anthrogenica.com/showthread.php?10785-What-mutation-rates-should-we-consider-for-Y-SNPs&p=244811#post244811.

For L21 at least it shows a bell curve but not a perfect one. The bigger issue however is that it's low and flat - i.e the standard deviation is very large. At 1 SD (68% confidence) the 4400 ybp estimate is about +/- 1000 years (rounded), at 2 SD (95% confidence) it's about +/- 2000 years (rounded).

Translation - I think your point that both STR and SNP-based ageing analyses are educated guesses is valid.

Mikewww
10-30-2017, 03:47 PM
I know the Confidence Intervals expressed by McDonald are quite wide, and he readily concedes in some cases the branch age may well fall outside of the CI (which is one of the primary reasons why the CI is 95% rather than 100%). ...
The issues of variance in the data and fit to a bell curve are pertinent to mutation data regardless of type. Should anyone be using the "95%" number as a range? Is that just a guess? or a guess based on a statistical model that doesn't apply to the real world data? If so, then we probably shouldn't use terminology that is generally associated with data that conforms to a bell curve.

Don't shoot the messenger. I said "if so" above but I am concerned if anyone has really done a full analysis on variance in mutation rates by lineage. I suspect this is a complicated issue as we don't know the mutation rates by region and seem to assume all regions, within the scope of CombBed or McDonald or whatever have similar mutation rates.

Do you see why I advocate using computer simulation/analytic software tools?

Dave-V
10-30-2017, 04:02 PM
The issues of variance in the data and fit to a bell curve are pertinent to mutation data regardless of type. Should anyone be using the "95%" number as a range? Is that just a guess? or a guess based on a statistical model that doesn't apply to the real world data? If so, then we probably shouldn't use terminology that is generally associated with data that conforms to a bell curve.

Don't shoot the messenger. I said "if so" above but I am concerned if anyone has really done a full analysis on variance in mutation rates by lineage.

Statistical methods apply to data whether or not it fits a perfect bell curve. All a skew really does is change the CI range so one side (either + or -) is larger than the other. The real problem is not whether the statistical model applies, but that we then tend to use the output - whether it's an age estimate on a SNP, or a years-per-SNP estimate, etc, as if it was a single useful number while ignoring the CIs, skew, or other messiness that the analysis may have included.

Mikewww
10-30-2017, 04:14 PM
Statistical methods apply to data whether or not it fits a perfect bell curve. All a skew really does is change the CI range so one side (either + or -) is larger than the other. The real problem is not whether the statistical model applies, but that we then tend to use the output - whether it's an age estimate on a SNP, or a years-per-SNP estimate, etc, as if it was a single useful number while ignoring the CIs, skew, or other messiness that the analysis may have included.

I agree statistical methods apply but how are they applied in these models is a concern. I have not seen YSNP mutation rate variance evaluated so we can see what the skews are so we can see how they effect the confidence intervals. This is an issue as well as the size/representativeness of the data. Maybe it is there and I'm not seeing it.

Look at the example of R1b-P312 that I noted earlier on this thread. I "divided the 95% confidence ranges by the best estimage age in ybp."

McDonald: P312 ... Chalcolithic-Bronze . 25.8%
YFull: P312 ...... Chalcolithic-Bronze . 35.6%

http://www.anthrogenica.com/showthread.php?12370-The-YFull-TMRCA-vs-Iain-McDonald-methods&p=300185&viewfull=1#post300185

YFull's range is about 40% larger (more conservative) than McDonald's. The P312 data set is large, surely, in both cases. Is this a difference in methodologies of presenting confidence intervals? [EDIT to correct per David V]

Dave-V
10-30-2017, 05:14 PM
I agree statistical methods apply but how are they applied in these models is a concern. I have not seen YSNP mutation rate variance evaluated so we can see what the skews are so we can see how they effect the confidence intervals. This is an issue as well as the size/representativeness of the data. Maybe it is there and I'm not seeing it.

Look at the example of R1b-P312 that I noted earlier on this thread. I "divided the 95% confidence ranges by the best estimage age in ybp."

YFull: P312 ...... Chalcolithic-Bronze . 25.8%
McDonald: P312 ... Chalcolithic-Bronze . 35.6%

http://www.anthrogenica.com/showthread.php?12370-The-YFull-TMRCA-vs-Iain-McDonald-methods&p=300185&viewfull=1#post300185

McDonald's range is about 40% larger (more conservative) than YFull's. The P312 data set is large, surely, in both cases. Is this a difference in methodologies of presenting confidence intervals?

Mike, in looking back I think you quoted them reversed originally and it was YFull's CI range that was larger?

Regardless, both YFull's method and Iain's reduce uncertainties from further down the tree as they go up. If you check my cross-post in this thread just above, my "bell curve" of the raw data across all of L21 shows a much larger 95% uncertainty across all the data than either 25.8% or 35.6%. The same situation exists across P312 in general.

There ARE valid ways to reduce uncertainty, some better than others depending on factors like the amount of random effects in the data etc but that's not area I have any depth in. I think YFull's method though is just averages of averages which is at best mathematically crude and at worst doesn't really achieve the reduction in uncertainty that they're stating in their CI intervals.

Iain says his method does this better (from his website, emphasis below is mine, and "this method" refers to Iain's):

------------------------------------------------------------
The primary disadvantage of this method compared to YFull's implementation of Adamov's method is that we are reliant on the derivative VCF data from Family Tree DNA, which dictates a binary outcome for each SNP. In reality, each call comes with a certainty weighting that can be derived from the quality statistics and individual reads present in the BAM file which allows a more definite tree to be built up.

Conversely, there are a number of advantages to this method, which can be summarised as follows:

- Preserving the PDF ensures that uncertainties from child clades are properly propagated up through the tree.
- Multiplying the PDFs of each clade to obtain an age for a clade provides a more robust method of weighting the contribution of each clad
- Reprocessing the tree from a top-down perspective prevents problems with causality: no manual correction is needed to make child clades younger than their parents, and a more accurate age estimate is provided.
-----------------------------------------------------------------------


Whether one is really better than the other I don't know. But the short answer is that they're different because they address the data differently, and HOW they're different is probably dependent both on the general uncertainties in the original data as well as their different methods.

Mikewww
10-30-2017, 05:22 PM
Mike, in looking back I think you quoted them reversed originally and it was YFull's CI range that was larger?
Yes, I edited the errant post. Here is the correct labeling. YFull has the 40% wider range.

McDonald: P312 ... Chalcolithic-Bronze . 25.8%
YFull: P312 ...... Chalcolithic-Bronze . 35.6%

Thanks for your longer review. I need to read closer so I make less mistakes!

Mikewww
10-30-2017, 05:48 PM
...
Regardless, both YFull's method and Iain's reduce uncertainties from further down the tree as they go up. If you check my cross-post in this thread just above, my "bell curve" of the raw data across all of L21 shows a much larger 95% uncertainty across all the data than either 25.8% or 35.6%. The same situation exists across P312 in general.

There ARE valid ways to reduce uncertainty, some better than others depending on factors like the amount of random effects in the data etc but that's not area I have any depth in. I think YFull's method though is just averages of averages which is at best mathematically crude and at worst doesn't really achieve the reduction in uncertainty that they're stating in their CI intervals.

Iain says his method does this better (from his website, emphasis below is mine, and "this method" refers to Iain's):

------------------------------------------------------------
The primary disadvantage of this method compared to YFull's implementation of Adamov's method is that we are reliant on the derivative VCF data from Family Tree DNA, which dictates a binary outcome for each SNP. In reality, each call comes with a certainty weighting that can be derived from the quality statistics and individual reads present in the BAM file which allows a more definite tree to be built up.

Conversely, there are a number of advantages to this method, which can be summarised as follows:

- Preserving the PDF ensures that uncertainties from child clades are properly propagated up through the tree.
- Multiplying the PDFs of each clade to obtain an age for a clade provides a more robust method of weighting the contribution of each clad
- Reprocessing the tree from a top-down perspective prevents problems with causality: no manual correction is needed to make child clades younger than their parents, and a more accurate age estimate is provided.
-----------------------------------------------------------------------


Whether one is really better than the other I don't know. But the short answer is that they're different because they address the data differently, and HOW they're different is probably dependent both on the general uncertainties in the original data as well as their different methods.
Outstanding, David. I forgot about that thread.

If I am reading your snapshot analysis correctly, the range that includes 95% of all of YFull's 730 R1b-L21 kits is from a low of about 19 mutations to a high of 44.
http://www.anthrogenica.com/attachment.php?attachmentid=16750&d=1496977090

Yikes!


their [YFull] estimate of 4400 years to the L21 TMRCA could be in a range of 3544-5425 ybp at 68%% confidence, or 2679-6339 ybp at 95% confidence.
...
we're still very far away from a universal SNP ageing methodology. And to be honest, that even YFull's methods have a much larger error margin then they're probably accounting for. http://www.anthrogenica.com/showthread.php?10785-What-mutation-rates-should-we-consider-for-Y-SNPs&p=244811#post244811

Using the YFull data, you have a 95% confidence range for L21's TMRCA as 2679-6339 ybp.
YFull has it today as 3900-4800 ybp.

https://www.yfull.com/tree/R-L21/

That's quite a difference in ranges. I'll just leave it at that to let people figure out what this might mean, if anything.

Let me know if I am missing something.

Mikewww
10-30-2017, 07:28 PM
Iain says his method does this better (from his website, emphasis below is mine, and "this method" refers to Iain's):
------------------------------------------------------------
...
Conversely, there are a number of advantages to this method, which can be summarised as follows:

- Preserving the PDF ensures that uncertainties from child clades are properly propagated up through the tree.
- Multiplying the PDFs of each clade to obtain an age for a clade provides a more robust method of weighting the contribution of each clad
- Reprocessing the tree from a top-down perspective prevents problems with causality: no manual correction is needed to make child clades younger than their parents, and a more accurate age estimate is provided.

...

Just in case folks don't understand the abbreviation "PDF", its Probability Distribution Function. This URL has the summary which even has the comparison with Adamov. WingGenealogist pointed to this earlier. http://www.jb.man.ac.uk/~mcdonald/genetics/pipeline-summary.pdf

This is what I was getting at in the reply #47. It should really be about the confidence range (not the best estimate) and it looks to me like McDonald does a nice job of NOT losing track of the uncertainties.

Do you know how how these models weight the immediate subclade branches when calculating the TMRCA for the parent?

At first glance weighting them equally makes sense so as not to provide bias to the most populous branches. On the other hand populous branches have the most data which should help accuracy.

Is this a problem and how is it handled? Bottlenecks with strings of equivalents for large groups, i.e. M222, may insert a bias from those blocks which have had mutation rates significantly off the average?
http://www.anthrogenica.com/showthread.php?12370-The-YFull-TMRCA-vs-Iain-McDonald-methods&p=300871&viewfull=1#post300871

I think the paraphrases below are correct. Agree, disagree?

The advantages of McDonald method.

1) Confidence ranges should be more accurate because biases are accounted for up and down the tree.
2) The weighting of child subclade impact on the age of the parent is robustly and rationally applied.
3) No manual adjustment/fudge is needed when child subclades are calculated as older than the parent.

The current McDonald implementation should also have a larger sample size within the scope of R1b-L151 which should improve accuracy. Busby, et. al. (A Cautionary Tale of Y Chromosome Lineage R-M269) have L151 (L11/S127) as well over 50% of the population along the Atlantic Facade and about 50% in Central Europe.

The YFull method has the advantage of a more reliable base for SNP determination because their exclusive use of BAM files.

Of course, the YFull implementation is the only option outside of R1b-L151 other than do-it-yourself. For non-R1b haplogroups and most MDKA's not from Western and Central Europe this is it!

leonardo
10-30-2017, 09:40 PM
Are the confidence intervals provided by SNP counting TMRCA estimation models misleading? I don't mean intentionally misleading, but misleading because the data is not too clean.

Robert Casey has done a nice job on evaluating STRs and makes the very strong and correct (I believe) case that massive YSNP testing is needed. He reminded me of the underlying assumptions used in applying models. Confidence intervals, at least as I know them, are based on the underlying variance in the data and assumptions about that variance.

The basic assumption is there is that the data fits a bell curve. Has anyone tried to look at mutation rates across the board, all haplogroups, by branch, to evaluate if there is truly a bell curve? I do not see that in these studies either by McDonald or YFull/Adamov. Is it there? We probably shouldn't use these confidence ranges if there is not a good fit with the bell curve.

If not, we just have to consider this stuff is an educated guess and recognize the greatest value is still in just understanding branch ages relative to each other.

I don't know. I guess one could argue this same point for any other's methodology. I would venture to guess that, statistically, the differences are nominal. I imagine the sample size will certainly affect its validity, as Michał has opined.

Mikewww
10-30-2017, 09:55 PM
I don't know. I guess one could argue this same point for any other's methodology. I would venture to guess that, statistically, the differences are nominal. I imagine the sample size will certainly affect its validity, as Michał has opined.
Leonardo, Dave-V straightened me out on that. It's not that statistics don't apply but the confidence ranges suffer. See his reply below:

Statistical methods apply to data whether or not it fits a perfect bell curve

If we look a little deeper, it appears that McDonald's method has some advantages as stated in reply 60.

However, in the big picture these age estimates just aren't that accurate anyway. They are great to have, but just more evidence. We should probably not focus on the best estimate dates but the whole ranges date.

Wing Genealogist
10-30-2017, 09:56 PM
I believe Iain McDonald has a degree in Statistics. As such, he is very much aware of various factors which can influence/skew the results from a "normal" Bell curve. I have sent him an email regarding this issue and he told me he does wish to directly address it. The real issue for him is simply the time commitment. He has recently informed the U106 Haplogroup team (which he is a co-admin) that he will be globetrotting over the next couple of weeks or so (job-related) so his ability to dedicate time to his hobbies is restricted.

Dave-V
10-30-2017, 10:06 PM
I think the paraphrases below are correct. Agree, disagree?

The advantages of McDonald method.

1) Confidence ranges should be more accurate because biases are accounted for up and down the tree.
2) The weighting of child subclade impact on the age of the parent is robustly and rationally applied.
3) No manual adjustment/fudge is needed when child subclades are calculated as older than the parent.

The current McDonald implementation should also have a larger sample size within the scope of R1b-L151 which should improve accuracy. Busby, et. al. (A Cautionary Tale of Y Chromosome Lineage R-M269) have L151 (L11/S127) as well over 50% of the population along the Atlantic Facade and about 50% in Central Europe.

The YFull method has the advantage of a more reliable base for SNP determination because their exclusive use of BAM files.

Of course, the YFull implementation is the only option outside of R1b-L151 other than do-it-yourself. For non-R1b haplogroups and most MDKA's not from Western and Central Europe this is it!

I would agree with all of what you said here. However (and it's obvious but I'll say it anyway) both methods (McDonald and YFull) are still limited by available data and especially coverage of subclades by NGS testing which as we know varies significantly.

My point there is although for instance the McDonald confidence ranges may be a more accurate reflection of the available data, whether they are in turn a more accurate reflection of a range around the actual age is really dependent on how well the available data reflects the subclade. Especially when a subclade has only 1 or 2 NGS data points the answer there is probably "not well".

Again, that's obvious when stated, but it's another dependency that neither method does very well at highlighting.

Mikewww
10-30-2017, 10:27 PM
...
My point there is although for instance the McDonald confidence ranges may be a more accurate reflection of the available data, whether they are in turn a more accurate reflection of a range around the actual age is really dependent on how well the available data reflects the subclade. Especially when a subclade has only 1 or 2 NGS data points the answer there is probably "not well".
McDonald should have access to a larger sample size (for L151) but that is why I am perplexed about the very wide confidence range for R1b-P312. I have no doubt his computations are correct, but there may be significant underlying problems with the whole shooting match of SNP counting that we (project admins) should be more cautious about.

Wing Genealogist
10-30-2017, 11:05 PM
As I have said more than once, using a random mutation as a gauge to estimate an age is a poor tool, but at the moment it is the best tool we have.

Dave-V
10-30-2017, 11:37 PM
As I have said more than once, using a random mutation as a gauge to estimate an age is a poor tool, but at the moment it is the best tool we have.

Agreed. And itís also one reason I havenít given up on STR-based TMRCA estimation yet, because Iím not yet convinced that that chisel is worse than someone elseís screwdriver at driving in a nail.

imcdonald
10-30-2017, 11:49 PM
Hi all,

I rarely darken Anthrogenica's doorway, as the throughput of traffic exceeds my available time to read it, but I'll take the opportunity here to clear a few things up. Get a cup of tea - it's going to be a long one, covering the multiple topics that have been discussed.

As a preface, the transition to Build 38 has caused major problems with my analysis. Many of you will have seen the new repository a James Kane's haplogroup-r.org website that we are now using. This underlies a broader effort to merge some the existing analyses that occur over haplogroup R1b. The result of this is that all my analysis is suspended while we concentrate on working out a way forward. Hopefully, this will allow me to expand the scope of my age analysis and spend time removing some of the approximations I've made.

Firstly, a take-home message is to forget the bell curve - it's rarely important in the particular analyses that are done, since the results aren't distributed in bell curves (Gaussians).

There are fundamental differences between how time to most-recent common ancestor ages (TMRCAs) are calculated from Y-STRs and Y-SNPs, and between the various methods used on Y-STRs. Let's cover the STRs first.

The infinite alleles method (popularised through Dean McGee's excellent tool) assumes any difference between STRs is the result of a single mutation, e.g. 13->15. However, after about 1000 years, this is no longer generally true, and multi-step mutations (e.g. 13->14->15) become more common. This is the primary limitation of this method, which is why TMRCAs more than 800-1500 years old derived using this method shouldn't be trusted. A variant of this, back mutation (e.g. 13->14->13) compounds this problem.

The step-wise alleles method assumes that all mutations are multi-step mutations (e.g. 13->14->15, not 13->15). This is generally true for TMRCAs greater than 1000 years. However, it still doesn't account for back mutations. TMRCAs more than about 1200-2000 years old derived using this method shouldn't be trusted.

Both these methods use Bayesian statistics. They aren't affected by Gaussianity (conformance to a "normal" distribution) or most other statistical measures of data normalcy. The basic statistical test is the probability of whether or not one or more mutations have occurred.

For large groups, Y-STR TMRCAs are generally calculated using Ken Nordtvedt's variance method. This assumes completely random mutation, resulting in a Gaussian distribution. However, skewed inheritance patterns, random wander and biological stability of STRs (see S.C. Bird, 2012; journals.plos.org/plosone/article?id=10.1371/journal.pone.0048638) mean that this slowly stops being a good assumption.

If you read Ken Nordtvedt's discussions on the topic, you'll see that *intra-clade* variance (taking the variance within a group of STR results) is a very poor estimator of TMRCA. The better estimator is *inter-clade* variance: measuring the variance between Y-STR data from two different sub-clades. With careful attention, Y-STR variance can be used to provide accurate ages several millennia in the past which match with the archaeological record. Mark Jost's ages from several years ago still appear to be some of the most accurate ages we have today. Departures from Gaussianity can be significant, but I have not found any significant changes in TMRCAs by restricting the choice of STRs by Bird's q values, which are a measure of this non-Gaussianity.

Y-SNP TMRCAs are generally calculated very differently. At its simplest, one can count up the number of mutations and divide by a mutation rate to get a TMRCA. This is essentially what YFull does. However, proper determination of TMRCA should be done with reference not to a Gaussian distribution, but a Poisson distribution, which describes the occurrence of rare events like SNP mutations. The question then becomes slightly more complex. The key assumptions are that SNPs occur randomly, and that the mutation rate is constant across time and space.

It's impossible to prove that these assumptions are exactly true, but we can place very stringent limits on whether they are *not* true. If they are not true then we should: (a) obtain different ages for clades from their individual sub-clades; and (b) see a difference between the expected number of SNPs and the observed number of SNPs. We expect some variability in these quantities anyway, but we should only see (e.g.) 5% of tests producing results outside of the 95% confidence ranges, 0.5% outside the 99.5% confidence ranges, etc.

To date, I have collected nearly 100 different triangulated lineages where two or more testers are traced back to a common ancestor and both have taken a BigY test, providing a list of mutations and a specified timeframe. This computes to 136 countable mutations over 23846 +/- 887 years, or a mutation rate of between 138 and 227 years/SNP over the counting range, or 5.71-9.37 x 10^-10 SNPs per base pair per year. This is only just enough to start probing differences between haplogroups and among different families. I see only the expected number of statistical outliers, and no obvious differences between the haplogroups: a slightly slower value in haplogroups R1a and R1b-L21 is probably due to unresolved structure in the family trees of the MacDonald and Stewart families, so I don't include those results in my estimates at present.

We can compare this to estimates from the literature, both from modern father-son populations, archaeological timings and ancient DNA. These different methods agree with each other to within +/- 7.5%, showing no evidence of variation from hunter-gatherer lifestyles in paleolithic Russia and mesolithic Sardinia and America, to modern Icelandic and Chinese populations, despite vast differences in lifestyle, diet, generation length, exposure to carcinogens, etc. Hence, I adopt a final rate of 7.524-8.708 x 10^-10 SNPs per base pair per year, as the weighted average of all of the above. It should be noted, however, that on longer evolutionary timescales some differences may occur (e.g. between chimps and humans at 5 Myr ago).

The most significant problem is likely to be noise in the data (false positives / negatives due to bad alignments), or any non-random creation of SNPs (i.e. any form of MNP: multi-nucleotide polymorphism). There are instances where MNPs affect individual clades sporadically, but don't contribute much to the overall age uncertainty. Once phylogenically inconsistent mutations are cleared out, and problematic regions like DYZ19, I haven't seen any issues with false positives / negatives except in a few cases where the sample was obviously contaminated. The largest issues with the BigY dates appear to be limited to shot noise (Poisson noise) due to the small number of SNPs, followed by the uncertainty in the mutation rate and any small remaining variance over different populations.

With current data, these uncertainties are large (as has been pointed out). Hence, I would be surprised if there are any unaccounted-for factors that would cause them to grow substantially. I have tried my best to think of and account for any such factors. I explore some of the issues in the FAQs on my website.

There are unaccounted-for factors in the data, which could affect the 95% confidence intervals. The most significant of these is the "questionable singletons", which are called in one clademate, but not the other(s). The position of these singletons affects the ages of a minority of the youngest clades, and I'm working on a fix to that.

The exception to these statements is a peculiar statistical property whereby populous clades gain more SNPs on average than small clades. It can be best understood by a population in which each man has two sons: A and B, who each have two sons (AA, AB, BA and BB), who each have two sons (AAA, AAB, ABA, ... BBB), etc. Let's say the founder is R-L151. If a SNP forms in A (let's say P312) but not B, then A will have father a clade. AA and AB each have the same probability of having a new SNP as BA and BB, but will start with one SNP more. Let's say AA picks up an SNP (L21) BA picks up an SNP (U106) and BB doesn't. So we have P312>L21, P312*, U106* and L151* for the four grandchildren. The clade will grow, but will remain 1/2 P312, 1/4 U106 and 1/4 split among more minor clades. As this process repeats itself, we find that the oldest clades tend to become more populous and have more SNPs. This can mean that the smallest branches of populous clades can have half a dozen SNPs more than rare clades in the same branch, simply by random chance and the way the tree works. Effectively, it provides a faster mutation rate for populous clades than rare ones.

The process is strongest in times of rapid population expansion. It's for this reason that it's important to take age analysis back to bottleneck points, beyond which this has little effect, e.g. R-L151, R-M269, R-M343, haplogroup R, or the tree root. It's also important to get a good representative sample of all the sub-clades in a tree, so that you aren't biased by the populous ones.

Finally, weighting of the clades can be an important item. YFull do a straight average of all their clades, which leads to some very peculiar results. On the face of it, it seems reasonable: each clades represents the descendants of one man, so why shouldn't you average them? The answer is because it gives equal weight to clades where there is very little idea of what the age is, as it does to clades where the age is much more precisely determined.

I asked the technies behind YFull what they thought of this, and the response I got was that they looked into different weightings and didn't find it important, yet they have to fudge the age of U106 to not be vastly younger than Z381. One of the reasons I started doing this exercise myself was to prove that statement wrong, and I've done that to at least my satisfaction. By retaining the proper uncertainty estimates for each clade (including the parent clade) it's possible to present a tree that doesn't need a lot of manual fudging to avoid violating causality. At the end of the process, I was actually very pleased to see how similar YFull's ages and mine were. It suggests we're both pretty much on the right track, and the agreement with the archaeological DNA and Mark Jost's Y-STR ages is encouraging.

While YFull and I differ on implementation, I'd encourage people not to pay too much attention to individual centuries and look at the overall range of possibilities. I disagree with the precise implementation that YFull uses in a number of cases, but they normally get a pretty good answer despite this. We could both improve the methods that we're using, and I'm sure we're both trying to do that now. In the end, the main limitation is the data: it's quantity, it's quality, and how well it is calibrated. As those factors improve, these little details will start to become important, but for now most differences between YFull and my ages are simply down to noise in our two different sets of input data.

Cheers,

Iain.

Dave-V
10-31-2017, 12:17 AM
Iain - first of all, thanks for the considered response. I look forward to reading it in more detail :-).


...The better estimator is *inter-clade* variance: measuring the variance between Y-STR data from two different sub-clades. With careful attention, Y-STR variance can be used to provide accurate ages several millennia in the past which match with the archaeological record. Mark Jost's ages from several years ago still appear to be some of the most accurate ages we have today...

On a quick note, this ("interclade") is the STR-based TMRCA method I have coded into SAPP from Ken Nordtvedt's descriptions, his Generations67 and 111 spreadsheets, and Mark Josts's spreadsheets.

While I don't claim accuracy from that (my earlier chisel and screwdriver comment still stands), if we ever want to do a run-off of an STR method against a SNP method; it's easy enough to generate the STR side.

Mikewww
10-31-2017, 03:10 AM
... Finally, weighting of the clades can be an important item. YFull do a straight average of all their clades, which leads to some very peculiar results. On the face of it, it seems reasonable: each clades represents the descendants of one man, so why shouldn't you average them? The answer is because it gives equal weight to clades where there is very little idea of what the age is, as it does to clades where the age is much more precisely determined.

I asked the technies behind YFull what they thought of this, and the response I got was that they looked into different weightings and didn't find it important, yet they have to fudge the age of U106 to not be vastly younger than Z381....

Thanks, Iain. This what I thought. YFull has this issue with other subclades besides just the U106 and Z381 relationship. I think it is spread all over their tree. This has to mess up their confidence ranges. I could say this another way but to be positive let me just say I appreciate that you account for the uncertainties using statistical analysis.

There are plenty of issues with STR variance based TMRCAs so I wouldn't argue any advantage to them. I have used Jost's spreadsheet, which is just a better spreadsheet adaptation of Nordtvedt's methods. I will have to go back and dig them up but I did an analysis of L151 in about 2012 on Jost's spreadsheet and it wasn't that bad. I think I little young if I remember. You (we) do eliminate the multi-copy markers and such.

Edit: I found it. The Nordtvedt/Jost TMRCA inter-clade estimates are younger but they are in the right age. I use to argue with folks who that R1b in Europe was Neolithic when I thought it (L11) was Bronze age. From 2011:

R-L151(L11/S127) 2500 BC (3200 BC - 1800 BC)
https://www.dropbox.com/s/bpcvzlkq807zj6s/R1b-L11_Subclades_Timeline.jpg?dl=0

The McDonald ages for U106 and P312:
R-U106 3012 BC (3689 BC — 2456 BC)
R-P312 3155 BC (3898 BC — 2568 BC)

Mikewww
10-31-2017, 03:23 PM
The process is strongest in times of rapid population expansion. It's for this reason that it's important to take age analysis back to bottleneck points, beyond which this has little effect, e.g. R-L151, R-M269, R-M343, haplogroup R, or the tree root. It's also important to get a good representative sample of all the sub-clades in a tree, so that you aren't biased by the populous ones.
Joe B or you folks in the the R1b Basal project, let me know if you are interested in this. M269 has a lengthy phlyogenetic block but I think it would be helpful to look at this from an M269 MRCA and on in. In that regards, L151 is biasing the whole thing.

Joe B
10-31-2017, 04:33 PM
Just speaking for myself. There hasn't been any desire expressed from within the R1b Basal Subclades R1b-M343 (xP312 xU106) haplogroup project to change what we're doing. BAM file analysis. There's been a false narrative about no need to have a BAM file analysis. I strongly disagree. It complicates our efforts to have members submit BAM file links for project analysis and to YFull for a second analysis. Both analysis offer the member the opportunity to compare their results to a database outside of FTDNA.
Unless R1b Basal Subclades members and administrators express a desire to change, the answer is no. It certainly should not be dictated from downstream clades. Please, as administrator of the R1b Basal Subclades project, respect the xP312 xU106 boundry.

Mikewww
10-31-2017, 07:12 PM
Joe B or you folks in the the R1b Basal project, let me know if you are interested in this. M269 has a lengthy phlyogenetic block but I think it would be helpful to look at this from an M269 MRCA and on in. In that regards, L151 is biasing the whole thing.


Just speaking for myself. There hasn't been any desire expressed from within the R1b Basal Subclades R1b-M343 (xP312 xU106) haplogroup project to change what we're doing. BAM file analysis. There's been a false narrative about no need to have a BAM file analysis. I strongly disagree.
I think there is a lot more that can be done with with the new generation of VCF/BED files but reviewing BAM files is clearly still a very positive thing to do. Do you have a particular post where there is a "false narrative"? I think the number of times one must go to a BAM review is diminished with the more robust VCF/BED files but would never say it is a bad idea or a waste to do BAM review. There is a guy on the L21 forum who is saying it is a waste, though.

The above offer wasn't a request to not do BAM file analysis anyway.


It complicates our efforts to have members submit BAM file links for project analysis and to YFull for a second analysis. Both analysis offer the member the opportunity to compare their results to a database outside of FTDNA.
Okay. I like to be on multiple trees but I guess it is more complicated.

The above offer/interest level check above is related to a free SNP age analysis. That's it, a volunteer service.


Unless R1b Basal Subclades members and administrators express a desire to change, the answer is no. It certainly should not be dictated from downstream clades. Please, as administrator of the R1b Basal Subclades project, respect the xP312 xU106 boundry.
Who is dictating? The M269 branches have a relationship and there can be some synergies in analysis, as Iain indicated. There was an expansion point from M269's MRCA.

WOLFF ťric
11-08-2017, 07:58 PM
5026 YBP ….. ….. R-Y14049 origin ? :argue:

jamesdowallen
03-01-2018, 02:20 PM
I'm trying to get "best available" dates for Y-chromosome clading to display at my own website, which will show part of the "story" of haplogroup development. Naturally I start with Yfull data: I don't know any other site that provides easy to read date estimates for almost the entire Y-tree ó is there one?

I appreciate that the "error bars" are huge, but still: the Yfull dates seem almost consistently more recent than other estimates. What's the general opinion on this?

I'm tempted to add 10% to all Yfull time spans before presenting them at my site (Of course I would note this "correction method" at the site). Would this be wise? Is there a better correction formula? (I realize that any estimation biases in Yfull's data may vary from one haplogroup to another, but I hope for something simple.)

palamede
03-02-2018, 11:19 AM
I'm trying to get "best available" dates for Y-chromosome clading to display at my own website, which will show part of the "story" of haplogroup development. Naturally I start with Yfull data: I don't know any other site that provides easy to read date estimates for almost the entire Y-tree — is there one?

I appreciate that the "error bars" are huge, but still: the Yfull dates seem almost consistently more recent than other estimates. What's the general opinion on this?

I'm tempted to add 10% to all Yfull time spans before presenting them at my site (Of course I would note this "correction method" at the site). Would this be wise? Is there a better correction formula? (I realize that any estimation biases in Yfull's data may vary from one haplogroup to another, but I hope for something simple.)
You haven't got a lot of responses because the subject is not really solved.
For me, added 10% to Yfull datation is a minimum but already well, it is to be cautious but certainly a low limit to the addition..

For different mutation rates according to the haplogroup, it is possible, but I am afraid nobody could answer you to give a true clue.

Mikewww
03-02-2018, 04:08 PM
I'm trying to get "best available" dates for Y-chromosome clading to display at my own website, which will show part of the "story" of haplogroup development. Naturally I start with Yfull data: I don't know any other site that provides easy to read date estimates for almost the entire Y-tree — is there one?

I appreciate that the "error bars" are huge, but still: the Yfull dates seem almost consistently more recent than other estimates. What's the general opinion on this?

I'm tempted to add 10% to all Yfull time spans before presenting them at my site (Of course I would note this "correction method" at the site). Would this be wise? Is there a better correction formula? (I realize that any estimation biases in Yfull's data may vary from one haplogroup to another, but I hope for something simple.)

It is a concern if we add fudge factors arbitrarily. There already may even be fudge factors in dating when parallel branches have different ages coming to the same MRCA so we end up adding fudge factors on top of fudge factors.

I have to remind myself of this too... it is tempting to look at these dates with some precision but as you mentioned it is really the time span or confidence interval that is important and we can't really expect the the midpoint is actually "right".

If you look at the width of these ranges then adding 10% or 15% or 5% probably doesn't matter that much. It's just not that precise.

There is new news with one of the ancient DNA findings. Over on the U106 yahoo group Dr. McDonald suspects we may have to take all of the dates back further in time, his too.

jamesdowallen
03-02-2018, 07:07 PM
There is new news with one of the ancient DNA findings. Over on the U106 yahoo group Dr. McDonald suspects we may have to take all of the dates back further in time, his too.

Perhaps you're referring to the quotation in https://groups.yahoo.com/neo/groups/R1b1c_U106-S21/conversations/messages/52445 which begins


Hi folks,

As some of you may have been following on Anthrogenica ( https://anthrogenica.com/showthread.php?10565-The-Beaker-Phenomenon-And-The-Genomic-Transformation-Of-Northwest-Europe-Olalde&highlight=I7196 ), our new U106 ancient burial from Prague has now been further refined to:
U106 > Z2265 > BY30097 > Z381 > Z156 > Z306 > Z304 > DF98 > S1911 > S1894
Being S1894+ myself, this obviously makes me very happy. I've been very lucky to have two out of the five early U106+ burials be R-S1894. However, it also raise some interesting possibilities for everyone else. I should point out that the S1911 and S1894 assignments are made from single reads, but that's normally ok providing one is not searching for novel variants and provided the read quality is ok.

These are still very early days for understanding this burial and the context in which he was found. Perhaps the most important factor here is the date of the burial. Carbon dating hasn't (as far as I can work out) been performed. Contextually, the site has been given a date of between 2200 BC and 1700 BC in the Olalde publication, and we have to assume from the R-S1894 haplogroup that the burial lies towards the end of this period.

This is interesting in the context of the age and spread of the older U106 branches. This burial closest U106 individual to an existing lineage that we know of: the Swedish (RISE98), Dutch (Olalde) and York (Roman) burials are all many centuries more recent than their most-recent known haplogroups, and don't give us great constraints on the ages of haplogroups. I had dated S1894 to between 2545 BC and 1231 BC, with a best guess of 1866 BC. Clearly, if this burial is S1894+, the real date must be in the earlier part of this timeframe. This pushes back the most likely dates of all the haplogroups around it. The following are guideline ages for the youngest each clade is likely to be (i.e. we can be 95% confident the true ages are older than these dates):
R-S1894: before 1740 BC
R-S1911: before 1800 BC
R-DF98: before 1875 BC
R-Z304: before 1895 BC
R-Z306: before 1950 BC
R-Z156: before 2280 BC
R-Z381: before 2380 BC
R-BY30097: before 2410 BC
R-Z2265: before 2440 BC
R-U106: before 2470 BC
R-L11/L151/P311: before 2560 BC
Better constraints for U106 and L11 come from RISE98 and other ancient burials.
This doesn't actually contradict Yfull's mean dates, although it increases skepticism.

BTW, of 11 clades listed at the bottom of that excerpt, at least 3 are completely missing from Yfull. >:( That leads me to two questions:
(1) Is there a "machine readable" version of Dr. McDonald's tree? (an .svg image isn't easy for me to work with.)
(2) Does McDonald only study U106? Are there annotated machine-readable trees as rich as McDonald's for P312?

jamesdowallen
03-03-2018, 03:05 AM
I want to be able to look at Y-haplogroup MRCA dates and be able to try to match these up to known migrations, but without constantly correcting for the "known Yfull bias." Since there seems to be a consensus that Yfull ages are generally underestimated I will replace them with better estimates, McDonald's or derived by formula from Yfull dates. Otherwise I (and any others using my database) will have to be constantly correcting.

To keep a clear connection to the Yfull data, I will always show the Yfull mean TMRCA in brackets. For example a Yfull mean date of 2400 BC might display as

2700 BC [2400]

YFull's confidence interval is usually wider in proportion for more recent dates. I'm left with three choices for a general date correction formula:
(1) Add 10% to all TMRCA's. (Or do something similar but slightly more complicated.)
(2) Set the estimate halfway (or such) between Yfull's mean and Yfull's upper bound. Or, since (2) would be extra work, just
(3) Use a more complicated but single-variable formula as in (1), which will tend to approximate (2).

To make this decision, I'd like input from others about Yfull's bias. Their Copper Age dates seem off by 10% or 15%. How about their dates in the Middle Paleolithic? Is the bias still 10% or more? Or, following Yfull's "confidence interval", closer to 5%?

Saetro
03-08-2018, 06:56 PM
I'm trying to get "best available" dates for Y-chromosome clading to display at my own website, which will show part of the "story" of haplogroup development. Naturally I start with Yfull data: I don't know any other site that provides easy to read date estimates for almost the entire Y-tree — is there one?

I appreciate that the "error bars" are huge, but still: the Yfull dates seem almost consistently more recent than other estimates. What's the general opinion on this?

I'm tempted to add 10% to all Yfull time spans before presenting them at my site (Of course I would note this "correction method" at the site). Would this be wise? Is there a better correction formula? (I realize that any estimation biases in Yfull's data may vary from one haplogroup to another, but I hope for something simple.)

I'm not sure which estimates you find are 10% low, but you can always give the low figure and say it is a lower limit and may be higher.

If the low figure you are talking about is TMRCA, then of course it will be low.
TMRCAs always start out low and then become higher as more descendant lines are found.
Usually they jump back and then rise asymptotically towards a final figure.

The other method of dating - SNP mutation - is generally higher.
On rare occasions when there is a sudden explosion of mutations, it may turn out a little low, but usually it tends to be the upper likely limit.

What's wrong with giving a range? By quoting both methods as probable lower and upper limits?
With conventional genealogy we may not know when a person moved from A to B exactly, but there are definite events either side.
With Y chromosome genealogy at the frontier, dates are always uncertain at first because data are few.
That's the nature of the thing.
And sometimes it can be even worse, when we are not even sure exactly what the chain of mutations is and an initial understanding is revised.
Then the dates can really change.

And there is great value in using the word "roughly" when it is needed.

parasar
03-08-2018, 08:59 PM
5026 YBP ….. ….. R-Y14049 origin ? :argue:

YFull has 4300. So YFull's about 17% lower.
https://www.yfull.com/tree/R-Y14049/