PDA

View Full Version : Explanations for varying rates of Y-Chromosome mutation



theImmortal
06-11-2018, 03:21 PM
Most of us have seen how certain male lines mutate more/less than others (STR and SNP, even over thousands of years). What are the proposed reasons for this?

I've heard one theory that deals with endogamy. When a male line procreates within a small population over an extended period of time, this can decrease the rate of mutation. Perhaps it is because there is a limited pool of X-chromosomes with which the Y-chromosome interacts, thereby decreasing the chance of X-to-Y gene conversion (or some other mutative effect caused by the presence of a "foreign" X chromosome that the Y-chromosome has never interacted with before).

I think this theory attempts to explain the observation that "older" male lines seem to be more stable than "newer" (or more successful) lines. Why when a male line stays in the same place does it mutate less than a line that moves around? The easiest example is R-L151, which has expanded rapidly in the last 4,500 years in Europe and North America (and appears to have supplanted the G haplogroup found in my ancient DNA samples). And my understanding is (and maybe this is all anecdotal) that downstream branches have been observed to mutate relatively quickly.

After thinking about this theory for a while, I've come up with a related theory that attributes the aforementioned observation to evolution.

Under this theory, the rate does not actually vary. Lines with large numbers of mutations simply have not had enough time for evolutionary pressures to select them out of the gene pool. Thus, when a male line is successful in the modern era, we observe more mutations. But if we took samples in 10,000 years, we'd find that the surviving lines are those that experienced fewer mutations (since mutations are more likely to result in adverse rather than beneficial effects).

Curious to hear others' thoughts.

K33
06-11-2018, 04:16 PM
I have wondered the same thing... it cannot be a coincidence that basically all of the more "Basal" y-dna haplogroups are found in populations that are:

a) Genetically "isolated"/endogamous (at least relative to other human groups), and
b) Spatially "non-mobile", in the sense of having occupied the same ecological niche for many millenia (which would presumably decelerate evolutionary pressure on all portions of the genome, including the Y chromosome)

Why do Khoisan still overwhelmingly carry y-dna A? Why didn't this mutate much over 100k (or 200k or 300k) years? And the same for y-dna B in Central Africa.

The Aeta carry P*, P1, and P2. So obviously their y-dna mutated pretty far beyond y-dna "Adam", before suddenly "freezing" in time, with shockingly few downstream mutations dated beyond 30kya or so (around the TMCRA of P*, P1 etc). Is this evidence of an extensive "early" history of movement and admixture with other groups (ie, during first OOA wave), before ultimate isolation/endogamy beginning in the Mesolithic? And the same thing for K2 re: Sahulians?

Socotra islanders have been reported to harbor 70%+ basal J* (http://dienekes.blogspot.com/2008/11/y-chromosomes-and-mtdna-from-soqotra.html), which is dated to 48kya. Again, are they largely a relic population from the "Iberomaurusian" days who just shifted to Semitic languages during the BA? (I think Socotrans may be one of the last un-sequenced modern populations for which autosomal data can still teach us about human origins)

OTOH the most "downstream" of the major y-dna clades tend to be held by groups that show precisely the opposite pattern-- large-scale movements and admixture with radically different outgroups taking place right down to the common era: Indo-Europeans (R1) and Turkics (N, Q).

Native Americans would be an interesting case study since they obviously descend from quite mobile, exogamous groups well into the Mesolithic, but following the Bering Strait crossing 15kya, the spatial mobility continued while potential outbreeding was heavily restricted by the population bottleneck. In other words, the one element continued but the other element (exogamy) was substantially controlled for.

It would be nice to see higher resolution mapping of Native American y-dna--just how many "downstream" nodes are there for Native Q and C? Of course because of European sex bias you'd need to sequence 5-10 self-identified Natives to maybe get 1 genuine Native y-lineage but it would be highly interesting to see the results...

RobertCasey
06-11-2018, 04:59 PM
This question really depends on the time frame of discussion. For very old branches (over 2,500 years old), there are waves of YDNA replacement via conquest and spread of new technology (farming, metalwork). The patterns vary radically depending on geography.

Once you get below 2,500 years and getting closer to the genealogical time frame, we are now beginning to understand the mechanics of why mutation rates vary a lot. There are several factors: 1) random statistical variation is the primary cause. The smaller your sample size, the less accurate the estimates based on limited data. For instance years per YSNP are now being reported in the 30 to 60 year range vs. the 130 year range. Also, it depends on your criteria for rejecting YSNP branches that are found in complex areas or insert/deletes; 2) The geography being analyzed plays a significant role as sample size of population varies by several factors. China and the Far East are by far lowest penetration of testing (percent of population tested); The US and other former English colonies are by far the most tested due to higher interest in attempting to make connections back to their European ancestors; 3) There is another major factor - how prolific is your line. Being the conquerors vs. the conquered make a huge difference in the numbers of living testers. L226 which is 1,500 years old has over 1,000 testers at 37 or more markers. L96 which is around 2,500 years old, have only ten known testers; 4) ability to thrive on unlimited access to land (at the expense of the the native population - colonization). My Casey line arrived in South Carolina around 1750 after the massive crop failures caused by harsh weather conditions. Even though Casey ranks in the top 50 Irish surnames in Ireland today, there are more Casey with ties to early South Carolina tested today than all the Caseys that still reside in Ireland today; 5) every geography has its own individual story to tell on how YDNA evolved (people residing between large powerful countries tend to have more blended YDNA and have more constant small waves of influx of new YDNA).

Only testing of yet larger quantities of archaic remains will shed more light on how YDNA spread across the world. Sample size = accuracy. I do not think the isolation or mobility play much of role as does the massive spread of technology and conquest. During the last ice age to hit Europe, the population of northern Europe fell to zero as no hunter gathers could survive living on top of several miles of ice. However, a small percentage survived via mobility by just walking south to avoid freezing and starving. Until archaic remains were analyzed in great detail, it was believed that R1b were the first to return after the last ice age. That theory was dismissed as no R1b was found for the first few thousand years after the last Ice Age in Europe. R1b from southern Russian allowed those with bronze and horses to almost completely replace the E haplogroup that was at least 25 % of early western Europe after the last ice age. The haplogroup Q was found in significant quantities in Germany are almost entirely replaced. The Vikings replaced around ten percent of the YDNA in Ireland, the Celtic culture allowed more mobility due to common languages and customs. The Ango-Saxon (and Danish) invasion of England replaced ten percent or more of the YDNA in England. There are just constant influxes of new YDNA on a continual basis. Even due to new economic motivations, the number of people claiming Irish ancestry lives in England is about the same as living in Ireland. For every person living in Ireland today, there are ten people living in the United States. Even Canada, Australia and New Zealand have more Irish people living in these former British colonies than live in present day Ireland. So colonization plays a major role in the last 300 years. The middle east was dominated by the Ottoman Empire for centuries - this political control almost certainly had a major effort for ease of YDNA flow. The Irish were recruited by the English to fight Prussians and others. Many of these Irish soldiers were granted land in Poland and Lithuania (where they celebrate St. Patricks Day even today). So there are many smaller waves that are constant.

redifflal
06-11-2018, 05:09 PM
Is my understanding incorrect or OP? I thought that a person having basal P that isn't Q or R is still having the same number of mutations as happens every generation, just that the P*(xQR) isn't as successful as the one whose got the Q or R mutations. Nobody's is "frozen in time". All the assignments in the charts are retrospective. Number of mutations should still be the same for both P*(xQR) and those with Q and R, just that a group of P* became uber successful and so a subset of P* has an event more recent common ancestor than rest of the P*.

theImmortal
06-11-2018, 07:14 PM
Is my understanding incorrect or OP? I thought that a person having basal P that isn't Q or R is still having the same number of mutations as happens every generation, just that the P*(xQR) isn't as successful as the one whose got the Q or R mutations. Nobody's is "frozen in time". All the assignments in the charts are retrospective. Number of mutations should still be the same for both P*(xQR) and those with Q and R, just that a group of P* became uber successful and so a subset of P* has an event more recent common ancestor than rest of the P*.

I think we're talking about the same thing, which is why I was using "air quotes" to refer to "newer" and "older" lines. My experience has been that when you look at two branches after they diverged, the more successful branch will have, on average, a larger number of subsequent SNPs. Obviously, they're the same age. EDIT: So while basal P kits should have the same number of mutations as Q or R, they don't.

Maybe I'm wrong about that. That has been my general observation, and I recall reading somewhere that kits downstream of R-L21 have more subsequent SNPs on average than the rest of the kits downstream of R-P312, even though we know for a fact that R-P312 is older than R-L21.


I do not think the isolation or mobility play much of role as does the massive spread of technology and conquest.

Isn't one really just a proxy for the other? Meaning haplogroups that have been extremely mobile, like R1a and R1b, were extremely mobile due to technology and conquest.

redifflal
06-11-2018, 07:31 PM
So only way I can think of that mutation rate is faster in a more prolific/"recent"(lier defined mrca) line than its undefined brother is that the more prolific line besides being more prolific is also reproducing at an earlier age. Example if I am P*(xQR) and my brother is the first R*, and I'm having 1-2 sons while my brother is having 16 sons, and then my sons continue having sons at 25-30 years age while my nephews are having sons maybe 14-15 years old? Maybe prolific-ness has something to do with age at which viable children can be produced, in which case there would end up being more mutations in a person downstream of a R1a than a P*.

This is like a macro level version of finding a distant uncle (your parents third cousin or something) that's actually younger than you. Pretty comical actually in joint family settings to call them, especially if girls, as auntie lol, because technically you're not wrong.

theImmortal
06-11-2018, 09:57 PM
So only way I can think of that mutation rate is faster in a more prolific/"recent"(lier defined mrca) line than its undefined brother is that the more prolific line besides being more prolific is also reproducing at an earlier age. Example if I am P*(xQR) and my brother is the first R*, and I'm having 1-2 sons while my brother is having 16 sons, and then my sons continue having sons at 25-30 years age while my nephews are having sons maybe 14-15 years old? Maybe prolific-ness has something to do with age at which viable children can be produced, in which case there would end up being more mutations in a person downstream of a R1a than a P*.

This is like a macro level version of finding a distant uncle (your parents third cousin or something) that's actually younger than you. Pretty comical actually in joint family settings to call them, especially if girls, as auntie lol, because technically you're not wrong.

I like this theory...a lot.

More prolific = more desirable/greater fitness = earlier lifetime procreation = more generations = more mutations.

This would explain why we seem to see one SNP rate per generation across the population in the present, but another across the population over time. And it definitely meshes with the idea that this coincides with pillaging/conquest.

On a side note, makes me feel better that my slightly higher SNP rate might not mean that I'm sitting on a line that's just waiting to be killed off :)

RobertCasey
06-12-2018, 01:22 PM
For more prolific lines - your current sample size is much greater, so you currently observe higher mutation rates. The mutation rates themselves do not change except for random statistical variation for smaller sample sizes. If you are finding TMRCA for very old haplogroups - over 2,500 years, the observed rates are much less since in that time frame, over 99 % of male lines have become extinct just due to random statistical variation. Between, 1,000 years and 2,500 years, for prolific lines (like L226 that is 1,500 years, has 125 Big Y tests and over 1,000 testers with 37 or more markers), I recently had to shift from 70 years per YSNP to 60 years per YSNP from my previous analysis when only there was 35 % less data available. There are a few surname clusters (people who share the same ancestor who first used surnames around 1,000 years ago - Ireland where clan names were first used) that have 100 to 200 testers for just one surname cluster. These people are getting one YSNP per generation (or every 30 years).

For L226, we also make extensive usage of YSNPs in complex areas and inserts/deletes. If these are removed, our YSNP mutation rate drops by 20 to 40 % depending on which path of the L226 haplotree you look at. YFULL uses 130 years per YSNP as they are estimating older haplogroups and do not allow the usage of YSNPs in complex areas or any inserts/deletes. As a L226 admin, I use 60 years per YSNP for my current sample size. Once my sample size increases another 50 to 100 %, this will surely drop to 50 years per YSNP.

I know that these adjustments are required due to surname cluster dating. If I do not lower the years per YSNP mutation, YSNP branches become much younger than surname cluster dates. Estimating dates based on surname are much more reliable than YSNPs, so I adjust the YSNP branch mutation rate to minimize the over running of the surname cluster dating. We now have around 25 surname clusters under L226 which helps calibrate the years per YSNP mutation value.

Another huge adjustment will be made when longer read lengths become available at reasonable costs in the next five to ten years. The YElite2.1 with long read can reveal twice as many base pairs as the Big Y tests. The normal YElite2.1 shows around 30 % more base pairs. So when these tests become the normal due to price reductions, we will see a quick 30 to 100 % increase in YSNP mutation rates.

theImmortal
06-12-2018, 04:10 PM
For more prolific lines - your current sample size is much greater, so you currently observe higher mutation rates. The mutation rates themselves do not change except for random statistical variation for smaller sample sizes. If you are finding TMRCA for very old haplogroups - over 2,500 years, the observed rates are much less since in that time frame, over 99 % of male lines have become extinct just due to random statistical variation.

I don't see how greater sample size = higher observed mutation rates. We should see a regression toward the mean. As you point out, we'll see random statistical variation for smaller sample sizes, but shouldn't they be scattered equally between high and low outliers? I take it you agree they are not.

What I like about the "younger-to-procreate" theory is that it also explains why I have not been seeing low outliers, only high outliers, in large samples that represent a long historical period. Before the Big Y platform change, I could see +500 men downstream from U152. It seemed like there was a floor at 17-18 novel SNPs, but the mode was 19-22. Only a handful of kits had fewer than 17 novel SNPs. Some men had 28-32, but no one had 10.

"Younger-to-procreate" would explain it like this. Some lines went through prolonged periods where the average age of procreation was 20, while the average age for all humans is closer to 30. But no lines go through prolonged periods where the average age of procreation was 40 or older. That just doesn't happen. This would be particularly true in a place like the British Isles, particularly Ireland, where Bronze Age men totally displaced the Neolithic settlers.

I'm interested in surname clustering, but I wonder, how do you ensure reliability? Some surnames go back to the 11th-13th century, some go back to the 17th century. Are you accounting for this or giving an average date for surname adoption? I'm also interested in the impact of higher resolution testing. Is there a case to be made that large samples skew toward newer tests, and newer tests pick up more SNPs?

redifflal
06-12-2018, 05:27 PM
If the younger to procreate theory does hold water, it would be interesting to understand how that sustains over time. It would have to be a cultural context that allows for such a phenomenon beyond just availability of resources. It might even have very little to do with the so called dominance of the male line and more to do with the resident female population's cooperativity as was noted in the other thread recently about grandma power
https://www.npr.org/sections/goatsandsoda/2018/06/07/617097908/why-grandmothers-may-hold-the-key-to-human-evolution?utm_source=facebook.com&utm_medium=social&utm_campaign=npr&utm_term=nprnews&utm_content=20180607

theImmortal
06-13-2018, 02:38 AM
If the younger to procreate theory does hold water, it would be interesting to understand how that sustains over time. It would have to be a cultural context that allows for such a phenomenon beyond just availability of resources. It might even have very little to do with the so called dominance of the male line and more to do with the resident female population's cooperativity as was noted in the other thread recently about grandma power
https://www.npr.org/sections/goatsandsoda/2018/06/07/617097908/why-grandmothers-may-hold-the-key-to-human-evolution?utm_source=facebook.com&utm_medium=social&utm_campaign=npr&utm_term=nprnews&utm_content=20180607

In my view, this doesn’t persist over time and that article has little relevance. Remember we’re not talking about hunter gatherers. We’re talking about subsistence farmers who had developed a complex society with power systems, including chieftains, fealty, slavery, and even a legal system. But wealth generation also depended on expanding into new territories. I think this was a very difficult time for women, who were forced to adjust to mobile warring tribes of men who used their superior weaponry to kill their sons and husbands and keep them as trophies.

RobertCasey
06-13-2018, 04:02 AM
I don't see how greater sample size = higher observed mutation rates. We should see a regression toward the mean. As you point out, we'll see random statistical variation for smaller sample sizes, but shouldn't they be scattered equally between high and low outliers? I take it you agree they are not.

For L226 - As we get more and more YSNP branches, it becomes very obvious that the true years per YSNP mutation is much less than those used in predicting much older haplogroups. During my previous iteration of analysis, I used 70 years per YSNP which made the dating of the YSNP branches fit neatly into the surname clusters. Since the number of branches went for 45 to 78, many of the YSNP branches now are younger than the surname clusters. Also, for very large surname clusters of 200 or 300 testers, they are finding years per YSNP to be around 30 years per YSNP.

Just think when L226 reaches 200 branches, a lot will be in the last 1,000 years - but we are continually finding more branches older than 1,000 years as well. We continue to add more and more intermediate branches, the years per YSNP mutations will have to come down. Also, we are finding more and more surname clusters as well (now 25 vs. 15 for the previous analysis). This also contributes to forcing years per YSNP to decrease per year since there are more fixed time points that the older YSNP progression must fit into. Another general rule of thumb, the more recent time frame that you are analyzing - the higher the number of YSNP branches are present. The older your time frame, the more lines that have died that hides mutations that could of happened if all lines survived. So the younger the time, the less that lines have died out.

Also, as pointed out, the origin of surnames varies dramatically by geography. The example that I was giving for 1,000 years for surname creation applies to Irish and Scottish surnames. English surnames are a little younger by another 100 or 200 years. Plus there are exceptions as royal surnames were used prior to 1,000 years by some surnames. This date of surname obviously does not work for Swedish lines since they only converted surnames being passed in the 100 to 200 years. Turkey did not allow surnames for most people until just around 100 years ago.

neanderling
06-13-2018, 05:30 AM
For L226 - As we get more and more YSNP branches, it becomes very obvious that the true years per YSNP mutation is much less than those used in predicting much older haplogroups. During my previous iteration of analysis, I used 70 years per YSNP which made the dating of the YSNP branches fit neatly into the surname clusters. Since the number of branches went for 45 to 78, many of the YSNP branches now are younger than the surname clusters. Also, for very large surname clusters of 200 or 300 testers, they are finding years per YSNP to be around 30 years per YSNP.


I am intrigued but baffled by this observation. Is this a consequence of imperfect coverage by any single Big Y test, with the increased number of identified SNPs in a given line with more men tested due to the higher likelihood of two men sharing a region of the Y chromosome usually underrepresented by Big Y testing? Surely if coverage were perfect and complete, the number of SNP differences between any random pair of men would, after dividing by two, be an unbiased estimate of the number of mutations in the time since their common ancestor so long as there are no back mutations. The estimate in that case should not systematically decrease by adding more descendants of that ancestor, hence my interest to understand this better.

RobertCasey
06-13-2018, 02:24 PM
I am intrigued but baffled by this observation. Is this a consequence of imperfect coverage by any single Big Y test, with the increased number of identified SNPs in a given line with more men tested due to the higher likelihood of two men sharing a region of the Y chromosome usually underrepresented by Big Y testing? Surely if coverage were perfect and complete, the number of SNP differences between any random pair of men would, after dividing by two, be an unbiased estimate of the number of mutations in the time since their common ancestor so long as there are no back mutations. The estimate in that case should not systematically decrease by adding more descendants of that ancestor, hence my interest to understand this better.

Years per YSNP is based on only shared YSNP mutations that define branches and the number of private and branch equivalents are not in play. Around every six months or so, I attempt to determine the TMRCA values. As more and more intermediate branches between 1,000 and 1,500 years are discovered via a larger sample size of Big Y testers (and increased numbers of branches under L226), the more I have to reduce the years per YSNP for two reasons: 1) there are now many more intermediate branches that have been discovered 1,000 and 1,500 years; 2) there are now many more surname clusters found that are around 1,000 years of age that require YSNP branches to not having younger estimates than 1,000 years.

As the sample size grows for Big Y testers, the number of branches continue to increase which includes many times branch equivalents becoming branches which add another branch in their path that is older than 1,000 years. It also includes new YSNPs that create intermediate YSNP branches to be added. So the number of branches between 1,500 years (the age of L226) and 1,000 years (where a increasing number of surname clusters being added) results in having to continually decrease the years.

Here is an example. Previous evaluation (45 branches) - L226 > FGC5660 > Z17669 > DC63 > DC29 > DC30 > DC189. We knew that DC189 was older than 1,000 years due to mixture of surnames: 6 McNamara, 8 Bryan, 2 Small, 2 Davis, 2 O"Neil and 6 other surnames. Obviously, we are getting close to 1,000 years with 8 Bryans and 6 McNamara - but having nine other surnames implies it it just over 1,000 years. Plus the Bryans are tightly clustered together. DC189 was estimated to be 1,500 - (6 x 70) = 1,080 which is which is pretty close.

Fast forward to the present analysis (78 branches) - L226 > FGC5660 > Z17669 > DC63 > DC29 > DC367 > DC31 > DC30 > DC189 > DC191 > DC340. The Bryans were split off to their own branch but DC191 remains older than 1,000 years due to surnames: 5 McNamara, 2 Davis, 2 O'Neil and 4 other surnames. So using the 70 years per YSNP, we get the following estimate for DC191 = 1,500 - (9 x 70) = 870 years old. This is younger than what surnames imply that suggests around 1,060 years. Even with the new 60 years per YSNP, the number of YSNPs for this path is even less than 60 years 1,500 - (9 x 60) = 960. But allowing some statistical variation, these two estimates are much closer with 60 years - but 870 is just too way young. For this particular path, 50 years per YSNP would be more appropriate = 1,500 - (9 x 50) = 1,050 is very close to 1,060 years (two generations before surnames).

neanderling
06-13-2018, 05:37 PM
the more I have to reduce the years per YSNP for two reasons: 1) there are now many more intermediate branches that have been discovered 1,000 and 1,500 years; 2) there are now many more surname clusters found that are around 1,000 years of age that require YSNP branches to not having younger estimates than 1,000 years.

So the reason for the dependency on sample size is that more samples increases the likelihood of finding two men who descend from different brothers of a very early user of a given surname. If you have only two men with a surname and don't have the paper trail that connects them, perhaps they actually have a common ancestor from 300 years ago, but you can wonder if the common ancestor was 800 years ago. Perhaps the fifth man tested with that surname connects from an earlier branch, demonstrable by SNPs, making it no longer possible to consider the connection of the first two men to have been so ancient. Once you've got men from the earliest still extant split of the surname (though you never know when that has happened without a full paper trail), you have the best estimate that you can achieve. The other element here is having a good estimate of when surnames were adopted, which is what provides the number of years.

I wonder if the high rates that you are seeing may also be providing clues as to breakdowns in the assumptions that go into mutation rate estimates. Specifically, to convert SNPs into years rather than into generations, you have to assume either that the higher mutation rate in older fathers conveniently cancels out the number of extra years elapsed (e.g., that the mutation rate in a 72 year old father is twice that of a 36 year old father and 4 times that of an 18 year old father) or that the distribution of paternal ages over the course of a few hundred years doesn't vary much from one lineage to another. I can imagine that a pedigree who consistently married early would vastly outnumber a pedigree that consistently married late after several hundred years, so there is a selection bias towards overestimating the mutation rate if you don't have a paper trail and have to base the analysis on surnames and the date of surname adoption. If the increased mutation rate in older men falls short of compensating for the decreased number of generations ("older" here might just mean age 30 versus age 18), this would skew even more severely.

If there are systematic cultural variations in reproductive age, this does not bode well for a universally valid estimation of the date of the most recent common ancestor based on SNP counts within the genealogically relevant time frame, no matter how many men are tested. The more reliable estimates will necessarily derive from early branches that have certain paper trail documentation and SNP results from known descendant lines and from an understanding of the dates and patterns of surname adoption.

RobertCasey
06-13-2018, 07:00 PM
Specifically, to convert SNPs into years rather than into generations, you have to assume either that the higher mutation rate in older fathers conveniently cancels out the number of extra years elapsed (e.g., that the mutation rate in a 72 year old father is twice that of a 36 year old father and 4 times that of an 18 year old father) or that the distribution of paternal ages over the course of a few hundred years doesn't vary much from one lineage to another. I can imagine that a pedigree who consistently married early would vastly outnumber a pedigree that consistently married late after several hundred years, so there is a selection bias towards overestimating the mutation rate if you don't have a paper trail and have to base the analysis on surnames and the date of surname adoption. If the increased mutation rate in older men falls short of compensating for the decreased number of generations ("older" here might just mean age 30 versus age 18), this would skew even more severely.

If there are systematic cultural variations in reproductive age, this does not bode well for a universally valid estimation of the date of the most recent common ancestor based on SNP counts within the genealogically relevant time frame, no matter how many men are tested. The more reliable estimates will necessarily derive from early branches that have certain paper trail documentation and SNP results from known descendant lines and from an understanding of the dates and patterns of surname adoption.

I used to think years per generation was an issue. But in my calculations, this is not a factor. There are two assumptions: 1) the accuracy of the TMRCA date of your particular haplogroup - but this methodology is pretty well established and accepted (with the exception of how people deal with massive branch equivalents which are common). For L226, we used to have this exposure, but with two recent Big Y tests, 80 % of the L226 equivalents have moved up to newly discovered branches just above L226; 2) the dating of surname clusters based on surname diversity. Not only may be a little older and somewhat younger for different surnames, this extra 100 years makes a difference. Also, what are the acceptable NPE rates for surname clusters. I have made a stab at using a variable surname cluster dating based on the number of surnames involved and the percentage of surnames. I am still playing around varying the 1,000 year rate to 900 to 1,100 years based on how many surnames are involved (could be an early NPE event) and what the percentage of the surname is (obviously a cluster with 100 % one surname is probably more recent than one with 70 % of one surname.

The age at reproductive and average age per generation are minor factors. I think with 150 read length NGS testing, the real YSNP rate will approach one mutation per generation with enough testing. At 2,000 read length, this then become two YSNP mutations per generation. So all we need is 1000X testers and it will truly reveal YSNP rate. Also, another factor is usage of YSNPs in complex areas and inserts/deletes. This increases the years per YSNP by 20 or 30 percent or more. My analysis includes these - YFULL does not include these in their estimates.

CillKenny
06-14-2018, 08:58 AM
In terms of dating surnames I think it is fair to believe that they were first adopted by the then elite and only a few generations after that by the branches that split off earlier. In the case of L226 I think the O'Brien surname would have come before many of the others. I see this in my own group [Z255/S219 > Z16429 > BY519 > ZZ7 > DYS435=12 > Z16430] where the Byrne surname seems to emerge earlier than the disparate others under Z16430.

RobertCasey
06-14-2018, 02:09 PM
In terms of dating surnames I think it is fair to believe that they were first adopted by the then elite and only a few generations after that by the branches that split off earlier. In the case of L226 I think the O'Brien surname would have come before many of the others. I see this in my own group [Z255/S219 > Z16429 > BY519 > ZZ7 > DYS435=12 > Z16430] where the Byrne surname seems to emerge earlier than the disparate others under Z16430.
We know that many O'Briens are direct descendants of King Brian Boru (born 941). So it may be off one or two generations for this royal line. But they did not use surnames until several generations later - and probably many of descendants took other surnames by the time that surnames were used. However, Sir Conor O'Brien, who is a direct descendant of King Brian Boru, has been YDNA tested. He is a proven descendant of King Brian Boru since one of his titles is "Chief of the O'Briens." This title has been formally passed down for 40 generations and documentation of this title is still available today.

We now have ten YSNP branches that are associated with this very large surname cluster. Dating of Y5610 is 1,080 years old - pretty close to time of King Brian Boru 1,500 - (7 x 60) = 1,080. We now have ten YSNP branches associated with this line. Another major branch, DC36, may be later proven to part of this line as well (brother of Y5610) which has large numbers of O'Briens (60 %). However, many branches that are not YSNP tested have been predicted DC36 where further YSNP testing could move predicted DC36 testers to other non-DC36 branches.

The royal O'Brien surname cluster has an amazing ten YSNP branches associated with this surname cluster. It also has another twelve YSTR branches for a total of 22 branches that divide up this surname cluster. This surname cluster really wants to get good time estimates but years per YSNP get less and less accurate the more recent you get due to lack of sample size. Their lowest branch (Y44000) is 12 levels below L226 giving it a date of 780 years ago - this estimate is way too old due to lack of sample size. During each iteration, they add more and more branches to their surname cluster - but they do not understand that other older lines become more tested and every six to twelve months, I have to reduce the years per YSNP (this time reducing from 70 years per YSNP to 60 years per YSNP).

The only alternative for dating would be YSTR based - but the statistical variation and accuracy for these estimates would be very problematic. So as they add more YSNP branches getting more levels - the time estimates do not change as the years per YSNP gets lowered as they add more levels of branches. But YSTR estimates below YSNP dating is probably better than just pure speculation - but not by too much.

RobertCasey
04-13-2020, 12:28 AM
I don't see how greater sample size = higher observed mutation rates. We should see a regression toward the mean. As you point out, we'll see random statistical variation for smaller sample sizes, but shouldn't they be scattered equally between high and low outliers? I take it you agree they are not.

This is a reference to YSNP mutation rates. For R-L226, I calculate the actual years per YSNP mutation from R-L226 (1500 YBP) and Irish surname clusters (1000 YBP). As more intermediate YSNP branches are discovered the average number of years per YSNP branch continues to decline. It started out with an average of 95 years per YSNP branch but is now down to 70 years per YSNP branch. The most extreme path to a surname cluster is 48 years per YSNP branch. There are lot of currently unknown YSNP branches being revealed over time as we have only scratched the surface and Big Y700 is definitely decreasing the years per YSNP as more of the YCHR is now being analyzed.

I would like to use charting to calculate Y500 and Y700 (as well as Y111) mutation rates. But currently, if I only use YSNP TMRCA dates, around half of the YSTR mutations are below terminal YSNP branches. Until YSNP testing reaches more recent time frames, I would have to use the YSTR mutations under terminal YSNPs span too large of a time frame to be that accurate. Also, 75 % of all charting branches are YSTR only branches as well. This is with 859 Y67 or higher testers, 261 Big Y testers (over half are Big Y700) and 156 branches under R-L226.

Sorry did not see your post around one year ago - but just now found this post from link discussing mutation rates (Rox2 discussion).