PDA

View Full Version : STR Wars, GDs, TMRCA estimates, Variance, Mutation Rates & SNP counting



Pages : [1] 2

TigerMW
04-21-2013, 04:44 AM
I'm opening this thread as a place to discuss and catalog information on using Y STR and Y SNP information to try to calculate aging within R haplogroups.

TigerMW
04-21-2013, 04:49 AM
Although I think the Law of Large Numbers can outweigh problems with individual STRs it makes sense to realize that NOT all STRs have similar behavior patterns. We are generally interested in those that can help us estimate time to a most recent common ancestor (TMRCA).

If some are not good at that and we have enough alternatives it makes sense to me to use the alternatives.
Steve Bird at Texas State Univ. wrote this paper: "Towards Improvements in the Estimation of the Coalescent: Implications for the Most Effective Use of Y Chromosome Short Tandem Repeat Mutation Rates", 2012.
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0048638

He evaluates Y STRs for their fitness to having a linear variance relationship with time.

MJost
05-02-2013, 03:34 AM
A discussion is working on the Yahoo L21 board, under AlexWilliams 111 marker SNP based Haplotype PhyloTree where the discussion of mutation rates that Anatole Klyosov uses are calibrated for 25 years per generation. One person converted AK's mutation rate to a 30 year per generation number. Also discussed was the use of 25 or 30 years per generation. One believed 30 years should be use back 1000 years and 25 prior to that.

Based on my understanding that a mutation rate is calculated is based on the number of transmissions that occur before a STR mutation happens. Example, it is estimated that a mutation will occur only once every 500 transmissions (birth events) per a single Y-DNA STR marker – or roughly an overall rate of a 0.2% mutation rate, a debated rate of genetic mutation clock.

Anatole Klyosov uses several method to produce ages based on a 25 year per generation mutation rate.
http://www.jogg.info/52/files/Klyosov1.pdf

Chandler has posted his own set of calculated mutation rates. His paper is found at: http://www.jogg.info/22/Chandler.pdf

Marko Heinla has produced his own more recent mutation rates back in May 2012 using methods using Chandler's methods. He has a link to his 111 marker rates near the botton of this web page.
https://dl.dropboxusercontent.com/u/50201824/old/jsphylosvg_trees.html

Marko Heinila's results are based on about 4,000 111 level samples. He used an estimation process that each haplotype pair was considered an independent random draw from a model distribution. Model distribution suggests what is the ratio of mismatches and matches in a given marker if pairs with a given number of matching markers in general are considered. The pair data was then used to solve the mutation rates. He said that this is the same idea as in Chandler's paper on mutation rate estimation.

Ken Nordtvedt chose to use Heinla's 2012 mutation rates in his 111t Generations spreadsheet which I maintained its use in my TRMCA Estimator spreadsheet as well.

It is estimated that a mutation will occur only once every 500 transmission (birth events) per a single Y-DNA STR marker – or roughly an overall rate of a

0.2% mutation rate, a debated rate of genetic mutation clock. We have more recent calculations that show a more realistic transmission rates.

Recalulated using Marko Heinla 2012 Mutation Rates


#Markers Transmissions BirthEvents GenYrs=25.0 GenYrs=30.0

12 495 41.3 1,031.3 1,237.5
25 413 16.5 413.0 495.6
37 280 7.6 189.2 227.0
67 388 5.8 144.8 173.7
111 382 3.4 86.0 103.2

12-mcm 556 61.8 1,544.4 1,853.3
25-mcm 428 26.8 668.8 802.5
37-mcm 319 13.3 332.3 398.8
67-mcm 452 9.0 226.0 271.2
111-mcm 411 4.4 109.3 131.2


#Mkrs MarkoHCumlRate perMarkerRate
12 0.0242 0.0020
25 0.0605 0.0024
37 0.1323 0.0036
67 0.1728 0.0026
111 0.2907 0.0026

12-mcm 0.0162 0.0018
25-mcm 0.0374 0.0023
37-mcm 0.0747 0.0031
67-mcm 0.1107 0.0022
111-mcm 0.2285 0.0024

MJost

MJost
05-02-2013, 04:51 PM
Calculating a group of Haplotypes' TMRCAs in my TRMCA Estimator spreadsheet Concepts overview.


Intraclade is 'within a clade', a clade is derived from a common ancestor's data which are

within a higher level grouping of a genetic haplogroup such as M222 and includes those

haplotypes that are known to have positive test results.


Technically two things are being calculated from a clade (Haplotype) dataset, Population

variance and Sample variance which are used in calculating the Coalescence and Founders

Modal Intraclade generation age respectively. Next the sum of each type of variance is

divided by the sum of the mutation rates to garner a generation (MRCA) age.


Further when estimating the variance, the dataset used is technically a sample of the

population space. Coalescence looks at just the data as a small population which is

assumed to be close to actual population representation, where the modal Founders section

is an adjusted sample that represents the entire population.


The Coalescence Whole (n) population generation age is biased. The Coalescence sample (n-

1) population generation age is a corrected generation age to get a 'True' unbiased

result.


To explain bias, this method of Coalescence estimation is close to optimal, with the

caveat that it underestimates the variance by a factor of (n - 1)/ n. (For example, when n

= 1 the variance of a single observation is obviously zero regardless of the true

variance). This gives a bias which should be corrected for when n is small by multiplying

by n /(n-1). This is why Coalescence Whole population Age is less than the Coalescence

sample population age.


My TMRCA spreadsheet can produce individual statistical variances which should show a

generational point were all haplotypes meet their common ancestor (think of the first two

Coalescence Ages which is a variance (Think variance of factional mutations counting {sort

of}).

I report three intraclade variance reports to produce an estimated Most Recent Common

Ancestor (MRCA) age:

Coalescence Age = Variance of Whole Population (n) < (near to KenN's original Coalescence

age using Varp functions)


Coalescence Age = Variance of Sample Population (n-1) (Sampled Var)


Founder's Modal Age Variance (using Ken's formula for Modal Method)

Use Coalescence(n) for close families with all known family members MRCA node.



Use Coalescence(n-1) for groups of unknown or missing lineages to a MRCA node of the

applied set of haplotypes (most runs).

Use Modal for the Founders Age. The founders Age will be older than the Coalescence (n-1)

Age. Since there are usually missing lineage branches and/or generations without mutations

considering Haplotype markers are not 100 percent represented.


An Interclade MRCA age point is calculated for the last two results above [(n-1) and

Modal] between the two clades studied to point to a MRCA age from each clades node point

using a statistical Pooled Standard Deviation method.


MJost

MJost
05-03-2013, 02:17 PM
I posted some TMRCAs on the Yahoo 1113 Combo forum and poster Daryl posed some questions and skepticisms of TMRCA's. So I will reply here under this thread as suggested by MikeW.

Daryl,

As I have always stated, I am not a Math expert. But Yes, I agreed with you when you said in a previous post that "TMRCA calculations are mostly speculative", And I said the results are all about their relevance. These are not error rates as you pointed out, but only Statistics probabilities. Let review.

In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out has a theoretical probability distribution at 1 sigma (66.27%) is the distribution's outcome probability.


Look at this chart which shows the normal distribution curve that illustrates standard

deviations. Each band has 1 standard deviation.

https://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg

The standard deviation is an important reference, because we can say that any generaton value calculated is:
•likely to be within 1 standard deviation (68 out of 100 will be)
•very likely to be within 2 standard deviations (95 out of 100 will be)
•almost certainly within 3 standard deviations (997 out of 1000 will be)


I have an option in my spreadsheet to adjust and check the Confidence level to any level to evaluate what number of generations it would it take to produce a MRCA point at the assigned confidence. In other words, at a 99.73% probability that the standard deviation of the generations of the sample fall between x and y generations. CI indicate the reliability of an estimate. Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population.

The “Variance Method” (Slatkin, 1995; Stumpf, 2001) assumes that the variance (average-squared-distance from ancestral value) of each STR marker in a large population, is proportional to the TMRCA of that population.

Ken Nordtvedt has implemented variance into his Generations spreadsheets.calculations's generation cacluations are very close to each other.

Please note that Ken explains Variance Sigma (Standard Deviation) Concepts on his website.

http://knordtvedt.home.bresnan.net/Sigma%20for%20Variance.pptx

Yes, Statistically Relevant.

MJost

TigerMW
05-14-2013, 03:38 PM
I'm copying this over from another thread so we don't bog that one down. For some people this might be interesting so I'll continue the conversation on estimating ages and using Y STRs and some of the vagaries and benefits there of.


Just a reminder - Anatole Klyosov calculated the age of E-V13 at 1000 BC and yet E-V13 has been found in a 5000 BC sample from Spain. Therefore, I don't know that we can take his dating techniques all that seriously. I do thing that there is a small value however in showing that subclade A is older than subclade B, with the understanding that founder affects could be in play.


Not always, but Kylosov's TMRCA calculations for R1b are usually in line with others for R1b so I don't think we need to distrust them just because they are from Klyosov. On the other hand, his error ranges are quite narrow compare to most alternative methods. I just ignore his error ranges.

I'm not sure the E-V13 ancient DNA is a good example to assess the Klyosov methodology's technical correctness. TMRCA intraclade estimates like Klyosov's only represent the MRCA (Most Recent Common Ancestor) for the remnant population. The E-V13 MRCA (single man) for those surviving may have very little to do with the ancient DNA E-V13 man found other than some very, very ancient connection. Of course, all of this is what drove Dienekes nuts, particularly as it relates to geographical differentiation.

However, problems with intraclade calculations can be mitigated by doing multiple phylogenetically comparable calculations and using interclade methods like what Ken Nordtvedt developed.

If we want to go deeper into the topic of methodologies we should probably go over to this thread designed for that purpose.


I am well aware of the limitations and will pass on going deeper into the topic, but I only mention it because I want people to be aware that Anatoly's dates only represent the successful living ancestors and are probably fraught with the reduced age of successful founder effects. It in no way means that we will not find R1b hundreds or even thousands of years older than his numbers.

I wasn't intending to slight anyone's understanding of the situation, so sorry if I sounded condescending. For people just catching up or tuning in, I just wanted to point out that Klyosov's methodology has nothing uniquely wrong with it although it suffers the same maladies as any Y STR based age estimation technique.

Probably the best and most fun initiation into this is Dienekes's blog entry here.
http://dienekes.blogspot.com/2011/08/y-str-variance-of-busby-et-al-2011.html

Here is the kick-off of the fun part. You have to scroll down to the comments.

From now on I am going on a Y-STR boycott on this blog. Y-STRs still have their obvious uses, for recent genealogy, or forensics. They may also convey some information about human prehistory in the broadest time scales.


Excellent, Dienekes. I truly appreciate your boycott. It means that one more person who understands nothing in the area, is out.
:biggrin1:

MJost
05-15-2013, 01:31 AM
The “Variance Method” (Slatkin, 1995; Stumpf, 2001) assumes that the variance (average-squared-distance from ancestral value) of each STR marker in a large population, is proportional to the TMRCA of that population.

Ken Nordtvedt has implemented variance into his Generations spreadsheet calculations. Please note that Ken explains Variance Sigma (Standard Deviation) Concepts on his website.

http://knordtvedt.home.bresnan.net/Sigma%20for%20Variance.pptx


Edit repost

TigerMW
05-16-2013, 03:17 AM
This may seem a little off track, but bear with me. This is about understanding MRCAs....

What's the value of a haplogroup?

What's the value of an SNP?

You might be surprised to hear me say this but I think there is very little value in haplogroups and SNPs
... at least in and of themselves.

A haplogroup is just a group of people with a common ancestor.

An SNP is just a single nucleotide polymorphism, a mutation, that marks the group of people with a common ancestor. It is just a signpost on a branch of the human family tree. The true nature of the haplogroup of people, any commonality in culture, location, etc., many not align with the SNPs have marked the lineages. The SNP could mark either a subset or superset of the true group of people we care about.

This gets into some notions about value and philosopical concerns, but these are the points I'm getting at.

1) I do not care too much about all of the extinct lineages of mankind. There are many, many extinct lineages. On the Y chromosome/paternal side probably there are many, many more lineages that have gone extinct than those who survive.

2) I do care about how we got here and how, where, when and why they did what they did to get us to where we are today.

I think these notions are just conveying that what many hobbyists may care about most is the connection to genealogy and deeper ancestry.... and specifically our ancestry.

The net is that the most recent common ancestors (MRCAs) of the various branches remaining today (and in recent history) are critical people to try understand. The more MRCAs we can understand better at more layers and branches in the tree, then the more we have a chance to understand our ancestry.

I am not saying that all of the old extinct lineages were not important people or that SNPs are useless. I'm just trying to say they are most important in how they help us understand who we are and how we got here. They are just bread crumbs from an old trail.

Superconducting supercolliders smash atoms and look at the residue of the accident to try to get more detail on the characteristics of the atom. In the case of genetics; the accidents, bottlenecks, growth spurts, etc. have already taken place but, likewise, we are looking at the residue to try to ascertain what happened.

I don't care when an SNP first occurred. I care about the expansion and movements of my ancestry. The SNP marked haplotroup ages may help put a maximum age in place for my ancesty. That's good, but it's not really the haplogroup I'm after.


P.S. Science may be interested in who the genetic Adam was or wasn't and some other things. That's fine with me but I'm really after understanding how we, the survivors, got here.

TigerMW
05-23-2013, 03:53 PM
I'm moving the following post here (into "STR Wars..." as it on a generic topic of TMRCA, STR issues and concerns. This type discussion can bog down any thread so let's keep it here and refer to it as is applicable to other discussions.


TMRCA estimates are subject to many vagaries. However, for R1b we now several studies' worth of data, a lot of long haplotypes and and interclade TMRCA methodologies. I think we have robust enough data broken up by the SNP defined phylogeny that we can have robust estimates. The issues are the methodologies themselves, or actually, more the STR mutation rates. The SNP methods need maturity in the coverage of the Y chromosome. Posted on the R1b Early Subclades subcategory phylogeny thread are several TMRCA estimates from different folks and methodologies that find essential agreement.

I would never say they are precise. Still the relative nature of the timing within the phylogeny along with the geographical distribution is robust.


TMRCA estimates are precisely what the definition says: "time to the most recent common ancestor". These estimates are based on the use of STR values and have nothing intrinsically to do with SNP's and when they occur. e.g. Clan gregor had a founder c. 1350 with an apparent mutation from 11 to 10 at 385a. It turns out that a SNP mutation, L1065, defines the clan along with many other clans some 300 to 500 years before. Now there may be a more defining SNP mutation but it hasn't been found yet. So, right now, we infer SNP sequences, in part from STR data. This may be flawed because some subset of a group may have had more advantageous condition for procreation and out-produced other lines? Relative timing based on the apparent lines success may not be robust?

TigerMW
05-23-2013, 04:14 PM
TMRCA estimates are precisely what the definition says: "time to the most recent common ancestor". These estimates are based on the use of STR values and have nothing intrinsically to do with SNP's and when they occur. e.g. Clan gregor had a founder c. 1350 with an apparent mutation from 11 to 10 at 385a. It turns out that a SNP mutation, L1065, defines the clan along with many other clans some 300 to 500 years before. Now there may be a more defining SNP mutation but it hasn't been found yet. So, right now, we infer SNP sequences, in part from STR data. This may be flawed because some subset of a group may have had more advantageous condition for procreation and out-produced other lines? Relative timing based on the apparent lines success may not be robust?

This can be mitigated by use of intraclade age estimates within known related groups, as defined by SNPs, and then comparing those estimates across a known tree of SNP based subclades. This is what Ken Nordtvedt's interclade TMRCA estimates are all about.

We also see other non STR methods are coming on-line. The 2008 Karafet study used a scientific sampling of Y chromsome SNPs to estimate ages. They estimated the R1 TMRCA, which is ancestral to R1b and R1a, as 18.5K ybp. This fits nicely with what the common (hobbyist and FTDNA) TMRCA estimation methods are getting for R1b subgroups so there is some apparent corraboration of STR based methods from this "novel" (Karafet's word) SNP method.
"New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree" by Karafet, et. al., 2008. The et. al. in this case includes Michael Hammer, FTDNA's Chief Scientist.

MJost
05-26-2013, 08:24 PM
In another thread, R1b Early Branching Phylogeny (SNP based family tree) I posted some variance numbers of L23 and below and I used some limited subclade depth such as L23 w/ only L51 was L23 with just L51 included only and no other subclades. The SNPs labled with the # is only includes the first subclade below a specified clade. example L23 and L51 but not L11, no second more recent subclades.

I chose a use a sliding window variance method for the following reasons. Including only one subclade below and not including the larger subclades at three or more below, seems to limit the overall saturation that would cause a washout of the older HTs (ratio's) giving it a younger age. If starting with a small ancestral Clade that was much younger, it might be less effected.

MJost
MJost

TigerMW
06-03-2013, 06:34 PM
Rathna has some interesting principles that imply STR variance methods may not work as expected. Since they are really more about STRs, this would be the right thread to go into depth into such a discussion.

In another thread, he commented to the effect at a GD=27 at 37 markers was possibly misleading, in other words, falsely estimating a very distant age.

But I’d add my three principles, no one took in consideration. Mutations happen 1) around the modal 2) there is a convergence to the modal as time passes 3) sometime a mutation goes for the tangent. The combination of these principles makes that Nazarov has a GD of 27 out 37 as to you and others much less, but only because Nazarov belongs to a different line of R-L23 and others to your same line.
...

Any two haplotypes may contain anomalies and do not necessarily represent their subclades well so TMRCA estimates between just two people is subject to very large error ranges.

On the other hand, I don't necessarily understand or agree with Rathna's principles other that there is no doubt that STRs can mutate in either direction so they can diverge off on a tangent or converge/back-mutate towards the ancestral. I probably don't understand how his principles should be applied though.

R.Rocca
06-03-2013, 06:43 PM
Rathna has some interesting principles that imply STR variance methods may not work as expected. Since they are really more about STRs, this would be the right thread to go into depth into such a discussion.

In another thread, he commented to the effect at a GD=27 at 37 markers was possibly misleading, in other words, falsely estimating a very distant age.


Any two haplotypes may contain anomalies and do not necessarily represent their subclades well so TMRCA estimates between just two people is subject to very large error ranges.

On the other hand, I don't necessarily understand or agree with Rathna's principles other that there is no doubt that STRs can mutate in either direction so they can diverge off on a tangent or converge/back-mutate towards the ancestral. I probably don't understand how his principles should be applied though.

I seem to recall (I think) a study that stated that STRs do converge around a modal value, but tend to go up slightly in value over a very long period of time.

MJost
06-03-2013, 07:02 PM
I once studied (lightly) this and it appeared that it was a two to one up vs down mutation action. But since, I have realized that sub branching and sheer volumn can skew this determination. This is where AK is prolly correct in using Phylo trees to break out these type of branches. But then again, large numbers can smooth out these mutations but MikeW, using off-modal clusters can provide the same information to follow decendent branching in order to trace out to the twigs the mutation directions.

MJost

TigerMW
06-03-2013, 08:12 PM
I seem to recall (I think) a study that stated that STRs do converge around a modal value, but tend to go up slightly in value over a very long period of time.

I don't remember that, per se, but there is a paper that shows that a particular type of STR, typically with high STR counts, i.e. DYS449 and CDY, tend to get their STR values stuck at a high value, which could turn out to be a modal for some sub-group.

What Marko Heinilla's comment on this was that mutations actually increase in rate as you reach the high end of some of these STRs. However, the rate of increase is on the down side so there are more back mutations, giving the "getting stuck" effect.

These are the kinds of things that Marko, Busby and Bird tried to assess and filter out by using linear (with time) duration models to detect badly behaving STRs. Generally speaking, multi-copy STRs are thrown out right at the start in the more current hobbyist based TMRCA estimation methods. When I do simple GD calculations, I use modified infinite allele methods for the multi-copy STRs. This effectively tones down their jumpiness which is eliminated by those who throw out certain STRs in their estimates.

Everything I've read so far, however, indicates that SNP mutations and STR mutations are completely independent, at least as far as the stable types of SNPs used for phylogenetic trees. The net of that is, there is no reason for an STR value to be stuck within a particular haplogroup which is only defined by happenstance of discovery of SNPs that cordon off a branch on the Y tree.

razyn
06-04-2013, 02:13 AM
It was on a different forum that I was finally able to squeeze out of M. Jost the usually missing names of the 25 Bird-blessed markers (i.e. those that have a quality factor of .07 or higher, whatever that proves). So in case anybody else doesn't know, and cares (a condition that afflicted me for several weeks, until I was a complete pest about it back around March 3rd): in a 67-marker test, after throwing out all the multi-copy markers, you also throw out about half of the others, and are left with:
390
391
439
458
447
437
448
449
460
GATAH4
456
607
576
570
578
537
406S1
511
557
534
444
481
520
446
617

And in the (presumably remote) event that you want to compare people who have been tested any further than FTDNA's 67th position -- after throwing out more multi-copy ones -- these few and proud markers also have a high-enough q to be usable:
710
485
632
495
540
714
549
522
533
636
452
A10
463
441
712
650
532
715
504
513
635
643
510

There may be a few more good ones between positions 93 and 111, but Steven C. Bird's recent paper didn't go quite that high -- so in the present exercise, neither can we.

MJost
06-04-2013, 12:38 PM
Sorry I could have given a list right from the Bird paper of the STRs with a q <=.07. All of which are posted in my TMRCA Estimate Speed worksheet. I even listed the STRs that didnt qualify in another post I did.

http://www.anthrogenica.com/showthread.php?560-A-TMRCA-Estimator-Excel-spreadsheet&p=5245&viewfull=1#post5245

The entire concept is that Bird used three factors: Allelic Range, Excess Range and Kurtosis Skewness

Towards Improvements in the Estimation of the Coalescent-BirdS-journal.pone.0048638

Towards Improvements in the Estimation of the Coalescent: Implications for the Most Effective Use of Y Chromosome Short Tandem Repeat Mutation Rates


MJost

MJost
06-13-2013, 05:16 PM
And I truely believe that calculations using SNPs is basically flawed, but if it wasnt we could calibrate the YDNA STRs to them but we cant. Your adjustment by two and a half is an assumption. Age is based on years per generation. Then Generations is based on the variance and STR mutation rates. Variance is based on total STRs and is used to produce numbers on how spread out numbers; The Standard Deviation result.

So do you agree that variance is an Effective measurement tool on stable STRs? And, if you agree, then you must believe the mutation rates are 2.5 times as slower then the experts are producing (ie MarkoH)?.

MJost

Rathna
06-13-2013, 06:26 PM
And I truely believe that calculations using SNPs is basically flawed, but if it wasnt we could calibrate the YDNA STRs to them but we cant. Your adjustment by two and a half is an assumption. Age is based on years per generation. Then Generations is based on the variance and STR mutation rates. Variance is based on total STRs and is used to produce numbers on how spread out numbers; The Standard Deviation result.

So do you agree that variance is an Effective measurement tool on stable STRs? And, if you agree, then you must believe the mutation rates are 2.5 times as slower then the experts are producing (ie MarkoH)?.

MJost

MJost, we have some STRs values of Gorilla and similar our cousins, and frequently Gorilla and Humans values are close or the same. What do you think about this? Do you think that the variance is so low? And why DYS391 has above all the values 10 and 11 and rarely others upstream and downsteam if not for the principles of the mutations around the modal and that sometime the mutation goes for the tangent? Many mutations are destined to remain hidden, and many other factors influed on this, that I think they cannot be calculated. A theory shouldn't be based only upon this calculation, but upon History, Geography, Linguistics etc etc. and above all (and I wrote this about the Boattini's paper) on not excluding the outliers.
I did a calculation about a R1a1a from India (and I am not an expert of your field) on an outliers and the age was similat to that of the infamous Zhiv method. I posted this on eng.molgen and nobody replied to me, neither Anatole Klyosov, whose 22 slowest markers are anyway interesting.
The lines arrived to us are a few, the most part went extinct, the modal (just for my principles) is an abstraction etc etc. and it isn't said that the most diffused values was the original etc etc.
What do we know about the DYS426 of R-M269 (was really 11?) or of R-L51 (was really 13?) etc etc. The ancient DNA of hg. G has demonstrated that it wasn't so for some markers and I wrote about this in some forum which probably banned me.

Rathna/Gioiello/Maliclavelli/Claire etc etc

TigerMW
06-19-2013, 02:20 PM
MJost, we have some STRs values of Gorilla and similar our cousins, and frequently Gorilla and Humans values are close or the same. What do you think about this? Do you think that the variance is so low?

Can you show full 37 or 67 STR haplotypes, in Ysearch or in a spreadsheet so we can analyze, of a gorilla? It's hard to have an opinion of a something we don't usually look at.

I am not sure what you are trying to say in terms of any disagreement with MJost. MJost has taken great care to look at various assessments of Y STRs "linear duration" and what Bird's paper calculates in terms of its "q" values. The whole purpose of that is to consider that STR variance is not useful for indefinite periods of time. Everything I've seen is that MJost is taking great pains to account for this.

What are you critical of?

I don't see the purpose that you cite that you've been banned on other forums when you throw in comments like "which probably banned me."' Its just devoid of content commentary.

Rathna
06-20-2013, 08:02 PM
Can you show full 37 or 67 STR haplotypes, in Ysearch or in a spreadsheet so we can analyze, of a gorilla? It's hard to have an opinion of a something we don't usually look at.

I am not sure what you are trying to say in terms of any disagreement with MJost. MJost has taken great care to look at various assessments of Y STRs "linear duration" and what Bird's paper calculates in terms of its "q" values. The whole purpose of that is to consider that STR variance is not useful for indefinite periods of time. Everything I've seen is that MJost is taking great pains to account for this.

What are you critical of?

I don't see the purpose that you cite that you've been banned on other forums when you throw in comments like "which probably banned me."' Its just devoid of content commentary.

Mike, I did expect that MJost replied to me, but I am glad that you replied. I have a great exteem of MJost, and probably my remarks are worth something.
Why I wrote "I wrote about this in some forum which probably banned me"? Because I use only my memory, I haven't an archive on my PC (on my many PC I have had and lost) and I remember having written this but at the moment I am not sure when and where. I have printed all what I have written but it would be very difficult to find these my writings (I'd agree with you if you said that this isn't the right way to use PC). I remember that someone published the STRs (some of them) of some apes, I wrote something on them, but don't ask me where and when.
Anyway, when someone spoke of these Africans A00, I found some of them on SMGF (unfortunately now out), I extracted their data and put them on ySearch, where they surely still are.
You can take some of them (there is also a project at FTDNA) and compare their STRs values and see that if you calculate the MRCA about them and any Y haplogroup with the MJost, Bird or any other method, you won't reach neither the tenth part of their age: more than 300,000 years. Why? Because mutations happened around the modal, because there is a convergence to the modal as time passes, and only sometime a mutation goes for the tangent.

TigerMW
06-21-2013, 12:37 PM
I am not sure what you are trying to say in terms of any disagreement with MJost. MJost has taken great care to look at various assessments of Y STRs "linear duration" and what Bird's paper calculates in terms of its "q" values. The whole purpose of that is to consider that STR variance is not useful for indefinite periods of time. Everything I've seen is that MJost is taking great pains to account for this.

Mike, I did expect that MJost replied to me, but I am glad that you replied. I have a great exteem of MJost, and probably my remarks are worth something.
....
You can take some of them (there is also a project at FTDNA) and compare their STRs values and see that if you calculate the MRCA about them and any Y haplogroup with the MJost, Bird or any other method, you won't reach neither the tenth part of their age: more than 300,000 years. Why? Because mutations happened around the modal, because there is a convergence to the modal as time passes, and only sometime a mutation goes for the tangent.

You use the terminology "mutations around the modal" and that is not common that I know of in scientific papers but I agree (and I think Mark would too) that Y STRs have general ranges. However, to use a gorilla as an example is an erroneous argument by exception. Mark has not been trying to estimate TMRCAs with gorillas. I've seen in the past where folks like Vince V have said STRs (throw out the multi-copy ones of course) are probably useful for estimating time up to the 15-20K year range. Of course, that is given a particular species and we are clearly of one species whereas a gorilla is exceptional when compared with us.
I don't know what the right time range is but people are actually studying this issue, i.e. Busby, Heinilla and Bird among others. Mark is using this work appropriately.

Rathna
06-21-2013, 03:51 PM
You use the terminology "mutations around the modal" and that is not common that I know of in scientific papers but I agree (and I think Mark would too) that Y STRs have general ranges. However, to use an gorillas as an example is an erroneous argument by exception. Mark has not been trying to estimate TMRCAs with gorillas. I've seen in the past where folks like Vince V have said STRs (throw out the multi-copy ones of course) are probably useful for estimating time up to the 15-20K year range. Of course, that is given a particular species and we are clearly of one species whereas a gorilla is exceptional when compared with us.
I don't know what the right time range is but people are actually studying this issue, i.e. Busby, Heinilla and Bird among others. Mark is using this work appropriately.

Do you want to see the last failure of this method? Hong Shi et al., Genetic Evidence of an East Asian Origin and Paleolithic Northward Migration of Y-chromosome Haplogroup N, PLOS one. The authors had to use the Zhiv method, others would want to divide for three: "An effective mutation rate of 0.0069 was used" (p. 4). But the ages should have been these:
"Earliest dispersal from southern East Asia (around 21 kya)" (p. 8), of course of N*, but at the proof of the calculation based on the haplotypes variance they have to admit:
"The age of N*-M231 (13.69 kya), presumably the ancestral lineage of Hg. N, is younger than expected, likely as a result of yet-to-be-identified individuals having derived N-M231 sub-haplogroup when new Y SNP markers are uncovered in the future" (p. 6).
And if we divide 13.69 for 3 we have: 4.56 kya. And this N* should be the ancestral haplogroup which would have peopled from South Asia all the continent till Finland!

TigerMW
06-21-2013, 07:13 PM
Do you want to see the last failure of this method? Hong Shi et al., Genetic Evidence of an East Asian Origin and Paleolithic Northward Migration of Y-chromosome Haplogroup N, PLOS one. The authors had to use the Zhiv method, others would want to divide for three: "An effective mutation rate of 0.0069 was used" (p. 4). But the ages should have been these:
"Earliest dispersal from southern East Asia (around 21 kya)" (p. 8), of course of N*, but at the proof of the calculation based on the haplotypes variance they have to admit:
"The age of N*-M231 (13.69 kya), presumably the ancestral lineage of Hg. N, is younger than expected, likely as a result of yet-to-be-identified individuals having derived N-M231 sub-haplogroup when new Y SNP markers are uncovered in the future" (p. 6).
And if we divide 13.69 for 3 we have: 4.56 kya. And this N* should be the ancestral haplogroup which would have peopled from South Asia all the continent till Finland!

I may not be following you here, but from what I see you've quoted, they are presuming N* is the ancestral lineage. That is completely erroneous if we are measuring modern populations. You are conflating N* as a subclade when it is an undifferentiated paragroup that is not necessarily representative of all of N. There is no reason to assume it has the ancestral haplotype values for all of N. There is no reason to assume N*'s modern values are representative. They are just people that are currently N*, but may be a very young subclade. They could be all N-XYZ where XYZ is undiscovered. I think the author says as much within your quote.

Rathna
06-21-2013, 07:51 PM
I may not be following you here, but from what I see you've quoted, they are presuming N* is the ancestral lineage. That is completely erroneous if we are measuring modern populations. You are conflating N* as a subclade when it is an undifferentiated paragroup that is not necessarily representative of all of N. There is no reason to assume it has the ancestral haplotype values for all of N. There is no reason to assume N*'s modern values are representative. They are just people that are currently N*, but may be a very young subclade. They could be all N-XYZ where XYZ is undiscovered. I think the author says as much within your quote.

That N* is the ancestral of all N-subclades is the theory of these Chinese scholars and not mine. If N* was a paragroup, being composed of many different haplotypes, the variance should be higher and not lower.
Something doesn't fit in what you say.

TigerMW
06-24-2013, 09:19 PM
That N* is the ancestral of all N-subclades is the theory of these Chinese scholars and not mine. If N* was a paragroup, being composed of many different haplotypes, the variance should be higher and not lower.
Something doesn't fit in what you say.

You are the one bringing up this study and the use of N* to prove your point. I can't help it if it doesn't make any sense to compare a paragroup which is of unknown content, with a true subclade. Paragroups do not automatically have higher variance. They are a collection of haplotypes of unknown relation. They are a "partial" clade. What sense is it to calculate the TMRCA of a partial group when we don't know if it is representative of the whole clade?

MJost
07-02-2013, 12:35 PM
Terry Robb post this over on the RootsWeb board that I thought was pertient here. Do we even have enough SNPs to do this with much justice?

MJost

[email protected]
Subject: [DNA] New results for TMRCA of Y-Haplogroups - based of Complete Genomics data

"Some new time-to-Most-Recent-Common-Ancestor (TMRCA) results for
Y-haplogroups, *without* relying on STR methods, are shown as a tree in
UPDATE20 at the bottom of the webpage:"

http://www.goggo.com/terry/HaplogroupI1/#CompleteGenomicsTMRCA

"The present results are broadly consistent with estimates I got last year.

In particular, I find that (under the assumption that CT-M168 splits at 70
thousand years ago):

1) The Afrasian DE-M145 haplogroup then splits at 63.4 (+/- 2.1 SD)
thousand years ago.

2) The Eurasian F-M89 haplogroup then splits at 50.7 (+/- 1.9 SD) thousand
years ago.

3) Haplogroup I1-M253 would split prior to 6.5 (+/- 0.7 SD) thousand years
ago. But need a full sequence I1-Z131 sample to see how much older, if any,
than that.

4) Haplogroup R1b-P312 (a subgroup of R1b-M269 found mainly in Western
Europe), would split at 6.1 (+/- 0.6 SD) thousand years ago."

TigerMW
07-02-2013, 01:03 PM
Terry Robb post this over on the RootsWeb board ...
"Some new time-to-Most-Recent-Common-Ancestor (TMRCA) results for
Y-haplogroups, *without* relying on STR methods
...
4) Haplogroup R1b-P312 (a subgroup of R1b-M269 found mainly in Western
Europe), would split at 6.1 (+/- 0.6 SD) thousand years ago."

That is not complete or almost complete coverage of the Y chromosome so I wouldn't call this the best estimate so far, but I think we are see some cross-validation.

As I interpret it, this estimate would put the first P312 man at about 4000 BC. This man would the oldest a P312 MRCA could be.

Rathna
07-02-2013, 01:19 PM
4) Haplogroup R1b-P312 (a subgroup of R1b-M269 found mainly in Western
Europe), would split at 6.1 (+/- 0.6 SD) thousand years ago."

Like Euclidean Geometry all is based upon assumptions ("under the assumption that..."), but this is a historical science and we shall find documents and proofs.
6.1 +/- 0.6 could be a reliable age. At its maximum we arrived at 7,000YBP.
The agriculturalists from Italy to Iberia 7,500YBP carried many haplogroups as to the region of start: probably Sardinians carried I-M26, G2a and a Basque-like language; Tyrrhenians, who spoke an IE language of the Celtic-Italic group, carried undoubtedly R-L51 (Valencia region and Middle Portugal) and R-P312/DF27-.

MJost
07-02-2013, 02:24 PM
That is not complete or almost complete coverage of the Y chromosome so I wouldn't call this the best estimate so far, but I think we are see some cross-validation.

As I interpret it, this estimate would put the first P312 man at about 4000 BC. This man would the oldest a P312 MRCA could be.

Soon we will have a giant leap into a much expanded SNP realm from Full Y Testing. Phase II test data is being analyzed shortly with alot of U106 guys that should provide considerable data down to L11. Then with several p312 guy and 22 or more L21's we should see a much bigger picture of the Entire Tree deep into L21. We then should be able to tie STR Mutation timing into the fray.

MJost

TigerMW
07-02-2013, 05:57 PM
Like Euclidean Geometry all is based upon assumptions ("under the assumption that..."), but this is a historical science and we shall find documents and proofs.
6.1 +/- 0.6 could be a reliable age. At its maximum we arrived at 7,000YBP...

Anatole Klyosov has responded to Terry Robb. Klyosov thinks the point of calibration should be a little more recent.

You approach is certainly valid and useful. However, in this particular case you took the 70 kya for the CT split, which should rather be around 55 kya. As a result, all other estimates are shifted up. For example, R1b-P312 is about 4200 years "old" (3900 ybp on some accounts, which is within the margin of error anyway), I1 has a common ancestor around 3600 ybp, all over Europe. I do not know (and you did not explain) how specifically supported the 70 kya figure which your calibration is based upon. http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2013-07/1372781288

MJost
07-02-2013, 06:37 PM
I thought P312 was seemingly too old as well but SNP Calibration is not for the faint of heart.

MJost

Rathna
07-02-2013, 09:19 PM
Anatole Klyosov has responded to Terry Robb. Klyosov thinks the point of calibration should be a little more recent.

That Anatole of the Arbins, of the passage of R1b from North Africa to Spain, that of the African A00 a little less than Chimp cousins, that of no "Out of Africa" etc etc?

TigerMW
07-02-2013, 10:04 PM
That Anatole of the Arbins, of the passage of R1b from North Africa to Spain, that of the African A00 a little less than Chimp cousins, that of no "Out of Africa" etc etc?

Is that a sarcastic message intended to attack the person rather than the logic or the evidence? (ad hominen attack)

You might consider going on http://archiver.rootsweb.ancestry.com/th/index/GENEALOGY-DNA/2013-07 directly and try disagreeing there. I'm sure Anatole will respond.

TigerMW
07-02-2013, 10:13 PM
Soon we will have a giant leap into a much expanded SNP realm from Full Y Testing. Phase II test data is being analyzed shortly with alot of U106 guys that should provide considerable data down to L11. Then with several p312 guy and 22 or more L21's we should see a much bigger picture of the Entire Tree deep into L21. We then should be able to tie STR Mutation timing into the fray.

MJost

Yes, I agree. We may be jumping the gun a little bit here on SNP counting. From prior discussions with Ken Nordtvedt I think it is important that coverage and consistency of coverage of the Y chromosome is critical for this method. Basically, it like saying you have to have a representative sample and low or measurable rate of testing errors.

Sooner or later, this will do wonders for tying down time estimates, but I don't look at it as one method obsoletes another just that both can be used in concert, for cross-checking if nothing else.

MJost
07-03-2013, 03:47 AM
Yes, I agree. We may be jumping the gun a little bit here on SNP counting. From prior discussions with Ken Nordtvedt I think it is important that coverage and consistency of coverage of the Y chromosome is critical for this method. Basically, it like saying you have to have a representative sample and low or measurable rate of testing errors.

Sooner or later, this will do wonders for tying down time estimates, but I don't look at it as one method obsoletes another just that both can be used in concert, for cross-checking if nothing else.

Your right, as time passes and with new comprehensions and findings, methods do change to keep up. Example:

Kenneth Nordtvedt recently stated that the traditional interclade variance summed over all haplotype pairs a,b is used to estimate node age GAB, and all haplotype pair variances are weighted equally and that he is currently working on a better (smaller) statistical sigma can be obtained for the GAB estimate by weighting the haplotypes and the variances they participate in, differently.

Nordtvedt said in near future he will put up on his website

http://knordtvedt.home.bresnan.net/

a more complete derivation of this and discussion of typical results confirmed by simulations. (Weighting Haplotypes for Tree Node Age Estimates is up now)

Ken added that the bottom line is that haplotypes related very recently to other haplotypes are down weighted relative to haplotypes which have primarily deeper times to common ancestors with the other haplotypes.

He advised that he is also is preparing a method of weighting Y snp counting for dating tree foundings and node ages. He mentioned it is not commonly discussed, but ysnp counts as clocks are subject to very similar statistical uncertainties as are STR variance or GD clock methods, reflecting their common mechanisms --- random mutations.

MJost

Rathna
07-03-2013, 10:04 AM
Is that a sarcastic message intended to attack the person rather than the logic or the evidence? (ad hominen attack)

You might consider going on http://archiver.rootsweb.ancestry.com/th/index/GENEALOGY-DNA/2013-07 directly and try disagreeing there. I'm sure Anatole will respond.

My "sarcastic message" isn't an ad personam attack, but the illustration of a theory of a man, a scholar, a teacher, expert of Chemistry and nothing to say about that, but with whom I have exchanged many letters in the past (see "Dienekes' Anthropology blog" and elsewhere). I am waiting that he responds to a letter of mine about an Indian R1a-subclades where I demonstrated that also by using the current methods of calculation, if the haplotype is an outlier (then one of the few survived ones of very ancient stocks), the infamous Zhiv rate could be verified. That letter is on eng.molgen and perhaps also on Worldfamilies.
And I think you couldn't use Klyosov in what he seems to give reason to your theories, but also knowing all the rest, which isn't probably accepted by you. Also this is a retorical escamotage.
I cannot write on Rootsweb, because I was banned at the end of 2007 and I'll never write there (do you remember Bullock, the "torello"?), and if Klyosov writes only there, it means that he doesn't desire to exchange letters with me.
I said many times in the past that, even recognizing his great work in the chemical field, his theories on Genetics were a "massive waste of time".

TigerMW
07-03-2013, 06:52 PM
My "sarcastic message" isn't an ad personam attack, but the illustration of a theory of a man, a scholar, a teacher, expert of Chemistry and nothing to say about that, but with whom I have exchanged many letters in the past (see "Dienekes' Anthropology blog" and elsewhere). I am waiting that he responds to a letter of mine about an Indian R1a-subclades where I demonstrated that also by using the current methods of calculation, if the haplotype is an outlier (then one of the few survived ones of very ancient stocks), the infamous Zhiv rate could be verified. That letter is on eng.molgen and perhaps also on Worldfamilies.
And I think you couldn't use Klyosov in what he seems to give reason to your theories, but also knowing all the rest, which isn't probably accepted by you. Also this is a retorical escamotage.
I cannot write on Rootsweb, because I was banned at the end of 2007 and I'll never write there (do you remember Bullock, the "torello"?), and if Klyosov writes only there, it means that he doesn't desire to exchange letters with me.
I said many times in the past that, even recognizing his great work in the chemical field, his theories on Genetics were a "massive waste of time".
My understanding of your use of "rhetorical escamotage" conveys the meaning of trickery or deception. Are you saying you are communicating trickery and deception or Klyosv is? BTW, I don't necessarily agree with Klyosov on many things anyway but I respect his intelligence and his willingness to work and document. Your whole posting appears to be one of your interpretations and subjective perpectives of prior exchanges on multiple forums. I think that is a little off-track from this thread or at least any discussion of evidence and logic as it relates to this thread.

Well, anyway, Ken Nordtvedt has commented on SNP counting methods of TMRCA estimation. I think this is interesting in that Y STRs may remain extremely important for a long, long time.

On Rootsweb Hg I on 07/03/2013:

Y chromsome is said to have 58 million nucelotide sites. But for years we have been told that only about 27 million of those sites are used in searches for Y snps. Various technical reasons have been given.

But now some snp counters using full genome sequences say they use only about 9 million sites of the Y in their counts of Y snps on various segments of the tree (or between present day dna samples).

What technical factors lead to this further cut by 1/3 in the number of sites used? Is there a map showing what parts of the Y make up this 9 million sites?

9 million nucleotide sites has about the same total mutation rate as does an 111 STR haplotype, leading to about the same statistical sigmas for tmrca estimates on the short and moderate time spans into the past. http://archiver.rootsweb.ancestry.com/th/read/Y-DNA-HAPLOGROUP-I/2013-07/1372869010

It turns out 111 Y STR haplotypes may be the best thing around fo short and moderate time spans, period.

lgmayka
07-04-2013, 02:10 AM
It turns out 111 Y STR haplotypes may be the best thing around fo short and moderate time spans, period.
I appreciate Ken's role as a "devil's advocate" against SNPs and in favor of STRs. But as such, he does not mention the primary argument against his case:

Within reasonable timespans, SNPs only change once, whereas STRs flip back and forth and can even jump multiple steps. This means that SNPs can construct an unambiguous haplotree, whereas STRs cannot.

razyn
07-04-2013, 02:26 AM
But counting SNPs to estimate age is an exercise in blind faith, while we know perfectly well that not all of the SNPs that define branching points have yet been found. By comparison, using STR variance within the said unambiguous haplotrees to calibrate their ages relative to each other is downright scientific (law of large numbers, and all that).

Is there some reason it needs to be one or the other?

lgmayka
07-04-2013, 02:36 AM
But counting SNPs to estimate age is an exercise in blind faith, while we know perfectly well that not all of the SNPs that define branching points have yet been found.
I was referring to a time, hopefully coming soon, when we have full Y sequences for all the major clades of interest. Just as with STRs, interclade calculations will be the most reliable (least biased).

Is there some reason it needs to be one or the other?
No, of course not--both are useful, and best used together.

MJost
07-06-2013, 04:19 AM
Mike,

Would you be interested in being a proctor for 'a Consumer Reports like evaluation on the various TMRCA methodologies' since you have an understanding of the TRMCA concepts? The data is to be provided by one of the moderator's of Rootsweb. Some else will perform AK's methods and I will use my version of TMRCA spreadsheet.

MJost

jeanL
07-08-2013, 02:29 PM
Ok here is what I said on Rootsweb before being banned for "personal attacks" on Klyosov's persona:

http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2013-07/1372684216

Summary:

Mutation rates are a function of repeat number, so calibrating for mutation rates in time frames of <2000 ybp, and then extrapolating that to other time frames is erroneous. I asked Klyosov numerous times what he thought of that, he ignored that question several times. If you want proof that they are indeed dependant on each other, I will more than glandly supply it.

There is a mutation bias whereby the modal value changes with time, i.e. What happened with G2a-P15+ in Avellaner, Catalonia, and Treilles, France.

TigerMW
07-08-2013, 03:03 PM
Mike,

Would you be interested in being a proctor for 'a Consumer Reports like evaluation on the various TMRCA methodologies' since you have an understanding of the TRMCA concepts? The data is to be provided by one of the moderator's of Rootsweb. Some else will perform AK's methods and I will use my version of TMRCA spreadsheet.

MJost
Yes, Mark, I'm flattered, but if you are looking for a statistician's input/feedback that's not me. I'll do whatever you need or want, but be forewarned I should be treated as a layperson as far as statistics. I do have academic education (an undergraduate minor) but this is not my profession by any means.

TigerMW
07-08-2013, 03:05 PM
... There is a mutation bias whereby the modal value changes with time, i.e. What happened with G2a-P15+ in Avellaner, Catalonia, and Treilles, France.
Please could you explain the pertinent points related to this G2a-P15 finding on this forum?

TigerMW
07-08-2013, 03:09 PM
I appreciate Ken's role as a "devil's advocate" against SNPs and in favor of STRs. But as such, he does not mention the primary argument against his case:

Within reasonable timespans, SNPs only change once, whereas STRs flip back and forth and can even jump multiple steps. This means that SNPs can construct an unambiguous haplotree, whereas STRs cannot.

I agree, 100%. As has also been mentioned, it makes sense to use multiple methods, cross-check them and then consider them in context of each method's individual weaknesses as well as the multi-discipline backdrop.

jeanL
07-08-2013, 04:33 PM
Please could you explain the pertinent points related to this G2a-P15 finding on this forum?

I posted the ancient haplotypes in this forum:

http://www.anthrogenica.com/showthread.php?1081-Analysis-of-G2a-haplotypes-from-aDNA-studies

Basically what we have is two relatively close ancient sample spaced apart by 2000 years, so assuming that they are related isn't far fetched. Essentially using the neutral random mutation model the modal of the two sets appear to have TMRCA at 8500 ybp per Klyosov, but if we were to assume that there exist a mutational bias such that any STR locus will gravitate to either + 1 or -1 in a timeframe of 1/mu, where mu is the biased mutation rate, then the set from Treilles could very easily simply descend from a man who lived or was directly related to the Cardial Farmers from Avellaner, Catalonia. The other methodology places the TMRCA of G2a-P15+ in 3000 BC SW France and those in 5000 BC Catalonia at 8500 ybp(6500 BC), at least per Klyosov, which would mean that they descend from different groups that arrived from the Middle East directly, and are not related to each other at all. Quite illogical if you ask me!!

MJost
07-09-2013, 01:11 AM
Yes, Mark, I'm flattered, but if you are looking for a statistician's input/feedback that's not me. I'll do whatever you need or want, but be forewarned I should be treated as a layperson as far as statistics. I do have academic education (an undergraduate minor) but this is not my profession by any means.
Mike, thanks for responding. You could do it. But, the dataset of 39 known paper trail proven mixed haplotypes have just been released and there are three different (main) methods being utilized and maybe one additional one. The methods will be Bruce Walsh's, my modified Nordtvedt's and AK's counting.

We'll see how it comes out.

MJost

jeanL
08-01-2013, 11:49 PM
Looks like this study just sent us back to the drawing board in terms of age estimates fellows:

Wei.et.al.2013 (http://www.sciencedirect.com/science/article/pii/S1872497313001026)




We have compared phylogenies and time estimates for Y-chromosomal lineages based on resequencing ∼9 Mb of DNA and applying the program GENETREE to similar analyses based on the more standard approach of genotyping 26 Y-SNPs plus 21 Y-STRs and applying the programs NETWORK and BATWING. We find that deep phylogenetic structure is not adequately reconstructed after Y-SNP plus Y-STR genotyping, and that times estimated using observed Y-STR mutation rates are several-fold too recent. In contrast, an evolutionary mutation rate gives times that are more similar to the resequencing data. In principle, systematic comparisons of this kind can in future studies be used to identify the combinations of Y-SNP and Y-STR markers, and time estimation methodologies, that correspond best to resequencing data.

Wing Genealogist
08-02-2013, 08:46 AM
Looks like this study just sent us back to the drawing board in terms of age estimates fellows:

Wei.et.al.2013 (http://www.sciencedirect.com/science/article/pii/S1872497313001026)

Tim Janzen has posted elsewhere:
"I haven’t looked at that paper yet, but my suspicion is that they didn’t eliminate the medium and fast mutating markers from their analysis. I demonstrated the issues surrounding this back in 2009. See http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2009-07/1247384275. If you don’t use the right markers for older haplogroups you get wrong answers when you try to do TMRCA estimates."

Looking at the STRs used, they include most of the fastest mutating markers in the FTDNA 67 marker panel (with the exception of CDY, 464 and 534).

alan
08-02-2013, 12:42 PM
Ignoring the absolute dates, the branching diagram is nevertheless interesting in terms of relative age of branching. There does seem to be a bimodal pattern at a quick glance. There is an earlier period of strong branching which only involves G, E and J. Its hard not to think that is linked to the Neolithic. Then a later period where r1b, r1a and I2a1a undergo strong branching. In relative terms it woudl still fit quite nicely with the idea of G, E and J involved in the early Neolithic and R1b and and I2a1a expanding in the copper age. Seems to be a tolerable fit for ancient DNA evidence.

MJost
08-02-2013, 01:08 PM
Does anyone have the Wei paper supplements to review. I have the paper.

MJost

Never mind I found the free version that available here and it has the suppliments.


http://www.sciencedirect.com/science/article/pii/S1872497313001026

MJost
08-02-2013, 02:11 PM
Looks like this study just sent us back to the drawing board in terms of age estimates fellows:

Wei.et.al.2013 (http://www.sciencedirect.com/science/article/pii/S1872497313001026)

I believe this is one of the best papers produced on the subject as they are using all the major tools, even Fluxus to assist in identifying back mutations. GENETRE and Batwing used times calculated from Y-SNP plus Y-STRs, utilized Pearson’s correlation coefficient (R2), Spearman’s rank correlation coefficient (rho) tool also.

They do concede "that the possibility that conclusions about deep relationships based on genotyping Y-SNPs plus Y-STRs may be unreliable."

The comparison of TMRCA estimates based on Y-SNP plus Y-STR genotyping using five different Y-STR mutation rates: two compilations of observed mutation rates, an observed mutation rate adjusted for population variation, an evolutionary mutation rate and a recalibrated evolutionary mutation rate. They "... conclude that BATWING time estimates based on an evolutionary mutation rate correlate best with the resequence data."

They do believe they '...now have a tool to evaluate time estimates based on Y-SNP plus Y-STR genotyping in a systematic way." But this has opened up more questions including some STRs being more useful than others and I suspect they are wondering more about saturation effects and which STRs are less prone to this situation ie my consideration for the use of Steve Bird's more recent list Stable STRs in TMRCA.

And of course, which mutation rates to consider as the evolutionary mutation rate seem to make a better fit as they have shown. 1/500 is the evolutionary mutation rate. Mutation rate(s) and STR selection is a huge factor to consider divining TMRCA's.

MJost

jeanL
08-02-2013, 03:40 PM
I believe this is one of the best papers produced on the subject as they are using all the major tools, even Fluxus to assist in identifying back mutations. GENETRE and Batwing used times calculated from Y-SNP plus Y-STRs, utilized Pearson’s correlation coefficient (R2), Spearman’s rank correlation coefficient (rho) tool also.

They do concede "that the possibility that conclusions about deep relationships based on genotyping Y-SNPs plus Y-STRs may be unreliable."

The comparison of TMRCA estimates based on Y-SNP plus Y-STR genotyping using five different Y-STR mutation rates: two compilations of observed mutation rates, an observed mutation rate adjusted for population variation, an evolutionary mutation rate and a recalibrated evolutionary mutation rate. They "... conclude that BATWING time estimates based on an evolutionary mutation rate correlate best with the resequence data."

They do believe they '...now have a tool to evaluate time estimates based on Y-SNP plus Y-STR genotyping in a systematic way." But this has opened up more questions including some STRs being more useful than others and I suspect they are wondering more about saturation effects and which STRs are less prone to this situation ie my consideration for the use of Steve Bird's more recent list Stable STRs in TMRCA.

And of course, which mutation rates to consider as the evolutionary mutation rate seem to make a better fit as they have shown. 1/500 is the evolutionary mutation rate. Mutation rate(s) and STR selection is a huge factor to consider divining TMRCA's.

MJost

Yeah that's kind of what I said before, I don't think the evolutionary rate is the best estimate, it is just a rough estimate that seems to fit the data best in time spans that are older. Given that mutation rate of STR is a function of repeat number, the best estimators for long time spans will indeed be slow mutating loci, but these too are subject to these effects, so a believe a Taylor Linearization ought to be implemented to account for the variation of mutation rate as a function of repeat number, of course such linearization would not work for fast mutating STRs in longer time spans given than the deviation from the center value would be too massive, thus creating a greater error. In any case, from the observed data, it seems that tandem repeats grown massively exponential in terms of their mutation rate for the first handful of repeats namely from 6-10 repeats, but then their growth slows down considerably after the 10-15 repeats range.

MJost
08-02-2013, 04:06 PM
This study only used 33 HTs and there were eight R1b's of which four were F/S/GS(3) and the Father gave his son two mutation (DYS391 & 537) and the GSons had no changes. (FYI Coal. TMRCA on the four is 54 years) So my question 1) not enough HTs, and Need more F/Son HTs utilized.

They did use 23 STRs,
393 390 19 391 385a 385b 439 389i 392 389ii-i 458 437 448 GataH4 456 576 570 438 481 549 533 635 643

Of which 15 were what bird considered stable STRs
390 391 439 458 437 448 GataH4 456 576 570 481 549 533 635 643

So my point is that the two mutation in the F/S/GS, of which one is stable and the later STR is unstable and if the later STR was thrown out, the age would be even longer? Bird STRs show the F/S/GS now shows a Coal. of 50.5 years to MRCA. ...... NEED more.

MJost

MJost

alan
08-03-2013, 05:30 PM
http://dienekes.blogspot.co.uk/

On this diagram it seems to me that E, G and J clades start exploding in one earlier period then R and I clades shown in another. There seems to be a two period explosion. If the first is not farming then I dont know what it is. It would be pretty bizzare if the R and I clades burst of branching corresponded with the ealry Neolithic and the G, E and J clades had their burst in the LGM. Seems to me that an early farming then later neolithc/copper age explosion makes most sense of this.

DMXX
08-07-2013, 10:49 PM
I haven't posted before on the STR debate, but have been following it closely for some months. I don't believe this study's been posted on the forum before (either that or the Search function didn't pick it up):

Mutability of Y-Chromosomal Microsatellites: Rates, Characteristics, Molecular Bases, and Forensic Implications
Kaye N. Ballantyne,1 Miriam Goedbloed,1 Rixun Fang,2 Onno Schaap,1 Oscar Lao,1 Andreas Wollstein,1,3 Ying Choi,1 Kate van Duijn,1 Mark Vermeulen,1 Silke Brauer,1,4 Ronny Decorte,5 Micaela Poetsch,6 Nicole von Wurmb-Schwark,7 Peter de Knijff,8 Damian Labuda,9 He´le`ne Ve´zina,10 Hans Knoblauch,11 Ru¨diger Lessig,12 Lutz Roewer,13 Rafal Ploski,14 Tadeusz Dobosz,15 Lotte Henke,16 Ju¨rgen Henke,16 Manohar R. Furtado,2 and Manfred Kayser1,*


Nonrecombining Y-chromosomal microsatellites (Y-STRs) are widely used to infer population histories, discover genealogical relationships,
and identify males for criminal justice purposes. Although a key requirement for their application is reliable mutability knowledge,
empirical data are only available for a small number of Y-STRs thus far. To rectify this, we analyzed a large number of 186 Y-STR markers in
nearly 2000 DNA-confirmed father-son pairs, covering an overall number of 352,999 meiotic transfers. Following confirmation by DNA
sequence analysis, the retrieved mutation data were modeled via a Bayesian approach, resulting in mutation rates from 3.783 104 (95%
credible interval [CI], 1.38 3 105 2.02 3 103) to 7.44 3 102 (95% CI, 6.51 3 102 9.09 3 102) per marker per generation.With
the 924 mutations at 120 Y-STR markers, a nonsignificant excess of repeat losses versus gains (1.16:1), as well as a strong and significant
excess of single-repeat versus multirepeat changes (25.23:1), was observed. Although the total repeat number influenced Y-STR locus
mutability most strongly, repeat complexity, the length in base pairs of the repeated motif, and the father’s age also contributed to
Y-STR mutability. To exemplify how to practically utilize this knowledge, we analyzed the 13 most mutable Y-STRs in an independent
sample set and empirically proved their suitability for distinguishing close and distantly related males. This finding is expected to revolutionize
Y-chromosomal applications in forensic biology, from previous male lineage differentiation toward future male individual
identification.


I'm going through the paper now and they've gone through almost 200 STR's, including the 17 in Y-Filer.

The paper itself is three years old; I've read posts on individual mutation rates for STR's before online but it doesn't seem population genetics papers have caught up with this approach yet? I cannot see why this is the case should it be true; if STR's are colloquially divided into slow-med-fast the global application of a single mutation rate (be it evolutionary or germline) is illogical.

That is, unless I (as well as the aforementioned authors) are missing something big.

TigerMW
08-07-2013, 10:51 PM
I see those Rho figures and the P312 ages in the 4200 and 4500 ybp range. They match nicely with long haplotype Y STR TMRCA based methods and I thought with Xue's SNP as well as Karafet's SNP methods as far as trying to extrapolate them downward into the closer in timeframes.


A calibrated human Y-chromosomal phylogeny based on resequencing
Wei Etal
Oct 2012
http://genome.cshlp.org/content/23/2/388.full

TMRCA of R1b check out the Rho numbers.
http://genome.cshlp.org/content/23/2/388/T1.large.jpg

MJost

Also those maybe interested in these comments
http://www.anthrogenica.com/showthread.php?377-A-calibrated-human-Y-chromosomal-phylogeny-based-on-resequencing&p=2153&viewfull=1#post2153

However, the Wei paper seems to line up more with the older resulting Dr. Zhiv "effective" Y STR mutation rates.

Do you understand the calibration Wei used? I see Dr. T. Robb also has done his own SNP counting and comes out somewhere in between.

jeanL
08-08-2013, 12:42 PM
I see those Rho figures and the P312 ages in the 4200 and 4500 ybp range. They match nicely with long haplotype Y STR TMRCA based methods and I thought with Xue's SNP as well as Karafet's SNP methods as far as trying to extrapolate them downward into the closer in timeframes.

That's because he used a mutation rate of 1x10-9 mutations/nucleotide/year, if you use the average of all paper published thus far which is 0.7x10-9 you get older datings(i.e. P312 being 6100 ybp old) and no longer in agreement with long STR haplotypes TMRCA based method, see here:

http://www.anthrogenica.com/showthread.php?1167-If-the-R1a-and-b-clade-central-variance-dates-were-literally-true&p=10843&viewfull=1#post10843


I see Dr. T. Robb also has done his own SNP counting and comes out somewhere in between.

He generated a mutation rate by assuming the age of the CT node was 70 kya, and he came out with a very similar age estimate than the one I came out with using the mean mutation suggested by Michal. So the point is, SNP counting yields dates that are in between germline estimates and evolutionary estimates. However as with STRs, SNP has it downside, but in a different way, i.e. not capturing enough diversity in small samples, or capturing extra diversity, hence a simple SNP counting method, should in fact be coupled with a coalesence analysis using Bayesian statistics to avoid for biases due to recent relationships amongst the participants.

TigerMW
08-08-2013, 12:49 PM
Do you understand the calibration Wei used? I see Dr. T. Robb also has done his own SNP counting and comes out somewhere in between.

Terry Robb is using similar SNP counting techniques to estimate TMRCAs. He is open and courteous as well as intelligent so we can ask him questions. He posted his most recent update here and I followed up with a question related to why his numbers are different than Karafet's. This was in reference to "New binary polymorphisms reshape and increase resolution of thehuman Y chromosomal haplogroup tree" by Karafet, Mendez, Meilerman, Underhill, Zegura in 2008. Hammer is FTDNA's Chief Scientist and at least was a key figure in the Y Chromosome Consortium (YCC) so it is an impressive crew.

Essentially, Robb's response is he is usually higher coverage data and therefore has better precision and that his dates fall within the error ranges given by Karafet, et. al. Robb responded,

Karafet et al (2008) have the following dates:
R -- 26,800 (19,900–34,300) years ago, and
R1 -- 18,500 (12,500–25,700) years ago.
And those dates are estimated on the basis of a few dozen SNP differences
(and hence the broad range).

The dates I get, using the high-coverage full-sequence y-chromosome data,
is roughly consistent:
My R would be 30,500 +/- 1400 years ago, and
My R1 would be 24,800 +/- 1300 years ago.

Any differences might be accounted for by differences in the quality of the
data used. I will check later to see if there might be any other reason." http://archiver.rootsweb.ancestry.com/th/read/Y-DNA-HAPLOGROUP-I/2013-08/1375934668

The relative age of R1 to R is 66% (18.5/26.8) according to Karafet and is 80% according to Robb so the difference is not just the calibration of mutation rates and per Robb's point, may just be the coincidence of Karafet's lower coverage data.

I'm still a little leery of the SNP counting methods which is why I think our conversations on these matters can be helpful, at least for me. I'm only leery because of the newness and my lack of understanding. I suppose I'm really wondering if it is time to say SNP counting methods are likely to be more accurate than Y STR based methods for the last 2000 to 1000 years?

Whether using STRs or SNPs, Ken Nordtvedt has explained it that all we are trying to do is build an aggregated or composite clock.
http://www.nav.ei.tum.de/fileadmin/w00bkq/layout/colloquium_suess_slides.pdf
Radiocarbon dating does the same thing. We are using statistics to average multiple data measurements together.

In the case of Y STRs, due to the fast mutation rates, we might consider that we are using a whole set of stopwatches. The more the better and of course we could throw out the ones that we just don't think work, the non-linear duration STRs. This is gives a lot of precision in the closer timeframes but using stopwatches for measuring years may not work as well. SNPs are completely the opposite, they tick very, very slowly, as maybe only once or twice, period. One tick is not very precise, but if we have millions of SNPs then we we can count the number of clocks that ticked so we can regain the precision using a vast number of measurements.

We have banged around Y STRs for a while and know that some don't have long enough linear durations (correlation with time) but do we know yet the characteristics of Y SNPs and how they can best be averaged together? I don't know, but I haven't seen the discussion and analysis on this yet.

I know that I have two SNPs that appear to have only happened twice (and survivored) in our current Y phylogeny for mankind. For my R1b-L21 lineage's occurrence, they correlate beautifully with Y STR markers. However, in the other haplogroup's (I-M253) occurrence, there is one fellow that seems to stand alone. FTDNA has said that it appears that person had some kind of massive mutation event that caused his particular two SNP mutations along with a number of others. They also said they were leery of SNPs on this section of the Y chromosome that they felt was unstable.

In the case, above, Robb is pointing to higher coverage as a positive increasing the precision of his estimates versus Karafet's. This makes sense but I worry if the higher coverage means more SNPs from unstable areas that fowl up the mutation rates just like misbehaving (non-linear) Y STRs would. I don't know if Karafet's study took care to evaluate specific kinds of SNPs or if any of these new analyses are considering this and have assessed the potential impacts ???

TigerMW
08-08-2013, 12:57 PM
... He (Robb) generated a mutation rate by assuming the age of the CT node was 70 kya, and he came out with a very similar age estimate than the one I came out with using the mean mutation suggested by Michal. So the point is, SNP counting yields dates that are in between germline estimates and evolutionary estimates. However as with STRs, SNP has it downside, but in a different way, i.e. not capturing enough diversity in small samples, or capturing extra diversity, hence a simple SNP counting method, should in fact be coupled with a coalesence analysis using Bayesian statistics to avoid for biases due to recent relationships amongst the participants.

Makes sense, I think eventually there will be another wave of what Nordtvedt calls composite clocks, but they will be aggregate both SNP and STR measurements. I can see the day will come where our confidence levels in this will be similar to radiocarbon dating. In some ways, interclade STR methods are just the beginning of that, but they don't exploit the vast number of SNPs now available.

FTDNA's Greenspan has said that he felt the way they did true germ-line Y STR mutation rate calculations (based on their staff's genealogies) and with 111 markers they would be very accurate for 2000 years. Beyond that he thought there could be be a loss of precision but since they were a genetic genealogy company that wasn't their thing anyway.

MJost
08-08-2013, 01:15 PM
TDRobb just posted an update on Genealogy-DNA with a revised mutation rate that produces a slightly younger P312 that would split at 6,200 (+/- 700 SD) years ago. I am still not convinced on using snps until we get a lot of Full Y results.

He stated:
"...updated the computed y-tree using a nominal y-mutation rate of
0.8*10^-9 mutations/nucleotide/year.
See
http://www.goggo.com/terry/HaplogroupI1/#CompleteGenomicsTMRCA

If you want to use another y-mutation rate, then just rescale the dates
accordingly.

Some dates from the tree, based on the high-coverage full sequence
y-chromosome data from Complete Genomics would be as follows:

* The Out-of-Africa CT-M168 haplogroup then splits at 71,200 (+/- 2200 SD)
years ago.
* The Afrasian DE-M145 haplogroup then splits at 64,500 (+/- 2100 SD) years
ago.
* The Eurasian F-M89 haplogroup then splits at 51,300 (+/- 1900 SD) years
ago.
* The Into-America Q-M3 haplogroup then splits at 14,400 (+/- 1000 SD)
years ago.

* Haplogroup C-M130, would split off prior to F-M89. But no sample was
available for C-M130.
* Haplogroup I1-M253 would split prior to 6,600 (+/- 700 SD) years ago. But
need a I1-Z131 sample to date I1-M253.

* Haplogroup R1b-P312 (a subgroup of R1b-M269 found mainly in Europe),
would split at 6,200 (+/- 700 SD) years ago.

The dates are essentially the same as I computed previously, where instead
I calibrated using CT-M168 at 70,000 years ago.

Note I have now added another Q-M3 sample so that I could get an estimate
for when that haplogroup split in America. The y-mutation rate and the Q-M3
date is consistent with the paper by Poznik et al published last week in
Science 341, 562 (2013).

I should mention again that I am assuming a y-mutation rate of 0.8*10^-9
mutations/nucleotide/year, and all dates are relative to that assumption.

Terry"

MJost

MJost
08-08-2013, 01:39 PM
I also just ran MikeW's list of 111 marker haplotypes that are positive for L21 or better, removing 19 duplicate HTs.

Using Bird's Stable STRs 111(48) Bird's q<=.07 NO MCM's Used Sheet Mutation Rate: 0.17703



YBP +OR-YBP Max-YBP
3,829.8 805.6 4,635.5
This SD has about a 90 CI

Stepping out to a higher confidence level of 95.45 (2-sigma)
CI +OR-YBP
983.4

TigerMW
08-08-2013, 03:29 PM
I also just ran MikeW's list of 111 marker haplotypes that are positive for L21 or better, removing 19 duplicate HTs.

Mark, can you do that for P312 as a whole? You'd have to combine the P312xL21 and L21 ExtHts tabs together. I'm not sure how make it unbiased. Perhaps there should be a random, equal selection of L21, U152 and DF27 people.

MJost
08-09-2013, 02:29 AM
Ok, I added a macro to import your P312xL21 haplotypes. I ended up removing all the suspects to be able to combine the 67 lists and had plenty of rows for the 111 but again I only keep those HTs with Positive P312 or better Results. I ended up with a little over 1,900 111 marker HTs and 5,535 with the 67 length. I also removed Ysearch HTs. I left out the L21 duplicate HTs but did not check your P312 list as it didnt seem to affect the the entire picture much.

I am going to upload a V8.5 (Re-fixed the Dup checker and added the P312xL21 import to the last row of your present sheet) and it will have the P312+'s All included.

I ran some inital checks and All of P312 is still young. Using Bird's 48 stable STRs out of the 111 marker panel with 30 years per generation, the results for Intraclade Founder's Modal Age are:

131.4 +-27.2 generations
3,941 -+817 years before present
with a max of 4,758 years

This SD range is at 94% probability where 2-sigma is at 95.45 that these haplotype fit within this range is very good.

At 2-sigma (95.45%) this would be a 868 year range around the basal age of 3.94K ybp (just under 2000 BC).

The data and TMRCA Estimator is found here: http://tinyurl.com/TMRCA-Estimator

What can I say, put just to repeat what Anatole Klyosov just said today on a thread stating, "In short, the SNP-based calculations are promising, however, they should be handled properly. At the moment they are MUCH less reliable compared with the STR based calculations." (genealogy-dna on rootsweb.com - [DNA] New results for TMRCA of Y-Haplogroups - based of Complete Genomics data).

Didnt we all think that Mr P312 wasnt very many generations prior to L21?

MJost
08-27-2013, 04:37 PM
I not entirely sure I know what you mean, change from what ?

Personally I don't find coalescence age calcs that useful, generally you get a feel for relative ages of clusters by looking at them and when comparing clusters interclade calcs are preferable, but if you want to do some help yourself.

I'd say the Swede cluster is probably the youngest of the Z14+, Z372- groups with the East Anglia and Cumberland coming in joint second.

I am showing the TMRCA of an observed divergence is due to migration time passed since the ancestor founded this particular population. Nothing more. Not the Date of the Founders Age. I am observing that coalescence via variance shows that divergence is mosty due to migration. Usefull to most others.

As per Wiki:

"TMRCA calculations are considered critical evidence when attempting to determine migration dates of various populations as they spread around the world. For example, if a mutation is deemed to have occurred 30,000 years ago, then this mutation should be found amongst all populations that diverged after this date. If archeological evidence indicates cultural spread and formation of regionally isolated populations then this must be reflected in the isolation of subsequent genetic mutations in this region. If genetic divergence and regional divergence coincide it can be concluded that the observed divergence is due to migration as evidenced by the archaeological record. However, if the date of genetic divergence occurs at a different time than the archaeological record, then scientists will have to look at alternate archaeological evidence to explain the genetic divergence. The issue is best illustrated in the debate surrounding the demic diffusion versus cultural diffusion during the European Neolithic.[13]" Morelli L, Contu D, Santoni F, Whalen MB, Francalacci P, et al. (
2010).

http://en.wikipedia.org/wiki/Most_recent_common_ancestor

MJost

jdean
08-27-2013, 04:51 PM
alternatively : )

http://newsarch.rootsweb.com/th/read/GENEALOGY-DNA/2013-04/1365634125

[[[ Mikewww/Moderator on 08/27/2013: STR vagaries, TMRCA and mutation rate controversies are okay but they can sink any one thread so I'm going to move any posts that seem to be going off on a tangent over to the STR Wars thread: http://www.anthrogenica.com/showthread.php?828-STR-Wars-GDs-TMRCA-estimates-Variance-Mutation-Rates-amp-SNP-counting

I moved some of these posts over the P312/Celtic/U106/Germanic thread ]]]

MJost
08-27-2013, 05:06 PM
alternatively : )

http://newsarch.rootsweb.com/th/read/GENEALOGY-DNA/2013-04/1365634125

SNP counting/aging was the subject and is what I am sure Thomas was alluding to in this tread not Haplotype aging. Not quite main stream and proven yet.

MJost

TigerMW
08-28-2013, 04:27 PM
Mark, I'm updating the U106 file but it'll probably take a day or two yet. There must be something interesting about Scotland's haplotypes. I don't get the higher variance but I think you are getting higher TMRCAs. It must be related to specific slower markers that are more varied there. Anyway, let me collect the data first before we attempt any further analysis.

TigerMW
09-23-2013, 05:08 PM
I created this graphic for a new group/project but I'll post this here too.

I use STRs to try to classify people into varieties/cluster but I consider this a speculative endeavor and strongly advocate SNP testing.

A picture is sometimes easier to understand than words so here is an illustration of how people from different haplogroups only related thousands of years ago can end up with the same STR value even though their haplogroups may have started out differently for that STR.

This shows why it is important to have more STRs. The more off-modal or unusual STR values you have in common with a particular variety/group the more likely you really are in that group.

Still, the STRs fool us from time to time so SNP testing is a very good checkpoint.
https://dl.dropboxusercontent.com/u/17907527/STR_Convergence.jpg

MJost
09-23-2013, 06:02 PM
Mark, I'm updating the U106 file but it'll probably take a day or two yet. There must be something interesting about Scotland's haplotypes. I don't get the higher variance but I think you are getting higher TMRCAs. It must be related to specific slower markers that are more varied there. Anyway, let me collect the data first before we attempt any further analysis.

I pulled your latest U106 data and posted a R1b dataset in my TRMCA Estimator sheet link posted here:
http://www.anthrogenica.com/showthread.php?560-A-TMRCA-Estimator-Excel-spreadsheet&p=13898&viewfull=1#post13898

Lets do some deeper variance analysis with it when you have some time.

MJost

MJost

MJost
09-23-2013, 06:20 PM
A picture is sometimes easier to understand than words so here is an illustration of how people from different haplogroups only related thousands of years ago can end up with the same STR value even though their haplogroups may have started out differently for that STR.
Visuals sure make it easier to get the convergence (overlapping) concepts.


This shows why it is important to have more STRs. The more off-modal or unusual STR values you have in common with a particular variety/group the more likely you really are in that group.


In my own Variety 1130-A1's that generally have around 16 Off-modal markers from the L21 founder .
If I ever get out of DF13* status all of my 40 plus haplotypes might just follow along with me depending on the age of the new SNP(s) found.

Mutation Rates are sorted slower to faster left to right

DYS: 531 / 497 / 511 / 441 / 19 / 385a / 513 / 447 / 552 / 446 / 557 / 533 / 464d / 534 / 449 / 576
M-L21: 11 / 14 / 10 / 13 / 14 / 11 / 12 / 25 / 24 / 13 / 16 / 13 / 17 / 15 / 30 / 18
1130A1: 12 / 15 / 11 / 14 / 15 / 12 / 11 / 24 / 25 / 14 / 15 / 12 / 18 / 16 / 31 / 17

Off-Modal ranges:
531=>12, 497=15, 511=11, 19=>15, 385a=12, 441=14, 552=25, 447=24, 513=11, 557=<15, 446=14, 464d=18, 456=18, 534=16, 449=31, 576=17, 710=36, 712=>21 (533=<13)

Mike, maybe a new variety designation is due.


MJost

razyn
09-24-2013, 12:59 AM
In my own Variety 1130-A1's that generally have around 16 Off-modal markers from the L21 founder.
MJost

I have yet to be persuaded that modals tell us the marker values of the founder. They tell us about his most numerous survivors -- in the case of L21, several thousand years later. On the other hand, shared off-modals are really good pointers to subclades, I wouldn't argue with that.

TigerMW
09-24-2013, 02:46 AM
I have yet to be persuaded that modals tell us the marker values of the founder. They tell us about his most numerous survivors ...
I agree 100%, however, at least they give us a starting point for determining the ancestral values.

... and you can do more work with them. For instance, you can look at the modals for the peer and paragroup subclades. If they all match up with the modal for the subclade in question, your triangulation is about as good as we can get for an ancestral value.

Such would be the case for most of the Super WAMH modal markers. You can look at L11*, U106, Z18, Z381, P312, DF27, U152, L21 and with a little bit of triangulation you can get a pretty good fix on the ancestral values for some of these.

Fire Haired
09-24-2013, 02:58 AM
Mikwww do u have any idea what the STR of the 4,600 year old Bell Beaker R1b is. Do u have any info to show evidence that it is R1b L11. Because I know one was positive as M269 I think it would make total sense the Bell Beaker R1b is connected with 50% R1b L11 in modern western Europe and would be under R1b L11 or at least R1b L51. Also I emailed u access to my raw data with SNP's from geno 2.0 hopefully it can help figure out if I am for sure P312 or Df27.

MJost
09-24-2013, 03:19 AM
I have yet to be persuaded that modals tell us the marker values of the founder. They tell us about his most numerous survivors -- in the case of L21, several thousand years later. On the other hand, shared off-modals are really good pointers to subclades, I wouldn't argue with that.

The modal haplotype does not necessarily correspond with the ancestral haplotype but the "modal haplotype" is simply the occurring haplotype in the set of haplotypes under study. If possible, it requires an enforcement of the SNP tree containment to eliminate convergence, along with consideration of any sub-clade structure(s) all affect the statistical confidence such as the effect of M222 subclade on the entire L21 clade. The larger the set of haplotypes the more accurate the Modal will be. Ken N is the champion of the modal method.

MJost

razyn
09-24-2013, 02:27 PM
I'm not objecting to the foundation stones of the STR variance method, only to the imprecise language use. In this case, "16 off-modal markers from the L21 founder." It's really just 16 off-modals from the L21 modal, and we don't know the L21 founder's STR marker values -- though the modal surely points in their direction, and is probably (i.e. statistically) right for quite a few of them. In older and/or more highly fragmented haplogroups (like L21) there have probably been enough back mutations (now invisible), lineage extinctions, migrations to the country doing most of the testing, and other genetic issues to skew the modal somewhat in favor of the descendants time has proven to be the most successful breeders within the sampled population. And we should not assume that their averaged haplotype (aka modal) is that of the founder. It's meaningful, but that's not what it means.

TigerMW
09-26-2013, 03:37 PM
Mikwww do u have any idea what the STR of the 4,600 year old Bell Beaker R1b is. Do u have any info to show evidence that it is R1b L11. Because I know one was positive as M269 I think it would make total sense the Bell Beaker R1b is connected with 50% R1b L11 in modern western Europe and would be under R1b L11 or at least R1b L51. Also I emailed u access to my raw data with SNP's from geno 2.0 hopefully it can help figure out if I am for sure P312 or Df27.

Are you talking specifically of the R1b Bell Beaker skeleton found in Kromsdorf, Germany? I think there is limited STR information available, but I think you can find it through internet searches for the related research paper(s). It may take a couple of variations in search words but I have in my notes the name of the paper is "Emerging Genetic Patterns of the European Neolithic: Perspectives From a Late Neolithic Bell Beaker Burial Site in Germany" by Lee, published in 2012.

If instead, you are wanting a guess at the ancestral STR values for the R1b-L11 most recent common ancestor, I think they can best be estimated by following the method I outlined in an earlier post on this thread.

I agree 100%, however, at least they give us a starting point for determining the ancestral values.

... and you can do more work with them. For instance, you can look at the modals for the peer and paragroup subclades. If they all match up with the modal for the subclade in question, your triangulation is about as good as we can get for an ancestral value.

Such would be the case for most of the Super WAMH modal markers. You can look at L11*, U106, Z18, Z381, P312, DF27, U152, L21 and with a little bit of triangulation you can get a pretty good fix on the ancestral values for some of these.

krkerwin
09-30-2013, 07:31 PM
Mike:
You are exactly on track. I've followed your work for 5 years and in fact used your work to understand the South Irish base haplotype. You encouraged someone to take on the South Irish project and I was the first to answer up. Alex Williamson has provided me much insight to questions regarding SNPs. Of course I primarily use Anatole's (Dr. Klyosov) work in my research. I also just confirmed Dr. Nordvedt’s base haplotype of the South Irish with his primary and secondary marker sets. Understanding comparative TMRCA formulas (Anatole's and Dr. Nordvedt’s in particular) is very interesting to me since both scientists take different paths to confirm areas that I incorporate in my research.

SNPs have been overvalued because hobbyists can see them without reading or understanding anything else or simply repeat what others say without confirming the knowledge. That said, I am really starting to understand how building the STR signatures of base haplotypes and comparing them to SNPs really means while trying to clearly identify the South Irish base haplotype and the 4466 SNP and sub SNPs. SNPs occur randomly but are not inherited by the brothers, first cousins, uncles etc. so that a base haplotype can include those with and without the SNP. The base haplotype is higher in the tree. Pin pointing where this location is will naturally be a moving target because it is based on SNP testing and projected base haplotype STR signatures. I base the 4466 and South Irish haplotypes on member results who have tested + or - for SNPs so there is certainty and not speculation. There should be a continual feedback loop based on reassessment when new evidence is identified (every new 4466 + or - provides new evidence) and incorporated into the study. Every new +/- relevant SNP changes the base haplotype STR signature.

The Geno 2.0 results (once the troubleshooting is complete) will provide invaluable evidence of the SNP hierarchy and the specific base haplotype analyses. Much work needs to be done for this to occur.

The hobbyists have provided a wealth of SNP testing that has given more clarity to the phylogenetic trees for SNPs and base haplotype analysis.

That said, much work needs to be done to answer these questions about SNPs and base haplotypes. This is the basis of my current research.

What is it really all about? I believe it comes down to understanding a member's sub branch so they can see the best possible picture of their actual sub branch. Where did they come from? What group did they belong in? What is the migration pattern of their ancestors? Can they pinpoint a location so they can visit and say to themselves "my ancestors could very well have walked this path, lived in this town, or shared this common history"? This is the reason I overlap the research with family histories that can be compared and reviewed in the sub branch grouping. Curiously this is also the way scientists test their TMRCA theories.

Sidebar:
I first started my efforts trying to find clan or surname signatures. This turned not to be the rule of thumb. However, the very rare clan or surname signatures still may be possible to identify. In particular the Eoganacht Aine - O'Kirby surname, Eoganacht Locha Lein - Moriarty surname, Eoganacht Chaisil - Sullivan Mor. As with any science it's the bottleneck, the exception or error that proves a point. In this case it may be groups with fewer members. In identifying the 4466 and/or South Irish STR signature (they are not the same and where they are in relation to each other is still a moving target), I noticed that using primary and secondary marker sets that I have found possible uniqueness in the STR marker signatures of those with the Kirby, Moriarty or Sullivan Mor ancestry.

These groups are not extinct but small in numbers which may hold the key to possible sub group STR signatures. I've also seen groupings of surnames in sub branches, however since I'm not absolutely convinced of the sub branches yet this remains to be seen. My current delay in building sub branches is that the Geno 2.0 SNP results need troubleshooting from the programming and system testing level and cannot yet be relied upon. Once the Geno 2.0 SNP results are stable, I'll first build the 4466 SNP hierarchy and then separate out those with the 4466 STR signature and sub SNP signatures before building sub branches. I have already seen excellent results that may prove that there is a strong relationship between STR signatures and the 4466 sub SNPs. This identified sub branch will likely have a TMRCA calculation that is in the reasonable range and may confirm that this sub branch is relevant within 300 years.

I very likely will not be adding much input into this forum in the present since I have a mountain of work in front of me, however this sub forum is the basis of my research and I will be contributing in the future.

Kathleen



This may seem a little off track, but bear with me. This is about understanding MRCAs....

What's the value of a haplogroup?

What's the value of an SNP?

You might be surprised to hear me say this but I think there is very little value in haplogroups and SNPs
... at least in and of themselves.

A haplogroup is just a group of people with a common ancestor.

An SNP is just a single nucleotide polymorphism, a mutation, that marks the group of people with a common ancestor. It is just a signpost on a branch of the human family tree. The true nature of the haplogroup of people, any commonality in culture, location, etc., many not align with the SNPs have marked the lineages. The SNP could mark either a subset or superset of the true group of people we care about.

This gets into some notions about value and philosopical concerns, but these are the points I'm getting at.

1) I do not care too much about all of the extinct lineages of mankind. There are many, many extinct lineages. On the Y chromosome/paternal side probably there are many, many more lineages that have gone extinct than those who survive.

2) I do care about how we got here and how, where, when and why they did what they did to get us to where we are today.

I think these notions are just conveying that what many hobbyists may care about most is the connection to genealogy and deeper ancestry.... and specifically our ancestry.

The net is that the most recent common ancestors (MRCAs) of the various branches remaining today (and in recent history) are critical people to try understand. The more MRCAs we can understand better at more layers and branches in the tree, then the more we have a chance to understand our ancestry.

I am not saying that all of the old extinct lineages were not important people or that SNPs are useless. I'm just trying to say they are most important in how they help us understand who we are and how we got here. They are just bread crumbs from an old trail.

Superconducting supercolliders smash atoms and look at the residue of the accident to try to get more detail on the characteristics of the atom. In the case of genetics; the accidents, bottlenecks, growth spurts, etc. have already taken place but, likewise, we are looking at the residue to try to ascertain what happened.

I don't care when an SNP first occurred. I care about the expansion and movements of my ancestry. The SNP marked haplotroup ages may help put a maximum age in place for my ancesty. That's good, but it's not really the haplogroup I'm after.


P.S. Science may be interested in who the genetic Adam was or wasn't and some other things. That's fine with me but I'm really after understanding how we, the survivors, got here.

Michał
10-11-2013, 05:23 PM
Below please find my SNP-based TMRCA estimations (in ky) for different R1b clades (and some upstream haplogroups) present in Sardinia. These estimations are based on an assumed mutation rate of 0.7 10^-9 per nucleotide per year (chosen for some reasons mentioned in another thread (http://www.anthrogenica.com/showthread.php?709-New-DNA-Papers/page5&p=10824#post10824)), while the numbers given in parentheses correspond to the TMRCA values calculated using the mutations rates 0.82 and 0.53, as proposed by Poznik and Francalacci, respectively).

63.0 (53.7-82.8) haplogroup F
61.9 (52.8-81.3) haplogroup IJK
58.8 (50.1-77.3) haplogroup K
40.2 (34.3-52.9) haplogroup P
36.6 (31.2-48.2) haplogroup I
33.5 (28.6-44.1) haplogroup R
27.6 (23.5-36.3) haplogroup R1
22.9 (19.5-30.1) R1b-P25
14.9 (12.5-19.6) R1b-V88
8.6 (7.3-11.3) R1b-M269
8.3 (7.1-10.9) R1b-L23
7.6 (6.5-10.0) R1b-L51
7.4 (6.3-9.7) R1b-Z2105
7.2 (6.1-9.5) R1b-M269(xL23)
6.6 (5.6-8.6) R1b-L11
6.2 (5.3-8.2) R1b-P312
6.1 (5.2-8.0) R1b-U152

Assuming that the number of downstream mutations found in members of some poorly represented subclades is not a reliable source of data (due to some technical reasons associated with using the low pass sequencing method), I have instead used the distance (i.e. the number of SNPs) between a parent clade and a common ancestor of a given subclade as a basis for calculating the age (TMRCA) of every subclade. For estimating the age of haplogroup F, I have used the average number of mutations downstream of haplogroup F in members of the well-represented (in Sardinia) clade I2a-M26, which was about 404 mutations. It is worth noting that the average number of mutations downstream of haplogroup F in clade R1b-U152 (a clade that is also frequent in Sardinia, but not as common as I2a-M26) was close to the above number but, nevertheless, evidently lower (392). Thus, when basing similar calculations on this slightly reduced number of SNPs found in members of R1b-U152, we get lower TMRCA values, as shown below.

61.3 (52.3-80.6) haplogroup F
60.2 (51.3-79.1) haplogroup IJK
57.1 (48.7-75.0) haplogroup K
38.5 (32.9-50.6) haplogroup P
34.9 (30.0-45.9) haplogroup I
31.8 (27.1-41.8) haplogroup R
25.9 (22.1-34.0) haplogroup R1
21.2 (18.1-27.9) R1b-P25
13.3 (11.3-17.4) R1b-V88
6.9 (5.9-9.0) R1b-M269
6.6 (5.6-8.6) R1b-L23
5.9 (5.1-7.8) R1b-L51
5.6 (4.8-7.4) R1b-Z2105
5.5 (4.7-7.2) R1b-M269(xL23)
4.8 (4.1-6.4) R1b-L11
4.5 (3.9-5.9) R1b-P312
4.4 (3.7-5.7) R1b-U152

Neither of the above sets of TMRCAs can be considered secure, but I think the first approach is slightly more likely to give correct values when using those Sardinian data alone.

parasar
11-26-2013, 03:22 AM
Below please find my SNP-based TMRCA estimations (in ky) for different R1b clades (and some upstream haplogroups) present in Sardinia. These estimations are based on an assumed mutation rate of 0.7 10^-9 per nucleotide per year (chosen for some reasons mentioned in another thread (http://www.anthrogenica.com/showthread.php?709-New-DNA-Papers/page5&p=10824#post10824)), while the numbers given in parentheses correspond to the TMRCA values calculated using the mutations rates 0.82 and 0.53, as proposed by Poznik and Francalacci, respectively).

63.0 (53.7-82.8) haplogroup F
61.9 (52.8-81.3) haplogroup IJK
58.8 (50.1-77.3) haplogroup K
40.2 (34.3-52.9) haplogroup P
36.6 (31.2-48.2) haplogroup I
33.5 (28.6-44.1) haplogroup R
27.6 (23.5-36.3) haplogroup R1
22.9 (19.5-30.1) R1b-P25
14.9 (12.5-19.6) R1b-V88
8.6 (7.3-11.3) R1b-M269
8.3 (7.1-10.9) R1b-L23
7.6 (6.5-10.0) R1b-L51
7.4 (6.3-9.7) R1b-Z2105
7.2 (6.1-9.5) R1b-M269(xL23)
6.6 (5.6-8.6) R1b-L11
6.2 (5.3-8.2) R1b-P312
6.1 (5.2-8.0) R1b-U152

Assuming that the number of downstream mutations found in members of some poorly represented subclades is not a reliable source of data (due to some technical reasons associated with using the low pass sequencing method), I have instead used the distance (i.e. the number of SNPs) between a parent clade and a common ancestor of a given subclade as a basis for calculating the age (TMRCA) of every subclade. For estimating the age of haplogroup F, I have used the average number of mutations downstream of haplogroup F in members of the well-represented (in Sardinia) clade I2a-M26, which was about 404 mutations. It is worth noting that the average number of mutations downstream of haplogroup F in clade R1b-U152 (a clade that is also frequent in Sardinia, but not as common as I2a-M26) was close to the above number but, nevertheless, evidently lower (392). Thus, when basing similar calculations on this slightly reduced number of SNPs found in members of R1b-U152, we get lower TMRCA values, as shown below.

61.3 (52.3-80.6) haplogroup F
60.2 (51.3-79.1) haplogroup IJK
57.1 (48.7-75.0) haplogroup K
38.5 (32.9-50.6) haplogroup P
34.9 (30.0-45.9) haplogroup I
31.8 (27.1-41.8) haplogroup R
25.9 (22.1-34.0) haplogroup R1
21.2 (18.1-27.9) R1b-P25
13.3 (11.3-17.4) R1b-V88
6.9 (5.9-9.0) R1b-M269
6.6 (5.6-8.6) R1b-L23
5.9 (5.1-7.8) R1b-L51
5.6 (4.8-7.4) R1b-Z2105
5.5 (4.7-7.2) R1b-M269(xL23)
4.8 (4.1-6.4) R1b-L11
4.5 (3.9-5.9) R1b-P312
4.4 (3.7-5.7) R1b-U152

Neither of the above sets of TMRCAs can be considered secure, but I think the first approach is slightly more likely to give correct values when using those Sardinian data alone.
Michał,

What would you estimate the age of M479 to be? There is an R2 Francalacci et al sample. It seems to be far older than R1 counting the number of mutations downstream from M207 say to HG03727 ITU. http://www.yfull.com/tree/R/

From the Mal'ta sample:
http://r1b.org/imgs/MA-1_Tree.png

Michał
11-26-2013, 03:32 PM
Michał,
What would you estimate the age of M479 to be? There is an R2 Francalacci et al sample.
I don’t have enough data to do so. This would require comparing the data for at least two fully sequenced people representing different sublineages of M479 (i.e. R2a and R2b).


It seems to be far older than R1 counting the number of mutations downstream from M207 say to HG03727 ITU. http://www.yfull.com/tree/R/

I am not sure what you mean by mutations downstream from M207 to HG03727 ITU? Are you saying that the number of mutations downstream of M207 in sample HG03727 is significantly larger than a number of mutations downstream of M420 in any R1a member? Have you seen such data? I am not aware of any R1a* member who would be fully sequenced, and as long as we lack this kind of data, it is impossible to provide any SNP-based calculations for the TMRCA of R1a.

newtoboard
11-26-2013, 03:39 PM
Is there such a thing as R2b?

Michał
11-27-2013, 01:11 AM
Is there such a thing as R2b?
Not yet, so I should indeed have written R2* and R2a instead.

alan
11-27-2013, 11:17 PM
It certainly looks that SNP counting on modern subjects and just a modest amount done on radiocarbondated ancient subjects could really resolve the dating issues for yDNA in the near future. It looks like the method is in place and its just a case of doing it now.

Michał
01-06-2014, 11:35 PM
Using the full Y-DNA sequencing data from the recent Ashkenazi-Levites paper by Rootsi et al., I've done some calculations that should allow us to compare the age of some major subclades of R1a and R1b. I have used a slightly modified mutation rate (0.66 x 10^-9 per nucleotide per year, instead of 0.7 x 10^-9 per nucleotide per year), mostly because this new value is not only the exact average of the rates calculated by Francalacci (0.53), Poznik (0.82) and Mendez (0.62), but also because it was suggested by another forumer James Dow Allen (http://www.anthrogenica.com/showthread.php?1507-Some-provisional-calculations-for-haplogroup-R1a-based-on-the-first-FGC-result/page4&p=20887#post20887) that this new rate 0.66 (which corresponds to 165 years per each mutation found in the 8.97 Mb Y-DNA sequence) is consistent with the data provided by both Raghavan (the Mal'ta paper) and Francalacci (the Sardinian paper).

Below please find my new TMRCA estimates (in ky). The values shown in the parentheses correspond to my two previous sets of estimates that were based on the Sardinian data. For all clades below the R1a-Z645 and R1b-L23 levels (where only some relatively small numbers of individuals were available), the estimates were based on taking into account both the average number of mutations downstream of a given branching point and the number of mutations separating this particular branching point from an upstream node.

38.8 (40.2, 38.5) haplogroup P
32.2 (33,5, 31.8) haplogroup R
26.9 (27.6, 25.9) haplogroup R1
8.0 (8.3, 6.6) R1b-L23
6.7 (7.4, 5.6) R1b-Z2103
6.1 (6.6, 4.8) R1b-L11
6.0 R1a-Z645
5.7 R1a-Z93
5.4 R1a-Z94
5.3 R1a-Z282
5.2 R1a-Z2123
5.2 R1a-Z2122
4.3 R1a-L657
3.9 R1a-M582

alan
01-06-2014, 11:46 PM
I have to say that P, R and R1 admirably fit the dates of three main divisions of the upper palaeolithic in Siberia with P to R matching the apparent first intrusion of modern humans (apparently from the Levant via Iran and central Asia), R to R1 matching the middle upper palaeolithic and R1 matching the late upper palaeolithic.



Using the full Y-DNA sequencing data from the recent Ashkenazi-Levites paper by Rootsi et al., I've done some calculations that should allow us to compare the age of some major subclades of R1a and R1b. I have used a slightly modified mutation rate (0.66 x 10^-9 per nucleotide per year, instead of 0.7 x 10^-9 per nucleotide per year), mostly because this new value is not only the exact average of the rates calculated by Francalacci (0.53), Poznik (0.82) and Mendez (0.62), but also because it was suggested by another forumer James Dow Allen (http://www.anthrogenica.com/showthread.php?1507-Some-provisional-calculations-for-haplogroup-R1a-based-on-the-first-FGC-result/page4&p=20887#post20887) that this new rate 0.66 (which corresponds to 165 years per each mutation found in the 8.97 Mb Y-DNA sequence) is consistent with the data provided by both Raghavan (the Mal'ta paper) and Francalacci (the Sardinian paper).

Below please find my new TMRCA estimates (in ky). The values shown in the parentheses correspond to my two previous sets of estimates that were based on the Sardinian data. For all clades below the R1a-Z645 and R1b-L23 levels (where only some relatively small numbers of individuals were available), the estimates were based on taking into account both the average number of mutations downstream of a given branching point and the number of mutations separating this particular branching point from an upstream node.

38.8 (40.2, 38.5) haplogroup P
32.2 (33,5, 31.8) haplogroup R
26.9 (27.6, 25.9) haplogroup R1
8.0 (8.3, 6.6) R1b-L23
6.7 (7.4, 5.6) R1b-Z2103
6.1 (6.6, 4.8) R1b-L11
6.0 R1a-Z645
5.7 R1a-Z93
5.4 R1a-Z94
5.3 R1a-Z282
5.2 R1a-Z2123
5.2 R1a-Z2122
4.3 R1a-L657
3.9 R1a-M582

alan
01-27-2014, 01:13 PM
Those calculations would suggest that

1. R1 occurred almost exactly at the time of the best calculations for the start of the LGM.

2. The age for R1b's main clades of main eastern European/SW Asian clade of c. 4700BC is still way too late to associate with an expansion linked to early farmers.

3. In Steppe terms the R1b expansion would be in a Sredny Stog sort of timeframe.

4. The R1a clades seem to centre on what would be Yamnaya times.

In general its suggestive of a two step model of an R1b expansion c. 4500BC and a little later and an R1a expansion c. 3300BC and a little later. This fits quite well a two wave steppe model with R1b associated with the Suvorovo type groups and R1a mainly linked to Yamnaya. There are of course other options for both but it has a nice fit with the timing of steppe waves into east-central Europe and the Balkans. When you add the apparent association of R1a and b with different IE branches with the R1b-rich groups tending to be linked to most of the earliest branching on the linguistic tree then it does feel like a pretty good fit all things considered.

Another aspect of this is that if R1b is linked to the pre-Yamnaya Suvorovo type waves west dervived from Sredny Stog groups around the Dnieper then it also makes some kind of sense in terms of the later environments they spread into. I say this because these Suvorovo groups seem to be descended from groups who had a lot more of a farming aspect to their economy and had longstanding links to the Balkans copper trade. Craniology shows that this was not just influence but almost certainly also involved real mixing. So, you could say that that wave had a bit of preparation for their future home in the farming world and would more easily integrate after a little time.

Conversely, if R1a was, as seems likely, associated with Yamnaya and its predecessors east of the Don then it really was peripheral to farming and not preparared in any way to easily integrate into the farming world. It was probably the first steppe culture, other than maybe some steppe Maykop elements a century before, to take up the wagon. That created both an incredible expansion opportunity within the interfluvial areas of the steppes and steppe-like land in east-central Europe but it also made them a further degree alien in terms of lifestyle with the farming world. This may have primed it for the later pattern that once it invaded the Balkans it remained somewhat aloof and confined to steppe-like land separate from the farmers.

In such a scenario of chronological and socio-economic differences the contrast in distribution between R1a and b seems understandable and the differences are foreshadowed in their earlier history on the steppes.


Using the full Y-DNA sequencing data from the recent Ashkenazi-Levites paper by Rootsi et al., I've done some calculations that should allow us to compare the age of some major subclades of R1a and R1b. I have used a slightly modified mutation rate (0.66 x 10^-9 per nucleotide per year, instead of 0.7 x 10^-9 per nucleotide per year), mostly because this new value is not only the exact average of the rates calculated by Francalacci (0.53), Poznik (0.82) and Mendez (0.62), but also because it was suggested by another forumer James Dow Allen (http://www.anthrogenica.com/showthread.php?1507-Some-provisional-calculations-for-haplogroup-R1a-based-on-the-first-FGC-result/page4&p=20887#post20887) that this new rate 0.66 (which corresponds to 165 years per each mutation found in the 8.97 Mb Y-DNA sequence) is consistent with the data provided by both Raghavan (the Mal'ta paper) and Francalacci (the Sardinian paper).

Below please find my new TMRCA estimates (in ky). The values shown in the parentheses correspond to my two previous sets of estimates that were based on the Sardinian data. For all clades below the R1a-Z645 and R1b-L23 levels (where only some relatively small numbers of individuals were available), the estimates were based on taking into account both the average number of mutations downstream of a given branching point and the number of mutations separating this particular branching point from an upstream node.

38.8 (40.2, 38.5) haplogroup P
32.2 (33,5, 31.8) haplogroup R
26.9 (27.6, 25.9) haplogroup R1
8.0 (8.3, 6.6) R1b-L23
6.7 (7.4, 5.6) R1b-Z2103
6.1 (6.6, 4.8) R1b-L11
6.0 R1a-Z645
5.7 R1a-Z93
5.4 R1a-Z94
5.3 R1a-Z282
5.2 R1a-Z2123
5.2 R1a-Z2122
4.3 R1a-L657
3.9 R1a-M582

jeanL
03-21-2014, 01:51 PM
Here is a relatively newer study which shows pretty much what I've been saying for a while, that mutation rate isn't linear as Klyosov and the likes claim, and that it varies as a function of the repeat number.

Empirical Evaluation Reveals Best Fit of a Logistic Mutation Model for Human Y-Chromosomal Microsatellites (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3241406/)




The rate of microsatellite mutation is dependent upon both the allele length and the repeat motif, but the exact nature of this relationship is still unknown. We analyzed data on the inheritance of human Y-chromosomal microsatellites in father–son duos, taken from 24 published reports and comprising 15,285 directly observable meioses. At the six microsatellites analyzed (DYS19, DYS389I, DYS390, DYS391, DYS392, and DYS393), a total of 162 mutations were observed. For each locus, we employed a maximum-likelihood approach to evaluate one of several single-step mutation models on the basis of the data. For five of the six loci considered, a novel logistic mutation model was found to provide the best fit according to Akaike’s information criterion. This implies that the mutation probability at the loci increases (nonlinearly) with allele length at a rate that differs between upward and downward mutations. For DYS392, the best fit was provided by a linear model in which upward and downward mutation probabilities increase equally with allele length. This is the first study to empirically compare different microsatellite mutation models in a locus-specific fashion.

Here is more from the study:


It is well known that the pattern of microsatellite mutation varies across loci (Kelkar et al. 2008). To our knowledge, however, the present study is the first to systematically compare novel as well as previously described microsatellite mutation models for Y-STRs in a locus-specific fashion. This comparison was made possible by the accumulation of suitable genotype data from 15,285 father–son duos. The major advantage of this type of data is that all father–son relationships had been confirmed by independent genotyping of other markers. In studies using deep pedigree data for mutational analyses, this is typically not the case (Heyer et al. 1997), which renders discrimination between false paternity records and genuine mutations notoriously difficult. Furthermore, by basing our analysis on directly observed mutations (i.e., Y-STR mutations in father–son duos), we avoided the need for additional assumptions about the underlying population dynamics, mating behavior, or selective pressure. This is a clear advantage over studies that sought to investigate microsatellite mutation processes by comparing distantly related genomes (Dieringer and Schlötterer 2003).

In our analysis, we also avoided the complicating effects of recombination through choosing loci from the male-specific region of the Y chromosome. Although this restriction may at first glance seem to limit the general applicability of our results, it may be surmised that Y-chromosomal and autosomal microsatellite loci obey similar mutation models because they have similar biochemical properties and because replication slippage is responsible for STR mutations in both instances (Heyer et al. 1997; Kayser et al. 2000). This contrasts with minisatellite mutations, where recombination plays a significant role (Buard et al. 2000).

One caveat of our study is that the loci considered were originally selected for forensic applications because of their high variability. Therefore, we cannot exclude that our parameter estimates are biased toward higher mutation probabilities, but this seems unlikely to affect the general conclusion as to which models are most appropriate for microsatellites.

As was mentioned before, many statistical models have been proposed for the microsatellite mutation process (Calabrese and Sainudiin 2005). However, only a few of these turned out to be applicable to our data. For example, the model proposed by Kruglyak et al. (1998), which includes point mutations, was not deemed relevant to our study because, with the genotyping systems used in forensics, point mutations are not altering repeat counts (Gusmão et al. 2006). Moreover, we chose to restrict ourselves to one-step models owing to the scarcity of data on multistep mutations (Table 3). The three instances of mutations resulting in a change by 2 repeat units in our data were counted as single-step changes for model-fitting purposes. This concerned two of the six loci considered, for which the models should therefore be interpreted as dichotomizing all possible mutation events into up- and downward mutations, regardless of step size. However, since multistep mutations are very rare, this dichotomization should not affect our conclusions substantially. Nevertheless, should more data on multistep mutations become available in the future, a study of more sophisticated models may become worthwhile.

Our study was in part inspired by Whittaker et al. (2003), who were the first to suggest the use of maximum likelihood to fit microsatellite mutation models. However, as explained above, their exponential mutation model was not well defined, resulting in unbounded mutation probabilities. This problem does not occur with a logistic model, which to our knowledge has not been investigated before, because in the logistic model mutation probabilities are always bounded by parameter γ. Notably, for small repeat numbers, Whittaker’s model and the logistic model are similar in that both entail an exponential increase in mutation probabilities with increasing repeat number. Qualitatively, the logistic model is also similar to the best models emerging from genome comparisons, e.g., the PL1 model in Sainudiin et al. (2004). The main features in both instances are an allele-length–dependent mutation rate and a confinement to single-step mutations.

In principle, it would be possible to combine mutation models to obtain better fits. For example, a model with linearly increasing upward mutation probabilities but with downward mutation probabilities according to the logistic model fits our data for DYS390 and DYS391 somewhat better than the logistic model alone (ΔAIC = −21.9 and ΔAIC = −37.7, respectively, cf. Table 5). However, we decided to focus on pure models here to not exacerbate the multiple-testing problem.

Practical applications of our results are vast because many uses of microsatellite data require estimates of the respective mutation probabilities. The logistic model, which was shown here to provide the best fit to empirical mutation data, is readily applicable to likelihood-based kinship analysis, phylogenetic analysis, and coalescence methods used in population genetics. Our statistical evaluation of mutation models may also contribute to a better understanding of the underlying biological mutation mechanisms. In particular, the fact that the combined models fit the data better than the original ones suggests possible differences between the mechanisms of upward and downward mutation.

With these applications in mind, gathering of further mutation data, for example, in an international STR mutation database, seems to be warranted. With a growing database, it will become possible to further refine parameter estimates as well as the models themselves.

parasar
04-03-2014, 03:00 AM
Using the full Y-DNA sequencing data from the recent Ashkenazi-Levites paper by Rootsi et al., I've done some calculations that should allow us to compare the age of some major subclades of R1a and R1b. I have used a slightly modified mutation rate (0.66 x 10^-9 per nucleotide per year, instead of 0.7 x 10^-9 per nucleotide per year), mostly because this new value is not only the exact average of the rates calculated by Francalacci (0.53), Poznik (0.82) and Mendez (0.62), but also because it was suggested by another forumer James Dow Allen (http://www.anthrogenica.com/showthread.php?1507-Some-provisional-calculations-for-haplogroup-R1a-based-on-the-first-FGC-result/page4&p=20887#post20887) that this new rate 0.66 (which corresponds to 165 years per each mutation found in the 8.97 Mb Y-DNA sequence) is consistent with the data provided by both Raghavan (the Mal'ta paper) and Francalacci (the Sardinian paper).

Below please find my new TMRCA estimates (in ky). The values shown in the parentheses correspond to my two previous sets of estimates that were based on the Sardinian data. For all clades below the R1a-Z645 and R1b-L23 levels (where only some relatively small numbers of individuals were available), the estimates were based on taking into account both the average number of mutations downstream of a given branching point and the number of mutations separating this particular branching point from an upstream node.

38.8 (40.2, 38.5) haplogroup P
32.2 (33,5, 31.8) haplogroup R
26.9 (27.6, 25.9) haplogroup R1
8.0 (8.3, 6.6) R1b-L23
6.7 (7.4, 5.6) R1b-Z2103
6.1 (6.6, 4.8) R1b-L11
6.0 R1a-Z645
5.7 R1a-Z93
5.4 R1a-Z94
5.3 R1a-Z282
5.2 R1a-Z2123
5.2 R1a-Z2122
4.3 R1a-L657
3.9 R1a-M582

Another set of estimates by Michał


Z280 - 5121 years
CTS1211- 4770 years
CTS3402 - 4224 years
Y33 - 3784 years
CTS8816 - 3696 years
Y2900 - 2171 years
Z93 - 5456 years
Z94 - 5280 years
L657 - 4752 years
Y57 - 2904 years

http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=60

alan
04-03-2014, 12:28 PM
I really get the impression that these new SNP counting dates are getting us within a few centuries of the real date. Z280's age and distribution could suggest that it is an R1a lineage that, although of presumably western Yamnaya origin, only really expanded when it entered the corded ware culture (including its Middle Dnieper and Fatyanovo variants). At a guess the advantage such a line could have had is if it was the one that was responsible for hybriding steppe and non-steppe traits together such as is seen in corded ware.

Z93 has a distribution which suggests that it expanded on the Asian steppe, kept that lifestyle for a long time. Any sort of transformation to non-steppe living was not made until some of its later subclades spilled towards the Indian subcontinent.




Another set of estimates by Michał.


http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=60

parasar
04-03-2014, 04:18 PM
I really get the impression that these new SNP counting dates are getting us within a few centuries of the real date. Z280's age and distribution could suggest that it is an R1a lineage that, although of presumably western Yamnaya origin, only really expanded when it entered the corded ware culture (including its Middle Dnieper and Fatyanovo variants). At a guess the advantage such a line could have had is if it was the one that was responsible for hybriding steppe and non-steppe traits together such as is seen in corded ware.

Z93 has a distribution which suggests that it expanded on the Asian steppe, kept that lifestyle for a long time. Any sort of transformation to non-steppe living was not made until some of its later subclades spilled towards the Indian subcontinent.

There is just a few hundred years between Z93 and L657 or Z2125 (3-4 SNP's from Z93). So starting with Z93 we see a star-like expansion in the subcontinent.
The shared steppe SNPs (perhaps a couple) under Z93 are going to turn up to be much younger. Plus just comparing two sets - the Khakas and and the Altains - the sets look quite distinct.

The Altai and Khakas data-set in Underhill's paper did not have even one M417*.

Europe (Estonia & Hungary)
M417 16 12 13 18 25 10 11 13 11 10 11 14 14 11 19 16 15 23 12
M417 15 12 13 17 25 11 11 13 12 11 ND ND ND ND ND ND ND ND ND

Near/Middle East Kordestan, Iran
M417 15 12 13 16 25 10 11 13 10 10 11 14 14 11 20 16 15 23 12

Near/Middle East Turk
M417 15 12 12 17 25 11 11 13 10 10 ND ND ND ND ND ND ND ND ND

South India
M417 15 12 14 17 24 11 11 13 10 10 14 14 14 11 20 17 15 23 12

While at DYS385a 11 is a common repeat (11,14 is modal for 385ab), the South India sample has 14.
If you look at the diverse data-set of the Saharia - http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0032546.s003
a number of them do not have 11 repeats 14, 19; 14,17; 14;17 (though 11 is still modal even in the Saharia). So, either the STR typing is incorrect or possibly the Saharia have a lot of M417* types. Genetic Affinities of the Central Indian Tribal Populations http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0032546

An old paper had (IMO) apparently botched the SNP typing on the Sahariya, but perhaps their data-set was different.
http://www.nature.com/jhg/journal/v54/n1/images/jhg20082f4.gif
http://www.ncbi.nlm.nih.gov/pubmed/19158816?dopt=Abstract

alan
04-23-2014, 03:17 PM
I notice that there has been quite a bit of discussion on some other forums like yahoo groups about the SNP counting approaches when applied to L21.

The dates being argued for have a wide span of c. 4000BC to 1600BC or something like that. I would just like to throw in my tuppence worth that a date at the youngest end of that range would seem to me to be bordering on archaeological impossibility. However, there are a lot of ifs and buts to say anything with absolute confidence.

What I would say with much more confidence is that the date for P312 as a whole must be seen in terms of the apparently close correspondence to the beaker culture and its pan European nature. So, a date for P312 any younger than c. 2700BC would seem nonsensical to me. I think that should seriously be considered when the arguements over years per SNP etc are being made. If a methodology makes P312 any younger than 4700 years it is almost certainly wrong.

So I ask this to Mike W and others who are looking at the SNP counting method - what rate of years per SNP would be the minimum that would arrive at a date of no less than 4700 years for P312? Whatever that is I think you should consider it the minimum that can possibly be compatible with the P312-bell beaker link. There is simply no archaeological phase after the bell beaker period where the pan-European spread P312 can be explained.

If that helps to eliminate some of the lowest end of the range of estimates of years per SNP then I think it can probably be a safe inference.

razyn
04-23-2014, 03:47 PM
2700 BC is 4700 years old, which are you arguing?

alan
04-23-2014, 03:55 PM
Sorry- total brainstorm - fixed now.

What I have read is that the total numbers of reliable SNPs for DF13 derived clade people averaged about 30. I am assuming that if that was extended to all P312 the average would by definition have to be a couple of SNPs higher - c 32?. Now if the archaeological rational would put the minimum age of P312 as 4700 years then I would suppose that would bring it close to around 150 years per SNP. Anything much less than that would make zero sense for the beaker-P312 correlation that a large amount of people believe in. While some archaeological argument might be made for an even older date, I believe that there is zero chance of explaining P312's pan-European spread in any scenario later than the start of the major expansion phase of the beaker culture c. 4700 years ago.

That of course is just total ballpark guess based on half understood reading of others comments on this area I am not at all well versed in. I was hoping some people might weigh in with information on what would be the average no of SNPs after P312 for people derived from that SNP who have had deep SNP testing?


2700 BC is 4700 years old, which are you arguing?

alan
04-23-2014, 04:54 PM
In general looking at the archaeologically derived suggested link between beaker and P312, I would think 150 years per SNP sounds relatively close to the date of 4700 years divided by the average no of SNPs in individuals below P312. I am assuming the latter is in the low 30s. Obviously if the average no of SNPs below P312 is higher then the no of years per SNP will drop.

alan
04-23-2014, 05:11 PM
One other thought on SNP counting. I am assuming mutations would increase with age of father on average. Although we tend to think of generations as being a little shorter in the past, I think we should also consider what the effect of superbreeding chiefs would be. They presumably would have more of an ability to have children at an older age through greater life expectancy and serial marriages and concubines than the less well off people with less wives, poorer resources and lower life expectancy. I think this apparent picture of surviving lineages tending to be those of the elites needs to be factored in when considering generation length etc. I think consideration of that might slightly increase the figure as opposed to the sort of figure that one might come up with for a more normal monogomous society. IMO that top-down sort of push from chiefs might commence a little later and continue a little longer in terms of age than would be the norm for ordinary people. However, someone really would need to study the ages of death average of clan chiefs etc.

alan
05-04-2014, 04:36 PM
New paper

http://biorxiv.org/content/early/2014/05/03/004705

MitchellSince1893
05-04-2014, 06:03 PM
First thing that caught my eye in that paper was the decision to use 25 years per generation.

MJost
05-06-2014, 01:57 PM
This paper does state that the more recent branches appear to be different in mutation rates. Slow STRs are near sequencing rates. Bird and 1/3 slowest markers options are about the same in my TMRCA Estimator.

MJost

Michał
05-24-2014, 06:50 PM
I'm just using Mark's recent quote "The Experts have deemed from studies that the range is 70-90 years per SNP mutation." I think Michal uses 88 years per SNP. Some would say it's three generations per SNP. Perhaps Mark or Michal or Warwick can elaborate on these ranges or point to specific studies. I'm assuming these reanges are for FGC high reliability SNPs, not Big Y or YFULL.
I have already provided the basis for my calculations on numerous occasions, but since there are some new data available, let me summarize it again.

As for the Y-DNA mutation rate (or, more precisely, for the Y-DNA SNP rate), there have been four major papers that have provided some reasonable estimates for this rate. They have been briefly described in some previous posts of mine:
http://www.anthrogenica.com/showthread.php?709-New-DNA-Papers/page5&p=10824#post10824
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=02

Based on those four independent estimates, we could securely assume that the true mutation rate is roughly 0.7 (or between 0.6 and 0.8) x 10^-9 per bp per year. Using this well justified assumption, I have produced a series of estimates for some selected haplogroups, including some major subclades of R1b (based on the SNP data provided by the Sardinian paper by Francalacci et al., 2013):
http://www.anthrogenica.com/showthread.php?828-STR-Wars-GDs-TMRCA-estimates-Variance-Mutation-Rates-amp-SNP-counting/page8&p=15936#post15936

Shortly thereafter, it turned out that my estimates are in perfect agreement with the more recently published SNP data for the Siberian radiocarbon dated Mal’ta boy (24 kya, R*):
http://www.anthrogenica.com/showthread.php?1507-Some-provisional-calculations-for-haplogroup-R1a-based-on-the-first-FGC-result/page4&p=20838#post20838

In the meantime, I have switched to a slightly modified mutation rate by replacing the 0.7 rate with 0.66 x 10^-9 per bp per year (though this does not mean that I am strongly convinced that using the 0.7 rate would be inappropriate):
http://www.anthrogenica.com/showthread.php?828-STR-Wars-GDs-TMRCA-estimates-Variance-Mutation-Rates-amp-SNP-counting/page9&p=26002#post26002

More recently, my SNP-based estimates have also been confirmed by Underhill et al., 2014, who dated the R1a/R1b split to about 25 ky and calculated the R1a-Z645 clade to be 5.8 ky old (though the authors were, quite surprisingly, not aware that this is a TMRCA value for R1a-Z645 and not R1a-M417, as they claimed in their paper):
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=72

Using the above well-supported (and positively verified) mutation rate of 0.66 x 10^-9 per bp per year, I have estimated that each mutation in the so-called “gold region” (of about 10 Mb) covered by the FTDNA Big Y test should correspond to about 151 years:
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=60

At the same time, I have also suspected that the number of years per each BigY-tested mutation should be slightly lower than 151 (because the range of Big Y is actually slightly larger than 10 Mb). On the other hand, the higher number of years per mutation (about 180) should be used in all those cases where the SNP count is based only on the number of reliable HQ variants reported by FTDNA (and not on an SNP count produced by some additional analysis of the vcf files, or by the analysis of the Big Y BAM files at YFull or FGC), which has been, unfortunately, wrongly interpreted as my suggestion that each BigY-tested SNP corresponds to about 180 years:
http://eng.molgen.org/viewtopic.php?f=85&t=1494&start=03

Since the Y-DNA region covered by the Full Genome Corp (FGC) test is much harder to define than in the case of Big Y, and it seems certain that some less reliable regions additionally included into the FGC test do not produce reliable SNPs at the same rate as the “standard” 8-10 Mb region used in most research studies, one way to estimate the number of years per each FGC-tested SNP marker is to calculate the percentage of the FGC-tested SNPs that are covered by Big Y (see the link shown above). I have initially assumed (based on some R1a-based calculations) that the Big Y test covers approximately 58% of all SNPs tested by FGC, which would correspond to 88 years per each FGC-tested SNP (assuming 151 years/SNP for Big Y), although now it seems that 60% would be more appropriate (which would then correspond to 91 years/SNP for FGC).

Using the above SNP rate and the huge collection of Big Y results from the R1b-U106 project, I have calculated a series of provisional SNP-based estimates for different subclades of R1b-U106. These estimates strongly indicated that R1b-U106 is much older than suggested based on some previous STR-based estimates:
http://www.anthrogenica.com/showthread.php?2420-SNP-based-TMRCAs-for-R1b-U106-and-subclades

Most recently, Iain McDonald from the R1b-U106 FTDNA project has used the Big Y results for four relatively large and deeply rooted families (three R1b and one R1a) with known genealogies to estimate that each reliable BigY-tested SNP (i.e. an SNP mutation verified by the analysis of the vcf files) corresponds to about 140 years, with a 95.5% confidence level of 104-197 years/SNP. Although the margin of error is still very large, it is easy to notice that this is fully consistent with the Y-DNA SNP mutation rates I was using so far (be it 0.66 or 0.7 x 10^-9 per bp per year), while negatively verifying all much lower (<0.5) or much higher (>0.9) mutation rates that were sometimes suggested.

Assuming that the above 140 years per each BigY-tested mutation is more or less correct (which should be considered very likely in the view of all the data discussed above), this would also indicate that each FGC-tested mutation corresponds to about 84 years (or 81 years when assuming the that Big Y covers only 58% (and not 60%) of the FGC-tested SNPs).

To summarize, before the above estimates are further refined based on investigating more families with known genealogies, or based on some new radiocarbon-dated archaeological remains, I would recommend using the above number of 84 years (or the 81-91 range) for each reliable FGC-tested SNP, and 140-150 years for each relaible BigY-tested SNP. And since I know that people frequently use such estimates to calculate the age of a single lineage, I would like to remind all of you that only by testing multiple independent lineages descending from a common ancestor (and calculating the average number of SNPs) one may get a fairy reliable TMRCA estimate. Also, when calculating the age of a specific clade, it is always good to compare it with the age of some sister clades, as it is always possible that a substantially decreased or increased number of mutations at the root of a given clade (due to some random fluctuations) may significantly affect such TMRCA calculation.

alan
05-24-2014, 07:43 PM
Thanks that is a very useful post for numerically challenged people like myself.


I have already provided the basis for my calculations on numerous occasions, but since there are some new data available, let me summarize it again.

As for the Y-DNA mutation rate (or, more precisely, for the Y-DNA SNP rate), there have been four major papers that have provided some reasonable estimates for this rate. They have been briefly described in some previous posts of mine:
http://www.anthrogenica.com/showthread.php?709-New-DNA-Papers/page5&p=10824#post10824
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=02

Based on those four independent estimates, we could securely assume that the true mutation rate is roughly 0.7 (or between 0.6 and 0.8) x 10^-9 per bp per year. Using this well justified assumption, I have produced a series of estimates for some selected haplogroups, including some major subclades of R1b (based on the SNP data provided by the Sardinian paper by Francalacci et al., 2013):
http://www.anthrogenica.com/showthread.php?828-STR-Wars-GDs-TMRCA-estimates-Variance-Mutation-Rates-amp-SNP-counting/page8&p=15936#post15936

Shortly thereafter, it turned out that my estimates are in perfect agreement with the more recently published SNP data for the Siberian radiocarbon dated Mal’ta boy (24 kya, R*):
http://www.anthrogenica.com/showthread.php?1507-Some-provisional-calculations-for-haplogroup-R1a-based-on-the-first-FGC-result/page4&p=20838#post20838

In the meantime, I have switched to a slightly modified mutation rate by replacing the 0.7 rate with 0.66 x 10^-9 per bp per year (though this does not mean that I am strongly convinced that using the 0.7 rate would be inappropriate):
http://www.anthrogenica.com/showthread.php?828-STR-Wars-GDs-TMRCA-estimates-Variance-Mutation-Rates-amp-SNP-counting/page9&p=26002#post26002

More recently, my SNP-based estimates have also been confirmed by Underhill et al., 2014, who dated the R1a/R1b split to about 25 ky and calculated the R1a-Z645 clade to be 5.8 ky old (though the authors were, quite surprisingly, not aware that this is a TMRCA value for R1a-Z645 and not R1a-M417, as they claimed in their paper):
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=72

Using the above well-supported (and positively verified) mutation rate of 0.66 x 10^-9 per bp per year, I have estimated that each mutation in the so-called “gold region” (of about 10 Mb) covered by the FTDNA Big Y test should correspond to about 151 years:
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=60

At the same time, I have also suspected that the number of years per each BigY-tested mutation should be slightly lower than 151 (because the range of Big Y is actually slightly larger than 10 Mb). On the other hand, the higher number of years per mutation (about 180) should be used in all those cases where the SNP count is based only on the number of reliable HQ variants reported by FTDNA (and not on an SNP count produced by some additional analysis of the vcf files, or by the analysis of the Big Y BAM files at YFull or FGC), which has been, unfortunately, wrongly interpreted as my suggestion that each BigY-tested SNP corresponds to about 180 years:
http://eng.molgen.org/viewtopic.php?f=85&t=1494&start=03

Since the Y-DNA region covered by the Full Genome Corp (FGC) test is much harder to define than in the case of Big Y, and it seems certain that some less reliable regions additionally included into the FGC test do not produce reliable SNPs at the same rate as the “standard” 8-10 Mb region used in most research studies, one way to estimate the number of years per each FGC-tested SNP marker is to calculate the percentage of the FGC-tested SNPs that are covered by Big Y (see the link shown above). I have initially assumed (based on some R1a-based calculations) that the Big Y test covers approximately 58% of all SNPs tested by FGC, which would correspond to 88 years per each FGC-tested SNP (assuming 151 years/SNP for Big Y), although now it seems that 60% would be more appropriate (which would then correspond to 91 years/SNP for FGC).

Using the above SNP rate and the huge collection of Big Y results from the R1b-U106 project, I have calculated a series of provisional SNP-based estimates for different subclades of R1b-U106. These estimates strongly indicated that R1b-U106 is much older than suggested based on some previous STR-based estimates:
http://www.anthrogenica.com/showthread.php?2420-SNP-based-TMRCAs-for-R1b-U106-and-subclades

Most recently, Iain McDonald from the R1b-U106 FTDNA project has used the Big Y results for four relatively large and deeply rooted families (three R1b and one R1a) with known genealogies to estimate that each reliable BigY-tested SNP (i.e. an SNP mutation verified by the analysis of the vcf files) corresponds to about 140 years, with a 95.5% confidence level of 104-197 years/SNP. Although the margin of error is still very large, it is easy to notice that this is fully consistent with the Y-DNA SNP mutation rates I was using so far (be it 0.66 or 0.7 x 10^-9 per bp per year), while negatively verifying all much lower (<0.5) or much higher (>0.9) mutation rates that were sometimes suggested.

Assuming that the above 140 years per each BigY-tested mutation is more or less correct (which should be considered very likely in the view of all the data discussed above), this would also indicate that each FGC-tested mutation corresponds to about 84 years (or 81 years when assuming the that Big Y covers only 58% (and not 60%) of the FGC-tested SNPs).

To summarize, before the above estimates are further refined based on investigating more families with known genealogies, or based on some new radiocarbon-dated archaeological remains, I would recommend using the above number of 84 years (or the 81-91 range) for each reliable FGC-tested SNP, and 140-150 years for each relaible BigY-tested SNP. And since I know that people frequently use such estimates to calculate the age of a single lineage, I would like to remind all of you that only by testing multiple independent lineages descending from a common ancestor (and calculating the average number of SNPs) one may get a fairy reliable TMRCA estimate. Also, when calculating the age of a specific clade, it is always good to compare it with the age of some sister clades, as it is always possible that a substantially decreased or increased number of mutations at the root of a given clade (due to some random fluctuations) may significantly affect such TMRCA calculation.

jdean
05-25-2014, 11:53 AM
Thanks that is a very useful post for numerically challenged people like myself.

I second that !!

Using the 180 yr no. with my results (taking care to remove anything that could be contentious) correlated very well with TMRCA calculations I've done for my cluster using Ken Nordtvedt's 111 loci interclade spreadsheet.

Also the age of P312 came out at 3600 BC and DF13 2500 BC, which I think falls in line with Alan's thoughts ?

One query though is how many years per generation were used in this figure, I have trouble with the 25 yrs communally quoted.

Michał
05-25-2014, 12:36 PM
Using the 180 yr no. with my results (taking care to remove anything that could be contentious) correlated very well with TMRCA calculations I've done for my cluster using Ken Nordtvedt's 111 loci interclade spreadsheet.
Also the age of P312 came out at 3600 BC and DF13 2500 BC, which I think falls in line with Alan's thoughts ?
As I have mentioned in my above post, I would use the 180 years/SNP rate only for the Big-tested private SNPs that were “manually” extracted from the list of novel variants reported by FTDNA, as in case you have used any additional analysis of the vcf and/or BAM files, I would rather use the lower number of years per SNP (i.e. 150 or 140). Also, it seems to me that when the known SNPs upstream of the private ones are considered, the number of years per SNP should also be lower than 180, mostly because we are usually more willing to accept a low quality SNP that is shared by other members of our clade (which is not possible for the singletons/private SNPs) and because we frequently assume that we are positive for such shared SNP (even if we get a no-call for it).



One query though is how many years per generation were used in this figure, I have trouble with the 25 yrs communally quoted.
Frankly speaking, there was no need to assume any specific generation time in my calculations. This is because the initial estimates by Poznik and Francalacci did not require to assume a specific number of years per generation. Also, the radiocarbon dating provides only the number of years (and not the number of generations).

Of course, it would be interesting to know which generation time is more appropriate for a given population or for a given time period, but since we will be unable to determine this for any prehistoric period, I wouldn’t pay too much attention to this question when producing the estimates for R1b-P312 or R1b-DF13.

jdean
05-25-2014, 01:14 PM
Thanks Michal, the SNPs were extracted from my Big Y variance file.

Really I ought do this for a few more DF49 kits but the process of removing the stragglers not picked up by my kill list is still quite time consuming and I'm feeling lazy today : )

haleaton
05-25-2014, 07:34 PM
I have already provided the basis for my calculations on numerous occasions, but since there are some new data available, let me summarize it again.

As for the Y-DNA mutation rate (or, more precisely, for the Y-DNA SNP rate), there have been four major papers that have provided some reasonable estimates for this rate. They have been briefly described in some previous posts of mine:
http://www.anthrogenica.com/showthread.php?709-New-DNA-Papers/page5&p=10824#post10824
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=02

Based on those four independent estimates, we could securely assume that the true mutation rate is roughly 0.7 (or between 0.6 and 0.8) x 10^-9 per bp per year. Using this well justified assumption, I have produced a series of estimates for some selected haplogroups, including some major subclades of R1b (based on the SNP data provided by the Sardinian paper by Francalacci et al., 2013):
http://www.anthrogenica.com/showthread.php?828-STR-Wars-GDs-TMRCA-estimates-Variance-Mutation-Rates-amp-SNP-counting/page8&p=15936#post15936

Shortly thereafter, it turned out that my estimates are in perfect agreement with the more recently published SNP data for the Siberian radiocarbon dated Mal’ta boy (24 kya, R*):
http://www.anthrogenica.com/showthread.php?1507-Some-provisional-calculations-for-haplogroup-R1a-based-on-the-first-FGC-result/page4&p=20838#post20838

In the meantime, I have switched to a slightly modified mutation rate by replacing the 0.7 rate with 0.66 x 10^-9 per bp per year (though this does not mean that I am strongly convinced that using the 0.7 rate would be inappropriate):
http://www.anthrogenica.com/showthread.php?828-STR-Wars-GDs-TMRCA-estimates-Variance-Mutation-Rates-amp-SNP-counting/page9&p=26002#post26002

More recently, my SNP-based estimates have also been confirmed by Underhill et al., 2014, who dated the R1a/R1b split to about 25 ky and calculated the R1a-Z645 clade to be 5.8 ky old (though the authors were, quite surprisingly, not aware that this is a TMRCA value for R1a-Z645 and not R1a-M417, as they claimed in their paper):
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=72

Using the above well-supported (and positively verified) mutation rate of 0.66 x 10^-9 per bp per year, I have estimated that each mutation in the so-called “gold region” (of about 10 Mb) covered by the FTDNA Big Y test should correspond to about 151 years:
http://eng.molgen.org/viewtopic.php?f=77&t=1300&start=60

At the same time, I have also suspected that the number of years per each BigY-tested mutation should be slightly lower than 151 (because the range of Big Y is actually slightly larger than 10 Mb). On the other hand, the higher number of years per mutation (about 180) should be used in all those cases where the SNP count is based only on the number of reliable HQ variants reported by FTDNA (and not on an SNP count produced by some additional analysis of the vcf files, or by the analysis of the Big Y BAM files at YFull or FGC), which has been, unfortunately, wrongly interpreted as my suggestion that each BigY-tested SNP corresponds to about 180 years:
http://eng.molgen.org/viewtopic.php?f=85&t=1494&start=03

Since the Y-DNA region covered by the Full Genome Corp (FGC) test is much harder to define than in the case of Big Y, and it seems certain that some less reliable regions additionally included into the FGC test do not produce reliable SNPs at the same rate as the “standard” 8-10 Mb region used in most research studies, one way to estimate the number of years per each FGC-tested SNP marker is to calculate the percentage of the FGC-tested SNPs that are covered by Big Y (see the link shown above). I have initially assumed (based on some R1a-based calculations) that the Big Y test covers approximately 58% of all SNPs tested by FGC, which would correspond to 88 years per each FGC-tested SNP (assuming 151 years/SNP for Big Y), although now it seems that 60% would be more appropriate (which would then correspond to 91 years/SNP for FGC).

Using the above SNP rate and the huge collection of Big Y results from the R1b-U106 project, I have calculated a series of provisional SNP-based estimates for different subclades of R1b-U106. These estimates strongly indicated that R1b-U106 is much older than suggested based on some previous STR-based estimates:
http://www.anthrogenica.com/showthread.php?2420-SNP-based-TMRCAs-for-R1b-U106-and-subclades

Most recently, Iain McDonald from the R1b-U106 FTDNA project has used the Big Y results for four relatively large and deeply rooted families (three R1b and one R1a) with known genealogies to estimate that each reliable BigY-tested SNP (i.e. an SNP mutation verified by the analysis of the vcf files) corresponds to about 140 years, with a 95.5% confidence level of 104-197 years/SNP. Although the margin of error is still very large, it is easy to notice that this is fully consistent with the Y-DNA SNP mutation rates I was using so far (be it 0.66 or 0.7 x 10^-9 per bp per year), while negatively verifying all much lower (<0.5) or much higher (>0.9) mutation rates that were sometimes suggested.

Assuming that the above 140 years per each BigY-tested mutation is more or less correct (which should be considered very likely in the view of all the data discussed above), this would also indicate that each FGC-tested mutation corresponds to about 84 years (or 81 years when assuming the that Big Y covers only 58% (and not 60%) of the FGC-tested SNPs).

To summarize, before the above estimates are further refined based on investigating more families with known genealogies, or based on some new radiocarbon-dated archaeological remains, I would recommend using the above number of 84 years (or the 81-91 range) for each reliable FGC-tested SNP, and 140-150 years for each relaible BigY-tested SNP. And since I know that people frequently use such estimates to calculate the age of a single lineage, I would like to remind all of you that only by testing multiple independent lineages descending from a common ancestor (and calculating the average number of SNPs) one may get a fairy reliable TMRCA estimate. Also, when calculating the age of a specific clade, it is always good to compare it with the age of some sister clades, as it is always possible that a substantially decreased or increased number of mutations at the root of a given clade (due to some random fluctuations) may significantly affect such TMRCA calculation.

Thanks! I am learning a lot from your summary. One thing though for myself (R-U152-L2*) having tested both at FGC BGI amd FTDNA Big Y and having NGC [Edit: this should be FGC] analyze both of them, Big Y only covered 15/44 or 34% of valid Private SNPs. BIG Y did not report INDELs but they could be found in the VCF file and BIG Y found 1/6 or 17% of INDELs. I verified this was due to coverage differences as the NGC BGI mutations found were all outside of the FTDNA BED file. This coverage difference may depend on haplogroup or person as the BIG Y coverage may be be biased towards previously tested with known SNP regions.

Michał
05-25-2014, 10:09 PM
One thing though for myself (R-U152-L2*) having tested both at FGC BGI amd FTDNA Big Y and having NGC analyze both of them
What is NGC? (Did you mean FGC?)


Big Y only covered 15/44 or 34% of valid Private SNPs.
What about the appropriate numbers for all your SNPs downstream of P312?

Also, are you sure that all of those FGC-tested "private" SNPs are indeed downstream of all non-private SNPs detected by Big Y (and not at the same level as some of the BigY-tested SNPs from the level just upstream of the "private level")? I am not saying that this is indeed your case, but such situation may happen when the number of BigY-tested people is much larger than a number of the FGC-tested people from a given subclade.


BIG Y did not report INDELs but they could be found in the VCF file and BIG Y found 1/6 or 17% of INDELs. I verified this was due to coverage differences as the NGC BGI mutations found were all outside of the FTDNA BED file.
On the other hand, many of the "high quality" (i.e. no asterisk or one asterisk) INDELS reported by FGC should be considered as not reliable (i.e. they have no phylogenetic value), at least this is suggested by the YFull analysis of the FGC BAM files I've seen, while many of the BigY-tested INDELs that are not included in the VCF file can be extracted from the Big Y BAM files (for example at YFull).

The major disadvantage of Big Y is not the lower percentage of the chromosome Y covered by the test (which is of course compensated by the lower price) but the fact that FTDNA does not provide any appropriate tool to interpret the raw data.



This coverage difference may depend on haplogroup or person as the BIG Y coverage may be be biased towards previously tested with known SNP regions.
Do you know any data that would strongly indicate that the BigY/FGC ratio (for all reliable SNPs, including the private and non-private ones) is higher for most well studied haplogroups, like R1b and G2a, than for the less studied ones, like T or L?

I can imagine that the well studied haplogroups would show much lower (than average) BigY/FGC ratio for "private" SNPs but this should be compensated but the much higher (than average) BigY/FGC ratio for the non-private SNPs (and we should keep in mind that the ratio of private to non-private SNPs should be much lower in the well studied haplogroups).

Cofgene
05-25-2014, 11:09 PM
What is NGC? (Did you mean FGC?)

On the other hand, many of the "high quality" (i.e. no asterisk or one asterisk) INDELS reported by FGC should be considered as not reliable (i.e. they have no phylogenetic value), at least this is suggested by the YFull analysis of the FGC BAM files I've seen, while many of the BigY-tested INDELs that are not included in the VCF file can be extracted from the Big Y BAM files (for example at YFull).



The lack of phylogenetic value does not imply that an INDEL or SNP is "not reliable." Are all equivalent SNPs considered to be "unreliable?" Reliability should be associated with whether a mutation is highly recurrent. A more meaningful statement would be that with with current limited test results one cannot establish if a particular INDEL has phylogenetic value. Under R1b-U106 we have a new INDEL that is present in 3 Big-Y and 2 Full-Y samples yet is not present in 4 other Big-Y results. This novel INDEL currently has no equivalent SNPs and seems to define a new intermediate level haplogroup.

haleaton
05-26-2014, 01:32 AM
What is NGC? (Did you mean FGC?)

I corrected my typo, I meant FGC - Full Genomes Corporation.



What about the appropriate numbers for all your SNPs downstream of P312?

Also, are you sure that all of those FGC-tested "private" SNPs are indeed downstream of all non-private SNPs detected by Big Y (and not at the same level as some of the BigY-tested SNPs from the level just upstream of the "private level")? I am not saying that this is indeed your case, but such situation may happen when the number of BigY-tested people is much larger than a number of the FGC-tested people from a given subclade.

Myself, I have not done a study between P312 and U152/L2, but so far others (FGC, YFULL) have not reported anything new, but I should check the very few "no calls in FGC BGI data. There were 83 no calls for Y-Full's known SNPs. Big Y does have a problem of not of covering (per BED) many important L2 itself and subclade defining SNPs such as L2 and Z49, but I have relied on multiple other tests and checked every SNP relevant that I could find in public record. I had a previous post on just how bad Big Y was in my particular L2* case. In some cases, such as L2, data sufficient by FGC or YFull exists in the raw FTDNA BAM data, but gets excluded by the BED file and not reported by FTDNA Big Y as positive.

FGC analysis finds everything that FTNDA does, but Big Y NGS data is a subset and the quality valuations can differ. FTDNA does not compare against public data sets from other labs such as 1K Genomes, but FGC and YFull does. Novel Variants called out by FTDNA Big Y that are shared by these samples were not considered to be Private to me. Multiple comparisons between U152+ and L2+ data samples have been made and I have been tested negative for all known branches below L2+ including all those new ones on the FTDNA tree that are from GENO 2.0 results but have not been found in public data sets or corroborating proof from GENO 2.0 provided.



On the other hand, many of the "high quality" (i.e. no asterisk or one asterisk) INDELS reported by FGC should be considered as not reliable (i.e. they have no phylogenetic value), at least this is suggested by the YFull analysis of the FGC BAM files I've seen, while many of the BigY-tested INDELs that are not included in the VCF file can be extracted from the Big Y BAM files (for example at YFull).

The major disadvantage of Big Y is not the lower percentage of the chromosome Y covered by the test (which is of course compensated by the lower price) but the fact that FTDNA does not provide any appropriate tool to interpret the raw data.


I did not find the YFull did much with reporting on complex INDELs found in either my FGC BGI & FTNDA Big Y BAMs. The single (1/6) INDEL that was in Big Y's coverage was marked as "PASS" in the VCF file but was not reported as a Novel Variant.

Since I am a heavily tested L2*, I cannot say anything about phylogenetic value of INDELs. I am interested to learn about mutation rate models for INDELS as well as the chemical mechanism and if cosmic rays play any role with folks migrating to high altitude during warm periods in the Alps (just to veer a bit off-topic).

The issue with Big Y is the time wasted and expense in individual orders in what is not covered to compare with others. Advantage is they have more people to compare with today. I still think FTDNA delivered everything and more to what they said they would do with Big Y. Raw data was the key.

YFull is a great product, but currently they cannot handle two BAM files from the same person, so I was not able to compare.

Advantage of FGC BGI is they are much much closer to full coverage. I don't see how Big Y can be that precise just counting total Private SNPS, seems to add a millennium or two of error. I am an not an expert in any of this and am learning from you all.




Do you know any data that would strongly indicate that the BigY/FGC ratio (for all reliable SNPs, including the private and non-private ones) is higher for most well studied haplogroups, like R1b and G2a, than for the less studied ones, like T or L?

I can imagine that the well studied haplogroups would show much lower (than average) BigY/FGC ratio for "private" SNPs but this should be compensated but the much higher (than average) BigY/FGC ratio for the non-private SNPs (and we should keep in mind that the ratio of private to non-private SNPs should be much lower in the well studied haplogroups).

No this is just speculation based on my L2* case and my new private SNPs being found in regions Big Y does not cover, that Walk through the Y did not cover, and the talk that Big Y could replace Deep Clade. GENO 2.0 replaces Deep Clade but just for the FTDNA halpotree by self-reference.

Looking through YBrowse it is interesting how FGC named SNPs are filling in across the Y. Within R1b I speculate that those that truly have their "*" in an early terminal SNP through multiple tests, even following Big Y, have that because of coverage and should consider Full Genomes Corporation testing.

Michał
05-26-2014, 02:45 PM
The lack of phylogenetic value does not imply that an INDEL or SNP is "not reliable." Are all equivalent SNPs considered to be "unreliable?"
Firstly, you seem to use the term „phylogenetic value” in a very narrow sense (that was never suggested in my post), by excluding the equivalent mutations (both SNPs and INDEls) from a group of phylogenetically useful mutations. I can only say that I consider all equivalent mutations (as long as they are relatively reliable/stable) to be very useful from the phyologenetic point of view.

Secondly, even when using this very narrow definition of a phylogenetic value, I have not suggested that the mutations with no phylogenetic value are not reliable but rather that the unreliable mutations have no phylogenetic value, which in this particular case is quite a difference.




Reliability should be associated with whether a mutation is highly recurrent. A more meaningful statement would be that with with current limited test results one cannot establish if a particular INDEL has phylogenetic value. Under R1b-U106 we have a new INDEL that is present in 3 Big-Y and 2 Full-Y samples yet is not present in 4 other Big-Y results. This novel INDEL currently has no equivalent SNPs and seems to define a new intermediate level haplogroup.
Agreed. I have never stated that all INDELs should be considered not reliable from the phylogenetic point of view, only that surprisingly many “good quality” INDELs from the FGC reports seem to be considered unreliable by the YFull specialists (and I must admit that I trust their opinion in this respect, after being corrected myself on several occasions).

Michał
05-26-2014, 03:29 PM
Myself, I have not done a study between P312 and U152/L2,
This suggests that you have analyzed the SNPs downstream of U152. What is then your BigY/FGC ratio for all SNPs under U152?


I did not find the YFull did much with reporting on complex INDELs found in either my FGC BGI & FTNDA Big Y BAMs.

YFull is a great product, but currently they cannot handle two BAM files from the same person, so I was not able to compare.
The abve two statements contradict each other, so I guess it is not true that you were able to use the YFull analysis to compare your INDELs reported by FGC and Big Y.



Looking through YBrowse it is interesting how FGC named SNPs are filling in across the Y.
If only FTDNA cared a bit about introducing their novel Big Y SNPs to YBrowse (and I am talking about those SNPs that have not been previously found by FGC) and was actually able to extract such reliable SNPs from their own results, you would be surprised how many of those SNPs have been already discovered (mostly by the admins of the haplogroup projects).


Within R1b I speculate that those that truly have their "*" in an early terminal SNP through multiple tests, even following Big Y, have that because of coverage and should consider Full Genomes Corporation testing.
It seems to me that this has much more to do with the number of people tested (with any NGS-based test) than with the difference between FGC and Big Y. Among the FGC-tested people in the R1a project, the proportion of those who have their terminal SNP marked with "*" is not smaller than among those who took Big Y. For example, among the six FGC-tested members of R1a-Z280, we have YP340*, Y33*, YP2902*, while among those tested with Big-Y are many who represent specific subclades of YP340, Y33 or Y2902, while no cases of YP340* or Y33* and only one case of Y2902*.

haleaton
05-26-2014, 04:36 PM
This suggests that you have analyzed the SNPs downstream of U152. What is then your BigY/FGC ratio for all SNPs under U152?


"U152/L2" - I only looked around or below L2. For Big Y excluded L2 by BED file, though both FGC & YFull called it out from Big Y BAM.



The abve two statements contradict each other, so I guess it is not true that you were able to use the YFull analysis to compare your I
NDELs reported by FGC and Big Y.

No, I compared the INDELs found in the Big Y & BGI BAMs by NGC analysis and BGI BAM by YFull Raw Data viewer. I compared the INDEL in the VCF file of Big Y. I would have liked to have compared the Big Y F BAMs with YFull but they would have to modify their software to handle two very similar BAM files they told me. Though I did not find YFull all that useful for INDELs on just my FGC BAM, as they did not really call them out except for the tiny ones. [Edit - I had to learn the locations from FGC reports.]

I probably should just learn to use SAMTOOLS.



If only FTDNA cared a bit about introducing their novel Big Y SNPs to YBrowse (and I am talking about those SNPs that have not been previously found by FGC) and was actually able to extract such reliable SNPs from their own results, you would be surprised how many of those SNPs have been already discovered (mostly by the admins of the haplogroup projects).

Of the many Novel Variants called out by FTDNA in my sample that were shared with other multiple haplogroups, except for a couple, they all have defined "rs" SNP names from previous usually academic tests of 1K Genomes type data. Again, I am only talking my own L2* sample compared others--but not the statistics of the others.

Not clear if FTDNA even maintains there own "Ymap" anymore. Outside of Admins in shared projects getting to see the data, I have not heard of people being informed by FTDNA that they have a matching Novel Variant?



It seems to me that this has much more to do with the number of people tested (with any NGS-based test) than with the difference between FGC and Big Y. Among the FGC-tested people in the R1a project, the proportion of those who have their terminal SNP marked with "*" is not smaller than among those who took Big Y. For example, among the six FGC-tested members of R1a-Z280, we have YP340*, Y33*, YP2902*, while among those tested with Big-Y are many who represent specific subclades of YP340, Y33 or Y2902, while no cases of YP340* or Y33* and only one case of Y2902*.

Probably depends on particular case and location on the tree (age of terminal SNP) and the luck of draw. I have no clue if there is apriori way to know if Big Y returns 35% or 60% of reliable variants compared with FGC BGI results.

Interesting, info, thanks.

Michał
05-26-2014, 07:44 PM
Probably depends on particular case and location on the tree (age of terminal SNP) and the luck of draw. I have no clue if there is apriori way to know if Big Y returns 35% or 60% of reliable variants compared with FGC BGI results.

I have never heard about any lineage that would be at least 4000 years old and in which the BigY-tested SNPs would constitute close to the 35% of the FGC-tested SNPs, as suggested in your post. All examples I have seen so far indicate a range of about 50-65% (usually close to 60%). For example, the CTS6 subclade of R1a-F1345 that seems to be about 4500 years old shows 50 FGC-tested SNPs under F1345, while the average number of corresponding BigY-tested SNPs for this clade is 30.75 (61.5%). I guess MitchellSince1893 can provide the appropriate numbers for the Big Y and FGC-tested SNPs under P312 in your L2 subclade of U152.

FGC Corp
05-26-2014, 07:52 PM
190619051907
All examples I have seen so far indicate a range of about 50-65% (usually close to 60%). For example, the CTS6 subclade of R1a-F1345 that seems to be about 4500 years old shows 50 FGC-tested SNPs under F1345, while the average number of corresponding BigY-tested SNPs for this clade is 30.75 (61.5%).


50%- 60% is consistent with our white paper estimates.

Note: This white paper, while still consistent with recent Big Y and FGC results, dates to Feb 2014.

haleaton
05-26-2014, 09:30 PM
I have never heard about any lineage that would be at least 4000 years old and in which the BigY-tested SNPs would constitute close to the 35% of the FGC-tested SNPs, as suggested in your post. All examples I have seen so far indicate a range of about 50-65% (usually close to 60%). For example, the CTS6 subclade of R1a-F1345 that seems to be about 4500 years old shows 50 FGC-tested SNPs under F1345, while the average number of corresponding BigY-tested SNPs for this clade is 30.75 (61.5%). I guess MitchellSince1893 can provide the appropriate numbers for the Big Y and FGC-tested SNPs under P312 in your L2 subclade of U152.

You just did though I won't speculate on the age of L2.

My personal estimate makes some minor adjustments based on detailed analysis and in progress Sanger sequencing. Again all these Private SNPs are just singletons at this point though I am getting my 12th paternal cousin to be tested by Full Genomes hopefully with results later this year.

Anyway, ignoring my adjustments, using FGC Analysis of my FGC BGI BAM versus my FTDNA BigY BAM previously posted:

FG1059-A (February Analysis) BGI
SNPs shared with reference samples (1723 total variants; 124 "high reliability" variants)
Private SNPs (684 total variants; 39 "high reliability" variants)
INDELs and MNPs shared with reference samples (391 total variants; 42 "high reliability" variants)
Private INDELs and MNPs (372 total variants; 6 "high reliability" variants)

4UEEK (April Analysis) BIG Y
SNPs shared with reference samples (901 total variants; 76 "high reliability" variants)
Private SNPs (2920 total variants; 16 "high reliability" variants)
INDELs and MNPs shared with reference samples (239 total variants; 24 "high reliability" variants)
Private INDELs and MNPs (465 total variants; 1 "high reliability" variants)

This is just cut and pasted from FGC reports. So here it is 16/39 = 41% Private SNPs, 1/6 = 17% Private INDELs
FGC does not count SNPs with only one match to a public dataset but includes them with shared.

Note also the large amount of what I call "noise" in low reliability SNPs in the Big Y data. The number of low reliability Private SNPs actually increases for Big Y data by a lot, though the coverage is smaller. Both FTDNA and BGI use barcoding. The reason has not been found to my awareness.

I have found (Edit - though not statistically significant) that NGC BIG Y data is of high quality at low reads and I have validated both that I tried (2) by Sanger at YSeq.com. Also, I do count SNPs as private when they match only one public data set from a much different halplogroup as an independent mutation. This brings the percentages down to 35-40%. Further data mining may bring this even lower, btw.

Just where are they missing SNPs that are not found in Big Y? They are all outside of coverage/BED of Big Y.

Michał
05-26-2014, 10:19 PM
You just did though I won't speculate on the age of L2.

You have already admitted that:
1) those "private" SNPs are not all your BigY and FGC-tested SNPs under L2,
2) you don't know whether they are from the comparable levels on your tree,

Additionally:
3) when comparing all your "high reliability" SNPs shared with the reference samples, your BigY/FGC ratio is 76/124, which corresponds to 61.3%,
4) when additionally including the "high reliability" private SNPs, your overall Big/FGC ratio is 92/163, which corresponds to 56.4%.

Thus, not even close to 35% (and well within the 50-65% range I have suggested).

haleaton
05-26-2014, 11:43 PM
You have already admitted that:
1) those "private" SNPs are not all your BigY and FGC-tested SNPs under L2,
2) you don't know whether they are from the comparable levels on your tree,

Additionally:
3) when comparing all your "high reliability" SNPs shared with the reference samples, your BigY/FGC ratio is 76/124, which corresponds to 61.3%,
4) when additionally including the "high reliability" private SNPs, your overall Big/FGC ratio is 92/163, which corresponds to 56.4%.

Thus, not even close to 35% (and well within the 50-65% range I have suggested).

Sorry I was confused. I thought I and others were talking about Private SNPs likely downstream

Just like YFull analysis I do not count the shared high reliable SNPs when the public samples shared are in many different haplogroups.

Does anybody use these [Edit: Variant ] SNPs?

haleaton
05-27-2014, 01:06 AM
190619051907


50%- 60% is consistent with our white paper estimates.

Note: This white paper, while still consistent with recent Big Y and FGC results, dates to Feb 2014.

Thanks! You have analyzed both my FGC BGI and FTDNA Big Y datasets--and boy I am glad I tested with FGC. Big Y was just for helping compare with future FTNDA Big Y matches which so far have not materialized. Is the percentage you talk about Private variants or Private + Shared Variants?

Inquiring minds want to know.

FGC Corp
05-27-2014, 03:56 AM
Thanks! You have analyzed both my FGC BGI and FTDNA Big Y datasets--and boy I am glad I tested with FGC. Big Y was just for helping compare with future FTNDA Big Y matches which so far have not materialized. Is the percentage you talk about Private variants or Private + Shared Variants?

Inquiring minds want to know.

Email sent.

Best,

Justin

Michał
05-27-2014, 05:30 AM
Sorry I was confused. I thought I and others were talking about Private SNPs likely downstream

Just like YFull analysis I do not count the shared high reliable SNPs when the public samples shared are in many different haplogroups.

Does anybody use these [Edit: Variant ] SNPs?
I guess those "high reliability" SNPs shared with reference samples include SNPs that are specific for your haplogroup (including those specific for your particular sublineage of L2) and not just some recurrent SNPs that are seen in other haplogroups. And of course the non-private (or shared) SNPs downstream of P312 (or downstream of L2) are important because these are the ones that define your exact position in the P312 tree.

palamede
05-27-2014, 11:52 AM
You just did though I won't speculate on the age of L2.
FG1059-A (February Analysis) BGI
SNPs shared with reference samples (1723 total variants; 124 "high reliability" variants)
Private SNPs (684 total variants; 39 "high reliability" variants)
INDELs and MNPs shared with reference samples (391 total variants; 42 "high reliability" variants)
Private INDELs and MNPs (372 total variants; 6 "high reliability" variants)

4UEEK (April Analysis) BIG Y
SNPs shared with reference samples (901 total variants; 76 "high reliability" variants)
Private SNPs (2920 total variants; 16 "high reliability" variants)
INDELs and MNPs shared with reference samples (239 total variants; 24 "high reliability" variants)
Private INDELs and MNPs (465 total variants; 1 "high reliability" variants)



If I understand the importance of the "high reliability" variants for the private variants because we have to be sure of the quality for the new SNPS to walk on a hard ground and to get a solid new philogeny, I have difficulty to understand the importance of the "high reliability" variants for the variants with reference samples, we can think the first number of the variants with reference samples is approximately the number of non-private variants since the established "ADAM" and the ratio between the "high reliability" variants and the first number is a clue about the real ratio for the private variants between the number of high reliability" variants and the real number of the private SNPs.

For example, for SNPs from BGI 124/1723 = 7,2% therefore the real number of private SNPs to discover in 23M should be approximately 42*100/7,2 = 583 SNPs. I must say 583 seems me too big.

haleaton
05-27-2014, 01:15 PM
I guess those "high reliability" SNPs shared with reference samples include SNPs that are specific for your haplogroup (including those specific for your particular sublineage of L2) and not just some recurrent SNPs that are seen in other haplogroups. And of course the non-private (or shared) SNPs downstream of P312 (or downstream of L2) are important because these are the ones that define your exact position in the P312 tree.

Thanks! Yes, I did not mean to imply that much of the shared were not part the upstream subclade defining SNPs. I think my position relative to P312 as U152-L2 and also not matching any known sub clades of L2 was well established by the FGC and YFull analysis, though I am far from an expert. I did include the very few shared SNPs that might be independent mutations in my counts. My point was just that counting backwards from the present to L2 using NGC BGI versus FTDNA Big Y using Private SNPs might have different results based on coverage range differences which is more like 35% - 60%, just for the low number of Private downstream SNPs. My results may be a statistical outlier.

Michał
05-27-2014, 02:48 PM
Thanks! Yes, I did not mean to imply that much of the shared were not part the upstream subclade defining SNPs. I think my position relative to P312 as U152-L2 and also not matching any known sub clades of L2 was well established by the FGC and YFull analysis, though I am far from an expert. I did include the very few shared SNPs that might be independent mutations in my counts. My point was just that counting backwards from the present to L2 using NGC BGI versus FTDNA Big Y using Private SNPs might have different results based on coverage range differences which is more like 35% - 60%, just for the low number of Private downstream SNPs. My results may be a statistical outlier.
I wasn't aware that you don't show any known (shared) SNPs under L2, so in such case the number of your BigY-tested private SNPs (or a number of SNPs under L2) is indeed extremely small and probably much below the average for members of L2. For example, MitchellSince1893 has recently posted an information that his Big Y test revealed 38 SNPs under L2, including 10 known (named) SNPs and 28 private mutations. Those 38 BigY-tested SNPs under L2 are almost equal to the number of all 39 reliable SNPs under L2 that were detected in your FGC test, so it seems like you are much below the average for both the FGC-tested and Big-Y tested mutations under L2. Does anybody know the average number of reliable SNPs in the FGC-tested members of L2?

BTW, I've seen a couple of examples of such huge difference in a number of SNPs found in some parallel lineages, and this includes both FGC and Big Y. For example, the number of the FGC-tested SNPs under Z645 ranges from 46 to 91 (though most R1a-Z645 people show 62-68 SNPs downstream of Z645). This is the reason why we should avoid calculating any TMRCAs based on a very small number of descending lineages.

alan
05-27-2014, 03:14 PM
That is why what I would like to see is an average number of SNPs in those big y and full y testers of all P312 clades to get a good idea of its date. Its also seems more achievable to do this early for P312 as a whole because there are presumably a lot more of them when they are pooled. It would also be interesting to get an average no of SNPs for among those tested pooling all of L11. These seem to be much more likely to be achievable now than looking further down the tree with the diminishing numbers of testers in each subclade and sub-subclade.


I wasn't aware that you don't show any known (shared) SNPs under L2, so in such case the number of your BigY-tested private SNPs (or a number of SNPs under L2) is indeed extremely small and probably much below the average for members of L2. For example, MitchellSince1893 has recently posted an information that his Big Y test revealed 38 SNPs under L2, including 10 known (named) SNPs and 28 private mutations. Those 38 BigY-tested SNPs under L2 are almost equal to the number of all 39 reliable SNPs under L2 that were detected in your FGC test, so it seems like you are much below the average for both the FGC-tested and Big-Y tested mutations under L2. Does anybody know the average number of reliable SNPs in the FGC-tested members of L2?

BTW, I've seen a couple of examples of such huge difference in a number of SNPs found in some parallel lineages, and this includes both FGC and Big Y. For example, the number of the FGC-tested SNPs under Z645 ranges from 46 to 91 (though most R1a-Z645 people show 62-68 SNPs downstream of Z645). This is the reason why we should avoid calculating any TMRCAs based on a very small number of descending lineages.

MJost
05-27-2014, 04:42 PM
I used these ten FGC Variant SNP file counts between R1 and DF13, added 10 SNPs not shown in Variant File but shown in the YKnot file, and added back a flat average 70 HQ SNPs under DF13. Using 3.3 generations per SNP rate (99years)



DF13Sub R1-DF13-SNPS Total SNPs
FGC5496 1478 1548
DF21 1431 1501
DF21 1533 1603
DF21 1340 1410
DF21 1431 1501
DF41 1302 1372
DF49 1340 1410
Z253 1452 1522
Z254 1313 1383
L1065 1670 1740
Total 14290 14990
Average 1429 1499
Generations(@3.3 per SNP) 433 454



MJost

mcg11
05-28-2014, 01:56 PM
"BTW, I've seen a couple of examples of such huge difference in a number of SNPs found in some parallel lineages, and this includes both FGC and Big Y. For example, the number of the FGC-tested SNPs under Z645 ranges from 46 to 91 (though most R1a-Z645 people show 62-68 SNPs downstream of Z645). This is the reason why we should avoid calculating any TMRCAs based on a very small number of descending lineages.'

This is true for any founders tree. In Clan Gregor, we have a founder c. 1450 AD(first uncertainty is who had the mutation? father, son, grandfather, grandson etc.), with a list of over 80 identified descendants. The number of STR's vary from 0 over 67 to 9 over 67, depending on when branching from the main line occurred. The number of SNP's are varying also, due to similar branching effects. Much like STR analysis, care must be used in estimating TMRCA's, and SD's should also be provided.

Unlike the knowledge we have gained of the properties of STR's (rates, multiple steps, variable rates with allele value), we have little knowledge of the properties of SNP's. Do all SNP's mutate at the same rate? Is the rate the same from each of the values? (G,C,A,T). These questions and others will be dealt with in time, of that I am sure.

Jean M
05-30-2014, 07:37 PM
Patricia Balaresque et al., Gene Conversion Violates the Stepwise Mutation Model for Microsatellites in Y-Chromosomal Palindromic Repeats, Human Mutation, Volume 35, Issue 5, pages 609–617, May 2014. http://onlinelibrary.wiley.com/doi/10.1002/humu.22542/full (open access)


The male-specific region of the human Y chromosome (MSY) contains eight large inverted repeats (palindromes), in which high-sequence similarity between repeat arms is maintained by gene conversion. These palindromes also harbor microsatellites, considered to evolve via a stepwise mutation model (SMM). Here, we ask whether gene conversion between palindrome microsatellites contributes to their mutational dynamics. First, we study the duplicated tetranucleotide microsatellite DYS385a,b lying in palindrome P4. We show, by comparing observed data with simulated data under a SMM within haplogroups, that observed heteroallelic combinations in which the modal repeat number difference between copies was large, can give rise to homoallelic combinations with zero-repeats difference, equivalent to many single-step mutations. These are unlikely to be generated under a strict SMM, suggesting the action of gene conversion. Second, we show that the intercopy repeat number difference for a large set of duplicated microsatellites in all palindromes in the MSY reference sequence is significantly reduced compared with that for nonpalindrome-duplicated microsatellites, suggesting that the former are characterized by unusual evolutionary dynamics. These observations indicate that gene conversion violates the SMM for microsatellites in palindromes, homogenizing copies within individual Y chromosomes, but increasing overall haplotype diversity among chromosomes within related groups.

Jean M
05-30-2014, 07:59 PM
J Purps et al., A global analysis of Y-chromosomal haplotype diversity for 23 STR loci, Forensic Science International: Genetics, in press.
http://www.fsigenetics.com/article/S1872-4973%2814%2900084-2/abstract (open access)


In a worldwide collaborative effort, 19,630 Y-chromosomes were sampled from 129 different populations in 51 countries. These chromosomes were typed for 23 short-tandem repeat (STR) loci (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385ab, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635, GATAH4, DYS481, DYS533, DYS549, DYS570, DYS576, DYS643) and using the PowerPlex Y23 System (PPY23, Promega Corporation, Madison, WI). Locus-specific allelic spectra of these markers were determined and a consistently high level of allelic diversity was observed. A considerable number of null, duplicate and off-ladder alleles were revealed. Standard single-locus and haplotype-based parameters were calculated and compared between subsets of Y-STR markers established for forensic casework. The PPY23 marker set provides substantially stronger discriminatory power than other available kits but at the same time reveals the same general patterns of population structure as other marker sets. A strong correlation was observed between the number of Y-STRs included in a marker set and some of the forensic parameters under study. Interestingly a weak but consistent trend towards smaller genetic distances resulting from larger numbers of markers became apparent.

razyn
06-01-2014, 04:31 PM
I haven't seen a lot of recent evidence on this thread of interest in R1b-DF27 issues; but in case any should arise, I want to note that the main active thread about the North/South Cluster now includes some FGC "variant compare" data from a tested member thereof, yours truly. I'll link to that thread for deeper background, but FYI my results show 35 highly reliable SNPs below CTS4065 (which is below Z220 and Z295, on a branch of the latter that appears to be primarily non-Iberian). Some, but presumably not all, of the 35 should belong to my known L484+ subset of CTS4065.

http://www.anthrogenica.com/showthread.php?1275-DF27-Z295-CTS4065&p=41288&viewfull=1#post41288

dp
07-10-2014, 05:24 PM
I've seen estimates of TMRCA of L21, M529 at abt 8700 ybp, and M222 at about 3800 ybp [Myres 2011 "...R1b Holocene founder effect... - Table S2]
What are the current estimates of TMRCA for L21, DF13, DF49, S476 (Z2976), DF23, Z2961 (Z2980, S6154), M222?
David Powell
dp :-)
PS: and maybe P312?

Brent.B
08-19-2014, 04:12 AM
BTW, I've seen a couple of examples of such huge difference in a number of SNPs found in some parallel lineages, and this includes both FGC and Big Y. For example, the number of the FGC-tested SNPs under Z645 ranges from 46 to 91 (though most R1a-Z645 people show 62-68 SNPs downstream of Z645). This is the reason why we should avoid calculating any TMRCAs based on a very small number of descending lineages.

Would this also apply to YP254 (downstream of L260)? Half the results show 10-12ish SNPs downstream, while the other half show 18-22ish SNPs downstream...

Lappa
10-07-2014, 11:11 AM
# of SNPs down from YP256 lvl (based on 13 BIG Y's and FTDNA algorythm) is:
MAXIMUM # is 19 (3 samples)
MINIMUM # is 11
AVERAGE is: 14,2 (by FTDNA reports) and 17 (by YFull.com - but only 6 samples were analysed and available)

Of course it's NOVEL and reliable variants only.

George Chandler
10-07-2014, 03:44 PM
# of SNPs down from YP256 lvl (based on 13 BIG Y's and FTDNA algorythm) is:
MAXIMUM # is 19 (3 samples)
MINIMUM # is 11
AVERAGE is: 14,2 (by FTDNA reports) and 17 (by YFull.com - but only 6 samples were analysed and available)

Of course it's NOVEL and reliable variants only.

Have the SNP's been vetted by YSEQ? The reason I ask is that of all the Big Y high and med quality test results which have come back still more are being culled by YSEQ.

George

Michał
10-07-2014, 05:02 PM
Have the SNP's been vetted by YSEQ? The reason I ask is that of all the Big Y high and med quality test results which have come back still more are being culled by YSEQ.

There is no chance that all these private SNPs will be verified using Sanger sequencing. There are simply too many of them and most people would not afford this. However, in the R1a project we are (usually) very carefully analysing all candidates for "private" SNPs in order to exclude all unreliable mutations located in some potentially unstable regions (like multicopy/palindrome sequences). If we have any doubts, we consult the experts from YFull. Also, many of those initially "private" SNPs have subsequently turned out to define some new subclades, so many of them became testable at FTDNA. So far, I know only one case that such "approved" NGS-derived mutation identified in our project turned out to be slightly less reliable (YP264), and this is only because the experts realized that it is located in a duplicated region, but so far all Sanger results from FTDNA (both the positive and negative ones) are very consistent, so we have no reason to exclude this SNP marker as phylogenetically unreliable.

MJost
11-28-2014, 04:57 PM
Ok my SNP counting results. My numbers from Full Genomes YDNA.

I have 304 total SNP's below DF13.

I have a total number of 1,495 private and public mutations under R1.

T. Karafet et al. estimated the age of R1, the parent of R1b, as 18,500 (12,500 - 25,700) years before present. (note this is a skewed Range slightly towards the longer)

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2336805/

Working the numbers:

Using the estimated the age of 18,500 minus the last 2000 that are presumed to be averaging 30 years per generations, equals 67 generations. The previous portion of 16,500 years at 25 years per generation would be 660 generations. Totaling a 727 Generations. Using 1,495 mutations over 727 generations, equates to 2 mutations per generation.

Taking the 'Max at 27,500 ybp': 67 + 1028 equals 1095 generations equates to 1.45 mutations per generation.

There is a slightly better probability that R1 could be bit higher than 18,500 than a lower age due to skewing. A plus one sigma range would be max 23,396 ybp instead of 18500 years old. So recalibrating, 23,396 ybp - 2000 = 21,396 left for 25 years per gen = 856 + 67 = 923 total generations. 1426 SNPs / 923 gens = 1.56 SNPs per generation. Recalibrating to 23,396 instead of 18500 and thus using the 1.56 SNPs vs 2.0 per generations (23,396 vs 18500 ybp), 194.9 generations - 60.6 = 134.3 or 2000 + 3357.5 = 5357.5

Now using my 304 mutations below DF13 divided by 2 mutations per generations equals 152 generations. Taking 152 (67 generations back to 0AD, 85 Prior to 0AD) equals 2,000 + 2,125 total 4,125 years. This age is very close to my STR Founders calculations using variance for DF13.

At the Max at 27,500 ybp of R1, 304 mutations below DF13 divided 1.45 equals 210 generations. 67 + 143 at 30 and 25 respectively totals 2000 + 3,575 totals 5,575ybp.

So recalibrating using 23,3963 (1 sigma) 304 mutations below DF13 divided 1.56 equals 194.9 generations. 67 + 127.9 at 30 and 25 respectively totals 2000 + 3198 totals 5,198 ybp.

DF13 = 4,125 (max 1 SD 5,198 ybp). Using the high side of 68% curve, adding five SNPs back to L21 calculates 3.2 generations at 1.56 SNPs per gen to L21 or 80 years or 4205 years before present and using 1950 as a base year 2255 bc., my previous stated age of 4,275 ybp L21 figure.

Assuming the 30 years per generation average for the present to 1AD is incorrect and the real number is higher, lets say 33 years per generation, then lowers the generations to 60.6 instead of 67. Using 304 mutations divided by 2 mutations per generations equals 152 generations. 152 - 60.6 = 91.4 gens before 1 BCE. 2000 + 2285 = 4285 ybp. Back to L21 will be aprrox. 2,365 BC

MJost

MJost
12-02-2014, 04:04 AM
So lets do another SNP calibration.

Mal'ta boy, a 23,000-year old-remains from Mal'ta in Siberia near Lake Baikal had a subclade distribution Paragroup R-M207 Haplogroup R* basal Y-DNA (xR1,R2). Radiocarbon dating estimated the age of the bones to be about 24K years old. Sequencing Mal'ta boy genome showed both East Asian and Eurasian roots. The study "determined the most likely phylogenetic affiliation of the MA-1 Y chromosome to a basal lineage of haplogroup R (Supplementary Information, section 8 and Supplementary Fig. 5a)."

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105016/
http://www.illumina.com/documents/icommunity/article_2014_03_native_am.pdf
http://www.nature.com/nature/journal/v505/n7481/extref/nature12736-s1.pdf


I then pulled the Y-DNA R SNP list from YFull's Experimental YTree v2.28 dated 25 November 2014 which uses ISOGG information, in order to recalibrate the R branch since we know Mal'ta boy's age. (See data under spoiler.) 24,000 cal ybp was the Mal'ta boy radiocarbon dated bone.

Number counts includes the first derived branch SNP listed below it.

Summary

R - 22
R.1 (Mal'ta Boy) - 4
R1 - 44
R1b - 2
R1b1 - 2
R1b1.1-L389 - 2
R1b1a - 2
R1b1a2 - 7
R1b1a2a - L23 - 3
R1b1a2a1 - L51 - 4
R1b1a2a1a - L11 - 7
R1b1a2a1a2 - P312 - 2
R1b1a2a1a2c - L21 - 4
Count =101 SNPs under R to above DF13
DF13 derived - 29

This number of DF13 SNP counts is from my Full Genome subclade FGC5494 with 29, mostly Sanger validated and a few shared branching SNPs near FGC5494.

The 101 SNPs between R and DF13 plus 29 validated SNPs = 130 SNPs.

Adjustment to get the number of SNPs to a most recent common ancestor: 101 - 21 R SNPs = 80 SNPs + 4 SNPs from R.1 Mal'ta boy back towards R = 84 plus 29 SNPs = 113 SNPs

(note: the overall average of 24000 / 113 = 203.4 years per mutation.

Calculating different periods with different years per generation.

Present to 1AD = 2000 years/30 years per gen = 67 generations.
1AD to 22000 bc = 22000/25 years per gen = 880 generations.
Total 947 generations to have 113 mutations.
947 / 113 = 8.4 generations per mutation.

@ 8.4 gen per mutation.
67 gens had 8 SNPs back to 0AD (2k yrs).
880 gens had 105 SNPs from 0AD to 22K bc (22K yrs).

The 880 generations / 105 = 8.4 x 25 years per gen = 210 years per SNP.

FGC5494 has 29 SNPs, less first 8 SNPs (67 gens) leaving 21 remaining SNPs prior to 0AD.

210 years per SNP x 21 SNPs = 4410 years before present. (edited correcting result)

2000 + 4410 = 6410 ybp or 4410 bc for 29 SNPs to have occurred under DF13.

This assumes only one SNP mutation at each mitosis event.

MJost


Here is the YFull SNP tree Experimental YTree v2.28 dated 25 November 2014:

{} = tree SNPs from YFull tree includes the next downstream SNP.

R--M795/CTS11075/PF6078 * CTS2913/PF6034/M667 * V2573/YSC0001265/CTS3229/PF6036/M672... 19 SNPs --- "M207/UTY2/PF6038/PAGES00037 * P224/PF6050 * P227 *

P229/PF6019 * P232 * P280/PF6068 * P285/PF6059 * YSC0000201/PF6057/M734/S4 * YSC0000233/PF6077/L1347/M792 * CTS3622/PF6037 * V3466/CTS9200/PF5938 *

PF5953/M764 * Y453 * Y457 * Y472/F47/M607/PF6014/S9 * Y480 * YSC0000232/M789/L1225/PF6076 * M696/CTS5815/PF6044 * M732/CTS8311/PF6055" {22}

R* id:MA1

R-Y482--M799/PF6079/YSC237 * PF6040/YSC179 * PF5919/F356/M703... 1 SNPs --- "Y482/PF6056/F459" {4}
R-Y482*

R1--M813/CTS12618/PF6089 * M812/CTS12546/PF6088 * PF6073... 41 SNPs ---"V1478/PF6116/F102/M625 * V1356/PF6114/F93/M621 * PF6110 * CTS2680 *

M173/P241/PF6126/PAGES00029 * M306/S1/PF6147 * P225/PF6128 * P231 * P233/PF6142 * P234/PF6141 * P236/PF6137 * P238/PF6115 * P242/PF6113 * P245/PF6117 *

P286/PF6136 * P294/PF6112 * YSC0000230/L1352/M785 * L875/PF6131/YSC0000288/M706 * CTS3123/PF6124/M670 * CTS3321/PF6125/M673 * PF5477/F28 * PF6069 *

PF6118/M640 * PF6133/F378/M711 * F132/M632 * Y290/F211 * Y305/PF6031 * Y459 * Y464/PF6008 * Y465 * Y512 * M643 * M663/CTS2565/PF6122 * M714/CTS7066/PF6049 *

M717/CTS7122/PF6135 * M730/CTS8116/PF6138 * M748/YSC0000207 * M781/PF6145 * PF6011 * PF6119 * PF6146" {44}

R1*

(R1a--Y2362 * Y1432 * Y1975... 149 SNPs)

R1b--M343/PF6242 {2}

R1b1--L278 * M415/PF6251 {2}

R-L389/PF6531R {2}

R1b1a--L320/PF6092 * P297/PF6398 {2}

R1b1a*

(R1b1a1--M478)

R1b1a2--L265/PF6431 * M269/PF6517 * S3/PF6485... 4 SNPs --- S10/PF6399 * YSC0000269/PF6475/S17 * L1063/CTS8728/PF6480/S13 * PF6410/M520" {7}

R1b1a2*

R1b1a2a--L23/S141/PF6534 * L49.1/L49.2/PF6276/S349 * L150/PF6274/L150.1/PF6274.1/L150.2/PF6274.2 {3}

R1b1a2a*

R1b1a2a1--L51/M412/S167/PF6536 * CTS10373/PF6537 * PF6414... 1 SNPs --- "PF6355" {4}

R1b1a2a1*

R1b1a2a1a--L11/S127/PF6539 * L52/PF6541 * L151/PF6542... 5 SNPs --- "P310/S129/PF6546 * P311/S128/PF6545 * YSC0000191/PF6543/S1159 * CTS7650/PF6544/S1164 *

PF5856" {7}

R1b1a2a1a*

(R1b1a2a1a1--M405/S21/U106)

R1b1a2a1a2--P312/S116/PF6547 * Z1904/CTS12684/PF6548 {2}

R1b1a2a1a2*
(R1b1a2a1a2a--DF27/S250)
(R1b1a2a1a2b--S28/U152/PF6570)
(R1b1a2a1a2e--Z4161 * DF19/S232 * S1354)
(R1b1a2a1a2f--Z6001/DF99/S11987)
(R-Z2244--V1864/Z2246 * Z2244 * Z2245... 5 SNPs --- "Z2247 * Z2248 * Z2249 * Z2250 * Z2251")
R1b1a2a1a2c--L21/M529/S145 * L459 * Y2598/S552... 3 SNPs --- "Z245/S245 * Z260 * Z290/S461" {4}

R1b1a2a1a2c1--DF13/S521/CTS241

MJost
12-09-2014, 01:52 AM
Now my pure SNP counting calibrated in previous post to calculate node points. Using the same 210 years per SNP from M269 node to present as shown.

M269>> 55 SNPs to present less first 8 SNPs (67 gens) leaving 47 remaining SNPs prior to 0AD.
210 years per SNP x 47 SNPs = 9870 years before present.
2000 + 9870 = 11870 ybp or 9870 bc for 55 SNPs to have occurred under M269.

For reference, this recent study shows two previous SNP dating from Xue and Mendez for comparison.

http://mbe.oxfordjournals.org/content/early/2014/11/26/molbev.msu327.abstract
http://mbe.oxfordjournals.org/content/early/2014/11/26/molbev.msu327/suppl/DC1

Hallast_TablesS1-3_5-11Revised281014.xlsx

Table S6: TMRCA estimates based on SNPs and STRs.
Based on SNPs

Xue et al 2009 --- Mendez et al 2013
clade N TMRCA (years) TMRCA range based on mutation rate C.I. (years) rho stdev in (years) --- TMRCA (years) TMRCA range based on mutation rate C.I. (years) rho stdev in (years)

R1b-M269 145 4898 1959-16328 96 --- 7939 6928-11158 156

MJost

MJost
12-09-2014, 10:54 PM
Back in Oct 14 over on the [email protected] forum I posted this info.

"Lets run some numbers.

The aim to generate individual Y-chromosomal sequences with adequate coverage. The need is to have these results appropriately filtered to generate validated Y-SNP calls, including using PCR and Sanger Sequencing. There are currently both FGS and BigY kits with SNP calls that were ranked as High Quality which were retested and produced a reduced number of validated SNPs.

A calibrated human Y-chromosomal phylogeny based on resequencing

Wei Wei, Qasim Ayub, [...], and Chris Tyler-Smith

""estimate this number from the sequences of the three-generation family... observation of two germline mutations in two transmissions of 8.97 Mb is consistent with the expectation of ∼0.6 mutations in two transmissions (0.3 variants observed per meiosis in 10.5 Mb)""


Let say we have a point in time of 4000 ybp with 33 year generations (ypg) equates to 121.1 generations. At 0.3 variants observed per meiosis, calculates to 36.3 mutations that would occur.

At 3750 ybp (my suggested DF13 age) with 33 year generations equates to 113.6 generations. At 0.3 variants observed per meiosis, calculates to 34.1 mutations. 250 years less has 6 to 7 generations difference.

The one father who has an increased age at the point of meiosis may increase the fractional number of variants observed just for the one generation. But even if everyone was 40 years old at meiosis, at 4000 ybp it only has 40 SNPs occurred.

Even at 5k ybp, as some prefer for DF13, same 33 ypg is 151.5 gens, 45.5 SNPs."

When asked why I used about the above methods, I wrote:

"I am showing the alternate studies that result in similar results. Check out Ray Banks' comments (He is has light years more understanding of these processes).
http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2012-11/1352614028
Ray used the info for the total SNPs discovered (not just validated ones), that "a new detectable SNP arises every 55 yrs (or a little less)."

So my 70 (approx.) Full Genome SNPs under DF13 using 55 years per SNP equates to 3850 ybp. (my personal estimate of DF13 is around 3750.) You mentioned 60 SNPs, so recalibrated would be 64 years per SNP. My deeper shared SNPs plus my Sanger'ed list of (totaling 29 of the 70) SNPs would be 133 years per SNP, close to my estimated 129 years per mutation. Initially it was shown the average of 33 BigY SNPs would be 116 years per meiosis.

Just showing the various studies and the used different methods appear to be produce similar results which can be used in the new calibrating SNPs world."

MJost

alan
12-10-2014, 08:55 AM
So lets do another SNP calibration.

Mal'ta boy, a 23,000-year old-remains from Mal'ta in Siberia near Lake Baikal had a subclade distribution Paragroup R-M207 Haplogroup R* basal Y-DNA (xR1,R2). Radiocarbon dating estimated the age of the bones to be about 24K years old. Sequencing Mal'ta boy genome showed both East Asian and Eurasian roots. The study "determined the most likely phylogenetic affiliation of the MA-1 Y chromosome to a basal lineage of haplogroup R (Supplementary Information, section 8 and Supplementary Fig. 5a)."

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105016/
http://www.illumina.com/documents/icommunity/article_2014_03_native_am.pdf
http://www.nature.com/nature/journal/v505/n7481/extref/nature12736-s1.pdf


I then pulled the Y-DNA R SNP list from YFull's Experimental YTree v2.28 dated 25 November 2014 which uses ISOGG information, in order to recalibrate the R branch since we know Mal'ta boy's age. (See data under spoiler.) 24,000 cal ybp was the Mal'ta boy radiocarbon dated bone.

Number counts includes the first derived branch SNP listed below it.

Summary

R - 22
R.1 (Mal'ta Boy) - 4
R1 - 44
R1b - 2
R1b1 - 2
R1b1.1-L389 - 2
R1b1a - 2
R1b1a2 - 7
R1b1a2a - L23 - 3
R1b1a2a1 - L51 - 4
R1b1a2a1a - L11 - 7
R1b1a2a1a2 - P312 - 2
R1b1a2a1a2c - L21 - 4
Count =101 SNPs under R to above DF13
DF13 derived - 29

This number of DF13 SNP counts is from my Full Genome subclade FGC5494 with 29, mostly Sanger validated and a few shared branching SNPs near FGC5494.

The 101 SNPs between R and DF13 plus 29 validated SNPs = 130 SNPs.

Adjustment to get the number of SNPs to a most recent common ancestor: 101 - 21 R SNPs = 80 SNPs + 4 SNPs from R.1 Mal'ta boy back towards R = 84 plus 29 SNPs = 113 SNPs

(note: the overall average of 24000 / 113 = 203.4 years per mutation.

Calculating different periods with different years per generation.

Present to 1AD = 2000 years/30 years per gen = 67 generations.
1AD to 22000 bc = 22000/25 years per gen = 880 generations.
Total 947 generations to have 113 mutations.
947 / 113 = 8.4 generations per mutation.

@ 8.4 gen per mutation.
67 gens had 8 SNPs back to 0AD (2k yrs).
880 gens had 105 SNPs from 0AD to 22K bc (22K yrs).

The 880 generations / 105 = 8.4 x 25 years per gen = 210 years per SNP.

FGC5494 has 29 SNPs, less first 8 SNPs (67 gens) leaving 21 remaining SNPs prior to 0AD.

210 years per SNP x 21 SNPs = 4410 years before present. (edited correcting result)

2000 + 4410 = 6410 ybp or 4410 bc for 29 SNPs to have occurred under DF13.

This assumes only one SNP mutation at each mitosis event.

MJost


Here is the YFull SNP tree Experimental YTree v2.28 dated 25 November 2014:

{} = tree SNPs from YFull tree includes the next downstream SNP.

R--M795/CTS11075/PF6078 * CTS2913/PF6034/M667 * V2573/YSC0001265/CTS3229/PF6036/M672... 19 SNPs --- "M207/UTY2/PF6038/PAGES00037 * P224/PF6050 * P227 *

P229/PF6019 * P232 * P280/PF6068 * P285/PF6059 * YSC0000201/PF6057/M734/S4 * YSC0000233/PF6077/L1347/M792 * CTS3622/PF6037 * V3466/CTS9200/PF5938 *

PF5953/M764 * Y453 * Y457 * Y472/F47/M607/PF6014/S9 * Y480 * YSC0000232/M789/L1225/PF6076 * M696/CTS5815/PF6044 * M732/CTS8311/PF6055" {22}

R* id:MA1

R-Y482--M799/PF6079/YSC237 * PF6040/YSC179 * PF5919/F356/M703... 1 SNPs --- "Y482/PF6056/F459" {4}
R-Y482*

R1--M813/CTS12618/PF6089 * M812/CTS12546/PF6088 * PF6073... 41 SNPs ---"V1478/PF6116/F102/M625 * V1356/PF6114/F93/M621 * PF6110 * CTS2680 *

M173/P241/PF6126/PAGES00029 * M306/S1/PF6147 * P225/PF6128 * P231 * P233/PF6142 * P234/PF6141 * P236/PF6137 * P238/PF6115 * P242/PF6113 * P245/PF6117 *

P286/PF6136 * P294/PF6112 * YSC0000230/L1352/M785 * L875/PF6131/YSC0000288/M706 * CTS3123/PF6124/M670 * CTS3321/PF6125/M673 * PF5477/F28 * PF6069 *

PF6118/M640 * PF6133/F378/M711 * F132/M632 * Y290/F211 * Y305/PF6031 * Y459 * Y464/PF6008 * Y465 * Y512 * M643 * M663/CTS2565/PF6122 * M714/CTS7066/PF6049 *

M717/CTS7122/PF6135 * M730/CTS8116/PF6138 * M748/YSC0000207 * M781/PF6145 * PF6011 * PF6119 * PF6146" {44}

R1*

(R1a--Y2362 * Y1432 * Y1975... 149 SNPs)

R1b--M343/PF6242 {2}

R1b1--L278 * M415/PF6251 {2}

R-L389/PF6531R {2}

R1b1a--L320/PF6092 * P297/PF6398 {2}

R1b1a*

(R1b1a1--M478)

R1b1a2--L265/PF6431 * M269/PF6517 * S3/PF6485... 4 SNPs --- S10/PF6399 * YSC0000269/PF6475/S17 * L1063/CTS8728/PF6480/S13 * PF6410/M520" {7}

R1b1a2*

R1b1a2a--L23/S141/PF6534 * L49.1/L49.2/PF6276/S349 * L150/PF6274/L150.1/PF6274.1/L150.2/PF6274.2 {3}

R1b1a2a*

R1b1a2a1--L51/M412/S167/PF6536 * CTS10373/PF6537 * PF6414... 1 SNPs --- "PF6355" {4}

R1b1a2a1*

R1b1a2a1a--L11/S127/PF6539 * L52/PF6541 * L151/PF6542... 5 SNPs --- "P310/S129/PF6546 * P311/S128/PF6545 * YSC0000191/PF6543/S1159 * CTS7650/PF6544/S1164 *

PF5856" {7}

R1b1a2a1a*

(R1b1a2a1a1--M405/S21/U106)

R1b1a2a1a2--P312/S116/PF6547 * Z1904/CTS12684/PF6548 {2}

R1b1a2a1a2*
(R1b1a2a1a2a--DF27/S250)
(R1b1a2a1a2b--S28/U152/PF6570)
(R1b1a2a1a2e--Z4161 * DF19/S232 * S1354)
(R1b1a2a1a2f--Z6001/DF99/S11987)
(R-Z2244--V1864/Z2246 * Z2244 * Z2245... 5 SNPs --- "Z2247 * Z2248 * Z2249 * Z2250 * Z2251")
R1b1a2a1a2c--L21/M529/S145 * L459 * Y2598/S552... 3 SNPs --- "Z245/S245 * Z260 * Z290/S461" {4}

R1b1a2a1a2c1--DF13/S521/CTS241

Thanks. I know Michal also produced ages that were significantly older than the STR dates for R1b clades. We clearly need a couple of more ancient R1b full genomes to get confidence on the SNP method.

However, dates as early as 4400BC for DF13 poses archaeological problems. The geography of DF13 is very much NW Europe and extremely rare in eastern Europe. So a date like that would actually push DF13 into a period about the time of the arrival of farming in the area where it is largely found today. TeH big picture of P312 as a whole I am guessing that method would push back to maybe almost back to 6000BC. P312 is still a predominantly northern, central, south-western and Atlantic clade so that would place it right back to the start of the Neolithic within that zone - again hard to believe because the zone is actually split into LBK and Cardial streams, something that has always argued against P312's distribution being linked to one of those cultures.

L11 would be pushed back a further 1500 years into the pre-farming era of Europe. L51*, again barely known east of Austria, would end up in the early post-YD Mesolithic of the area. All of these dates would demand a much earlier presence of L51 derived R1b than any other evidence suggests and push back the origins even pre-farming in central Europe. Again this seems very unlikely.

I suppose in short the geography, dates and in light of ancient DNA suggest these dates are improbably old.

alan
12-10-2014, 09:47 AM
Just to check I am understanding this - are the 7 SNPs leading between L11 and P312 unknown in the line that leads from L11 to U106?

If so your list of SNPs between the stated one and the one below it also highlights that what was once thought of as a rather lightening spread is not the case. If there are 7 SNPs between L11 and P312 then that is a large amount of separation no matter how we cut the years per SNP. Indeed in general the SNP model does give a very different impression of the whole L51-L11-P312/U106-L21/U152/DF27-DF13 sort of sequence and gives the impression that this as a far more drawn out process of fission.

It certainly is no longer possible IMO to see L11 as a whole as linked to a short phase like beaker. L11 would seem to me to have a pre-beaker story of very significant length which is very interesting given the distribution. I think with 7 SNPs between it and P312 and perhaps accepting the STR based anchor date linking P312 with earlyish beaker then even very conservative estimates of SNP generations must place L11 back way before the beaker era in the c. 3200-4000BC kind of period.

These considerations and distribution do make me think that L11* and perhaps U106 did exist, but not necessarily exclusively, in the corded ware culture as well as L11* having a history somewhere pre-dating CW in a culture of the 4th millenium BC which somehow also led to beaker.

Well its getting clearer corded ware was a mixture of farming and steppe elements melded together c. 3000BC and there is an absence of R1b in the former in anient. So it does suggest the latter is where L11 was hiding in the 4th millenium. However, this has the interesting implication that before 3000BC L11* must have been hiding in a place where it is not at all common today in eastern Europe which would support its Yamanaya or other similar steppe origins, discordant though that may look today.

If some L11 went into corded ware and P312 ended up linked to beaker, that at least seems to have some sort of echo in the L11XP312xU106 pattern which has both Alpine and Baltic scatters. However while we have a Corded Ware home for some of the L11 i that can take us back to 3000BC and down to 2500BC and beyond (creating a plausible chrono-geographical-cultural context for U106 to arise) we have the issue of what the L11 line leading to P312 was doing while other L11 lines were likely living in Corded Ware. Corded ware and L11 through to U106 has a nice simple logic to it without anything too low visibility or geographically epic being required to explain it. It makes perfect sense.

The roots of P312 and beaker on the other hand are much less self evident. I think it all boils down to one question-where was the L11 line leading to P312 living in pre-beaker times while the other L11 line leading to U106 was likely living in Corded Ware?




R1b1a2a1 - L51 - 4
R1b1a2a1a - L11 - 7
R1b1a2a1a2 - P312 - 2
R1b1a2a1a2c - L21 - 4
Count =101 SNPs under R to above DF13
DF13 derived - 29

Michał
12-10-2014, 03:57 PM
I know Michal also produced ages that were significantly older than the STR dates for R1b clades. We clearly need a couple of more ancient R1b full genomes to get confidence on the SNP method.
However, dates as early as 4400BC for DF13 poses archaeological problems.

I agree that these ages suggested by MJost seem to be overestimates. However, I don't understand many details of his procedure, so it is hard to point our where the problem lies.

On the other hand, it is worth noting that when accepting the DF27-related scenario for the initial Bell Beakers expansion, as proposed by Jean M, one needs to assume that P312 is at least 5000 years old, or let's say at least 5000-5500 years old, with L11 being at least 5500-6000 years old in such case, which is quite consistent with my SNP-based estimates.



The roots of P312 and beaker on the other hand are much less self evident. I think it all boils down to one question-where was the L11 line leading to P312 living in pre-beaker times while the other L11 line leading to U106 was likely living in Corded Ware?


I would like to point out that P312 has expanded very shortly after being separated from U106. In fact, if this single mutation (P312) did not arose, we would need to treat U106 as a sister clade of DF27, U152, L21, L238 and DF19. In other words, we don't have much data suggesting a completely different fate of U106 (especially when compared to such subclades of P312 like L238 and DF19) when the very initial stage of development for P312 and U106 is considered.

MJost
12-10-2014, 04:22 PM
> Alan: Just to check I am understanding this - are the 7 SNPs leading between L11 and P312 unknown in the line that leads from L11 to U106?

Both P312 and U106 are under P311 which is one SNP above.


The SNP counts below will be considered as the Designated SNP Block name with the number of unordered SNPs in the block which includes the next identified

SNP below ie. L23 has three SNPs includes L51.
R1b1a2 - M269 - 7
R1b1a2a - L23 - 3
R1b1a2a1 - L51 - 4
R1b1a2a1a - L11 - 7
R1b1a2a1a2 - P312 - 2
R1b1a2a1a2c - L21 - 4


My post:
"Let say we have a point in time of 4000 ybp with 33 year generations (ypg) equates to 121.1 generations. At 0.3 variants observed per meiosis, calculates to 36.3 mutations that would occur.

At 3750 ybp (my suggested DF13 age) with 33 year generations equates to 113.6 generations. At 0.3 variants observed per meiosis, calculates to 34.1 mutations."


Using my estimated 129 years per mutation dating SNP blocks above DF13 at 3750 ybp (my suggested DF13 age) working backwards similar to Wei Wei, Qasim Ayub, [...], and Chris Tyler-Smith's results.

R1b1a2 - M269 - 7 < 903 yrs in block plus 6330 ybp = 7233 (5233bc)
R1b1a2a - L23 - 3 < 387 yrs in block plus 5943 ybp = 6330 (4330bc)
R1b1a2a1 - L51 - 4 < 516 yrs in block plus 5427 ybp = 5943 (3943bc)
R1b1a2a1a - L11 - 7 < 903 yrs in block plus 4524 ybp = 5427 (3427bc)
R1b1a2a1a2 - P312 - 2 < 258 in block yrs plus 4266 ybp = 4524 (2524bc)
R1b1a2a1a2c - L21 - 4 < 516 in block yrs plus 3750 ybp = 4266 (2266bc)



Using the Raghavan study reported Mal'ta Calibrated age at 24000 ybp and from those numbers we show:
M269>> 55 SNPs
FGC5494 has 29 SNPs, less first 8 SNPs (67 gens) leaving 47 remaining SNPs prior to 0AD.
210 years per SNP x 47 SNPs = 9870 years before present.
2000 + 9870 = 11870 ybp or 9870bc for 55 SNPs to have occurred under M269.

M269>> 55 SNPs 11870 ybp (9870bc)
L23>> 48 SNPs (48-8 = 40 x 210 = 8400 + 2000 = 10400 (8400bc)
L51>> 45 SNPs (45-8 = 37 x 210 = 7770 + 2000 = 9770 (7770bc)
L11>> 38 SNPs (38-8 = 30 x 210 = 6300 + 2000 = 8300 (6300bc)
P312>> 36 SNPs (36-8 = 28 x 210 = 5880 + 2000 = 7880 (5880bc)
L21>> 32 SNPs (32-8 = 24 x 210 = 5040 + 2000 = 7040 (5050bc)


With such a different between Wei/Smith and Mal'ti boy dating the former will be the dating that makes the most reasonable sense when considering the closeness to STR dating results.

MJost

vettor
12-11-2014, 05:46 PM
is there a program which converts 23andme data to STR results

I have used the "felix" program which converts 23andme to SNP results and it works

Not for me, but for my son who is on the latest v4 23andme

Salkin
12-11-2014, 06:29 PM
is there a program which converts 23andme data to STR results

I have used the "felix" program which converts 23andme to SNP results and it works

Not for me, but for my son who is on the latest v4 23andme

Unfortunately, I don't think 23andMe's chip-genotyping produces enough information to give STR values. I suspect it'd be hard to even make an educated guess good enough. All the genotype-to-STR converters I've seen require input from a big-gun Y sequence like BIG Y or FGC provide.

MitchellSince1893
12-12-2014, 12:41 AM
...
Using my estimated 129 years per mutation...
MJost

I was exploring the SNP based dating a few months ago in this thread.
http://www.anthrogenica.com/showthread.php?2420-SNP-based-TMRCAs-for-R1b-U106-and-subclades&p=38641&viewfull=1#post38641
http://www.anthrogenica.com/showthread.php?2420-SNP-based-TMRCAs-for-R1b-U106-and-subclades&p=38755&viewfull=1#post38755

In that thread I mentioned what others had found as it related to BigY vs FGC SNP detection.

...Based on the percentage of discovery of known/named SNPs between BigY and FGC (BigY detected 74% of named SNPs FGC detected) and novel SNPs (BiGY is averaging 47.8% of the novel SNPs compared to FGC).

Do you have any thoughts as to how your 129 years per mutation relates to BigY results?

Also,
FGC is seeing a new SNP every 75-90 years (average= 82.5 years), Then 82.5/74% gives BigY detecting named SNPs every 111.5 years, and 82.5/47.8% gives BigY detecting a novel SNPs every 172.6 years. Source for the 75-90 years comment http://www.anthrogenica.com/showthread.php?742-Full-Y-Chromosome-Sequencing-Phase-III-Pilot&p=33670&viewfull=1#post33670

Your 129 years is almost 50 years longer than what FGC was observing. Any thoughts on why there is such a difference?

emmental
12-12-2014, 01:44 AM
I was exploring the SNP based dating a few months ago in this thread.
http://www.anthrogenica.com/showthread.php?2420-SNP-based-TMRCAs-for-R1b-U106-and-subclades&p=38641&viewfull=1#post38641
http://www.anthrogenica.com/showthread.php?2420-SNP-based-TMRCAs-for-R1b-U106-and-subclades&p=38755&viewfull=1#post38755

In that thread I mentioned what others had found as it related to BigY vs FGC SNP detection.


Do you have any thoughts as to how your 129 years per mutation relates to BigY results?

Also, Source for the 75-90 years comment http://www.anthrogenica.com/showthread.php?742-Full-Y-Chromosome-Sequencing-Phase-III-Pilot&p=33670&viewfull=1#post33670

Your 129 years is almost 50 years longer than what FGC was observing. Any thoughts on why there is such a difference?

To put it simply - FGC has more coverage, therefore finds more SNPs. This makes the years per SNP less.

MitchellSince1893
12-12-2014, 01:46 AM
To put it simply - FGC has more coverage, therefore finds more SNPs. This makes the years per SNP less.

So the 129 years MJost mentioned is based on BigY?

MJost
12-12-2014, 02:38 AM
Do you have any thoughts as to how your 129 years per mutation relates to BigY results?

Also, Source for the 75-90 years comment http://www.anthrogenica.com/showthread.php?742-Full-Y-Chromosome-Sequencing-Phase-III-Pilot&p=33670&viewfull=1#post33670

Your 129 years is almost 50 years longer than what FGC was observing. Any thoughts on why there is such a difference?

I am using 129 based on my getting my 70 HQ Full Genome SNPs reviewed and narrowed down for Sanger sequencing resulting my FGC5494 branch with 29 net SNPs, mostly by Sanger and a few by shared SNPs at the root of this branch ie FGC5561 that can not be single SNP tested due to cross-over issues but only found in the FGC5494 line.

My net resulting validated SNPs counts are similar to the BigY's 'Gold' position test ranges, which the results that were looked at initially, produced around 33 average SNPs found under DF13. A good representative. With all the other reviews that I used to determine that DF13 was spawned at around 1,750 (typo 1,375) bc and this age is my calibration age to produce 129 years per SNP.

MJost

MJost
12-12-2014, 02:40 AM
No, my initial 70 FGC SNPs would equate to around 54 years per mutation.

AND the average 33 BigY SNPs if all were Sanger validated, would be 114 years per SNP.

MJost

palamede
12-15-2014, 07:32 PM
my 70 HQ Full Genome SNPs reviewed and narrowed down for Sanger sequencing resulting my FGC5494 branch with 29 net SNPs, mostly by Sanger and a few by shared SNPs


MJost

I don't understand why you reject a great part of your 70 HQ Full Genome SNPs (below DF13) and for what reasons :
- Is it possible so many SNPs are confirmed negative by Sanger test ?
- Or maybe there are a lot of SNPS not possible to test by Sanger test for different reasons, but it is not a reason to reject them or it needs to do a separate count of these SNPS reading by high-coverage next-generation sequencing and which cannot be tested by Sanger test due by some hampering proximity . When the mutation rate is given by scholars, it is for all SNPS (SNSs + small indels) or for all SNS (Single Nucleotid Substitution) without this kind of restriction.

Other point :
About number of SNPs at each branch node, YFULL updated R1a 2 or 3 months ago, but it has not updated R1b yet. ISOGG finds 41 SNPS at the M269 level not signalled yet by YFULL. I hope YFULL will update R1b soon.

http://isogg.org/tree/ISOGG_HapgrpR.html

MJost
12-16-2014, 12:20 AM
MJost

I don't understand why you reject a great part of your 70 HQ Full Genome SNPs (below DF13) and for what reasons :
- Is it possible so many SNPs are confirmed negative by Sanger test ?
- Or maybe there are a lot of SNPS not possible to test by Sanger test for different reasons, but it is not a reason to reject them or it needs to do a separate count of these SNPS reading by high-coverage next-generation sequencing and which cannot be tested by Sanger test due by some hampering proximity . When the mutation rate is given by scholars, it is for all SNPS (SNSs + small indels) or for all SNS (Single Nucleotid Substitution) without this kind of restriction.

Other point :
About number of SNPs at each branch node, YFULL updated R1a 2 or 3 months ago, but it has not updated R1b yet. ISOGG finds 41 SNPS at the M269 level not signalled yet by YFULL. I hope YFULL will update R1b soon.

http://isogg.org/tree/ISOGG_HapgrpR.html

I am using validated SNPs which can single SNP tested. These would be the same clean SNPs in the main part of the tree. BigY ranges for SNP were chosen to fit the 'Gold' MSY sections which have low issues such as Cross0vers due to recombination, etc. My 70 SNPs were each evaluated for any issues that primers with the goal of, what Thomas K states, "...there are regions on the Y chromosome that are not similar to any other part of the genome whatsoever. SNPs that are in those regions are very unlikely to be affected by recombination and they are the best choice for constructing a phylogenetic tree" and Sanger sequencing results in the ability to test a single position going forward. The result provided 27 Sanger sequenced 'clean' validated. I have two SNPs that showed positive but primers could not be designed properly but these two are at the base of my branch and are considered to be NGS qualified and, in this specific case, are indicator SNPs for either Chromo2 or BigY testers which points them to be positive for FGC5494 and the latter would indicate positive for on of its subclades.

Until most recently, PCR testing was the dominate method of testing single SNPs and new ones were found via this method.

Until all of the new SNPs get vetted using ISOGG rules, we wait.

MJost

warwick
12-16-2014, 12:55 AM
I am using validated SNPs which can single SNP tested.

That's not the right approach, as Sanger sequencing may not work for NGS, high quality SNPs.

Warwick

--FGC team member

MJost
12-16-2014, 02:17 AM
That's not the right approach, as Sanger sequencing may not work for NGS, high quality SNPs.

Warwick

--FGC team member
I don't know what to exactly say. I am grateful for all the results from your Full Genome Corp. Your process for NGS is the future but I can not comment on how my remain 40 HQ SNPs would be compared to only other Full Genome testers. My 29 SNPs are currently are the best choice for constructing a phylogenetic tree by Sanger validation avoiding the most probability of false positives due to any kind of recombination.

If a very high percentage of NGS testers turn positive for an ancient SNP then I can see using it on the tree with caveat, like Thomas K states, they might be affected by recombination at some time, but after that much time has shown it probably very stable and thus reliable just as my FGC5561 and FGC5495 and the difficult to score FGC5494 which requires a human eye to be confirm the autoscoring systems. So I see your where you might be coming from.

MJost

warwick
12-16-2014, 02:32 AM
I don't know what to exactly say. I am grateful for all the results from your Full Genome Corp. Your process for NGS is the future but I can not comment on how my remain 40 HQ SNPs would be compared to only other Full Genome testers. My 29 SNPs are currently are the best choice for constructing a phylogenetic tree by Sanger validation avoiding the most probability of false positives due to any kind of recombination.

If a very high percentage of NGS testers turn positive for an ancient SNP then I can see using it on the tree with caveat, like Thomas K states, they might be affected by recombination at some time, but after that much time has shown it probably very stable and thus reliable just as my FGC5561 and FGC5495 and the difficult to score FGC5494 which requires a human eye to be confirm the autoscoring systems. So I see your where you might be coming from.

MJost

Greg M. can comment further if you email us directly. Overall, Sanger sequencing validation is a conservative approach, but, it may be the case that it results in an under count of the true SNP numbers. My personal SNP count is relatively consistent at 21 private SNPs with a common ancestor of someone else in Z326 at 375 AD. I haven't looked into Sanger validating my private SNPs personally, but 21 SNPs is close to expectation with a rate of 75 years/SNP for my FGC data.

It may be that in your case we have an over count, but my sense would be to rely on additional NGS data first, at least from my perspective.

MJost
12-29-2014, 08:50 PM
I thought I would throw in Anatole A. Klyosov's and Iain McDonald's ages for reference. AK uses STR mutation counting and IM counted his own list of SNPs.

R1b1a2 - M269 - 7 < 903 yrs in block plus 6330 ybp = 7233 (5233bc){AK*: ~7000 ybp]
R1b1a2a - L23 - 3 < 387 yrs in block plus 5943 ybp = 6330 (4330bc) {AK: 6000 ybp}
R1b1a2a1 - L51 - 4 < 516 yrs in block plus 5427 ybp = 5943 (3943bc) {AK: 4850 ybp}
R1b1a2a1a - L11 - 7 < 903 yrs in block plus 4524 ybp = 5427 (3427bc) {AK: 4600 ± 500 ybp}
R1b1a2a1a2 - P312/U106 - 2 < 258 in block yrs plus 4266 ybp = 4524 (2524bc) {AK:~4200 ybp} [IM**: 4557 ybp (5265bc - 4001bc)]
R1b1a2a1a2c - L21 - 4 < 516 in block yrs plus 3750 ybp = 4266 (2266bc) {AK: 4100 ybp}
DF13 - 3750 (1750bc)

*AK- www.scirp.org/journal/PaperDownload.aspx?paperID=19567
**IM- https://xa.yimg.com/kq/groups/22032797/228208014/name/Ages+of+U106+Timeline+%28BigY+277%29+condensed.pdf

MJost

alan
02-15-2015, 12:03 PM
Below please find my SNP-based TMRCA estimations (in ky) for different R1b clades (and some upstream haplogroups) present in Sardinia. These estimations are based on an assumed mutation rate of 0.7 10^-9 per nucleotide per year (chosen for some reasons mentioned in another thread (http://www.anthrogenica.com/showthread.php?709-New-DNA-Papers/page5&p=10824#post10824)), while the numbers given in parentheses correspond to the TMRCA values calculated using the mutations rates 0.82 and 0.53, as proposed by Poznik and Francalacci, respectively).

63.0 (53.7-82.8) haplogroup F
61.9 (52.8-81.3) haplogroup IJK
58.8 (50.1-77.3) haplogroup K
40.2 (34.3-52.9) haplogroup P
36.6 (31.2-48.2) haplogroup I
33.5 (28.6-44.1) haplogroup R
27.6 (23.5-36.3) haplogroup R1
22.9 (19.5-30.1) R1b-P25
14.9 (12.5-19.6) R1b-V88
8.6 (7.3-11.3) R1b-M269
8.3 (7.1-10.9) R1b-L23
7.6 (6.5-10.0) R1b-L51
7.4 (6.3-9.7) R1b-Z2105
7.2 (6.1-9.5) R1b-M269(xL23)
6.6 (5.6-8.6) R1b-L11
6.2 (5.3-8.2) R1b-P312
6.1 (5.2-8.0) R1b-U152

Assuming that the number of downstream mutations found in members of some poorly represented subclades is not a reliable source of data (due to some technical reasons associated with using the low pass sequencing method), I have instead used the distance (i.e. the number of SNPs) between a parent clade and a common ancestor of a given subclade as a basis for calculating the age (TMRCA) of every subclade. For estimating the age of haplogroup F, I have used the average number of mutations downstream of haplogroup F in members of the well-represented (in Sardinia) clade I2a-M26, which was about 404 mutations. It is worth noting that the average number of mutations downstream of haplogroup F in clade R1b-U152 (a clade that is also frequent in Sardinia, but not as common as I2a-M26) was close to the above number but, nevertheless, evidently lower (392). Thus, when basing similar calculations on this slightly reduced number of SNPs found in members of R1b-U152, we get lower TMRCA values, as shown below.

61.3 (52.3-80.6) haplogroup F
60.2 (51.3-79.1) haplogroup IJK
57.1 (48.7-75.0) haplogroup K
38.5 (32.9-50.6) haplogroup P
34.9 (30.0-45.9) haplogroup I
31.8 (27.1-41.8) haplogroup R
25.9 (22.1-34.0) haplogroup R1
21.2 (18.1-27.9) R1b-P25
13.3 (11.3-17.4) R1b-V88
6.9 (5.9-9.0) R1b-M269
6.6 (5.6-8.6) R1b-L23
5.9 (5.1-7.8) R1b-L51
5.6 (4.8-7.4) R1b-Z2105
5.5 (4.7-7.2) R1b-M269(xL23)
4.8 (4.1-6.4) R1b-L11
4.5 (3.9-5.9) R1b-P312
4.4 (3.7-5.7) R1b-U152

Neither of the above sets of TMRCAs can be considered secure, but I think the first approach is slightly more likely to give correct values when using those Sardinian data alone.

Michal-any further thoughts or refinement on the R1b sequence from M269 down. What is your best shot now and has Reich made any difference. I would just like to check your current dates before doing any more speculation. I am particularly interested in L23, Z2105, L51 and L11. Just would like you to post your current best shot. It may help people like Jean and myself chew over the archaeological implications in light of Reich.

Michał
02-15-2015, 05:26 PM
Michal-any further thoughts or refinement on the R1b sequence from M269 down. What is your best shot now and has Reich made any difference. I would just like to check your current dates before doing any more speculation. I am particularly interested in L23, Z2105, L51 and L11. Just would like you to post your current best shot. It may help people like Jean and myself chew over the archaeological implications in light of Reich.
I think the calculations that have been recently posted by Ebizur are very reasonable and I would definitely agree with most of his estimates. The only exception is that I consider L11 to be rather older than 5.1 ky, but I may be wrong about it. I am on vacation now, so don't have access to any details of my calculations, but here are my relatively recent rough estimates (in ky) taken from the notes I have with me:

R1b-M269 7.5 (7.0-8.1)
R1b-L23 7.2 (6.7-7.7)
R1b-Z2103 6.4 (5.9-6.9)
R1b-L51 6.7 (6.2-7.2)
R1b-L11 5.7 (5.2-6.2)
R1b-P312 5.6 (5.1-6.1)
R1b-U106 5.5 (5.0-6.0)

R1a-M417 6.2 (5.7-6.7)
R1a-CTS4385 5.8 (5.3-6.3)
R1a-L664 4.8 (4.3-5.2)
R1a-Z645 5.6 (5.1-6.1)
R1a-Z93 5.4 (4.9-5.9)
R1a-Z282 5.4 (4.9-5.9)

lgmayka
02-15-2015, 06:10 PM
YFull's haplotree now displays age estimates. (For example, the TMRCA of R-M269 is given as 12200 ybp (http://yfull.com/tree/R1b1a2/), much higher than estimates by other methods.) However:

1) YFull can only base its calculations on the samples available to it. More data--especially from rare cases--ought to improve the numbers.

2) This is only a first attempt. There are obvious anomalies that YFull needs to correct, particularly with respect to Paleolithic ages. Just as one example, G2a2b is defined by 49 different reliable SNPs (http://yfull.com/tree/G2a2b/), yet YFull's tree gives the same number (20800 ybp) for both formation and divergence (TMRCA). In other words, the YFull tree is asserting that 49 different reliable Y-SNPs all occurred in the same man!

I don't mean this as a criticism. Rather, I am congratulating this bold first step. I expect an improvement in the numbers as YFull tunes the algorithm and gets more data.

leonardo
02-15-2015, 06:16 PM
YFull's haplotree now displays age estimates. (For example, the TMRCA of R-M269 is given as 12200 ybp (http://yfull.com/tree/R1b1a2/), much higher than estimates by other methods.) However:

1) YFull can only base its calculations on the samples available to it. More data--especially from unusual cases--ought to improve the numbers.

2) This is only a first attempt. There are obvious anomalies that YFull needs to correct, particularly with respect to Paleolithic ages. Just as one example, G2a2b is defined by 49 different reliable SNPs (http://yfull.com/tree/G2a2b/), yet YFull's tree gives the same number (20800 ybp) for both formation and divergence (TMRCA). In other words, the YFull tree is asserting that 49 different reliable Y-SNPs all occurred in the same man!

I don't mean this as a criticism. Rather, I am pointing out that this is a bold first step. I expect an improvement in the numbers as YFull tunes the algorithm and gets more data.

I noticed this last night. It is a good effort and much appreciated. I believe most of us who have provided our data to YFull for analysis have interest in this, especially as we try to place our ancestors in relationship to others who have tested. Most everybody would like to know TMRCA.

George Chandler
02-15-2015, 06:17 PM
YFull's haplotree now displays age estimates. (For example, the TMRCA of R-M269 is given as 12200 ybp (http://yfull.com/tree/R1b1a2/), much higher than estimates by other methods.) However:

1) YFull can only base its calculations on the samples available to it. More data--especially from rare cases--ought to improve the numbers.

2) This is only a first attempt. There are obvious anomalies that YFull needs to correct, particularly with respect to Paleolithic ages. Just as one example, G2a2b is defined by 49 different reliable SNPs (http://yfull.com/tree/G2a2b/), yet YFull's tree gives the same number (20800 ybp) for both formation and divergence (TMRCA). In other words, the YFull tree is asserting that 49 different reliable Y-SNPs all occurred in the same man!

I don't mean this as a criticism. Rather, I am pointing out that this is a bold first step. I expect an improvement in the numbers as YFull tunes the algorithm and gets more data.

The formed age is likely closer to the actual age than any previous calculation methods to date. They are getting 5,200 years for the formed age of DF13 and mine is only 4,500 years. They probably have a closer estimate for that formation age than me.

alan
02-15-2015, 07:22 PM
I think the calculations that have been recently posted by Ebizur are very reasonable and I would definitely agree with most of his estimates. The only exception is that I consider L11 to be rather older than 5.1 ky, but I may be wrong about it. I am on vacation now, so don't have access to any details of my calculations, but here are my relatively recent rough estimates (in ky) taken from the notes I have with me:

R1b-M269 7.5 (7.0-8.1)
R1b-L23 7.2 (6.7-7.7)
R1b-Z2103 6.4 (5.9-6.9)
R1b-L51 6.7 (6.2-7.2)
R1b-L11 5.7 (5.2-6.2)
R1b-P312 5.6 (5.1-6.1)
R1b-U106 5.5 (5.0-6.0)

R1a-M417 6.2 (5.7-6.7)
R1a-CTS4385 5.8 (5.3-6.3)
R1a-L664 4.8 (4.3-5.2)
R1a-Z645 5.6 (5.1-6.1)
R1a-Z93 5.4 (4.9-5.9)
R1a-Z282 5.4 (4.9-5.9)

They do see reasonable to me with of course interesting implications of being older than once though but even M269 as a whole still too young to be involved in the first farming thrust which makes sense given their absence. L23 being 5200BC of course provides an interesting date in reference to the period after which all its descendants - the vast majority of European and SW Asian P297 derivatives - can stem from. The 4700BC sort of date for L51 also suggests a long period of very little expansion on the margins or beyond farming probably on the steppes or some other place where farming arrived late. The dates for L11, P312 and U106 seems to speak collectively of a comming out of the blocks in the mid 4th millennium - a date that is close to Yamnaya. This IMO makes sense of the hints that L11 may have taken a dual route both north and south of the Carpathians with two stream -one to the Balkans in Yamanaya and one to via Corded Ware both exiting the steppes just after 3000BC. It suggests that L11 and probably the beginning of its two great branches may have had at least a short steppe history. This dates do not seem to jar with the archaeological data IMO despite being earlier than once though. We of course could contrast the SNP dates with the MRCA dates to see the difference between origin of SNP and dispersion. Its all good.

alan
02-15-2015, 07:23 PM
YFull's haplotree now displays age estimates. (For example, the TMRCA of R-M269 is given as 12200 ybp (http://yfull.com/tree/R1b1a2/), much higher than estimates by other methods.) However:

1) YFull can only base its calculations on the samples available to it. More data--especially from rare cases--ought to improve the numbers.

2) This is only a first attempt. There are obvious anomalies that YFull needs to correct, particularly with respect to Paleolithic ages. Just as one example, G2a2b is defined by 49 different reliable SNPs (http://yfull.com/tree/G2a2b/), yet YFull's tree gives the same number (20800 ybp) for both formation and divergence (TMRCA). In other words, the YFull tree is asserting that 49 different reliable Y-SNPs all occurred in the same man!

I don't mean this as a criticism. Rather, I am pointing out that this is a bold first step. I expect an improvement in the numbers as YFull tunes the algorithm and gets more data.

I dont think I can cope with tearing up everything and thinking through those sort of dates on a rainy Sunday night. Rum and coke beckons.

alan
02-15-2015, 07:24 PM
I think the calculations that have been recently posted by Ebizur are very reasonable and I would definitely agree with most of his estimates. The only exception is that I consider L11 to be rather older than 5.1 ky, but I may be wrong about it. I am on vacation now, so don't have access to any details of my calculations, but here are my relatively recent rough estimates (in ky) taken from the notes I have with me:

R1b-M269 7.5 (7.0-8.1)
R1b-L23 7.2 (6.7-7.7)
R1b-Z2103 6.4 (5.9-6.9)
R1b-L51 6.7 (6.2-7.2)
R1b-L11 5.7 (5.2-6.2)
R1b-P312 5.6 (5.1-6.1)
R1b-U106 5.5 (5.0-6.0)

R1a-M417 6.2 (5.7-6.7)
R1a-CTS4385 5.8 (5.3-6.3)
R1a-L664 4.8 (4.3-5.2)
R1a-Z645 5.6 (5.1-6.1)
R1a-Z93 5.4 (4.9-5.9)
R1a-Z282 5.4 (4.9-5.9)

Cheers Michal. Hope you are having an enjoyable break.

Michał
02-15-2015, 08:11 PM
This dates do not seem to jar with the archaeological data IMO despite being earlier than once though. We of course could contrast the SNP dates with the MRCA dates to see the difference between origin of SNP and dispersion. Its all good.
Actually, these are all MRCA dates. For example, I can imagine that M269 itself is much older than 7.5 ky.

lgmayka
02-16-2015, 03:23 PM
Just as one example, G2a2b is defined by 49 different reliable SNPs (http://yfull.com/tree/G2a2b/), yet YFull's tree gives the same number (20800 ybp) for both formation and divergence (TMRCA). In other words, the YFull tree is asserting that 49 different reliable Y-SNPs all occurred in the same man!
This must be a major bug.
G2a2b - 49 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2 - 14 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a - 36 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1 - 18 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b - 75 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1 - 1 SNP, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a - 1 SNP, formation and TMRCA are both 20800 ybp
G-Z1823 - 2 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a2 - 2 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a2a - 3 SNPs, formation and TMRCA are both 20800 ybp

Over 200 SNPs, in the blink of an eye. :)

Again, I don't mean to criticize, only to warn readers that these numbers are merely a first attempt and clearly need more work.

George Chandler
02-16-2015, 03:40 PM
This must be a major bug.
G2a2b - 49 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2 - 14 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a - 36 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1 - 18 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b - 75 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1 - 1 SNP, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a - 1 SNP, formation and TMRCA are both 20800 ybp
G-Z1823 - 2 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a2 - 2 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a2a - 3 SNPs, formation and TMRCA are both 20800 ybp

Over 200 SNPs, in the blink of an eye. :)

Again, I don't mean to criticize, only to warn readers that these numbers are merely a first attempt and clearly need more work.

If they are placing a formation and TMRCA for all the SNP's you've listed above as the same it must be an error. It's possible they are basing their formation age estimate from ancient remains but I can't say for sure. Are they 49 reliable SNP's which are "reliable" or reliable enough to pass Sanger and have primers made for them. If all listed SNP's are reliable and have passed Sanger sequencing I would think 7,000 years +/- for G2a2b?

What do you get for a formation age?

George Chandler
02-16-2015, 03:42 PM
You're getting another 75 reliable SNP's below G2ab2a1b???

George Chandler
02-16-2015, 03:54 PM
I'm trying to understand what you've posted. Just because you have 75 reliable SNP's below G2ab2a1b obviously doesn't mean they are ancestral to G2ab2a1b1. I have a lot of independent lines in my S1051 and over 300 verified SNP's but obviously they are different lines and not added together. Am I missing your point here?

Heber
02-16-2015, 03:59 PM
I understand a paper is due shortly which explains the methodology used. In any event I would expect this to be a beta version with much tweaking based on ancient DNA results and more branches tested.
However it is the first time I have seen an entire phylogenetic tree contain age estimates.
Here are my YFull results and R1b analysis.
https://www.pinterest.com/gerardcorcoran/yfull/
https://www.pinterest.com/gerardcorcoran/r1b/

"In the version 3.4 of YFull Y-Tree
http://www.yfull.com/tree/R1b1a2a1a2/
we plan to show an estimation of age for all subclades with at least one Big Y or Y Elite in our database. The algorythm of estimation by SNP count we will explain later in an article written by Dmitriy Adamov, Vladimir Gurianov, Sergey Karzhavin, Vladimir Tagankin, Vadim Urasin. We have checked our estimation by information of common ancestors our clients. For 12 of 14 subclades estimated age is inside 95% confidence interval. But estimated age of I-YP1012 and I-A379 is not. In the chart you can see all 14 subclades with known ancestors. For all subclades in the Y-Tree a confidence interval depends from number of samples. For subclades with 1 sample a bounds of 95% confidence interval of age estimation are -48.8% +61.6% in the average, for subclades with 2 samples a bounds of 95% confidence interval are -43.3% +50.1% in the average, for 3 samples: -34.6% +37,3%, for 4 samples: -28,9% +32,2%. More samples - better estimation."

George Chandler
02-16-2015, 05:38 PM
I understand a paper is due shortly which explains the methodology used. In any event I would expect this to be a beta version with much tweaking based on ancient DNA results and more branches tested.
However it is the first time I have seen an entire phylogenetic tree contain age estimates.
Here are my YFull results and R1b analysis.
https://www.pinterest.com/gerardcorcoran/yfull/
https://www.pinterest.com/gerardcorcoran/r1b/

"In the version 3.4 of YFull Y-Tree
http://www.yfull.com/tree/R1b1a2a1a2/
we plan to show an estimation of age for all subclades with at least one Big Y or Y Elite in our database. The algorythm of estimation by SNP count we will explain later in an article written by Dmitriy Adamov, Vladimir Gurianov, Sergey Karzhavin, Vladimir Tagankin, Vadim Urasin. We have checked our estimation by information of common ancestors our clients. For 12 of 14 subclades estimated age is inside 95% confidence interval. But estimated age of I-YP1012 and I-A379 is not. In the chart you can see all 14 subclades with known ancestors. For all subclades in the Y-Tree a confidence interval depends from number of samples. For subclades with 1 sample a bounds of 95% confidence interval of age estimation are -48.8% +61.6% in the average, for subclades with 2 samples a bounds of 95% confidence interval are -43.3% +50.1% in the average, for 3 samples: -34.6% +37,3%, for 4 samples: -28,9% +32,2%. More samples - better estimation."

Just as a side note I was hoping Larry would qualify the number of SNP's for each of the G haplogroups he posted. I can get the more than 300 SNP's listed in ISOGG for the S1051 group and clutter up their page with SNP's for which the placement is unknown for most. The average number of SNP's using Big Y, YPrime and YElite testing is about 31 SNP's for each of the lines below DF13 within the S1051 group so obviously a big difference.

lgmayka
02-16-2015, 06:22 PM
Just because you have 75 reliable SNP's below G2ab2a1b obviously doesn't mean they are ancestral to G2ab2a1b1.
There are 75 reliable SNPs at the level of G2a2b2a1b. And yes, that does indeed mean that they are all ancestral to G2a2b2a1b1.

Please look carefully at the YFull haplotree for G2a2b (http://yfull.com/tree/G2a2b/) and downstream. The tree shows 49 reliable SNPs that are phylogenetically equivalent at the G2a2b level. The formation date is the date at which this clade diverged from its siblings; the TMRCA is the date when its subclades diverged. The two dates are the same, 20800 ybp. The tree is asserting that these 49 reliable SNPs were essentially simultaneous.

But then look farther down the tree. An entire chain of subclades are making the exact same date claim. All together, a sequential chain of 200 reliable SNPs are all claiming both formation and TMRCA dates of 20800 ybp.

This is a bug.

George Chandler
02-16-2015, 06:50 PM
I sent you a private message.

laurie
02-16-2015, 07:36 PM
YFull have just sent out this message -

"New feature:
In the YFull Y-Tree http://www.yfull.com/tree/A0-T we have shown an estimation of age for all subclades with at least one Big Y or Y Elite in our database. The algorythm of estimation by SNP count we will explain later in an article written by Dmitriy Adamov, Vladimir Gurianov, Sergey Karzhavin, Vladimir Tagankin, Vadim Urasin."

MJost
02-17-2015, 09:19 PM
I recently posted this observation:


First thing is to consider SNP dating, with the new information, is that there is a major constraint of the BB - QLB28b in Germany with Hg P312xU106 is dated: 2296-2206 cal BCE.

R1b1a2a1a2 P312/S116 I0806 Beaker man from Quedlinburg in north central Germany from the new Haak et al paper noting that U152, DF27, and L21 were not read. But Ancestral Journeys reports QLB 28 has been tested for: P312+, S1217-, Z262-, DF49-, DF23-, L554-, S868-, CTS6581-, CTS2457.2- Page created by Jean Manco 6 September 2009; last revised 13-02-2015. @ http://www.ancestraljourneys.org/ancientdna.shtml)

(Culture: Bell Beaker 2500-2200/2050BC Neolithic mitochondrial haplogroup H genomes and the genetic origins of Europeans - Paul Brotherton et al (w/ Wolfgang Haak) 23 April 2013 )

My Range between beginning of block of two SNPs, P312 (4524 (2524bc) to the end of L21 block (or beginning of DF13 3750 (1750bc). We do not know the order of mutations fits but is no younger than DF13 so the P312/S116 I0806 Beaker man from Quedlinburg Cal age fits my existing TMRCA ranges.

R1b1a2a1a2 - P312/U106 - (2 SNPs) Max 4524 (2524bc)
R1b1a2a1a2c - L21 - (4 SNPs) Max 4266 (2266bc)
DF13 - (1 SNP) Max 3750 (1750bc)

I suspect you will find I0806's actual haplotype in the L21 block and with the very low probability of being DF13 or any of its subclades.

http://www.anthrogenica.com/showthread.php?3474-Bell-Beakers-Gimbutas-and-R1b&p=69983&viewfull=1#post69983

MJost

I made an observation and posted the above information. This created a major SNP dating issue when counting SNPs with the above BB I0806 constraint and Mal'ti boy's 24kya basil R* max limit. Noted that maybe there a small number of un-accounted ancestral SNPs (five?) back to the Root of R creating a slightly older max date.

I speculated that I0806 lacked certain SNP tests below P312 that were well known during the last few years and/or that several SNPs were redacted from the study. This sample may actually be several or more SNPs below P312 only due to 3rd party listed results of several negative down stream DF27 SNPs and the several derived DF13 SNPs were also negative. With the two un-ordered SNPs with in the P312 clade and the five SNPs in the L21 block, and the DF27 subclade SNPs negatives, we can see a small window of SNPs involved.

These facts brought to the forefront of the ability to recalibrate and find the resultant years per SNP between R (24kya) and a node two SNPs below P312, or 101 SNPs within a 19,796 year block of time at 196 years per SNP. Calculated from 24kya to the beginning of two SNPs below P312 (Cal BC age: 2296 - 2206bc).

Now using these upper and lower constraints which produce a 196 years per SNP in the R lineage, lets calculate the remaining number of SNPs under 4206 ybp. (4206)/196 equals 21.4 SNPs. Odd result, maybe. But lets change the numbers.

Extending two SNPs below P312 would be into my L21 line (this could be equal to other P312 subclades), lets remove those six SNPs at 196 years per SNP equals 1176 years to the beginning of DF13 or 3030 ybp or 1030bc to arrive at the DF13 node on the tree where we can evaluate new SNPs under DF13.

So the question becomes how to the remaining NGS SNPs under DF13 be reconciled? My own HQ Full Genome SNPs under DF13 that were narrowed down to 26 Sanger sequenced and two NGS shared SNPs under DF13>FGC5494 Subclade tally's 104 years per SNP. Using a BigY average of 40 SNPs equate to 76 years per Mutation. We should now reasonable recalibrate individual DF13 subclades.

The facts bring up many questions as to why did the years per SNP dropped at least 50% or more from the re-calibrated upper and lower date window that showed 196 years per SNP. Or is something else amiss?

MJost

lgmayka
02-17-2015, 10:21 PM
[B]Or is something else amiss?
An ancient sample may have any number of SNPs unknown today. After all, it is highly unlikely that a random ancient man will have patrilineal descendants living today who have ordered the Big Y!

If we get access to the ancient sample's entire Y chromosome, at high quality, we can then calculate further.

MJost
02-18-2015, 02:34 AM
An ancient sample may have any number of SNPs unknown today. After all, it is highly unlikely that a random ancient man will have patrilineal descendants living today who have ordered the Big Y!

If we get access to the ancient sample's entire Y chromosome, at high quality, we can then calculate further.
The SNPs in R down to L21 are vetted SNPs that everyone under L21 has so I don't know what you are exactly saying.

MJost

lgmayka
02-18-2015, 04:33 AM
The SNPs in R down to L21 are vetted SNPs that everyone under L21 has so I don't know what you are exactly saying.
But what proof do you have that I0806 is L21+ ? And even if he is, he could have any number of unknown SNPs beyond that.

Muircheartaigh
02-18-2015, 09:29 AM
The SNPs in R down to L21 are vetted SNPs that everyone under L21 has so I don't know what you are exactly saying.

MJost

What he is saying is that your calculations seem to assume that because the donor carried a particular SNP, the age of the SNP is the same age of the donor. If that is your assumption it is not correct. His ancestors could have been carrying the SNP for thousands of years. The donor was no doubt carrying down stream SNPs that were not tested for because they only tested for SNPs that have been found in today's population. They did not report on any of these "private" SNPs so we cannot assess the age of the reported SNPs. He is therefore suggesting that your calculation is flawed, which it is.

MJost
02-18-2015, 01:37 PM
But what proof do you have that I0806 is L21+ ? And even if he is, he could have any number of unknown SNPs beyond that.
I didn't say what his terminal SNP was. His testing to date showed he was P312+ but negative for several DF13 and DF27 subclades and the three major SNPs U152, DF27, and L21 that are below P312 were not read. I had merely speculated that this sample probably was actually several more recent SNPs than P312 within a tight range.

MJost

MJost
02-18-2015, 01:56 PM
What he is saying is that your calculations seem to assume that because the donor carried a particular SNP, the age of the SNP is the same age of the donor. If that is your assumption it is not correct. His ancestors could have been carrying the SNP for thousands of years. The donor was no doubt carrying down stream SNPs that were not tested for because they only tested for SNPs that have been found in today's population. They did not report on any of these "private" SNPs so we cannot assess the age of the reported SNPs. He is therefore suggesting that your calculation is flawed, which it is.

Its your understanding of the basic facts that is flawed. His P312+ had three major subclades and two of those were tested into and those SNPs went negative. This narrows down the window for what SNPs this sample could be possible for. This sample is documented to be around 2200 cal bc. We now have hard dating facts of the lower and upper times within the R to P312+ tree and we know the exact number of SNPs within that range. All I did was to show that I0806 most likely several SNPs younger.

>He is therefore suggesting that your calculation is flawed, which it is.
Show me exactly what is flawed other than just state that 'carrying the SNP for thousands of years'.

MJost

lgmayka
02-18-2015, 03:08 PM
This narrows down the window for what SNPs this sample could be possible for.
No. The ancient sample could theoretically have 100 unknown SNPs. Until his entire Y chromosome is scanned, we just don't know how many SNPs beyond P312 he has. His P312+ status gives us only a lower bound on the age of P312, not an upper bound.

MJost
02-18-2015, 03:23 PM
No. The ancient sample could theoretically have 100 unknown SNPs. Until his entire Y chromosome is scanned, we just don't know how many SNPs beyond P312 he has. His P312+ status gives us only a lower bound on the age of P312, not an upper bound.

Sure you could say that his P312+ is 5000 bc and not 2200 cal bc and he is U152 with a 100 private SNPs.

MJost

Muircheartaigh
02-18-2015, 03:40 PM
Its your understanding of the basic facts that is flawed. His P312+ had three major subclades and two of those were tested into and those SNPs went negative. This narrows down the window for what SNPs this sample could be possible for. This sample is documented to be around 2200 cal bc. We now have hard dating facts of the lower and upper times within the R to P312+ tree and we know the exact number of SNPs within that range. All I did was to show that I0806 most likely several SNPs younger.


>He is therefore suggesting that your calculation is flawed, which it is.
Show me exactly what is flawed other than just state that 'carrying the SNP for thousands of years'.
I
MJost

I realy don't think that it's appropriate to suggest my understanding of the basic facts is flawed.

Why do you think that any of his Downstream SNP should be found in people who have tested with Big Y or FGC ie. in the Isogg tree? After all, we know that the vast majority of the population contemporary to the donor have no living descendants.

The results provide a lower bound for the age of the relevant SNPs but assuming an upper bound on the basis that you have suggested is pure speculation.

Incidentally, what's your opinion of the 5100 ybp that Yfull have estimated for DF13?

MJost
02-18-2015, 04:26 PM
I realy don't think that it's appropriate to suggest my understanding of the basic facts is flawed.

Your the one that agreed with lgmayka. You wrote "He (lgmayka) is therefore suggesting that your calculation is flawed, which it is." <<< your last three words states your opinion of my calculation.



Why do you think that any of his Downstream SNP should be found in people who have tested with Big Y or FGC ie. in the Isogg tree? After all, we know that the vast majority of the population contemporary to the donor have no living descendants. The results provide a lower bound for the age of the relevant SNPs but assuming an upper bound on the basis that you have suggested is pure speculation.[/

Again, this sample is 4200 years ago and as such, there would be very few additional down stream SNPs based on the average time per SNP counted from R at 24Kya Cal and P312+ at 2200 Cal BC. I used Yfull's own list of SNPs, P312 is a block of two SNPs ending at the beginning of the L21 block of SNPs. (knowing this sample may be any other P312 subclade but P312 still has two unordered SNPs). The year span between R and the end of P312 Block of SNPs, knowing that this sample may have which totals 19796 years.

My point is that it is reasonable to assume the P312 is or very close to the actual terminal SNP of this sample, a very reasonable speculation.




Incidentally, what's your opinion of the 5100 ybp that Yfull have estimated for DF13?

No. Do you believe R is 7600 years older than Mal'ti Boy's R* 24Kya Calibrated as posted by YFull?

MJost

Muircheartaigh
02-18-2015, 05:10 PM
Your the one that agreed with lgmayka. You wrote "He (lgmayka) is therefore suggesting that your calculation is flawed, which it is." <<< your last three words states your opinion of my calculation.

You did ask ''what's amiss''. Igmayka was pointing out what was amiss and yes I agreed with him that your calculation was flawed. That wasn't a comment on you personally.



[/QUOTE]
No. Do you believe R is 7600 years older than Mal'ti Boy's R* 24Kya Calibrated as posted by YFull?

MJost[/QUOTE]

I would need to know how many SNPs downstream of R* that Mal'ti Boy had before I would make an assessment, but I suspect that Yfull are probably in possession of other data that you and I don't have access to that enables them to make an assessment. What do you think? However, as far as the age of DF13 is concerned I think they are in the right ballpark.

seferhabahir
02-18-2015, 05:39 PM
I would need to know how many SNPs downstream of R* that Mal'ti Boy had before I would make an assessment, but I suspect that Yfull are probably in possession of other data that you and I don't have access to that enables them to make an assessment. What do you think? However, as far as the age of DF13 is concerned I think they are in the right ballpark.

The age of DF13 was batted back and forth last year on another thread and we came up with several scenarios where different subclades of DF13 showed 68 (or so) FGC SNPs below DF13, including my own Z251 and Mark's FGC5496. I thought back then that DF13 was at least 5000 years old, and still think so.

Megalophias
02-18-2015, 07:16 PM
Do you believe R is 7600 years older than Mal'ti Boy's R* 24Kya Calibrated as posted by YFull?
That date agrees very well with calculations from the Hallast et al data, which has on average 80 SNPs from R1'2 to present, and 129 SNPs from K to present. So putting K at 50 000 years ago gives us an average age of 31 000 years for R1'2 (plus or minus several thousand years).

I found the discussion in the Mal'ta boy paper a little hard to follow, but it says there were 143 called SNPs in MA-1's Y chromosome between DE and R on the phylogenetic tree. On Hallast's tree there were 113 SNPs between these points, which gives us 0.79 Hallast SNPs per MA-1 SNP, or 306 years/SNP on average. MA-1 had 138 derived and 5 ancestral SNPs between DE and R, so actually he splits off about 1500 years before the root of surviving R, the split of R1 and R2. He has 35 private SNPs not shared with R1'2, so he is about 9000 years after R1'2, which puts the root of surviving R at approximately 33 000 years ago.

Now since all of this should come with great big confidence intervals, the difference between 7600 and 9000 years is not significant. Though honestly I may be doing the math totally wrong here.

ETA: if instead you consider that there are average 86 ancestral SNPs from R1'2 to present on the MA-1 chromosome compared to the other R reference samples, that gives a much larger number of 361 years/SNP, which would put R1'2 at 35 000 years ago - or counting the other way, MA-1 is only 20 000 years old, which is clearly wrong. Or take 30 out of 86 SNPs down from the R root to be 24 000 years ago, which gives an R1'2 root 37 000 years ago! So it seems you can wildly different answers and contradictory answers depending how you count. I'm not sure SNPs in different parts of the tree would be found at comparable rates, which would mean the whole calculation is invalid. Huh.

EATA: actually the original calculation seems most consistent if you assume that the MA-1 tree has proportionately shorter downstream R branches.

MJost
02-18-2015, 07:21 PM
I would need to know how many SNPs downstream of R* that Mal'ti Boy had before I would make an assessment, but I suspect that Yfull are probably in possession of other data that you and I don't have access to that enables them to make an assessment. What do you think? However, as far as the age of DF13 is concerned I think they are in the right ballpark.


In the Raghavan July 2014 Study Figure SI 5a, MA-1 had only five R specific ancestral SNPs as I eluded to in my original post which I stated I didn't include this difference. I just took some time to re-adjust five SNPs off of the R's 23 SNPs from a total of 101 down to 96 SNPs to equalize to a node where MA-1 would branch off of R resulting in a new 206 years per SNP.

This then would add some years to MA-1 24Kya Cal due to the years per SNP increasing from 196 to 206, adding 1031 year to a true R age of 25,031 years old, not the seven times as shown by YFull of 31,600 years before present.

MJost

MJost
02-18-2015, 09:17 PM
I thought I would present my dated version of the R to DF13 Tree chart using the Upper and Lower calibrated Red font dates and the identified YFull SNPs. MJost



YBP Begin
BCE
SUM
SNPs in Block
Chr Y HG


25031
23031
5
5
R root


24000
22000
23
18
Mal'ta R* Node


20289
18289
27
4
R.1-Y482


19464
17464
71
44
R1


10392
8392
72
1
R1b-M343/PF6242


10185
8185
74
2
R1b1.1 M415/PF6251 * L278


9773
7773
75
1
R1b1.2 L389/PF6531


9567
7567
77
2
R1b1a P297/PF6398 * L320/PF6092


9155
7155
84
7
R1b1a2 - M269


7711
5711
87
3
R1b1a2a - L23


7093
5093
91
4
R1b1a2a1 - L51


6268
4268
99
8
R1b1a2a1a - L11


4618
2618
101
2
R1b1a2a1a2 - P312


4206
2206
107
6
R1b1a2a1a2c - L21


2969
969
109
2
DF13

MJost
02-19-2015, 03:20 PM
I wanted to post an Iain McDonald conversation, dated Feb 18 11:43 AM, from over on the Yahoo U106 board, where he re-calculated his own date of U106 (U106) as 2700 BC using the new information from the latest archaeological results of Haak et al. (2015) as compared my own re-calculated Haak 2015 ages of R1b1a2a1a2 - P312 that changed to 4618 (2618bc).

https://groups.yahoo.com/neo/groups/R1b1c_U106-S21/conversations/messages/32489


"The update below contains the latest estimates of SNP ages from 350 U106 BigY tests. It is a mostly automated update which does not include the very latest branches that have been discovered. The same (important!) caveats apply that have been mentioned in previous threads.

The main reason for this update is to include information from the latest archaeological results of Haak et al. (2015). This has pushed back the age of each SNP by 50-100 years, increased the upper bounding age of most SNPs by about 50 years, and increased the lower bounding age of most SNPs by 100-200 years. The uncertainty in each SNP's age has therefore decreased by about 100 years.

Some limited attempt has been made to regularise the SNP names here with those that the project admins are using. Some fixes have also been made by Andrew Booth where specific SNPs were not called but can be presumed positive. Most of these are reflected here too.

The mutation rates I am using are: 140.03 years/SNP over all of BigY excluding the DYZ19 region, with a 95% confidence interval of 127.39 to 157.46 years/SNP. Values are quoted as "SNP, best-estimate date, (95% confidence interval)".

Cheers,

Iain.

U106 (U106) 2700 BC (3328 BC - 2241 BC) "

I did leave, in my re-calculation, the number of years per SNP the same at 206 to arrive at a DF13 lower age (starting of its first tier of subclades).

But Iain stated his U106 BigY subclade SNPs are running around 140 years per SNP, where as my Full genome Sanger sequenced list of 29 SNPs under my P312>L21>DF13>FGC5494 branch would need to be around 88 years per SNP. I have specified DF13's max age now at 2969 years. DF13 has two SNPs in a block, and its first tier of subclades was spawned around 2557 ybp, thus all years per SNP calculations down into each branch below should be limited to this new capped 2557 age.

IMHO, 'Big Daddy DF13' started a massive male off-spring growth for a number of generations near the of the period 2969 to 2557 ybp. Was this because family units were changing to start with much young males which resulted in the much lower years per SNP below DF13?

MJost

MJost
02-19-2015, 07:29 PM
Now to check my suggestion that the years per SNP is much less below DF13 to present, I present the following information on the only ancient YDNA under DF13>DF21 Hinxton#4 which is date to around 1AD.

Hinxton4 (Iron age) background:
Insights into British and European population history from ancient DNA sequencing of Iron Age and Anglo-Saxon samples from Hinxton, England. See Quoted details below.
https://www.sciencenews.org/article/anglo-saxons-left-language-maybe-not-genes-modern-britons
http://www.ashg.org/2014meeting/abstracts/fulltext/index.shtml

Y_DNA DF25+ data shown at:
https://drive.google.com/folderview?id=0B7vzRsRM2aOQUGFBX3luN05rWkU&usp=sharing&tid=0B7vzRsRM2aOQTENJUlB4OVVWeUE#list





S. Schiffels, W. Haak, B. Llamas, E. Popescu, L. Loe, R. Clarke, A. Lyons, P. Paajanen, D. Sayer, R. Mortimer, C. Tyler-Smith, A. Cooper, R. Durbin

...British population history is shaped by a complex series of repeated immigration periods and associated changes in population structure. It is an open question however, to what extent each of these changes is reflected in the genetic ancestry of the current British population. Here we use ancient DNA sequencing to help address that question. We present whole genome sequences generated from five individuals that were found in archaeological excavations at the Wellcome Trust Genome Campus near Cambridge (UK), two of which are dated to around 2,000 years before present (Iron Age), and three to around 1,300 years before present (Anglo-Saxon period). Good preservation status allowed us to generate one high coverage sequence (12x) from an Iron Age individual, ...

DF21/S192>(Y2890/S5201>Y2891/S5199>CTS8704/S6375>Y2892 four unordered SNPs)>(Z246/S280>Y2889 unordered)>DF25/S253

DF13's first tier subclade DF21 has the age of 2547bc. There are seven SNPs under DF13>DF21 at 88 years per SNP = 616 years down to Subclade Z246/S280 (block of two SNPs un ordered) equals around 31 AD.

Note that Hinxton4 terminal SNP of DF25 is solidly placed in Cambridge, Cambridgeshire, UK with archaeological excavations dating around 1 AD.

MJost

Muircheartaigh
02-19-2015, 11:00 PM
Now to check my suggestion that the years per SNP is much less below DF13 to present, I present the following information on the only ancient YDNA under DF13>DF21 Hinxton#4 which is date to around 1AD.

DF13's first tier subclade DF21 has the age of 2547bc. There are seven SNPs under DF13>DF21 at 88 years per SNP = 616 years down to Subclade Z246/S280 (block of two SNPs un ordered) equals around 31 AD.

Note that Hinxton4 terminal SNP of DF25 is solidly placed in Cambridge, Cambridgeshire, UK with archaeological excavations dating around 1 AD.

MJost

Mark,

If DF21 has the age of 2547bc, how does it fit with your opinion in yesterday's post #190?

Quote IMHO, 'Big Daddy DF13' started a massive male off-spring growth for a number of generations near the of the period 2969 to 2557 ybp. Was this because family units were changing to start with much young males which resulted in the much lower years per SNP below DF13?Quote

MJost

MJost
02-19-2015, 11:45 PM
Exactly, what's your opinion?

MJost

Muircheartaigh
02-20-2015, 12:21 AM
Exactly, what's your opinion?

MJost

My opinion is that I'm confused by your last two posts. On the one hand your opinion is that DF13 has an age of 2969 years before present and in your next post which seems to be intended to support your theory you say that DF21 which is downstream of DF13 has an age of 2547bc ie c4547 years before present.

miiser
02-20-2015, 01:05 AM
IMHO, 'Big Daddy DF13' started a massive male off-spring growth for a number of generations near the of the period 2969 to 2557 ybp. Was this because family units were changing to start with much young males which resulted in the much lower years per SNP below DF13?

MJost

Without intending to argue in favor of or against your specific dates, I'll mention that there was some discussion not long ago in the L21 Yahoo group regarding the possibility of non uniform SNP mutation rates. It was pointed out that the mutation rate is highly dependent (greater than a linear relationship) on the age of the father. Children from older fathers have a much increased probability of SNP mutations occurring. This should have the effect that increased lifespan would increase the occurrence of SNPs per generation, resulting in a reduced number of years per SNP.

This would also have the effect of clumping multiple SNPs together within a single lineage during periods of prosperity within especially prosperous families. This effect may help explain the large variation in SNP mutation rate from one branch to another, and from one time period to another.

MJost
02-20-2015, 01:56 AM
My opinion is that I'm confused by your last two posts. On the one hand your opinion is that DF13 has an age of 2969 years before present and in your next post which seems to be intended to support your theory you say that DF21 which is downstream of DF13 has an age of 2547bc ie c4547 years before present.
There are six L21 SNPs in an unordered block, Two DF13 block SNPs. The beginning of the L21 block starts at 4206 (2206bc) and ends at the end of the DF13 block at 2557 (557bc) which starts the first tier of DF13 ALL subclades using the 206 years per SNP. The SNPs below calculate to be around 88 years per SNP in my FGC5494 line which based on 29 SNPs. I made the statement that for each branch, the cap of 2557 years for the start of each DF13 subclade.

I always try to make my post pretty clear but I guess I didn't.

MJost

MJost
02-20-2015, 02:43 AM
Without intending to argue in favor of or against your specific dates, I'll mention that there was some discussion not long ago in the L21 Yahoo group regarding the possibility of non uniform SNP mutation rates. It was pointed out that the mutation rate is highly dependent (greater than a linear relationship) on the age of the father. Children from older fathers have a much increased probability of SNP mutations occurring. This should have the effect that increased lifespan would increase the occurrence of SNPs per generation, resulting in a reduced number of years per SNP.

This would also have the effect of clumping multiple SNPs together within a single lineage during periods of prosperity within especially prosperous families. This effect may help explain the large variation in SNP mutation rate from one branch to another, and from one time period to another.

You are correct to point out the possible mutation rate variation that could be due to several factors and I have commented about them as well. When I recalibrated the upper and lower end date using the existing known SNPs, which of required me moving down from P312+ to match the I0808 calibrated age of 2206bc, pointing to the most probably that of being one or two SNPs below with some downstream testing having been done below DF13 and DF27 and were negative. DF13 is not easy to test for and using a chipset by have caused issues. But DF27 and U152 should have been ok. So I didn't understand why the report did have any special notations on why it wasn't covered in more detail since this was a critical find. I wondered if it is being delayed for "further research"? But I digress.

The remaining years from 2206bc to present would not fit the past 206 years per SNP rate but would be at least 50% less for some reason. That was my point of my question.

MJost

lgmayka
02-20-2015, 05:26 AM
Sure you could say that his P312+ is 5000 bc and not 2200 cal bc and he is U152 with a 100 private SNPs.
Thank you. I'm glad you now understand the limitations of any calculation that attempts to use this one published P312+ for calibration.

For other readers: Right now, this P312+ provides only a lower bound on the age of P312, not an upper bound. We wait to see whether his full Y chromosome is published at a sufficient quality level to count his additional SNPs beyond (younger than) P312.

MJost
02-20-2015, 01:38 PM
I always understood what, not what you just implied as (you "NOW" understand), the lower bound was, the age of 2206bc, as error would be upwards only. You must have not understood that P312 is a block of two unordered SNPs and this sample could have been born anywhere between 4618 and 4206 if his P312 was his most terminal SNP. I did, as you pointed out in your earlier response quoting what I posted, "This narrows down the window for what SNPs this sample could be possible for."

I was showing it was possible but not statistically reasonable that P312 is 5000BC years old.

MJost

Muircheartaigh
02-20-2015, 02:24 PM
I always understood what, not what you just implied as (you "NOW" understand), the lower bound was, the age of 2206bc, as error would be upwards only. You must have not understood that P312 is a block of two unordered SNPs and this sample could have been born anywhere between 4618 and 4206 if his P312 was his most terminal SNP. I did, as you pointed out in your earlier response quoting what I posted, "This narrows down the window for what SNPs this sample could be possible for."

I was showing it was possible but not statistically reasonable that P312 is 5000BC years old.

MJost

So do you think that it is statistically reasonable to assume that his terminal SNP was P312 with no downstream SNPs, in which case either his father or an extremely close ancestor was the founder of P312.

MJost
02-20-2015, 03:59 PM
So do you think that it is statistically reasonable to assume that his terminal SNP was P312 with no downstream SNPs, in which case either his father or an extremely close ancestor was the founder of P312.

Yes. since there are two unordered SNPs in the P312 block, but until the sample is tested for any SNPs inside the blocks of L21 and U152 and DF27 we wont be 100% certain.

MJost

Muircheartaigh
02-20-2015, 04:40 PM
Yes. since there are two unordered SNPs in the P312 block, but until the sample is tested for any SNPs inside the blocks of L21 and U152 and DF27 we wont be 100% certain.

MJost

So let's get this straight. Your opinion is that this donors Father, Grandfather, or Great-grandfather, was the founder of P312.

MJost
02-20-2015, 04:54 PM
How would we know that? There is a P312+ range of 412 or so years. But it maybe that this sample was more on the more recent end of the range based on the average years per SNP.

I don't think you are truly interested in the results but you are for the much older R age and is associated subclade bounded rates. You have already posted that you do not believe in any 'ancient DNA results don't tell us the age of the SNPs' theory.

MJost

Muircheartaigh
02-20-2015, 07:10 PM
How would we know that? There is a P312+ range of 412 or so years. But it maybe that this sample was more on the more recent end of the range based on the average years per SNP.

I don't think you are truly interested in the results but you are for the much older R age and is associated subclade bounded rates. You have already posted that you do not believe in any 'ancient DNA results don't tell us the age of the SNPs' theory.

MJost

On the contrary I'm very interested in the results of Ancient DNA testing as it's the only way we are going to accurately determine the ages of the SNPs and branching points on the Human Tree. However we are not going to be able to obtain accurate values until we get NGS testing of Ancient DNA with the same coverage of either FGC or Big Y coverage instead of or in addition to the targeted testing that has been carried out on the ancient remains. You only need to look at what has been achieved by NGS testing compared to GENO 2 to realize the advantages.

Repeating what has been stated here before, if the recently discovered Ancient DNA were to have true NGS testing carried out, with comparative coverage to FGC or Big Y, it would allow us to determine the age of their most recent currently known SNP by count back from the donor's known ages, using estimates from the quantity of their downstream SNPs.

Without knowledge of any downstream SNPs, the age of their ''Terminal'' SNP is pure speculation. Except that is, for providing a lower bound on the SNP age which of course is a major step forward.

lgmayka
02-22-2015, 04:20 PM
This must be a major bug.
G2a2b - 49 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2 - 14 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a - 36 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1 - 18 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b - 75 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1 - 1 SNP, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a - 1 SNP, formation and TMRCA are both 20800 ybp
G-Z1823 - 2 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a2 - 2 SNPs, formation and TMRCA are both 20800 ybp
G2a2b2a1b1a2a - 3 SNPs, formation and TMRCA are both 20800 ybp

Over 200 SNPs, in the blink of an eye. :)
This bug has been at least partially fixed. New numbers:

G2a2b (http://yfull.com/tree/G2a2b/) : 16700-14200 ybp
G2a2b2a : 14200-10600 ybp
G2a2b2a1b : 10300-9700 ybp
G2a2b2a1b1a2a : 4200-3300 ybp

MJost
02-22-2015, 06:34 PM
On the contrary I'm very interested in the results of Ancient DNA testing as it's the only way we are going to accurately determine the ages of the SNPs and branching points on the Human Tree. However we are not going to be able to obtain accurate values until we get NGS testing of Ancient DNA with the same coverage of either FGC or Big Y coverage instead of or in addition to the targeted testing that has been carried out on the ancient remains. You only need to look at what has been achieved by NGS testing compared to GENO 2 to realize the advantages.

Repeating what has been stated here before, if the recently discovered Ancient DNA were to have true NGS testing carried out, with comparative coverage to FGC or Big Y, it would allow us to determine the age of their most recent currently known SNP by count back from the donor's known ages, using estimates from the quantity of their downstream SNPs.

Without knowledge of any downstream SNPs, the age of their ''Terminal'' SNP is pure speculation. Except that is, for providing a lower bound on the SNP age which of course is a major step forward.I would have to believe that with all the R1b guys who have submitted their Full Genome NGS files to YFull, we have seen all of the intermediate SNPs down to L21 accounted for, allowing well defined SNP tree between R and DF13. So I say, ancient samples would not have any new information except branching SNPs not found today due to daughtering out, etc.

MJost

lgmayka
02-22-2015, 08:08 PM
I would have to believe that with all the R1b guys who have submitted their Full Genome NGS files to YFull, we have seen all of the intermediate SNPs down to L21 accounted for, allowing well defined SNP tree between R and DF13.
Not exactly. YFull has stated that it still needs Big Y examples from the early branches. Take a look at YFull's tree for R1b (http://yfull.com/tree/R1b/), and consider that they really need at least 1, and preferably 2, YF entries on each branch. So they need the following (if they exist):

2 of M343*
2 of L389*
1 more of P297* or M478
2 of M269*

Of course, the ages listed right now may not take into account the entries marked New.

The L23 haplotree (http://yfull.com/tree/R1b1a2a/) includes a new entry, YF02895, marked as L23* (xL51xZ2103). We shall see where that entry ends up, and hopefully its ancestry.

jdean
02-22-2015, 08:51 PM
The L23 haplotree (http://yfull.com/tree/R1b1a2a/) includes a new entry, YF02895, marked as L23* (xL51xZ2103). We shall see where that entry ends up, and hopefully its ancestry.

According to Atanas Kumbarov of the ht35 Project YF02895 belongs to R-PF7562.

Muircheartaigh
02-22-2015, 09:03 PM
I would have to believe that with all the R1b guys who have submitted their Full Genome NGS files to YFull, we have seen all of the intermediate SNPs down to L21 accounted for, allowing well defined SNP tree between R and DF13. So I say, ancient samples would not have any new information except branching SNPs not found today due to daughtering out, etc.

MJost

I agree that we probably have a full list of all of the SNPs between R and DF13 but that doesn't mean the tree is as it stands is complete. As you know, what we refer to as SNPs are actually variants measured against Hg19, a reference, constructed from present day donors, and not against a reference which predates Haplogroup R. That means that some of what we refer to as SNPs are not SNPs at all because the mutations never occurred in our ancestral line, they occurred in the ancestral lines of the Anonymous donors to Hg19. This means that the tree as it stands has errors and requires reconstruction to remove these ''Apparent SNPs''. Companies such as FGC, FTDNA and Yfull and Isogg should now be in possession of sufficient data to accurately reconstruct the Tree from ''Adam'' to the present day and hopefully they are using the data to do so and to agree on a Y chromosome reference that reflects the reality. That will have major implications that will really screw things up for us of course.

I cannot agree with the second part of your statement. Ancient DNA samples properly NGS tested is going provide accurate dating of SNPs groups and of the major branching of the tree and should also assist in differentiating between Actual and ''Apparent'' SNPs.

MJost
02-22-2015, 09:03 PM
I would note YFull used vetted ISOGG R Tree SNPs and I accept any NGS found that fits into the tree properly. But you are right to suggest the need for multiple samples having these new SNPs but, unless the reads are poor, they should ready for prime time.

MJost

MJost
03-17-2015, 04:02 AM
I wanted to show the equivalent levels for the various SNPs identified based on calibrated ages using my Mal'ti boy to BB I0806 dating, for better or worse, from the paper, Massive migration from the steppe is a source for Indo-European languages in Europe Haak et al 2015

https://drive.google.com/file/d/0By9Y3jb2fORNbzZvM1VwZXVYRGs/view?usp=sharing

MJost

razyn
03-20-2015, 06:54 PM
It's on the "new papers" thread in the General forum, but I think we need a link here to the YFull guys' new paper on their approach to the mutation rate in the Y chromosome. Available in English and Russian. I'll cite the Anthrogenica post rather than the paper, though I actually saw it on Facebook initially. http://www.anthrogenica.com/showthread.php?709-New-DNA-Papers&p=75177&viewfull=1#post75177

MJost
03-24-2015, 09:16 PM
New updated spreadsheet with the R1b project from MikeW. I took his list of haplotypes listed as R1b> P297- All and I ran these new numbers and they are reasonable close to my calibrated dates from Mal'ti boy and BB I0806 calibrated date range. I removed the V88 branch from the R1b as well.

https://drive.google.com/file/d/0By9Y3jb2fORNOEYwTkpWVkhQR1k/view?usp=sharing

MJost

MJost
04-01-2015, 09:36 PM
YFull has recently add a number of additional SNPs to their YTree v3.7 using NGS and ISOGG tree SNPs. I re-pulled them and added their Chr Y hg19 positions to the linked chart.

I recalibrated again using Mal'ti boy and I decided to use Hinxton4 dated 1AD instead of BB I0806 who was P312 but was not terminal SNP checked fully at around 2206 cal bc. Thus setting the Hinxton4 known terminal DF25 SNP at ~1AD. This also assisted in setting the bounds of Mal'ti boy using his 24kya cal age to around three SNPs below R working both age directions. The total SNPs from one to the other has 143 SNPs resulting in a 153.8 years per SNP across 22,000 years.

I then added the list of positions from R down to DF25 to be able to enter then into DaveV's YFull-based SNP Block Age Estimator spreadsheet just to see what happens with the qualified number of SNPs in CombBED Region count. Out of 147 SNPs it only liked 64 SNPs. The years per SNP was not calculable. My believe is that most of the 147 SNPs have been tested and confirmed all prior to NGS results were combined and thus all Sanger or PCR confirmed by different labs or papers and thus stable placement within ISOGG's existing R-tree.

https://drive.google.com/file/d/0By9Y3jb2fORNN3c1eG9MbWtpQlE/view?usp=sharing

MJost

MJost
04-02-2015, 02:29 AM
So I have entered my own FGC5494 and derived SNP that have been Sanger sequenced and shown to be positive and entered them into Number of SNPs in CombBED Region checker and came back with 23 below DF13 out of 27. These are my derived SNPs that occurred under DF13, with these four not in a CombBed ranges.
FGC5494s 3445114
FGC5505s 5856921
FGC5537s 18269066
FGC5513s 9929665

The remaining 23 then were created after the DF13 block of two SNPs starting at: 3077ybp (1077bc). My DF13 subclade FGC5494 CombBed qualified SNPs average 131.6 years per SNP from the standard 1950 date.


FGC5496s 7306721
FGC7448s 8501059
FGC5521s 14863542
FGC5511s 7915429
FGC5539s 18842540
FGC5522s 14885358
FGC5530s 17342711
FGC5533s 18048085
FGC5534s 18074610
FGC5541s 21313657
FGC5543s 21962352
FGC5544s 22205752
FGC5551s 22706864
FGC5510s 7811639
FGC5557s 23747059
FGC5552s 22842754
FGC5538s 18565974
FGC5523s 14998294
FGC5507s 6676738
FGC5508s 6745642
FGC5524s 15704259
FGC5554s 22888810
FGC7997s 23351605

131.6 yrs per SNP

My SNP GD 2/29 match is YSeq ID:263 FtDNA#: 316063 Watterson (STR GD1/67 & GD3/111).
1950 - 263.2 = 1686.8 AD Max and 1950 - 131.6 =1818.4 AD min. Actually should be around 1840 AD to his paper trail and other aDNA information, MRCA and also considering I have a STR GD3/67 & GD5/111 that appears to be a NPE with a MRCA b. 1805 AD Georgia, USA.

This MRCA's paternal line, William WATTERSON Jr. b: 1700 came to Virginia from County Down, Ireland in early 1760's with three sons. His father was William WATTERSON Sr. d: Nov 20 1703. Both born on the Isle of Man.

Next my match has a SNP GD 3/29 is YSeq ID:306 FtDNA#: 316063 Watterson (STR GD6/67 & GD8/111), born on the Isle of Man. 1950 - 394.8 = 1555.2 AD MAX and 1950 - 263.2 = 1686.8 AD min.

Neither of the above two guy's had any negative SNPs are outside the CombBed ranges. My next Highland Scot match does have one, FGC5537. He has a SNP GD of 8/29 YSeq ID: 113 FtDNA#: 155812 Ross (STR GD 9/67 & 13/111). Seven Negative CombBed SNPs equals 921.2 years. 1950 - 921.2 = 1028.8 AD Max and 1950- 789.6 = 1160.4 AD Min.

Interestingly the Watterson surname has been documented since 1417 in the IOM government Key records.

Mostly the end age of DF13 block of SNPs begins to fit the CombBed qualified SNPs and can be calibrated with the Max date of 3077ybp (1077bc). Count your CombBed SNP and divide them into 3027 years to get a new calibrated years per SNP under DF13.

MJost

Wing Genealogist
04-02-2015, 11:33 AM
Does anyone know if they have been able to estimate how old the Hinxton4 individual was when he died? Ideally, we should be calculating the date of birth for him, rather than the date of burial.

Then again, I am not certain if his age would be dwarfed by the uncertainty in age of burial due to the radiocarbon dating.

MJost
04-02-2015, 01:43 PM
Felix Immanuel blogged that he evaluated Hinxton4 age using Telomere length. Hinxton4's average length of telomere length from all runs is 2.24 kb, which means, Hinxton-4 died around the age 55.

See his blog and chart at:

http://www.fi.id.au/2014/10/hinxton-4-dna-analysis.html

Hinxton4 is 23 years older than the average 32 years old as suggested by the YFull paper.

MJost

MJost
04-02-2015, 04:44 PM
Does anyone know if they have been able to estimate how old the Hinxton4 individual was when he died? Ideally, we should be calculating the date of birth for him, rather than the date of burial.

Then again, I am not certain if his age would be dwarfed by the uncertainty in age of burial due to the radiocarbon dating.

Thanks for that very valid point concerning birth age of Hinxton4. I adjusted to 55BC birth date for him and within his DF25+ terminal SNP. It recalculates to 153.5 years per SNP, down 0.3 years from 153.8. We should also consider that the actual age of DF25 could have been spawned any point further back another 154 years Max, or 209 bc. (recal's to 152.4 years per SNP.)

It also would be interesting to recalculate to Mal'ti boys fathers age at his conception but that probably would extend back to the original year range of 22000 years, 55BC to 24040 ish ybp.


MJost

MJost
04-03-2015, 02:23 AM
New updated spreadsheet with the R1b project from MikeW. I took his list of haplotypes listed as R1b> P297- All and I ran these new numbers and they are reasonable close to my calibrated dates from Mal'ti boy and BB I0806 calibrated date range. I removed the V88 branch from the R1b as well.

https://drive.google.com/file/d/0By9Y3jb2fORNOEYwTkpWVkhQR1k/view?usp=sharing

MJost

I updated chart with new SNP ages.

MJost

MJost
04-03-2015, 02:27 AM
YFull has recently add a number of additional SNPs to their YTree v3.7 using NGS and ISOGG tree SNPs. I re-pulled them and added their Chr Y hg19 positions to the linked chart.

I recalibrated again using Mal'ti boy and I decided to use Hinxton4 dated 1AD instead of BB I0806 who was P312 but was not terminal SNP checked fully at around 2206 cal bc. Thus setting the Hinxton4 known terminal DF25 SNP at ~1AD. This also assisted in setting the bounds of Mal'ti boy using his 24kya cal age to around three SNPs below R working both age directions. The total SNPs from one to the other has 143 SNPs resulting in a 153.8 years per SNP across 22,000 years.

I then added the list of positions from R down to DF25 to be able to enter then into DaveV's YFull-based SNP Block Age Estimator spreadsheet just to see what happens with the qualified number of SNPs in CombBED Region count. Out of 147 SNPs it only liked 64 SNPs. The years per SNP was not calculable. My believe is that most of the 147 SNPs have been tested and confirmed all prior to NGS results were combined and thus all Sanger or PCR confirmed by different labs or papers and thus stable placement within ISOGG's existing R-tree.

https://drive.google.com/file/d/0By9Y3jb2fORNN3c1eG9MbWtpQlE/view?usp=sharing

MJost

Adjusted Mal'ti boy and Hinxon4 to their birth ages resulting in some minor realignment changes. Added my own SNPs below DF13 to compare recalibration of the SNPs after 1AD.

https://drive.google.com/file/d/0By9Y3jb2fORNN3c1eG9MbWtpQlE/view?usp=sharing


MJost

mcg11
04-05-2015, 07:10 PM
I am concerned about an apparent limitation of the present STR TMRCA analysis approach. In two cases where I computed the TMRCA of a founder I am limited by the availability of other haplotypes. e.g., I can trace my earliest ancestor back to 1684 when he emigrated to the colonies. In another case I computed the TMRCA for a set of Gregorys at the Clan gregor FtDNA site. In each case I am left with a problem on how to proceed further? I need more, independent, but related haplotypes to determine the founders TMRCA.

I believe the timing of an SNP mutation and an STR mutation are mutually independent. In the case of an STR, we also don't know if it was the person born at a certain time or one of his ancestors who had the mutation?

Additionally, in trying to determine what the ancestral value was, I usually use the modal value. If the data set contains many descendants of one branch of a tree, that will bias the estimate. It seems that only independent entries. not of the same branch, should be used

Finally, returning to my first example, If we are estimating the TMRCA of a set of persons, how do we know that the analysis will simply indicate when the haplotype first arrived at a different geographic location, not the true TMRCA of the STR mutation?

I realize that the current flavor of this thread is to better understand the limitations of using SNP's for time estimates, but I still use STR's and would like to read some discussion of these issues.

MJost
04-09-2015, 03:29 PM
In trying to understand the Adamov et al paper calculations and I am cautious. I appreciate that to quantify the measurement of the number of mutations from HQ Next Generation Sequencing (NGS) with a common sense confirmation of the exclusion of the pseudoautosomal, heterochromatic, X-transposed, and ampliconic segments from a host of other previous papers. And defining the 857 “good” regions tagged as "combBED area” covering the total length of 8,473,821 bp. SNP mutation rate calibration was calibrated within these ranges.

Just my simple method of calculating the number of SNPs as shown by YFull's 3.7 tree between Mal'ti boy cal 24000 ybp to Hinxton4 cal 1AD a range of approx. 22000 years, Mal'ti boy - Hinxton4 Years per SNP = 153.5

We have considerable number of BigY and Full Genome Inc NGS results and Adamov attempts to set a common frame work of which SNPs can or should not be used dating usage. But all discovered novel/private SNPs will eventually be wanting to be tested to find ancestral branching, but not all men will be able to test via NGS and single testing will be the only option. Those SNPs will need Sanger validated. So of these SNPs maybe outside the 'combBed' ranges but still great for branching information within a genealogical time frame or further back many millenniums. FYI: In my own case, 28 SNPs were validated by being Sanger sequenced positive but when comparing it against the CombBed ranges, only 23 qualify.

My point is that in my very simplex review, YFull has 144 SNPs from Mal'ti Boy (four year old) R* equiv node 22004 cal bc) to DF25 born 45bc d. cal 1AD. All but two are multiple lab identified mostly via Sanger/PCR methods. The two remaining are from NGS identified by YFull. Now those 144 positions have 50 outside the 'combBed' ranges.

Do we throw 50 out for dating purposes? Well I then took the remaining 90 and entered them into DaveV's SNP block estimator spreadsheet. Setting the Tested CombBED Length to 8473821, produced a Calculated Age of Phylogenetic Block of
12952.37 years before present or 11002BC (<need adjusted)
95% CI High = 15172.78 ybp =13223BC
95% CI Low = 11298.88 ybp =9349BC

Resulting SNP Mutation Rate: 143.92 years per SNP mutation
(High of 168.59 years, Low of 125.54 years)

Since this is only the Hinxton4 to Mal'ti boy range, add 2055 years back to the start of Hinxton4.
14952.37 years before present or 13002BC
95% CI High = 17172.78 ybp =15223BC
95% CI Low = 13298.88 ybp = 11349BC

Now if I adjust the SNP Block Estimator to use all 144 SNPs vs just using only CombBed region SNPs calculates
20723.80 ybp = 18774BC but adjust from Hinxton4 at 2055 ybp results in
22723.80 ybp = 20774BC

Closer to the actual cal Mal'ti boy cal site age of 24000 is still short by 1276 years. Thus either the mutation rate is slightly off and should be closer to the Rate Constant of SNP Mutations: 7.7E-10 (=0.00000000077).

This rate constant of 7.7E-10 from using all the 114 SNPs is still within the paper's 2-SD confidence interval amount:
95% CI Min: 7.00E-10
95% CI Max: 9.40E-10

AND my simple formula, above, of 153.5 yps (years per SNP) matches the rate constant of 7.7E-10 or 153.3yps as calculated in DaveV's spreadsheet.

Mal'ti boy - Hinxton4's total years difference is 21949 year range (22004bc-55bc). Note: Hinxton4 age at death was around 55 years old based on his Chr-Y Telomere length estimated by Felix and his burial site was cal 1AD, assuming the latest possible formation of the DF25 SNP. Mal'ti boy was four years old when he died and his burial site was cal 24000 ybp.


MJost

MJost
04-09-2015, 05:03 PM
Next I used only 'CombBed range' SNPs from Mal'ti boy to DF13's two SNP block for a total of 136 SNP that were narrowed down to 86 qualified CB (CombBed) SNPs. I then added my own 23 qualified CB SNPs and entered them into MarkV'a SNP Age Estimator. With a total of 109 CB SNPs I had to modify the Rate Constant of SNP Mutations to 5.4E-10 (0.00000000054) to achieve 23821 ybp or resulting SNP Mutation Rate 218.5 years per SNP mutation.


Using my manual method with the same CB SNPs
Mal'ti boy 24004 to 1950AD has total years difference: 23939
CB SNP Count: 109
Years per SNP: 219.6

Between the previous post and this one, this needs to be resolved as to how dating with which type of SNPs.

MJost

MJost
04-09-2015, 06:07 PM
One thing I didn't do with the last posting evaluation is to change the
Tested CombBED Length: 7.60E+06

and having to adjust the below to match ~24000 years period:

Rate Constant of SNP Mutations: 6E-10 (0.0000000006)

MJost

Dave-V
04-09-2015, 06:59 PM
In trying to understand the Adamov et al paper calculations and I am cautious.

I share your caution so take this in the spirit of making sure I understand it also rather than defending the whole methodology :)



We have considerable number of BigY and Full Genome Inc NGS results and Adamov attempts to set a common frame work of which SNPs can or should not be used dating usage. But all discovered novel/private SNPs will eventually be wanting to be tested to find ancestral branching, but not all men will be able to test via NGS and single testing will be the only option.

I don't think that by proposing a common framework the paper is saying SNPs outside the combBED are unfit for dating purposes. The question of what regions are stable enough is separate from how to set a common reference for comparison purposes, and I think the Adamov paper deals almost entirely with the second rather than the first.

If I poll a representative group to estimate voting trends, I'm not saying the people I didn't poll aren't qualified to vote or shouldn't ever be polled, I've just chosen a representative sample of people that I believe will give me data that can be extrapolated. In the same way the paper is proposing the CombBED not as the best regions but as a representative area that is a) reliable in producing stable and measurable SNPs, b) covered consistently to a large percentage by all the NGS tests, and c) shows a constant rate of SNP mutation within the 2 SDs described.

With those criteria filled, the CombBED gives them a reference to propose for translating everyone's test results to a common standard, so that we can say my calculated years-per-SNP is comparable to yours and can be applied within an error range to a third test, etc. Otherwise I'm arguing that my 50 years per SNP calculation is better than your 100 years per SNP calculation when mine was over 20M base pairs and yours was over 10M and they're really the same.

There are still many underlying assumptions... like whether the rate constant of SNP mutations really holds consistently across the whole CombBED (so that partial coverage yields consistent results), etc.

Dave

MJost
04-09-2015, 07:42 PM
Dave,

The paper's purpose was to conduct a SNP mutation rate calibration carried out for expected variants from samples in the "combBED area”, the location of 857 “good” regions of the Y-chromosome (total length of 8,473,821 bp).

They compared four independent calibrations and ranking them in order of validity and reliability yielded independent but similar rates constant for SNP mutations (0.82 ∙ 10-9 per year per bp, 95% CI: (0.70 − 0.94) ∙ 10-9).

All I was trying to show the what ALL known SNPs in the R tree based on length of years of known ancient samples vs only using those SNPs that fit within the CombBed ranges with reduces the total SNP and what the mutation rate would have to be to create an average years per SNP to fit within the framework of defined years.

Thus your questioning the
>underlying assumptions... like whether the rate constant of SNP mutations really holds consistently across the whole CombBED (so that partial coverage yields consistent results), etc.

With the ChrY getting shorter and who is to say what section has been corrupted due to recombination when.

MJost

Dave-V
04-09-2015, 11:05 PM
Dave,
The paper's purpose was to ...
MJost

Agreed on all fronts.



My point is that in my very simplex review, YFull has 144 SNPs from Mal'ti Boy (four year old) R* equiv node 22004 cal bc) to DF25 born 45bc d. cal 1AD. All but two are multiple lab identified mostly via Sanger/PCR methods. The two remaining are from NGS identified by YFull. Now those 144 positions have 50 outside the 'combBed' ranges.

Do we throw 50 out for dating purposes? Well I then took the remaining 90 and entered them into DaveV's SNP block estimator spreadsheet. Setting the Tested CombBED Length to 8473821, produced a Calculated Age of Phylogenetic Block of
12952.37 years before present or 11002BC (<need adjusted)
95% CI High = 15172.78 ybp =13223BC
95% CI Low = 11298.88 ybp =9349BC

MJost

The calculation has to include at least a reasonable approximation of how much of the CombBED was actually tested to come up with those 94 SNPs. Using 8473821 implies that ALL of the CombBED was covered by testing in coming up with the delta SNPs, so there were no other SNP mutations in the CombBED over that period. Is that reasonable? (I don't know, I'm asking)

Calculating backwards you get that Tested CombBED length would have to be 5.2M or about 60% of the total CombBED to hit the ~22000 delta from Mal'ti Boy to DF25. I don't know that that's reasonable either, but it's the other way to explain the gap. At that coverage the years per SNP mutation go up to 234.5, but that makes sense when dealing with smaller regions.

Dave

MJost
04-10-2015, 02:48 AM
Agreed on all fronts.

The calculation has to include at least a reasonable approximation of how much of the CombBED was actually tested to come up with those 94 SNPs. Using 8473821 implies that ALL of the CombBED was covered by testing in coming up with the delta SNPs, so there were no other SNP mutations in the CombBED over that period. Is that reasonable? (I don't know, I'm asking)

Dave

Ok, then calculate a new total CombBed length from only those positions that fall within the list of CombBed regions, Your "1"'s and show that summed total that can be manually entered into the Tested CombBED Length cell. Then the rate constant wouldn't have to change. You maybe have an easy time doing a look up and finding each matching region and calculate each length and then summing them.

MJost

MJost
04-30-2015, 04:12 PM
Using my estimated years per SNP of 153.5 for the R-Tree utilized my Calibration age using four year old Mal'ti Boy R* cal 24000 to Hinxton4 was 55 age at death base on Y-telomere died cal 1 AD, 153.5 years per SNP Mal'ti boy - Hinxton4 node of 143 SNPS, I extended it back to A0-T counting SNPs and recalibrating the years per SNP from 292600ybp equates to 153.4 years per SNP to present as a fixed number. Good or bad, here are my results. MJost











YFull Experimental YTree v3.8 (http://www.yfull.com/tree)







Extacted 4/30/15 MJost





HG
# of SNPS
YearsPer HG
YBP
Adj YearsPer HG



A0-T
583
89491
292600

89452


A1
54
8289
203109
8285


A1b
47
7215
194820
7211



BT
405
62168
187606
62141


CT
319
48967
125438
48946


CF
4
614
76472

614


F
162
24867
75858
24856


GHIJK
4
614
50991
614


HIJK
1
154
50377
153



IJK
5
768
50223
767


K
17
2610
49456
2608


K(xLT)
1
154
46846

153


MP-1205
3
461
46693
460


P
137
21030
46232
21021



R to DF13*
139
21337
25203
21327



FGC5494**
26
3866
3866
3989


Total:
1907
292600
0
292600










153.5* years per SNP down to R to DF13



153.5 extended back to A0-T




**148.7 years per Sanger SNPs FGC5494 to Present


Recal adjusted new Ave 153.4 years per SNP A0-T to Present


* my original Cal Mal'ti Boy to DF25

MJost
04-30-2015, 05:13 PM
Ok, just added the 1701 SNPs from A00. Really quite stunning is the age when A00 was spawned around 553K years before present. MJost










YFull Experimental YTree v3.8 (http://www.yfull.com/tree)







Extacted 4/30/15 MJost




HG
# of SNPS
YearsPer HG
YBP
Adj YearsPer HG


A00
1701
261104
553703
260933


A0-T
583
89491
292600
89452


A1
54
8289
203109
8285


A1b
47
7215
194820
7211


BT
405
62168
187606
62141


CT
319
48967
125438
48946


CF
4
614
76472
614


F
162
24867
75858
24856



GHIJK
4
614
50991
614


HIJK
1
154
50377
153


IJK
5
768
50223
767


K
17
2610
49456
2608


K(xLT)
1
154
46846
153


MP-1205
3
461
46693
460


P
137
21030
46232
21021


R to DF13*
139
21337
25203
21327


FGC5494**
26
3866
3866
3989


Total:
3608
553703
0
553467

lgmayka
05-01-2015, 02:49 AM
Really quite stunning is the age when A00 was spawned around 553K years before present.
Frankly, you must be misinterpreting something. YFull's own estimate of A00's age (http://yfull.com/tree/A00/) is only 194K ybp.

You certainly should not be adding A00's SNPs to A0-T's SNPs. Those are two parallel branches from "Adam." YFull gives the exact same formation date for A00 and A0-T, just as one would expect from two parallel branches.

Muircheartaigh
05-01-2015, 07:44 AM
Ok, just added the 1701 SNPs from A00. Really quite stunning is the age when A00 was spawned around 553K years before present. MJost










YFull Experimental YTree v3.8 (http://www.yfull.com/tree)







Extacted 4/30/15 MJost




HG
# of SNPS
YearsPer HG
YBP
Adj YearsPer HG


A00
1701
261104
553703
260933


A0-T
583
89491
292600
89452


A1
54
8289
203109
8285


A1b
47
7215
194820
7211


BT
405
62168
187606
62141


CT
319
48967
125438
48946


CF
4
614
76472
614


F
162
24867
75858
24856



GHIJK
4
614
50991
614


HIJK
1
154
50377
153


IJK
5
768
50223
767


K
17
2610
49456
2608


K(xLT)
1
154
46846
153


MP-1205
3
461
46693
460


P
137
21030
46232
21021


R to DF13*
139
21337
25203
21327


FGC5494**
26
3866
3866
3989


Total:
3608
553703
0
553467




Mark,

Are the SNPs you are counting actual SNPs or are they variants measured against Hg 19, a synthetic Genome constructed from present day donors?

MJost
05-01-2015, 12:50 PM
Frankly, you must be misinterpreting something. YFull's own estimate of A00's age (http://yfull.com/tree/A00/) is only 194K ybp.

You certainly should not be adding A00's SNPs to A0-T's SNPs. Those are two parallel branches from "Adam." YFull gives the exact same formation date for A00 and A0-T, just as one would expect from two parallel branches.


Check out this Mendez et al paper's date range as compared to the 553.5 KYA I calculated which is the upper end of the ±2σ confidence interval of A00.

The paper showed that the

"estimated the time to the most recent common ancestor (TMRCA) for the Y tree as 338 thousand years ago (kya) (95% confidence interval ¼ 237–581 kya).

Even "A0-T as 292.6Kya is well within the papers A0 reported 202 kya (95% CI ¼ 125–382 kya)."

"the analysis of relative ages of nodes shows that the TMRCA of the A00-rooted tree is 67% older (95% CI ¼ 35%–126%) than that of the A0-rooted tree."


An African American Paternal Lineage Adds an Extremely Ancient Root to the Human Y Chromosome Phylogenetic Tree

Fernando L. Mendez, Thomas Krahn, Bonnie Schrack, Astrid-Maria Krahn, Krishna R. Veeramah, August E. Woerner, Forka Leypey Mathew Fomine, Neil Bradman, Mark G. Thomas, Tatiana M. Karafet, and Michael F. Hammer

We report the discovery of an African American Y chromosome that carries the ancestral state of all SNPs that defined the basal portion of the Y chromosome phylogenetic tree. We sequenced ∼240 kb of this chromosome to identify private, derived mutations on this lineage, which we named A00. We then estimated the time to the most recent common ancestor (TMRCA) for the Y tree as 338 thousand years ago (kya) (95% confidence interval = 237–581 kya).

http://www.cell.com/ajhg/abstract/S0002-9297(13)00073-6

http://haplogroup-a.com/Ancient-Root-AJHG2013.pdf

MJost

MJost
05-01-2015, 12:59 PM
Mark,

Are the SNPs you are counting actual SNPs or are they variants measured against Hg 19, a synthetic Genome constructed from present day donors?

I as reported at the top of each data chart, I pulled all SNP names under each branch that was a direct lineage down to my DF13 subclade to present and I used the SNPs listed in

YFull Experimental YTree v3.8

Extacted 4/30/15

I used the R-tree SNPs and correlated each SNP with positions, but I don't think I will do that for the 3443+ SNP in A00 down to P.


MJost

Michał
05-01-2015, 03:05 PM
Check out this Mendez et al paper's date range as compared to the 553.5 KYA I calculated which is the upper end of the ±2σ confidence interval of A00.

You have apparently misunderstood the comment made by lgmayka. It doesn't really matter whether your estimate falls within the very large margin of error provided by Mendez et. al. or not. The major problem is that you have nearly doubled the age of the A00-A0/T split by counting the downstream SNPs in both downstream lineages. The 1701 SNPs at the top of your list are neither ancestral to A0-T nor belong to this particular lineage. Instead, these are SNPs that are a part of a parallel branch A00, so they, optionally, can be compared with the number of SNPs downstream of the A00-A0/T node that were found in your lineage (under the condition that you know the average number of private SNPs for those two A00 samples analysed by YFull), which can be done to either verify or refine your calculations.

However, the major problem with your estimates is that you don't seem to know whether each SNP block on your list corresponds to the same DNA coverage (or a total size of DNA sequenced), so using a "standardised" value of 153.5 years per each SNP upstream of R (or even upstream of DF13) does not seem to be sufficiently supported by any data. This doesn't mean that the numbers you use, or rather the estimates you get, may not (coincidentally) turn out to be close to the "real" values, only that your "standardised" 153.5 number is not supported by any reasonable calculations.

Also, when estimating the age of R* (based on the age of the Malta boy), you seem to have overlooked the fact that this particular sample showed a significant number of SNPs downstream of R, which quite strongly suggests that the R* (or pre-R*) node has actually pre-dated the Malta boy by a couple thousand of years.

MJost
05-02-2015, 12:46 AM
lgmayka and Michal,

I stand corrected. I see now that A00 & A0-T are peer branches. I should NOT have added them together. I did it an after thought of doing A0-T Chart and

wasn't thinking, period.

What is the number of SNPs A0-T to the common ancestor of A?

A00 has 1907 SNPs to present, A0-T to present 1701.

http://haplogroup-a.com/Ancient-Root-AJHG2013.pdf

Reading, look at Fig. 1, I suggest that there could be a ratio number of SNPs back to A then.



As to the dating and/or counting SNPs I checked my notes and have done some other adjustments.

Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans
Raghavan et al
Nov 2013
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105016/

Mal’ta (MA-1) bone Radiocarbon measurements of 20,240±60 14C age ± SD and has 35 derived and five ancestral to R SNPs.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105016/bin/NIHMS583477-supplement-Supp.pdf
See Figure SI 5a An unrooted Neighbor Joining tree of 24 Y chromosomes from the publicly available Complete Genomics dataset1 and MA-1.

SUPPLEMENTARY INFORMATION part of the paper, shows five ancestral back to R for MA-1 dated age of 20,240. Now what I should have done is adjust for the 35

derived SNPs at that R* split with more detail information. Total of 40 SNPs back to R. R to DF13 has 139 SNPs plus my 29 to present equals 165 SNPs.

Subtracting 40 SNPs = 125 SNPs equivalent to MA-1 at 20240 cal ybp or 161.9 years per SNP. Using 161.9 years per SNP time 165 SNPs equals 26713.5.

http://www.cbs.dtu.dk/suppl/malta/data/

So A0-T was 292600 ybp now is 308743 ybp.


IN a direct lineage to my present, I don't see why I can't count the beginning and end dates AND count the known total SNPs with in that range. Everything

is finite. Not to say, really, that different lineages might have slightly different years per SNP. A00 and A0-T could have only just a few SNPs to a most

common ancestor.

MJost

palamede
05-02-2015, 06:07 PM
Mal’ta (MA-1) bone Radiocarbon measurements of 20,240±60 14C age ± SD and has 35 derived and five ancestral to R SNPs.
......
equivalent to MA-1 at 20240 cal ybp

MJost
I have no opinion from your estimation of the numbers of SNPs and date estimations, but I think that 20,240±60 14C age is the non-calibrated date of MA1-1 and 24,000BP is the calibrated date.

MJost
05-03-2015, 05:01 PM
I have no opinion from your estimation of the numbers of SNPs and date estimations, but I think that 20,240±60 14C age is the non-calibrated date of MA1-1 and 24,000BP is the calibrated date.

I guess that is correct as well as 14c result has to be adjusted to back to an actual calendar date. So I revised the data based on (which I started out initially using 24K ybp).

Again using the paper, it shows five ancestral back to R from R*. MA-1 has 35 derived SNPs that is below the R* split, or 40 SNPs total back to R. R to DF13 has 139 SNPs plus my 29 to present equals 165 SNPs. Subtracting 40 SNPs = 125 SNPs would be equivalent to MA-1 at 24000 cal ybp or 192 years per SNP. Using 192 years per SNP times 165 R SNPs equals 31680 years before present, the age of R.

This changed quite substantially pushing back the overall age of R, and of course A0-T at 366144 ybp. I just don't know if the Ma-1 35 derived SNPs were correctly quality controlled.

MJost

Heber
05-30-2015, 06:39 AM
Evaluating the Y chromosomal STR dating in deeprooting pedigrees

"In this study, we used two deep-rooting pedigrees with full records and reliable dates to directly evaluate the Y chromosomal STR mutation rates and dating methods. We found that the Y chromosomal genealogical mutation rates (OMRB and lmMR) in BATWING method can give the best-fit estimation for historical lineage dating, which could provide a very efficient and reliable way for genealogy and historical anthropology researches."

http://www.investigativegenetics.com/content/6/1/8/abstract

alan
05-30-2015, 03:13 PM
Its an oddity that for the cultural material Mal'ta was found with, his radiocabron date of around 22000BC is 2000 years younger than any other date for that culture which otherwise all fall into the period c. 30000-24000BC. So, if anything is amiss with the Mal'ta radiocarbon date, its likely that its younger than reality. Its 22000BC is also within the LGM which is a little odd as the LGM appears to have otherwise chased Baikal hunting groups south some 2000 years earlier. Perhaps it was a little warmer year or so within the LGM but I would love to see the crucial Mal'ta bones RC dated again and isotope analysis etc to see if the RC date is sound and not subject to some distorting effect. I would not be surprised if the current RC date proved to be 2000 years younger than reality.

parasar
05-30-2015, 03:42 PM
Its an oddity that for the cultural material Mal'ta was found with, his radiocabron date of around 22000BC is 2000 years younger than any other date for that culture which otherwise all fall into the period c. 30000-24000BC. So, if anything is amiss with the Mal'ta radiocarbon date, its likely that its younger than reality. Its 22000BC is also within the LGM which is a little odd as the LGM appears to have otherwise chased Baikal hunting groups south some 2000 years earlier. Perhaps it was a little warmer year or so within the LGM but I would love to see the crucial Mal'ta bones RC dated again and isotope analysis etc to see if the RC date is sound and not subject to some distorting effect. I would not be surprised if the current RC date proved to be 2000 years younger than reality.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105016/#SD1

Recent geoarchaeological investigations have shown site stratigraphy to be more
complex than originally thought3. The profile is ~2.5 m deep with 10
lithostratigraphic units grouped into two sections. Ten cultural layers were found in
the upper section ...
Additionally, Richards and colleagues4 presented a direct date of
19,880 ± 160 (OxA-7129) 14C BP on one of the children from the double burial.

...
Similar to Mal’ta, several MUP sites from the Don River valley on the Russian
Plain to the Dordogne River valley in France have carved ivory “Venus” figurines
(Figure 1a). Though it has no female figurine forms, the earlier Yana RHS site (ca.
28,000 14C BP) in western Beringia also contains an elaborate ivory-carved art18
(Figure 1a).

MJost
05-30-2015, 10:17 PM
I recently extracted the entire list of SNPs from YFull's experimental tree v3.8 from A0-T to R. I then downloaded Ybrowse database and compared the YFull SNP names to Ybrowse list of SNPs to identify the list of their positions. Twelve YFull SNPs were excluded as INDel and non-identified positions within A0-T>A1>A1b>BT>CF>CT>F>GHIJK>HIJK>IJK>K>K(xLT)>MP-1205>P>R>FGC5494.

I then modified Dave-V's YFull-based SNP Block Age Estimator spread sheet to run through the 875 CombBED regions searching for a match then pull the beginning and ending position of the matched region and calculate its specific length for each of the 1933 positions and posting it to the spreadsheet, along with a running total original spreadsheet counted 1326 of the 1933 that matched the CombBED regions. Many SNPs identified the same combBED region which I removed and recalculated the "True CombBED length.

Considering that the Adamov_et al. paper suggested a "Rate Constant of SNP Mutations" of 8.2E-10, the "True CombBED(x dups) length" was 7,020,846 produced a 173.7 years per SNP "True CombBED" based rate.

Using the "Rate Constant of SNP Mutations" of 8.2E-10 and Full Genome length of total of 8,473,821 results in 143.92 years per SNP. Calculating the total age of A0-T with the 1933 and adding back the 12 excluded SNPs, of 1945, produces 279,915 ybp.

The results of all the A0-t to R SNPs, positions and respective regions spreadsheet are here:

https://drive.google.com/file/d/0By9Y3jb2fORNUndIQnZDVkVhSTQ/view?usp=sharing

I have posted the modified macro enabled version of Dave-V's YFull-based SNP Block Age Estimator spreadsheet loaded with the A0-T to R, including my DF13>FGC5494 HQ Full Y Genome SNPs.

https://docs.google.com/spreadsheets/d/16iTK4NVlwEo7bvjb0XIyNwzNaQ_xvdcvBV1tFrzG70g/edit?usp=sharing

Mark

George Chandler
05-30-2015, 10:28 PM
Mark,

When you say "Using the "Rate Constant of SNP Mutations" of 8.2E-10 and Full Genome length of total of 8,473,821 results in 143.92 years per SNP" Are you aging all SNP's with an average rate of 143.92 years? Can you elaborate?

George

MJost
05-30-2015, 10:40 PM
Its an oddity that for the cultural material Mal'ta was found with, his radiocabron date of around 22000BC is 2000 years younger than any other date for that culture which otherwise all fall into the period c. 30000-24000BC. So, if anything is amiss with the Mal'ta radiocarbon date, its likely that its younger than reality. Its 22000BC is also within the LGM which is a little odd as the LGM appears to have otherwise chased Baikal hunting groups south some 2000 years earlier. Perhaps it was a little warmer year or so within the LGM but I would love to see the crucial Mal'ta bones RC dated again and isotope analysis etc to see if the RC date is sound and not subject to some distorting effect. I would not be surprised if the current RC date proved to be 2000 years younger than reality.
Would 18125 ybp 16,175 BC be too early? that's making time line work for me with 143 years per SNP.

Nix that 18K year figure.

I incorrectly used my 26 Sanger SNPs instead of my Full Y 71 HQ SNPs and it was throwing it off. The total number of 'R' SNPs with my 71 DF13>FGC5494 is 210. SNP MA-1 with 5 ancestral back to R and 35 derived SNPs is equivalent to 40 SNPs below R . Less 40 leaves 170 SNP derived below MA-1 parallel node.

Using 143.92 years per SNP at MA-1 is ~24,600 years before present - 22,650 BC. The paper has MA-1 Cal BP (2 sigma) 24,423-23,891 with the (cal BP) with OxCal 4.24 using INTCAL095.

MJost

MJost
05-31-2015, 03:27 AM
Mark,

When you say "Using the "Rate Constant of SNP Mutations" of 8.2E-10 and Full Genome length of total of 8,473,821 results in 143.92 years per SNP" Are you aging all SNP's with an average rate of 143.92 years? Can you elaborate?

George

Yes I used that average rate for all the A0-T to R branch SNPs.

The process is to compute the 'rate constant' times the 'length' which results in a years per mutation. 'Years per mutation' times the number of SNPs equal an age. The length I uses was the Full genome Y length.

0.00000000082 * 8473821 = 143.92
143.92 * 1945 = 279,915 years before present
with:
95% CI High = 327900.65 ybp
95% CI Low = 244181.34 ybp


MJost

MJost
05-31-2015, 03:41 AM
I recently extracted the entire list of SNPs from YFull's experimental tree v3.8 from A0-T to R. I then downloaded Ybrowse database and compared the YFull SNP names to Ybrowse list of SNPs to identify the list of their positions. Twelve YFull SNPs were excluded as INDel and non-identified positions within A0-T>A1>A1b>BT>CF>CT>F>GHIJK>HIJK>IJK>K>K(xLT)>MP-1205>P>R>FGC5494.

I then modified Dave-V's YFull-based SNP Block Age Estimator spread sheet to run through the 875 CombBED regions searching for a match then pull the beginning and ending position of the matched region and calculate its specific length for each of the 1933 positions and posting it to the spreadsheet, along with a running total original spreadsheet counted 1326 of the 1933 that matched the CombBED regions. Many SNPs identified the same combBED region which I removed and recalculated the "True CombBED length.

Considering that the Adamov_et al. paper suggested a "Rate Constant of SNP Mutations" of 8.2E-10, the "True CombBED(x dups) length" was 7,020,846 produced a 173.7 years per SNP "True CombBED" based rate.

Using the "Rate Constant of SNP Mutations" of 8.2E-10 and Full Genome length of total of 8,473,821 results in 143.92 years per SNP. Calculating the total age of A0-T with the 1933 and adding back the 12 excluded SNPs, of 1945, produces 279,915 ybp.

The results of all the A0-t to R SNPs, positions and respective regions spreadsheet are here:

https://drive.google.com/file/d/0By9Y3jb2fORNUndIQnZDVkVhSTQ/view?usp=sharing

I have posted the modified macro enabled version of Dave-V's YFull-based SNP Block Age Estimator spreadsheet loaded with the A0-T to R, including my DF13>FGC5494 HQ Full Y Genome SNPs.

https://docs.google.com/spreadsheets/d/16iTK4NVlwEo7bvjb0XIyNwzNaQ_xvdcvBV1tFrzG70g/edit?usp=sharing

Mark

Fixed sharing issue.

https://docs.google.com/spreadsheets/d/16iTK4NVlwEo7bvjb0XIyNwzNaQ_xvdcvBV1tFrzG70g/edit?usp=sharing

MJost

alan
05-31-2015, 09:02 PM
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4105016/#SD1

Having read over the details of the Mal'ta RC sample it seems a good one. Level of collagen not much short of modern. Seems a good quality sample - much better than AG2. It still of course does remain the youngest of all MUP culture of south Siberia date but that doesnt of course mean its a bad date. Just rather odd that we seem to have caught and DNA tested one of the very last people of this culture - probably down to very small numbers by then given its the only date in a 2000 year span and no others come after it. Not surprising given that the LGM was well underway by then.

I have pointed this out before but by that date the options to 'escape' the Baikal area were not great with cold desert areas to the south and east and impossible conditions to the north. Perhaps why the particular line did not survive. Lake Baikal was at the extreme edge of where the cold deserts to the north, east and south met the bitterly cold eastern end of the steppe-tundra. Mal'ta was in Cis-Baikal at the west of the lake bordering the eastern terminus of the steppe-tundra belt. The rest of the Lake was in cold desert and uninhabitable.

However, what the LGM flora zones imply is that by 22000BC it was too late to escape south or east or north. Apart from the steppe tundra belt, the east end of which was incredibly cold, anyone still hanging around the west shore of Lake Baikal as late as 22000BC had missed the option of escaping south which existed 2000 years earlier. There is very little evidence of cultural contact between Siberia and European parts of the steppe-tundra belt or migration west during the LGM so this appears to have not been feasible or not availed of. So, that Mal'ta's particular R line went extinct is unsurprising. I strongly suspect the surviving R lines either had never strayed to Baikal and lived nearer Altai or they retreated south toward Altai 2000 years before Mal'ta boy lived when the option was still there.

George Chandler
06-01-2015, 02:32 AM
Yes I used that average rate for all the A0-T to R branch SNPs.

The process is to compute the 'rate constant' times the 'length' which results in a years per mutation. 'Years per mutation' times the number of SNPs equal an age. The length I uses was the Full genome Y length.

0.00000000082 * 8473821 = 143.92
143.92 * 1945 = 279,915 years before present
with:
95% CI High = 327900.65 ybp
95% CI Low = 244181.34 ybp


MJost

Did you take the Full Genome test? I'm wondering what the percentage of new Y SNP's discovered from it are problematic? The average time between SNP's is similar to what I get but only after YSEQ has worked their magic and removed the unreliable ones.

George

MJost
06-01-2015, 10:02 PM
Did you take the Full Genome test? I'm wondering what the percentage of new Y SNP's discovered from it are problematic? The average time between SNP's is similar to what I get but only after YSEQ has worked their magic and removed the unreliable ones.

George
Yes I did take the Full Genome CHrY test. I did have my 71 private HQ SNPs evaluated and it resulted in only having 27 positively Sanger sequenced SNPs. The YFULL Tree v3.8 takes in account all the very stable High Quality NGS SNPS from Full Genome and BigY results so adding in my 71 NGS SNPs was used in dating the tree. Using the resultant years per SNPs works great in dating MA-1 from ~ 24.6 kybp to present at 143 years per SNP (length times rate).

My 71 compared to the CombBED regions show only 40 that match. I have only 27 of the 71 that actually can be confirmed "Clean" by Sanger.

If I use the Phylogenetic Block Age Estimator for my Number of SNPs in CombBED Region: 40 of 71, that age of that block changes to

Calculated Age of Phylogenetic Block ybp: 5,757 = 3,807 BC
95% CI High = 6743.46 ybp =4,793 BC
95% CI Low = 5021.72 ybp =3,072 BC

Plugging in my "True- Clean" 27 SNPs:
Calculated Age of Phylogenetic Block ybp: 3861 = 1861 BC.

The use of 143 years per SNP doesnt fit the known MA-1 age of 24.0 Kybp.


Now if I take the entire A0-T to present 1933 SNPs and use only the 1346 CombBED qualified SNPs it equates to 173.7 years per SNP using tested "true CombBED (xDups) Length" of 7020846 bp pushes "R" to 22,251 ybp or 20,301 BC. Along with the fact I don't know which of MA-1'S five ancestral and 35 derived SNPs would be in the CombBED regions and which to exclude in order to find which his equivalent node under R but MA-1 has to be 24000 ybp and its no way near that since R was dating at 22,251 ybp.

So, back to my own theory which is that something is incorrect on the factors that produce the Years per SNP amount. If I change and use only my 27 Sanger SNPs and include all SNPs DF13 back to A0-T (totals 1889 SNPs) AND place MA-1 at 24000 ybp at 40 SNPs below "R", it would take 189 +-94.5 years per SNP to line up neatly to present (my last of the two private SNP spawned at or just before my birth).

To achieve 189 years per SNP, either the Length of 8473821 bp or the defined mutation rate has to change. Lets assume the latter was need to be changed and would have to be changed to: =0.000000000624 or Rate Constant of 6.24E-10 instead of 8.2E-10 to align the tree to MA-1's known calibrated bc age.

Using then 6.24 rate the age of A0-T is 357,247 ybp or 355,297 BC.

Using 189 years per SNP, my 27 Sanger SNPs puts my DG13>FGC5494 Subclade starting around
4915 ybp or 2,965 BC, the first tier below DF13 block is the Founders SNP Age.

DF13 (block of two SNPs) starts 5293 ybp or 3,343 BC
L21 (block of five SNPs) starts at 6427 ybp or 4,477 BC
P312 (block of two SNPs) 6805 ybp or 4,855 BC
R1b1a2a1a-L11/S127/PF6539 block is 8317 = 6,367 BC
R1b1a2a1-L51/M412/S167/PF6536 (R-Z2103 branched) is 9073 = 7,123 BC
R1b1a2a-L23/S141/PF6534block is 9451 = 7,501 BC
R1b1a2-M269/PF6517 (M73, M478 Branch split) Block is 16633 = 14,683 BC
R1b-M343/PF6242 (R1a split) ia 17956 = 16,006 BC
MA-1 node 24004 40 SNPs below "R"
R-root is 31191 ybp

HG Beg. Age
A0-T 357044
A1 246787
A1b 236575
BT 227686
CF 152606
CT 151849
F 93222
GHIJK 62774
HIJK 62017
IJK 61828
K 60882
K(xLT) 57667
MP-1205 57478
P 56911
R 31191


As we now understand, SNP ages are significantly older than STR MRCA ages due to the occurrence of coalescence of the haplotypes which may show best time when an expansion occurred.

These recent "R" ages may make some people happy to hear.

MJost

George Chandler
06-02-2015, 01:57 AM
Yes I did take the Full Genome CHrY test. I did have my 71 private HQ SNPs evaluated and it resulted in only having 27 positively Sanger sequenced SNPs. The YFULL Tree v3.8 takes in account all the very stable High Quality NGS SNPS from Full Genome and BigY results so adding in my 71 NGS SNPs was used in dating the tree. Using the resultant years per SNPs works great in dating MA-1 from ~ 24.6 kybp to present at 143 years per SNP (length times rate).

My 71 compared to the CombBED regions show only 40 that match. I have only 27 of the 71 that actually can be confirmed "Clean" by Sanger.

If I use the Phylogenetic Block Age Estimator for my Number of SNPs in CombBED Region: 40 of 71, that age of that block changes to

Calculated Age of Phylogenetic Block ybp: 5,757 = 3,807 BC
95% CI High = 6743.46 ybp =4,793 BC
95% CI Low = 5021.72 ybp =3,072 BC

Plugging in my "True- Clean" 27 SNPs:
Calculated Age of Phylogenetic Block ybp: 3861 = 1861 BC.

The use of 143 years per SNP doesnt fit the known MA-1 age of 24.0 Kybp.


Now if I take the entire A0-T to present 1933 SNPs and use only the 1346 CombBED qualified SNPs it equates to 173.7 years per SNP using tested "true CombBED (xDups) Length" of 7020846 bp pushes "R" to 22,251 ybp or 20,301 BC. Along with the fact I don't know which of MA-1'S five ancestral and 35 derived SNPs would be in the CombBED regions and which to exclude in order to find which his equivalent node under R but MA-1 has to be 24000 ybp and its no way near that since R was dating at 22,251 ybp.

So, back to my own theory which is that something is incorrect on the factors that produce the Years per SNP amount. If I change and use only my 27 Sanger SNPs and include all SNPs DF13 back to A0-T (totals 1889 SNPs) AND place MA-1 at 24000 ybp at 40 SNPs below "R", it would take 189 +-94.5 years per SNP to line up neatly to present (my last of the two private SNP spawned at or just before my birth).

To achieve 189 years per SNP, either the Length of 8473821 bp or the defined mutation rate has to change. Lets assume the latter was need to be changed and would have to be changed to: =0.000000000624 or Rate Constant of 6.24E-10 instead of 8.2E-10 to align the tree to MA-1's known calibrated bc age.

Using then 6.24 rate the age of A0-T is 357,247 ybp or 355,297 BC.

Using 189 years per SNP, my 27 Sanger SNPs puts my DG13>FGC5494 Subclade starting around
4915 ybp or 2,965 BC, the first tier below DF13 block is the Founders SNP Age.

DF13 (block of two SNPs) starts 5293 ybp or 3,343 BC
L21 (block of five SNPs) starts at 6427 ybp or 4,477 BC
P312 (block of two SNPs) 6805 ybp or 4,855 BC
R1b1a2a1a-L11/S127/PF6539 block is 8317 = 6,367 BC
R1b1a2a1-L51/M412/S167/PF6536 (R-Z2103 branched) is 9073 = 7,123 BC
R1b1a2a-L23/S141/PF6534block is 9451 = 7,501 BC
R1b1a2-M269/PF6517 (M73, M478 Branch split) Block is 16633 = 14,683 BC
R1b-M343/PF6242 (R1a split) ia 17956 = 16,006 BC
MA-1 node 24004 40 SNPs below "R"
R-root is 31191 ybp

HG Beg. Age
A0-T 357044
A1 246787
A1b 236575
BT 227686
CF 152606
CT 151849
F 93222
GHIJK 62774
HIJK 62017
IJK 61828
K 60882
K(xLT) 57667
MP-1205 57478
P 56911
R 31191


As we now understand, SNP ages are significantly older than STR MRCA ages due to the occurrence of coalescence of the haplotypes which may show best time when an expansion occurred.

These recent "R" ages may make some people happy to hear.

MJost

I've found there are some lines which have had a higher percentage of SNP's removed because they are from problematic areas yet the number of raw SNP's is consistent with other tests from the same haplogroup . Sometimes the average number of years between the reliable SNP's is much higher because there are so many removed from problematic areas. One of my S1051 members had 19 (18 high quality if I recall) new SNP's turn up from his testing but only seven were reliable enough for primers. I'm totally trusting YSEQ because every time they go through them the pieces seem to fit together. I would still rather see and record the 19 new SNP's even if they aren't posted..I'm just not sure if we can use 143 years for all of the new SNP's turning up? I'm still finding 139 years is holding for an approximate number of years between each Sanger validated SNP as an average.

I'm one of those who is happy about the new age estimates. I'm still holding that DF13 is a bit younger in that and only 4,500-4,700 years old but I could be wrong.

George