PDA

View Full Version : YFull's age estimation



thetick
08-29-2015, 08:05 PM
Hi,

I see YFull uses the following for each individual to determine SNP age -- (# of corrected Novel SNPs) * 144.41 + 60.
Corrected SNPs factors in the combBED region.
And then they average the results of all the individuals in the group for the group age estimate.

So how accurate do you think this is? I would imagine it's pretty accurate once there are as many individuals as Novel SNPs. Probably not so good with only a few individuals with 15+ Novel SNPs.

As many of you know YFull published a paper "Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data"
http://rjgg.molgen.org/index.php/RJGGRE/article/view/151/175 -- google translate will translate to English.

haleaton
08-29-2015, 09:32 PM
As many of you know YFull published a paper "Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data"
http://rjgg.molgen.org/index.php/RJGGRE/article/view/151/175 -- google translate will translate to English.

The pdf linked is in English, btw. Here are some additional notes they have:

FAQ
Q: What are the criteria for accepting/disqualifying a Known SNP or a Novel SNP for estimation of TMRCA?

A: The following five criteria are described in an article Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data by Adamov, Guryanov, Korzhavin, Tagankin, Urasin (2015):

1. “Region” criterion. There are derived variants (i.e. alleles different from the reference sequence) revealed in the BAM files. The nucleotide sequences under investigation had a total length between 13-15 Mbp for BigY, and about 23 Mbp for FGC. Single base read coverage varied from 1X to 8000X. The average coverage of commercial samples is about 60X. From this set of variants, we selected only those coordinates that fell into the combBED regions. As it was mentioned above, the
combBED area was designed by the authors to select X-degenerate segments. The combBED area borders were formed by mutual overlapping BED file taken from the work of Poznik et al. (2013) (total length of 10.45 Mbp) and by the generalized BigY BED file (11.38 Mbp long), published in the BigY White Paper (2014). The result was 857 continuous segments of the Y-chromosome with a total length of 8,473,821 bp. The coordinates of the beginning and the end of these regions are contained in Table 1 of Supplement.
2. “Not indel” criterion. We excluded insertions and deletions (indels), as well as multiple nucleotide polymorphism (more than one base position in derived alleles, MNP) variants.
3. “Locations” criterion. We excluded variants which were detected in more than five different localizations. (Note: “localization” is defined as a group of samples from the YFull database [2,900 samples at February, 2015] belonging to the same subclade and having derived allele nomination that have been studied). In some cases, the same derived variants were revealed in samples from different subclades or haplogroups. One of the reasons consists of the fact that standard reference sequence is based on haplogroup R1b data and also to a lesser extent on haplogroup G data. Thus, some variants in some haplogroups are ancestral allele, not derived. Another reason is mapping errors. We found limit of five localization empirically. This criterion is soft but effective.
4. “Reads” criterion. We excluded from consideration any one or two read variants.
5. “Qual” criterion. We excluded variants with a read quality less than 90%. Quality is defined as weighted average of the quality index where correct values are taken with the positive and error values, with the negative.

thetick
08-29-2015, 09:47 PM
Thanks yea YFull shows the FAQ when you click the more link for the haplogroup age.

I was just curious what those here think about the accuracy?