Page 2 of 4 FirstFirst 1234 LastLast
Results 11 to 20 of 35

Thread: A closer look at the I1 samples from Batini et al, Nat. Commun. 2015

  1. #11
    Gold Class Member
    Posts
    1,536
    Sex
    Location
    Florida, USA.
    Ethnicity
    English, Scottish & Irish
    Nationality
    American
    Y-DNA
    I-A13243
    mtDNA
    H1e2

    England Scotland Ireland Prussia Italy Two Sicilies United States of America
    Quote Originally Posted by deadly77 View Post
    I realize that’s a bit of a read, but wanted to give the rationale behind where I grouped these samples in this dataset. Much easier to summarize into a table:
    Attachment 31529
    Or for those who prefer a more visual overview, I annotated Supplementary Material Figure 1 from the Hallast et al 2014 paper with branches based on derived SNPs that I could find.
    Attachment 31530
    This doesn’t necessarily mean the absolute position on the tree for the samples in this dataset as a lot of the SNPs weren’t covered in the 8 targeted x-degenerate regions of Y chromosome listed in Supplementary Data 2 of the Batini et al paper and in Figure 1 of the main manuscript in the Hallast et al 2014 paper. For some of the branches I was able to use phyloequivalent SNPs and at least I was able to break down the I1 samples in this dataset into known subclades.
    As always, very nicely done!

    I remember when that paper came out, taking notice of their younger estimate. Although, if you average them all together, you’re still in the same ball park, 4100 ybp.

    One thing I’m curious to know, is how all of the new upstream SNPs being added by Y700 are going to effect our timeline. I would imagine that YFull knows about some of them from their Y-Elites & WGS submissions but I guess that would depends on where those testers are on the tree. I know everyone in my area of the tree is Big Y500. Then again, I’ve upgraded, so that may give them all they need for our branch. From what I’ve heard, we’re looking at a 40% increase in coverage, once those new tests start filtering in.

    What are your thoughts?
    Last edited by JMcB; 07-09-2019 at 04:47 PM.
    Known Paper Trail: 45.3% English, 29.7% Scottish, 12.5% Irish, 6.25% German & 6.25% Italian. Or: 87.5% British Isles, 6.25% German & 6.25% Italian.
    LivingDNA: 88.1% British Isles (59.7% English, 27% Scottish & 1.3% Irish), 5.9% Europe South (Aegian 3.4%, Tuscany 1.3%, Sardinia 1.1%), 4.4% Europe NW (Scandinavia) & 1.6% Europe East, (Mordovia).
    FT Big Y: I1-Z140 branch I-F2642 >Y1966 >Y3649 >A13241 >Y3647 >A13248 (circa 830 AD) >A13242/YSEQ (circa 975 AD) >A13243/YSEQ (circa 1550 AD).

  2. The Following 2 Users Say Thank You to JMcB For This Useful Post:

     deadly77 (07-09-2019),  JonikW (07-09-2019)

  3. #12
    Registered Users
    Posts
    511
    Sex
    Location
    United Kingdom
    Ethnicity
    European
    Nationality
    British
    Y-DNA
    I-L338
    mtDNA
    J1c1

    United Kingdom England England North of England Norfolk Scotland Ireland
    Quote Originally Posted by JMcB View Post
    As always, very nicely done!

    I remember when that paper came out, taking notice of their younger estimate. Although, if you average them all together, you’re still in the same ball park, 4100 ybp.

    One thing I’m curious to know, is how all of the new upstream SNPs being added by Y700 are going to effect our timeline. I would imagine that YFull knows about some of them from their Y-Elites & WGS submissions but I guess that would depends on where those testers are on the tree. I know everyone in my area of the tree is Big Y500. Then again, I’ve upgraded, so that may give them all they need for our branch. From what I’ve heard, we’re looking at a 40% increase in coverage, once those new tests start filtering in.

    What are your thoughts?
    Cheers - yes as you say a lot of those numbers are in the same ballpark - and a lot of them fall within the confidence intervals of each other. Perhaps it's a case of which educated guess is preferred.

    To be honest, I'm not sure that the new SNPs found in Y700 tests are going to affect the YFull age estimates all that much in the near future. I think the upstream placement of "novel" SNPs is more of a headache for FTDNA's haplotree than the YFull tree. As you say, YFull has seen a fair bit of YElite and WGS tests from customers as well as Y sequences from academic studies (ancient and modern). I don't know the numbers or breakdown but I think it's fair to say that YFull has a lot of samples of greater coverage that FTDNA didn't.

    The majority of these new Big Y Y700 SNPs being discovered are outside of the combBED regions - while the YFull tree is made up from SNPs from both the combBED region and outside of the combBED region, they only use the ones within the combBED region for age estimates (plus some other filtering). Big Y Y500 was targeted for the combBED regions so coverage is reasonably good for those regions.

    Also YFull also already accounts for differences in coverage in their calculation. Take your own kit for example - YFull filtered your novel SNPs downstream of your branch on the YFull tree down to 7 that are suitable for their age estimation. After that they look at the coverage of the combBED region (8467165 base pairs) which is of your kit and it comes to 7579639 base pairs. Then when they do their age estimation they adjust your 7 novel SNPs to 7.82 to account for the missing coverage in your Big Y. Compare to my WGS which has 8475116 base pair coverage of the 8467165 base pair combBED region which only adjusts my 17 novel SNPs to 16.98 (neglible difference). I can't see it being much of a change unless the estimated number of missing SNPs accounted for in the adjustment is way off, and I don't expect it will be. Maybe some cleanup of ambiguous or low read SNPs but in the grand scheme of things I don't expect a lot of changes to YFull age estimation.

    The newer SNPs may result in some rebranching which then has a little bit of adjustment, but that doesn't really change the calculation, more where it's pointing.

    The bigger changes would come if YFull decides to start using a different age estimate calculation, and incorporating SNPs from regions outside of the combBED region for age estimation. One of the reasons that YFull likely decided on using only the combBED regions for age estimation is that they anticipated the majority of their customers sending in BAM files from FTDNA Big Y tests. Now that the Big Y has moved from Y500 to Y700, YFull may decide they have better coverage of regions oustide of the combBED region to justify that. Then again, they may decide not to do that in order keep compatibility with the existing Y500 kits on the YFull tree who don't move to Y700. It might be a numbers game (and I have no inside information - purely speculating). But I do think they designed their age estimate with the earlier Big Y in mind.

    As I said above, I think it's more of a headache for FTDNA than YFull. If I look at the FTDNA public haplotree, it looks like they have 75 variants listed as phyloequivalent to branch I-M253 while YFull has 312. Same with I2 (FTDNA 37, YFull 68) and branch I (FTDNA 82, 199). I'm not sure if FTDNA is aware of the missing ones and just doesn't bother including them on the public haplotree as they won't affect any of their customers downstream, or if all of these need to be added as upstream novel variants and accounted for in the correct position in FTDNA's tree.

  4. The Following 2 Users Say Thank You to deadly77 For This Useful Post:

     JMcB (07-09-2019),  JonikW (07-09-2019)

  5. #13
    Gold Class Member
    Posts
    1,536
    Sex
    Location
    Florida, USA.
    Ethnicity
    English, Scottish & Irish
    Nationality
    American
    Y-DNA
    I-A13243
    mtDNA
    H1e2

    England Scotland Ireland Prussia Italy Two Sicilies United States of America
    Quote Originally Posted by deadly77 View Post
    Cheers - yes as you say a lot of those numbers are in the same ballpark - and a lot of them fall within the confidence intervals of each other. Perhaps it's a case of which educated guess is preferred.

    To be honest, I'm not sure that the new SNPs found in Y700 tests are going to affect the YFull age estimates all that much in the near future. I think the upstream placement of "novel" SNPs is more of a headache for FTDNA's haplotree than the YFull tree. As you say, YFull has seen a fair bit of YElite and WGS tests from customers as well as Y sequences from academic studies (ancient and modern). I don't know the numbers or breakdown but I think it's fair to say that YFull has a lot of samples of greater coverage that FTDNA didn't.

    The majority of these new Big Y Y700 SNPs being discovered are outside of the combBED regions - while the YFull tree is made up from SNPs from both the combBED region and outside of the combBED region, they only use the ones within the combBED region for age estimates (plus some other filtering). Big Y Y500 was targeted for the combBED regions so coverage is reasonably good for those regions.

    Also YFull also already accounts for differences in coverage in their calculation. Take your own kit for example - YFull filtered your novel SNPs downstream of your branch on the YFull tree down to 7 that are suitable for their age estimation. After that they look at the coverage of the combBED region (8467165 base pairs) which is of your kit and it comes to 7579639 base pairs. Then when they do their age estimation they adjust your 7 novel SNPs to 7.82 to account for the missing coverage in your Big Y. Compare to my WGS which has 8475116 base pair coverage of the 8467165 base pair combBED region which only adjusts my 17 novel SNPs to 16.98 (neglible difference). I can't see it being much of a change unless the estimated number of missing SNPs accounted for in the adjustment is way off, and I don't expect it will be. Maybe some cleanup of ambiguous or low read SNPs but in the grand scheme of things I don't expect a lot of changes to YFull age estimation.

    The newer SNPs may result in some rebranching which then has a little bit of adjustment, but that doesn't really change the calculation, more where it's pointing.

    The bigger changes would come if YFull decides to start using a different age estimate calculation, and incorporating SNPs from regions outside of the combBED region for age estimation. One of the reasons that YFull likely decided on using only the combBED regions for age estimation is that they anticipated the majority of their customers sending in BAM files from FTDNA Big Y tests. Now that the Big Y has moved from Y500 to Y700, YFull may decide they have better coverage of regions oustide of the combBED region to justify that. Then again, they may decide not to do that in order keep compatibility with the existing Y500 kits on the YFull tree who don't move to Y700. It might be a numbers game (and I have no inside information - purely speculating). But I do think they designed their age estimate with the earlier Big Y in mind.

    As I said above, I think it's more of a headache for FTDNA than YFull. If I look at the FTDNA public haplotree, it looks like they have 75 variants listed as phyloequivalent to branch I-M253 while YFull has 312. Same with I2 (FTDNA 37, YFull 68) and branch I (FTDNA 82, 199). I'm not sure if FTDNA is aware of the missing ones and just doesn't bother including them on the public haplotree as they won't affect any of their customers downstream, or if all of these need to be added as upstream novel variants and accounted for in the correct position in FTDNA's tree.
    Thank you, that was a nice explanation! So basically, YFull is ahead of the curve and the changes are going to be marginal. Which is good to know!
    Known Paper Trail: 45.3% English, 29.7% Scottish, 12.5% Irish, 6.25% German & 6.25% Italian. Or: 87.5% British Isles, 6.25% German & 6.25% Italian.
    LivingDNA: 88.1% British Isles (59.7% English, 27% Scottish & 1.3% Irish), 5.9% Europe South (Aegian 3.4%, Tuscany 1.3%, Sardinia 1.1%), 4.4% Europe NW (Scandinavia) & 1.6% Europe East, (Mordovia).
    FT Big Y: I1-Z140 branch I-F2642 >Y1966 >Y3649 >A13241 >Y3647 >A13248 (circa 830 AD) >A13242/YSEQ (circa 975 AD) >A13243/YSEQ (circa 1550 AD).

  6. The Following 3 Users Say Thank You to JMcB For This Useful Post:

     deadly77 (07-09-2019),  JonikW (07-09-2019),  spruithean (07-09-2019)

  7. #14
    Registered Users
    Posts
    511
    Sex
    Location
    United Kingdom
    Ethnicity
    European
    Nationality
    British
    Y-DNA
    I-L338
    mtDNA
    J1c1

    United Kingdom England England North of England Norfolk Scotland Ireland
    Quote Originally Posted by JMcB View Post
    Thank you, that was a nice explanation! So basically, YFull is ahead of the curve and the changes are going to be marginal. Which is good to know!
    Well, I could be very wrong about a lot of that, so we'll see. I think it's important to remember that these phylogenetic trees are a bit more fluid than we might be considering and some SNPs can be added to a branch or haplogroup and then removed if later seems like not such a good idea. This is a little easier to follow on the YFull tree because you can look up past versions of the tree (archive button on the top right of webpage). Here's the earliest version of the YFull tree on their website for I1: https://www.yfull.com/arch-3.07/tree/I1/ - notice that the I1 block is defined by 341 SNPs rather than the 312 of today's tree, so clearly some of them must have been moved to a different branch or removed altogether. Not as easy to retractively check FTDNA's tree as there doesnt seem to be a way to check archive versions.

    You can see the coverage of the upstream SNPs in your own Big Y test if you click on YReport on homepage. Looking at my FGC YElite (which is what the Big Y Y700 is going to resemble) it had a few no calls (grey) and ambiguous (yellow) calls. Of the I1 subclades, it had no call for 4 SNPs (including two SNPs phyloequivalent for DF29) but correct for 27 - not bad. Going up to the I1 block, no call for 37 SNPs, ambiguous for 3 out of the I1 312 SNPs - again not too bad. For the I block, no call for 25 SNPs, ambiguous for 2 of the 199 SNPs. Can do the same with the IJ, IJK, HIJK branches and so on. On my WGS, every SNP is read - there are zero no calls for these SNPs. Have some ambiguous calls - 2 in the I1 branch, 3 in the I branch, one in the IJ branch.

    Also at YFull, I can see where there are SNPs known and associated with a branch, but they're not on the tree. A lot of these have a 1 star rating. Some of these associated with branches closest to me on the tree, and some are a bit further up. Say, the SNP Z2741 for example - YFull is listing it as level I1, I have 27A reads (derived), but it only has a one star rating. Position is 11066130 so perhaps a bit to close to the centromere although perhaps observed in a lot of I1 folks. Then there's Y1948 - it's at 2839534 so in the combBED region, associated with level I1 but not on the YFull tree and Y Full gives it one star. The mutation is described as GAAAAAAAAAAT to GAAAAAAAAATT so maybe that's a MNP (multiple nucleotide polymorphism) rather than a SNP. Some of the other one star ones at this I1 level are perhaps not included because they're homologous with X or autosomes. I don't really know - I'm trying to think about this logically but appreciate that I'm often flailing around this a bit blindly. But that's kind of fun though.

    I guess the upstream stuff doesn't always get a lot of attention outside of the academics - most people focus on their terminal as that's where their closest connections, matches and mismatches are going to have the most interaction.

  8. The Following 2 Users Say Thank You to deadly77 For This Useful Post:

     JMcB (07-10-2019),  JonikW (07-09-2019)

  9. #15
    Gold Class Member
    Posts
    1,536
    Sex
    Location
    Florida, USA.
    Ethnicity
    English, Scottish & Irish
    Nationality
    American
    Y-DNA
    I-A13243
    mtDNA
    H1e2

    England Scotland Ireland Prussia Italy Two Sicilies United States of America
    Quote Originally Posted by deadly77 View Post
    Well, I could be very wrong about a lot of that, so we'll see. I think it's important to remember that these phylogenetic trees are a bit more fluid than we might be considering and some SNPs can be added to a branch or haplogroup and then removed if later seems like not such a good idea. This is a little easier to follow on the YFull tree because you can look up past versions of the tree (archive button on the top right of webpage). Here's the earliest version of the YFull tree on their website for I1: https://www.yfull.com/arch-3.07/tree/I1/ - notice that the I1 block is defined by 341 SNPs rather than the 312 of today's tree, so clearly some of them must have been moved to a different branch or removed altogether. Not as easy to retractively check FTDNA's tree as there doesnt seem to be a way to check archive versions.

    You can see the coverage of the upstream SNPs in your own Big Y test if you click on YReport on homepage. Looking at my FGC YElite (which is what the Big Y Y700 is going to resemble) it had a few no calls (grey) and ambiguous (yellow) calls. Of the I1 subclades, it had no call for 4 SNPs (including two SNPs phyloequivalent for DF29) but correct for 27 - not bad. Going up to the I1 block, no call for 37 SNPs, ambiguous for 3 out of the I1 312 SNPs - again not too bad. For the I block, no call for 25 SNPs, ambiguous for 2 of the 199 SNPs. Can do the same with the IJ, IJK, HIJK branches and so on. On my WGS, every SNP is read - there are zero no calls for these SNPs. Have some ambiguous calls - 2 in the I1 branch, 3 in the I branch, one in the IJ branch.

    Also at YFull, I can see where there are SNPs known and associated with a branch, but they're not on the tree. A lot of these have a 1 star rating. Some of these associated with branches closest to me on the tree, and some are a bit further up. Say, the SNP Z2741 for example - YFull is listing it as level I1, I have 27A reads (derived), but it only has a one star rating. Position is 11066130 so perhaps a bit to close to the centromere although perhaps observed in a lot of I1 folks. Then there's Y1948 - it's at 2839534 so in the combBED region, associated with level I1 but not on the YFull tree and Y Full gives it one star. The mutation is described as GAAAAAAAAAAT to GAAAAAAAAATT so maybe that's a MNP (multiple nucleotide polymorphism) rather than a SNP. Some of the other one star ones at this I1 level are perhaps not included because they're homologous with X or autosomes. I don't really know - I'm trying to think about this logically but appreciate that I'm often flailing around this a bit blindly. But that's kind of fun though.

    I guess the upstream stuff doesn't always get a lot of attention outside of the academics - most people focus on their terminal as that's where their closest connections, matches and mismatches are going to have the most interaction.
    Yes, time will tell and it will be interesting to see what happens but I would say your assessment sounds quite reasonable to me. As I was reading the part about YFull’s adjustments, I remembered hearing that Big Y500 covered most, but not all of the combBED region. Perhaps, in the 85 to 90% range (don’t quote me on that, as I’m going from memory), so their adjustments make sense. Coincidentally, I’ve often noticed their adjustments - without considering why they were doing them - because I like to use their SNP counts, to try out different mutation rates. Just to see how they effect the branches around me. I’m beginning to think, I may have too much time on my hands ;-)

    Do you have any thoughts on the various mutation rates? I know that William, Kane & Vance all use 131.6 years for Big Y500 and I believe MacDonald uses 160 years. On the Facebook pages Kane & Vance are saying the rate for Y700 is going to be in the 82 year range. Many people seem to believe that MacDonald’s calculations are the most accurate. On the other hand, in one of his own posts on the subject, he said he was glad to see that his results and YFull’s were fairly close. Even though they used different methods. Plus, no matter what method you use, the variances leave a lot of room for leeway.


    P.S. I agree about YFull’s archiving system. It’s definitely a nice feature to have. I was using it recently to see how much fluctuation there had been in the TMRCA of my main branch. Which was confirmed back in 2017.

    7F742BF7-E94E-4099-868C-DBB371390879.jpeg

    8A377740-07AC-4774-9F2A-6EE514A62ADB.jpeg

    I actually waiting for a new result to fully process, which I suspect is going to send it back to 1900 ybp.
    Last edited by JMcB; 07-10-2019 at 04:38 AM.
    Known Paper Trail: 45.3% English, 29.7% Scottish, 12.5% Irish, 6.25% German & 6.25% Italian. Or: 87.5% British Isles, 6.25% German & 6.25% Italian.
    LivingDNA: 88.1% British Isles (59.7% English, 27% Scottish & 1.3% Irish), 5.9% Europe South (Aegian 3.4%, Tuscany 1.3%, Sardinia 1.1%), 4.4% Europe NW (Scandinavia) & 1.6% Europe East, (Mordovia).
    FT Big Y: I1-Z140 branch I-F2642 >Y1966 >Y3649 >A13241 >Y3647 >A13248 (circa 830 AD) >A13242/YSEQ (circa 975 AD) >A13243/YSEQ (circa 1550 AD).

  10. The Following User Says Thank You to JMcB For This Useful Post:

     JonikW (07-10-2019)

  11. #16
    Registered Users
    Posts
    511
    Sex
    Location
    United Kingdom
    Ethnicity
    European
    Nationality
    British
    Y-DNA
    I-L338
    mtDNA
    J1c1

    United Kingdom England England North of England Norfolk Scotland Ireland
    Quote Originally Posted by JMcB View Post
    Yes, time will tell and it will be interesting to see what happens but I would say your assessment sounds quite reasonable to me. As I was reading the part about YFull’s adjustments, I remembered hearing that Big Y500 covered most, but not all of the combBED region. Perhaps, in the 85 to 90% range (don’t quote me on that, as I’m going from memory), so their adjustments make sense. Coincidentally, I’ve often noticed their adjustments - without considering why they were doing them - because I like to use their SNP counts, to try out different mutation rates. Just to see how they effect the branches around me. I’m beginning to think, I may have too much time on my hands ;-)

    Do you have any thoughts on the various mutation rates? I know that William, Kane & Vance all use 131.6 years for Big Y500 and I believe MacDonald uses 160 years. On the Facebook pages Kane & Vance are saying the rate for Y700 is going to be in the 82 year range. Many people seem to believe that MacDonald’s calculations are the most accurate. On the other hand, in one of his own posts on the subject, he said he was glad to see that his results and YFull’s were fairly close. Even though they used different methods. Plus, no matter what method you use, the variances leave a lot of room for leeway.


    P.S. I agree about YFull’s archiving system. It’s definitely a nice feature to have. I was using it recently to see how much fluctuation there had been in the TMRCA of my main branch. Which was confirmed back in 2017.

    7F742BF7-E94E-4099-868C-DBB371390879.jpeg

    8A377740-07AC-4774-9F2A-6EE514A62ADB.jpeg

    I actually waiting for a new result to fully process, which I suspect is going to send it back to 1900 ybp.
    Yes, I also recall hearing the 85-90% figure for Big Y Y500 for coverage of the combBED region, so I don't think you're wrong there. James Kane's website has a lot of good statistics on the coverage of some of the different tests from a Y perspective. So perhaps a maximum of 10-15% of a change in YFull's age estimate from Big Y500 to a WGS and a little bit less for YElite and in reality much less that that due to YFull adjusting the number of SNPs going into the age estimate calculation based on coverage of the combBED region. Of course that all changes if YFull moves to a different methodology for their age estimate.

    One paper I'd be interested to read regarding mutation rates is this one on The Y-chromosome point mutation rate in humans by Helgason et al in 2015 in Nature Genetics https://www.nature.com/articles/ng.3171 unfortunately it's not open access but I may try digging around in the supplementary information to see what I can find out about that. This paper was referenced in the Batini 2015 paper (this thread) and I've also seen it referenced by Iain McDonald in discussions about age estimation calculations. They find a substantial (18%) difference in the mutation rate among different regions of the Y chromosome.

    There's a good amount of information on Iain McDonald's genetics website here: http://www.jb.man.ac.uk/~mcdonald/genetics.html. A lot of it is well worth a read - he explains things very well and I've learned a lot from there. He describes that his age estimation method is based on the same method as YFull (Adamov et al. (2015)), but with a few mathematic bells and whistles. He's a bit better at maths than I am (his background in astrophysics, mine in chemistry). He acknowledges that a lot of the principles and basis for his model is the same as the calculation that YFull uses. There's a bit of give and take between his method and YFull's - for example, the Big Tree for some of the R1b subclades is built up from data using VCF files, while the YFull tree is built up from data using BAM files. So when YFull is assesing whether a SNP should be included for age estimation, they apply a filter of excluding SNPs which have a read quality of less than 90% and SNPs that have only one or two reads while that's information that may be missed by looking at the VCF file alone without the BAM. So advantages and disadvantages to either method.

    I think the year per SNP range is going to depend on how many SNPs you are using in the calculation - essentially how much quality control you want to apply to what you want to call "reliable" SNPs. There might be a bit give and take on excluding SNPs that may be bad and missing out on their influence, or including ones that throw a lot more uncertainty into the mix. If you're using 144.14 years per SNP and you suddenly add a lot more SNPs from outside the combBED region, obviously it's going to push the date a lot further back and so in that case you need to adjust the mutation rate to less years per SNP to account for that. YFull already excludes some SNPs from their calculation as said above (<90% read quality, 1 or 2 read SNPS), but also not including SNPs found in more than 5 different "localizations" (other haplogroups/subclades), indels and SNPs outside the combBED region. There's a bit more to it that 144 years per SNP - which Bill Wood was pushing last time I read one of his ravings. It's rather ironic that he trashes YFull so much while using their age estimate (albeit incorrectly).

    But yeah, it brings up a fair bit of questions as to what you want to include in the calculation. As well as SNPs, some indels seem rather stable and perhaps some of the slower mutating STRs as well. But these will all have different mutation rates as well. Perhaps there's a case for sorting mutations into different categories, includings SNPs in different regions separated from each other and applying individual mutation rates based on that rather than a weighted average. Although that makes the calculation hideously more complicated. There might also be case for different mutation rates in different haplogroups. Again, perhaps depends how complicated you want to get.

    I notice some people get really upset by age estimates, especially if they aren't what they expect. The YFull Facebook group is full of posts like that. I feel that some of those folks need to be a bit more cognizant that there's not a regular clock where say 144 years or 100 years pass and like clockwork: boom! - new SNP! Mutation process is a lot more random than that - could be several in one generation, could be none in several generations. Average rates with a fair range - as you say, leaving a lot of leeway. I like one of Maurice Gleeson's presentations from a few years ago - he had a slide which said something like "Which age estimate is the best one?" followed by "the one that best fits your preconceived ideas" which is think sums things up quite nicely. He also had another one which said that when dating branching points "pedigree method is most accurate. Others are statistically accurate... but not genealogically accurate... and never will be". Which I agree is entirely correct. That doesn't mean that we shouldn't work on these, discuss them and try and make them as good as they can be. All very worthy endeavours.

    I haven't been following as many of the discussions in Facebook groups recently. There's some really good information and intelligent discussions on there, but there are also a lot of ridiculous opinions, unpleasantness and keyboard warriors on there too. Lately it seems there was more of the latter, so I dialled back my participation in some of those groups. But the folks you mention are always worth listening to - I've always enjoyed reading David Vance and James Kane's comments. I hadn't seen Iain McDonald or Alex Williamson on Facebook discussions much - more from others posting updates or comments from other forums but those two always worth listening to as well. Some other individuals not so much.

  12. The Following 2 Users Say Thank You to deadly77 For This Useful Post:

     JMcB (07-10-2019),  JonikW (07-10-2019)

  13. #17
    Gold Class Member
    Posts
    1,536
    Sex
    Location
    Florida, USA.
    Ethnicity
    English, Scottish & Irish
    Nationality
    American
    Y-DNA
    I-A13243
    mtDNA
    H1e2

    England Scotland Ireland Prussia Italy Two Sicilies United States of America
    Quote Originally Posted by deadly77 View Post
    Yes, I also recall hearing the 85-90% figure for Big Y Y500 for coverage of the combBED region, so I don't think you're wrong there. James Kane's website has a lot of good statistics on the coverage of some of the different tests from a Y perspective. So perhaps a maximum of 10-15% of a change in YFull's age estimate from Big Y500 to a WGS and a little bit less for YElite and in reality much less that that due to YFull adjusting the number of SNPs going into the age estimate calculation based on coverage of the combBED region. Of course that all changes if YFull moves to a different methodology for their age estimate.

    One paper I'd be interested to read regarding mutation rates is this one on The Y-chromosome point mutation rate in humans by Helgason et al in 2015 in Nature Genetics https://www.nature.com/articles/ng.3171 unfortunately it's not open access but I may try digging around in the supplementary information to see what I can find out about that. This paper was referenced in the Batini 2015 paper (this thread) and I've also seen it referenced by Iain McDonald in discussions about age estimation calculations. They find a substantial (18%) difference in the mutation rate among different regions of the Y chromosome.

    There's a good amount of information on Iain McDonald's genetics website here: http://www.jb.man.ac.uk/~mcdonald/genetics.html. A lot of it is well worth a read - he explains things very well and I've learned a lot from there. He describes that his age estimation method is based on the same method as YFull (Adamov et al. (2015)), but with a few mathematic bells and whistles. He's a bit better at maths than I am (his background in astrophysics, mine in chemistry). He acknowledges that a lot of the principles and basis for his model is the same as the calculation that YFull uses. There's a bit of give and take between his method and YFull's - for example, the Big Tree for some of the R1b subclades is built up from data using VCF files, while the YFull tree is built up from data using BAM files. So when YFull is assesing whether a SNP should be included for age estimation, they apply a filter of excluding SNPs which have a read quality of less than 90% and SNPs that have only one or two reads while that's information that may be missed by looking at the VCF file alone without the BAM. So advantages and disadvantages to either method.

    I think the year per SNP range is going to depend on how many SNPs you are using in the calculation - essentially how much quality control you want to apply to what you want to call "reliable" SNPs. There might be a bit give and take on excluding SNPs that may be bad and missing out on their influence, or including ones that throw a lot more uncertainty into the mix. If you're using 144.14 years per SNP and you suddenly add a lot more SNPs from outside the combBED region, obviously it's going to push the date a lot further back and so in that case you need to adjust the mutation rate to less years per SNP to account for that. YFull already excludes some SNPs from their calculation as said above (<90% read quality, 1 or 2 read SNPS), but also not including SNPs found in more than 5 different "localizations" (other haplogroups/subclades), indels and SNPs outside the combBED region. There's a bit more to it that 144 years per SNP - which Bill Wood was pushing last time I read one of his ravings. It's rather ironic that he trashes YFull so much while using their age estimate (albeit incorrectly).

    But yeah, it brings up a fair bit of questions as to what you want to include in the calculation. As well as SNPs, some indels seem rather stable and perhaps some of the slower mutating STRs as well. But these will all have different mutation rates as well. Perhaps there's a case for sorting mutations into different categories, includings SNPs in different regions separated from each other and applying individual mutation rates based on that rather than a weighted average. Although that makes the calculation hideously more complicated. There might also be case for different mutation rates in different haplogroups. Again, perhaps depends how complicated you want to get.

    I notice some people get really upset by age estimates, especially if they aren't what they expect. The YFull Facebook group is full of posts like that. I feel that some of those folks need to be a bit more cognizant that there's not a regular clock where say 144 years or 100 years pass and like clockwork: boom! - new SNP! Mutation process is a lot more random than that - could be several in one generation, could be none in several generations. Average rates with a fair range - as you say, leaving a lot of leeway. I like one of Maurice Gleeson's presentations from a few years ago - he had a slide which said something like "Which age estimate is the best one?" followed by "the one that best fits your preconceived ideas" which is think sums things up quite nicely. He also had another one which said that when dating branching points "pedigree method is most accurate. Others are statistically accurate... but not genealogically accurate... and never will be". Which I agree is entirely correct. That doesn't mean that we shouldn't work on these, discuss them and try and make them as good as they can be. All very worthy endeavours.

    I haven't been following as many of the discussions in Facebook groups recently. There's some really good information and intelligent discussions on there, but there are also a lot of ridiculous opinions, unpleasantness and keyboard warriors on there too. Lately it seems there was more of the latter, so I dialled back my participation in some of those groups. But the folks you mention are always worth listening to - I've always enjoyed reading David Vance and James Kane's comments. I hadn't seen Iain McDonald or Alex Williamson on Facebook discussions much - more from others posting updates or comments from other forums but those two always worth listening to as well. Some other individuals not so much.
    Unfortunately, I haven’t been able to enjoy any of BiL Wood’s mad ravings, since he unceremoniously and secretly block me from his BiG Y Page. While it was still in it’s infancy. Although, I do remember your reference to Maurice Gleason’s humorous presentation and I’ve actually taken: whatever fits your preconceived notions, as my motto, ever since then. ;-)

    Facebook can definitely be a wilder and at times, more childish venue. On the other hand, you can still learn a lot there and it doesn’t take too long, before you figure out who to ignore. I suspect MacDonald and Williamson are far too busy to engage in forums and from what I’ve seen, MacDonald has only posted here on rare occasions. Usually when one of his U106 members asks him to comment and he feels it’s worthwhile.


    If you don’t mind, I like to run one more question by you. What are your thoughts concerning the accuracy of the estimates once your results seem to be confirming them, as you start moving into a genealogical time frame?

    For example, my branch (A13248) is currently dated by YFull to 970 AD. YFull & FT are both giving me 7 Novel Variants of good quality. I have tested 2 matches against a SNP pack of my NVs. With one being positive for 1 of the 7 and the other being positive for 5 of the 7. The one who is positive for 5 of the 7, is also my closest surname match at 111 markers (2@111). Judging from our MDKAs (1720s & 1730s) and our genealogy, it looks like are connection probably dates to the late 1600s or early 1700s. Although, it’s possible it could be earlier. On the face of it, these numbers would seem to align nicely with the estimates.

    Is there a point where the numbers become more reliable, once we move closer to the present?
    Last edited by JMcB; 07-11-2019 at 04:44 AM.
    Known Paper Trail: 45.3% English, 29.7% Scottish, 12.5% Irish, 6.25% German & 6.25% Italian. Or: 87.5% British Isles, 6.25% German & 6.25% Italian.
    LivingDNA: 88.1% British Isles (59.7% English, 27% Scottish & 1.3% Irish), 5.9% Europe South (Aegian 3.4%, Tuscany 1.3%, Sardinia 1.1%), 4.4% Europe NW (Scandinavia) & 1.6% Europe East, (Mordovia).
    FT Big Y: I1-Z140 branch I-F2642 >Y1966 >Y3649 >A13241 >Y3647 >A13248 (circa 830 AD) >A13242/YSEQ (circa 975 AD) >A13243/YSEQ (circa 1550 AD).

  14. The Following 2 Users Say Thank You to JMcB For This Useful Post:

     deadly77 (07-11-2019),  JonikW (07-10-2019)

  15. #18
    Registered Users
    Posts
    511
    Sex
    Location
    United Kingdom
    Ethnicity
    European
    Nationality
    British
    Y-DNA
    I-L338
    mtDNA
    J1c1

    United Kingdom England England North of England Norfolk Scotland Ireland
    Quote Originally Posted by JMcB View Post
    Unfortunately, I haven’t been able to enjoy any of BiL Wood’s mad ravings, since he unceremoniously and secretly block me from his BiG Y Page. While it was still in it’s infancy. Although, I do remember your reference to Maurice Gleason’s humorous presentation and I’ve actually taken: whatever fits your preconceived notions, as my motto, ever since then. ;-)

    Facebook can definitely be a wilder and at times, more childish venue. On the other hand, you can still learn a lot there and it doesn’t take too long, before you figure out who to ignore. I suspect MacDonald and Williamson are far too busy to engage in forums and from what I’ve seen, MacDonald has only posted here on rare occasions. Usually when one of his U106 members asks him to comment and he feels it’s worthwhile.


    If you don’t mind, I like to run one more question by you. What are your thoughts concerning the accuracy of the estimates once your results seem to be confirming them, as you start moving into a genealogical time frame?

    For example, my branch (A13248) is currently dated by YFull to 970 AD. YFull & FT are both giving me 7 Novel Variants of good quality. I have tested 2 matches against a SNP pack of my NVs. With one being positive for 1 of the 7 and the other being positive for 5 of the 7. The one who is positive for 5 of the 7, is also my closest surname match at 111 markers (2@111). Judging from our MDKAs (1720s & 1730s) and our genealogy, it looks like are connection probably dates to the late 1600s or early 1700s. Although, it’s possible it could be earlier. On the face of it, these numbers would seem to align nicely with the estimates.

    Is there a point where the numbers become more reliable, once we move closer to the present?
    Oh, I haven't been anywhere near BiL's Facebook page for a very long time - I was one of the first booted from his Facebook group in an early mass purge when he really threw his toys out of the pram. Someone I have a lot of respect for said a while ago being excluded from his group was a badge of honour, which I fully agree with. He would occaisionally jump on to other Facebook groups to promote his own "echo chamber" Facebook group which usually turned into him having an online slanging match. It must take so much energy being that angry all the time. But he hasn't done that for a while - mainly I meant some of the other groups with some different issues. Oh well - enough complaining about Facebook and back on topic.

    Absolutely it's entirely possible to get to a point where you can have a lot more confidence. I'm a bit over sceptical at times and a lot of that coms with the validation of a scientific method, and that's what I'm often applying to some of these questions - look to try it from as many angles as possible to disprove something. If it falls apart pretty quickly it's likely what's being proposed isn't robust enough. If it holds up to more intense scrutiny it's a stronger model. Questions are good, and I'm always revaluating my opinions as I learn and think about things more.

    But let's look back at your example. You're the only one on your branch on the YFull tree, which has it's branch dated with a TMRCA of 1050 ybp, as you say, going back to 970 AD. But if you click on the info button for your branch you'll see that there's a contribution to the TMRCA of your branch from the four samples on the I-Y136323 branch below you (even though you don't share common ancestor with them beyond A13248). If you count up the number of SNPs back to A13248 (novels for you, novels plus the SNPs on the tree at branch I-Y136323 for them) you'll see there's a bit of variance - 4,4,6,7,7. One of the 4 SNPs samples YF19279 has the highest combBED coverage. So after coverage adjustment, there's a range of contributions to the TMRCA from 4.1 SNPs at 652 years to 7.82 SNPs (you) at 1189 years. So the contribution from the four other kits (average 5.85 SNPs 905 ybp) pulls your TMRCA 142 years earlier from 1189 to 1047 (rounded to 1050 ybp on the tree).

    So this is counting the age estimates from the bottom up, while when you're looking at the people that you've Sanger tested your novel variants at YSEQ, you're switcing over to a top down approach - setting a date at 970 AD and counting backwards (although maybe it would be better to count from 830 AD given your individual TMRCA of 1189 ybp). But for consistency with YFull, ideally you'd be counting back from their own number of novel variants back to I-A13248. Obviously don't have that because they haven't taken a NGS test. You're assuming that they have 7 novel SNPs down from I-A13248. That's fine as an assumption but be aware that they may not - 3 out of the four kits at I-Y136323 have less than that, two as low as four, and after the adjustment for combBED coverage, your age estimate of 7.8 SNPs and 1189 ybp is at the high end of the range. Or it could go the other way if they have say 9 or 10 novel SNPs. Hence the 95% confidence interval that YFull uses for the TMRCA - in the case of I-A13248 at 1550 to 600 ybp.

    Just a few more things to think about. But these are good close matches on the Y line. I'm rather jealous as I don't have anything closer than 2900 ybp so haven't had to consider these until now, so rather thinking out loud and rambling. Of course, could design a model for age estimate that counts from the top down, but we're using dates for TMRCA that have been established by a bottom up approach.

    But lets take another example which has a more extreme range. I hope JonikW doesn't mind me using his subclade as an example but it's an interesting one and he has his terminal branch I-A21912 listed below his profile. Ok, so two kits at this branch with rather different number of downstream novel SNPs - one kit has 4 novel SNPs, the other kit has 11 novel SNPs. After accounting for coverage (they're not too different to each other), adjust to a corrected number of 4.59 and 12.32 SNPs resulting in two rather different estimated TMRCAs - one is 772 ybp and the other is 1838 ybp - more than a thousand years difference between the two taken indivually, which is a really large range. Obviously with just two samples, difficult to say if one is the outlier or the other. Or they could both be outliers around an average of 8.46 adjusted SNPs, which is where YFull rounds out the TMRCA on the tree to 1300 ybp. But if we go a step upstream to I-A21901, now four kits and we can see that the two kits at I-A21912 are the outer ranges of the TMRCA at 888 and 2162 ybp, while the two kits at I-A21901 branch are a little closer in towards YF13910 than YF13812, so there's a case for labelling YF13812 as the outlier in the SNP count. Still a small sample size though.

    In answer to your last question, I actually think numbers become less reliable as we move closer to the present. My feeling is that the branches further back in the past have more more samples collectively contributing to their age estimate which brings down the statistical noise of outliers. In contrast as we move to the present the number of people contributing to the age estimate is less - may even be just one or two, and it becomes less confident in assessing whether a sample is in the middle of the statistical range (the common assumption) or at the edges. I also think there is more of a difference in perception closer to the present day - consider 100 years either side of 4600 years ago doesn't really have much effect on thinking about who or where the common ancestor was, but 100 years either side of 1750 AD definitely does.

  16. The Following 3 Users Say Thank You to deadly77 For This Useful Post:

     JMcB (07-12-2019),  JonikW (07-12-2019),  spruithean (07-12-2019)

  17. #19
    Gold Class Member
    Posts
    1,211
    Sex
    Location
    Kent
    Ethnicity
    Isles Celto-Germanic
    Nationality
    British
    Y-DNA
    I1 Z140+ A21912+
    mtDNA
    V

    Wales England Cornwall Scotland Ireland Normandie
    Quote Originally Posted by deadly77 View Post
    Oh, I haven't been anywhere near BiL's Facebook page for a very long time - I was one of the first booted from his Facebook group in an early mass purge when he really threw his toys out of the pram. Someone I have a lot of respect for said a while ago being excluded from his group was a badge of honour, which I fully agree with. He would occaisionally jump on to other Facebook groups to promote his own "echo chamber" Facebook group which usually turned into him having an online slanging match. It must take so much energy being that angry all the time. But he hasn't done that for a while - mainly I meant some of the other groups with some different issues. Oh well - enough complaining about Facebook and back on topic.

    Absolutely it's entirely possible to get to a point where you can have a lot more confidence. I'm a bit over sceptical at times and a lot of that coms with the validation of a scientific method, and that's what I'm often applying to some of these questions - look to try it from as many angles as possible to disprove something. If it falls apart pretty quickly it's likely what's being proposed isn't robust enough. If it holds up to more intense scrutiny it's a stronger model. Questions are good, and I'm always revaluating my opinions as I learn and think about things more.

    But let's look back at your example. You're the only one on your branch on the YFull tree, which has it's branch dated with a TMRCA of 1050 ybp, as you say, going back to 970 AD. But if you click on the info button for your branch you'll see that there's a contribution to the TMRCA of your branch from the four samples on the I-Y136323 branch below you (even though you don't share common ancestor with them beyond A13248). If you count up the number of SNPs back to A13248 (novels for you, novels plus the SNPs on the tree at branch I-Y136323 for them) you'll see there's a bit of variance - 4,4,6,7,7. One of the 4 SNPs samples YF19279 has the highest combBED coverage. So after coverage adjustment, there's a range of contributions to the TMRCA from 4.1 SNPs at 652 years to 7.82 SNPs (you) at 1189 years. So the contribution from the four other kits (average 5.85 SNPs 905 ybp) pulls your TMRCA 142 years earlier from 1189 to 1047 (rounded to 1050 ybp on the tree).

    So this is counting the age estimates from the bottom up, while when you're looking at the people that you've Sanger tested your novel variants at YSEQ, you're switcing over to a top down approach - setting a date at 970 AD and counting backwards (although maybe it would be better to count from 830 AD given your individual TMRCA of 1189 ybp). But for consistency with YFull, ideally you'd be counting back from their own number of novel variants back to I-A13248. Obviously don't have that because they haven't taken a NGS test. You're assuming that they have 7 novel SNPs down from I-A13248. That's fine as an assumption but be aware that they may not - 3 out of the four kits at I-Y136323 have less than that, two as low as four, and after the adjustment for combBED coverage, your age estimate of 7.8 SNPs and 1189 ybp is at the high end of the range. Or it could go the other way if they have say 9 or 10 novel SNPs. Hence the 95% confidence interval that YFull uses for the TMRCA - in the case of I-A13248 at 1550 to 600 ybp.

    Just a few more things to think about. But these are good close matches on the Y line. I'm rather jealous as I don't have anything closer than 2900 ybp so haven't had to consider these until now, so rather thinking out loud and rambling. Of course, could design a model for age estimate that counts from the top down, but we're using dates for TMRCA that have been established by a bottom up approach.

    But lets take another example which has a more extreme range. I hope JonikW doesn't mind me using his subclade as an example but it's an interesting one and he has his terminal branch I-A21912 listed below his profile. Ok, so two kits at this branch with rather different number of downstream novel SNPs - one kit has 4 novel SNPs, the other kit has 11 novel SNPs. After accounting for coverage (they're not too different to each other), adjust to a corrected number of 4.59 and 12.32 SNPs resulting in two rather different estimated TMRCAs - one is 772 ybp and the other is 1838 ybp - more than a thousand years difference between the two taken indivually, which is a really large range. Obviously with just two samples, difficult to say if one is the outlier or the other. Or they could both be outliers around an average of 8.46 adjusted SNPs, which is where YFull rounds out the TMRCA on the tree to 1300 ybp. But if we go a step upstream to I-A21901, now four kits and we can see that the two kits at I-A21912 are the outer ranges of the TMRCA at 888 and 2162 ybp, while the two kits at I-A21901 branch are a little closer in towards YF13910 than YF13812, so there's a case for labelling YF13812 as the outlier in the SNP count. Still a small sample size though.

    In answer to your last question, I actually think numbers become less reliable as we move closer to the present. My feeling is that the branches further back in the past have more more samples collectively contributing to their age estimate which brings down the statistical noise of outliers. In contrast as we move to the present the number of people contributing to the age estimate is less - may even be just one or two, and it becomes less confident in assessing whether a sample is in the middle of the statistical range (the common assumption) or at the edges. I also think there is more of a difference in perception closer to the present day - consider 100 years either side of 4600 years ago doesn't really have much effect on thinking about who or where the common ancestor was, but 100 years either side of 1750 AD definitely does.
    Thanks for all these recent informative posts deadly77. I really appreciate you taking the time to look at my results (for the record I'm more than happy for anyone on this site to look at my SNPs, run my G25 coordinates etc at any time; there's so much to learn). I agree that my downstream match looks like the outlier at this early stage. That's why I think an origin for me among the Angles looks like the most parsimonious assumption right now. It also fits geographically with my own Y line location in the Peak district as well as that of my match and the others upstream in the instances where we know where their paternal lines lived in the past (so that includes Scania in the one case, Cambridgeshire etc in others).
    Living DNA Cautious mode:
    South Wales Border-related ancestry: 86.8%
    Cornwall: 8%
    Cumbria-related ancestry: 5.2%
    Y line: Peak District, England. Big Y match: Scania, Sweden; TMRCA 1,280 ybp (YFull);
    mtDNA: traces to Glamorgan, Wales, 18th century. Mother's Y line (Wales): R-L21 L371

  18. The Following User Says Thank You to JonikW For This Useful Post:

     deadly77 (07-12-2019)

  19. #20
    Junior Member
    Posts
    8
    Sex

    I do not want to derail the high level discussion. I’m not a geneticist but specialized in gene expression, so am a layman when it comes to these discussions. However it is interesting to me because I belong to the I haplogroup (M253). That makes me curious about my paternal lineage origins. Sounds like Denmark is the probable region of origin for M253? Family records show English patrilineal ancestral, migrating to America in the latter 1600s.

    Next question, if I may impose: is there a scientific layman-readable article or presentation on how the calculations are done regarding relatedness of populations based on the available genetic data?

    If any or all of this is hopelessly under water, please ignore and carry on!

Page 2 of 4 FirstFirst 1234 LastLast

Similar Threads

  1. Why is Botocudo closer to SE Asians than Amerindians?
    By Censored in forum Autosomal (auDNA)
    Replies: 2
    Last Post: 07-01-2019, 04:54 PM
  2. My current list of U106+ samples in aDNA samples
    By Bollox79 in forum Ancient (aDNA)
    Replies: 2
    Last Post: 04-23-2018, 12:51 AM
  3. Do Pakistanis cluster closer to Indians or Persians?
    By Bar in forum Autosomal (auDNA)
    Replies: 1
    Last Post: 10-08-2017, 06:01 PM
  4. Replies: 232
    Last Post: 08-17-2016, 08:01 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •