PDA

View Full Version : A closer look at the I1 samples from Batini et al, Nat. Commun. 2015



deadly77
07-09-2019, 08:07 AM
I was intrigued by the paper published in Nature Communications in 2015 by Batini et al on “Large-scale recent expansion of European patrilineages shown by population resequencing” that came up a few times recently in the Ancient I1 Samples thread here on Anthrogenica. The Batini et al 2015 paper doesn’t actually contain any ancient samples, aside from a list in Supplementary Table 8, referenced in the manuscript “Ancient MSY (male specific region of the Y chromosome) sequences show that hgs R1a and R1b are present in the steppe much earlier than observed in any European sites (Supplementary Table 8), making this region a likely source for these MSY expansion lineages.” Given that there’s no ancient sample analysis and instead looks at the DNA sequences of modern individuals, plotted back to a TMRCA (time to most recent common ancestor), I thought I’d put it into a separate thread so it doesn’t go off tangent from the main Ancient I1 thread. I think does have some interest from an I1 perspective, especially relating to the founder effect and patrilineal bottleneck of modern I1 living today.

The paper is open access, so anyone should be able to read it and it can be found here https://www.nature.com/articles/ncomms8152. It seems to be a more focused discussion on TMRCA using a dataset established in an earlier 2014 paper from Hallast et al titled “Y chromosome tree bursts into leaf…” which features a lot of the same authors as the Batini et al 2015 paper. The Hallast et al 2014 paper is also open access and can be found here https://academic.oup.com/mbe/article/32/3/661/977118. This one is useful because there’s a better resolution version of the tree branching Supplementary Figure 1 displaying the phylogenetic tree with sample names that is easier to read. The corresponding Supplementary Figure 1 in the Supplementary Information from the Batini et al 2015 paper is not very easy to read.

The study contains the MSY sequences of 334 males from 17 populations and uses NGS (next generation sequencing). So not a very high number of samples, but the Y-DNA analysis of those samples is in greater depth than some other studies. All samples are from anonymous donors and most were collected for this study, so a good chance to get a look at a dataset that’s outside of the heavily US-biased direct to consumer testing databases.

deadly77
07-09-2019, 08:12 AM
46 of the samples in the study were found to be I1, representing 13.8% of the dataset. As expected, the dataset is dominated by R1b, although several other haplogroups are represented in the study. No I1 samples were found in the Basque, Greek, Palestinian, Spanish, Italian or Turkish populations. This doesn’t mean that there’s no I1 folks in these modern populations – each population was only sampled with up to 20 individuals. Small sample size means that some haplogroups are going to be missed. For the populations where I1 samples were found in this study, the breakdown looks like this:
31525
Can also follow these in the Batini et al 2015 manuscript on the pie charts depicted in Figure 1b where I1 samples are shown in a lime green colour.

The percentage of I1 in Danish and Norwegian populations appears to match up closely with other studies referenced by Eupedia around the 30% mark. A bit surprising to see the highest percentage of I1 was in the Frisian population at 50%. My feeling is that is a consequence of a small sample size of 20 (10 of which are I1) overestimates the percentage of I1. Compare to Eupedia reporting 16.5% I1 in the Netherlands citing a sample size of 500 to 1000 samples. It’s unlikely that 50% of men in that “Frisian” population are I1.

Some of the population names are easy to follow – such as Danish, Irish, Norwegian, etc. Some are less clear. It’s not clear to me whether the Saami population in this study refers to specifically to Finno-Urgic people inhabiting Sápmi region or instead refers the population of Finland as defined by the current political borders. My feeling is the latter. Same with Frisian for the Netherlands, Bavaria for Germany. When the paper mentions England, it says Hertfordshire and Worcestershire in the Methods section, so perhaps the English samples were only collected from those regions. It also appears that the English and Orcadian samples reference the POBI (People of the British Isles) project. The CEU samples are collected from folks in Utah with Central European ancestry from International HapMap Consortium.

deadly77
07-09-2019, 08:13 AM
One thing that’s interesting is that the Batini et al 2015 paper gives a couple of different age estimates for the TMRCA for I1 – 4190 YA (years ago) with 95% HPD (high posterior density) 3470-5070 YA using BEAST. Then there’s another age estimate using rho with TMRCA 3460 YA and a range of 3180-3760 YA. This is a bit earlier than the YFull estimate of 4600 ybp (years before present) and 95% CI (confidence interval) 5200-4000 ybp. It’s important to remember that estimates are just that – estimates. Such estimates are not genealogically exact and they never will be. They’re in the same general area (and some of the dates fall within the error range of each other) but all estimates agree that the TMRCA is quite a gap from the branching off from I into I1 and I2. I wanted to establish if the TMRCA estimate in the Batini et al 2015 paper was based on a subset of I1 – for example, if all of the samples came from just I-DF29 or a downstream branch such as I-Z58, I-Z2336 or I-Z63.

The Supplementary Data also includes a VCF file of the samples reported in this study. This can be loaded up in the Broad Institute’s IGV software that I use for looking at the BAM files of ancient I1 samples and make some designations about the I1 downstream subclades of these samples. I can’t get as much information from the VCF compared to the BAM. Think of the BAM file as the raw sequence data against a reference genome (hg19 or hg38 for example) and think of the VCF as more like an annotated report rather than the raw data itself. Also, both papers listed above in this study used 8 targeted regions of Y chromosome that were X-degenerate. The hg19 coordinates of these targeted X-degenerate regions are listed in Supplementary Data 2 of the Batini et al paper and in Figure 1 of the main manuscript in the Hallast et al 2014 paper.

deadly77
07-09-2019, 08:19 AM
This also means there are limitations on which SNPs can be extracted from the VCF file. For example, DF29 and the phyloequivalent SNPs for that branch aren’t covered. Same for Z58 which is one of the largest branches below I-DF29. However, I can see the SNPs Z2336, Z59 and Z63 which covers a lot of major branching points. Here’s how the numbers work out for this dataset:
31527
Most of the samples are I-Z59 and subclades. Second most common in this dataset are the I-Z2336 branch, followed by the I-Z63 branch and the rest are I-M253, which includes the samples which don’t fit into the other three categories – either I-Z58 who aren’t I-Z59 (such as I-Z138) and other I-DF29 subclades, and subclades of I1 that are negative for DF29.
Here’s a more detailed breakdown od how the main four I1 groups are distributed in this paper.
31526
Or for those of you who like a more visual representation:
31528

deadly77
07-09-2019, 08:20 AM
All four of the CEU samples are on YFull as part of their academic samples list which makes them a bit easier to track. From the YFull tree, sample CEU-NA11992 in on branch I-Y9414, which is downstream of I-Z17954. So the dataset here includes at least one sample that is outside the I-DF29 branch, going back to the same TMRCA as the YFull I1 tree, so that removes one of the main reasons for the differences in the TMRCA. There’s several other main reasons why the TMRCAs are different. YFull has significantly more samples and uses BAM files rather than VCF for the data. They also pull their pool of SNPs from different regions of the Y chromosome. YFull uses SNPs from the CombBED region, filters out some on other criteria (MNPs such as indels, SNPs that appear to often in multiple different haplogroups, read quality and depth) adjusted for coverage then an assumed mutation rate of 144.41 years. The Batini et al 2015 paper used 8 targeted X-degenerate regions of Y chromosome, so doesn’t cover as much of the combBED region as YFull, and uses a different mutation rate – for example, the rho method calculation uses 268.5 years per mutation, much higher than YFull but uses less SNPs to do that calculation. There is some discussion of their mutation rate in the manuscript and they reference a 2015 paper by Helgusson Nature Genetics which assigned mutation rates of Icelandic Y chromosome genomes. I haven’t read the Helgusson 2015 paper on Y chromosome of Icelandic genomes (not open access), but I’ve seen it referred to in Iain MacDonald’s age estimate calculations and different regions where a substantial (18%) difference in the mutation rate among different regions of the Y chromosome was mentioned.

The smaller coverage of the 8 targeted X-degenerate regions of Y chromosome means that I’m not able to assign everything clearly as not every relevant SNP is covered in the VCF, but managed to find a few SNPs that were covered and was able to use some phyloequivalent SNPs as surrogates. After a bit of trial and error, I found that SNPs with a PH prefix were well covered in this paper. Apparently, this prefix is assigned to Pille Hallast, one of the authors of this paper, so that’s probably why. It may be that several of the PH SNPs were discovered and registered by this study. I was able to assign three of the samples in the M253 group to the I-Z138 group based on a derived reads for Z139. These are I-Z58 but negative for the I-Z59 branch (I-Z59 accounts for 21 of the 46 samples, so I-Z58 is 24 samples). Within these three samples, Fri-1312 is derived for PH4482 and Fri-1725 is derived for S19185, while Nor-20 I didn’t find anything further than Z139. Know CEU-NA11992 is I-Y9414 from position on the YFull tree, and further analysis shows that for the two remaining I-M253 group samples, Ork-525 has a derived read for PH2706 which is on I-Z131 branch (and therefore negative for DF29) and Nor-15 has a derived read for PH2510, which is a small branch below I-DF29 on the YFull tree. So, of the 46 samples, 44 are DF29+ and 2 are DF29-.

Could dig out some more downstream information on the other branches – three of the I-Z63 samples were derived for Y2245 (Bav-53, Eng-O109, Ser-12) while the other three I-Z63 samples were ancestral for Y2245 (Fri-1319, Nor-14, Ser-1). Managed to find derived read for PH2195 for Ser-1 and derived read for PH3482 for Ser-12.

deadly77
07-09-2019, 08:21 AM
For the I-Z2336 samples, I could separate five of the thirteen into I-Z74 based on derived for CTS1793. Of those five, Saa-5 was derived for L258 and a further four of those (CEU-NA11829, Den-152, Nor-2, Nor-21) into I-L813 based on derived read for S297. Of the other samples, three of the Saami samples (Saa-5, Saa-9, Saa-10) had derived read for PH5383 downstream of I-L22 branch. Of the rest, Ser-4 was while Den-158 and Hun-27 I didn’t find anything downstream of I-Z2336.
The I-Z59 group (the largest group), three of these were negative for Z60 and positive for Z2041. Of those three, Hun-37 had derived read for PH4774, Den-183 had derived read for PH4362 and CEU-NA12750 was I-Z2042.

All the rest of the I-Z59 group were I-Z60. I didn’t find anything further for Den-207 but Den-113 was derived for PH902 and Nor-7 was derived for PH2834, placing that sample at I-BY453 downstream of I-F2642. Z140 and Z141 aren’t covered so no read for those, although F2642 was and Nor-7 in the previous sentence was the only example. I-Z2535 was read and six of the samples were derived – all of these were also derived for L338. Two of these I-L338 samples were derived for S1990 – Fri-1325 was also derived for PH5345 while Ire-0130 was ancestral for PH5345. The remaining four I-L338 samples were all derived for Y8337 – two of those (Fri-1309, Fri-1722) were also further derived for PH4462, while the other two (Fri-1048, Fri-1938) were not and at I-Y8337.

Of the remaining I-Z60 samples, the remaining nine were derived for S1948, putting them on the I-CTS7362 branch. For two of these (Fri-1048, Fri-1937), didn’t get further down than S1948. Bav-57 was derived for PH2753, which would be at branch I-Y63100 on the YFull tree, a small branch of I-CTS7362. The rest of the I-CTS7362 samples could be grouped into I-Z73 based on derived read for phylogenetic SNP Y2927. For two of these (Den-176, Ork-573), couldn’t find anything downstream of that but CEU-NA06994 was I-Y11026 and three of the Saami samples (Saa-11, Saa-19, Saa-20) were I-L1302.

deadly77
07-09-2019, 08:30 AM
I realize that’s a bit of a read, but wanted to give the rationale behind where I grouped these samples in this dataset. Much easier to summarize into a table:
31529
Or for those who prefer a more visual overview, I annotated Supplementary Material Figure 1 from the Hallast et al 2014 paper with branches based on derived SNPs that I could find.
31530
This doesn’t necessarily mean the absolute position on the tree for the samples in this dataset as a lot of the SNPs weren’t covered in the 8 targeted x-degenerate regions of Y chromosome listed in Supplementary Data 2 of the Batini et al paper and in Figure 1 of the main manuscript in the Hallast et al 2014 paper. For some of the branches I was able to use phyloequivalent SNPs and at least I was able to break down the I1 samples in this dataset into known subclades.

spruithean
07-09-2019, 01:50 PM
Very interesting work. I remember reading through this study and finding the high-rate of I1 in Frisians, to not be surprising, but I felt that it was skewed. Great work! Interesting to note the distribution of these various subclades of I1.

JonikW
07-09-2019, 02:28 PM
Very interesting work. I remember reading through this study and finding the high-rate of I1 in Frisians, to not be surprising, but I felt that it was skewed. Great work! Interesting to note the distribution of these various subclades of I1.

I agree that this is excellent work and it's great to see that deadly77 is back. I was also drawn to the Frisians here and would love to see a detailed study on that region, particularly given the area's role as a springboard into England in the Migration Period. Ten out of 20 were I1: wow, it would be good to have 200 samples and see how much the I1 level changed...

deadly77
07-09-2019, 03:01 PM
Very interesting work. I remember reading through this study and finding the high-rate of I1 in Frisians, to not be surprising, but I felt that it was skewed. Great work! Interesting to note the distribution of these various subclades of I1.

Cheers. Indeed - I saw that when I was reading the paper and thought "crikey! that's a lot of I1 in the Netherlands. Why have I never heard that before". Yeah, I believe it's a consequence of the small sample size and grabbing 10 out of the 20 Frisian samples being I1. I think with a larger sample size, the proportion of I1 isn't as large. Works the other way in that there's only one I1 sample from England. Luck of the draw I guess.

I'm often seeing the percentage of I1 being quoted as "up to 50% of the population in some areas of Sweden" in various places. I'm feeling that figure is down to a low sample size. Going from Eupedia's Y-DNA haplogroups by country, it has 50% I1 in Gotland while 37% I1 in Sweden here https://www.eupedia.com/europe/european_y-dna_haplogroups.shtml

Looking through the list of Eupedia's sources, got to this paper "Y-chromosome diversity in Sweden – A long-time perspective" by Karlsson in Nature 2006 here (open access): https://www.nature.com/articles/5201651 - Table 1 lists 40 samples from Gotland, 18 of which are I1a and 2 of which are I1c. I think the paper (given the year it was published) actually means I1c is what we would today classify as I2. It seems there was a reclassification in 2008 where I1a became I1. But if Eupedia is adding 18+2=20 I1 out of 40 samples from Gotland to get 50% again that could be an inflated figure from small sample size. If anyone knows of any other studies supporting the 50% statistic, I'd be interested to check them out.

JMcB
07-09-2019, 04:24 PM
I realize that’s a bit of a read, but wanted to give the rationale behind where I grouped these samples in this dataset. Much easier to summarize into a table:
31529
Or for those who prefer a more visual overview, I annotated Supplementary Material Figure 1 from the Hallast et al 2014 paper with branches based on derived SNPs that I could find.
31530
This doesn’t necessarily mean the absolute position on the tree for the samples in this dataset as a lot of the SNPs weren’t covered in the 8 targeted x-degenerate regions of Y chromosome listed in Supplementary Data 2 of the Batini et al paper and in Figure 1 of the main manuscript in the Hallast et al 2014 paper. For some of the branches I was able to use phyloequivalent SNPs and at least I was able to break down the I1 samples in this dataset into known subclades.

As always, very nicely done!

I remember when that paper came out, taking notice of their younger estimate. Although, if you average them all together, you’re still in the same ball park, 4100 ybp.

One thing I’m curious to know, is how all of the new upstream SNPs being added by Y700 are going to effect our timeline. I would imagine that YFull knows about some of them from their Y-Elites & WGS submissions but I guess that would depends on where those testers are on the tree. I know everyone in my area of the tree is Big Y500. Then again, I’ve upgraded, so that may give them all they need for our branch. From what I’ve heard, we’re looking at a 40% increase in coverage, once those new tests start filtering in.

What are your thoughts?

deadly77
07-09-2019, 06:07 PM
As always, very nicely done!

I remember when that paper came out, taking notice of their younger estimate. Although, if you average them all together, you’re still in the same ball park, 4100 ybp.

One thing I’m curious to know, is how all of the new upstream SNPs being added by Y700 are going to effect our timeline. I would imagine that YFull knows about some of them from their Y-Elites & WGS submissions but I guess that would depends on where those testers are on the tree. I know everyone in my area of the tree is Big Y500. Then again, I’ve upgraded, so that may give them all they need for our branch. From what I’ve heard, we’re looking at a 40% increase in coverage, once those new tests start filtering in.

What are your thoughts?

Cheers - yes as you say a lot of those numbers are in the same ballpark - and a lot of them fall within the confidence intervals of each other. Perhaps it's a case of which educated guess is preferred.

To be honest, I'm not sure that the new SNPs found in Y700 tests are going to affect the YFull age estimates all that much in the near future. I think the upstream placement of "novel" SNPs is more of a headache for FTDNA's haplotree than the YFull tree. As you say, YFull has seen a fair bit of YElite and WGS tests from customers as well as Y sequences from academic studies (ancient and modern). I don't know the numbers or breakdown but I think it's fair to say that YFull has a lot of samples of greater coverage that FTDNA didn't.

The majority of these new Big Y Y700 SNPs being discovered are outside of the combBED regions - while the YFull tree is made up from SNPs from both the combBED region and outside of the combBED region, they only use the ones within the combBED region for age estimates (plus some other filtering). Big Y Y500 was targeted for the combBED regions so coverage is reasonably good for those regions.

Also YFull also already accounts for differences in coverage in their calculation. Take your own kit for example - YFull filtered your novel SNPs downstream of your branch on the YFull tree down to 7 that are suitable for their age estimation. After that they look at the coverage of the combBED region (8467165 base pairs) which is of your kit and it comes to 7579639 base pairs. Then when they do their age estimation they adjust your 7 novel SNPs to 7.82 to account for the missing coverage in your Big Y. Compare to my WGS which has 8475116 base pair coverage of the 8467165 base pair combBED region which only adjusts my 17 novel SNPs to 16.98 (neglible difference). I can't see it being much of a change unless the estimated number of missing SNPs accounted for in the adjustment is way off, and I don't expect it will be. Maybe some cleanup of ambiguous or low read SNPs but in the grand scheme of things I don't expect a lot of changes to YFull age estimation.

The newer SNPs may result in some rebranching which then has a little bit of adjustment, but that doesn't really change the calculation, more where it's pointing.

The bigger changes would come if YFull decides to start using a different age estimate calculation, and incorporating SNPs from regions outside of the combBED region for age estimation. One of the reasons that YFull likely decided on using only the combBED regions for age estimation is that they anticipated the majority of their customers sending in BAM files from FTDNA Big Y tests. Now that the Big Y has moved from Y500 to Y700, YFull may decide they have better coverage of regions oustide of the combBED region to justify that. Then again, they may decide not to do that in order keep compatibility with the existing Y500 kits on the YFull tree who don't move to Y700. It might be a numbers game (and I have no inside information - purely speculating). But I do think they designed their age estimate with the earlier Big Y in mind.

As I said above, I think it's more of a headache for FTDNA than YFull. If I look at the FTDNA public haplotree, it looks like they have 75 variants listed as phyloequivalent to branch I-M253 while YFull has 312. Same with I2 (FTDNA 37, YFull 68) and branch I (FTDNA 82, 199). I'm not sure if FTDNA is aware of the missing ones and just doesn't bother including them on the public haplotree as they won't affect any of their customers downstream, or if all of these need to be added as upstream novel variants and accounted for in the correct position in FTDNA's tree.

JMcB
07-09-2019, 08:43 PM
Cheers - yes as you say a lot of those numbers are in the same ballpark - and a lot of them fall within the confidence intervals of each other. Perhaps it's a case of which educated guess is preferred.

To be honest, I'm not sure that the new SNPs found in Y700 tests are going to affect the YFull age estimates all that much in the near future. I think the upstream placement of "novel" SNPs is more of a headache for FTDNA's haplotree than the YFull tree. As you say, YFull has seen a fair bit of YElite and WGS tests from customers as well as Y sequences from academic studies (ancient and modern). I don't know the numbers or breakdown but I think it's fair to say that YFull has a lot of samples of greater coverage that FTDNA didn't.

The majority of these new Big Y Y700 SNPs being discovered are outside of the combBED regions - while the YFull tree is made up from SNPs from both the combBED region and outside of the combBED region, they only use the ones within the combBED region for age estimates (plus some other filtering). Big Y Y500 was targeted for the combBED regions so coverage is reasonably good for those regions.

Also YFull also already accounts for differences in coverage in their calculation. Take your own kit for example - YFull filtered your novel SNPs downstream of your branch on the YFull tree down to 7 that are suitable for their age estimation. After that they look at the coverage of the combBED region (8467165 base pairs) which is of your kit and it comes to 7579639 base pairs. Then when they do their age estimation they adjust your 7 novel SNPs to 7.82 to account for the missing coverage in your Big Y. Compare to my WGS which has 8475116 base pair coverage of the 8467165 base pair combBED region which only adjusts my 17 novel SNPs to 16.98 (neglible difference). I can't see it being much of a change unless the estimated number of missing SNPs accounted for in the adjustment is way off, and I don't expect it will be. Maybe some cleanup of ambiguous or low read SNPs but in the grand scheme of things I don't expect a lot of changes to YFull age estimation.

The newer SNPs may result in some rebranching which then has a little bit of adjustment, but that doesn't really change the calculation, more where it's pointing.

The bigger changes would come if YFull decides to start using a different age estimate calculation, and incorporating SNPs from regions outside of the combBED region for age estimation. One of the reasons that YFull likely decided on using only the combBED regions for age estimation is that they anticipated the majority of their customers sending in BAM files from FTDNA Big Y tests. Now that the Big Y has moved from Y500 to Y700, YFull may decide they have better coverage of regions oustide of the combBED region to justify that. Then again, they may decide not to do that in order keep compatibility with the existing Y500 kits on the YFull tree who don't move to Y700. It might be a numbers game (and I have no inside information - purely speculating). But I do think they designed their age estimate with the earlier Big Y in mind.

As I said above, I think it's more of a headache for FTDNA than YFull. If I look at the FTDNA public haplotree, it looks like they have 75 variants listed as phyloequivalent to branch I-M253 while YFull has 312. Same with I2 (FTDNA 37, YFull 68) and branch I (FTDNA 82, 199). I'm not sure if FTDNA is aware of the missing ones and just doesn't bother including them on the public haplotree as they won't affect any of their customers downstream, or if all of these need to be added as upstream novel variants and accounted for in the correct position in FTDNA's tree.

Thank you, that was a nice explanation! So basically, YFull is ahead of the curve and the changes are going to be marginal. Which is good to know!

deadly77
07-09-2019, 10:51 PM
Thank you, that was a nice explanation! So basically, YFull is ahead of the curve and the changes are going to be marginal. Which is good to know!

Well, I could be very wrong about a lot of that, so we'll see. I think it's important to remember that these phylogenetic trees are a bit more fluid than we might be considering and some SNPs can be added to a branch or haplogroup and then removed if later seems like not such a good idea. This is a little easier to follow on the YFull tree because you can look up past versions of the tree (archive button on the top right of webpage). Here's the earliest version of the YFull tree on their website for I1: https://www.yfull.com/arch-3.07/tree/I1/ - notice that the I1 block is defined by 341 SNPs rather than the 312 of today's tree, so clearly some of them must have been moved to a different branch or removed altogether. Not as easy to retractively check FTDNA's tree as there doesnt seem to be a way to check archive versions.

You can see the coverage of the upstream SNPs in your own Big Y test if you click on YReport on homepage. Looking at my FGC YElite (which is what the Big Y Y700 is going to resemble) it had a few no calls (grey) and ambiguous (yellow) calls. Of the I1 subclades, it had no call for 4 SNPs (including two SNPs phyloequivalent for DF29) but correct for 27 - not bad. Going up to the I1 block, no call for 37 SNPs, ambiguous for 3 out of the I1 312 SNPs - again not too bad. For the I block, no call for 25 SNPs, ambiguous for 2 of the 199 SNPs. Can do the same with the IJ, IJK, HIJK branches and so on. On my WGS, every SNP is read - there are zero no calls for these SNPs. Have some ambiguous calls - 2 in the I1 branch, 3 in the I branch, one in the IJ branch.

Also at YFull, I can see where there are SNPs known and associated with a branch, but they're not on the tree. A lot of these have a 1 star rating. Some of these associated with branches closest to me on the tree, and some are a bit further up. Say, the SNP Z2741 for example - YFull is listing it as level I1, I have 27A reads (derived), but it only has a one star rating. Position is 11066130 so perhaps a bit to close to the centromere although perhaps observed in a lot of I1 folks. Then there's Y1948 - it's at 2839534 so in the combBED region, associated with level I1 but not on the YFull tree and Y Full gives it one star. The mutation is described as GAAAAAAAAAAT to GAAAAAAAAATT so maybe that's a MNP (multiple nucleotide polymorphism) rather than a SNP. Some of the other one star ones at this I1 level are perhaps not included because they're homologous with X or autosomes. I don't really know - I'm trying to think about this logically but appreciate that I'm often flailing around this a bit blindly. But that's kind of fun though.

I guess the upstream stuff doesn't always get a lot of attention outside of the academics - most people focus on their terminal as that's where their closest connections, matches and mismatches are going to have the most interaction.

JMcB
07-10-2019, 01:00 AM
Well, I could be very wrong about a lot of that, so we'll see. I think it's important to remember that these phylogenetic trees are a bit more fluid than we might be considering and some SNPs can be added to a branch or haplogroup and then removed if later seems like not such a good idea. This is a little easier to follow on the YFull tree because you can look up past versions of the tree (archive button on the top right of webpage). Here's the earliest version of the YFull tree on their website for I1: https://www.yfull.com/arch-3.07/tree/I1/ - notice that the I1 block is defined by 341 SNPs rather than the 312 of today's tree, so clearly some of them must have been moved to a different branch or removed altogether. Not as easy to retractively check FTDNA's tree as there doesnt seem to be a way to check archive versions.

You can see the coverage of the upstream SNPs in your own Big Y test if you click on YReport on homepage. Looking at my FGC YElite (which is what the Big Y Y700 is going to resemble) it had a few no calls (grey) and ambiguous (yellow) calls. Of the I1 subclades, it had no call for 4 SNPs (including two SNPs phyloequivalent for DF29) but correct for 27 - not bad. Going up to the I1 block, no call for 37 SNPs, ambiguous for 3 out of the I1 312 SNPs - again not too bad. For the I block, no call for 25 SNPs, ambiguous for 2 of the 199 SNPs. Can do the same with the IJ, IJK, HIJK branches and so on. On my WGS, every SNP is read - there are zero no calls for these SNPs. Have some ambiguous calls - 2 in the I1 branch, 3 in the I branch, one in the IJ branch.

Also at YFull, I can see where there are SNPs known and associated with a branch, but they're not on the tree. A lot of these have a 1 star rating. Some of these associated with branches closest to me on the tree, and some are a bit further up. Say, the SNP Z2741 for example - YFull is listing it as level I1, I have 27A reads (derived), but it only has a one star rating. Position is 11066130 so perhaps a bit to close to the centromere although perhaps observed in a lot of I1 folks. Then there's Y1948 - it's at 2839534 so in the combBED region, associated with level I1 but not on the YFull tree and Y Full gives it one star. The mutation is described as GAAAAAAAAAAT to GAAAAAAAAATT so maybe that's a MNP (multiple nucleotide polymorphism) rather than a SNP. Some of the other one star ones at this I1 level are perhaps not included because they're homologous with X or autosomes. I don't really know - I'm trying to think about this logically but appreciate that I'm often flailing around this a bit blindly. But that's kind of fun though.

I guess the upstream stuff doesn't always get a lot of attention outside of the academics - most people focus on their terminal as that's where their closest connections, matches and mismatches are going to have the most interaction.

Yes, time will tell and it will be interesting to see what happens but I would say your assessment sounds quite reasonable to me. As I was reading the part about YFull’s adjustments, I remembered hearing that Big Y500 covered most, but not all of the combBED region. Perhaps, in the 85 to 90% range (don’t quote me on that, as I’m going from memory), so their adjustments make sense. Coincidentally, I’ve often noticed their adjustments - without considering why they were doing them - because I like to use their SNP counts, to try out different mutation rates. Just to see how they effect the branches around me. I’m beginning to think, I may have too much time on my hands ;-)

Do you have any thoughts on the various mutation rates? I know that William, Kane & Vance all use 131.6 years for Big Y500 and I believe MacDonald uses 160 years. On the Facebook pages Kane & Vance are saying the rate for Y700 is going to be in the 82 year range. Many people seem to believe that MacDonald’s calculations are the most accurate. On the other hand, in one of his own posts on the subject, he said he was glad to see that his results and YFull’s were fairly close. Even though they used different methods. Plus, no matter what method you use, the variances leave a lot of room for leeway.


P.S. I agree about YFull’s archiving system. It’s definitely a nice feature to have. I was using it recently to see how much fluctuation there had been in the TMRCA of my main branch. Which was confirmed back in 2017.

31573

31574

I actually waiting for a new result to fully process, which I suspect is going to send it back to 1900 ybp.

deadly77
07-10-2019, 09:27 AM
Yes, time will tell and it will be interesting to see what happens but I would say your assessment sounds quite reasonable to me. As I was reading the part about YFull’s adjustments, I remembered hearing that Big Y500 covered most, but not all of the combBED region. Perhaps, in the 85 to 90% range (don’t quote me on that, as I’m going from memory), so their adjustments make sense. Coincidentally, I’ve often noticed their adjustments - without considering why they were doing them - because I like to use their SNP counts, to try out different mutation rates. Just to see how they effect the branches around me. I’m beginning to think, I may have too much time on my hands ;-)

Do you have any thoughts on the various mutation rates? I know that William, Kane & Vance all use 131.6 years for Big Y500 and I believe MacDonald uses 160 years. On the Facebook pages Kane & Vance are saying the rate for Y700 is going to be in the 82 year range. Many people seem to believe that MacDonald’s calculations are the most accurate. On the other hand, in one of his own posts on the subject, he said he was glad to see that his results and YFull’s were fairly close. Even though they used different methods. Plus, no matter what method you use, the variances leave a lot of room for leeway.


P.S. I agree about YFull’s archiving system. It’s definitely a nice feature to have. I was using it recently to see how much fluctuation there had been in the TMRCA of my main branch. Which was confirmed back in 2017.

31573

31574

I actually waiting for a new result to fully process, which I suspect is going to send it back to 1900 ybp.

Yes, I also recall hearing the 85-90% figure for Big Y Y500 for coverage of the combBED region, so I don't think you're wrong there. James Kane's website has a lot of good statistics on the coverage of some of the different tests from a Y perspective. So perhaps a maximum of 10-15% of a change in YFull's age estimate from Big Y500 to a WGS and a little bit less for YElite and in reality much less that that due to YFull adjusting the number of SNPs going into the age estimate calculation based on coverage of the combBED region. Of course that all changes if YFull moves to a different methodology for their age estimate.

One paper I'd be interested to read regarding mutation rates is this one on The Y-chromosome point mutation rate in humans by Helgason et al in 2015 in Nature Genetics https://www.nature.com/articles/ng.3171 unfortunately it's not open access but I may try digging around in the supplementary information to see what I can find out about that. This paper was referenced in the Batini 2015 paper (this thread) and I've also seen it referenced by Iain McDonald in discussions about age estimation calculations. They find a substantial (18%) difference in the mutation rate among different regions of the Y chromosome.

There's a good amount of information on Iain McDonald's genetics website here: http://www.jb.man.ac.uk/~mcdonald/genetics.html. A lot of it is well worth a read - he explains things very well and I've learned a lot from there. He describes that his age estimation method is based on the same method as YFull (Adamov et al. (2015)), but with a few mathematic bells and whistles. He's a bit better at maths than I am (his background in astrophysics, mine in chemistry). He acknowledges that a lot of the principles and basis for his model is the same as the calculation that YFull uses. There's a bit of give and take between his method and YFull's - for example, the Big Tree for some of the R1b subclades is built up from data using VCF files, while the YFull tree is built up from data using BAM files. So when YFull is assesing whether a SNP should be included for age estimation, they apply a filter of excluding SNPs which have a read quality of less than 90% and SNPs that have only one or two reads while that's information that may be missed by looking at the VCF file alone without the BAM. So advantages and disadvantages to either method.

I think the year per SNP range is going to depend on how many SNPs you are using in the calculation - essentially how much quality control you want to apply to what you want to call "reliable" SNPs. There might be a bit give and take on excluding SNPs that may be bad and missing out on their influence, or including ones that throw a lot more uncertainty into the mix. If you're using 144.14 years per SNP and you suddenly add a lot more SNPs from outside the combBED region, obviously it's going to push the date a lot further back and so in that case you need to adjust the mutation rate to less years per SNP to account for that. YFull already excludes some SNPs from their calculation as said above (<90% read quality, 1 or 2 read SNPS), but also not including SNPs found in more than 5 different "localizations" (other haplogroups/subclades), indels and SNPs outside the combBED region. There's a bit more to it that 144 years per SNP - which Bill Wood was pushing last time I read one of his ravings. It's rather ironic that he trashes YFull so much while using their age estimate (albeit incorrectly).

But yeah, it brings up a fair bit of questions as to what you want to include in the calculation. As well as SNPs, some indels seem rather stable and perhaps some of the slower mutating STRs as well. But these will all have different mutation rates as well. Perhaps there's a case for sorting mutations into different categories, includings SNPs in different regions separated from each other and applying individual mutation rates based on that rather than a weighted average. Although that makes the calculation hideously more complicated. There might also be case for different mutation rates in different haplogroups. Again, perhaps depends how complicated you want to get.

I notice some people get really upset by age estimates, especially if they aren't what they expect. The YFull Facebook group is full of posts like that. I feel that some of those folks need to be a bit more cognizant that there's not a regular clock where say 144 years or 100 years pass and like clockwork: boom! - new SNP! Mutation process is a lot more random than that - could be several in one generation, could be none in several generations. Average rates with a fair range - as you say, leaving a lot of leeway. I like one of Maurice Gleeson's presentations from a few years ago - he had a slide which said something like "Which age estimate is the best one?" followed by "the one that best fits your preconceived ideas" which is think sums things up quite nicely. He also had another one which said that when dating branching points "pedigree method is most accurate. Others are statistically accurate... but not genealogically accurate... and never will be". Which I agree is entirely correct. That doesn't mean that we shouldn't work on these, discuss them and try and make them as good as they can be. All very worthy endeavours.

I haven't been following as many of the discussions in Facebook groups recently. There's some really good information and intelligent discussions on there, but there are also a lot of ridiculous opinions, unpleasantness and keyboard warriors on there too. Lately it seems there was more of the latter, so I dialled back my participation in some of those groups. But the folks you mention are always worth listening to - I've always enjoyed reading David Vance and James Kane's comments. I hadn't seen Iain McDonald or Alex Williamson on Facebook discussions much - more from others posting updates or comments from other forums but those two always worth listening to as well. Some other individuals not so much.

JMcB
07-10-2019, 04:48 PM
Yes, I also recall hearing the 85-90% figure for Big Y Y500 for coverage of the combBED region, so I don't think you're wrong there. James Kane's website has a lot of good statistics on the coverage of some of the different tests from a Y perspective. So perhaps a maximum of 10-15% of a change in YFull's age estimate from Big Y500 to a WGS and a little bit less for YElite and in reality much less that that due to YFull adjusting the number of SNPs going into the age estimate calculation based on coverage of the combBED region. Of course that all changes if YFull moves to a different methodology for their age estimate.

One paper I'd be interested to read regarding mutation rates is this one on The Y-chromosome point mutation rate in humans by Helgason et al in 2015 in Nature Genetics https://www.nature.com/articles/ng.3171 unfortunately it's not open access but I may try digging around in the supplementary information to see what I can find out about that. This paper was referenced in the Batini 2015 paper (this thread) and I've also seen it referenced by Iain McDonald in discussions about age estimation calculations. They find a substantial (18%) difference in the mutation rate among different regions of the Y chromosome.

There's a good amount of information on Iain McDonald's genetics website here: http://www.jb.man.ac.uk/~mcdonald/genetics.html. A lot of it is well worth a read - he explains things very well and I've learned a lot from there. He describes that his age estimation method is based on the same method as YFull (Adamov et al. (2015)), but with a few mathematic bells and whistles. He's a bit better at maths than I am (his background in astrophysics, mine in chemistry). He acknowledges that a lot of the principles and basis for his model is the same as the calculation that YFull uses. There's a bit of give and take between his method and YFull's - for example, the Big Tree for some of the R1b subclades is built up from data using VCF files, while the YFull tree is built up from data using BAM files. So when YFull is assesing whether a SNP should be included for age estimation, they apply a filter of excluding SNPs which have a read quality of less than 90% and SNPs that have only one or two reads while that's information that may be missed by looking at the VCF file alone without the BAM. So advantages and disadvantages to either method.

I think the year per SNP range is going to depend on how many SNPs you are using in the calculation - essentially how much quality control you want to apply to what you want to call "reliable" SNPs. There might be a bit give and take on excluding SNPs that may be bad and missing out on their influence, or including ones that throw a lot more uncertainty into the mix. If you're using 144.14 years per SNP and you suddenly add a lot more SNPs from outside the combBED region, obviously it's going to push the date a lot further back and so in that case you need to adjust the mutation rate to less years per SNP to account for that. YFull already excludes some SNPs from their calculation as said above (<90% read quality, 1 or 2 read SNPS), but also not including SNPs found in more than 5 different "localizations" (other haplogroups/subclades), indels and SNPs outside the combBED region. There's a bit more to it that 144 years per SNP - which Bill Wood was pushing last time I read one of his ravings. It's rather ironic that he trashes YFull so much while using their age estimate (albeit incorrectly).

But yeah, it brings up a fair bit of questions as to what you want to include in the calculation. As well as SNPs, some indels seem rather stable and perhaps some of the slower mutating STRs as well. But these will all have different mutation rates as well. Perhaps there's a case for sorting mutations into different categories, includings SNPs in different regions separated from each other and applying individual mutation rates based on that rather than a weighted average. Although that makes the calculation hideously more complicated. There might also be case for different mutation rates in different haplogroups. Again, perhaps depends how complicated you want to get.

I notice some people get really upset by age estimates, especially if they aren't what they expect. The YFull Facebook group is full of posts like that. I feel that some of those folks need to be a bit more cognizant that there's not a regular clock where say 144 years or 100 years pass and like clockwork: boom! - new SNP! Mutation process is a lot more random than that - could be several in one generation, could be none in several generations. Average rates with a fair range - as you say, leaving a lot of leeway. I like one of Maurice Gleeson's presentations from a few years ago - he had a slide which said something like "Which age estimate is the best one?" followed by "the one that best fits your preconceived ideas" which is think sums things up quite nicely. He also had another one which said that when dating branching points "pedigree method is most accurate. Others are statistically accurate... but not genealogically accurate... and never will be". Which I agree is entirely correct. That doesn't mean that we shouldn't work on these, discuss them and try and make them as good as they can be. All very worthy endeavours.

I haven't been following as many of the discussions in Facebook groups recently. There's some really good information and intelligent discussions on there, but there are also a lot of ridiculous opinions, unpleasantness and keyboard warriors on there too. Lately it seems there was more of the latter, so I dialled back my participation in some of those groups. But the folks you mention are always worth listening to - I've always enjoyed reading David Vance and James Kane's comments. I hadn't seen Iain McDonald or Alex Williamson on Facebook discussions much - more from others posting updates or comments from other forums but those two always worth listening to as well. Some other individuals not so much.

Unfortunately, I haven’t been able to enjoy any of BiL Wood’s mad ravings, since he unceremoniously and secretly block me from his BiG Y Page. While it was still in it’s infancy. Although, I do remember your reference to Maurice Gleason’s humorous presentation and I’ve actually taken: whatever fits your preconceived notions, as my motto, ever since then. ;-)

Facebook can definitely be a wilder and at times, more childish venue. On the other hand, you can still learn a lot there and it doesn’t take too long, before you figure out who to ignore. I suspect MacDonald and Williamson are far too busy to engage in forums and from what I’ve seen, MacDonald has only posted here on rare occasions. Usually when one of his U106 members asks him to comment and he feels it’s worthwhile.


If you don’t mind, I like to run one more question by you. What are your thoughts concerning the accuracy of the estimates once your results seem to be confirming them, as you start moving into a genealogical time frame?

For example, my branch (A13248) is currently dated by YFull to 970 AD. YFull & FT are both giving me 7 Novel Variants of good quality. I have tested 2 matches against a SNP pack of my NVs. With one being positive for 1 of the 7 and the other being positive for 5 of the 7. The one who is positive for 5 of the 7, is also my closest surname match at 111 markers ([email protected]). Judging from our MDKAs (1720s & 1730s) and our genealogy, it looks like are connection probably dates to the late 1600s or early 1700s. Although, it’s possible it could be earlier. On the face of it, these numbers would seem to align nicely with the estimates.

Is there a point where the numbers become more reliable, once we move closer to the present?

deadly77
07-11-2019, 10:52 PM
Unfortunately, I haven’t been able to enjoy any of BiL Wood’s mad ravings, since he unceremoniously and secretly block me from his BiG Y Page. While it was still in it’s infancy. Although, I do remember your reference to Maurice Gleason’s humorous presentation and I’ve actually taken: whatever fits your preconceived notions, as my motto, ever since then. ;-)

Facebook can definitely be a wilder and at times, more childish venue. On the other hand, you can still learn a lot there and it doesn’t take too long, before you figure out who to ignore. I suspect MacDonald and Williamson are far too busy to engage in forums and from what I’ve seen, MacDonald has only posted here on rare occasions. Usually when one of his U106 members asks him to comment and he feels it’s worthwhile.


If you don’t mind, I like to run one more question by you. What are your thoughts concerning the accuracy of the estimates once your results seem to be confirming them, as you start moving into a genealogical time frame?

For example, my branch (A13248) is currently dated by YFull to 970 AD. YFull & FT are both giving me 7 Novel Variants of good quality. I have tested 2 matches against a SNP pack of my NVs. With one being positive for 1 of the 7 and the other being positive for 5 of the 7. The one who is positive for 5 of the 7, is also my closest surname match at 111 markers ([email protected]). Judging from our MDKAs (1720s & 1730s) and our genealogy, it looks like are connection probably dates to the late 1600s or early 1700s. Although, it’s possible it could be earlier. On the face of it, these numbers would seem to align nicely with the estimates.

Is there a point where the numbers become more reliable, once we move closer to the present?

Oh, I haven't been anywhere near BiL's Facebook page for a very long time - I was one of the first booted from his Facebook group in an early mass purge when he really threw his toys out of the pram. Someone I have a lot of respect for said a while ago being excluded from his group was a badge of honour, which I fully agree with. He would occaisionally jump on to other Facebook groups to promote his own "echo chamber" Facebook group which usually turned into him having an online slanging match. It must take so much energy being that angry all the time. But he hasn't done that for a while - mainly I meant some of the other groups with some different issues. Oh well - enough complaining about Facebook and back on topic.

Absolutely it's entirely possible to get to a point where you can have a lot more confidence. I'm a bit over sceptical at times and a lot of that coms with the validation of a scientific method, and that's what I'm often applying to some of these questions - look to try it from as many angles as possible to disprove something. If it falls apart pretty quickly it's likely what's being proposed isn't robust enough. If it holds up to more intense scrutiny it's a stronger model. Questions are good, and I'm always revaluating my opinions as I learn and think about things more.

But let's look back at your example. You're the only one on your branch on the YFull tree, which has it's branch dated with a TMRCA of 1050 ybp, as you say, going back to 970 AD. But if you click on the info button for your branch you'll see that there's a contribution to the TMRCA of your branch from the four samples on the I-Y136323 branch below you (even though you don't share common ancestor with them beyond A13248). If you count up the number of SNPs back to A13248 (novels for you, novels plus the SNPs on the tree at branch I-Y136323 for them) you'll see there's a bit of variance - 4,4,6,7,7. One of the 4 SNPs samples YF19279 has the highest combBED coverage. So after coverage adjustment, there's a range of contributions to the TMRCA from 4.1 SNPs at 652 years to 7.82 SNPs (you) at 1189 years. So the contribution from the four other kits (average 5.85 SNPs 905 ybp) pulls your TMRCA 142 years earlier from 1189 to 1047 (rounded to 1050 ybp on the tree).

So this is counting the age estimates from the bottom up, while when you're looking at the people that you've Sanger tested your novel variants at YSEQ, you're switcing over to a top down approach - setting a date at 970 AD and counting backwards (although maybe it would be better to count from 830 AD given your individual TMRCA of 1189 ybp). But for consistency with YFull, ideally you'd be counting back from their own number of novel variants back to I-A13248. Obviously don't have that because they haven't taken a NGS test. You're assuming that they have 7 novel SNPs down from I-A13248. That's fine as an assumption but be aware that they may not - 3 out of the four kits at I-Y136323 have less than that, two as low as four, and after the adjustment for combBED coverage, your age estimate of 7.8 SNPs and 1189 ybp is at the high end of the range. Or it could go the other way if they have say 9 or 10 novel SNPs. Hence the 95% confidence interval that YFull uses for the TMRCA - in the case of I-A13248 at 1550 to 600 ybp.

Just a few more things to think about. But these are good close matches on the Y line. I'm rather jealous as I don't have anything closer than 2900 ybp so haven't had to consider these until now, so rather thinking out loud and rambling. Of course, could design a model for age estimate that counts from the top down, but we're using dates for TMRCA that have been established by a bottom up approach.

But lets take another example which has a more extreme range. I hope JonikW doesn't mind me using his subclade as an example but it's an interesting one and he has his terminal branch I-A21912 listed below his profile. Ok, so two kits at this branch with rather different number of downstream novel SNPs - one kit has 4 novel SNPs, the other kit has 11 novel SNPs. After accounting for coverage (they're not too different to each other), adjust to a corrected number of 4.59 and 12.32 SNPs resulting in two rather different estimated TMRCAs - one is 772 ybp and the other is 1838 ybp - more than a thousand years difference between the two taken indivually, which is a really large range. Obviously with just two samples, difficult to say if one is the outlier or the other. Or they could both be outliers around an average of 8.46 adjusted SNPs, which is where YFull rounds out the TMRCA on the tree to 1300 ybp. But if we go a step upstream to I-A21901, now four kits and we can see that the two kits at I-A21912 are the outer ranges of the TMRCA at 888 and 2162 ybp, while the two kits at I-A21901 branch are a little closer in towards YF13910 than YF13812, so there's a case for labelling YF13812 as the outlier in the SNP count. Still a small sample size though.

In answer to your last question, I actually think numbers become less reliable as we move closer to the present. My feeling is that the branches further back in the past have more more samples collectively contributing to their age estimate which brings down the statistical noise of outliers. In contrast as we move to the present the number of people contributing to the age estimate is less - may even be just one or two, and it becomes less confident in assessing whether a sample is in the middle of the statistical range (the common assumption) or at the edges. I also think there is more of a difference in perception closer to the present day - consider 100 years either side of 4600 years ago doesn't really have much effect on thinking about who or where the common ancestor was, but 100 years either side of 1750 AD definitely does.

JonikW
07-12-2019, 09:32 AM
Oh, I haven't been anywhere near BiL's Facebook page for a very long time - I was one of the first booted from his Facebook group in an early mass purge when he really threw his toys out of the pram. Someone I have a lot of respect for said a while ago being excluded from his group was a badge of honour, which I fully agree with. He would occaisionally jump on to other Facebook groups to promote his own "echo chamber" Facebook group which usually turned into him having an online slanging match. It must take so much energy being that angry all the time. But he hasn't done that for a while - mainly I meant some of the other groups with some different issues. Oh well - enough complaining about Facebook and back on topic.

Absolutely it's entirely possible to get to a point where you can have a lot more confidence. I'm a bit over sceptical at times and a lot of that coms with the validation of a scientific method, and that's what I'm often applying to some of these questions - look to try it from as many angles as possible to disprove something. If it falls apart pretty quickly it's likely what's being proposed isn't robust enough. If it holds up to more intense scrutiny it's a stronger model. Questions are good, and I'm always revaluating my opinions as I learn and think about things more.

But let's look back at your example. You're the only one on your branch on the YFull tree, which has it's branch dated with a TMRCA of 1050 ybp, as you say, going back to 970 AD. But if you click on the info button for your branch you'll see that there's a contribution to the TMRCA of your branch from the four samples on the I-Y136323 branch below you (even though you don't share common ancestor with them beyond A13248). If you count up the number of SNPs back to A13248 (novels for you, novels plus the SNPs on the tree at branch I-Y136323 for them) you'll see there's a bit of variance - 4,4,6,7,7. One of the 4 SNPs samples YF19279 has the highest combBED coverage. So after coverage adjustment, there's a range of contributions to the TMRCA from 4.1 SNPs at 652 years to 7.82 SNPs (you) at 1189 years. So the contribution from the four other kits (average 5.85 SNPs 905 ybp) pulls your TMRCA 142 years earlier from 1189 to 1047 (rounded to 1050 ybp on the tree).

So this is counting the age estimates from the bottom up, while when you're looking at the people that you've Sanger tested your novel variants at YSEQ, you're switcing over to a top down approach - setting a date at 970 AD and counting backwards (although maybe it would be better to count from 830 AD given your individual TMRCA of 1189 ybp). But for consistency with YFull, ideally you'd be counting back from their own number of novel variants back to I-A13248. Obviously don't have that because they haven't taken a NGS test. You're assuming that they have 7 novel SNPs down from I-A13248. That's fine as an assumption but be aware that they may not - 3 out of the four kits at I-Y136323 have less than that, two as low as four, and after the adjustment for combBED coverage, your age estimate of 7.8 SNPs and 1189 ybp is at the high end of the range. Or it could go the other way if they have say 9 or 10 novel SNPs. Hence the 95% confidence interval that YFull uses for the TMRCA - in the case of I-A13248 at 1550 to 600 ybp.

Just a few more things to think about. But these are good close matches on the Y line. I'm rather jealous as I don't have anything closer than 2900 ybp so haven't had to consider these until now, so rather thinking out loud and rambling. Of course, could design a model for age estimate that counts from the top down, but we're using dates for TMRCA that have been established by a bottom up approach.

But lets take another example which has a more extreme range. I hope JonikW doesn't mind me using his subclade as an example but it's an interesting one and he has his terminal branch I-A21912 listed below his profile. Ok, so two kits at this branch with rather different number of downstream novel SNPs - one kit has 4 novel SNPs, the other kit has 11 novel SNPs. After accounting for coverage (they're not too different to each other), adjust to a corrected number of 4.59 and 12.32 SNPs resulting in two rather different estimated TMRCAs - one is 772 ybp and the other is 1838 ybp - more than a thousand years difference between the two taken indivually, which is a really large range. Obviously with just two samples, difficult to say if one is the outlier or the other. Or they could both be outliers around an average of 8.46 adjusted SNPs, which is where YFull rounds out the TMRCA on the tree to 1300 ybp. But if we go a step upstream to I-A21901, now four kits and we can see that the two kits at I-A21912 are the outer ranges of the TMRCA at 888 and 2162 ybp, while the two kits at I-A21901 branch are a little closer in towards YF13910 than YF13812, so there's a case for labelling YF13812 as the outlier in the SNP count. Still a small sample size though.

In answer to your last question, I actually think numbers become less reliable as we move closer to the present. My feeling is that the branches further back in the past have more more samples collectively contributing to their age estimate which brings down the statistical noise of outliers. In contrast as we move to the present the number of people contributing to the age estimate is less - may even be just one or two, and it becomes less confident in assessing whether a sample is in the middle of the statistical range (the common assumption) or at the edges. I also think there is more of a difference in perception closer to the present day - consider 100 years either side of 4600 years ago doesn't really have much effect on thinking about who or where the common ancestor was, but 100 years either side of 1750 AD definitely does.

Thanks for all these recent informative posts deadly77. I really appreciate you taking the time to look at my results (for the record I'm more than happy for anyone on this site to look at my SNPs, run my G25 coordinates etc at any time; there's so much to learn). I agree that my downstream match looks like the outlier at this early stage. That's why I think an origin for me among the Angles looks like the most parsimonious assumption right now. It also fits geographically with my own Y line location in the Peak district as well as that of my match and the others upstream in the instances where we know where their paternal lines lived in the past (so that includes Scania in the one case, Cambridgeshire etc in others).

RP48
07-12-2019, 07:22 PM
I do not want to derail the high level discussion. I’m not a geneticist but specialized in gene expression, so am a layman when it comes to these discussions. However it is interesting to me because I belong to the I haplogroup (M253). That makes me curious about my paternal lineage origins. Sounds like Denmark is the probable region of origin for M253? Family records show English patrilineal ancestral, migrating to America in the latter 1600s.

Next question, if I may impose: is there a scientific layman-readable article or presentation on how the calculations are done regarding relatedness of populations based on the available genetic data?

If any or all of this is hopelessly under water, please ignore and carry on!

deanovermont
07-12-2019, 07:47 PM
Thanks for all these recent informative posts deadly77. I really appreciate you taking the time to look at my results (for the record I'm more than happy for anyone on this site to look at my SNPs, run my G25 coordinates etc at any time; there's so much to learn). I agree that my downstream match looks like the outlier at this early stage. That's why I think an origin for me among the Angles looks like the most parsimonious assumption right now. It also fits geographically with my own Y line location in the Peak district as well as that of my match and the others upstream in the instances where we know where their paternal lines lived in the past (so that includes Scania in the one case, Cambridgeshire etc in others).

Count me in as well. I'm happy to have anyone kick the tires of my BigY and SNP testing results any day. Really, I'm such a noob . When I perused the original post I was tickled simply to see that one of the test subjects is PH4462 ( just one level up from my A18477). I was already aware of my haploline's connection to Holland/Frisia. But I spent the better part of an hour reading about the region. Now I need to get serious and go read the articles mentioned in William's fine post.

deanovermont
07-12-2019, 07:47 PM
Oops. Hit post twice.

deadly77
07-13-2019, 12:00 AM
Thanks for all these recent informative posts deadly77. I really appreciate you taking the time to look at my results (for the record I'm more than happy for anyone on this site to look at my SNPs, run my G25 coordinates etc at any time; there's so much to learn). I agree that my downstream match looks like the outlier at this early stage. That's why I think an origin for me among the Angles looks like the most parsimonious assumption right now. It also fits geographically with my own Y line location in the Peak district as well as that of my match and the others upstream in the instances where we know where their paternal lines lived in the past (so that includes Scania in the one case, Cambridgeshire etc in others).

I have a lot of good memories of the Peak District - I went to Uni at Sheffield and wasn't that far from there. Got out for some good weekends hiking around there quite a few years ago.

I'm also thinking Angles from my own line in Northumbria, although that's largely from feeling rather than evidence. My last known paternal ancestor was born in Gateshead just South of Newcastle sometime around 1831 based on his age in the subsequent census data. I'm pretty sure that he was illegitimate - he was born before birth certificates but there's no father listed on his marriage certificate in 1851 and no father listed with the family on the early census. From what I've read about the streets they lived on in the early census data, it was pretty much one of the poorest, desolate, crime and disease ridden areas of the city. I found this map of the deaths represented by black dots in the Gateshead cholera epidemic of 1854 and the were living at Leonard's Court on the right of the map.
31670
So they were lucky to stay alive long enough to have descendants and it's likely that my surname comes from my 4th great grandmother rather than my 4th great grandfather. Which I'm fine with but it means that my 4th great grandfather could be from anywhere, as long as he was in Northeast England sometime around 1830/31, nevermind when his ultimate patrilineal ancestor came to the British Isles. So could be Northumbrian Angle, could be something else entirely.

deadly77
07-13-2019, 12:44 AM
I do not want to derail the high level discussion. I’m not a geneticist but specialized in gene expression, so am a layman when it comes to these discussions. However it is interesting to me because I belong to the I haplogroup (M253). That makes me curious about my paternal lineage origins. Sounds like Denmark is the probable region of origin for M253? Family records show English patrilineal ancestral, migrating to America in the latter 1600s.

Next question, if I may impose: is there a scientific layman-readable article or presentation on how the calculations are done regarding relatedness of populations based on the available genetic data?

If any or all of this is hopelessly under water, please ignore and carry on!

In all honesty, I'm not a geneticist either. My background is in chemistry, but I'm enjoying playing around with this. Real geneticists are probably shaking their heads...

In my opinion, it's unlikely that we'll ever find the real location for the origin of the I1 haplogroup. I1 is defined by over 300 SNP mutations - 312 on the current YFull tree, and M253 is just one of those, although it's the SNP most commonly used to name the haplogroup. This means that the ancestors of the haplogroup split away from I2 sometime about 27500 years ago and then genetically we're in a bit of a black hole until the most common recent ancestor that all I1 folks today descend from at approximately 4600 years ago. The location of Denmark/Northern Germany is often postulated from the geographic spread of the descendants living today. So this represents an average point from locations around Europe where I1 has been succesful and flourished rather than a confirmed origin. It's a reasonable approximation but it's unlikely we know for sure. Some people are very adamant about certain locations.

Regarding your question on "a scientific layman-readable article or presentation on how the calculations are done regarding relatedness of populations based on the available genetic data?" I'd ask you to specify whether you mean regarding Y-DNA or autosomal DNA, as the answers on where to point you will be rather different.

Welcome to the Anthrogenica forums and hope you enjoy the dicussions here.

deadly77
07-13-2019, 11:15 AM
Count me in as well. I'm happy to have anyone kick the tires of my BigY and SNP testing results any day. Really, I'm such a noob . When I perused the original post I was tickled simply to see that one of the test subjects is PH4462 ( just one level up from my A18477). I was already aware of my haploline's connection to Holland/Frisia. But I spent the better part of an hour reading about the region. Now I need to get serious and go read the articles mentioned in William's fine post.

Ah yes - I've seen some of your posts in the I-Z140 Facebook group, so welcome to the Anthrogenica fourms here. I find that too - little nuggets of information that send me off on tangents to read about related subjects.

One of the things that drew me to digging further into the Batini et al 2015 paper is that it would give a slightly different perspective to the Y-DNA database established by commercial testing, such as FTDNA and YFull. I'm only seeing one individual with Netherlands in both the YFull tree and the FTDNA public haplotree on branch PH4462, although one individual at I-L338 in Netherlands that William has grouped together with that branch is in there too. Finding two I-PH4462 in this paper, along with another two a level up at I-BY463/Y3334 (due to derived Y8337) was neat. At 10 samples out of 46 I1, Netherlands definitey have a lot of I1 in this dataset. I think with a larger sampling of the populations, would see a bit more normalizing but this study has an advantage in that it lets us dig into the subclade data, while a lot of other studies will just list a very basic high level haplogroup such as I1.

The SNPs with a PH prefix were discovered and registered by Pille Hallast, who is one of the authors on the Batini et al 2015 paper and the earlier 2014 paper. It's quite likely that these PH SNPs were discovered for the first time in this dataset that led to this paper. I had a look for some of the SNPs specific to your branch but they didn't show up in the VCF file - likely they weren't discovered until the Big Y tests of you and the other fellow on your branch after these papers published in 2014/15.

JMcB
07-13-2019, 04:43 PM
[…]

But let's look back at your example. You're the only one on your branch on the YFull tree, which has it's branch dated with a TMRCA of 1050 ybp, as you say, going back to 970 AD. But if you click on the info button for your branch you'll see that there's a contribution to the TMRCA of your branch from the four samples on the I-Y136323 branch below you (even though you don't share common ancestor with them beyond A13248). If you count up the number of SNPs back to A13248 (novels for you, novels plus the SNPs on the tree at branch I-Y136323 for them) you'll see there's a bit of variance - 4,4,6,7,7. One of the 4 SNPs samples YF19279 has the highest combBED coverage. So after coverage adjustment, there's a range of contributions to the TMRCA from 4.1 SNPs at 652 years to 7.82 SNPs (you) at 1189 years. So the contribution from the four other kits (average 5.85 SNPs 905 ybp) pulls your TMRCA 142 years earlier from 1189 to 1047 (rounded to 1050 ybp on the tree).

So this is counting the age estimates from the bottom up, while when you're looking at the people that you've Sanger tested your novel variants at YSEQ, you're switcing over to a top down approach - setting a date at 970 AD and counting backwards (although maybe it would be better to count from 830 AD given your individual TMRCA of 1189 ybp). But for consistency with YFull, ideally you'd be counting back from their own number of novel variants back to I-A13248. Obviously don't have that because they haven't taken a NGS test. You're assuming that they have 7 novel SNPs down from I-A13248. That's fine as an assumption but be aware that they may not - 3 out of the four kits at I-Y136323 have less than that, two as low as four, and after the adjustment for combBED coverage, your age estimate of 7.8 SNPs and 1189 ybp is at the high end of the range. Or it could go the other way if they have say 9 or 10 novel SNPs. Hence the 95% confidence interval that YFull uses for the TMRCA - in the case of I-A13248 at 1550 to 600 ybp.

Just a few more things to think about. But these are good close matches on the Y line. I'm rather jealous as I don't have anything closer than 2900 ybp so haven't had to consider these until now, so rather thinking out loud and rambling. Of course, could design a model for age estimate that counts from the top down, but we're using dates for TMRCA that have been established by a bottom up approach ....



Thank you once again for taking the time to give me your input. It’s nice to have some one to bounce this off of. To be honest, I didn’t really consider whether I was using a top down or bottom up calculation. as they were really back of the envelope affairs. Which is the only way I know how to do them. ;-)

Basically, I was using a variety of starting points, including YFull’s my line only estimate of 830 AD, and then adding in the number of matched SNPs to those figures. So for example, using that starting date for the single NV match, I simply used: 830 AD + 1 x 144.41 = 974 AD [A13242] and for the 5 NV surname match: 830 AD + 5 x 144.41 = 1552 AD [A13243]. As the latter was a surname match, I figured that match probably took place sometime after my name began to appear in Scotland. Which would give me an approximate window between 1320 - 1720. Coincidentally, William’s initial estimate also fell in that range, circa 1450 AD. With his usual provisos added.

Unfortunately, all I know about my matches - besides their basic genealogy - is the number of Novel Variants they’ve matched. So I didn’t try guessing how many NVs they might have. Although, I did experiment with variance swings and new variance estimations - as I’m no longer at A13248 - and different mutation rates, using YFull’s SNP counts. For the most part, I usually ended up in the window above. Which is about all I would expect. Although, I think you’ve made a good point about using the:this line only date, as my starting point.

Coincidentally, my Y700 results have come in and they’re now giving me 14 Novel Variants, instead of 7. Although, the reported numbers have been fluid over there, so I’m not sure if that’s going to stand. We’ll see!

O well, back to the drawing board ;-)


Looks likes I’ll have to make a new SNP Pack

deadly77
07-13-2019, 11:52 PM
Thank you once again for taking the time to give me your input. It’s nice to have some one to bounce this off of. To be honest, I didn’t really consider whether I was using a top down or bottom up calculation. as they were really back of the envelope affairs. Which is the only way I know how to do them. ;-)

Basically, I was using a variety of starting points, including YFull’s my line only estimate of 830 AD, and then adding in the number of matched SNPs to those figures. So for example, using that starting date for the single NV match, I simply used: 830 AD + 1 x 144.41 = 974 AD [A13242] and for the 5 NV surname match: 830 AD + 5 x 144.41 = 1552 AD [A13243]. As the latter was a surname match, I figured that match probably took place sometime after my name began to appear in Scotland. Which would give me an approximate window between 1320 - 1720. Coincidentally, William’s initial estimate also fell in that range, circa 1450 AD. With his usual provisos added.

Unfortunately, all I know about my matches - besides their basic genealogy - is the number of Novel Variants they’ve matched. So I didn’t try guessing how many NVs they might have. Although, I did experiment with variance swings and new variance estimations - as I’m no longer at A13248 - and different mutation rates, using YFull’s SNP counts. For the most part, I usually ended up in the window above. Which is about all I would expect. Although, I think you’ve made a good point about using the:this line only date, as my starting point.

Coincidentally, my Y700 results have come in and they’re now giving me 14 Novel Variants, instead of 7. Although, the reported numbers have been fluid over there, so I’m not sure if that’s going to stand. We’ll see!

O well, back to the drawing board ;-)


Looks likes I’ll have to make a new SNP Pack

To be honest mate, I woud likely have done a lot of things that you're doing - I play around with back of the envelope as well. I think it's extremely valid that you're able to chart some of your downstream SNPs - these are real branch points, regardless of what the age estimates are.

I don't have all the answers - I'm just calling things as I see them with the information as I understand it. I may very well be wrong on a lot of the things that I say. But I try and make the best of what we have. I'm just giving you my opinion and thoughts and I appreciate the discussions.

I'd take a look at what the seven "new" novel variants are - either at YBrowse or YFull - I find that latter a bit easier although both work. Or give FTDNA a bit of time to figure out where some of these SNPs really should be.

JMcB
07-14-2019, 12:50 AM
To be honest mate, I woud likely have done a lot of things that you're doing - I play around with back of the envelope as well. I think it's extremely valid that you're able to chart some of your downstream SNPs - these are real branch points, regardless of what the age estimates are.

I don't have all the answers - I'm just calling things as I see them with the information as I understand it. I may very well be wrong on a lot of the things that I say. But I try and make the best of what we have. I'm just giving you my opinion and thoughts and I appreciate the discussions.

I'd take a look at what the seven "new" novel variants are - either at YBrowse or YFull - I find that latter a bit easier although both work. Or give FTDNA a bit of time to figure out where some of these SNPs really should be.

Needless to say, I always appreciate your opinion. Not only in my case but in all the other ones you’ve been working on. It nice to see something happening in our haplogroup again and for the most part, you’re the one who’s been doing it.

At this point, I think I’ll let things settle down at FT because their reported numbers seem a little fluid at this point. If it’s seems like the new NVs are going to stick, I’ll try weeding out the new ones from the old. Right now, I’d really like to get my Bam file to YFull but apparently FT is having a problem with people who upgraded and also previously asked for their 500 Bam files. It’s always something. ;-)


As an aside, David seems to be hinting that the Vikings are coming soon:

However, I'll be adding many more ancient samples to the Global25 datasheets as they become available, including lots of new Vikings, which should greatly improve the accuracy of these sorts of fine-scale mixture models.

http://eurogenes.blogspot.com/

That’s the second time I’ve seen him mention them in the last week or so.

lgmayka
07-14-2019, 02:13 AM
Right now, I’d really like to get my Bam file to YFull but apparently FT is having a problem with people who upgraded and also previously asked for their 500 Bam files.
This morning, my FTDNA account presented me with a button to Generate BAM. I pressed the button. Just now, I notice that the BAM is now available.

JonikW
07-14-2019, 11:00 AM
Needless to say, I always appreciate your opinion. Not only in my case but in all the other one you’ve been working on. It nice to see something happening in our haplogroup again and for the most part, you’re the one who’s been doing it.

At this point, I think I’ll let things settle down at FT because their reported numbers seem a little fluid at this point. If it’s seems like the new NVs are going to stick, I’ll try weeding out the new ones from the old. Right now, I’d really like to get my Bam file to YFull but apparently FT is having a problem with people who upgraded and also previously asked for their 500 Bam files. It’s always something. ;-)


As an aside, David seems to be hinting that the Viking are coming soon:

However, I'll be adding many more ancient samples to the Global25 datasheets as they become available, including lots of new Vikings, which should greatly improve the accuracy of these sorts of fine-scale mixture models.

http://eurogenes.blogspot.com/

That’s the second time I’ve seen him mention them in the last week or so.

I posted an update on the Viking paper somewhere on this site. It's a Copenhagen/Gothenburg tie-up and should be published in three to five months.

deanovermont
07-15-2019, 10:33 PM
Ah yes - I've seen some of your posts in the I-Z140 Facebook group, so welcome to the Anthrogenica fourms here. I find that too - little nuggets of information that send me off on tangents to read about related subjects.

One of the things that drew me to digging further into the Batini et al 2015 paper is that it would give a slightly different perspective to the Y-DNA database established by commercial testing, such as FTDNA and YFull. I'm only seeing one individual with Netherlands in both the YFull tree and the FTDNA public haplotree on branch PH4462, although one individual at I-L338 in Netherlands that William has grouped together with that branch is in there too. Finding two I-PH4462 in this paper, along with another two a level up at I-BY463/Y3334 (due to derived Y8337) was neat. At 10 samples out of 46 I1, Netherlands definitey have a lot of I1 in this dataset. I think with a larger sampling of the populations, would see a bit more normalizing but this study has an advantage in that it lets us dig into the subclade data, while a lot of other studies will just list a very basic high level haplogroup such as I1.

The SNPs with a PH prefix were discovered and registered by Pille Hallast, who is one of the authors on the Batini et al 2015 paper and the earlier 2014 paper. It's quite likely that these PH SNPs were discovered for the first time in this dataset that led to this paper. I had a look for some of the SNPs specific to your branch but they didn't show up in the VCF file - likely they weren't discovered until the Big Y tests of you and the other fellow on your branch after these papers published in 2014/15.

I wanted to say thanks again and also apologise. In my post above I had meant to pay you a compliment by referring to your fine original post. But in a bit of a brain failure no doubt precipitated by knowing both your name and WH's from the Z140 Facebook group I misdirected it. Sorry S!

JonikW
07-15-2019, 10:46 PM
[QUOTE=deanovermont;582550]I wanted to say thanks again and also apologise. In my post above I had meant to pay you a compliment by referring to your fine original post. But in a bit of a brain failure no doubt precipitated by knowing both your name and William Hartley's from the Z140 Facebook group I misdirected it. Sorry S!QUOTE]

Yes, deadly77's posts are much appreciated. Might be best not to name people without their permission though. This site values privacy.;)

deanovermont
07-16-2019, 12:58 AM
Thanks. Fixed that!

JonikW
07-16-2019, 08:03 AM
Thanks. Fixed that!

... Great. I've removed from the quote too.:)

deadly77
07-17-2019, 09:36 AM
Cheers fellas - no apologies neccesary. I've a lot of respect for William - he's doing a very good job as admin of the I-Z140 project and also admins a fair few other I1 subclade projects as well. He's one of the early pioneers (along with Ken Nordvedt and others) on how the I1 tree and subclades were put together which is evern more impressive as we had a lot less SNPs discovered back then, and a lot of the groupings were done by STRs which in hindsight seems a harder way to do things. I wasn't following this field when a lot of the early research was going on, but it was probably a big mix of fun and frustration - there must have been quite a few "three steps forward, two steps back" moments.