View Full Version : Discovering two challenges to charting all testers under predictable haplogroups

02-08-2017, 10:15 PM
There are now dozens of haplogroups under haplogroup R where these haplogroups are around 1,500 to 2,500 years old, share one common YSTR signature and this signatures can predict testing positive for these haplogroups (with genetic distance filters added) with very high accuracy (90 to 99 %). However, after being able to chart around 75 % of L226, I am discovering two major issues that will make make the last ten or twenty percent difficult to chart with accuracy. These two issues are those in the two extreme ends of the bell curve for genetic distance from the haplogroup signature (10 to 15 on one end and 0 to 4 on the other end). I believe that these two "boundary condition" issues will affect almost all predictable haplogroups:

1) The largest and most challenging issue is that there continues to be significant bias in testing. This is the same bias of testing that happens in the boundary condition testing of YSNP prediction. Those that who have large genetic distances from signatures and extremely large genetic distances between other haplogroup members will be a challenge to get to test. When I sort the L226 spreadsheet based on only genetic distance from the L226 signature - the genetic distance is as high as 15 from the signature (and reaches 25 between L226 submissions). These genetic distances exclude the very volatile CDY markers.

Of the top ten percent of submissions with the genetic distance from the L226 signature - NONE had tested any of the 43 known branches under L226. This means we are making minimal progress on understanding the early branches of L226. This is understandable as the goals of haplogroup admins do not match the primary goal of most testers which is finding matches of closely related testers. Our requests to test individuals with with large genetic distances from others in the group do not draw much response because their interest in charting early branches (around 1,500 years ago) are not aligned with their primary interest. This probably aggravated by FTDNA's matching system that states genetic distance of seven or more at 67 markers are not related - so why should they test? Convincing these isolated and genetically diverse crowd to test will be a challenge to be able to accurately chart the older branches in any haplogroup. If you revert to YSTR only tree building in the 1,000 to 1,500 time frame, any reasonable accuracy should not be expected - but it makes for pretty complete charts. Admins need to encourage these individuals to test - but it will be a hard sell unless we can convince why this helps them directly (as well as admin interests in charting early branches of their haplogroup).

2) The second major issue is at the other end of the bell curve of the same spreadsheet sorted only by genetic distance. Another another ten percent of L226 has less than a genetic distance of four or less from L226 signature. My analysis always excludes CDY markers from my 67 marker database. Of course, there are probably some hidden parallel and backwards mutations that lower genetic distance of these testers. Since there is very little YSTR variation to form any unique two marker signatures, the only alternative for these people is to extensively test the younger YSNPs under the predictable haplogroup.

When I filtered out submissions that had not extensively YSNP tested (via NGS test or comprehensive SNP pack test), I found that there was an unbelievable 80 % error rate when genetic distance of four or less from L226 signature. 80 % percent of the time, they tested for different branches that were over 1,000 years old (where surname variation is 80 % or higher). I have labeled this as "lack of divergence" for lack of any term available. This is another form of convergence between older YSNP branches but it is just the opposite.

But the long term hope of these people testing is much more promising as these people now know that they are not really related to close matches as being revealed by over 100 NGS and SNP packs across the 500 L226 submissions being charted. In this case, the goals of the admins and members of the haplogroup admins are compatible. There is not only significant genealogical NGS testing but even greater SNP Pack testing by these individuals. People are really concentrating on extensively testing for very recent branches. For the line related to Brian Boru, they have recently discovered their second branch at ten levels of branches below L226 and around 50 % have either NGS tested or SNP Pack tested with the L226 SNP pack that currently has 90 % of the known branches under L226.

So, this second scenario has a better chance of continued testing. But eventually, you will end up with those that have mainly just lost interest in DNA testing in general and these people may never get mapped to their terminal YSNP with any accuracy. But admins must encourage these individuals to test while interest remains high in testing. Since there is very little YSTR diversity for these testers, prediction of these younger branches will be substantially less than older parts of the haplotree.