PDA

View Full Version : New YouTube for math supporting FTDNA matching system and issues with it



RobertCasey
10-29-2017, 06:40 AM
Below is a link to a very preliminary version of an analysis of normal distributions for genetic genealogy,
several examples of consistently high false positives across several haplogroups and surname clusters
and why YSNP testing and charting is really necessary and better than the FTDNA matching system:

https://www.youtube.com/watch?v=pmv7RyrVTa4

Look forward to comments - please be kind as this is my first pass at this topic.

Dave-V
10-29-2017, 02:19 PM
Hi Robert - nice video! As you know I agree with you that STR signatures are more important that GD alone in determining relatedness, and that STR analysis has to include the individual mutation rates. I think we will continue to get value out of STRs in tandem with SNPs until Y coverage and widespread NGS testing give us branching SNPs down to near-present day.

One suggestion - there are two reasons that STR analysis alone may not be effective - the first is (as you mentioned) that the curve may not fit a normal distribution, and the second is that the STR mutation frequencies themselves may not conform to calculated mutation rates (which is especially a concern within the relatively short time of surnames).

For the first issue we could probably calculate a Fisher-Pearson measure of skewness or a kurtosis figure for a particular group that could be used as a measure of the likely impact on branching. I haven't tried that yet. For the second issue we can calculate the actual mutation rates over the timespan - I HAVE done that with predictable results (ie they're usually off by some amount).

Both approaches might be useful to gauge the effectiveness of a STR-only analysis. But in the end I certainly agree that YSNP testing should be incorporated as well, obviously.

Robert1
10-29-2017, 02:46 PM
Thank you for your presentation regarding problems with YSTR matching. I found it very interesting and watched the whole video. I suppose part of the problem may be FTDNA feels pressured to "deliver" matches so they enlarge the GD for a "match" but no one wants false positives as you waste so much time chasing the wrong rabbits down the hole.

I am glad I broke down and took the BigY SNP test as I was fortunate to get 3 solid matches with 0 known SNP variants. And I was able to contact all three and compare information. It is interesting to look at these men's Y37, Y67 and Y111 GD. FTDNA gives me 53 matches at Y67 but only 3 matter at this time, the three BigY matches. Now at Y111 I have only 2 matches and they also are BigY matches. (My third BigY match did not test at Y111.)

Of those 53 matches at Y67 there are a few more that likely would be BigY matches but certainly 40 or more of the 53 are not useful. The other side of the coin is that my 2 Y111 matches were at GD 9 and GD 10. If FTDNA had been stricter with an allowable match I would have lost two good matches at Y111. I can see FTDNA is walking a tightrope and fears throwing out the baby with the bath water.

Wing Genealogist
10-29-2017, 03:36 PM
Thank you for your presentation regarding problems with YSTR matching. I found it very interesting and watched the whole video. I suppose part of the problem may be FTDNA feels pressured to "deliver" matches so they enlarge the GD for a "match" but no one wants false positives as you waste so much time chasing the wrong rabbits down the hole.

With STR matching it works both ways, depending on your personal situation. For some folks, (those whose STR results are close to the modal value of an old haplogroup, like the WAMH), the matches can be very distant. However, for folks like myself (who happen to be off the beaten path, and also have a family which is currently demonstrating a higher than average mutation rate) some of my known family connections are not showing up as STR matches.

Dave-V
10-29-2017, 04:14 PM
With STR matching it works both ways, depending on your personal situation. For some folks, (those whose STR results are close to the modal value of an old haplogroup, like the WAMH), the matches can be very distant. However, for folks like myself (who happen to be off the beaten path, and also have a family which is currently demonstrating a higher than average mutation rate) some of my known family connections are not showing up as STR matches.

You get all kinds of in-between patterns too. One of my major subgroups in my surname project has virtually no variation in the 38-67 marker range, so what happens is matches just outside FTDNA’s range at 37 markers (like say at 4-6 GD at 37 markers) come back in as matches when they upgrade to 67 (because they go from say 5 GD at 37 to 5 or even 6 GD at 67), and then often disappear again when they upgrade to 111 because there is more variation there.

It all argues for a much more sophisticated matching approach than simple GD which in the end is Robert's point (besides using SNPs). If you're lucky enough to have a good project admin they can recognize the patterns and tailor their analysis appropriately but that's still more art than science.

RobertCasey
10-29-2017, 06:23 PM
With STR matching it works both ways, depending on your personal situation. For some folks, (those whose STR results are close to the modal value of an old haplogroup, like the WAMH), the matches can be very distant. However, for folks like myself (who happen to be off the beaten path, and also have a family which is currently demonstrating a higher than average mutation rate) some of my known family connections are not showing up as STR matches.

For the genetic overlap in pretty old haplogroups (2,500 to 5,000 years old), this has been called convergence. Convergence under R1b happens when testers are close the WAMH signature. This only affects around five percent of testers but is very real for those five percent.

A much bigger issue happens in the more recent time frame between present and 2,500 years. This form of overlap is much more prevalent and is due the fact that the distribution of genetic distance in the last 1,500 has a bell curve distribution and usually has a skew to the left and a fat tail on the left. This results in many people 20 to 30 % have low genetic distance from signature signature predictable YSNPs in the 1,500 to 2,500 year range having only two to four genetic distance. When this happens, false positives at 67 markers can go up to 80 to 90 % on regular basis. This issue is being exposed with very prolific haplgroups in this time frame that have extensive NGS testing and extensive SNP pack testing. However, with varying genetic distance filters, reducing the seven to four or five, the false positives become much more reasonable. I have been calling this a "lack of divergence" to have a different term for this time frame.

RobertCasey
10-29-2017, 06:42 PM
Dave V -

I thought initially that mutation rates, rarity of marker values and commonality of surnames would play a larger role. However, I am finding that these parameters are only needed for tie breakers for very small two marker signatures. Even overlap of two marker signatures is not that common in a 600 testers chart. Obviously, two marker signatures associated with faster mutating markers are also the ones that have multiple occurrences. I still exclude CDY markers from charting as my calculations show that CDY mutations is almost as much of all of the other 65 markers combined. With that kind of volatility, I do not use CDY markers in charting.

The lack of a normal distribution is a major factor but I am finding mutation rates (other than CDY) are not as important as I once believed. Once we go to 500 YSTR markers, it will be a very different story. We will probably be able to chart back another 1,000 to 1,500 years - IF we only use relatively slow mutating markers. If you greatly increase the number of markers, you are including more CDY rate type markers, this could be very problematic.

I tried to avoid all math to handle skew and tails since the point of the video was to emphasize that the YSNP testing is broadly revealing that the FTDNA (YSTR only model) just is too unreliable to depend upon. This results in the requirement for extensive YSNP testing to make up for this lack of reliability of YSTR only models. Another issue, the cost of totally and exhaustively testing YSNPs down to your the your terminal YSNP (and verifying that your are negative for all sons as well). YSNP prediction and charting will greatly reduce overall YSNP testing costs since signatures do work well in the down to the present.

RobertCasey
10-29-2017, 06:46 PM
Thank you for your presentation regarding problems with YSTR matching. I found it very interesting and watched the whole video. I suppose part of the problem may be FTDNA feels pressured to "deliver" matches so they enlarge the GD for a "match" but no one wants false positives as you waste so much time chasing the wrong rabbits down the hole.

I do not think FTDNA, the genetic community or the academics really knew just how bad YSTR only models would be. But now that we know better, it is up to us to encourage much more YSNP testing.

Dave-V
10-29-2017, 07:13 PM
I thought initially that mutation rates, rarity of marker values and commonality of surnames would play a larger role. However, I am finding that these parameters are only needed for tie breakers for very small two marker signatures. Even overlap of two marker signatures is not that common in a 600 testers chart. Obviously, two marker signatures associated with faster mutating markers are also the ones that have multiple occurrences. I still exclude CDY markers from charting as my calculations show that CDY mutations is almost as much of all of the other 65 markers combined. With that kind of volatility, I do not use CDY markers in charting.

No disagreement from me. I ignore single faster-mutating STRs for signature recognition in the higher branches of the tree but still use them if they occur in multiple-marker signatures and in lower-level branching since I still find them useful within more recent timeframes. But there doesn't seem to be a one-size-fits-all answer to this; it greatly depends on the timeframe being analyzed and by extension how much noise the faster-mutating markers are generating.

Cofgene
10-29-2017, 10:19 PM
For different lines and haplotree regions there will be different results and observations. Robert's situation is one dealing with a large 'recent' expansion within a short time frame. For other branches of the haplotree the situation just is different. As someone whose 67 STR's haven't changed for 8 generations FTDNA's matching works great in identifying new relatives. Only by SNP testing have I been able to tease apart the STR data enough to show that I only have 1 or 2 STRs available to define specific related descendant branches. If I could ever get to 100 testers maybe something better will show but based upon the current samples that probably isn't the case. Going to to 111 just brings in matches from individuals 1500-2200 years ago.

There were a couple of statements on the slides that I hope you have references to. They seem to be 'off' and I"ll leave it at that. Is a normal distribution appropriate to describe the biological processes that you are evaluating? I'm 40 years away from my college stat classes but I'm not comfortable with the concept of fitting or correlating biological data to such a standard statistical model. For examples of modeling STR mutations and dependencies see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4863667/

RobertCasey
10-29-2017, 11:37 PM
Thank you for your presentation regarding problems with YSTR matching. I found it very interesting and watched the whole video. I suppose part of the problem may be FTDNA feels pressured to "deliver" matches so they enlarge the GD for a "match" but no one wants false positives as you waste so much time chasing the wrong rabbits down the hole.

I do not think FTDNA, the genetic community or the academics really knew just how bad YSTR only models would be. But now that we know better, it is up to us to encourage much more YSNP testing. But it is unlikely that FTDNA will alter their matching system since sometimes it works and other times the genetic distance criteria needs to be reduced.

RobertCasey
10-29-2017, 11:53 PM
For different lines and haplotree regions there will be different results and observations. Robert's situation is one dealing with a large 'recent' expansion within a short time frame. For other branches of the haplotree the situation just is different. As someone whose 67 STR's haven't changed for 8 generations FTDNA's matching works great in identifying new relatives. Only by SNP testing have I been able to tease apart the STR data enough to show that I only have 1 or 2 STRs available to define specific related descendant branches. If I could ever get to 100 testers maybe something better will show but based upon the current samples that probably isn't the case. Going to to 111 just brings in matches from individuals 1500-2200 years ago.

There were a couple of statements on the slides that I hope you have references to. They seem to be 'off' and I"ll leave it at that. Is a normal distribution appropriate to describe the biological processes that you are evaluating? I'm 40 years away from my college stat classes but I'm not comfortable with the concept of fitting or correlating biological data to such a standard statistical model. For examples of modeling STR mutations and dependencies see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4863667/

We can only observe this "lack of divergence" that generate high false positives with great detail with very prolific numbers of offspring. That is why I switched to NPE rates for smaller haplogroups and clusters to see if there is evidence of this issue for very small clusters (10 to 30) and appears that the NPE rates are far exceeding 50 % at a genetic distance of only three and four.

The normal distribution is not a tool that you can create charts with but they are very handy to identify characteristics of your clusters to know what issues that you are facing with your particular cluster. However, there is no doubt that binary logistic regression is the math that we should be using for two practical needs of genetic genealogists: 1) YSNP prediction: 2) charting using signatures. Neither usage of binary logistic regression works all the time since the degree of overlap between clusters varies a lot. However, for YSNP prediction, I think at least 80 % of testers can achieve over 95 % accuracy (the number of YSNP testers really does not need to be very as well). However, for charting, you really have to have signficant YSNP testing to be even be able to chart 80 % of the testers under a single signature predictable haplogroup (1,500 to 2,500 years only with current 67 marker FTDNA usage of YSTRs). Also, the accuracy of charting varies from 60 % to 95 % since the signature sizes are much smaller.

Wheal
10-30-2017, 01:00 AM
thanks Robert, was especially interesting to me as a beginner.

Jan_Noack
12-12-2017, 10:09 AM
Thanks Cofgene . "I'm 40 years away from my college stat classes but I'm not comfortable with the concept of fitting or correlating biological data to such a standard statistical model. For examples of modeling STR mutations and dependencies " , I concur from my 40+ years away uni days.

That pubmed article also mentions the selection of the STR in the panels leads to lower observed mutation rates than expected , as well as the sampling method of short reads (100bp to 150bp mostly) resulting in the elimination of the longer STR's (they've been chopped up), and it's the longer STR's that have the increased mutation rate. This was my understanding from a browse of the paper.

RobertCasey
12-17-2017, 04:37 PM
Thanks Cofgene . "I'm 40 years away from my college stat classes but I'm not comfortable with the concept of fitting or correlating biological data to such a standard statistical model. For examples of modeling STR mutations and dependencies "

YSNP prediction behaves very well statistically within the constraints of assumptions: 1) it does not work at 99 % accuracy once you get close to genealogical times (1,200 years or so) since there is not enough time for statistical variation to appear; 2) it does not work for older YSNP branches older than 2,500 years as the numbers of hidden mutations (via backwards mutations) start to increase in too large of numbers; 3) it works at much less than 99 % for haplogroups with significant convergence present (around ten percent); 3) in this case, assumes that you are predicted to be Haplgroup R. It does work for charting (1,500 years and below - but as the signature sizes decreases, the accuracy declines to 60 % (or even much lower if you chart 100 % that has no matching signatures).

I just finished another round of statistical analysis using two variables: 1) signature size match; 2) genetic distance from the signature match. I downloaded a trial copy of SPSS (IBM statistical tool) and analyzed three well known YSNPs under R-L21: L226, L371 and L555. The models yields 100 % accuracy for all three YSNPs with Chi Square being 0.0000 and Significance being 1.0000 (accuracy). Here is the model used:

Prediction = e**(constant1 + Signature*constant2 + Genetic Distance*constant3) / ( 1 + e**(constant1 + Signature*constant2 + Genetic Distance*constant3) )

e is a constant like pi and is 2.71828 (used in hundreds of scientific and engineering formulas). ** means taken to the power of the value in parenthesis.



Constant
B values
Model
Significance








L226











Constant
Constant1
-28.724
0.996


Signature
Constant2
17.557
0.985


Genetic Dist
Constant3
-8.007
0.968








L371











Constant
Constant1
-42.899
0.999


Signature
Constant2
8.675
0.999


Genetic Dist
Constant3
-1.510
0.999








L555











Constant
Constant1
-56.886
0.998


Signature
Constant2
14.772
0.996


Genetic Dist
Constant3
-5.178
0.993









The vast majority (80 % or higher) should have 100 % in YSNP prediction based on my previous experience with the L21 YSNP prediction tool. This version of prediction has its conditional boundary expanded from requiring L21 confirmed to only Haplogroup R predicted. Even this limit may not be necessary as I just have not tried to use other haplogroups other than haplogroup R.

Some other notable facts: 1) the accuracy of the constants will always be a little under 100 % as the model recognizes that these variables could change over time with additional input; 2) the constants are pretty consistent if you have the same size signature and the flex point is at the same signature value; 3) if you add more data in the transitional area (where negatives and positives are close to each other, the constants will change slightly; 4) you have to remember that this is a form fitting regression model that makes the constants fit the known tested data; 5) by removing faster mutating markers with 500 YSTRs, you could predict 1,000 to 2,000 years early (not using the FTDNA panels).

RobertCasey
01-15-2018, 03:58 AM
I have a small update to the following model. I went ahead and entered the L226 formula into my R-L226 spreadsheet (over 600 testers). For all verified testers, the accuracy was indeed 100 %. However, I had around four people in the spreadsheet I am pretty confident that are L226 positive and they came back with zero percent prediction of L226. I looked at the four exceptions and understand why they are zero - all four were untested and all four had slightly higher genetic distance than any tested submissions (which the model is created from). I reduced the influence of genetic distance by loweing the constant (negative value) for genetic distance from -8 to -6 and all four changed to 99 % or more predicted.

So if you do not test the boundary condition testers, the constants will change when more boundary condition testers are eventually tested. The form fitting algorithm changes all the constants to only get a formula that covers all tested data and then stops. It does not have any knowledge or care about the gray area that has not been tested. So this is a small weakness in this approach as it requires all your boundary condition testers to have tested. This is a tough thing to accomplish as these people are questionably related via genetic distance and signature matching - so they are reluctant to test since they will not be very close to other testers in the haplogroup and do not see the value of determining the extreme limits of your haplogroup. The haplogroup project will probably have to sponsor these kinds of tests.

This also means, the constants will change some over time. Unfortunately, you need an expensive copy of a statistical package to keep your constants updated when more extreme testers eventually test. On the plus side, this is only four testers being called wrong out of 635 testers. On another positive note, you can just make minor adjustments to the constants and quickly manually correct the model within reason.

Cofgene
01-15-2018, 10:21 AM
This also means, the constants will change some over time. Unfortunately, you need an expensive copy of a statistical package to keep your constants updated when more extreme testers eventually test. On the plus side, this is only four testers being called wrong out of 635 testers. On another positive note, you can just make minor adjustments to the constants and quickly manually correct the model within reason.

One does not need an expensive stat package. R is considered to be equivalent, if not more advanced in some areas, to SAS or SPSS. Firms have formally validated R over SAS or SPSS for regulatory data analysis use. Can you move your work over to the global standard of open source R-script?

RobertCasey
01-17-2018, 02:25 AM
One does not need an expensive stat package. R is considered to be equivalent, if not more advanced in some areas, to SAS or SPSS. Firms have formally validated R over SAS or SPSS for regulatory data analysis use. Can you move your work over to the global standard of open source R-script?

I have installed R on my PC and it is not exactly an end user tool. If you could step me through the steps, I would create new models for several L21 YSNPs that I predict when only using signature as the only parameter. I have used in MiniTab and SPSS (not exactly user friendly either). This is the current model using both signature match and genetic distance from SPSS for the R-L226 prediction. It remains at 100 % accurate for 250 positive testers and the closest 100 or so negative testers. However, four testers that I believe are L226 are predicted as negative. All four testers are just slightly higher genetic distance but are not tested to date. Here is the model used in SPSS for two parameters and the constants generated by SPSS. This works with all of haplogroup R data (54,000 testers at 67 markers):

Probability (L226) = (e**(-28.724 + 17.557*Signature + -8.0069*GeneticDistance) ) / (1 + (e**(-28.724 + 17.557*Signature + -8.0069*GeneticDistance) )

I just lowered the constant for Genetic Distance to -6 and it now calls these four as positive yet still keeps all other closer negative testers remain negative. I did have time to create models for L371 and L555 as well (prepping the data is a significant effort as there are a lot of new subbranches to extract). Unfortunately, Binary Logistic Regression assumes that all boundary condition testers are tested. It is quite easy to plot all positive and all negative manually but I would be a little hesitant to create a second model using predicted data as input (but this would work and not require constant updates). Then there are the boundary conditions where the results are mixed as genetic distance rises (no way to accurately chart these four or five testers out of 1,000 which are very predictable).

When you get to mixed results as genetic distance increases, individual YSTR values drive whether the results are positive or negative. I could assign weights to these markers to probably get prediction back to 100 % - but 99.5 % accurate is probably good enough and getting these boundary condition testers to actually test is a rough task since their testing has minimal genealogical impact. However, I am quite pleased that all three YSNPs modeled all currently have 100 % accuracy which means this model is definitely the correct math to predict YSNPs based on YSTRs.

Cofgene
01-17-2018, 10:33 AM
Robert - I'm personally don't have enough solid R experience to translate models. The same stat functions in SAS or SPSS are present within the base build or as add-on modules from the R-repositories. 5 years ago I did some simple conversions from an old desktop stat program into R to facilitate method validation during an upgrade at work.

I would place a query to some of the larger discussion lists, or even the ISOGG facebook group, asking for a statistician to help you with the task. There should be some researchers who are R savvy within the community who can help. You also might get help from some of the R user discussion lists.