PDA

View Full Version : Update - R-Z253 YSNP prediction (Z253 is now 66 % predictable across 14 haplogroups)



RobertCasey
11-08-2020, 06:50 PM
I intentionally targeted focus on this older major haplogroup to determine what percentage of Z253 could be predicted. Here is a summary haplogroups predicted (stated as the number of YSNP branches included in each predictable haplogroup):

L226 - 187
BY4086 - 73
S844 - 69
Subtotal (>50) - 329
Z17640 - 48
CTS9881 - 46
A503 - 28
Subtotal (25 - 49) - 122
BY414 - 17
A7037 - 14
BY411 - 13
BY4297 - 10
DF73 - 9
BY2848 - 9
L554 - 9
L643 - 5
Subtotal (<25) - 86

All of Z253 814 branches (FTDNA source)

There are pretty strict criteria for accurate YSNP prediction (otherwise the model accuracy can suffer significantly)

1) TMRCA must be in the 1500 to 2500 YBP range. There are very minor exceptions to this range.
2) There really needs to be between 10 to 25 branches - but pushing the lower bounds is many times not a big issue.
3) There needs to be between 50 to 100 testers for productive use of time and accuracy.
4) There needs to be around 10 to 20 branch equivalents in the two highest levels to gain genetic isolation
but this can be pushed if other criteria are strong.

I could probably push this from 66 % to 80 % but the time involved for smaller scope does not have very
good economies of scale. It is probably more productive of my time to focus on larger predictable haplogroups.
The prediction model works very well for P312, L21, U106 and R1a. Creating and updating the database is
around 90 % of the time. YSNP prediction and charting only takes 10 % of the time. Accuracy of YSNP prediction
is 99 to 100 % the vast majority of the time. The more you violate the criteria, the more the accuracy suffers (but
even 80 % accuracy is very useful and it is only 1 to 3 errors due to the smaller sample sizes).

Both automated data collection and automated discovery/analysis of YSNP prediction is possible if some programmer
would assist. Also, charting via SAPP has been extremely useful - even though it charts 100 % of testers. The limit of 600
testers is an issue and the Y67/Y111 combination is not as accurate as well. But it does amazing job of
quickly identifying surname clusters and developing strong YSTR signatures. It is also very amazing how much NEVGEN
and my binary logistic regression models predict consistently. Both are constrained by the amount time to collect the
data and create the models for each haplogroup.