PDA

View Full Version : R1a - how to handle sortable string of YSNPs



RobertCasey
06-20-2016, 06:48 PM
I was not very familiar with R1a testing until recently. When FTDNA replaced the long version of the haplogroup (R1a1a1b1a3, etc.), I lost the ability to sort haplogroups with the replacement the short version of the haplogroup R-Z284. So I determined the largest branches under R1a and created table lookups for all the terminal YSNPs for R1a. I ignored the chain of YSNPs which is very long for R1a and just set the first YSNP as one of the ten largest R1a haplogroups:

M420 - 1
M459 - 41
M512 - 2204 (obviously the most predictable older branch)
M417 - 80
Z645 - 176
Z283 - 562
Z282 - 346
Z280 - 566

And only two that were subbranches of the string of progression:

L664 - 122
Z284 - 278

After some minor investigation, I found that 100 % of Z645 was either Z93 or Z283. Should I just
leave out Z645 as a major branch and only have Z283 and Z93 ? Also, Z92 is a pretty large branch
under Z280. How many submissions would qualify for adding the Z92 as the first branch under R1a ?
There are currently 95 67 marker submissions under Z280. Should anything over 100 get added
as a major branch or what other criteria would anyone suggest ?

For now, I have the above ten YSNPs as the first branch under the label R1a. Obviously, putting the
entire progression of YSNPs would create too long of short string to replace the older long versions of
the hapogroup character sequence. So the new labels are R1a-Z280>lower tested YSNPs, etc. I try
to keep the list to maximum of six or less YSNPs in order to not consume too much width of the
spreadsheet.

If I used the entire progression of YSNPs, it would be the following just to get to Z280:

R1a-M420>M459>M512>M417>Z645>Z283>Z282>Z280

This would leave no room for younger YSNPs and this known progression really does not add
anything but the well documented progression that could be obtained from various haplotrees.

We do not have this issue as much under L21 since it has a huge starburst of major branches
under L21/DF13. R1a has a very bushy tree with a major trunk of YSNP progression.

Does anyone see any issues/disadvantages of using the ten major R1a branches as the
first branch under R1a? Remember, this string has to be short enough to fit into a spreadsheet
for sorting and is not intended to be the full chain of YSNP progression. It is intended to
be somewhat meaningful and not much larger than the older character version.

Another issue - FTDNA has dropped the R1a and R1b designations under their haplotree and
now only use R. So the new R1a designation would be R-M420> which is longer and not as
descriptive as R1a. On the other end - R1b-L21, R1b-P312, etc. would become shorter
and lose very little value (R-L21, R-P312. etc.). I am sure that R1a would be preferred by
R1a testers but this seems to be not following the new FTDNA scheme which would cause
confusion if we keep this old prefix. Most of R1b is now dropping the R1b in favor of R, so I
think this is a mute point - but are there any other issues/disadvantages with this change ?

Michał
06-21-2016, 01:03 PM
Robert, I would recommend visiting our R1a project (https://www.familytreedna.com/public/R1a/default.aspx?section=yresults) at FTDNA to see how we handle this problem. Also, the Experimental YFull tree (or its R1a part (https://www.yfull.com/tree/R1a/)) should be very useful in this case.

Could you please let us know which particular FTDNA project you are working for? I think this information is crucial when choosing the best strategy for classifying the R1a results.


So I determined the largest branches under R1a and created table lookups for all the terminal YSNPs for R1a. I ignored the chain of YSNPs which is very long for R1a and just set the first YSNP as one of the ten largest R1a haplogroups:

M420 - 1
M459 - 41
M512 - 2204 (obviously the most predictable older branch)
M417 - 80
Z645 - 176
Z283 - 562
Z282 - 346
Z280 - 566

And only two that were subbranches of the string of progression:

L664 - 122
Z284 - 278
I am not sure what you mean by "the most predictable older branch". M512 (also known as M198 or M17) is indeed a standard prediction made by FTDNA, but this is neither the largest branch under R1a (this would be M459) nor the easiest branch to predict. Also, it encompasses >99% of all R1a members, so such prediction is not very informative. From the practical point of view, it is much more useful to distinguish 5 non-overlapping major branches within R1a that together encompass about 99% of all R1a members. These are:

CTS4385 (including its major subclade L664), or M420>M459>M198>M417>CTS4385
Z93, or M420>M459>M198>M417>Z645>Z93
Z280, or M420>M459>M198>M417>Z645>Z283>Z282>Z280
Y2395 (including its major subclade Z284), or M420>M459>M198>M417>Z645>Z283>Z282>Y2395
PF6155 (including its major subclade M458), or M420>M459>M198>M417>Z645>Z283>Z282>PF6155

Since CTS4385, Y2395 and PF6155 have been relatively recently discovered as clades parental to L664, Z284 and M458, respectively, the latter names remain much more commonly used. All these major branches are associated with quite specific geographical distributions, so one can also apply the geographical or ethnic names:

L664 - North-Western European
Z93 - Asian
Z280 - Central-Eastern European ("Balto-Slavic", though this should actually apply to its two major subclades CTS1211 and Z92 only)
Z284 - Scandinavian
M458 - Central European ("Slavic")



After some minor investigation, I found that 100 % of Z645 was either Z93 or Z283. Should I just
leave out Z645 as a major branch and only have Z283 and Z93 ?
I wouldn't do this, mostly because Z645 is the only indication that these two branches are more closely related to each other than to the more distantly related CTS4385/L664 branch. Also, we cannot rule out that there are some additional (yet unknown) subclades directly under Z645 (in addition to Z283 and Z93), so keeping Z645 will be crucial for a proper classification of such unexpected findings.



Also, Z92 is a pretty large branch under Z280. How many submissions would qualify for adding the Z92 as the first branch under R1a ?
The largest subclade under Z280 is definitely CTS1211 (not Z92). I would roughly estimate that CTS1211 encompasses about 2/3 of all Z280 members while Z92 corresponds to about 1/4 of the entire clade Z280. Both these subclades (as well as S24902, a third major subclade under Z280 that seems to be most common in Central-Western Europe) should be included in your classification (IMO).



There are currently 95 67 marker submissions under Z280. Should anything over 100 get added
as a major branch or what other criteria would anyone suggest ?
Well, it all depends on the context. For example, in a project that includes mostly people of Western European ancestry (let's say the Irishmen), distinguishing some very specific subclades under Z280, M458 or Z93 might be considered redundant, while this is definitely not redundant for people of Eastern European ancestry. Also, please note the the Asian branch Z93 encompasses more R1a members (worldwide) than all remaining branches taken together, yet it likely constitutes less than 20% of those R1a men who are customers of FTDNA.



Does anyone see any issues/disadvantages of using the ten major R1a branches as the
first branch under R1a? Remember, this string has to be short enough to fit into a spreadsheet
for sorting and is not intended to be the full chain of YSNP progression. It is intended to
be somewhat meaningful and not much larger than the older character version.
For a regional or surname project, I would use the five major branches (L664, Z93, Z280, Z284 and M458) as "reference points" for adding more downstream subclades. For example, I would use this string of SNPs for my own result:
R1a-Z280>CTS1211>Y35>CTS3402>Y33>CTS8816>L1280>FGC19283>FGC19273, or
R1a>...>Z280>CTS1211>Y35>CTS3402>Y33>CTS8816>L1280>FGC19283>FGC19273, or
R-M420>...>Z280>CTS1211>Y35>CTS3402>Y33>CTS8816>L1280>FGC19283>FGC19273

Alternatively, one can use this shortened version:
R1a-Z280>...>L1280>FGC19283>FGC19273



Another issue - FTDNA has dropped the R1a and R1b designations under their haplotree and
now only use R. So the new R1a designation would be R-M420> which is longer and not as
descriptive as R1a. On the other end - R1b-L21, R1b-P312, etc. would become shorter
and lose very little value (R-L21, R-P312. etc.). I am sure that R1a would be preferred by
R1a testers but this seems to be not following the new FTDNA scheme which would cause
confusion if we keep this old prefix. Most of R1b is now dropping the R1b in favor of R, so I
think this is a mute point - but are there any other issues/disadvantages with this change ?
I consider retaining the very short alphanumeric names (like R1a, R1b, I2a, I2b, G2a, J2a, E1b, etc) very helpful in all such classifications when different haplogroups are compared. This is of course much less useful in any specific haplogroup project, where all project members are R1a (or R1b, I2a, etc.).

RobertCasey
06-21-2016, 08:47 PM
Could you please let us know which particular FTDNA project you are working for? I think this information is crucial when choosing the best strategy for classifying the R1a results.



When FTDNA converted from the long format to the shorter single terminal YSNP designation, I could no longer easily separate all the R haplogroup into the major haplogroup projects. Also, I had already expanded from L21 to P312, so I wanted add U106 and R1a as I keep expanding the scope of my pulls. From my initial review of all of R haplogroups, my L21 YSNP predictor tool no longer really needs L21 to be tested but does need to include both signature match and genetic distance. For instance, there is signature overlap between R1a and R1b-L21>L226 - but the genetic distance is around 30. I pull YDNA data around every three to six months and want to better reflect the new parts of R haplogroup as well as share the data. There is also a need for standardized formats for YSNP data and I do not want my L21 bias to miss out on the major differences that exist between major haplogroups. I am also assisting with the beta testing of the exciting new charting tool - SAPP (which currently is limited to L21 as well but it could easily be expanded as my predictor tool could be expanded). This tool builds pretty accurate charts based on both YSTR and YSNP data. My first pass was able to predict one-third of the untested submissions under L226 which is a single signature YSNP that is around 1,500 years old.




I am not sure what you mean by "the most predictable older branch". M512 (also known as M198 or M17) is indeed a standard prediction made by FTDNA, but this is neither the largest branch under R1a (this would be M459) nor the easiest branch to predict. Also, it encompasses >99% of all R1a members, so such prediction is not very informative. From the practical point of view, it is much more useful to distinguish 5 non-overlapping major branches within R1a that together encompass about 99% of all R1a members. These are:

CTS4385 (including its major subclade L664), or M420>M459>M198>M417>CTS4385
Z93, or M420>M459>M198>M417>Z645>Z93
Z280, or M420>M459>M198>M417>Z645>Z283>Z282>Z280
Y2395 (including its major subclade Z284), or M420>M459>M198>M417>Z645>Z283>Z282>Y2395
PF6155 (including its major subclade M458), or M420>M459>M198>M417>Z645>Z283>Z282>PF6155

Since CTS4385, Y2395 and PF6155 have been relatively recently discovered as clades parental to L664, Z284 and M458, respectively, the latter names remain much more commonly used. All these major branches are associated with quite specific geographical distributions, so one can also apply the geographical or ethnic names:

L664 - North-Western European
Z93 - Asian
Z280 - Central-Eastern European ("Balto-Slavic", though this should actually apply to its two major subclades CTS1211 and Z92 only)
Z284 - Scandinavian
M458 - Central European ("Slavic")



Thanks for this detail as it really points out issues with my approach. First, I only looked at terminal YSNPs reported in the YSTR report. I also thought that the ISOGG tree would be good enough for older branches and then used the FTDNA haplotree when the ISOGG haplotree was incomplete for more recent branches. CTS4385, PF6155 and Y2395 are major branches that FTDNA reports no terminal YSNPs and ISOGG haplotree does not include these three major branches. FTDNA has apparently added these YSNPs to their haplotree but have failed to add them YSNP terminal tree. You would think that these would be in synch.

Also, remember that this analysis is only based on YSTR report's terminal YSNP field - not any actual YSNP testing. So, I have added these major branches to the strings - even though no terminal YSNPs are reported for these branches (other than descendants of these branches). These YSNPs are pretty well tested (from YSNP reports):

CTS4385 - 75 positive and 36 negative
Y2395 - 155 positive and 20 negative
PF6155 - 65 positive and 5 negative




The largest subclade under Z280 is definitely CTS1211 (not Z92). I would roughly estimate that CTS1211 encompasses about 2/3 of all Z280 members while Z92 corresponds to about 1/4 of the entire clade Z280. Both these subclades (as well as S24902, a third major subclade under Z280 that seems to be most common in Central-Western Europe) should be included in your classification (IMO).



I already had 280 67 marker entries under CTS1211 at a lower level - Z280>CTS1211




Well, it all depends on the context. For example, in a project that includes mostly people of Western European ancestry (let's say the Irishmen), distinguishing some very specific subclades under Z280, M458 or Z93 might be considered redundant, while this is definitely not redundant for people of Eastern European ancestry. Also, please note the the Asian branch Z93 encompasses more R1a members (worldwide) than all remaining branches taken together, yet it likely constitutes less than 20% of those R1a men who are customers of FTDNA.



With only R1a - I could easily isolate all of R1a with just that short label. I am trying to create a better version of my current mapping. I take the terminal YSNPs and do a table lookup of sorts that has three characteristics: 1) reasonable length - maximum six YSNPs; 2) ability to sort to major branches; 3) include the most meaningful branches that helps everyone identify which branch is being shown. My current preliminary format for L21 is R1b-L21 followed by one the major starburst branches under L21/DF13, the third YSNP would be the single signature predictable YSNP branch (L226, M222, L555, etc.) and the last two reserved for the ever growing number of recent branches (under 1,500 years of age). The lowest will always be the terminal YSNP. For R1a - the L21 equivalent is not needed (M420) so this would allow two branches to sort out all the older branches (which R1a really needs two slots for this but L21 does have to be included since R1b does not equate to L21).




Alternatively, one can use this shortened version:
R1a-Z280>...>L1280>FGC19283>FGC19273



Here is my current version for L1280 (as usual FTDNA omits FGC & YSEQ branches as terminal YSNPs):



f182863/Rombel
f182863
Rombel
R1a-Z280>CTS1211>CTS3402>L1280


f113652/Oehlschlager
f113652
Oehlschlager
R1a-Z280>CTS1211>CTS3402>L1280


f198727/Milewski
f198727
Milewski
R1a-Z280>CTS1211>CTS3402>L1280


f245994/Focko
f245994
Focko
R1a-Z280>CTS1211>CTS3402>L1280


f219840/zUnkName
f219840
zUnkName
R1a-Z280>CTS1211>CTS3402>L1280


f291220/zUnkName
f291220
zUnkName
R1a-Z280>CTS1211>CTS3402>L1280


fE16468/Nikolenko
fE16468
Nikolenko
R1a-Z280>CTS1211>CTS3402>L1280


f182801/Guziewicz
f182801
Guziewicz
R1a-Z280>CTS1211>CTS3402>L1280



I would later fine tune this with actual YSNP testing results - but for now I am just trying to get the table lookup for terminal YSNPs to be much better. Is CTS1211, CT3402 or L1280 a single signature YSNP branch that is predictable ? This means no overlap with other R1a submissions and at least seven off modal mutations from the M420 modal (not including CDY markers). The third YSNP would be preferred to be the predictable branch but if there are no younger branches, I would still include other branches (and I do not know which are predictable as much as I did two years ago when NGS tests just exploded for branch discovery).




I consider retaining the very short alphanumeric names (like R1a, R1b, I2a, I2b, G2a, J2a, E1b, etc) very helpful in all such classifications when different haplogroups are compared. This is of course much less useful in any specific haplogroup project, where all project members are R1a (or R1b, I2a, etc.).



For now, I also prefer to retain the R1a and R1b prefixes. However, it is not just R1a and R1a - there is also R, R1 and R2 which would remain confusing to most people. R2 is not much of an issue but R and R1 would be confusing to many (I still keep seeing R and R1 submissions as a mistake and then verify yet again that these remain valid prefixes). There are not that many R and R1 submissions though and most are probably just untested and poorly predicted by FTDNA (probably R1a or R1b if tested).

I am getting ready for another pull of data and just wanted the next iteration to be a major improvement for the new major haplogroups being pulled. My YSNP prediction tool could be easily expanded to R1a - but my current manual analysis is just too time consuming. It does appear that signature recognition could be automated using neural network and AI languages - a skill probably lacking in most of us. I have a friend who uses this software for his job in process control automation - but he still has a very active day job for now. I also have another source that may be interested automating this prediction methodology.

Here is a link to the R1a pull that I started in January (with some updates):

http://www.rcasey.net/DNA/Temp/R1a_Master_20160621B.xlsx

RVBLAKE
06-23-2016, 02:11 PM
Good Morning Michal,

I am awaiting the final results of my L664 Panel Test. It appears at this point that I may test positive for the SNP YP943. To date, I've tested positive for S2869, and my question is, is this a synonym for S2857? Ronald