Results 1 to 4 of 4

Thread: R1a - how to handle sortable string of YSNPs

  1. #1
    Registered Users
    Posts
    674
    Sex
    Location
    Texas
    Ethnicity
    English, Irish, German
    Nationality
    US
    Y-DNA (P)
    L21>L226>FGC5639

    England Germany Netherlands France Ireland Switzerland

    R1a - how to handle sortable string of YSNPs

    I was not very familiar with R1a testing until recently. When FTDNA replaced the long version of the haplogroup (R1a1a1b1a3, etc.), I lost the ability to sort haplogroups with the replacement the short version of the haplogroup R-Z284. So I determined the largest branches under R1a and created table lookups for all the terminal YSNPs for R1a. I ignored the chain of YSNPs which is very long for R1a and just set the first YSNP as one of the ten largest R1a haplogroups:

    M420 - 1
    M459 - 41
    M512 - 2204 (obviously the most predictable older branch)
    M417 - 80
    Z645 - 176
    Z283 - 562
    Z282 - 346
    Z280 - 566

    And only two that were subbranches of the string of progression:

    L664 - 122
    Z284 - 278

    After some minor investigation, I found that 100 % of Z645 was either Z93 or Z283. Should I just
    leave out Z645 as a major branch and only have Z283 and Z93 ? Also, Z92 is a pretty large branch
    under Z280. How many submissions would qualify for adding the Z92 as the first branch under R1a ?
    There are currently 95 67 marker submissions under Z280. Should anything over 100 get added
    as a major branch or what other criteria would anyone suggest ?

    For now, I have the above ten YSNPs as the first branch under the label R1a. Obviously, putting the
    entire progression of YSNPs would create too long of short string to replace the older long versions of
    the hapogroup character sequence. So the new labels are R1a-Z280>lower tested YSNPs, etc. I try
    to keep the list to maximum of six or less YSNPs in order to not consume too much width of the
    spreadsheet.

    If I used the entire progression of YSNPs, it would be the following just to get to Z280:

    R1a-M420>M459>M512>M417>Z645>Z283>Z282>Z280

    This would leave no room for younger YSNPs and this known progression really does not add
    anything but the well documented progression that could be obtained from various haplotrees.

    We do not have this issue as much under L21 since it has a huge starburst of major branches
    under L21/DF13. R1a has a very bushy tree with a major trunk of YSNP progression.

    Does anyone see any issues/disadvantages of using the ten major R1a branches as the
    first branch under R1a? Remember, this string has to be short enough to fit into a spreadsheet
    for sorting and is not intended to be the full chain of YSNP progression. It is intended to
    be somewhat meaningful and not much larger than the older character version.

    Another issue - FTDNA has dropped the R1a and R1b designations under their haplotree and
    now only use R. So the new R1a designation would be R-M420> which is longer and not as
    descriptive as R1a. On the other end - R1b-L21, R1b-P312, etc. would become shorter
    and lose very little value (R-L21, R-P312. etc.). I am sure that R1a would be preferred by
    R1a testers but this seems to be not following the new FTDNA scheme which would cause
    confusion if we keep this old prefix. Most of R1b is now dropping the R1b in favor of R, so I
    think this is a mute point - but are there any other issues/disadvantages with this change ?
    Last edited by RobertCasey; 06-20-2016 at 07:09 PM.

  2. The Following 2 Users Say Thank You to RobertCasey For This Useful Post:

     Michał (06-21-2016),  parasar (06-20-2016)

  3. #2
    Registered Users
    Posts
    1,689
    Sex
    Location
    Warsaw, Poland
    Y-DNA (P)
    R1a-L1280>FGC41205
    mtDNA (M)
    H2a2(b)
    Y-DNA (M)
    R1a-L1029>YP517
    mtDNA (P)
    H5a2

    Poland European Union
    Robert, I would recommend visiting our R1a project at FTDNA to see how we handle this problem. Also, the Experimental YFull tree (or its R1a part) should be very useful in this case.

    Could you please let us know which particular FTDNA project you are working for? I think this information is crucial when choosing the best strategy for classifying the R1a results.

    Quote Originally Posted by RobertCasey View Post
    So I determined the largest branches under R1a and created table lookups for all the terminal YSNPs for R1a. I ignored the chain of YSNPs which is very long for R1a and just set the first YSNP as one of the ten largest R1a haplogroups:

    M420 - 1
    M459 - 41
    M512 - 2204 (obviously the most predictable older branch)
    M417 - 80
    Z645 - 176
    Z283 - 562
    Z282 - 346
    Z280 - 566

    And only two that were subbranches of the string of progression:

    L664 - 122
    Z284 - 278
    I am not sure what you mean by "the most predictable older branch". M512 (also known as M198 or M17) is indeed a standard prediction made by FTDNA, but this is neither the largest branch under R1a (this would be M459) nor the easiest branch to predict. Also, it encompasses >99% of all R1a members, so such prediction is not very informative. From the practical point of view, it is much more useful to distinguish 5 non-overlapping major branches within R1a that together encompass about 99% of all R1a members. These are:

    CTS4385 (including its major subclade L664), or M420>M459>M198>M417>CTS4385
    Z93, or M420>M459>M198>M417>Z645>Z93
    Z280, or M420>M459>M198>M417>Z645>Z283>Z282>Z280
    Y2395 (including its major subclade Z284), or M420>M459>M198>M417>Z645>Z283>Z282>Y2395
    PF6155 (including its major subclade M458), or M420>M459>M198>M417>Z645>Z283>Z282>PF6155

    Since CTS4385, Y2395 and PF6155 have been relatively recently discovered as clades parental to L664, Z284 and M458, respectively, the latter names remain much more commonly used. All these major branches are associated with quite specific geographical distributions, so one can also apply the geographical or ethnic names:

    L664 - North-Western European
    Z93 - Asian
    Z280 - Central-Eastern European ("Balto-Slavic", though this should actually apply to its two major subclades CTS1211 and Z92 only)
    Z284 - Scandinavian
    M458 - Central European ("Slavic")


    Quote Originally Posted by RobertCasey View Post
    After some minor investigation, I found that 100 % of Z645 was either Z93 or Z283. Should I just
    leave out Z645 as a major branch and only have Z283 and Z93 ?
    I wouldn't do this, mostly because Z645 is the only indication that these two branches are more closely related to each other than to the more distantly related CTS4385/L664 branch. Also, we cannot rule out that there are some additional (yet unknown) subclades directly under Z645 (in addition to Z283 and Z93), so keeping Z645 will be crucial for a proper classification of such unexpected findings.


    Quote Originally Posted by RobertCasey View Post
    Also, Z92 is a pretty large branch under Z280. How many submissions would qualify for adding the Z92 as the first branch under R1a ?
    The largest subclade under Z280 is definitely CTS1211 (not Z92). I would roughly estimate that CTS1211 encompasses about 2/3 of all Z280 members while Z92 corresponds to about 1/4 of the entire clade Z280. Both these subclades (as well as S24902, a third major subclade under Z280 that seems to be most common in Central-Western Europe) should be included in your classification (IMO).


    Quote Originally Posted by RobertCasey View Post
    There are currently 95 67 marker submissions under Z280. Should anything over 100 get added
    as a major branch or what other criteria would anyone suggest ?
    Well, it all depends on the context. For example, in a project that includes mostly people of Western European ancestry (let's say the Irishmen), distinguishing some very specific subclades under Z280, M458 or Z93 might be considered redundant, while this is definitely not redundant for people of Eastern European ancestry. Also, please note the the Asian branch Z93 encompasses more R1a members (worldwide) than all remaining branches taken together, yet it likely constitutes less than 20% of those R1a men who are customers of FTDNA.


    Quote Originally Posted by RobertCasey View Post
    Does anyone see any issues/disadvantages of using the ten major R1a branches as the
    first branch under R1a? Remember, this string has to be short enough to fit into a spreadsheet
    for sorting and is not intended to be the full chain of YSNP progression. It is intended to
    be somewhat meaningful and not much larger than the older character version.
    For a regional or surname project, I would use the five major branches (L664, Z93, Z280, Z284 and M458) as "reference points" for adding more downstream subclades. For example, I would use this string of SNPs for my own result:
    R1a-Z280>CTS1211>Y35>CTS3402>Y33>CTS8816>L1280>FGC1928 3>FGC19273, or
    R1a>...>Z280>CTS1211>Y35>CTS3402>Y33>CTS8816>L1280 >FGC19283>FGC19273, or
    R-M420>...>Z280>CTS1211>Y35>CTS3402>Y33>CTS8816>L128 0>FGC19283>FGC19273

    Alternatively, one can use this shortened version:
    R1a-Z280>...>L1280>FGC19283>FGC19273


    Quote Originally Posted by RobertCasey View Post
    Another issue - FTDNA has dropped the R1a and R1b designations under their haplotree and
    now only use R. So the new R1a designation would be R-M420> which is longer and not as
    descriptive as R1a. On the other end - R1b-L21, R1b-P312, etc. would become shorter
    and lose very little value (R-L21, R-P312. etc.). I am sure that R1a would be preferred by
    R1a testers but this seems to be not following the new FTDNA scheme which would cause
    confusion if we keep this old prefix. Most of R1b is now dropping the R1b in favor of R, so I
    think this is a mute point - but are there any other issues/disadvantages with this change ?
    I consider retaining the very short alphanumeric names (like R1a, R1b, I2a, I2b, G2a, J2a, E1b, etc) very helpful in all such classifications when different haplogroups are compared. This is of course much less useful in any specific haplogroup project, where all project members are R1a (or R1b, I2a, etc.).
    Last edited by Michał; 06-21-2016 at 01:06 PM.

  4. The Following 7 Users Say Thank You to Michał For This Useful Post:

     AJL (06-21-2016),  Amerijoe (06-21-2016),  KSDA (06-21-2016),  parasar (06-21-2016),  Pribislav (06-22-2016),  RobertCasey (06-22-2016),  Volat (06-21-2016)

  5. #3
    Registered Users
    Posts
    674
    Sex
    Location
    Texas
    Ethnicity
    English, Irish, German
    Nationality
    US
    Y-DNA (P)
    L21>L226>FGC5639

    England Germany Netherlands France Ireland Switzerland
    Quote Originally Posted by Michał View Post

    Could you please let us know which particular FTDNA project you are working for? I think this information is crucial when choosing the best strategy for classifying the R1a results.
    When FTDNA converted from the long format to the shorter single terminal YSNP designation, I could no longer easily separate all the R haplogroup into the major haplogroup projects. Also, I had already expanded from L21 to P312, so I wanted add U106 and R1a as I keep expanding the scope of my pulls. From my initial review of all of R haplogroups, my L21 YSNP predictor tool no longer really needs L21 to be tested but does need to include both signature match and genetic distance. For instance, there is signature overlap between R1a and R1b-L21>L226 - but the genetic distance is around 30. I pull YDNA data around every three to six months and want to better reflect the new parts of R haplogroup as well as share the data. There is also a need for standardized formats for YSNP data and I do not want my L21 bias to miss out on the major differences that exist between major haplogroups. I am also assisting with the beta testing of the exciting new charting tool - SAPP (which currently is limited to L21 as well but it could easily be expanded as my predictor tool could be expanded). This tool builds pretty accurate charts based on both YSTR and YSNP data. My first pass was able to predict one-third of the untested submissions under L226 which is a single signature YSNP that is around 1,500 years old.

    Quote Originally Posted by Michał View Post

    I am not sure what you mean by "the most predictable older branch". M512 (also known as M198 or M17) is indeed a standard prediction made by FTDNA, but this is neither the largest branch under R1a (this would be M459) nor the easiest branch to predict. Also, it encompasses >99% of all R1a members, so such prediction is not very informative. From the practical point of view, it is much more useful to distinguish 5 non-overlapping major branches within R1a that together encompass about 99% of all R1a members. These are:

    CTS4385 (including its major subclade L664), or M420>M459>M198>M417>CTS4385
    Z93, or M420>M459>M198>M417>Z645>Z93
    Z280, or M420>M459>M198>M417>Z645>Z283>Z282>Z280
    Y2395 (including its major subclade Z284), or M420>M459>M198>M417>Z645>Z283>Z282>Y2395
    PF6155 (including its major subclade M458), or M420>M459>M198>M417>Z645>Z283>Z282>PF6155

    Since CTS4385, Y2395 and PF6155 have been relatively recently discovered as clades parental to L664, Z284 and M458, respectively, the latter names remain much more commonly used. All these major branches are associated with quite specific geographical distributions, so one can also apply the geographical or ethnic names:

    L664 - North-Western European
    Z93 - Asian
    Z280 - Central-Eastern European ("Balto-Slavic", though this should actually apply to its two major subclades CTS1211 and Z92 only)
    Z284 - Scandinavian
    M458 - Central European ("Slavic")
    Thanks for this detail as it really points out issues with my approach. First, I only looked at terminal YSNPs reported in the YSTR report. I also thought that the ISOGG tree would be good enough for older branches and then used the FTDNA haplotree when the ISOGG haplotree was incomplete for more recent branches. CTS4385, PF6155 and Y2395 are major branches that FTDNA reports no terminal YSNPs and ISOGG haplotree does not include these three major branches. FTDNA has apparently added these YSNPs to their haplotree but have failed to add them YSNP terminal tree. You would think that these would be in synch.

    Also, remember that this analysis is only based on YSTR report's terminal YSNP field - not any actual YSNP testing. So, I have added these major branches to the strings - even though no terminal YSNPs are reported for these branches (other than descendants of these branches). These YSNPs are pretty well tested (from YSNP reports):

    CTS4385 - 75 positive and 36 negative
    Y2395 - 155 positive and 20 negative
    PF6155 - 65 positive and 5 negative

    Quote Originally Posted by Michał View Post

    The largest subclade under Z280 is definitely CTS1211 (not Z92). I would roughly estimate that CTS1211 encompasses about 2/3 of all Z280 members while Z92 corresponds to about 1/4 of the entire clade Z280. Both these subclades (as well as S24902, a third major subclade under Z280 that seems to be most common in Central-Western Europe) should be included in your classification (IMO).
    I already had 280 67 marker entries under CTS1211 at a lower level - Z280>CTS1211

    Quote Originally Posted by Michał View Post

    Well, it all depends on the context. For example, in a project that includes mostly people of Western European ancestry (let's say the Irishmen), distinguishing some very specific subclades under Z280, M458 or Z93 might be considered redundant, while this is definitely not redundant for people of Eastern European ancestry. Also, please note the the Asian branch Z93 encompasses more R1a members (worldwide) than all remaining branches taken together, yet it likely constitutes less than 20% of those R1a men who are customers of FTDNA.
    With only R1a - I could easily isolate all of R1a with just that short label. I am trying to create a better version of my current mapping. I take the terminal YSNPs and do a table lookup of sorts that has three characteristics: 1) reasonable length - maximum six YSNPs; 2) ability to sort to major branches; 3) include the most meaningful branches that helps everyone identify which branch is being shown. My current preliminary format for L21 is R1b-L21 followed by one the major starburst branches under L21/DF13, the third YSNP would be the single signature predictable YSNP branch (L226, M222, L555, etc.) and the last two reserved for the ever growing number of recent branches (under 1,500 years of age). The lowest will always be the terminal YSNP. For R1a - the L21 equivalent is not needed (M420) so this would allow two branches to sort out all the older branches (which R1a really needs two slots for this but L21 does have to be included since R1b does not equate to L21).

    Quote Originally Posted by Michał View Post

    Alternatively, one can use this shortened version:
    R1a-Z280>...>L1280>FGC19283>FGC19273
    Here is my current version for L1280 (as usual FTDNA omits FGC & YSEQ branches as terminal YSNPs):

    f182863/Rombel f182863 Rombel R1a-Z280>CTS1211>CTS3402>L1280
    f113652/Oehlschlager f113652 Oehlschlager R1a-Z280>CTS1211>CTS3402>L1280
    f198727/Milewski f198727 Milewski R1a-Z280>CTS1211>CTS3402>L1280
    f245994/Focko f245994 Focko R1a-Z280>CTS1211>CTS3402>L1280
    f219840/zUnkName f219840 zUnkName R1a-Z280>CTS1211>CTS3402>L1280
    f291220/zUnkName f291220 zUnkName R1a-Z280>CTS1211>CTS3402>L1280
    fE16468/Nikolenko fE16468 Nikolenko R1a-Z280>CTS1211>CTS3402>L1280
    f182801/Guziewicz f182801 Guziewicz R1a-Z280>CTS1211>CTS3402>L1280

    I would later fine tune this with actual YSNP testing results - but for now I am just trying to get the table lookup for terminal YSNPs to be much better. Is CTS1211, CT3402 or L1280 a single signature YSNP branch that is predictable ? This means no overlap with other R1a submissions and at least seven off modal mutations from the M420 modal (not including CDY markers). The third YSNP would be preferred to be the predictable branch but if there are no younger branches, I would still include other branches (and I do not know which are predictable as much as I did two years ago when NGS tests just exploded for branch discovery).

    Quote Originally Posted by Michał View Post

    I consider retaining the very short alphanumeric names (like R1a, R1b, I2a, I2b, G2a, J2a, E1b, etc) very helpful in all such classifications when different haplogroups are compared. This is of course much less useful in any specific haplogroup project, where all project members are R1a (or R1b, I2a, etc.).
    For now, I also prefer to retain the R1a and R1b prefixes. However, it is not just R1a and R1a - there is also R, R1 and R2 which would remain confusing to most people. R2 is not much of an issue but R and R1 would be confusing to many (I still keep seeing R and R1 submissions as a mistake and then verify yet again that these remain valid prefixes). There are not that many R and R1 submissions though and most are probably just untested and poorly predicted by FTDNA (probably R1a or R1b if tested).

    I am getting ready for another pull of data and just wanted the next iteration to be a major improvement for the new major haplogroups being pulled. My YSNP prediction tool could be easily expanded to R1a - but my current manual analysis is just too time consuming. It does appear that signature recognition could be automated using neural network and AI languages - a skill probably lacking in most of us. I have a friend who uses this software for his job in process control automation - but he still has a very active day job for now. I also have another source that may be interested automating this prediction methodology.

    Here is a link to the R1a pull that I started in January (with some updates):

    http://www.rcasey.net/DNA/Temp/R1a_M...20160621B.xlsx

  6. The Following User Says Thank You to RobertCasey For This Useful Post:

     parasar (06-21-2016)

  7. #4
    Gold Class Member
    Posts
    627
    Sex
    Omitted
    Location
    Wisconsin
    Ethnicity
    English/French
    Nationality
    American
    Y-DNA (P)
    R1a>L664>YP943
    mtDNA (M)
    H4a1
    Y-DNA (M)
    J-M172

    United States of America Canada Quebec Acadia Denmark Switzerland England
    Good Morning Michal,

    I am awaiting the final results of my L664 Panel Test. It appears at this point that I may test positive for the SNP YP943. To date, I've tested positive for S2869, and my question is, is this a synonym for S2857? Ronald

Similar Threads

  1. Replies: 2
    Last Post: 02-07-2018, 12:16 AM
  2. YSNPs in Unstable Areas - good or bad ?
    By RobertCasey in forum General
    Replies: 138
    Last Post: 09-07-2017, 06:45 PM
  3. Warning: Illegal string
    By Jean M in forum Forum Support
    Replies: 16
    Last Post: 11-12-2016, 10:54 AM
  4. Replies: 38
    Last Post: 12-13-2013, 11:04 PM
  5. Replies: 2
    Last Post: 11-16-2013, 05:21 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •