PDA

View Full Version : Have I understood long hand yDNA nomenclature?



Almagest
06-02-2016, 11:03 PM
If E-V32 is E1b1b1a1a1b (on ISOGG 2016);

E-Y15945 (YFull) should be E1b1b1a1a1b1 (unlike ISOGG 2016)
E-Y17750 should be E1b1b1a1a1b1a
E-Z813 should be E1b1b1a1a1b1b
Etc?
I think I'm missing something because with this method I would not need to use numbers 2 and 3 in the naming of a tree, but of course this does happen.

Also shouldn't ISOGG be up-to-date with these kind of things?

Almagest
06-10-2016, 03:20 PM
Actually, looking back at it, it does seem right.

The numbers and letters alternate.

Just like I used the A's and B's to show that two subclades are from the same SNP, I would do the same with the 1's and 2's, if it was their 'turn' for choosing.

So if E-Z813 is E1b1b1a1a1b1b;

E-Z809 (downstream of E-Z813) would be E1b1b1a1a1b1b1
E-Y17859 (also downstream of E-Z813) would be E1b1b1a1a1b1b2

https://www.yfull.com/tree/E-V32/

gotten
06-10-2016, 03:35 PM
I prefer shorthand but for the conversions you can use http://dna.scangen.se/index.php?show=tools&lang=en. In my opinion, the year should be mentioned with longhand notation to avoid confusion (at least for the deep branches that change the most).

GTC
06-10-2016, 03:43 PM
Also shouldn't ISOGG be up-to-date with these kind of things?

ISOGG is stubbornly adhering to the alphabet soup for its haplotree mainly, I guess, because it uses plain old HTML to store and depict it and the unwieldy alphabet soup it has become presumably simplifies the effort of maintaining the tree in the very basic format that it's stored in.

There was once talk of its developing a database to store the tree but, for whatever reason, that never came to anything.

Theramster
08-22-2016, 07:10 PM
I want to start a new thread to discuss the possibility of adopting a new better nomenclature for Y DNA haplogroups. I don't see the usefulness of the long hand method ( different people use them differently and it gets confusing). The short hand only gives a terminal SNP without informing about its path. A better method would use minimally letters and rely only on numbers to give the path, including in parentheses the name of terminal SNP. As trees get updated the numbering will change. As trees get longer letters should be used to denote the long common root. Basal groups must be marked with unique letters, rather than *. Just preliminary thoughts.

RobertCasey
08-25-2016, 04:24 AM
The terminal YSNP a very useful summary if you are very familiar with the part of the haplotree that you are analyzing. However, there are some extreme issues even with this YSNP designation: 1) FTDNA's list has around 75 % error rate on younger YSNPs where FTDNA are just randomly assigning private YSNPs as branches and they routinely drop solid branches with BY YSNPs. 2) Even Alex's summary lacks coverage beyond P312 and is missing all the branches discovered by individual testing and soon to be SNP pack testing that are now including dozens of private YSNPs. Also, Alex does not get all of the NGS files (but does get most of them for P312). 3) The YFULL haplotree just generates yet more Y labels without trying to preserve any existing labels. I sent my files from Full Genomes which had summaries where FGC assigned FGC labels which YFULL dropped and replaced with their Y labels.

But we also need a descriptive field that shows some progression of YSNPs and that is sortable, so that results can be organized. Since FTDNA replaced their terminal YSNP with the most recent YSNP, it is now impossible to sort the terminal YSNP and any progression of YSNPs is now gone. I started using a string of YSNPs to replace the long version. I decided that one prefix and five YSNPs would be needed to show the progression well enough (example for R1b):

1) R1b or R1a would be the prefix to let you know which major haplotree branch is being tracked.
2) Then the major haplogroup that are based mainly on major haplogroup projects:
L21, DF27, U106, P312 (only those not under those already on the list), U106, U152 and several other smaller scope that do not fit under these broad categories.
3) Next would major larger branches that are well known and have large number of tests along with smaller scope YSNPs
that do not fit under major branches: Z253, Z255, DF21, DF41, DF13 (that are not part of the others), etc.
4) Next would be the YSNPs that are predictable which are around 1,200 to 2,500 years of age - Such as M222, L1335,
M222, L193, L226, etc. along with some smaller ones like L371 that have been around and are well known.
5) Major branches under the predictable YSNPs that break up the predictable YSNPs into major groupings. Under L226,
this would include FGC5660, FGC5628, FGC5659 and maybe some smaller scope branches that have several branches
such as DC33 and DC28.
6) Then there would be the terminal YSNP that is decided by researchers and not FTDNA.

Note that R1a does not have the starburst of branches under DF13 and R1a people would get two levels above the
predictable YSNP branch. The key here is the predictable branches (those that have one solid signature that can
be used to predict the YSNP based on 67 markers). Then you have two YSNPs above and two recent SYNPs below
plus the prefix of R1b or R1a which is a well established prefix.

miiser
08-25-2016, 06:01 AM
The terminal YSNP a very useful summary if you are very familiar with the part of the haplotree that you are analyzing. However, there are some extreme issues even with this YSNP designation: 1) FTDNA's list has around 75 % error rate on younger YSNPs where FTDNA are just randomly assigning private YSNPs as branches and they routinely drop solid branches with BY YSNPs. 2) Even Alex's summary lacks coverage beyond P312 and is missing all the branches discovered by individual testing and soon to be SNP pack testing that are now including dozens of private YSNPs. Also, Alex does not get all of the NGS files (but does get most of them for P312). 3) The YFULL haplotree just generates yet more Y labels without trying to preserve any existing labels. I sent my files from Full Genomes which had summaries where FGC assigned FGC labels which YFULL dropped and replaced with their Y labels.

But we also need a descriptive field that shows some progression of YSNPs and that is sortable, so that results can be organized. Since FTDNA replaced their terminal YSNP with the most recent YSNP, it is now impossible to sort the terminal YSNP and any progression of YSNPs is now gone. I started using a string of YSNPs to replace the long version. I decided that one prefix and five YSNPs would be needed to show the progression well enough (example for R1b):

1) R1b or R1a would be the prefix to let you know which major haplotree branch is being tracked.
2) Then the major haplogroup that are based mainly on major haplogroup projects:
L21, DF27, U106, P312 (only those not under those already on the list), U106, U152 and several other smaller scope that do not fit under these broad categories.
3) Next would major larger branches that are well known and have large number of tests along with smaller scope YSNPs
that do not fit under major branches: Z253, Z255, DF21, DF41, DF13 (that are not part of the others), etc.
4) Next would be the YSNPs that are predictable which are around 1,200 to 2,500 years of age - Such as M222, L1335,
M222, L193, L226, etc. along with some smaller ones like L371 that have been around and are well known.
5) Major branches under the predictable YSNPs that break up the predictable YSNPs into major groupings. Under L226,
this would include FGC5660, FGC5628, FGC5659 and maybe some smaller scope branches that have several branches
such as DC33 and DC28.
6) Then there would be the terminal YSNP that is decided by researchers and not FTDNA.

Note that R1a does not have the starburst of branches under DF13 and R1a people would get two levels above the
predictable YSNP branch. The key here is the predictable branches (those that have one solid signature that can
be used to predict the YSNP based on 67 markers). Then you have two YSNPs above and two recent SYNPs below
plus the prefix of R1b or R1a which is a well established prefix.

I like this approach. This is basically the approach that I've ended up using in my projects, by a process of natural development.

Too often, when people get into the topic of nomenclature, they get caught up in a philosophical discussion of aesthetic perfection, wanting to explicitly list every single branching node, when all we really need is a practical approach. The point of nomenclature is so that one person can use a label, and other people with a reasonable degree of familiarity will know immediately, or be able to easily find, the haplogroup being discussed. It is not necessary to explicitly list every single branching node in order to accomplish this. All that's needed is easily recognizable "signpost" labels at major branches to point one in the right direction. Your approach accomplishes this.

Ideally, the number of branches at each "signpost" should be on the order of 10 - few enough that a person familiar with the topic can know of all the major branches within their region of the tree, but sufficiently many to get to the terminal label using a reasonably small number of signposts.

A bonus advantage of this approach is that non unique SNPs, such as L1066, can still be used as haplogroup identifiers without difficulty. They are easily uniquely identified by use of the upstream signposts.

Regarding the topic of "basal groups" mentioned by Theramster - I'm surprised and annoyed at how often this idea of "basal groups" gets misunderstood and misrepresented in this forum. There is nothing special or unique about a "basal" group. They are not some ancient relic fossil lineage, as they sometimes get spoken of. They are simply a modern branch, same as all the others, from which only one sample has been tested so far. If that sample gets their brother tested, suddenly they are no longer a "basal" group. At one point in time early in Y-DNA testing, there was just a single Z253 sample tested, and at that point in time it was a "basal" lineage of L21. And then more Z253 people got tested.

The * simply means that there are no shared SNPs, yet discovered, below the terminal listed SNP. In a sense, this label is really not strictly necessary for haplogroup identification. One might conceivably leave off the *, and just take it for granted that the last listed SNP is, in fact, the individual's "terminal" shared SNP. But the problem is that, inevitably, someone will ask, "Did you forget to list this person's branch below L21?" So, in order to avoid this confusion, the * label is a practical way to denote, "We didn't just neglect to list the downstream branches - this person really does not yet have any shared SNPs discovered below this SNP." The other source of potential confusion is when a person's haplogroup says only "L21", for example - is this because they haven't tested for anything below L21, or because they tested negative for all the previously known branches below L21? Thus, the * is a convenient way to denote that the person has been tested and found negative for previously known branches. The downside is that the * does tend to encourage people to speak strangely of such "basal" branches, as if they are somehow special and different from the previously known major branches.

Logically, it would make sense to put the *, or a similar substitute, at the end of EVERY haplogroup label, to signify this is the terminal shared SNP, and there is nothing more below this within the current database. I think if I were the manager of a haplogroup project or tree, this would be my favored approach. This would make it explicit that you are listing the "terminal" shared SNP, and would discourage the perception of "basal" branches as something special.

There is still maybe an issue of how to present these solo branches within trees. In trees such as Alex's Big Tree, they are clumped under a single collection as a matter of convenience. This can be misleading if one is not careful in their interpretation, because it makes the tree appear less bushy than it actually is. Suppose a node has 2 major branches, and 5 individuals grouped under the "basal" branch. One might be tempted to think that there are only 3 branches from the top node, when in fact there are 7. The alternative is to list every sample within its own private branch, so that the presentation will visually reflect the true bushiness of the tree. But this might become unwieldy, because the width of each tree would increase several times its current size.

If I were in charge, I think my approach would be to put an explicit terminator, such as * or #, at the end of every SNP sequence, to avoid ambiguity. And on tree graphics, for the sake of clarity, I would give each individual their own branch rather than clump "basal" branches together.

There really ought to be a convention to distinguish between people whose branching is unknown below a certain node, versus people who have been tested all the way to a "terminal" SNP. For example, within the Z253 project, suppose you have three different individuals:

R1b>L21>Z253>L1066>A6119# - designates the haplogroup of someone who has been fully tested for currently known branches, with A6119 being their terminal shared SNP.
R1b>L21>Z253# - designates the haplogroup of someone who has been fully tested for currently known branches, with Z253 being their terminal shared SNP (a so called "basal" branch).
R1b>L21>Z253} - designates the haplogroup of someone who has been determined to be a member of Z253, but whose placement below Z253 is unknown.

MitchellSince1893
08-25-2016, 11:59 AM
My thought. Stick with the current shorthand but with the addition of symbol(s) to indicate branches are being skipped.

Clades and subclades could be listed in multiple ways depending on the author's desire. For example on my own branch R-U152>L2>Z49>Z142>Z150>FGC12378>FGC12401>FGC12384, it be listed in the following ways:


R-U152(7x>)FGC12384. You know and feels it's important to list the number of levels. As new levels are identified it's easy to update
R-U152(#>)FGC12384. You don't know, or it's not important to list the number of levels

Other variations
R-U152>L2>Z49(#>)FGC12384

R-Z49(5x>)FGC12384 (You are in the L2 forum of anthrogenica, or on the U152 facebook page so it's a given that you are in the U152 group.

This would give the author the flexibility to decide based on the audience the level of detail.

Don't get hung up on the actual symbols in the examples. They can be changed.

Theramster
08-25-2016, 09:02 PM
The terminal YSNP a very useful summary if you are very familiar with the part of the haplotree that you are analyzing. However, there are some extreme issues even with this YSNP designation: 1) FTDNA's list has around 75 % error rate on younger YSNPs where FTDNA are just randomly assigning private YSNPs as branches and they routinely drop solid branches with BY YSNPs. 2) Even Alex's summary lacks coverage beyond P312 and is missing all the branches discovered by individual testing and soon to be SNP pack testing that are now including dozens of private YSNPs. Also, Alex does not get all of the NGS files (but does get most of them for P312). 3) The YFULL haplotree just generates yet more Y labels without trying to preserve any existing labels. I sent my files from Full Genomes which had summaries where FGC assigned FGC labels which YFULL dropped and replaced with their Y labels.

But we also need a descriptive field that shows some progression of YSNPs and that is sortable, so that results can be organized. Since FTDNA replaced their terminal YSNP with the most recent YSNP, it is now impossible to sort the terminal YSNP and any progression of YSNPs is now gone. I started using a string of YSNPs to replace the long version. I decided that one prefix and five YSNPs would be needed to show the progression well enough (example for R1b):

1) R1b or R1a would be the prefix to let you know which major haplotree branch is being tracked.
2) Then the major haplogroup that are based mainly on major haplogroup projects:
L21, DF27, U106, P312 (only those not under those already on the list), U106, U152 and several other smaller scope that do not fit under these broad categories.
3) Next would major larger branches that are well known and have large number of tests along with smaller scope YSNPs
that do not fit under major branches: Z253, Z255, DF21, DF41, DF13 (that are not part of the others), etc.
4) Next would be the YSNPs that are predictable which are around 1,200 to 2,500 years of age - Such as M222, L1335,
M222, L193, L226, etc. along with some smaller ones like L371 that have been around and are well known.
5) Major branches under the predictable YSNPs that break up the predictable YSNPs into major groupings. Under L226,
this would include FGC5660, FGC5628, FGC5659 and maybe some smaller scope branches that have several branches
such as DC33 and DC28.
6) Then there would be the terminal YSNP that is decided by researchers and not FTDNA.

Note that R1a does not have the starburst of branches under DF13 and R1a people would get two levels above the
predictable YSNP branch. The key here is the predictable branches (those that have one solid signature that can
be used to predict the YSNP based on 67 markers). Then you have two YSNPs above and two recent SYNPs below
plus the prefix of R1b or R1a which is a well established prefix.

I admire that you're questioning the inefficiency of current systems and trying to figure out another system to organize results and make them more intelligible. An ideal system would give information about major grouping and progression. With one look I could derive lots of information not only about the individual sample but also about its relation to others...

RobertCasey
08-26-2016, 05:09 AM
There really ought to be a convention to distinguish between people whose branching is unknown below a certain node, versus people who have been tested all the way to a "terminal" SNP. For example, within the Z253 project, suppose you have three different individuals:

R1b>L21>Z253>L1066>A6119# - designates the haplogroup of someone who has been fully tested for currently known branches, with A6119 being their terminal shared SNP.
R1b>L21>Z253# - designates the haplogroup of someone who has been fully tested for currently known branches, with Z253 being their terminal shared SNP (a so called "basal" branch).
R1b>L21>Z253} - designates the haplogroup of someone who has been determined to be a member of Z253, but whose placement below Z253 is unknown.

The lowest level would always be tested. Intermediate levels do not have to be tested and are present merely to sort and quickly identify which major branches of the haplotree that you belong to. If the submission is partially tested, it would OK include intermediate YSNP branches such as Z2534 which could the terminal YSNP. There is also the tricky issue of predicted YSNPs. I think that should be another field altogether but [L226] could indicate predicted vs. tested. Also, for FTDNA YSNPs not in Alex's haplotree, I always put them in parentheses since FTDNA lists so many private YSNPs as branches on their haplotree. Another big issue are synonymous YSNPs. I usually use the first in the list of ISOGG for those unless a much better option exists.

For terminal YSNPs, it takes two people testing positive for the YSNP to create a branch. If it is not a terminal YSNP, only one positive submission is required. If the YSNP is unique to one submission, this is generally called a private YSNP. There is also the issue of equivalent YSNPs being tested but the branching YSNP is unknown. I think that it would be best to list the equivalent YSNP in case it later turns out to be another branch but use {} to indicate an equivalent is present. I do not like adding the complexity of adding other intermediate information that requires too much interpretation. We should keep it simple to the bare minimum issues that are required for understanding. This YSNP string should have two main goals - quick identification of where you are on the haplotree and any additional information should not interfere with sorting.