PDA

View Full Version : Wouldn't it be nice if Yfull, BigTree, and FTDNA had the same naming convention?



MitchellSince1893
09-07-2016, 06:05 PM
A well known issue that seems solvable if parties involved could just use the same rules for naming a new block of SNPs.

For example


In BigTree my branch/SNP block is called:
FGC12385.

In Yfull it called:
FGC12384

In FTDNA U152 project:
FGC7647

I asked Alex Williamson how he named blocks. He said typically it's by lowest position. In my case this would be FGC47872. But this is in an area not covered by BigY (it is covered by FGC). The first position covered by BigY is FGC12384. However this SNP "is on the edge of a coverage region. These...regions often indicate that the individual may be positive for a SNP even if there is no corresponding entry in the VCF file. This is particularly true when most men would otherwise show coverage at that particular position".

So I assume Alex didn't name the branch FGC12384 because it was on the edge of a coverage region and my BigY test vcf file isn't shown as positive for it. Only the two FGC tests on this branch were positive for FGC12384.

The next lowest positioned SNP is FGC12385 which he used.

Alex primarily uses VCF files for analysis while YFULL uses bam files. I assume the issue seen with FGC12384 in the VCF isn't an issue for bam file and hence the reason why Yfull named it FGC12384 (my BigY sample is shown as positive for FGC12384 using the bam file)

On the FTDNA U152 project, I assume they chose the lowest named SNP, FGC7647. This SNP isn't highly rated by Bigtree, receiving a ** about 40% likely genuine rating for the two FGC samiles, and not in a region even covered by BigY (it's not detected by my BigY test). On Yfull FGC7647 receives only 1 of 4 stars and is in a region that "isn't used for analysis"

Any recommendations on getting universal agreement between these entities?

Wing Genealogist
09-07-2016, 07:46 PM
Yes, it would be nice. But it is like asking Ford, GM etc. to use the same parts (such as filters and fan belts) for their cars. Unfortunately, it ain't gonna happen.

Cofgene
09-07-2016, 08:57 PM
Yes it would be nice but one wants the name on a level to be testable by several methods. Regretfully with everyone clamoring "what IS my haplogroup?" there is not the time or procedure to establish a consensus rule for using names. So forget the name as the important point since in the community the establishment of a new haplogroup is the primary effort.

MitchellSince1893
09-07-2016, 09:31 PM
As long as these SNP names are getting e.g. FGC52069 it's going to reach the point where they aren't much shorter than the position location and mutation e.g. 23433357C>T. A seven digit alphanumeric vs 11 digits. Of course when we change builds e.g. 37 to 38 the positions change.

MitchellSince1893
09-11-2016, 12:26 AM
Got a response from Steve Gilbert on the procedures they use in the FTDNA U152 project.


Order of precedence when it comes to use of a particular SNP as the lead one for a new clade that contains several SNPs (node).

1- In most cases, if only the SNPs positions are known, we use the lowest one.
2- In most cases, If only one of the SNPs have a series number, we use this one as the lead SNP.
3- In most cases, If more than one of the node SNPs are already known by a series number, we use the lowest one within a same series (BY1234, BY1235, BY1236, etc).
4- In most cases, If many series number exist for a same SNP, we first privilege the old series first: M, P, U, L, S, etc. If none exist, we use the "second generation" series: Z, DF, etc. If again, none exist, we use the "third generation" series: BY, A, FGC, Y, etc. We privilege this order of precedence for the "third generation"
5- In most cases, In order to minimize the conflicts, we privilege the series FTDNA use on their haplotree even if it doesn't comply with the above steps. It would be nonsense to use a different series as 96% of NGS people have been tested with the BigY product. This percentage comes from the YTree stats.

Also, my branch on the U152 project has been changed from FGC7647 to FGC12385. While Steve didn't elaborate, I think step 3 was initially used when FGC7647 was selected, but when I explained that FGC7647 isn't covered by BigY, then step 5 partially came into play. Partially because while FTDNA hasn't yet given my current terminal branch a name, why would you give it a SNP name that isn't covered by FTDNA BigY.? So I think FGC12385 was selected to be in line with the BigTree name for this branch as the FTDNA project admins are working with Alex Williamson to get NGS tests in the U152 project on his tree.

Cofgene
09-11-2016, 01:18 AM
The U152 naming process could represent technical laziness in a number of ways. We see examples of individuals not wanting to force an order on an array and just use a simple sort to name a level.

One should name a level with a variant that is thought to be the most testable across several technologies. The name should also represent the original discoverer or system. It makes no sense to name a clade using variant that can only be tested with Elite or WGS results when there are other equivalents present.

RobertCasey
09-11-2016, 01:50 PM
There was a very long thread on YSNP naming standards on the Yahoo ISOGG forum with around 20 or 30 excellent posts. But unless the standards have some kind of enforcement, it is unlikely to have much of an influence. Enforcement can be merely stating that certain groups / vendors are not following standards. With ISOGG's restrictive policies for exclusion of most branches, this leaves FTDNA haplotree and Alex Williamson's Big Tree as the defacto standard. FTDNA is trying a lot more these days on taking input for additions and corrections to their haplotree which is very encouraging. Neither address new branches discovered by YSEQ very well which is a major source of branch discovery. It is also very encouraging that Alex's chart and FTDNA's chart track a lot more than we thought. There are some major differences that need to be ironed out - but I am very glad to see that Alex must be getting the vast majority of NGS files. Unfortunately, around half of the posts are not relevant to the topic as the thread got hi-jacked by another topic.

https://groups.yahoo.com/neo/groups/ISOGG/conversations/messages/40529

(https://groups.yahoo.com/neo/groups/ISOGG/conversations/messages/40529)

lgmayka
09-12-2016, 12:17 AM
With ISOGG's restrictive policies for exclusion of most branches, this leaves FTDNA haplotree and Alex Williamson's Big Tree as the defacto standard.
I presume you're joking. A tree that covers one portion of one haplogroup cannot possibly be a standard for the human race as a whole.

RobertCasey
09-12-2016, 05:16 PM
I presume you're joking. A tree that covers one portion of one haplogroup cannot possibly be a standard for the human race as a whole.
For R-P312, it is the standard and it is definitely the standard for the methodology that should be used for the rest of the haplogroups. They are expanding to include U106 which will then include all of R1b which is probably around 1/3 of the submissions tested (but not 1/3 of the actual haplotree since Western European is much more tested than other parts of the haplotree). I do not know if Alex plans to continue to expand coverage to R1a, etc. But his excellent work is becoming the defacto standard for how to track NGS testing results. We recently compared the L21 part of the FTDNA haplotree and the Big Tree coverage and found them very comparable in size. So, the R1b leadership is feeding 90 % of the NGS files to Alex which is pretty encouraging. Of course, both Alex and FTDNA do not include the numerous branches discovered by YSEQ testing which is another big issue to resolve.

I rarely use the ISOGG haplotree anymore and no longer submit new branches since my primary interest is genealogical YSNPs. Another major improvement - FTDNA has a person who is very responsive to requests to fix the FTDNA haplotree and Alex has always be very responsive to any issues as well. For L21, ISOGG is now around 50 % of the known branches.

I apologize for not stating that his web site is the defacto standard for the methodology as it does not include all haplogroups. His site is a public database (FTDNA does not do this properly for the Big Y results) - it is comprehensive for inclusion of branches for the haplotree that it covers. It documents FTDNA IDs for all NGS testing for both Big Y and Full Genomes NGS tests. Its coverage of equivalents is far superior to FTDNA's list. Unlike FTDNA, it preserves older branch mutations vs. constantly replacing them with new equivalent mutations, he uses YSNPs in unstable areas that have consistent testing results (YSEQ will not and FTDNA does most of the time but not all the time). His list of advantages is pretty long and should be noted and replicated.

After the integration of U106, Alex's haplotree will be approaching the same number of branches as ISOGG's entire haplotree. I do not hear the same argument of so missing branches being listed for ISOGG. I am a hard core genealogist and genealogist drive most of the testing. But ISOGG is a volunteer organization with limited resources. They should be collecting dues to create tools like Alex's Big Tree web site.

lgmayka
09-13-2016, 01:29 AM
They are expanding to include U106 which will then include all of R1b
No, that still omits the fairly large Z2103 branch, not to mention the many small branches.

Cofgene
09-13-2016, 09:53 AM
The integration of the U106 region should also bring a major enhancement to Alex's tree in that Iain McDonald's age estimation process will be embedded as one of the analysis processes which comes with the U106 variant identification code. Alex's tree will need to display side-by-side the the variants from current Alex, FGC analysis, and U106 analysis pipeline results. That will be good in that individuals will get to see what is consistently called as higher quality results and which ones may need further investigation. We still need to develop and work out a curation model and interface that will provide multiple haplogroup admins the tools needed to curate their portion of the tree structure.

This week Iain's original BASH analysis script replicating the original U106 Mac analysis program got its first major optimization with some Python code. So still very portable and just a heck of a lot faster. One more section will be optimized into Python while Iain works on welding his age determination code into the script.