PDA

View Full Version : R-U106 Fireside Chat



Peter M
01-27-2017, 10:13 AM
This thread was created as a fork based on a slightly off-topic post in "Bell Beakers, Gimbutas and R1b"


Yes, Z18 & L48 should be considered separately. A third clade to keep in mind is Z156. According to the Age analysis of Dr. Iain McDonald, Z156 is older & larger than Z18. In addition the Royal House of Wettin is a Z156 clade.
Z18 already IS considered separately. This was arranged some six years ago by a few seasoned people who had thoroughly considered the issues involved and decided, most progress would be made by starting a dedicated project, focussing on R-Z18 (the achievements of the project have proven them right over the years).

The result was the R-Z18 project with its own web site and forum. Apart from extensively supporting Z18+ people and developing the R-Z18 Y-DNA tree, the other main activity of the project is the development of the tools that are used by the project. These tools (e.g. the R-Z18 Y-DNA SNP Database and the programs that generate our web site) make the R-Z18 project probably the technically most advanced Y-DNA haplogroup project (it's kind of hard to understand, nobody in ISOGG shows any interest in our Y-DNA Database :)).

The fact that the R-Z18 project is not uniformly recognised in the field have always prohibited us from making any of these tools available to others (one cannot reasonably expect to freely receive things (software and data) from an entity whose existence is not recognised :P).

Peter M
01-27-2017, 10:39 AM
Possibly, it's an idea to respond to the content of the original message as well, as a very significant improvement is possible here.


Yes, Z18 & L48 should be considered separately. A third clade to keep in mind is Z156.
I personally am convinced it would be "much better" to apply the idea behind the R-P312 projects in R-U106 as well. This part would then consist of the following projects (four projects instead of the current four projects :)):


R-U06 (all U106* and all "smaller" SNPs under it; and concentrating on getting people to "test into" one of the other three);
R-Z18 (as it is now);
R-Z156 (including L1);
R-Z301 (including L48 and L47).

These projects could share things like a web site and possibly a common set of tools, but as such each project would focus on their own branch(es) of the tree, so that much more progress would be possible. The most relevant change would be for some people to move away from their concentration on "number of members" as the most important aspect of their project.

I am aware, there are people who think (apparently they do :)) that even the R-Z18 project should not exist and is ignored, because it's "part of" the R-U106 project. These people tend to forget, that if that rule is applied consistently, then the R-U106 project should not exist either, because it's part of the R1b project. If we apply this concept uniformly, then all projects should be stopped and only the haplogroup CT project will be continued (not considering Africans for the moment).


According to the Age analysis of Dr. Iain McDonald, Z156 is older & larger than Z18.
At this moment I'm not sure which data and analysis tools the said good Dr has been using, I'm not aware of him getting ANY information from the R-Z18 project on these issues. It's absolutely clear he's had no access to software for geographic analysis of haplogroup or parts thereof. BTW, I don't see the geographic facilities on the FT-DNA web site as particularly useful.

Z156 being larger than R-Z18 might well be true, in reality R-Z18 is only about 5% of R-U106, based on the information in the R-Z18 Profile Database (if one doesn't have reasonably reliable information, the group may look much bigger than it actually is).


In addition the Royal House of Wettin is a Z156 clade.
Interesting.

Wing Genealogist
01-27-2017, 11:30 AM
There are benefits and drawbacks to dividing up the clades as well as combining the clades. We certainly need experts at the various clade levels but we also need to look at the Big Picture.

Iain's work on the ages: http://www.jb.man.ac.uk/~mcdonald/genetics/tree.html shows what can be done by combining the clades. In fact, one of the major issues he is now trying to address is that much of the uncertainty around the earliest branches of the tree is due to the fact his work stops at U106. He needs to incorporate the results of the clades above U106 to tighten up his age estimates for U106 as well as its top branches (including Z18). He is actively working to bring in the P310 level and hopes to go back to the M269 level some day.

Iain fully recognizes his information is going to be "wrong" at many levels, but his willingness to put it out to the public does advance our knowledge.

In the end it is all about the research, and the folks ordering the various tests. I strongly believe the best thing FTDNA has done is to create the volunteer Project Administrators who have turned into the experts in the field. Many of them are light years ahead of the so-called professionals and their willingness to share their expertise with the customers (and the general public) is what has moved the field at the incredible pace it has achieved.

Peter M
01-27-2017, 11:45 AM
These projects could share things like a web site and possibly a common set of tools, but as such each project would focus on their own branch(es) of the tree, so that much more progress would be possible. The most relevant change would be for some people to move away from their concentration on "number of members" as the most important aspect of their project.

I'm well aware, I'm using the expression "much more" here, which is normally routinely used by fans of hackers, who (the fans) in reality don't have the faintest idea what they're talking about, so that "much more" is always a pretty suspect claim (at least to me).

In this context I mean that a group of people working together, e.g. on a Y-DNA tree, will be making much more progress if they concentrate on a smaller more coherent branch of the tree. An example of this are the sponsored R-Z18 Panel tests that have been organised by the project in the past. People who would normally not consider ordering such a test, were enabled to do so as a group at a significant discount. This resulted in over 100 tests ordered (in a project with about 600 members), and allowed the project to get the results available of a much larger group of people testing the new SNPs discovered (by the project: the ZPn SNPs) in the Big-Y results.

Currently, we (most people in the field) are still working with the test results from people who are interested enough and have the funds available to order a modern (Full-Y) test. These are not always the most interesting samples. If the member base of a project is more coherent, it might well be possible to organise "focussed testing" of the most interesting profiles, the tests being paid by the project using the donations of the members. This necessitates a lot of organisational work (for all the said panel tests in R-Z18, hundreds of emails have been sent), but as a result "much more" progress will be made. This will only be doable in a smaller, more focussed, project. The ideal case being a project in which groups of people know each other.

Peter M
01-27-2017, 12:35 PM
There are benefits and drawbacks to dividing up the clades as well as combining the clades. We certainly need experts at the various clade levels but we also need to look at the Big Picture.

Personally, I think projects should be focussed on branches that share a clear common theme and not so much on their actual size or the ambition of their administrators. This is especially important if we see a project as supporting their members (advising on tests and explaining test results) as this only works if there's something of a binding factor. Given the amount of time it takes to handle all requests in R-Z18 (about 600 members), then a project ofthousands of members seems a wee bit impractical.


Iain's work on the ages: http://www.jb.man.ac.uk/~mcdonald/genetics/tree.html shows what can be done by combining the clades. In fact, one of the major issues he is now trying to address is that much of the uncertainty around the earliest branches of the tree is due to the fact his work stops at U106. He needs to incorporate the results of the clades above U106 to tighten up his age estimates for U106 as well as its top branches (including Z18). He is actively working to bring in the P310 level and hopes to go back to the M269 level some day.


Of course, there's nothing wrong with (project-) independent researchers who do their own investigation in this case on the age of SNPs of a number of branches, but it could be on any subject the researcher deems relevant. There is no necessary implication to the scope of projects in this, he just creates his own blog somewhere. Whether one sees the good Dr. as an independent researcher is a matter of opinion, I guess (I'm aware, I'm not responding to his approach, BTW).


Iain fully recognizes his information is going to be "wrong" at many levels, but his willingness to put it out to the public does advance our knowledge.

The only problem I see, is that lots of people see his results as facts and start to act based on them, instead of as an opinion of a single person who works individually (at the very least, he hasn't ever consulted the R-Z18 project) and is a fan of age calculation. Especially in the case of newcomers in the field, this is an important issue.


In the end it is all about the research, and the folks ordering the various tests. I strongly believe the best thing FTDNA has done is to create the volunteer Project Administrators who have turned into the experts in the field. Many of them are light years ahead of the so-called professionals and their willingness to share their expertise with the customers (and the general public) is what has moved the field at the incredible pace it has achieved.
I agree, although it would be nice if the projects decided to co-operate more, as I think the field currently is not getting any stronger.

To summarise, I think there are two issues here: first are the projects (a) that support their members (advising on tests and helping explaining results, in effect a knowledgeable help desk as an alternative/enhancement to FT-DNA's) and (b) evaluate all results available and use them to develop a Y-DNA tree for their branch, and there's individual and independent researchers who refrain from advising members of any project, except of course in the area of their research (e.g. SNP ages in this case). These researchers should make very clear that what they're saying is primarily a result of their research and as such a personal opinion.

Joe B
01-27-2017, 04:28 PM
There are benefits and drawbacks to dividing up the clades as well as combining the clades. We certainly need experts at the various clade levels but we also need to look at the Big Picture.

Iain's work on the ages: http://www.jb.man.ac.uk/~mcdonald/genetics/tree.html shows what can be done by combining the clades. In fact, one of the major issues he is now trying to address is that much of the uncertainty around the earliest branches of the tree is due to the fact his work stops at U106. He needs to incorporate the results of the clades above U106 to tighten up his age estimates for U106 as well as its top branches (including Z18). He is actively working to bring in the P310 level and hopes to go back to the M269 level some day.

Iain fully recognizes his information is going to be "wrong" at many levels, but his willingness to put it out to the public does advance our knowledge.

In the end it is all about the research, and the folks ordering the various tests. I strongly believe the best thing FTDNA has done is to create the volunteer Project Administrators who have turned into the experts in the field. Many of them are light years ahead of the so-called professionals and their willingness to share their expertise with the customers (and the general public) is what has moved the field at the incredible pace it has achieved. Nobody has made enquiries to the administrators of the R1b Basal Subclades project where the P310+, M269- folks reside. Frankly, we wouldn't let them go without a consensus of the members. Most of our members are committed to going with our project trees and YFull.

Wing Genealogist
01-27-2017, 05:07 PM
Nobody has made enquiries to the administrators of the R1b Basal Subclades project where the P310+, M269- folks reside. Frankly, we wouldn't let them go without a consensus of the members. Most of our members are committed to going with our project trees and YFull.

Iain is currently working on the various subclades of P312 and hopes to examine the M269+, P312- U106- clades at some later date. His work does not call for anyone to be moved from any projects. I cannot speak for him in regards to who he is contacting, but I do know he is working with Alex Williamson and would imagine he is working with the various project administrators as well.

His work is similar to the work YFull is doing on the age analysis, but he does add in some additional steps to further refine the estimated ages.

RobertCasey
01-27-2017, 05:23 PM
The other main activity of the project is the development of the tools that are used by the project. These tools (e.g. the R-Z18 Y-DNA SNP Database and the programs that generate our web site) make the R-Z18 project probably the technically most advanced Y-DNA haplogroup project (it's kind of hard to understand, nobody in ISOGG shows any interest in our Y-DNA Database :)).

Could you summarize the tools and their functions ? I know about the TMRCA charting tool which is really nice. I am more interested in the tree building aspect than the age estimates. Also, I saw several posts where initial work was being done on creating a 400 - 500 YSTR database based on NGS tests. I am very interested in that.

I am assume that you also have some form of database for U106 research as well. There are a lot of those databases out there - but they are mostly available in spreadsheet form or web sites which are less useful. Making a robust database available directly to the public (vs. personal research) may be an issue with FTDNA. The FTDNA terms and conditions could legally shut down such a database tool. There is a lot of useful YSEQ data that is tracked very manually due to the ID translation issue. The Big Tree is a database of sorts, but it is really an output summary vs. a true database. Glad to see that U106 assisted Alex with expanding the scope to include U106. I do have tools to collect FTDNA data which includes all of U106 (it covers over 4,000 projects) which is in a MySQL database.

We also have the issue of competing haplotrees. It used to be ISOGG leading the way (I rarely even go to their haplotree these days). When Big Tree hit the scene that was the best source for P312 YSNP haplotrees and he is now collecting YSTR data as well (but not from NGS tests). YFULL has pretty good coverage for certain haplogroups but really lacks P312 data. Shame on Full Genomes for not having any usable database. FTDNA has not hardly even shown up on the real map for a viable NGS database that includes private YSNPs. Even their YSNP reports are bloated with worthless NATGEO data but miss 80 % of the relevant data from their Big Y tests. Here is a spreadsheet version of YSTR reports (last full pull was July, 2016 and 20 % was pulled in November, 2016 - caution this is a 29 MB file):


http://www.rcasey.net/DNA/Temp/HG_R_Master_20161117A.xlsx

Lately, I have become very interested in automating (coding) a charting tool. Now that R-L226 has 500 67 marker tests with around 20 % coverage with YSNPs and 42 branches, I am able to chart over 75 % of L226 with fairly high reliability:

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf

Peter M
01-27-2017, 08:57 PM
You've covered a lot of ground in a few lines and to reasonably respond to all issues touched will be hard, as I've worked on the vast majority of these issues for years are there's an awful lot to say about them. I will give things a fair try.


Could you summarize the tools and their functions ? I know about the TMRCA charting tool which is really nice. I am more interested in the tree building aspect than the age estimates. Also, I saw several posts where initial work was being done on creating a 400 - 500 YSTR database based on NGS tests. I am very interested in that.
I have the best part of a hard drive filled with software and data. I could write a book on this software. Some of it was written in the past and not used recently (e.g. a program that generated Network Diagrams in full colour (to indicate different sub clades)). Some are minor utilities that prepare input data for other programs. But essentialy it is a set of tools that automate all steps in a project's workflow, as I tend to see it.

I'm not concerned with age estimates, for the simple reason that I have strong doubts about them. A few years ago, I wrote a program that automated the Generation5 spreadsheet by Ken Nordvedt (it's still used in the STR Results display on the R-Z18 web site). Unfortunately, it is not 100% reliable (e.g. it doesn't handle nulls correctly); I have to look at it again, but that's not high on my priority list. So I share your liking for the tree building process.

I'm not too enthousiastic about the idea of working with 400-500 STRs, I don't think this will be worthwhile, although it might be an idea to have a look at the results to date, trying to find a few extra STRs that might be useful and to include those in databases. A have a currently hibernating program that can be used to automatically call STRs (and SNPs) from NGS results in a .bam file (called gyca: Genealogical Y-Chromosome Analyser).


I am assume that you also have some form of database for U106 research as well. There are a lot of those databases out there - but they are mostly available in spreadsheet form or web sites which are less useful. Making a robust database available directly to the public (vs. personal research) may be an issue with FTDNA. The FTDNA terms and conditions could legally shut down such a database tool. There is a lot of useful YSEQ data that is tracked very manually due to the ID translation issue. The Big Tree is a database of sorts, but it is really an output summary vs. a true database. Glad to see that U106 assisted Alex with expanding the scope to include U106. I do have tools to collect FTDNA data which includes all of U106 (it covers over 4,000 projects) which is in a MySQL database.

That's a save assumption, but currently it is only used for R-Z18 research (no other groups have been interested; I have used things occasionally to help people sort out issues). The word "database" is used for a lot of different things. Personally, I tend not to see a spreadsheet as a database, for the simple reason that it is very difficult or next to impossible to actually DO something with the data that was not designed in, in the structure of the sheet.

Making a data set (neutral term) available via the internet might annoy FT-DNA depending on the organisation behind it; it's not very likely FT-DNA will take legal steps to fight a party that represents the vast majority of their customer base. A useful database would add value to the things FT-DNA is already displaying, otherwise it would be useless and this added value is a good way of countering any copyright claims.


We also have the issue of competing haplotrees. It used to be ISOGG leading the way (I rarely even go to their haplotree these days). When Big Tree hit the scene that was the best source for P312 YSNP haplotrees and he is now collecting YSTR data as well (but not from NGS tests). YFULL has pretty good coverage for certain haplogroups but really lacks P312 data. Shame on Full Genomes for not having any usable database. FTDNA has not hardly even shown up on the real map for a viable NGS database that includes private YSNPs. Even their YSNP reports are bloated with worthless NATGEO data but miss 80 % of the relevant data from their Big Y tests. Here is a spreadsheet version of YSTR reports (last full pull was July, 2016 and 20 % was pulled in November, 2016 - caution this is a 29 MB file):
Competing trees? These trees all share roughly the same issues: (a) they all assume there's a single party (person or very small team) who is responsible for defining a "covering" Y-Tree (a tree including all branches); (b) these trees are all defined offline by said party; and (c) are all using ancient and static technology for display of the tree.

as to (a): I don't believe a single entity will be able to follow all new developments in all branches of the Y-Tree in order to be able to define a complete Y-Tree. I am a strong believer in focussed groups that each concentrate on a single area of the tree and follow all test results from all labs.
as to (b): requires much more explanation, so I will leave this for the moment.
as to (c): none of these trees use state of the art graphics (as that would complicate things dramatically) and e.g. allow the viewer to drill down into the underlying SNP database and/or other data sets.

Apart from these shared issues, there are things that are particular to a single tree names, e.g. the ISOGG tree is based on manual labour using only an html editor, which are outdated tools (few people code web pages by hand nowadays, most use CMSs). ISOGG insist on not involving knowledgeable people in their tree maintenance decision-making and the results are revealing. FT-DNA are selling tests to individuals and do remarkably little, and/or with comparatively low quality, for groups of customers. FGC cannot be expected to offer analysis support as they are selling FULL-Y tests at the lowest prices possible to knowledgeable people. Y-Full AFAIA are not known for their co-operation with other parties and that's a real requirement if you want to build a tree and do not use all test results available.

Personally, I think a somewhat new and revolutionary approach is needed.


Lately, I have become very interested in automating (coding) a charting tool. Now that R-L226 has 500 67 marker tests with around 20 % coverage with YSNPs and 42 branches, I am able to chart over 75 % of L226 with fairly high reliability:

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf
This looks nice in comparison to most of the other tree displaying pages I've seen around. For static display of a tree as part of a document, this will work fine, but for presentation online on the web, there are lots of things, I would expect to be available, that will be very hard, or impossible, to add using a document-oriented thing like .pdf. But it is a very nice example of how a Y-Tree could look like.

MitchellSince1893
01-27-2017, 10:21 PM
You two should start a dna database business together.

RobertCasey
01-28-2017, 06:14 AM
Some are minor utilities that prepare input data for other programs. But essentially it is a set of tools that automate all steps in a project's workflow, as I tend to see it.


Other than my "out of date" L21 SNP Predictor tool, most of my tools are tools that automate the analysis process as well. After the explosion of YSNP discovery, my predictor tool was too manual to keep up. But this tool could be automated as well. It could be still a very useful tool as 90 % of L21 could be predicted based YSTR patterns with genetic distance filters for convergence. Prediction would only be for single signature predictable YSNPs in the 1,500 to 2,500 year range. Below that the charting tool would be used.

http://www.rcasey.net/DNA/R-L21_SNP_Predictor_Intro.html



So I share your liking for the tree building process.


I think this tool is really prime for a roll-out for general usage. The first phase would have the normal "quick and dirty" output format for the web but this would be database driven (MySQL) Writing dozens of canned queries with high end web presentations would be a major task though - plus it is still somewhat a moving target. I do not use any network joining math methdology - it is all YSTR signature based methodology (same principle used in YSNP prediction).



I'm not too enthousiastic about the idea of working with 400-500 STRs, I don't think this will be worthwhile, although it might be an idea to have a look at the results to date, trying to find a few extra STRs that might be useful and to include those in databases. A have a currently hibernating program that can be used to automatically call STRs (and SNPs) from NGS results in a .bam file (called gyca: Genealogical Y-Chromosome Analyser).


If you look at the Irwin project where he has around 150 67 marker submissions and a couple of dozen NGS tests for just one surname cluster, you might see that 67/111 plus Big Y is just not enough resolution to do the job. 67 to 111 helps some and Y Elite could help a little more, but we will probably need even more YSTR resolution to assign two or three mutations per ancestor on our ancestor chart. Once the WGS test goes down to around $200 and has read longer lengths to catch all 111 markers, the data will be there for the taking. We do need to start sorting out the STRs that are hard to analyze or have issues. Also, we need to classify slow, medium and fast mutation rates and come up with the 300 slower markers plus another 100 faster mutation rates for the old Genetree approach where you test the faster markers for lots of descendants (10X) to chart more connections.



That's a safe assumption (about databases), but currently it is only used for R-Z18 research (no other groups have been interested; I have used things occasionally to help people sort out issues). The word "database" is used for a lot of different things. Personally, I tend not to see a spreadsheet as a database, for the simple reason that it is very difficult or next to impossible to actually DO something with the data that was not designed in, in the structure of the sheet.


I agree that spreadsheets are not databases, but they can be consumed by most end users (and further manipulated). All the FTDNA YSTR and YSNP reports are already in MySQL format. My spreadsheets are just end user consumable data. However, my YSEQ data is only in spreadsheet form and limited to R-L226. Have any good ideas on how to get a viable comprehensive YSEQ ID to FTDNA ID cross reference tool where pulling YSEQ could be done across all haplogroups ? YSEQ does not want to add an optional field for "FTDNA ID" for fear of FTDNA reaction (this really dilutes the usefulness of their database). All my NGS test data is really limited for L226 from BAM analysis from Dennis Wright - but I do use Alex's site for other haplogroups but its format is not even EXCEL exportable (which could be imported back into MySQL).

It is much quicker and easier to round trip from MySQL to EXCEL for derived fields. Surnames are important to genetic analysis - but they have to be manually extracted (no way to automate this). Also, when they converted to the short haplogroup format only, I now do table lookups in EXCEL for now. I only pull most projects every six months or so - except for L226 where I cut and paste for recent information (pretty sad). It is pretty automated now but I need to continue to fine tune the YSNP table. After 1.5 million rows, MySQL started finally to slow down to where I had to create a second source table. I really only pulled U106 and R1A since I could no longer sort out the YSTR report based on the long format.



Making a data set (neutral term) available via the internet might annoy FT-DNA depending on the organisation behind it; it's not very likely FT-DNA will take legal steps to fight a party that represents the vast majority of their customer base. A useful database would add value to the things FT-DNA is already displaying, otherwise it would be useless and this added value is a good way of countering any copyright claims.


Their recent addition of web site terms and conditions are pretty strong legally. The recent GEDMATCH issue shows their true predatory nature for sharing the FTDNA IDs (which they publish to their own website for YDNA data). My YSTR database is pretty robust but my YSNP data is still pretty crude - I just extract YSNPs on demand for haplogroups under analysis. If you rolled out a full database with dozens of good queries and very high end web output, they would probably react.

Another good tool would be the automatic detection of bad YSNP values - detecting when bad data from YSNPs in unstable areas happen, multiple mutations of the same YSNP, technology errors in SNP packs (pretty common), bad NGS data that FTDNA routinely publishes, etc. This takes a lot of time manually now.



Competing trees? These trees all share roughly the same issues: (a) they all assume there's a single party (person or very small team) who is responsible for defining a "covering" Y-Tree (a tree including all branches); (b) these trees are all defined offline by said party; and (c) are all using ancient and static technology for display of the tree.

Personally, I think a somewhat new and revolutionary approach is needed.


The true approach would obviously have to be database driven with some pretty high end / easy to navigate web output (along with exports to EXCEL for those included).

By default, FTDNA is the best output at this point in time. Glad to see them finally make some major improvements. For P312 and U106, Alex's Big Tree and FTDNA's haplotree is pretty comprehensive. Of course there are lot of warts remaining but these two web outputs are pretty useful for regular usage (manual examination). You can drill down a lot more with Alex's tool but Alex is only NGS driven which is a major limitation. 13 of 42 L226 branches are YSEQ or L226 SNP discovered (this should be a lot higher). The latest major new wart from FTDNA - promoting private YSNPs as branches (all private YSNPs that show up in the L226 and L555 SNP Packs are listed as branches). These SNP Packs are extremely powerful for revealing new branches and collecting enough genetic data to chart 75 % of L226. Highly recommend rolling out SNP Packs with private YSNPs.



This looks nice in comparison to most of the other tree displaying pages I've seen around. For static display of a tree as part of a document, this will work fine, but for presentation online on the web, there are lots of things, I would expect to be available, that will be very hard, or impossible, to add using a document-oriented thing like .pdf. But it is a very nice example of how a Y-Tree could look like.


This charting tool is my primary focus for now - the timing is really perfect for this tool for well tested haplogroups. I am working with Dave Vance and others to create a tool with more accurate and little more user friendly. In its current form, it uses network joining which just does not work for the higher levels of tree building. For signature recognition though, this is a great tool over manual analysis. Dave just rolled out version 2 which removed time out problems and greatly improved performance. It also reverts to YSTR only charting where no YSNPs are available and we all know how low accuracy that will be. My chart was done with Adobe InDesign and then output to PDF format. Dave's generates one huge graphic but he is working one breaking it up to one graphic per YSNP branch. His format is very similar to mine since I have assisted him with his coding (providing design input and QA input to date).

http://www.jdvtools.com/SAPP/

Cofgene
01-28-2017, 01:44 PM
This discussion really should be focused on the consolidation of some of the utilities and concepts into an single system. It's time to transform ytree.net into the "GEDCOM" of the ySTR/variant world. The focus should be on how to establishing a viable support structure to assist Alex's ytree and to build out additional functionality in a controlled manner.

If the ySTR predictor is that solid then it is time to integrate it with Alex's ytree.net so that a prediction provides links to the predicted regions to allow the user to explore additional information on STR motifs and geography related to the predictions.

Get on board with making sure tools work across the broader R1b region. Fragmentation aggravates the problem of looking at specific topics due to the deliberately compartmentalized and targeted utility of data and tools. Alex's Ytree is moving into the U106 world. Iain's bigY analysis, clade identification, and corresponding age analysis is expanding to cover other parts of R1b. We want to get Iain's results into Alex's tree as another analysis column but more importantly be able to provide individuals with current age estimates, with lower error estimates than Yfull, for R1b haplogroups.

RobertCasey
01-28-2017, 03:07 PM
This discussion really should be focused on the consolidation of some of the utilities and concepts into an single system. It's time to transform ytree.net into the "GEDCOM" of the ySTR/variant world. The focus should be on how to establishing a viable support structure to assist Alex's ytree and to build out additional functionality in a controlled manner.


I talked with Alex at length at the last FTDNA conference. He and I are on the same page. His site is very dependent comprehensive submissions of NGS files for analysis. As Peter pointed out, this is very fragmented across the genome. It was quite a major effort to recently add U106 (both political and time). If the R1a crowd wants to be added next, I am pretty sure that Alex would be able to handle another influx of data to analyze. In my response to Peter, I realized that BigTree is missing a EXCEL export feature which would also be another next step.



If the ySTR predictor is that solid then it is time to integrate it with Alex's ytree.net so that a prediction provides links to the predicted regions to allow the user to explore additional information on STR motifs and geography related to the predictions.


When the YSNPs were trickling in with "Walk the Y", I could manually keep up with the analysis to keep the tool updated for L21. However, each signature takes around 30 minutes to one hour to analyze. This process has to be automated via coding. The charting tool being developed has such methodology that may be modified for more automated signature recognition. But it takes programmers (which I am not a heavy weight) or funds to create these tools. If there are any programmers out there that are interested in the automation of YSNP prediction, I recently retired and would be more than happy to roll out a more robust version of this tool.

http://www.rcasey.net/DNA/R_L226/R_L226_Contact_Project.html



Get on board with making sure tools work across the broader R1b region. Fragmentation aggravates the problem of looking at specific topics due to the deliberately compartmentalized and targeted utility of data and tools. Alex's Ytree is moving into the U106 world. Iain's bigY analysis, clade identification, and corresponding age analysis is expanding to cover other parts of R1b. We want to get Iain's results into Alex's tree as another analysis column but more importantly be able to provide individuals with current age estimates, with lower error estimates than Yfull, for R1b haplogroups.


The first pass at U016 is now up and running at the link below:

http://www.ytree.net/DisplayTree.php?blockID=1147

If the R1a group wants to coordinate sending their NGS files to Alex, he would probably be open to adding them as well. Across the rest of the genome, the collection of NGS files becomes very fragmented. This is more of political issue and leadership issue to get this to happen. If Haplogroup I is ready, that would be another possible next step. R1a and much of the I haplogroup use YFULL for NGS analysis instead of BigTree. It takes time to get the cooperation across the genome as these files are not available to download at will - they are manually sent to a common web site where Alex imports the files. R1b has lead the way in creating this methodology - which could be expanded.

But there is also an operational aspect to these tools and databases. With each iteration of rollout, we automate more and are able to expand the scope in both coverage and functionality. Again, this is a lot of coding effort and operational time. For the database maintenance, I have to extract out a derived field every new surname from three FTDNA fields (this can not be automated). With the demise of the long haplogroup name, I now manually do table lookups for each new haplogroup added to the tree (this could be automated but is yet another automation/coding project). I need more programming assistance or funds to hire programmers. I was in the IT business for 40 years and outsourcing business for the last ten years, so I know how to get this incrementally rolled out.

Peter pointed out the low end output nature of the existing tools and databases which exists - including FTDNA. FTDNA has a staff of ten or twenty programmers and IT staff just to get out the information that they produce. Probably one quarter to one-third of the costs with your FTDNA tests go to IT department. These tools will require funding to program and maintain/improve the tools/software. It also requires income to run a data center with 10,000s of end users and operational costs as well extracting data being added every day takes time. Interpreting new data requires time as well.

Any volunteers for coding the automation of YSNP prediction ? Or another approach, how much do you really support this advancement of tools and databases ? You could lead the effort to crowdfund the outsourcing of these programming efforts. Once the tools are operational, do you expect these tools and databases for free - it will take a pedabytes of hard disk space to load the BAM files and a 100,000 end users would require $10,000s of hardware, software and communication support costs. I easily spend 30 or 40 hours per week on this smaller projects already.

I see the progression as charting tools first - they are extremely useful and it becomes much more obvious for what to test next - saving costs. Next tool would the YSNP prediction. Just cut and paste your 67 markers and get your predicted YSNP down to where charting works 90 % of the time. Next collect your data and create your charts. Next would be a FTDNA database as input. Just enter your FTDNA ID, the prediction tools generates your predicted haplogroup. Next you run the tool but this depends on the crappy FTDNA database that is full of bloated data and numerous errors. It only partially includes Big Y data and does not include any FGC NGS tests or YSEQ WGS tests being rolled. So you export Alex's data to collect most of the NGS data (including missing Big Y data) and merge the FTDNA YSNP report into Alex's data. Remember Alex does not include SNP pack testing or individual YSNP testing, so this merge will be a challenge to determine which data is correct. But you are still missing the invaluable YSEQ YSNP data. Translation of IDs is more than a technical issue. It is a political and leadership issue - similar to R1b having a focal point for NGS uploads.

MitchellSince1893
01-28-2017, 04:54 PM
...Or another approach, how much do you really support this advancement of tools and databases ? You could lead the effort to crowdfund the outsourcing of these programming efforts. Once the tools are operational, do you expect these tools and databases for free - it will take a pedabytes of hard disk space to load the BAM files and a 100,000 end users would require $10,000s of hardware, software and communication support costs. I easily spend 30 or 40 hours per week on this smaller projects already.

Just brainstorming...what about using an approach similar to gedmatch, where you have some basic free tools (maybe the ones that aren't that labor/IT intensive), and then have tiers of more advanced tools where paid subscriptions are required? The user could decide what would be of best use based on their knowledge and interest. I could see the FTDNA project admins being major users of the higher tier tools.

I used to use the freebee tools on gedmatch, but found I needed the tiered tools for my work...It's well worth the monthly $10, I've been paying for 1-2 years now. It would be difficult for me to go back to the basic free tools at this point.

EDIT: Or another thought is a collaboration between gedmatch, Alex Williamson, Robert Casey, Peter M, and others to combine resources and current infrastructure/tools

RobertCasey
01-28-2017, 05:25 PM
Just brainstorming...what about using an approach similar to gedmatch, where you have some basic free tools (maybe the ones that aren't that labor/IT intensive), and then have tiers of more advanced tools where paid subscriptions are required? The user could decide what would be of best use based on their knowledge and interest. I could see the FTDNA project admins being major users of the higher tier tools.

I used to use the freebee tools on gedmatch, but found I needed the tiered tools for my work...It's well worth the monthly $10, I've been paying for 1-2 years now. It would be difficult for me to go back to the basic free tools at this point.

That is a good way to startup a niche business which we should emulate. But it takes a core set of members with skills (design, programming, QA, promotion, etc.) and it takes a certain amount of funding to purchase software and infrastructure or fill in the gap for skills by hiring programmers. FTDNA has at least 10 to 20 full time individuals and yet we still need better tools and databases. GEDMATCH has three individuals with complementary skills but still lacks a full set of skills to roll out other than atDNA tools. I am also investigating teaming with GEDMATCH as well and teaming with other programmers as well. But we all have different availability of time, skills and varying goals/approaches. Normal software development issues.

I really need programmers to roll out tools/databases faster that are higher quality and more comprehensive, need QA to help beta test the tools when that time comes or funding to hire programmers (I was in outsourcing for the last ten years of my career and know how to outsource programming to the Far East, Middle East or former Soviet countries) at costs that the genetic community would accept. I need others time or funding support to get things going. Signature recognition is a AI type programming where there are specialized languages that can be used. But these languages are very expensive roll out and few programmers have these skills - especially at reasonable rates that our genetic community could afford - so it will probably be coded in php or C++ (or some other language that volunteers have in depth knowledge of).

TigerMW
01-28-2017, 06:57 PM
..... As Peter pointed out, this is very fragmented across the genome. It was quite a major effort to recently add U106 (both political and time). If the R1a crowd wants to be added next, I am pretty sure that Alex would be able to handle another influx of data to analyze. In my response to Peter, I realized that BigTree is missing a EXCEL export feature which would also be another next step.
FYI, I think FTDNA is working on CSV file export feature for their haplotree. That would be very helpful to be able download phylogenetic equivalents. Phylogenetic equivalent viewing is one of the big benefits of the Big Tree. It is painful to look at the "More" pop-ups for equivalents the way FTDNA and YFull do it. It's not downloadable unless you can dissect htlml coding.

Long haplogroup labels may be included in the FTDNA CSV export as they are still there, just under the covers.



The first pass at U016 is now up and running at the link below:

http://www.ytree.net/DisplayTree.php?blockID=1147

Very good. I'd like to see the R1b-L51xP311 guys jump on board next.

The politics have been brought up. I think they will always be there as an inhibitor between vendor capabilities and infrastructure but I don't see why the volunteers shouldn't share what they have. Alternative methods of doing things is good for evaluation and advancement. There are control concerns between overlapping or master/subset projects but I don't see any reason that people can't be involved in multiple projects, and for some overlap to be present. There is a trade-off between skills in a group of people and interest level as it relates to specificity of the smaller and smaller subclades. There is nothing wrong with both being leveraged.



It only partially includes Big Y data and does not include any FGC NGS tests or YSEQ WGS tests being rolled. So you export Alex's data to collect most of the NGS data (including missing Big Y data) and merge the FTDNA YSNP report into Alex's data. Remember Alex does not include SNP pack testing or individual YSNP testing, so this merge will be a challenge to determine which data is correct. But you are still missing the invaluable YSEQ YSNP data. Translation of IDs is more than a technical issue. It is a political and leadership issue - similar to R1b having a focal point for NGS uploads.
There are at least two major considerations. I have set up for L513 a VCF-like format to merge Big Y relevant VCF, Pack and individual SNP results. The display is generated is a rough and tree-like but purely reporting oriented - little interpretation. This is not that hard to do. However, bringing in results from different vendors is a royal pain. I use to do with this YSEQ but the different synonyms used and ID/Kit# cross-referencing system is the big limitation. The lack of sharing is disappointing as it limits the usefulness even if you overcome the data integrity and maintenance for the cross-referencing.

The second consideration is the phylogenetic analysis and interpretations in a cross-platform system. It's hard enough to do this with all NGS results of one vendor, then add a second and it is still semi-workable but then add in all of the individual SNP results where phylogenetic equivalents are not often included is a real challenge.
My advice for YFull and Alex is the same... keep it clean and like for like as much as possible. To go cross-platform really does require the true database, data section "owners", data integrity checking, etc.

Peter M
01-28-2017, 09:02 PM
Nice lively discussion :), my intention is to respond to as many view points I can, but will most likely split up my response into multiple posts.


If you look at the Irwin project where he has around 150 67 marker submissions and a couple of dozen NGS tests for just one surname cluster, you might see that 67/111 plus Big Y is just not enough resolution to do the job. 67 to 111 helps some and Y Elite could help a little more, but we will probably need even more YSTR resolution to assign two or three mutations per ancestor on our ancestor chart. Once the WGS test goes down to around $200 and has read longer lengths to catch all 111 markers, the data will be there for the taking. We do need to start sorting out the STRs that are hard to analyze or have issues. Also, we need to classify slow, medium and fast mutation rates and come up with the 300 slower markers plus another 100 faster mutation rates for the old Genetree approach where you test the faster markers for lots of descendants (10X) to chart more connections.

I'm very afraid, you'll find that even with 500 STR profiles and a number of Elite tests added, the job will not be easily done (depending on the complexity of the family). And then these will add new problems in addition (like the latest Elite 2.1 test being mapped to HuRef38). In general, I think when a endeavour like this was started with Big-Y (mapped to HuRef37), then it might be better to stay with Big-Y, to make sure all test results can be compared on an equal basis (if only because of PR issues in the project).

Please note I have not looked at the Irwin family in any detail, but we have families in R-Z18 that started a similar adventure. In addition, I do hope to have understood your issue correctly.

I guess (personally, I'm convinced), the issue here is you see an endeavour like this as a two step thing: (a) run as many tests as possible and (b) compare all results and try to design a tree that reflects those results and explains the family structure. My approach would have been to do things iteratively, by first looking at the highest level of the family tree which SNPs divide the family up in a number of "sub-families". This might involve first determining which STR profiles are the furtest removed from each other (in GD terms; might be more than two) and then asking those to order a Full-Y test (e.g. Big-Y). This would result in a number of top level SNPs. Then negotiate with YSeq a "family deal" to test as many family members for those high level SNPs as possible. It might be possible to find an STR pattern for one of more of these SNPs and these patterns might help you in the process.

After this first iteration, you'll most likely have split the "big Irwin problem" into a number of "small(er) Irwin Problem"-s and to each family branch you apply the same procedure. Most likely, with this approach you need less than "a few dozen NGS tests" to arrive at an reasonable family tree. It might well save a lot of testing costs.

This is an example of the "iterative focussed test" approach I discussed in an earlier post in this thread. Essentially one decides within a group of people which samples should be subjected to which type of test (paying for these tests by sharing the costs between the members of the group), instead of relying on anybody who feels like ordering a test and who happens to have the funds available to order. Please note, this approach requires some organisational effort and quite a bit of leadership, it might not be everbody's approach.

How useful those "extra STRs" will turn out to be to this process in practice will become clear along the way (my current opinion is: not too much).

TigerMW
01-28-2017, 11:00 PM
....
I'm very afraid, you'll find that even with 500 STR profiles and a number of Elite tests added, the job will not be easily done ....
How useful those "extra STRs" will turn out to be to this process in practice will become clear along the way (my current opinion is: not too much).
I wouldn't say much of anything here is done easily and without qualification.

However, if ("if"), there is a change to the 111 STR panel every 3 generations, then 400 or 500 STRs would be getting an average rates down to less than one generation. Even in a linage with pause in mutations, on the high end of the variance rate, we should be seeing some changes every couple of father-son transmissions. This would be immensely helpful if we've already got the stable SNPs fencing off families down to the last several hundred years.

I don't think we'll get our high quality 500 STRs soon, if at all, but a couple of hundred more should help. The jury is still out, though. Even if we had them, we'd still want some kind of selection of "genealogically" focused STRs in their own panel, as Robert describes. The nice thing from a vendor standpoint is this is not like fixed SNP testing where only a few SNPs are relevant to any one family, but a standard panel could have a broad market.

The implication is we need a lot more review of the unlocked STRs, probably from Big Y results since there are plenty of individuals tested, then the step of selecting a genealogical panel could take place.

lgmayka
01-28-2017, 11:13 PM
It is painful to look at the "More" pop-ups for equivalents the way FTDNA and YFull do it.
YFull offers sharable reports, but they follow only one line of descent. For example:
G-L661 (http://www.yfull.com/share/yreport/6f4aedfe0c25784f16c73535aa3ce6af/)
R-Y6956 (http://www.yfull.com/share/yreport/1657247ce911bb577073a02cbbda1040/)
R-Y18894 (http://www.yfull.com/share/yreport/f18212b6676dbebe3f540e5a8d42ed8b/)
R-YP1337 (http://www.yfull.com/share/yreport/7f25b2ad9f0ecaec31f7baa3f0d60eb5/)
R-BY653 (http://www.yfull.com/share/yreport/9cf9a26b0272561786ad84215cfe9b7b/)
I-CTS10228 (http://www.yfull.com/share/yreport/bf5383bfa8811e97332b0faabbae7894/)

I agree that YFull ought to offer a more convenient way to look at an entire subtree in the same way.

Bollox79
01-28-2017, 11:23 PM
I guess (personally, I'm convinced), the issue here is you see an endeavour like this as a two step thing: (a) run as many tests as possible and (b) compare all results and try to design a tree that reflects those results and explains the family structure. My approach would have been to do things iteratively, by first looking at the highest level of the family tree which SNPs divide the family up in a number of "sub-families". This might involve first determining which STR profiles are the furtest removed from each other (in GD terms; might be more than two) and then asking those to order a Full-Y test (e.g. Big-Y). This would result in a number of top level SNPs. Then negotiate with YSeq a "family deal" to test as many family members for those high level SNPs as possible. It might be possible to find an STR pattern for one of more of these SNPs and these patterns might help you in the process.

After this first iteration, you'll most likely have split the "big Irwin problem" into a number of "small(er) Irwin Problem"-s and to each family branch you apply the same procedure. Most likely, with this approach you need less than "a few dozen NGS tests" to arrive at an reasonable family tree. It might well save a lot of testing costs.



This sounds a bit like what Dr. Iain McDonald has been doing for a while per the DF98 King's Cluster pdf, which I have very thankful for as I am DF98+ and also share some SNPs per Big Y with the Doctor himself and the "Roman?" skeleton 6drif-3 from Driffield Terrace ;-). I am just happy to have someone maintaining a pdf on our DF98 group!!! Really... I say thank you to everyone - Peter included - who put a lot of effort into figuring out all these SNPs family trees for the different groups ;-). Dr. Iain's King's Cluster pdf: http://www.jb.man.ac.uk/~mcdonald/genetics/kings-cluster.pdf

Cheers!!

Charlie

TigerMW
01-28-2017, 11:25 PM
YFull offers sharable reports, but they follow only one line of descent. For example:
G-L661 (http://www.yfull.com/share/yreport/6f4aedfe0c25784f16c73535aa3ce6af/)
R-Y6956 (http://www.yfull.com/share/yreport/1657247ce911bb577073a02cbbda1040/)
R-Y18894 (http://www.yfull.com/share/yreport/f18212b6676dbebe3f540e5a8d42ed8b/)
R-YP1337 (http://www.yfull.com/share/yreport/7f25b2ad9f0ecaec31f7baa3f0d60eb5/)
R-BY653 (http://www.yfull.com/share/yreport/9cf9a26b0272561786ad84215cfe9b7b/)
I-CTS10228 (http://www.yfull.com/share/yreport/bf5383bfa8811e97332b0faabbae7894/)

I agree that YFull ought to offer a more convenient way to look at an entire subtree in the same way.

The great beauty of Alex's visual depiction of the Big Tree is exactly because he conquered this problem and brings the added value of a visual time element by demonstrating the lengths of each of the branches. He does and has a drill down, zoom out capability.

There are pitfalls to the Big Tree visual approach (particular with very granular branching), but the phylogenetic equivalents are not hidden away.

MitchellSince1893
01-29-2017, 01:49 AM
The great beauty of Alex's visual depiction of the Big Tree is exactly because he conquered this problem and brings the added value of a visual time element by demonstrating the lengths of each of the branches. He does and has a drill down, zoom out capability.

There are pitfalls to the Big Tree visual approach (particular with very granular branching), but the phylogenetic equivalents are not hidden away.

Also, the length of the SNP list is affected by the test type. Case in point, two FGC Y Elite testers are going to have a lot more SNPs than a branch with two BigY testers.

Branch on the left is two FGC test takers. Branch on the right is 2 BigY test takers. Yfull estimates age branch on right is 200 years younger, even though branch on left has a lot more SNPs.
http://www.ytree.net/DisplayTree.php?blockID=1082&star=false

RobertCasey
01-29-2017, 05:42 AM
The great beauty of Alex's visual depiction of the Big Tree is exactly because he conquered this problem and brings the added value of a visual time element by demonstrating the lengths of each of the branches. He does and has a drill down, zoom out capability.

There are pitfalls to the Big Tree visual approach (particular with very granular branching), but the phylogenetic equivalents are not hidden away.

Also, since Alex's chart visually show that some YSNPs have a truckload of equivalents, genetic bottlenecks visually jump out at you. These genetic bottlenecks take a lot of time to develop. In contrast, FTDNA's cleaner and more simplistic approach requires you to click the popup menus to see equivalents and FTDNA is very inconsistent with including full lists of equivalents as well.

On the plus side for FTDNA, its simple approach is easier to navigate branching since there is so much less data to scan. So each visual have different pros and cons. I usually use FTDNA for manual analysis of haplotrees when I only interested in the progression of branches. Also, Alex includes much more information that can be very useful - such as Full Genomes Y Elite results and full listing of private YSNPs which are both missing from FTDNA charts. For quick and dirty analysis (progression of branching), use FTDNA's haplotree. For more detailed analysis and completeness, use Alex's chart.

For L226 and L555 only, FTDNA's haplotree is really messed up. I do not even use them for L226 since around 50 % of the branches under L226 are not branches. They list all 100 the private YSNPs included in these two SNP packs as real branches vs. private YSNPs. This will become a more serious problem if 20 or 30 SNP packs are updated with 50 private YSNPs each. My recommendation to include these private YSNPs with the equivalents with brackets around them in the popup menu received no response. Since these private YSNPs can be ordered, they probably wanted to make them available to see in their haplotree where they want everyone to place orders for YSNPs. Once, they roll out many more private YSNPs in SNP packs, this issue will affect a lot more people and will hopefully get resolved.

BTW, the 50 private YSNPs in L226 is a huge success story for L226. The inclusion 50 equivalents has not produced any results to date. We have revealed seven branches after around 50 SNP packs. But even much better, my ability to chart L226 reliably has gone from 40 % to 75 % with so much better YSNP information to use. Around 90 % of 42 L226 branches are included in this SNP pack which is really made a huge difference in understanding L226 YSTR progression with YSNP data added. The problem associated with private YSNPs being listed as branches is trivial issue compared to the huge progress being made with private YSNPs being included in L226 SNP pack. It's not only that 50 private YSNPs are included but 90 % of the branches are being included as well.

RobertCasey
01-29-2017, 06:57 AM
FYI, I think FTDNA is working on CSV file export feature for their haplotree. That would be very helpful to be able download phylogenetic equivalents. Phylogenetic equivalent viewing is one of the big benefits of the Big Tree. It is painful to look at the "More" pop-ups for equivalents the way FTDNA and YFull do it. It's not downloadable unless you can dissect htlml coding.

Long haplogroup labels may be included in the FTDNA CSV export as they are still there, just under the covers.


These will help some - but the issue I was pointing out was getting the leadership of other parts of the haplotree to send in files. The "Not Invented Here" issue is a major barrier for expansion. Also, R1b is really fortunate to good leadership (partially due to lots of Americans with R1b ancestry that have built larger databases). Also, strong leadership is less available in some parts of the haplotree mainly due to less testing to date - a smaller pool of researchers.



Very good (about adding R1a). I'd like to see the R1b-L51xP311 guys jump on board next.

The politics have been brought up. I think they will always be there as an inhibitor between vendor capabilities and infrastructure but I don't see why the volunteers shouldn't share what they have. Alternative methods of doing things is good for evaluation and advancement. There are control concerns between overlapping or master/subset projects but I don't see any reason that people can't be involved in multiple projects, and for some overlap to be present. There is a trade-off between skills in a group of people and interest level as it relates to specificity of the smaller and smaller subclades. There is nothing wrong with both being leveraged.


Sharing is the key issue that we should stress as well as common well defined database is another strong point in getting others to post VCF files as most of R1b does. Also, hopefully U106 has a positive reaction to being added to Alex's database. Their recent experience (hopefully positive), should help with further expansion to other parts. Since your an active admin in the R1b-L51xP311 space, hopefully that will help.

As far as haplogroup project management - we should always encourage new leadership to start their own projects. Once they become well established and have a good track record, we should encourage movement of testers to more recent haplogroup projects like you have been doing for R1b which is hard to even load these days (not sure why these larger projects have so many reliability issues lately).



There are at least two major considerations (concerning exporting). I have set up for L513 a VCF-like format to merge Big Y relevant VCF, Pack and individual SNP results. The display is generated is a rough and tree-like but purely reporting oriented - little interpretation. This is not that hard to do. However, bringing in results from different vendors is a royal pain. I use to do with this YSEQ but the different synonyms used and ID/Kit# cross-referencing system is the big limitation. The lack of sharing is disappointing as it limits the usefulness even if you overcome the data integrity and maintenance for the cross-referencing.


We really need to develop some kind of standard for exporting of relevant YSNP data. First, we should just filter out all the NatGeo YSNPs unless they are relevant to the terminal YSNP. We should include all relevant equivalents and relevant private YSNP test data as well. We also need to include all relevant downstream negative results. Probably all of SNP packs could be included. Not sure about inclusion of all the valid ancestors of terminal YSNPs but since that is not that many, it would be OK (except pre-R1b values should be removed).

Here is a large option that most will not like but would be very useful. Even though spreadsheets really are bad for hierarchical data (YSNPs), we could export father relationships via automated scripts. Like DF13 would also include L21 as the father. You could include all the fathers of YSNP values included. For missing YSNPs in the haplotree, you just need to add a third field (tested/untested). The fourth field would be positive (for all upstream) and negative for relevant downstream. We could develop a tool that reconstructs a graphic with this format. The format is very common for tracking hierarchical data with two dimensional software programs like EXCEL (row/column). This could be automated for output but would require a tool to display it. This format can be imported in MySQL (etc.) with simple canned queries.



The second consideration is the phylogenetic analysis and interpretations in a cross-platform system. It's hard enough to do this with all NGS results of one vendor, then add a second and it is still semi-workable but then add in all of the individual SNP results where phylogenetic equivalents are not often included is a real challenge.

My advice for YFull and Alex is the same... keep it clean and like for like as much as possible. To go cross-platform really does require the true database, data section "owners", data integrity checking, etc.


We do have a roadmap for NGS but YFULL is an issue. We need to stress for the R1a and I haplogroups that use YFULL that our approach is free of charge (for now until Alex needs economic support to 1000s of users). But Alex will probably have to merge some of YFULL's bells and whistles to win them over. Poor Alex is getting a lot of to do's from us. He does have life, career and family.

YSEQ data is an ID translation issue. We could scan all YSEQ data just like we scan FTDNA reports. However, without a comprehensive cross reference of IDs, that data is not much good. I do have a comprehensive list of cross reference for L226 (Dennis Wright and I maintain this file). However, there are so many admins at this level, it would be hard to get large numbers of admins to upload these cross references as we do with the VCF files. While we are uploading IDs, we might as well upload the YSNP results as well (interim need since we do not scan the data today). I think Thomas would be open to providing more direct access to the data if we can solve the ID issue. He would not add an optional field for the FTDNA ID.

Peter M
01-29-2017, 09:08 AM
It is pretty automated now but I need to continue to fine tune the YSNP table. After 1.5 million rows, MySQL started finally to slow down to where I had to create a second source table. I really only pulled U106 and R1A since I could no longer sort out the YSTR report based on the long format.

That sounds a bit like a data modeling issue (1.5 million rows for just R1a and R-U106) ??

I don't think we should be getting too technical here, if you (or anybody else) want to have a more technical discussion on the R-Z18 software, Y-DNA Tree and analysis and web presentation techniques in general, there's a dedicated group on Facebook called Y-DNA Database (https://www.facebook.com/groups/YDNADatabase/) available for that. Why on Facebook ? Well, I would like to see who I'm talking to and Facebook makes a reasonable effort to force you to present yourself with your real-world name. This is a secret group, so you'll have to apply for admission. BTW, when I say technical, I really mean technical issues, I myself have always been a professional designer of large web systems in the telecoms field. I do not like e.g. chit-chatting about .html editors (in case you're not web-aware, that's previous decade/century technology).

MacUalraig
01-29-2017, 12:48 PM
We do have a roadmap for NGS but YFULL is an issue. We need to stress for the R1a and I haplogroups that use YFULL that our approach is free of charge (for now until Alex needs economic support to 1000s of users). But Alex will probably have to merge some of YFULL's bells and whistles to win them over. Poor Alex is getting a lot of to do's from us. He does have life, career and family.



Have you engaged with YFull at all? I'm not following all this in detail but it sounds like you are trying to take people away from YFull, am I reading this right? That would be a big shame, personally I think their service is way better than what Alex is doing (which at least in the early days was just parsing VCF reports).

It bugs me somewhat that people keep comparing the two as if they are equivalent 'services' since they are nothing of the kind.

Peter M
01-29-2017, 03:50 PM
Have you engaged with YFull at all? <...> it sounds like you are trying to take people away from YFull, am I reading this right? That would be a big shame, personally I think their service is way better than what Alex is doing (which at least in the early days was just parsing VCF reports).
This is interesting. I do hope you indeed have engaged with YFull and are able an willing to substantiate this "way better" claim with full explanation of all the technical issues involved. Please feel free and encouraged to do so.

This is a very serious question.

Reason for me asking: there might be a small technical detail somewhere in either the Y-Full or the Alex Williamson approach we might be able to use to improve the R-Z18 software. And as you claim to have all detailed knowlege to compare the two, you would be the man to help things a little (I'm sure, there are other people here who are interested as well). The previous generation of Y-Full fans always claimed the Y-Full approach to be "much better", but they never substantiate this, or specified better than what, therefore their claims were not very useful to us.

MacUalraig
01-29-2017, 03:56 PM
This is interesting. I do hope you indeed have engaged with YFull and are able an willing to substantiate this "way better" claim with full explanation of all the technical issues involved. Please feel free and encouraged to do so.

This is a very serious question.

Reason for me asking: there might be a small technical detail somewhere in either the Y-Full or the Alex Williamson approach we might be able to use to improve the R-Z18 software. And as you claim to have all detailed knowlege to compare the two, you would be the man to help things a little (I'm sure, there are other people here who are interested as well). The previous generation of Y-Full fans always claimed the Y-Full approach to be "much better", but they never substantiate this, or specified better than what, therefore their claims were not very useful to us.

I have better things to do than respond to sarcasm.

lgmayka
01-29-2017, 04:13 PM
I'm not following all this in detail but it sounds like you are trying to take people away from YFull, am I reading this right?
If indeed Alex's service remains free of charge, it does not need to "take people away" from YFull--Big Y customers can feel free to submit their raw data to both. As you point out, YFull provides additional services that Alex does not, and does not plan to, provide. For example, YFull provides safe permanent storage and convenient retrieval of raw data contents: At any time I can query YFull's database to show me exactly what readings I got at any location, by name or number:
---
Sample: #YF01476 (I-CTS10228)
ChrY position: 4178283 (+strand)
Reads: 12
Position data: 12G
Weight for G: 1.0
Probability of error: 0.0 (0<->1)
Sample allele: G
Reference (hg19) allele: C
Known SNPs at this position: Y3105 • FGC12080 (C->G) *****
Reference sequence (100bp): GTTCAAAGAAGATGTCCAAATAGTCAAGAGGAATATTAAAAGTTGCTCAA
C
ATCCTAATACCCAGGGAAATGCAATTCGACAACAATGAGATATTACCTTA
(4178232-4178333)
---

The real issue here is legal privacy. I must remind everyone, particularly project administrators, that FTDNA considers BAM files and even VCF/BED files to be sensitive private information, even more sensitive than mtDNA coding-region mutations. This is evident from the fact that a project member has an easy way (in the web interface) to let administrators see his mtDNA coding-region mutations, but has no such way to let administrators access his BAM file or even his VCF/BED files. The only available way to share these Big Y data files is via kit authorization (giving out the kit password) or direct file/hyperlink transfer.

I do not believe that FTDNA made up this policy just to spite administrators. Rather, I suspect that FTDNA's lawyers, after getting a description of these files' personal contents, decided that a strict privacy policy was necessary to maintain a defense against lawsuits. Consequently, FTDNA never officially encourages customers to transfer these files to any company or person, much less to post them on a public bulletin board (e.g., a Yahoo group).

There is a related legal issue here, perhaps peculiar to American law. My understanding is that our law makes a clear distinction between a commercial relationship, a personal relationship, and a broadcast (which is no relationship at all).

- If I pay for a service, the provider and I have a contract. I can sue him for breach of contract, or in some cases get an injunction or even an arrest warrant. (For example, our copyright laws have migrated over time from lawsuits to injunctions to criminal convictions.)

- If I ask someone to do me a favor (without compensation), we have no real contract. He may have a common-law obligation to take "slight care"--i.e., not to do anything deliberately harmful or grossly negligent--but no more than that.

- If I broadcast information (e.g., by posting it on a public bulletin board such as a Yahoo group), I essentially give up all rights to it--the information is now public. If the format expresses my personal creativity, that format may still be copyrighted; but facts are not copyrightable--they go into the public domain.

In consequence, I personally cannot possibly recommend that a project member post his Big Y raw data to a public bulletin board such as a Yahoo group--I cannot possibly accept the legal responsibility attached to such a recommendation. Frankly, I am surprised that FTDNA permits any project administrator to make such a recommendation.

Peter M
01-29-2017, 04:20 PM
Have any good ideas on how to get a viable comprehensive YSEQ ID to FTDNA ID cross reference tool where pulling YSEQ could be done across all haplogroups ? YSEQ does not want to add an optional field for "FTDNA ID" for fear of FTDNA reaction (this really dilutes the usefulness of their database).
FT-DNA and YSeq are in pretty unfriendly competition for obvious reasons. I'm convinced if we want to bring the test results from the two labs together, and add results from other labs and analysis of ancient Y-DNA, the task of actually "joining" the two data sets lies with the community. If the people co-operate, this is easily done.


If you rolled out a full database with dozens of good queries and very high end web output, they would probably react.
This is possible, but only if they would recognised their IDs being used from the outside, for which there's no real necessity. Apart from that, personally, I don't think they would ever legally attack an association of their customers, that is also generating new testing demands.


Another good tool would be the automatic detection of bad YSNP values - detecting when bad data from YSNPs in unstable areas happen, multiple mutations of the same YSNP, technology errors in SNP packs (pretty common), bad NGS data that FTDNA routinely publishes, etc. This takes a lot of time manually now.Most of these things fall wiyhin the scope of the SNP Database. It may be hair-splitting, but I think the two (Trees and SNPs) should not be mixed unnecessarily for good reasons.


The true approach would obviously have to be database driven with some pretty high end / easy to navigate web output (along with exports to EXCEL for those included).

In my post I mentioned a "new and revolutionary approach". I was referring to the workflow. Everything that is being done (and being used in the process) to analyse test results and to draw conclusions from this analysis. I simply assume state of the art web technology to be used in the process.

By default, FTDNA is the best output at this point in time. Glad to see them finally make some major improvements. For P312 and U106, Alex's Big Tree and FTDNA's haplotree is pretty comprehensive. Of course there are lot of warts remaining but these two web outputs are pretty useful for regular usage (manual examination). You can drill down a lot more with Alex's tool


<...> but Alex is only NGS driven which is a major limitation
This is a very major limitation and unfortunately it is common to nearly all approaches I'm aware of. One key feature of a "new and revolutionary approach" would be for it to cover all test results from all labs.


This charting tool is my primary focus for now - the timing is really perfect for this tool for well tested haplogroups. <...>
I'm afraid, I see a few issues with it that might be good to have a good look at in the future.

Peter M
01-29-2017, 04:33 PM
<...>But there is also an operational aspect to these tools and databases. With each iteration of rollout, we automate more and are able to expand the scope in both coverage and functionality. Again, this is a lot of coding effort and operational time. For the database maintenance, I have to extract out a derived field every new surname from three FTDNA fields (this can not be automated). With the demise of the long haplogroup name, I now manually do table lookups for each new haplogroup added to the tree (this could be automated but is yet another automation/coding project). I need more programming assistance or funds to hire programmers.
Amazing, why is all those manual work need ??

Peter M
01-29-2017, 04:47 PM
But there is also an operational aspect to these tools and databases. <...>
Important reminder !!


FTDNA has a staff of ten or twenty programmers and IT staff just to get out the information that they produce.
I'm a wee bit afraid that building functions that support group of customers will not be very high on FT-DNA's priority list. Most of these programmers will be working on streamlining the running of new test (for individual customers) and on optimising the operational cost of running the shop. I would not look at FT-DNA for providing any support to haplogroup teams.


Any volunteers for coding the automation of YSNP prediction ?
Considering the point where we stand as this moment, I personally would not see Y-SNP prediction as a very obvious first priority.


Or another approach, how much do you really support this advancement of tools and databases ? You could lead the effort to crowdfund the outsourcing of these programming efforts. Once the tools are operational, do you expect these tools and databases for free - it will take a pedabytes of hard disk space to load the BAM files and a 100,000 end users would require $10,000s of hardware, software and communication support costs. I easily spend 30 or 40 hours per week on this smaller projects already.
I guess, first consider and define the process then specify and develop the tools needed.

Peter M
01-29-2017, 04:53 PM
Just brainstorming...what about using an approach similar to gedmatch, where you have some basic free tools (maybe the ones that aren't that labor/IT intensive), and then have tiers of more advanced tools where paid subscriptions are required? The user could decide what would be of best use based on their knowledge and interest. I could see the FTDNA project admins being major users of the higher tier tools.

I used to use the freebee tools on gedmatch, but found I needed the tiered tools for my work...It's well worth the monthly $10, I've been paying for 1-2 years now. It would be difficult for me to go back to the basic free tools at this point.

I guess gedmatch supplies tiers of tools to individual users in the Y-DNA process the users are groups (haplogroups) of users, most likely lead by an admin. I don't think it would be easy to define a payment/subscription model for such a user base.


Or another thought is a collaboration between gedmatch, Alex Williamson, Robert Casey, Peter M, and others to combine resources and current infrastructure/tools

That's a good idea. But keep in mind: a clear lack of co-operation is one of the weakest attributes of the genetic genealogy field.

Peter M
01-29-2017, 05:15 PM
It is painful to look at the "More" pop-ups for equivalents the way FTDNA and YFull do it. It's not downloadable unless you can dissect htlml coding.
Dissecting html won't help you: they either use js directly or (much more likely) they use ajax. What you need is more serious software ;)

A more serious question is about the popups. Would you be so kind as to explain exactly why you think all equivalents names should be displayed in a primary tree display ?? I'm not referring to indicating in the display the fact that there are an x number of alternative names, but to the necessity of an explicit list naming them all.


Long haplogroup labels may be included in the FTDNA CSV export as they are still there, just under the covers.
Would you be so kind as to explain why these long names are so useful we need to endanger our databases (might require explanation); what can people actually do with them ?? (serious question, I'm completely puzzled)


The second consideration is the phylogenetic analysis and interpretations in a cross-platform system. It's hard enough to do this with all NGS results of one vendor, then add a second and it is still semi-workable but then add in all of the individual SNP results where phylogenetic equivalents are not often included is a real challenge.
My advice for YFull and Alex is the same... keep it clean and like for like as much as possible. To go cross-platform really does require the true database, data section "owners", data integrity checking, etc.If you mean multi-vendor (or multi-lab) by saying cross-platform, then multi-lab is the essence of what we are discussing here, I guess, and yes, that will require the use of serious databases and serious software. This in turn will require a serious effort from knowledgeable people. Then the problems you mention can relatively easily be solved.

RobertCasey
01-29-2017, 11:42 PM
Have you engaged with YFull at all? I'm not following all this in detail but it sounds like you are trying to take people away from YFull, am I reading this right? That would be a big shame, personally I think their service is way better than what Alex is doing (which at least in the early days was just parsing VCF reports).

It bugs me somewhat that people keep comparing the two as if they are equivalent 'services' since they are nothing of the kind.

I am just trying to figure out a single comprehensive way to share NGS data across the genome. The BigTree has extremely good coverage of R1b (probably exceeding 90 % of the NGS tests). I know that R1a and a lot of I haplogroup use YFULL. What is bad for genetic genealogy is two (even more than two) methodologies for doing the same thing. Each have their pros and cons, we need to discuss these pros and cons vs. not even trying to determine which methodology is the best long term.

I submitted by Y Elite 1.0 to YFULL back when it was free of charge two years ago. Since I already had my Full Genomes analysis, I did not learn too much. Here is list of issues (which is biased to YTREE and FTDNA since R1b uses this approach).

1) Cost - A one-time charge of $50 is not unreasonable cost for services. Eventually, Alex will eventually need to charge for usage if his site's traffic results in higher costs that need be recovered. However, there will be a certain percentage of individuals that will not pay the $50 - which is minor disadvantage for now.

2) Coverage of haplogroup by each organization. When comparing BigTree to Alex's web site, Alex has FGC which is better than FTDNA which obviously does not allow uploads of competitive test results. I am sure that five to ten percent of R1b does not participate with the VCF file uploads. Do you know the coverage of YFULL when compared to FTDNA's haplotree for R1a or I haplogroups ?

3) YFULL and FGC use BAM file inputs which can reveal more information that the interpreted VCF summaries. So YFULL gets the edge for this functionality.

4) YFULL makes little effort to use existing YSNP names and creates YFS and Y labels that highly redundant with FTDNA, ISOGG and FGC labels. This is a disadvantage for YFULL.

5) A major concern is covering the rest of the haplogroups that are not as well tested and have more splintered leadership - primarily due to a smaller population of testers in general. Even if the YFULL and BigTree to reach a happy consensus, we need a strategy that works better for all haplogroups.

6) I do believe that everyone should pay an extra $50 for either FGC or YFULL analysis but we need to reach a consensus on which files would be best. Obviously BAM files would be the best, but these files are very large and require a lot of processing power and storage to archive the files.

7) Compatibility with FTDNA YSNP reports. Since FTDNA is rolling out really nice SNP packs these days, having a compatible naming convention with the FTDNA YSNP reports is another important factor as SNP pack data needs to be merged with NGS data for a complete picture. Again, the YFULL naming convention has been an issue for me to require translations of synonymous labels.

8) Compatibility with YSEQ YSNP database and FGC output reports is another major issue where YFULL uses different labels. Don't get me wrong, there are a lot of translation issues between FGC, YSEQ and FTDNA - but YFULL makes minimal attempt use existing labels. Of course, we could mitigate this issue with massive translation tables - but YFULL does a not do a good job in YSNP naming.

9) YFULL is viable company with a revenue stream which has a history of success. Alex is an individual who might become unavailable and has a continuity risk which could be solved. This is an advantage for YFULL but this could be corrected as Alex's business model will eventually require others to keep it going and income stream to pay for IT and support costs someday.

If you have other issues, please let me know. I know my list is biased for Alex since I am R1b researcher by heritage. Over time, hopefully each will add each others functionality over time where the decision will have less impact on either community.

lgmayka
01-30-2017, 12:37 AM
4) YFULL makes little effort to use existing YSNP names and creates YFS and Y labels that highly redundant with FTDNA, ISOGG and FGC labels.
My understanding is that YFull gladly uses SNP names that have (already) been officially registered into Ybrowse, which is the closest we have to a common, mutually agreed Y-SNP database. Are other processors (FGC, FTDNA, Alex) diligent in registering their SNPs into Ybrowse? And are the other processors also reusing SNP names already in Ybrowse instead of making up new ones?

Does YFull download the Ybrowse database periodically instead of querying it for each SNP? I would say that's likely, and understandable--but it introduces a delay which can cause construction of synonyms.

RobertCasey
01-30-2017, 12:58 AM
My understanding is that YFull gladly uses SNP names that have (already) been officially registered into Ybrowse, which is the closest we have to a common, mutually agreed Y-SNP database. Are other processors (FGC, FTDNA, Alex) diligent in registering their SNPs into Ybrowse? And are the other processors also reusing SNP names already in Ybrowse instead of making up new ones?

YBrowse is run by Thomas Krahn - who runs YSEQ. When I submitted my files to YFULL, I included the FGC labels via VCF files which YFULL did not use. My 40 or so FGC lablels were later submitted to FTDNA who added them to their haplotree. When requesting the YSNPs to be added at YSEQ (where they have been extensively tested of non Big Y coverage), YSEQ updated YBrowse. I know that YSEQ makes a real effort to collect YSNP labels but neither FTDNA and FGC submit their data to YSEQ for publication since both view YBrowse as a competitor.

Most of R1b uses FTDNA labels unless FGC or YSEQ labels have been assigned - FTDNA does make an effort to check ISOGG, YBrowse and other sources for existing labels. But using YBrowse as the master source, could be used by all but this is vendor supported database - and not the most user friendly interface to be the master source. But labeling is a major issue that all players are not very good at. ISOGG is not a possible source as its inclusion is too restrictive in nature to be a major player anymore. Since FTDNA is the dominant player in this field, this is another major issue to get them to follow any standard approach. But this issue needs a standard written with non-compliance downside (at least emails to all vendors who do not follow). Until we are brave enough to create and softly enforce a standard, we will continue to have this major issue.

It is really hard to get FTDNA to change their labels. If YFULL would honor requests to synchronize their labels with more standard labels being used by the R1b community, that might work. However, there are some 5,000 plus NGS tests under P312. That is $250,000 at $50 each - too cost prohibitive to recommend at this point in time. Also, FGC testers will resist shoving out $50 just to get their YSNPs listed which is another issue since P312 has several hundred FGC tests to date.

It looks like only Dennis Wright, Dennis O'Brien and myself have submitted files to YFULL (the key researchers of L226). We are missing around 50 NGS tests. So adding L226 to YFULL would require $2,500 get our branches listed. This is redundant with the work that Dennis Wright does with BAM files - which is even better than YFULL and FGC analysis. YFULL lists 21 IDs under the branches - that is 40 % of the NGS testers. If that many submitted data, their analysis missed a lot of branches. Also, yet another translation issue for ID numbers which would be impossible track at a global level.

RobertCasey
01-30-2017, 04:42 AM
Amazing, why is all those manual work need ??

It is necessary to create derivative fields from the FTDNA reports. The first is the surname which can come from three sources. It is extremely doubtful that any automation could ever separate surnames from other parts of the full name and place names. The second derived field is a replacement of the long YSNP name with a string of relevant YSNPs. This could be automated a lot more, but the haplotree is dramatically changing day to day. I currently use a table lookup scheme (pretty crude) which could be replaced by a master haplotree extract from FTDNA - but I that would require more coding. But every pull from the database would reveal newly discovered branches that constantly have to be updated. This may be a lot of effort for the terminal YSNP in the YSTR reports - but this is often the only source of the YSNP results since YSNP reports are private (project only) much more than YSTR reports.

The second manual aspect is running the extraction jobs. FTDNA time outs are very common these days and projects are becoming public (everyone) and private (project only) every day. Maintaining the project table list takes a lot of time. This could be more automated as well to recognize and report changes in the status of projects. Also, I continue to add smaller projects and projects that are more remote to Haplogroup R. Also, new haplogroup and new surname projects are added daily. There is a manual operational issue of maintaining the project table.

RobertCasey
01-30-2017, 05:08 AM
Considering the point where we stand as this moment, I personally would not see Y-SNP prediction as a very obvious first priority.


I had really given up on YSNP prediction across the genome until I got involved with the charting tool. To make YDNA charting work with a signature based methodology, it also requires the automation of signature recognition which I now up and running (manually). This could probably be exported / modified for YSNP signature across the genome where 85 to 90 % of all 67 markers could be predicted with 90 to 99 % accuracy.

Here is my preferences for rolling out tools 1) the chart building tool as timing is ripe to make good charts and used by more modest skilled individuals to make testing recommendations; 2) YSNP prediction would be next; 3) database to drive the tools is largely up and running (but always needs continuous improvement from other testing companies).

You just enter your FTDNA ID number, it retrieves your 67 YSTR values and predicts your YSNP (1,500 to 2,500 years old) around 90 % of the time. It then retrieves all the relevant YSNP and YSTR data and produces a chart that currently would be 75 % of L226, this could include automated testing recommendations as well. Of course, if only based on FTDNA data from their YSNP and YSTR reports, a lot of genetic information would be lacking due to missing Big Y data, missing FGC data, missing YSEQ data and errors from the FTDNA reports as well. The charting would also allow user input if you have collected more comprehensive / accurate data.

This is why YSNP prediction has been added back onto my list: 1) it would be a major tool to roll out once the automated signature methodology is perfected from the charting tool; 2) it would great way to reduce the need to test multiple layers of SNP packs - you just go to the lowest level YSNP directly; 3) it provides those with modest skills to participate in making better selection of the next step (FTDNA NGS, private YSNPs at YSEQ, higher resolution FGC or FTDNA SNP pack).

What would be your favorite tools to roll out first ?

Peter M
01-30-2017, 12:49 PM
It is extremely doubtful that any automation could ever separate surnames from other parts of the full name and place names.

Are you sure ? This can be done routinely with an success rate of at least 90% by an automated procedure. I agree, in a few cases manual intervention is needed. One just needs more clever software.

Peter M
01-30-2017, 01:14 PM
I had really given up on YSNP prediction across the genome until I got involved with the charting tool. To make YDNA charting work with a signature based methodology, it also requires the automation of signature recognition which I now up and running (manually). This could probably be exported / modified for YSNP signature across the genome where 85 to 90 % of all 67 markers could be predicted with 90 to 99 % accuracy.

Here is my preferences for rolling out tools 1) the chart building tool as timing is ripe to make good charts and used by more modest skilled individuals to make testing recommendations; 2) YSNP prediction would be next; 3) database to drive the tools is largely up and running (but always needs continuous improvement from other testing companies).

<..>

What would be your favorite tools to roll out first ?
My preference would be to first discuss the workflow of the whole process from test results to trees (including geographics and time (for the fans)). For each step in the workflow a supporting tool should be defined including all interfaces needed. Then an identification of all parties involved would result in a good picture of who would need which function, which might lead to a practical business model per (group of) tool (e.g. is it reasonable to expect people to be paying for certain tools).

I'm well aware, that this might sound like being a wee bit complex, but it lays a solid foundation for all subsequent e.g. development work. I'm also aware, that to address the dependency of individual people (e.g. like Alex and me) it would be better to have this done by an (corporate) entity or association, but as this seems difficult at the moment (e.g. the ISOGG is mainly concentrating on politics and is a "no-budget" organisation, which will make running any serious tool problematic, so ISOGG will not be a practicable platform).

As to the actual priorities, I'm a huge fan of a layered design of an architecture (collection of systems). This means that individual systems or tools could be run by independent entities (people, corporations or associations). To give an example, I would expect the "ground layer" of an support architecture for genetic genealogy to consist of a SNP database, with a clearly defined interface, so that, within access rights, any person or party could read SNP data and add new SNPs to the system (one could argue that YBrowse is such a system, but there are a few issues here (e.g. interfacing)). Please note, this is just an example.

TigerMW
01-30-2017, 02:06 PM
My understanding is that YFull gladly uses SNP names that have (already) been officially registered into Ybrowse. ...
I haven't seen this to be the case but it may be you asked the right person and I did not. My experiences with YFull on this were in 2015 with several L513 people. I specifically asked for certain SNP names to be respected that were already in YBrowse and even available by Sanger Sequencing already. I asked in the comments during order/submission process and provided the existing SNP names. They didn't comply but I recognize it may have not been a high priority for them or that some times things just slip through cracks.

I think SNP synonyms are an annoying detriment, but I threw in the towel on this long ago. I've submitted recommendations to ISOGG and tried to get something going. No luck.

Wing Genealogist
01-30-2017, 02:35 PM
A bit of a history lesson:

Full Genomes originally contracted with YFull to perform the analysis of FGS test results. For whatever reason, this contract was terminated shortly after the testing started. I don't know why this happened (nor do I care to know why) but this termination likely caused some bitter feelings on both ends.

When YFull first started their own (independent) analysis of NGS files (from both FTDNA and Full Genomes) they did not use the FGC SNP names, but created their own Y (and YFS) SNP names. After a period of time, YFull decided to accept the FGC SNP names and stopped their initial practice of using their own Y/YFS names to previously named SNPs.

Given the Global nature of DNA testing, friction is inevitable given how different countries have different laws and values. "The West" has many laws which protect business interests (which has both positive and negative features). Many of these laws are not universally recognized in all countries (which again has positive and negative features).

Peter M
01-30-2017, 04:38 PM
A bit of a history lesson:

Full Genomes originally contracted with YFull to perform the analysis of FGS test results. For whatever reason, this contract was terminated shortly after the testing started. I don't know why this happened (nor do I care to know why) but this termination likely caused some bitter feelings on both ends.

When YFull first started their own (independent) analysis of NGS files (from both FTDNA and Full Genomes) they did not use the FGC SNP names, but created their own Y (and YFS) SNP names. After a period of time, YFull decided to accept the FGC SNP names and stopped their initial practice of using their own Y/YFS names to previously named SNPs.

Given the Global nature of DNA testing, friction is inevitable given how different countries have different laws and values. "The West" has many laws which protect business interests (which has both positive and negative features). Many of these laws are not universally recognized in all countries (which again has positive and negative features).
A bit of corrected history lesson: During the time FGC started discussing the concept of a Full-Y test, it was recognised that analysis would be needed and that the test lab was not able to provide that. The first agreement was that the existing software from somebody else from Europe would be used for analysis and for the web site. As a deal could not be made, for reasons that are not relevant for now, this agreement never materialised and they (FGC) defined an alternative strategy. This was most unfortunate as it would have made a lot of things easier.

Peter M
01-30-2017, 05:08 PM
My understanding is that YFull gladly uses SNP names that have (already) been officially registered into Ybrowse,
<...>



I haven't seen this to be the case <...>

And indeed it isn't the case. One of the issues in building the Y-DNA SNP database was the choice how to name the SNPs that have both a Yxxxx and a FGCyyyy name (and no other names). When I was initially building the system, there were more than 5.000 of these SNPs. I do not know exactly who has been ignoring whom in naming these SNPs, but one thing is sure: these two parties are not mutually recognising each others names.

A second observation, is that at least one SNP that was discovered by the R-Z18 project and as such named ZPsomething (I don't know the exact right number off-head), was later renamed YPsomething (exactly the same number). As the number was exactly the same, the original name must have been known and this renaming must have been done deliberately. How about gladly uses SNP names that have (already) been officially registered ". I guess, it is a little problematic as seeing SNP names in YBrowse as being officially registered, as the BY names that FT-DNA use, tend not to be in YBRrowse.

BTW, for those interested, SNPs discovered in the R-Z18 project, that didnt have a name yet when discovered, have been named ZPn and are as such registered in YBrowse. The series of SNP names starting with Z (e.g. Z18), has been started by Greg Magoon in 2010/2011 and the initial Z has emerged to stand for "the gg-community". "Within" the Z-series, people have started using Z2000, Z4000 and Z8000. To simplify this naming, we started an explicit subseries with Z (community) P (my name) and a number, thereby inviting people to start naming SNPs they discovered in their own name series, such as ZD (for Joe Doe). The ZZ was reserved to start a second series if/when needed as ZZAn etcetera. The ZZ series has in the mean time been allocated by Alex for his palindromic SNPs.

In general, I'm not at all worried about SNP having multiple names, as the Y-DNA SNP Database allows each SNPs to have an unlimited number of names, but I do like the SNPs in R-Z18 to have a number as low as possible, as this significantly improves the average SNP's attractiveness.

The most important thing I'm worried about, it that this slightly chaotic naming of SNPs is not going to make the field any easier to enter for newcomers. There are many other contributions to this difficulty but apparently, I'm one of the few who see this as a problem.

lgmayka
01-30-2017, 05:54 PM
neither FTDNA and FGC submit their data to YSEQ for publication since both view YBrowse as a competitor.
That is a real problem!

My understanding is that Ybrowse does, from time to time, ask FTDNA for a list of BY-names. Perhaps Ybrowse makes a similar occasional query to FGC? But even if so, one can easily see that if every entity updates its SNP database periodically, the delays on all sides will create plenty of duplicate names.

Peter M
01-30-2017, 06:24 PM
That is a real problem!

My understanding is that Ybrowse does, from time to time, ask FTDNA for a list of BY-names. Perhaps Ybrowse makes a similar occasional query to FGC? But even if so, one can easily see that if every entity updates its SNP database periodically, the delays on all sides will create plenty of duplicate names.

FT-DNA do not supply (BY) information to YSeq (fact, but info from three-four months ago, I expect this to still be true), given the number of FGC variants in YBrowse, my assumption is, FGC supply this information to them in some form or other. And then we need to make a distinction between parties that publish their SNP data (YSeq) and parties that don't, won't and/or can't (due to a lack of a suitable system) (FT-DNA, FGC).

Most of this is based on the supposed competition between the various parties. The only party that could bridge the gaps is the user community. Then it would not be a real problem.

TigerMW
01-30-2017, 07:10 PM
FT-DNA do not supply (BY) information to YSeq (fact, but info from three-four months ago, I expect this to still be true), given the number of FGC variants in YBrowse, my assumption is, FGC supply this information to them in some form or other. And then we need to make a distinction between parties that publish their SNP data (YSeq) and parties that don't, won't and/or can't (due to a lack of a suitable system) (FT-DNA, FGC).

Most of this is based on the supposed competition between the various parties. The only party that could bridge the gaps is the user community. Then it would not be a real problem.

FTDNA provides BY pos-anc-der information to their testers, but it is on a request driven basis. I go on and submit requests for this data almost every week through the FTDNA contact request form.

My guess is that even an FGC principal can make these requests and receive the information. I don't know about the YSEQ principal, though, but anyone can make the request and turn over the details to whoever they want.

However, this is bad process on FTDNA's part. I requested over a year ago that they provide a link or download from the GAP tool and/or Advanced Tests menu to all pos-anc-der details for BY SNPs or for any named SNPs for that matter. Actually, I think I requested this about two years ago by now.

One of the problem areas is that FGC created SNP names for about anything that moved, but there was a couple of year time lag for some of the early private FGC SNP names to reach YBrowse. They may have corrected that by now as I see FGC names that were private pop up automatically on the Big Tree as Alex discovers and documents new shared SNPs for branches.

I recently sent the BY detail I have to Alex and Vince T. I periodically send this to Alex as I love the Big Tree and I'm P312 type.

lgmayka
01-30-2017, 07:39 PM
If YFULL would honor requests to synchronize their labels with more standard labels being used by the R1b community, that might work.
Many, many YFull SNPs are already shown with explicit multiple names (e.g., Y3131/FGC5637). Are you merely referring to the clade names--e.g., a clade is named R-Y12459 on YFull's tree, although the SNP name is FGC12290/Y12459 ?

FGC gives a name to every private (unshared) SNP it finds. FGC claims that this is what its customers generally want. In contrast, both FTDNA and YFull consider the naming and publication of private/unshared SNPs to be a legal privacy risk, which they avoid.

The discovery of a private SNP is not the same as the discovery of a public clade. If YFull discovers a clade, then later learns that FGC already gave the defining SNP an FGC name, that does not in itself imply that FGC discovered the clade first--precisely because FGC names every SNP regardless of whether it constitutes a clade or not.

Note: I don't know whether the "naming right of first discoverer" actually plays a significant role in YFull's clade naming. I know only that FGC's prolific ("promiscuous") naming practice complicates our community's ability to determine who discovered a clade first.

TigerMW
01-30-2017, 08:01 PM
... Most of this is based on the supposed competition between the various parties.
I think some of this is intentional based on rivalries between the vendors, but I think that is largely overblown. The larger the vendor, the more likely it is that they just don't place priority on sharing information like they should and/or have the privacy concerns that Lawrence mentions. In that sense, it is more a management/expense/priority thing than a intentionally trying to hurt another rival.


The only party that could bridge the gaps is the user community. Then it would not be a real problem.
This has been and is true. Ironically, I won't name names but we have the instance where the (non-vendor) tester/user provided a go between but one of the vendors felt like the information was NOT bi-directional... hence there was a vendor re-alignment. This was intentional on the part of the two vendors, with a third as a result. I guess that qualifies as fireside chat.:)

lgmayka
01-30-2017, 08:01 PM
You just enter your FTDNA ID number, it retrieves your 67 YSTR values and predicts your YSNP (1,500 to 2,500 years old) around 90 % of the time.
Needless to say, this is only possible for the more densely populated subhaplogroups.

Also note that there is a very large geographical (and psychological) difference between the lower and upper time bounds you cite. The entire I-CTS10228 (I2a-Dinaric) clade (https://yfull.com/tree/I-CTS10228/) is only 2200 years old by YFull's reckoning, but sprawls across Central-Eastern Europe and beyond. A clade predictor that simply tells someone "you belong to I-CTS10228" is not very satisfying--indeed, 12 markers is often enough to do the same. Sadly, even an up-to-date SNP pack does not necessarily do much better--look at how many members of I-CTS10228 take the Big Y test, only to end up classified as I-S17250 (https://yfull.com/tree/I-S17250/). (They don't even get an asterisk, because a major subclade, PH908 (https://yfull.com/tree/I-PH908/), is not usually tested by the Big Y. :( )

lgmayka
01-30-2017, 08:13 PM
BTW, for those interested, SNPs discovered in the R-Z18 project, that didnt have a name yet when discovered, have been named ZPn and are as such registered in YBrowse. The series of SNP names starting with Z (e.g. Z18), has been started by Greg Magoon in 2010/2011 and the initial Z has emerged to stand for "the gg-community".
Apparently, the R-Z18 project did plenty of renaming itself:
ZP1 = Z8188
ZP8 = Z8191
ZP22 = CTS4753
ZP24 = S10198
ZP25 = S11601
ZP26 = S15815
ZP27 = S18329
ZP28 = S18973
etc.

I do not mean this as an accusation or even a criticism--only an observation that maintaining a single set of names without a centralized, agreed-upon repository is a difficult task for everyone.

TigerMW
01-30-2017, 08:14 PM
.... both FTDNA and YFull consider the naming and publication of private/unshared SNPs to be a legal privacy risk, which they avoid...
I don't like how FTDNA does because it makes life hard for project administrators, but I can appreciate their position and how it works.

Project admins can not download BAM files and VCF/REGIONS files. The kit owner has to do it. They have to take overt actions to make their variant information available. Once they do, i.e. Big Tree, YFull or whatever, the cat is out of the bag and the information is public. I am concerned about privacy, but not on my Y DNA side. I'm more concerned about health data get out the door. People say that is hopeless situation, but I'm not throwing in the towel on releasing anything health related.

TigerMW
01-30-2017, 08:21 PM
Apparently, the R-Z18 project did plenty of renaming itself:
ZP1 = Z8188
ZP8 = Z8191
ZP22 = CTS4753
ZP24 = S10198
ZP25 = S11601
ZP26 = S15815
ZP27 = S18329
ZP28 = S18973
etc.

I do not mean this as an accusation or even a criticism--only an observation that maintaining a single set of names without a centralized, agreed-upon repository is a difficult task for everyone.
Lawrence, I will come to Peter's defense on this - not that I say I've never made any mistakes or that Peter has never made any mistakes.

Regardless of how you couch this, you said "Apparently, the R-Z18 project did plenty of renaming itself". That is a kind of criticism of some type.

It is not reliable to look at the dates entered into YBrowse per SNP as the dates of "discovery". YBrowse is not really a formal database with data integrity, nor is it really neutral. Still, it is a great service and we'd all be screwed if we didn't have it.

A second vagary is this also circles back into when should an SNP be named, when it is seen in one individual, or two? and is it even really an SNP?

Peter M
01-30-2017, 08:25 PM
Apparently, the R-Z18 project did plenty of renaming itself:
ZP1 = Z8188
ZP8 = Z8191
ZP22 = CTS4753
ZP24 = S10198
ZP25 = S11601
ZP26 = S15815
ZP27 = S18329
ZP28 = S18973
etc.

I do not mean this as an accusation or even a criticism--only an observation that maintaining a single set of names without a centralized, agreed-upon repository is a difficult task for everyone.
You may well see this as criticism and it is correct. This happened in the early days of the SNP Database (note the low numbers; we have passed ZP200 by now) when not all SNP names were containedin the database (the full story is a little complicated), the problem has in the mean time been corrected and these ZP names are no longer primary. I currently don't know about the Z8000 series names, these might well be newer than the corresponding ZP names.

BTW, in the Z-18 SNP database any SNP has a single primary name and an unlimited number of alternative names, once an (alternative) name has been given, the name cannot normally be withdrawn.

lgmayka
01-30-2017, 09:24 PM
Once they do, i.e. Big Tree, YFull or whatever, the cat is out of the bag and the information is public.
No. That was my point earlier--YFull is a commercial concern with an explicitly published privacy policy, preventing the unrequested disclosure of personally identifiable information (e.g., private/unshared SNPs). You may disagree with this policy, but it is consistent with FTDNA's view, which was probably forged in the fire of lawsuit defense.

TigerMW
01-30-2017, 10:21 PM
No. That was my point earlier--YFull is a commercial concern with an explicitly published privacy policy, preventing the unrequested disclosure of personally identifiable information (e.g., private/unshared SNPs). You may disagree with this policy, but it is consistent with FTDNA's view, which was probably forged in the fire of lawsuit defense.
I don't disagree with the policy. I just find it an annoyance for project admins.

What I meant was out of the bag in terms of project administrators and free analysis through them like the Big Tree and what the U106 folks and Basal folks do. Pretty much everyone in R1b goes this route and then YFull is for those who want more.
... so if someone bypasses the project administration public types of analyses, and only goes to YFull, you are quite right.

EDIT: This is the U106 fireside chat so I think some assumptions about R1b content are natural.

lgmayka
01-31-2017, 04:21 AM
... so if someone bypasses the project administration public types of analyses, and only goes to YFull, you are quite right.
Disclosure to a project administrator is certainly not supposed to equate to worldwide publication, unless that project administrator specifies that clearly in the project description. ("Anything you send to the administrator may be broadcast worldwide.")

For example, R1a Project administrators often request VCF/BED files from Big Y participants within the project. But the tacit assumption is that only shared SNPs from such files will be published, in accordance with the project's stated goals. The administrators do not publish private/unshared SNPs--that would, I think, be a breach of FTDNA's privacy policy unless the project member were warned in advance.

The FTDNA's Group Administrator Guidelines (https://www.familytreedna.com/learn/project-administration/gap-guidelines-ftdna-projects/) do not quite cover this precise situation, perhaps only because they have not been updated recently:
---
Privacy and confidentiality are a key responsibility for a Group Administrator. Group Administrators have access to data and contact information of the members in the project. This access is necessary to assist participants in understanding and interpreting their results. Family Tree DNA expects Group Administrators to protect members’ privacy and confidentiality.

Group Administrators shall not use this access to:
...
Post guarded DNA Results to a public website or otherwise make them public.

Guarded DNA Results are mtDNA Coding Region results, Factoid results, and Population Finder results.
---

The web site guards VCF/BED files and BAM files even more tightly than the "guarded DNA results" mentioned above. I get the impression, then, that the guidelines have simply not been updated yet to reflect Big Y testing.

TigerMW
01-31-2017, 05:01 AM
Disclosure to a project administrator is certainly not supposed to equate to worldwide publication, unless that project administrator specifies that clearly in the project description. ("Anything you send to the administrator may be broadcast worldwide.")

For example, R1a Project administrators often request VCF/BED files from Big Y participants within the project. But the tacit assumption is that only shared SNPs from such files will be published, in accordance with the project's stated goals. The administrators do not publish private/unshared SNPs--that would, I think, be a breach of FTDNA's privacy policy unless the project member were warned in advance.

The FTDNA's Group Administrator Guidelines (https://www.familytreedna.com/learn/project-administration/gap-guidelines-ftdna-projects/) do not quite cover this precise situation, perhaps only because they have not been updated recently:
---
Privacy and confidentiality are a key responsibility for a Group Administrator. Group Administrators have access to data and contact information of the members in the project. This access is necessary to assist participants in understanding and interpreting their results. Family Tree DNA expects Group Administrators to protect members’ privacy and confidentiality.

Group Administrators shall not use this access to:
...
Post guarded DNA Results to a public website or otherwise make them public.

Guarded DNA Results are mtDNA Coding Region results, Factoid results, and Population Finder results.
---

The web site guards VCF/BED files and BAM files even more tightly than the "guarded DNA results" mentioned above. I get the impression, then, that the guidelines have simply not been updated yet to reflect Big Y testing.
You will find that R1b project administration in U106 and P312 do provide that kind of
clarity. I think the U106 guys even spell out their caveat in all caps.

[EDIT: Disclosure to a project admin and to the project public forums is not an explicit attempt at a "worldwide broadcast" and I don't know of cases where project admins are doing that. However, it is fact of life today that something on any public internet site or semi-public internet site is accessible world-wide by about anyone. That is the cat out of the bag.]

Cofgene
01-31-2017, 11:23 AM
FTDNA's knee-jerk reaction and definitions of privacy have been defined by a series of poor legal judgements. The general topic of genetic privacy is similar to the GMO squabbles. For an individuals list of singletons unless there has been specific followup testing to establish the order one cannot define which ones could be restricted due to privacy. Tossing 500-3000 years worth of an individuals singletons under a privacy secrecy shroud is indefensible. This is where all projects and systems dealing with this information state up front that the results are being placed into the public domain. This should especially apply to the increasing number of family projects which have multiple family member results present in a system.

The hiding of singletons from reports aggravates the acquisition of data necessary to improve the current mutation rate estimates and look deeper at topics including recurrency and use of INDELs, MNPs, STRs as phylogenetically useful markers.

TigerMW
01-31-2017, 01:48 PM
...
The hiding of singletons from reports aggravates the acquisition of data necessary to improve the current mutation rate estimates and look deeper at topics including recurrency and use of INDELs, MNPs, STRs as phylogenetically useful markers.
Singleton (some call them private) SNPs don't need to be hidden. That is my understanding. Y SNPs are not on the list of "guarded DNA results" that Lawrence posted below.

I think it's just a matter of practicality (possibly oversight too), on FTDNA's part, that singleton's are not listed. The SNP display method on the Y DNA SNP report is an eyesore as it is.

FTDNA has no expert interpretation applied to singletons SNPs. That's coming from third parties, like YFull, or from project administrators.

FTDNA's BAM expert (M. Sager) interpretation only happens when SNPs are requested to be added to the haplotree (which also happens in the Pack development process) or a haplogroup label is requested to be forced into an update manually.

By default, interpretations are only done on SNPs that occur at least twice in their database, hence they aren't singletons. However, as has been noted, there are exceptions.

On the other hand, FTNDA's non-expert and very conservative automated method of calling SNPs puts singletons and other novel SNPs on to the Big Y matching and results screens which can be accessed by matches as well as project admins. These are the dumb calls though. Many are correct, some are not - that's the state of their automation algorithms.

Peter M
02-01-2017, 09:37 PM
Singleton (some call them private) SNPs

I'm a bit afraid, this is not just a matter of terminology, a singleton SNP and a private SNP may very well be two different things. A singleton is, to date, only seen once and a family/private SNP is a SNP that is only found in people in the same family (normally carrying the same surname). So singleton is a qualification that is given, when the SNP is seen first and it might be classified as public or family/private after more testing and analysis.


Singleton (some call them private) SNPs don't need to be hidden. That is my understanding. <...>

I think it's just a matter of practicality (possibly oversight too), on FTDNA's part, that singleton's are not listed.
<...>

I assume this is about privacy. I can understand that FT-DNA (context of the original post) are carefully watching all privacy issues, to stand stronger in a possible confrontation with the FDA and/or privacy-loving customers.

I understand far less why individual customers would be so worried about privacy issues. Personally, I don't care much about other people seeing my Y-DNA results, as I expect at least hundreds of people to get essentially the same Y-DNA results if they were to order a test (people group S from sharing).

One could get the impression there are other people (group P from privacy), who think that after a DNA test, they'll find the result of the test on the FT-DNA site, complete with a full description of the meaning (!!) of those results, fully worked out as described by some specialised scholar in the scientific literature. And that after a DNA test the results can be studied in full isolation, so that nobody else will ever be informed about these very sensitive DNA results. For this group of people it might be completely unknown, that the meaning of the personal result is instead discovered by direct comparison to a lot of other results (possibly by people who are to some extent hobbyists), which fact in practice implies sharing results with others and making results known to others.

My guess is, group S(haring) is much larger than group P(rivacy), but nevertheless e.g. the default setting on the FT-DNA site for privacy is focussing on the P group and if one doesnt change this setting explicitly, the test results are not shared (since about kit# 400000).

Any comments anybody ? why tend people to be so worried about privacy ??

TigerMW
02-01-2017, 11:59 PM
I'm a bit afraid, this is not just a matter of terminology, a singleton SNP and a private SNP may very well be two different things. A singleton is, to date, only seen once and a family/private SNP is a SNP that is only found in people in the same family (normally carrying the same surname). So singleton is a qualification that is given, when the SNP is seen first and it might be classified as public or family/private after more testing and analysis.
My post was about singletons but I have been told various things about what private means in the SNP world.
I probably shouldn't even mention the word private because of the evolution of the term. It just becomes an annoyance to conversation.


Any comments anybody ? why tend people to be so worried about privacy ??
I don't think there is widespread worry about privacy when it comes to Y DNA and these projects. Y DNA is not on the guarded data list, people have to overtly join projects and overtly turn in Big Y raw results. Project admins in R1b have been clear about the public nature of the data.
Even people who join projects can adjust their profile settings.

RobertCasey
02-02-2017, 12:10 AM
Needless to say, this is only possible for the more densely populated subhaplogroups.


YSNP prediction does not need a very large sample size to accurately predict YSNPs in the 1,500 to 2,500 year range. It takes only a handful of tested submissions and 20 to 50 matching testers. However, there is a pretty narrow age range where prediction based on signatures works. For instance, L720 is quite young and tracks one marker mutation and signatures do not give preference to one marker value. However, accuracy usually only drops to 90 % vs 99 % for most prediction. Older YSNPs that have multiple signatures have just too many hidden parallel and backwards mutations that can longer be seen in present day testers.

However, charting does require a substantial amount of genetic information to chart adequately. There are dozens of such branches under R1b - but that number probably radically drops as you migrate away from US driven testing - with some notable exceptions. For L226, we now have 44 branches under L226 (none two years ago) and have a signature that remains at 100 % after a 150 tests. We also have 50 NGS tests, 100 critical private YSNPs tested at YSEQ, have 50 leading edge L226 SNP pack tests that include 50 private YSNPs and 50 branch equivalents. These YSNP packs have revealed eight new branches to date and YSEQ testing has revealed another 6 branches. I am now able to chart 75 % of the 500 67 submissions with only 20 % robustly tested.



Also note that there is a very large geographical (and psychological) difference between the lower and upper time bounds you cite. The entire I-CTS10228 (I2a-Dinaric) clade (https://yfull.com/tree/I-CTS10228/) is only 2200 years old by YFull's reckoning, but sprawls across Central-Eastern Europe and beyond. A clade predictor that simply tells someone "you belong to I-CTS10228" is not very satisfying--indeed, 12 markers is often enough to do the same. Sadly, even an up-to-date SNP pack does not necessarily do much better--look at how many members of I-CTS10228 take the Big Y test, only to end up classified as I-S17250 (https://yfull.com/tree/I-S17250/). (They don't even get an asterisk, because a major subclade, PH908 (https://yfull.com/tree/I-PH908/), is not usually tested by the Big Y. :( )


This is range which would vary depending genetic bottlenecks, random variation of mutation rates high in the haplotree, how much the population moved around, what the penetration rate of testing of individuals belonging to this group, etc. YSNP prediction would eliminate the costly need for tiered SNP pack testing since you would be able to order the most recent YSNP pack directly - reducing testing costs. It could also document signatures across the genome in a standard way and would be based on statistics that should be used for any kind of signature prediction - binary logistic regression.

jdean
02-02-2017, 12:11 AM
But of course (and stating the obvious here) a lot of private SNPs are future branching SNPs waiting to be discovered

lgmayka
02-02-2017, 01:02 AM
Y SNPs are not on the list of "guarded DNA results" that Lawrence posted below.
As I specifically mentioned, that has to be a mere oversight (i.e., a failure to update the guidelines to reflect downloadable files). I must once again emphasize that FTDNA has deliberately chosen to make VCF/BED and BAM files unavailable to project administrators. Obviously, FTDNA believes that such personal information is private.

You are apparently looking for an official statement from FTDNA on this. I will ask them for one.

TigerMW
02-02-2017, 03:02 AM
As I specifically mentioned, that has to be a mere oversight (i.e., a failure to update the guidelines to reflect downloadable files). I must once again emphasize that FTDNA has deliberately chosen to make VCF/BED and BAM files unavailable to project administrators. Obviously, FTDNA believes that such personal information is private.

You are apparently looking for an official statement from FTDNA on this. I will ask them for one.
Why do you assume I am looking for an official statement. I am not looking for anything from you or FTDNA related to this. I think your concerns are overblown as it applies to P312 and U106 projects that I'm aware of...

I send out a form letter/email whenever I request the VCF/REGIONS files and tell them exactly what we are doing with these files. The tester has to respond back with their approval and send the files too. That's what I mean by overt action on the part of the kit owner.

There are also caveats such as this one as I put on the R-L21 project web page...

"This is a public project. The more of us who test and share our information, the more we will all know. When you join this project, you are granting permission to place your Y SNP and STR data into the public domain, from which it can never be retrieved. We do not publish your full given name or contact info. "

The above is the kind of message that U106 puts in all caps.

The kit owners have to deliberately join these projects and deliberately send the project admins the files.

BTW, I rarely use them but I'm pretty sure the CSV files and Big Y matching/results web screens have singleton SNPs on them. These are free and clear to see by project admins. Hopefully you will not ask FTDNA to retract those capabilities.

TigerMW
02-02-2017, 03:35 AM
But of course (and stating the obvious here) a lot of private SNPs are future branching SNPs waiting to be discovered
Yes, and it is working exactly that way for the three L513 SNP Packs we have. I think the L226 guys are seeing some of this too.

In a couple of cases, for very large surname groups with only one NGS tester, I requested several of the singleton (phylo equivalent) SNPs in the packs because I knew surely one or two of them would break up the group.

RobertCasey
02-02-2017, 03:36 AM
No. That was my point earlier--YFull is a commercial concern with an explicitly published privacy policy, preventing the unrequested disclosure of personally identifiable information (e.g., private/unshared SNPs). You may disagree with this policy, but it is consistent with FTDNA's view, which was probably forged in the fire of lawsuit defense.

FTDNA has obviously decided to change this policy as 100 private YSNPs are now found in two SNP packs (L555 and L226). They had to assign labels to private YSNPs in order report values in the SNP reports. Dennis Wright started using DC = Dal Cais for all branches and private YSNPs under L226 and FTDNA has honored the vast majority unless they have already assigned a label previously. It really makes it a lot easier to visually see L226 SNPs in the bloated SNP reports that have NatGeo YSNP or older tests of older branches. I really like putting labels on private YSNPs and it came in real handy when the L226 SNP pack was designed.

The amount of information that L226 is getting out the L226 SNP is much better than we hoped for. It is not that cost effective for branch discovery, but during the two months we have more than tripled relevant YSNP data under L226 - allowing better chart building which greatly helps with recommending next steps in testing. I highly recommend inclusion of private YSNPs in SNP packs and designing SNP packs for predictable YSNP signatures. But you probably need around 200 or 300 67 marker submissions under your YSNP branch to get a SNP pack made. I am not a big fan of loading up SNP equivalents over all private YSNPs though. Just this week, we did convert our first equivalent branch into a real branch. But we have seven new branches from private YSNPs. We have added 2,000 relevant L226 YSNPs that define known branches, 2,500 tests of private YSNPs and 2,500 tests of equivalent YSNPs.

TigerMW
02-03-2017, 11:51 PM
FTDNA has obviously decided to change this policy as 100 private YSNPs are now found in two SNP packs (L555 and L226). They had to assign labels to private YSNPs in order report values in the SNP reports. Dennis Wright started using DC = Dal Cais for all branches and private YSNPs under L226 and FTDNA has honored the vast majority unless they have already assigned a label previously. It really makes it a lot easier to visually see L226 SNPs in the bloated SNP reports that have NatGeo YSNP or older tests of older branches. I really like putting labels on private YSNPs and it came in real handy when the L226 SNP pack was designed.

The amount of information that L226 is getting out the L226 SNP is much better than we hoped for. It is not that cost effective for branch discovery, but during the two months we have more than tripled relevant YSNP data under L226 - allowing better chart building which greatly helps with recommending next steps in testing. I highly recommend inclusion of private YSNPs in SNP packs and designing SNP packs for predictable YSNP signatures. But you probably need around 200 or 300 67 marker submissions under your YSNP branch to get a SNP pack made. I am not a big fan of loading up SNP equivalents over all private YSNPs though. Just this week, we did convert our first equivalent branch into a real branch. But we have seven new branches from private YSNPs. We have added 2,000 relevant L226 YSNPs that define known branches, 2,500 tests of private YSNPs and 2,500 tests of equivalent YSNPs.

There are some approaches to be thought out related to what SNPs to be included or not. I think the best approach might differ by subclade (of the pack) and the depth and breadth of prior testing by branch.

"But you probably need around 200 or 300 67 marker submissions under your YSNP branch to get a SNP pack made."

I'm not sure where you are getting this. In terms of FTDNA, they do look at subclade size, which they see as market size. They obviously have some kind of business decision as to number of packs to be sold for them to break-even. The number is surprisingly low and is not impacted by STR testing. Essentially the project admin/pack requestor has a lot to do with this.

I'm curious about the R1b-FGC5870 SNP Pack. I don't think that is a huge group.
http://ytree.net/DisplayTree.php?blockID=868&star=false

TigerMW
02-09-2017, 03:48 PM
FTDNA has obviously decided to change this policy as 100 private YSNPs are now found in two SNP packs (L555 and L226)...
I really like putting labels on private YSNPs and it came in real handy when the L226 SNP pack was designed.
...
I highly recommend inclusion of private YSNPs in SNP packs and designing SNP packs for predictable YSNP signatures
....
Since this is the U106 fireside chat, what's the status of U106 SNP Packs? I think there are seven today.

We can look at DF21, which is just one element of L21, and see six packs for DF21.

Are the U106 folks planning to exploit the pack technology? It doesn't mean other technologies are not of value, but it is a good option for people to have.

EDIT: I just checked and L21 has twenty-four packs. L21 is a bigger project than U106 but not that much bigger.

Wing Genealogist
02-09-2017, 06:33 PM
Since this is the U106 fireside chat, what's the status of U106 SNP Packs? I think there are seven today.

We can look at DF21, which is just one element of L21, and see six packs for DF21.

Are the U106 folks planning to exploit the pack technology? It doesn't mean other technologies are not of value, but it is a good option for people to have.

EDIT: I just checked and L21 has twenty-four packs. L21 is a bigger project than U106 but not that much bigger.

The U106 Project is working with FTDNA on updating the seven SNP packs falling under U106 (U106 upper level, *Z18, Z156, L48 upper level, L47, Z326, Z8). *The Z18 Project developed the Z18 SNP pack and we believe they will work on updating it. This process has been EXTREMELY slow, as we have been asking FTDNA to update our SNP Packs since March of last year and we have just heard from them where they are working on updating the Z8 SNP Pack.

We have had a fair amount of success getting roughly 150 or so SNPs on each pack (keeping my fingers crossed for the upcoming updates). While we are looking at the possibility of breaking down the Z156 SNP pack into an upper level, DF98 & DF96, at the present time, we are not looking at further breaking down the other SNP packs.

One of the issues which has plagued the "big brother" P312 progress is the fragmentation of its subclades into many projects. We recognize the size of P312 made it impractical to keep it all as a single project, but this fragmentation can cause confusion to new customers. There have been recent efforts like Alex's Big tree and James' work on combining the clade, and the U106 project (particularly Dr. Iain McDonald) is working on adding U106 to this combination and he has stated his ultimate goal would be to work on combining the data from M269 (in order to help create better dates for the formation of clades like U106 & P312).

Within U106, we are able to predict which SNP Pack test to take for over 90% of the folks, and to date we have only had a small number of incorrect predictions. By and large, the folks who have taken the backbone M343/M269 SNP Pack test and found they fall under U106 have not ordered a second SNP Pack to further refine their clade. We have also seen some folks who order the incorrect SNP Pack (without consulting the admin team) which almost always results in an unhappy customer.

The U106 Project definitely believes the SNP pack technology is a good option, as there will always be folks who are unable/unwilling to shell out the funds for a NGS/WGS test. But, we also feel we need to balance the SNP Packs so as not to leave folks with only "partial" answers (ie not need them to take multiple SNP packs in order to achieve their most recent known clade).

TigerMW
02-09-2017, 08:22 PM
...
One of the issues which has plagued the "big brother" P312 progress is the fragmentation of its subclades into many projects. We recognize the size of P312 made it impractical to keep it all as a single project, but this fragmentation can cause confusion to new customers. There have been recent efforts like Alex's Big tree and James' work on combining the clade, and the U106 project (particularly Dr. Iain McDonald) is working on adding U106 to this combination and he has stated his ultimate goal would be to work on combining the data from M269 (in order to help create better dates for the formation of clades like U106 & P312).....
What are the plagues of P312?

Fragmentation has negative connotations but if it is additional specialization, motivation and focus is that bad? I can assure you the focus we've had like James K on FGC111134x and CTS4466/Irish II, or Dennis W and Robert C on L226/Irish III, or Peter and David S on DF49x, or Mark J on FGC5494, or Greg H on Z253, Neal and Maurice and others on Z255, or George C on S1051, or Peter B and Patrick M on Z3000/Clan Colla, or James I on L555/Clan Irwin, or myself on L513, or Rory C on DF25, or Nigel on P314.2, or Susan H and David W and Linda M (and Iain K) on M222, or Alex W on S3058/Little Scots, Richard R and Steve G on U152, or Gareth H and Stephen P on Z198, or Dick H on Z209 and DF27, etc., etc. - all has been very helpful. I missed many, but you get the idea.

Yes, it has been true that these folks use different methods, but who is to judge which is best? It's a bit of a competition of ideas but these folks are generally quite willing to share.

Another consideration goes back to what Peter has brought up before. What is the proper level of consolidation ... or control ... or restriction? Who decides?

If fragmentation is a bad thing, let's can it all and just go have a single R1b project. Of course, that's not reasonable which is why the R1b project's gateway service focuses on getting people into smaller haplogroup projects. To do otherwise would discourage the passionate volunteers, like Peter, who are first motivated in their own subclades.

In grand consolidation projects what do we see happen? ... slow downs.

EDIT: I am pricking the sore side of this, I know. I was referred to this thread and its fireside chat nature so I want to have a little fun with it to... It's just fun. We are all volunteers doing what we can. This is just a hobby (for me) and a good argument is fun (for me).:)

Wing Genealogist
02-09-2017, 09:49 PM
What are the plagues of P312?
... In grand consolidation projects what do we see happen? ... slow downs.


I certainly understand you are having a little fun with it. We are all volunteers.

The "slow downs" which are occurring have nothing to do with the U106 Project, but instead lie with FTDNA's delay in responding to our repeated requests to update the SNP Packs. Is it possible the reason for the delay may have been caused (in part) by the myriad SNP packs being developed for other projects? It takes a lot of time and effort to develop each SNP Pack, so developing a host of smaller packs may have the unintended consequence of delaying the updates for other Packs.

You are certainly correct about the numerous individuals who have stepped up to the plate to help everyone, as well as the need for these experts not to be overwhelmed with too much data to analyze. Within the U106 Project we have developed an admin team to distribute the workload and share/expand our expertise.

TigerMW
02-09-2017, 11:22 PM
.This process has been EXTREMELY slow, as we have been asking FTDNA to update our SNP Packs since March of last year and we have just heard from them where they are working on updating the Z8 SNP Pack.

The "slow downs" which are occurring have nothing to do with the U106 Project, but instead lie with FTDNA's delay in responding to our repeated requests to update the SNP Packs. Is it possible the reason for the delay may have been caused (in part) by the myriad SNP packs being developed for other projects? It takes a lot of time and effort to develop each SNP Pack, so developing a host of smaller packs may have the unintended consequence of delaying the updates for other Packs....
Something is amiss here. A request from March of last year is almost a year old. I've not seen anything take anything near that long. You have both "repeated requests" to FDNA and "we have just heard from them".

The reasons I can think of for year-long delay are:

1) They don't think there is much of a market (which doesn't make sense for U106 as its a huge testing audience)

2) The requests from U106 were not well organized or incomplete (which doesn't make sense because the U106 project team is very sharp)

3) Other subclades, like mine, are receiving preferential treatment (this is possible but there are other subclades I have nothing to do with who also have received more responsive treatment)

4) ?

TigerMW
02-09-2017, 11:45 PM
The U106 Project is working with FTDNA on updating the seven SNP packs falling under U106 (U106 upper level, *Z18, Z156, L48 upper level, L47, Z326, Z8). *The Z18 Project developed the Z18 SNP pack and we believe they will work on updating it. This process has been EXTREMELY slow, as we have been asking FTDNA to update our SNP Packs since March of last year and we have just heard from them where they are working on updating the Z8 SNP Pack.

We have had a fair amount of success getting roughly 150 or so SNPs on each pack (keeping my fingers crossed for the upcoming updates). While we are looking at the possibility of breaking down the Z156 SNP pack into an upper level, DF98 & DF96, at the present time, we are not looking at further breaking down the other SNP packs...
I recommend breaking out into more packs and even getting into singleton SNPs for large surname groups, and pushing so post-haste. With the delays you've had, U106 has every right to insist on the highest priority in the pack development queue.

I don't advocate just trying copy an old deep clade package, barebones tree concept. FTDNA claims that truly care about genetic genealogy, which means youthful SNPs. I'd blast them.



Within U106, we are able to predict which SNP Pack test to take for over 90% of the folks, and to date we have only had a small number of incorrect predictions. By and large, the folks who have taken the backbone M343/M269 SNP Pack test and found they fall under U106 have not ordered a second SNP Pack to further refine their clade.
That's very good. I know the U106 team agrees with the importance of everyone getting to 67 STRs, particularly since DYS492 is STR #66. I still get resistance from some project admins that think 37 STRs enough. I'm on to the point we need at least 111 STRs and probably that is not enough.


The U106 Project definitely believes the SNP pack technology is a good option, as there will always be folks who are unable/unwilling to shell out the funds for a NGS/WGS test. But, we also feel we need to balance the SNP Packs so as not to leave folks with only "partial" answers (ie not need them to take multiple SNP packs in order to achieve their most recent known clade)... I'm with you that we want to exploit STR signatures to leverage SNP testing.

I think you should have at least ten packs to cover U106. However, to make sure people can get to the right pack we need to have FTDNA include each of those ten lead into pack SNPs in the R1b-M343 Backbone. Good, bad or ugly, the M343&M269 Backbone is still very popular, DYS492 is not infallible and many, many, many still aren't at 67 STRs anyway.

How many NGS tests does U106 have? L21 has over 2000. If U106 has about 900, that's how I get at least ten packs needed. That still doesn't get you quite on par with L21 but it is in the ball game then. If U106 has well over 1000 NGS tests then probably you need at least a dozen packs.

Wing Genealogist
02-09-2017, 11:56 PM
Something is amiss here. ...

I cannot agree with you more.

Wing Genealogist
02-10-2017, 12:04 AM
As I reported earlier, the U106 project is under the impression we can get roughly 150 SNPs per SNP pack (and hopefully they will work well together). We have placed a lot of currently equivalent SNPs on the SNP packs and hope to continue the process of finding folks to split up some of the larger clades (such as Z326 and Z8). At this point in time, we are believing/hoping all the SNP packs (with the exception of Z156) are still okay. When and if we feel we need to expand the number of SNP packs, we will work with FTDNA on it.

If a major cause of the delay to updating U106 SNP Packs is due to the wanton proliferation of SNP Packs in other haplogroups, then creating more SNP packs for U106 will only exacerbate this problem.

TigerMW
02-10-2017, 12:21 AM
As I reported earlier, the U106 project is under the impression we can get roughly 150 SNPs per SNP pack (and hopefully they will work well together).

You can do 160 SNPs in a pack without challenge. Anything over that is likely to be challenged but we do have several packs with more SNPs... and they are packs that work! I am surprised you hadn't heard the 160 number.


We have placed a lot of currently equivalent SNPs on the SNP packs and hope to continue the process of finding folks to split up some of the larger clades (such as Z326 and Z8). At this point in time, we are believing/hoping all the SNP packs (with the exception of Z156) are still okay. When and if we feel we need to expand the number of SNP packs, we will work with FTDNA on it.
Who is the "we"? Is the "we" only U106 proper project admins? Is Peter and Z18 included? Are L1 and U198 project admins included? Are strong analysts who aren't admins included?


If a major cause of the delay to updating U106 SNP Packs is due to the wanton proliferation of SNP Packs in other haplogroups, then creating more SNP packs for U106 will only exacerbate this problem.

"wanton proliferation"??? What does that mean? FTDNA has the responsibility of deciding what products they want to develop given their capabilities and market justification. Surely, you think U106 has a large and justifiable market for packs, right?

I'll have to see if my memory serves me well on this stuff.

So you think "wanton proliferation" is the something that is "amiss". Please elaborate.

Wing Genealogist
02-10-2017, 02:00 AM
I have heard about 160 SNPs in a pack, but it depends on how you define "challenge". I don't understand all of the technical details, but my understanding is that as you add more and more SNPs to a SNP pack, it increases the odds of an effect where adding one SNP causes a previously reliable SNP to become unreliable. It is also my understanding it is a fairly time consuming process to properly place the SNPs to reduce (and hopefully eliminate) this problem. FTDNA has only one individual (that I am aware of) who works on the SNP Packs, so it is certainly possible this causes a bottleneck in their ability to create and update these products.

The U106 Project does work closely with both the L1/S26 and U198 Projects on these SNP Packs, and everyone has benefited from this cooperation & collaboration. Peter has taken the lead with the Z18 portion of the tree, and the U106 Project has not been involved with that SNP Pack. We are also working with folks outside of the admin team to help with tasks including the updating of the SNP Packs.

FTDNA has an absolute right and responsibility of deciding what products to develop and how to develop them as well as how to prioritize the development. I can well expect the P312 portion of the tree is a HUGE market for them (with the U106 project also being a large market). However, one may argue that other haplogroups suffer a bit as a consequence of any decision which prioritizes P312 (and to a somewhat lesser extent U106). Within U106, we have been able to bring many individuals to clades dating roughly 1,500-2,000 years. I wouldn't be surprised if many haplogroups are still back in the 5,000-10,000 year range.

TigerMW
02-10-2017, 02:17 AM
As I reported earlier, the U106 project is under the impression we can get roughly 150 SNPs per SNP pack (and hopefully they will work well together).

I have heard about 160 SNPs in a pack, but it depends on how you define "challenge". .
It's not that complicated. Ask the Y team at FTDNA and they will tell you that 160 is the maximum. This is why I question the U106 admin team on this.

I suspect I was referred to this thread for exactly these purposes. I'm okay with it. I treat Anthrogenica has an excellent forum for advanced genetic genealogy folks and by definition, then, project administrators. I would not argue these things with you on the U106/S21 or R1b-YDNA yahoo groups. I'm not trying to make anyone look bad.


my understanding is that as you add more and more SNPs to a SNP pack, it increases the odds of an effect where adding one SNP causes a previously reliable SNP to become unreliable..
Your concern is overblown here. This is probably something fed to you by people who aren't familiar with mass spectrometry and the MassArray equipment (or competitors).

There is a threshold for increased reliability concerns, but 160 is safe.


It is also my understanding it is a fairly time consuming process to properly place the SNPs to reduce (and hopefully eliminate) this problem. FTDNA has only one individual (that I am aware of) who works on the SNP Packs, so it is certainly possible this causes a bottleneck in their ability to create and update these products..
Are you saying you think you need to protect FTDNA employee time? I find this argument somewhat incredulous as I don't see you as a defender of FTDNA. I look at it differently. I want to challenge these guys to work as hard as possible for us. The good news is there are some very good people, like Carlos and Michael. They can and do work weekends.

If you don't challenge them, you are doing U106 folks, in general, a disservice.

We are also working with folks outside of the admin team to help with tasks including the updating of the SNP Packs.
So who is "we"? Is "we" the U106 project admin team proper? Do you include specific expert analysts?


I can well expect the P312 portion of the tree is a HUGE market for them (with the U106 project also being a large market). However, one may argue that other haplogroups suffer a bit as a consequence of any decision which prioritizes P312 (and to a somewhat lesser extent U106).
You appear to be underestimating the value of the U106 market. I'd have to pull it up but Thomas K has gone on record as saying that L21 and U106 is what really counts, market-wise. Of course, this is not lost on FTDNA.

The U106 project team proper is very proud, as well it should be.

Cofgene
02-10-2017, 03:13 AM
Something is amiss here. A request from March of last year is almost a year old. I've not seen anything take anything near that long. You have both "repeated requests" to FDNA and "we have just heard from them".

The reasons I can think of for year-long delay are:

1) They don't think there is much of a market (which doesn't make sense for U106 as its a huge testing audience)

2) The requests from U106 were not well organized or incomplete (which doesn't make sense because the U106 project team is very sharp)

3) Other subclades, like mine, are receiving preferential treatment (this is possible but there are other subclades I have nothing to do with who also have received more responsive treatment)

4) ?


Mike,

I delivered 3 SNP pack upgrades lists to FTDNA during the last week of December 2015. They were complete submissions with all of the required information - positions, calls, supporting kit numbers. We provide better and more complete data that a lot of other haplogroup project submissions. One of packs was briefly worked on then all were lost in the shuffle of the effort to deliver all of the other R1b SNP packs. This was a FTDNA marketing decision. Revisions to 6 SNP packs were submitted again last june. Again they were not worked on due to priorities associated with new smaller project packs. Several weeks ago we finally got word that one of the packs was being worked on again. The thing is it is now 6 months out of date from our last request. An update is good but being out of date reduces the value. I am still unable to tell FTDNA customers that have been waiting for revisions since October 2015 that an updated pack is available for them.

In the U106 region we have a good set of STR signatures which allow us to provide consistent predictions into the 7 pack regions. Fragmenting the individual packs so that we can get improved coverage creates the problem that we significantly reduce our technical ability to predict which of the fragmented packs would best apply to an individual. We don't want to see users test several individual SNPs at FTDNA just to identify the correct pack that applies to them. At that point we just need to send them to YSEQ which provides a more fiscally responsible test.

From a technical perspective we would prefer to see the pack sizes increased to 200 or 240 (5 or 6 wells) over fragmenting the current packs. If FTDNA does not address their process of creating and updating packs the U106 project will end up fragmenting existing packs and calling them "NEW" just so that we can get them worked on.

No one else should be getting revisions to their packs unless they have had requests in before Dec 26th, 2015. The U106 packs should be at the head of the queue for revisions.

TigerMW
02-10-2017, 03:34 AM
I delivered 3 SNP pack upgrades lists to FTDNA during the last week of December 2015. 2015? Is this a typo?


They were complete submissions with all of the required information - positions, calls, supporting kit numbers. We provide better and more complete data that a lot of other haplogroup project submissions. One of packs was briefly worked on then all were lost in the shuffle of the effort to deliver all of the other R1b SNP packs
As I said before, you are a very proud team, as well you should be. Your team is very sharp and diligent. It doesn't make sense that U106 admins would get lost in the shuffle or submit an incomplete request. This is why this all seems amiss to me.


This was a FTDNA marketing decision. Revisions to 6 SNP packs were submitted again last june. Again they were not worked on due to priorities associated with new smaller project packs.
The U106 leader, Charles is a superb politician. My guess is he is or was an outstanding businessman. Maybe he needs to be called in if the U106 team needs more help on business justification.


Several weeks ago we finally got word that one of the packs was being worked on again. The thing is it is now 6 months out of date from our last request. An update is good but being out of date reduces the value. I am still unable to tell FTDNA customers that have been waiting for revisions since October 2015 that an updated pack is available for them.
There really is something seriously wrong if you still have 2015 requests outstanding.


In the U106 region we have a good set of STR signatures which allow us to provide consistent predictions into the 7 pack regions. Fragmenting the individual packs so that we can get improved coverage creates the problem that we significantly reduce our technical ability to predict which of the fragmented packs would best apply to an individual. We don't want to see users test several individual SNPs at FTDNA just to identify the correct pack that applies to them. At that point we just need to send them to YSEQ which provides a more fiscally responsible test.
First, of all, who is "we" again? Who decides these things for all of U106? Who decides what is "fiscally responsible"?

I think I get it, but it is wrong for Ray W to blame FTDNA for slowness when the U106 admin team has made other judgements.

Look, I'm not saying you should not validate/verify whatever you want to with YSEQ. That's all fine, but that doesn't mean you should limit the exploitation of mass spectometry, which FTDNA supports in the packs.

Earlier, Ray W, bragged about your (U106 admins I guess) ability to predict based on STRs. You think it is a problem that "Fragmenting the individual packs so that we can get improved coverage creates the problem that we significantly reduce our technical ability to predict".

The problem is simple, and we don't even need 67 STRs if you give me some more lead in SNPs to include in the R1b-M343&M269 SNP Pack. Don't misinterpret this. I think everyone in R1b should get to 111 STRs, at least one per MDKA, but many (most) haven't. They are at 37 STRs.

Another way to think about this is I could translate your pack fragmentation concern as really a problem with your (U106 admins I guess) ability to predict based on STRs. That's not a good reason to limit the availability of genealogical important SNPs in SNP Packs.


From a technical perspective we would prefer to see the pack sizes increased to 200 or 240 (5 or 6 wells) over fragmenting the current packs. If FTDNA does not address their process of creating and updating packs the U106 project will end up fragmenting existing packs and calling them "NEW" just so that we can get them worked on. ...
Is this true issue? We have Ray W of the U106 admin team saying 150 is a limit and his reliability concerns at that level. We also have him saying FTDNA's time must be limited. We have you saying you want 200 to 240 when there is no such thing as a pack with over 200 that I know of. I guess the U106 admins don't agree on when there are reliability concerns about the Pack technology......

Ray W says you've (U106 admins I think) have made repeated requests but haven't heard from FTDNA for months.

I get it. I've had personal email exchanges that show the situation.

Who else has two admins in Houston who will and have gone into meet with Bennett face to face?

Please have Charles join this conversation. He is adept at explaining the nuances.

Again, I am pricking this as fun, which I think Charles can appreciate.

P.S. My paternal lineage is not U106, but I have three pure English lineages and four or five Czech (eastern Czech) lineages and two German lineages that are unknown. Probably, there is U106 in there somewhere... and some R1a1, etc.


I'll re-iterate that I think U106 should have at least ten or maybe more SNP Packs so that genealogical important SNPs can be better supported.

Cofgene
02-10-2017, 11:58 AM
2015? Is this a typo?


I'll re-iterate that I think U106 should have at least ten or maybe more SNP Packs so that genealogical important SNPs can be better supported.


No. 2015 is Not at typo. You don't understand the impact of low volume packs has had on FTDNA's ability to refresh older products applied to larger groups.

From a marketing perspective there might be SELECTIVE cases where putting some genealogically relevant variants onto packs for British Isle/US geographic origin haplogroup regions. For other geographic regions where sampling and participation is much lower we struggle to get results on haplogroups closer than 1500 years ago. Until we get several thousand move continental European results which allow us to consistently reach into the last 800 years across a broad spectrum of the U106 region we must continue to concentrate on haplogroup identification. Putting a number of genealogically relevant variants for well tested lineages will not occur at the expense of providing the opportunity of identifying rare haplogroup levels.

The U106 project has over 1000 "non-participating" individuals. This lack of participation comes from death, language, lack of interest, and availability of money to do a pack or specific refining SNPs.

Review the technology being used and more specifically exactly how the samples are prepped and introduced into the machine. 160 is a FTDNA marketing/lab operation imposed limit.

Fracturing into a large number of packs makes no sense if we have to convince individuals, who won't ante up for a BigY that due to inaccurate STR predictions they will need to purchase a 2nd pack. If we can get FTDNA to provide a $40 single well 40 variant U106 scan pack then that does open up the topic for fracturing the existing U106 packs into more subpacks. Instead of focusing on genealogical variants we would need to look at positioning the WGS/Elite variants that BigY doesn't report on. We have several hundred of those which need to be positioned on the tree.

Wing Genealogist
02-10-2017, 12:18 PM
The U106 Project currently has just under 5,000 members. I am extremely happy that we have a little over 1,000 NGS/WGS results within the project but my own personal goal is to work on reducing the large percentage of folks who have done little to no SNP testing.

We currently have over 300 individuals who have not done any SNP testing, and are predicted to be M269 (but are predicted to be U106+ based on STR markers). In addition we have over 450 individuals who have not tested below U106. Looking at the 7 SNP Packs we currently have (U106, Z18, Z156, L48, L47, Z326, Z8) roughly 1,400 of our 5,000 members have not yet done any testing below them.

As I stated earlier, we are reasonably confident we can predict which of the seven SNP packs the vast majority of folks in our project will fall into. (We currently only have 31 individuals where we cannot predict which SNP pack would be best and many of them have not tested out to 67 STRs.) Even splitting up the Z156 SNP Pack into DF96 & DF98 would leave us with a pool of individuals where we are not confident we can predict where they would fall (within a 95% CI).

RobertCasey
02-10-2017, 02:09 PM
Since FTDNA did away with the long version of the haplogroup, I have been analyzing all R haplogroups for the last year or so. I have around 50,000 67 marker submissions in this database. As we all know, FTDNA is pretty business oriented (and should be) and probably rolling SNP packs according to existing known major haplogroups. This is only based on terminal YSNPs in YSTR reports and the last full pull was in July, 2016 (so it is a little dated):

R1 and R2 212
R1a 5,100
Early R1b 449 (L278 and M343)
M269 only 31,356
post M269 560
U106 3,234
P312 w/o L21 1,626
L21 6,600

Of course, the earlier R1b is just untested and most would be spread across the more recent haplogroups.

L21 6,600 39.3 %
R1a 5,100 30.4 %
U106 3,234 19.2 %
P312 1,626 9.7 %
R1 & R2 212 1.3 %
Total (major) 16,772

So if you just look at raw data, L21 should probably have twice as many SNP packs as U106 due to raw data above. But there is a genetic difference between R1a and U106 from L21 and P312. R1a and U106 have a lot of branches peeling off their haplotree and L21 and P312 have huge starbursts of major branches. Not sure how that would make a market share difference but is very observable in the haplotrees when numbers are assigned.

Another major issue is that FTDNA really cooperates a lot more with FTDNA friendly folks as well as those who understand the issues associated with Mass Array technology. Mike is right, 160 tests is the sweet spot and anyone who wants more will just have to wait until the replacement of Mass Array arrives by the next round of technology. You have to work within the limits of the technology that FTDNA has selected. It is pretty annoying that so many YSNPs come back with bad reads but that is nature of the technology that delivers very targeted YSNPs at less than $1 per YSNP. FTDNA is now loading up SNP packs with private YSNPs (singletons) and equivalents. Unfortunately, equivalents will probably only find 10 % of the branches that random private YSNPs will yield but that is the methodology for rollout. It is also confusing for the general public to declare all private YSNPs as branches - but this is such a good test, you just have to roll with the flow sometimes. This private YSNP issue will eventually blow up and FTDNA will probably then fix it. As with any new technology, bad reads will improve (not go away) as FTDNA determines the root cause for some of these bad reads.

But I really love the L226 SNP Pack and in just two months, it has doubled the number of L226 submissions that are fully tested for most branches of L226 (50 NGS and 50 robust SNP packs). So we have doubled our knowledge of L226 at 20 % of the cost of Big Y. I have gone from charting 40 % of L226 to 75 % and should soon reach 80 % coverage. By charting at this level, I learning a lot of new issues and now have much better knowledge on what to test next and which tests should be next. These tests have also generated a lot of renewed interest in testing - since its price point is much more consumable. So, SNP Packs loaded with private YSNPs is hugely successful for L226.

Also, FTDNA responds well to smaller and more dedicated groups of admins where sales have been high in the past. L226, L193, L555, M222 and many other smaller branches (like L226) have two and three competent admins driving business for FTDNA and giving a lot of support to all their questions. L226 is prime for getting good SNP pack support having: 1) three very active admins; 2) market inertia with 500 67 marker tests; 3) having over 50 Big Y tests; 4) and in the first two or three months have ordered 50 L226 SNP packs - allowing FTDNA to probably earn a little profit; 2) in the first round of Z253 Packs, L226 drove over half of the Z253 Pack testing. So, these metrics are important. I really think the distributed leadership really helps L21 get more FTDNA support for SNP packs.

TigerMW
02-10-2017, 02:44 PM
No. 2015 is Not at typo. You don't understand the impact of low volume packs has had on FTDNA's ability to refresh older products applied to larger groups.

U106 should be a higher priority. FTDNA is business oriented. In other words, they want to sell to the larger markets. This a symptom of a communications problem. You can blame it on FTDNA but most communications problems are a two way street.

It's not helpful for the U106 community, though.

If so, then this is really a case where the large consolidated and strongly controlled project approach has hindered progress on some levels. It may have helped on other levels, but it hasn't been helpful on this front.



From a marketing perspective there might be SELECTIVE cases where putting some genealogically relevant variants onto packs for British Isle/US geographic origin haplogroup regions. For other geographic regions where sampling and participation is much lower we struggle to get results on haplogroups closer than 1500 years ago. Until we get several thousand move continental European results which allow us to consistently reach into the last 800 years across a broad spectrum of the U106 region we must continue to concentrate on haplogroup identification. Putting a number of genealogically relevant variants for well tested lineages will not occur at the expense of providing the opportunity of identifying rare haplogroup levels.

You've just described the case for why you need more packs. You are restricting forays into genetic genealogy for what you are calling "haplogroup identification" or while you wait on "haplogroup identification".



Review the technology being used and more specifically exactly how the samples are prepped and introduced into the machine. 160 is a FTDNA marketing/lab operation imposed limit.

At least I'm glad you really have heard about the 160 number although it took you all a while admit it it in this conversation.

What? Are you guys playing hardball with FTDNA because you know the techhnology and know that it should support 200 or 240 SNPs? I don't know if it is possible or if reliability would go down the tubes, but in any case that might actually cause a price change. I guess you are marketing and product management experts.


Fracturing into a large number of packs makes no sense if we have to convince individuals, who won't ante up for a BigY that due to inaccurate STR predictions they will need to purchase a 2nd pack. If we can get FTDNA to provide a $40 single well 40 variant U106 scan pack then that does open up the topic for fracturing the existing U106 packs into more subpacks. Instead of focusing on genealogical variants we would need to look at positioning the WGS/Elite variants that BigY doesn't report on. We have several hundred of those which need to be positioned on the tree.
(EDIT: I asked FTDNA a couple of years ago for a R1b Backbone Pack for $49 with just the high level SNPs. That evolved into the $99 pack since it was apparent that was pricing level they wanted, ~$100. When I ran into resistance I just changed directions to the best I could with the options... hence lots of SNPs in packs.)

I can see the U106 "we" admin team has their own approach but you are restricting options for your testing community because of your lack of willingness to exploit the MassARRAY technology. I'm not saying you should restrict other options that you deem more "fiscally responsible" but why take something off the table?

Perhaps you should let people decided for themselves if "focusing on genealogical variants" is a worthwhile venture.

Robert Casey's review of the subclade sizes do indicate you should have at least ten packs. I think it is more like twelve but ten for sure.

(EDIT: With 160 SNPs as an approximate pack capacity, the R1b-M343 & M269 Backbone Pack can handle a lot of pointer SNPs to other packs, particularly that L51x is now offloaded. It seems like a workable strategy.... )

lgmayka
02-10-2017, 02:58 PM
Fracturing into a large number of packs makes no sense if we have to convince individuals, who won't ante up for a BigY that due to inaccurate STR predictions they will need to purchase a 2nd pack.
Selling two SNP packs ($238) is even more difficult to project members who have already spent plenty of money on the earlier SNP testing options (Deep Clade tests, individual SNP tests, etc.). I have occasionally mentioned in this forum my frustration at SNP packs that totally ignore customers who have already spent considerable amounts of money on earlier SNP testing. One may be able to persuade such a project member to order one last SNP pack, but two is out of the question.

TigerMW
02-10-2017, 03:05 PM
...
Of course, the earlier R1b is just untested and most would be spread across the more recent haplogroups.

L21 6,600 39.3 %
R1a 5,100 30.4 %
U106 3,234 19.2 %
P312 1,626 9.7 %
R1 & R2 212 1.3 %
Total (major) 16,772


Thanks, Robert. This is along the lines of what I've suspected but I've never tried to count this across the projects like you have. I assume you have U152 and DF27 both in P312. If so, the P312 number is lower than I thought. U152 has seven packs so it is about even with U106. This means that it is not L21 that is over represented. It is U106 that is under represented in pack coverage.


... . FTDNA is now loading up SNP packs with private YSNPs (singletons) and equivalents. Unfortunately, equivalents will probably only find 10 % of the branches that random private YSNPs will yield but that is the methodology for rollout. It is also confusing for the general public to declare all private YSNPs as branches - but this is such a good test, you just have to roll with the flow sometimes. This private YSNP issue will eventually blow up and FTDNA will probably then fix it. As with any new technology, bad reads will improve (not go away) as FTDNA determines the root cause for some of these bad reads.
I agree. This is also where SNP selection while considering equivalents and age are important. In the first round of L513 I went hard at the older SNPs wanting to really exercise the phylogenetic equivalent blocks. We did break a couple, but most of the older blocks are still pretty much in tact. Even though it was exciting to discover ancient branching it was easy to see the community didn't care as much about that as I did. They really do want the genealogically significant SNPs, as Robert indicates.
In this round (v2) I tried to hit the youthful SNPs. More people are ordering because they really want the genetic genealogy. not deep ancestral origins.


. I really think the distributed leadership really helps L21 get more FTDNA support for SNP packs.
We fight like cats and dogs sometimes but as long we share information, the distributed leadership really is best thing for the community overall. All in all, it has been a very rewarding experience and I've learned a lot.

TigerMW
02-10-2017, 03:12 PM
Selling two SNP packs ($238) is even more difficult to project members who have already spent plenty of money on the earlier SNP testing options (Deep Clade tests, individual SNP tests, etc.). I have occasionally mentioned in this forum my frustration at SNP packs that totally ignore customers who have already spent considerable amounts of money on earlier SNP testing. One may be able to persuade such a project member to order one last SNP pack, but two is out of the question.
You should only have to spend $218, not 238, as long as the backbone pack is properly designed. At least in R1b, the backbone pack has a regular price of $99.

Sometimes STR signatures allow you to skip the backbone pack but not in all cases, but that is no reason not to give people the genealogically useful SNPs they want.

I don't view this is a selling exercise. Not everyone will participate but the way people spend money varies. Spending a $100 every six months is not a big deal to many. For others, they see the value of spending $400-500 for NGS or more for bigger NGS. It's all one shot but some are willing to bite it off.

Some, particularly surname project focused people, may incrementally move up the STR route and wait and hope for a good match to come up who has done NGS, allowing them some SNPs to target individually, hopefully just a couple.

I like the NGS option but it just isn't the cards for everyone. Having these options and multiple vendors is a good thing.

Bollox79
02-10-2017, 08:21 PM
Just my two cents here - I'm often on the yahoo forum etc ;-). I often wonder if the lack of testing SNPs is an educational thing? I know I have helped and/or played the middle man for a couple people in my little branch of FGC14840 under DF98 (and FGC14814 on the big tree - Mr. Roman 6drif-3 is in there - I also have a keen interest on ancient SNPs!) and had to explain to them, in my honest opinion, what they are getting for their money. I use my own example as an example to them, and make it clear I'm quite happy currently with the results because this is a waiting game. I first did Geno 2.0. That got me to Z304-307. Then I came along just about the time DF98 was discovered and connected with Wettin etc. I tested that. Then I did my STRs to 111. Then I did the Big Y as soon as it came out. I make sure to tell them that it's a fairly decent test for finding new branches (more of a database to compare with and growing). I make sure I tell them I went through all that, got the results I got, matched an old skeleton (which is usually very unlikely I suppose), but would have done it again if I didn't get as many matches. Big Y (or any Next Gen - that's what it's called right?) sets you up for future matches as long as every one is on the same page, and realizes that we need to share the VCF file to compare data and identify new branches!!! I think some people, who are new to this and not as OCD as I am, get lost in the shuffle! I try to help those in my little are of DF98! Like I said, just my two cents since I don't have a hand in U106 as an admin, but occasionally try to do some messages for Dr. Iain and do some of the leg work in my little areas of DF98!

Also, discussion of issues is a great thing, and I'm thankful for all that the admins do!!!

Cheers,
Charlie "Cathal Dubh"

Bollox79
02-10-2017, 08:28 PM
Additionally, I think that if you can explain the "big picture" to them about the future of all this testing, the massive potential of this testing, and the fact that we have modern people who already match ancient DNA samples... they will grasp the big picture. Then again they may not care about that, and are just interested in the recent (last 500 years or so) of their y-line. Just depends on what they want. I try to make it clear in a positive way though, that the more people who test hopefully to the Big Y level, the more it will help flesh out the branches and everyone benefits over time! In the end I make it a positive message, but I don't try and force them down one path or another. I just explain my experiences and how I'm very happy with experience. I also make sure I mention that my modern day matches are only a few SNPs away from the Roman 6drif-3 match... and that makes my modern matches quite old, but that doesn't discourage me ;-). It's a journey of discovery, it's ok if it takes a few or more years! We are literally uncovering history as we speak. It's almost as if we used a time machine to go back and look at genetics 2000 years ago! Huge Potential!!!

Cheers!

TigerMW
02-10-2017, 09:12 PM
Just my two cents here - I'm often on the yahoo forum etc ;-). I often wonder if the lack of testing SNPs is an educational thing? "
I agree. Most people starting into this don't know what they are getting into.



I know I have helped and/or played the middle man for a couple people in my little branch of FGC14840 under DF98
....
I think some people, who are new to this and not as OCD as I am, get lost in the shuffle! I try to help those in my little are of DF98! Like I said, just my two cents since I don't have a hand in U106 as an admin, but occasionally try to do some messages for Dr. Iain and do some of the leg work in my little areas of DF98!...

Thanks for your work. I think you present a good example of what Robert C called distributed team leadership. There are many potential volunteers out there who are most interested in their own little areas of the tree. Sometimes folks even fund others' testing. This is oftentimes the reason people want their own sub-projects.

RobertCasey
02-10-2017, 09:24 PM
With L226, we are pretty lucky have a prolific number of offspring and survivors. This is because one of the founders of this haplogroup was King Brian Boru who was the first king to unite/conquer all of Ireland. Due to status, L226 prospered much more than most other lines. This line being prolific (and later migrating to the US in large quantities for future testing), set up pretty ideal conditions for having an ample supply of descendants to test. There are many other L21 haplogroups with similar decent sample sizes to enjoy. The largest predictable single signature haplogroup under R-L21 is probably R-M222 which is several times larger than L226 and their line is tied to Nial of Nine Hostages who later had Ireland under his control and created a lot of offspring during and long after his reign.

I really have a lot empathy for those lines that barely survived the dark ages. Even R-L226 has at most five percent of the older part of L226. Due to 20 to 30 equivalent L226 YSNPs via NGS testing, we now know via YSNP testing that there was at least another 1,000 years prior to the time of Brian Boru where L226 barely survived these harsh times due to massive crop failures of mini ice ages (really just much colder than normal for decades where rain was much less plentiful) causing massive crop failures and subsequent large scale mortality rates due to starvation and diseases. These people have very high genetic distance and very different signatures from 95 % of rest of L226.

Like you, I try to explain to them that their tests are much more long term in nature but with continued testing - larger sample sizes will eventually get there. Of course, many are not interested in investing in the future and want genealogical results like others in L226 are enjoying. However, these people will discover a lot of private YSNPs from NGS testing as well as good solid branches but will have a hard time in "building their cluster" due to the smaller sample size of their present days survivors. Some gladly test for the future - but most of these people become disappointed in satisfying their genealogical goals with genetic testing and lose interest. This is the random nature of DNA game in general.

There are similar issues in genealogy where much of your line daughters out as my great grandfather William Martin Shelton has no male descendant with the Shelton surname. I have around 500 descendants charted but only three men named Shelton in this part of the family history. Going just two more generations back and I get 800 Shelton men. Also, my mother was born a Brooks and both myself and my son carry her surname as our middle name. But I could not break a brick wall in the mid 1700s. Finally genetics solved this issue, my ancestor was a NPE event of 1765 and was actually a Wade. Somehow, my interest in Brooks research waned due to any earlier research would be Wades or Brooks that are not my real bloodline. Also, my Brooks Family History book now has around 8,000 descendants but all future Brooks research would be really unrelated to my true bloodline. I now feel the angst of adoption related issues for the first time.

Bollox79
02-10-2017, 10:25 PM
I agree. Most people starting into this don't know what they are getting into.



Thanks for your work. I think you present a good example of what Robert C called distributed team leadership. There are many potential volunteers out there who are most interested in their own little areas of the tree. Sometimes folks even fund others' testing. This is oftentimes the reason people want their own sub-projects.

No problem Mike! I figured it's the least I can do to help Dr. Iain with my small little Ancient British/later Scandinavian branch! In addition to just being thankful for all the work all U106 admins do (I'm sure it's a huge headache in addition to their normal jobs!) I'm very thankful for my S1894/S1900 clade mate Dr. Iain for all that work he put into the DF98 king's cluster pdf. It can only get bigger with more data and more branches ;-). It's the least I can do to lend a hand in my area and take some of the leg work off of the busier guys.

Cheers!

Bollox79
02-10-2017, 11:09 PM
Like you, I try to explain to them that their tests are much more long term in nature but with continued testing - larger sample sizes will eventually get there.




Robert! I happy to hear the amount of progress you guys are making on L226!! Having a lot of 3rd-5th cousins from Ireland, Northern Ireland and Scotland (and Cape Breton, PEI etc)... I'm always interested in projects in those areas... particularly Tipperary... I get a lot of matches with autosomal links to all over Ireland, and especially the SW!

For myself... I am happy and very intrigued with the Driffield Terrace cemeteries and their context etc. Where there are two Z304-307 guys (a DF98 and a DF96) in a small number of skeletons from a martial cemetery... that means there may be more!!! Looking forward to the future of aDNA mostly... then perhaps find distant cousins from Pennsylvania surnamed Weaver. That's another I am dealing with! So many Weavers from PA (many in the Weaver DNA project etc), but I do not match any of them ;-). The quest continues!

Cheers!

lgmayka
02-10-2017, 11:51 PM
You should only have to spend $218, not 238, as long as the backbone pack is properly designed.
For R1b, that's true. Backbone SNP packs for other haplogroups are $119.

TigerMW
02-11-2017, 12:03 AM
delete - was going off topic

TigerMW
02-13-2017, 08:57 PM
... At least in R1b, the backbone pack has a regular price of $99.

Sometimes STR signatures allow you to skip the backbone pack but not in all cases, but that is no reason not to give people the genealogically useful SNPs they want.
I just joined about a half dozen new U106 folks to the U106 project. We had about a dozen new R1b-M343&M269 Backbone SNP Pack results which were U106+.

The flow R1b-M343&M269 Backbone SNP Packs is pretty steady with no signs of slowing down.

Right now we have pointer SNPs in the Backbone to each of the U106 SNP Packs. If new packs are to be added it is critical to let me know that we need to add the pointer SNPs to the Backbone Pack. There's plenty of room but lead time is very important.

TigerMW
02-21-2017, 08:13 PM
... . FTDNA is now loading up SNP packs with private YSNPs (singletons) and equivalents. Unfortunately, equivalents will probably only find 10 % of the branches that random private YSNPs will yield but that is the methodology for rollout. It is also confusing for the general public to declare all private YSNPs as branches - but this is such a good test, you just have to roll with the flow sometimes. ...


I agree. This is also where SNP selection while considering equivalents and age are important. In the first round of L513 I went hard at the older SNPs wanting to really exercise the phylogenetic equivalent blocks. We did break a couple, but most of the older blocks are still pretty much in tact. Even though it was exciting to discover ancient branching it was easy to see the community didn't care as much about that as I did. They really do want the genealogically significant SNPs, as Robert indicates.
In this round (v2) I tried to hit the youthful SNPs. More people are ordering because they really want the genetic genealogy. not deep ancestral origins.

I highly encourage the U106 folks to relook at this, or work on it post-haste, whichever may be the case. This is a post from this morning related to a part of L21.

"We have a new Thomas/Martin branch. The S5668 SNP Pack results for Martin (#495859) confirm he is Z17911 > BY11573 > FGC33966. FGC33966 was a singleton SNP for Thomas (#8633) that FTDNA added to the SNP Pack. This verifies a new branch below BY11573.

I'm not sure how FTDNA determines which singletons to add to their packs, but we've really struck gold with them - 2 new branches formed in recent weeks from SNP Pack results! "

People like this kind of thing.....:)