PDA

View Full Version : Tracking R1b novel SNPs and their availability incl. a la carte ordering



TigerMW
09-16-2013, 01:47 PM
I don't think most people will do full genome scans without a dramatic reduction in price. That means that a la carte (stand-alone or one at a time) SNP ordering will still be very critical. We know FTDNA raised their prices from $29 to $39 each earlier this year. It appears they remove the one-time lab transfer fee in the process. We know our favorite (if not best or primary) SNP advocate, Thomas Krahn, is no longer employeed at FTDNA so I don't know what thats means for SNP discovery and offering development for FTDNA. We know that some on order WTY's have been cancelled. My guess is they are watching the competition and are working on new offerings. That's just a guess.

Regardless, the a la carte (Advanced Orders) process for single SNPs is still available, and as far as we know, SNPs tests are still being development. I propose that we share and track SNP primer requests and availability. I've started this thread to do that for R1b.

DF99 is one SNP I have questions about. It appears to be a new peer subclade to U152, L21, DF27, DF19 and L238 under P312. Does anyone know the status at FTDNA or how we can determine it.

FTDNA would probably appreciate we had efficient communications with them on these matters.

lgmayka
09-16-2013, 02:42 PM
as far as we know, SNPs tests are still being development.
I have the contrary impression, based on what Bennett Greenspan wrote to me on September 4:
---
For the time being I see no change [due to the departure of Thomas Krahn] other than the SNPs that others have asked him to calibrate up or a few WTYs to be scored, nothing should be out of the norm.
---

Thomas Krahn was the only one at FTDNA, to my knowledge, who had any interest in designing and offering new SNP tests. He had already ordered the primers for several more on my behalf when he was let go. I have no reason to believe that those tests will ever emerge.

My own suspicion right now is that, in regard to SNP tests, FTDNA will execute a mere holding action until it can:
- In the shorter term, offer the new version of Geno or a competitor to it
- In the longer term, offer full Y sequencing.

With some justification, FTDNA probably feels that individual SNP testing is an ever-deepening bed of quicksand. The requests for new tests continue to grow while the revenue per test continues to decline as tests become more specific to tiny recent clades.

Maybe a better modern analogy is an ever-widening sinkhole:
Georgia sinkhole threatens to swallow entire apartment complex (http://www.sott.net/article/264452-Widening-sinkhole-threatens-to-swallow-entire-apartment-complex-in-Cobb-County-Georgia)
Louisiana sinkhole swallows 100-foot-tall trees (http://www.cnn.com/2012/08/09/us/louisiana-bayou-sinkhole/index.html)
Florida sinkhole swallows resort near Disney World (http://www.cnn.com/2013/08/12/us/florida-resort-sinkhole/index.html)

The inability to get new individual SNP tests is quite tragic in the short term, of course, because there is no cost-effective alternative. Full Y sequencing is currently still much too expensive for routine recommendation, and chip offerings like Geno 2.0 cater primarily to Utahans and Sardinians.

GTC
09-16-2013, 02:52 PM
I have the contrary impression, based on what Bennett Greenspan wrote to me on September 4:
---
For the time being I see no change [due to the departure of Thomas Krahn] other than the SNPs that others have asked him to calibrate up or a few WTYs to be scored, nothing should be out of the norm.
---


I've had the feeling for some time that Bennett Greenspan is not up to speed with the operational details of the a la carte SNP and WTY side of the business. With Thomas' departure I would not be surprised if no more new SNPs are added to the menu.

RobertCasey
09-16-2013, 03:48 PM
I think that we really need to push Full Genomes for alternatives for testing individual YSNPs or groups of YSNPs. If FTDNA dramatically slows down the offering of newly discovered YSNPs, they are making a big business mistake of taking their eyes off their core business of YSTR testing AND extensive YSNP testing (not less than now). Testing individual YSNPs is probably not a money maker for FTDNA since the overhead costs of scanning for one mutation on the Y-chromosome has got to cost the $39 that they are collecting. I am afraid that their strategy will be pumping out new versions of Nat Geo every couple of years - which just will not be adequate. This could put a business squeeze on Full Genomes - if FTDNA does not offer individual YSNP tests - Full Genomes business will really slow down. We should give FTDNA a few weeks to propose their new strategy on YSNP discovery and testing of newly discovered YSNPs. Our requests to add Full Genomes newly discovered YSNPs will reveal a lot in the near future about the FTDNA long term strategy. If Full Genomes finds a good alternative, I would really hate to manually track of these tests vs. FTDNA YSNP reports (even with all their problems).

If the Nat Geo / FTDNA contract allows, FTDNA could statically test 500 to 1,000 newly discovered YSNPs with new tests every 3 to 6 months for around $99. Over time, there could be around a dozen or so $99 tests that would give us the equivalent of the $39 individual tests. Not sure if the economics of setting up several static chips would allow this but this would also eliminate many unstable / private SNPs in making the Nat Geo test. Nat Geo probably does not want to add all the private YSNPs because they do not really care about private/genealogical YSNPs. Then every couple of years, the Nat Geo test could add the broader and older YSNPs and remove any unstable or private YSNPs that would only be available via the static $99 tests. This would be a reasonable solution that would allow us continue to test groups of YSNPs at less than the Nat Geo test costs and be able to test newly discovered SNPs much more frequently than a Nat Geo update every two years.

The 111 marker test ($359) and the Nat Geo test ($199) is already $559 - this is 37 % of the cost of the Full Genomes test. However, $1,499 is just not viable except for those of us are willing to test at those costs and can afford to test at those costs. If FTDNA does not keep the ball rolling and the Full Genome test falls to $750, FTDNA would have a major exposure to losing their lucrative and exclusive testing of profitable YSTR 67 and 111 marker tests as well as keeping orders for Nat Geo from those who recommend YSNP testing. It really sad that Ancestry.com and 23andme have such dismal and disappointing long term strategies. All three companies are pursuing high volume atDNA tests which is a good test for many scenarios - but can not address the older brick walls that YDNA can reveal.

The above only addresses testing issues - we still have massive database and lack of tools which is another major issue. Having to download Nat Geo tests manually to make up for the broken Nat Geo to FTDNA transfer and manually tracking Walk the Y test results (vs. having them load into the YSNP reports). You know that FTDNA will never allow uploads of Full Genomes YSNP or YSTR results. Just growing pains of a new and leading edge niche industry.

TigerMW
09-16-2013, 09:17 PM
... The above only addresses testing issues - we still have massive database and lack of tools which is another major issue. Having to download Nat Geo tests manually to make up for the broken Nat Geo to FTDNA transfer and manually tracking Walk the Y test results (vs. having them load into the YSNP reports). You know that FTDNA will never allow uploads of Full Genomes YSNP or YSTR results. Just growing pains of a new and leading edge niche industry.
I was thinking about that database issue too. I see we have different volunteers in U106, Z18 and L21 using different formats to do the comparative analylsis. We also have Chris Morley's tool which must have its own database.

I wish there was a better example, but just like there is a Ysearch database with STR data elements and "one" haplogroup designation per record/kit/row, we need to have something similar for SNPs. There should be a record ID, associated kit # or some other identifier, a surname as a double check and then a series of structure data items/columns dependent on the haplogroup.... something along those lines. Everybody should use the same format for the SNP mutation label/location and the allele be it "A", "G", etc., etc.

There are many, many SNPs of course so I'd recommend something like a relevant set of SNPs per each different haplogroup. This needs to be worked out but a secondary index file might have the haplogroup definitions with a list of relevant SNPs per each haplogroup and the ancestral allele/value for each of those SNPs. The primary/large file would have to have a index key stored per each record to point to the appropriate "owning" haplogroup.

What do you think? Essentially, I think something like ISOGG needs to endorse use of an independent database structure. There should be guidelines/controls for registering the SNPs, etc. They already have an SNP index and perhaps that could be another auxiliary file to be used as the "correct" / consistent SNP names.

We've had a few years of WTY, then the larger volume of Geno2, but now the multiple other formats which will come. I'm afraid its getting ahead of us.

I suppose a database structure should be endorsed for Y STRs too since Ysearch doesn't even hold FTDNA's 111 and it sounds like we have more than that coming in these new offerings.

I recognize that the operational support of such databases is another issue, but at least we should come with a common format so various project data can be joined together for analysis, etc. without a bunch of transformations. I'm advocating setting data format and naming standards, not providing the service itself. Companies will probably attempt to but we will see bias in these towards their offerings or lackluster up/down support.

TigerMW
09-16-2013, 09:33 PM
... DF99 is one SNP I have questions about. It appears to be a new peer subclade to U152, L21, DF27, DF19 and L238 under P312. Does anyone know the status at FTDNA or how we can determine it. ...
I'm not sure of their status but Z192 is an important subclade under U152 and then we have DF100 directly under L11 as a peer to P312 and U106.

RobertCasey
09-16-2013, 11:02 PM
I think that there is hope for a common YSTR format since it is an easier issue:

1) The dash format of multi-copy markers has to be parsed into separate fields for analysis (first deviation from FTDNA format);
2) Conversion to 389-delta format - fair superior for analysis (second deviation from FTDNA format - change in my format - my database is already in delta format);
3) Multi-copy markers beyond normal values should be kept separate from normal markers (change in your format for 464 - eliminate blank columns and put them out in a special area);
4) Must include the extra Full Genomes YSTRs (up to 400) - needed for future and current YSTR research;
5) Must include source project name and date extracted fields - I am slowly adding these so that others can easily pull values - I still can not find around 50 of your submissions (labeled SS);
6) Some way to load non FTDNA YSTR report submissions (YSearch, WFN and custom reports) - too hard to maintain but we need source field to track their original source;
7) Some way to track no-calls YSTRs and missing YSTRs (currently listed as 0* and 0 in FTDNA reports);
8) Manually be able to add the CCTG 464 variations as special fields;
9) Full Genomes is has to be a special source field vs. just using FTDNA project name improperly;
10) Splitting up f and FTDNA ID fields - bad database design to combine fields - should be report source field and report ID field (change in both of our formats);

I try to emulate your spreadsheet (no need to re-invent the wheel) with a couple of minor exceptions. Hopefully, it will become the defacto standard to emulate.
The YSTR and YSNP reports must be separate database tables but output summary export reports can be joined together. Probably need a separate table for genealogical
data (Surname and estimated European place of origin - derived fields that are major pain and deviation from FTDNA) as well as original FTDNA fields such as Oldest known
proven ancestor, origin, donor field and grouping header (I parse this).

Normalization of tables has major tradeoffs. Normalization states that any table entries that are not dependent should be separated. We obviously currently have
L21 tables (about all our spreadsheets can handle and that is about to get ugly with too many more entries). David Reynolds started doing normalization for his
Nat Geo / WTY summaries which is the correct thing to do. I would combine L96, L144.1, L371, L679 and CTS2457.2 into one table, DF13private. I would keep
all the others in DF13new - which would include DF13* which is already has done. Since M222 has so many SNPs (and Full Genomes will add many more) as well
as having so many submissions: you could split DF49 into DF49-M222 and DF49-other. Same with DF21: DF21-Z246 and DF21other (normalize a few of these
to get everyone used to constant changes for normalization).

However, this would greatly complicate our lives since we would have to learn how to join numerous tables together for reports and analysis. Looking up ISOGG requirements
for new DF13 sons would require looking at most tables or joining them together. We will have no choice but to eventually normalize our data into different tables. Another
major problem would be convergence. My recent analysis of CTS2687 showed major overlap between L1066 (Z253), L1333 (L513), CTS2687 (L513), P314.2 (DF21), 1515Who
and 513-V (L513). As testing progresses, we would have to constantly move submissions from one table to another. Also, where do we place 1515Who - under L21* or DF13* or its
own separate table as L21unk.

Normalization could help the YSNP issue. Under normalization, you only have to store the raw data (un-parsed YSNP) and relevant SNPs under L513 if in the L513 table. For
relevant YSNPs, they really need to be separate fields. Output reports of relevant YSNPs would be derived combination of separate fields for spreadsheet conversion. YSNPs can get
real ugly fast: we need to classify them as terminal, duplicate, unstable, volatile (L159.2 and L69.5) which is subject to a lot of debate. Do we only allow terminal YSNPs
to be ISOGG driven or do we allow private YSNPs to be terminal YSNPs. Or do we just punt and give them the entire list or filter some (unstable).

Another major issue for me is prediction. Most of the single fingerprint YSNPs are probably predictable without L21 positive being a requirement. If I filter out P312 and others,
I will not find convergence and would not be able to properly predict without testing positive for L21. M222 and L226 are very isolated and do not require testing L21+ and
probably 80 % of the single fingerprint SNPs fit the same category. So for this analysis, I really need to combine L21 with its near relatives. We are already up to half of
L21 being single fingerprints and the percent will continue to rise. Also, if we can prove some of your single pattern signatures are genetically isolated, they could be
predicted as L21 with minimal analysis. Of course, any signature / fingerprint can range from very stable to very speculative - so we will constantly have to move submissions
between tables - a hassle that we do not currently have too much.

jonesge
09-17-2013, 01:59 PM
Normalization of tables has major tradeoffs. Normalization states that any table entries that are not dependent should be separated. We obviously currently have L21 tables (about all our spreadsheets can handle and that is about to get ugly with too many more entries). David Reynolds started doing normalization for his Nat Geo / WTY summaries which is the correct thing to do. I would combine L96, L144.1, L371, L679 and CTS2457.2 into one table, DF13private. I would keep all the others in DF13new - which would include DF13* which is already has done. Since M222 has so many SNPs (and Full Genomes will add many more) as well as having so many submissions: you could split DF49 into DF49-M222 and DF49-other. Same with DF21: DF21-Z246 and DF21other (normalize a few of these to get everyone used to constant changes for normalization).

However, this would greatly complicate our lives since we would have to learn how to join numerous tables together for reports and analysis. Looking up ISOGG requirements for new DF13 sons would require looking at most tables or joining them together. We will have no choice but to eventually normalize our data into different tables. Another major problem would be convergence.

Another major issue for me is prediction.

Robert, please drop your thinking about "normalization" and "predictions" and "Excel gymnastics" for a moment and consider this: L371 is a stand alone SNP with a Welsh heritage and does not deserved to be COMBINED nor MIXED IN with other SNPs. One part of your brain seems to get clouded over from your overly Excelling at juggling things around. Just because there have not been so many people tested for L371 is no reason for you to relegate and COMBINE it.

RobertCasey
09-17-2013, 02:44 PM
Robert, please drop your thinking about "normalization" and "predictions" and "Excel gymnastics" for a moment and consider this: L371 is a stand alone SNP with a Welsh heritage and does not deserved to be COMBINED nor MIXED IN with other SNPs. One part of your brain seems to get clouded over from your overly Excelling at juggling things around. Just because there have not been so many people tested for L371 is no reason for you to relegate and COMBINE it.

You are confusing database related issues vs. reporting issues. Currently, L371 is lumped into all of R-L21 (both Mike W. and I) have R-L21 databases (Mike also has a R-P312 database as well). Normalization is necessary for database design and R-L21 is getting too big for export to our external reporting spreadsheets. I will probably continue to create external reports at the YSNP level (L371 has a separate report). Prediction has the potential to save everyone a lot of unnecessary testing costs and that is what Mike and I strive for (as best we can). EXCEL gymnastics is a necessary evil as that is the lowest common denominator for public consumption. When you look at a spreadsheet, wouldn't you prefer to only see the smaller DF13 SNPs vs. all of R-L21 - that is what normalization is all about. What we are talking about is splitting up the L21 database into more manageable chunks. You are confusing reporting related issues with database related issues. I will probably continue to export L371 related reports as standalone spreadsheets and web pages. What we are talking about is what to do when a new L21 submissions when added to our database. Putting it all into one massive L21 database does have advantages (to recognize overlap of L371 with overlapping non-L371 submissions). From a prediction point of view, if I combine all of R-M269 into one database, I will be able to predict most L21 private YSNPs without the need to test positive for L21. This could avoid the Nat Geo test in the future and could allow people to directly test for L371 directly if they match. These issues are very important and affect your and your L371 cousin's pocketbook for testing costs and the ability to rapidly discover and analyze newly discovered YSNPs.

Creating standards such as Mike W has suggested means analysis between different parts of the haplotree will be more consistent in nature and that we are better able to merge new and better ideas from other groups. By doings so, we will be better able to provide better analysis of L371 and any future descendants of L371. There are now around 20 R-L21 individuals in the Full Genomes tests, so database related issues will become more important with the doubling of R-L21 SNPs by the end of the year. You might want to consider the Full Genomes test for L371 - you will probably discover 20 to 40 new YSNPs under and just above L371. With the impending deluge of new YSNPs coming, we are all concerned about the need to improve and standardize our analysis methodologies. This would separate all the Pugh submissions that dominate L371 from the 20 % non-Pugh submissions. Our goal is to get you a son and grandson of L371 that identifies your particular part of L371. With 100,000+ YSNPs on the near horizon combined with over 400 YSTRs, we may be able to assign combinations of YSNPs and YSTRs to individuals on your pedigree chart - and make some significant progress on our brick walls.

jonesge
09-18-2013, 01:45 AM
With the impending deluge of new YSNPs coming, we are all concerned about the need to improve and standardize our analysis methodologies. This would separate all the Pugh submissions that dominate L371 from the 20 % non-Pugh submissions. Our goal is to get you a son and grandson of L371 that identifies your particular part of L371.

Robert, In an earlier post I did specifically explain to you the "reason" for the higher percentage of "Pugh" surnamed individuals who have the L371 SNP. There are even some African American "West" (http://www.familytreedna.com/public/R-17-14-10/default.aspx?section=ycolorized) who are L371 but are genetically YDNA L371 Welsh Pughs. That "reason" is that several years back there was a Project Admin "PROACTIVELY SEEKING" Pughs and Wests to be DNA tested. These surnamed Pughs are hence not a fairly distributed part of the larger Welsh heritage L371 males being tested. Plus, these are mainly DNA tests from USA males .... the real L371 hotspot is in NW Wales (and I have those results now in conjunction with my own research and the POBI project). Those are the hard and fast facts and hopefully I trust you can now accept that reality when you do your Excelling - Standardizing - or Whatever you do on your IBM computers.

Before making such assertions in the future about any L21 or L371 surname, you might also want to validate and get a pragmatic sanity check with what I am again explaining to you by looking at USA Surname frequencies based on the 2000 Census.

Pugh ranks # 817 (http://www.histopolis.com/Surname/Pugh) whereas Jones ranks #5 (http://www.histopolis.com/Surname/Jones) and the Griffith surname ranks # 369 (http://www.histopolis.com/Surname/Griffith). My Jones YDNA line descended from a Griffith L371 line circa 1400s to early 1500s. Again, I doubt your Excelling would discover that very quickly.

[[[ Edited out and copied elsewhere see note below from moderator]]]

The British are coming ... The British are coming!!! The Data is coming ... The Data is coming!!! Your sophomoric pleadings to get L371s and other L21s to shuck out big bucks for high priced "Full Genome Testing" is comical. You said: "You might want to consider the Full Genomes test for L371 - you will probably discover 20 to 40 new YSNPs under and just above L371. Robert, I will take this Full Chromosome test ... will you pay 50% of this cost if you are real wrong on the new (20 to 40!!!) YDNA SNPs you predict an L371 individual will have?

L371 is a relatively younger Y-SNP and looking at Y-STRs right now is the best way to ID Welsh surname relationships as noted in the above examples.

I would rather fund 10 new Y-37 DNA tests in NW Wales to define some L371 boundaries there rather than shucking out over $1500 for a Full Chromosome DNA Test ... but I'll do that Full Chromosome test now if you want put some money (upfront) where your mouth is. This topic is about "Novel" SNPs (and you brought my L371 into this discussion), so I think my comment calling you out is more than appropriate.


[[[Mikewww/moderator on 9/18/2013: I copied verbatim the two contiguous off-topic paragraphs on privacy concerns over to a general category thread about DNA project data and pivacy. Go here if interested: http://www.anthrogenica.com/showthread.php?1335-Concerns-about-privacy-of-data-of-public-DNA-projects-etc I think there are other comments going off tangent here too. ]]]

TigerMW
09-18-2013, 01:59 PM
I'm posting this as a moderator so it is off-topic but I want to not let this thread go off onto tangents too broadly.

First, please try to present your positions without using insulting or provocative language. I'm referring to a post or two made by jonesge. We want to stay on topic related to the content of the discussion and keep off individuals' traits. There may be other postings by others too, but I just noticed these on this thread.

Second, I will look for a place to open a thread on project data privacy. That has been discussed on multiple forums but if someone wants to discuss it lets get it in the right category of this forum.

Third, we could easily go off tangent into a discussion comparing the different commercial testing options from a procurement standpoint. There may other threads already set up to do this so we should look for those if we want to discuss that more deeply. If there are not threads open for that please start up a thread in the proper, more general category.

TigerMW
09-18-2013, 08:08 PM
I think that there is hope for a common YSTR format since it is an easier issue:
...
4) Must include the extra Full Genomes YSTRs (up to 400) - needed for future and current YSTR research;
...

Have you see an output file from Full Genomes? I'm looking for the labels and formatting they provide. Hopefully they will keep that constant and I'll be able to transform at least the Y STRs into the format I'm using, which is really just the old FTDNA format with DYS389ii-i versus DYS389-2. On the ExtHts (extended haplotypes) tab, I plan to go ahead and add columns to the right for the additional Y STRs that Full Genomes reports.

Goldenhind, do you have your full report? Can you send me an example of how the Y STRs are reported?

GoldenHind
09-19-2013, 10:42 AM
Have you see an output file from Full Genomes? I'm looking for the labels and formatting they provide. Hopefully they will keep that constant and I'll be able to transform at least the Y STRs into the format I'm using, which is really just the old FTDNA format with DYS389ii-i versus DYS389-2. On the ExtHts (extended haplotypes) tab, I plan to go ahead and add columns to the right for the additional Y STRs that Full Genomes reports.

Goldenhind, do you have your full report? Can you send me an example of how the Y STRs are reported?

I'm afraid that will have to wait until I return home in October. I haven't downloaded my full report yet.

MJost
09-19-2013, 01:31 PM
Mike I will send you a link for an output file.

MJost

David
09-19-2013, 10:51 PM
...

Normalization of tables has major tradeoffs. Normalization states that any table entries that are not dependent should be separated. We obviously currently have
L21 tables (about all our spreadsheets can handle and that is about to get ugly with too many more entries). David Reynolds started doing normalization for his
Nat Geo / WTY summaries which is the correct thing to do. I would combine L96, L144.1, L371, L679 and CTS2457.2 into one table, DF13private. I would keep
all the others in DF13new - which would include DF13* which is already has done. Since M222 has so many SNPs (and Full Genomes will add many more) as well
as having so many submissions: you could split DF49 into DF49-M222 and DF49-other. Same with DF21: DF21-Z246 and DF21other (normalize a few of these
to get everyone used to constant changes for normalization).
...

Normalization is a key to keeping up with the volume of data we have to keep up with. My original process, which worked just fine with just WTY data went to pieces after about three weeks of Geno 2.0 data coming in. :)

One thing I did internally is to have a clear separation between the data, and the presentation of the data. I store the data in a format that can easily be manipulated by traditional UNIX tools (Cygwin on my PC helps immensely!) or brought into Excel and worked with there. For presentation of the data, the excel spreadsheets I publish as PDF files are composed almost entirely of VLOOKUP functions. To add a new entry, or move an entry to another sheet, all I have to do is insert a column, paste the generic formulas into, type in the kit number and date received, and I'm done.

Robert, your thoughts about further splitting off of the data are very similar to what I have been thinking of. I will be making some changes in that area soon.

Regards,
david

David
09-19-2013, 11:06 PM
I've had the feeling for some time that Bennett Greenspan is not up to speed with the operational details of the a la carte SNP and WTY side of the business. With Thomas' departure I would not be surprised if no more new SNPs are added to the menu.

I had a conversation with Bennett about la carte SNP and WTY testing when Geno 2.0 testing was announced, and again in June at the SCGS Jamboree in Burbank.

My characterization would be that Bennett very much understands the operational details, but as low-volume, low-profit product, they are not his priority. But product he has been willing to provide as a service to the community.

I do think WTY is dead, but I don't know that is necessarily a bad thing. Some portion of the people who WTY tested can have their needs met by a $200 Geno 2.0 or Chromo2 test. The rest can spend $300 more and get a far more useful full Y sequence done.

FTDNA has told us they will continue to do Y-SNP testing, and will continue to add new SNPs. I've no doubt it will take longer and probably be more painful to get new SNPs added, but I would expect them to live up to what they have publicly announced.

Regards,
david

lgmayka
09-19-2013, 11:19 PM
FTDNA has told us they will continue to do Y-SNP testing, and will continue to add new SNPs.
Has FTDNA actually said this recently? That is, since the new management team took over (http://www.prnewswire.com/news-releases/gene-by-gene-acquires-arpeggi-a-startup-health--and-ge-backed-company-to-build-worlds-leading-genetic-testing-and-genome-diagnostics-company-218666411.html)?
---
Arpeggi's Nir Leibovich was named Gene by Gene's Chief Business Officer, Jason Wang was named Chief Technology Officer and David Mittelman, Ph.D was named Chief Scientific Officer.
---

David
09-19-2013, 11:26 PM
Has FTDNA actually said this recently? That is, since the new management team took over (http://www.prnewswire.com/news-releases/gene-by-gene-acquires-arpeggi-a-startup-health--and-ge-backed-company-to-build-worlds-leading-genetic-testing-and-genome-diagnostics-company-218666411.html)?
---
Arpeggi's Nir Leibovich was named Gene by Gene's Chief Business Officer, Jason Wang was named Chief Technology Officer and David Mittelman, Ph.D was named Chief Scientific Officer.
---

Yes, that announcement was made 7 Aug, and the statements wrt continued Y-SNP testing were made after the news of the termination of the Krahns was made public on 31 Aug.

--david

TigerMW
09-20-2013, 02:07 AM
Normalization is a key to keeping up with the volume of data we have to keep up with. My original process, which worked just fine with just WTY data went to pieces after about three weeks of Geno 2.0 data coming in. :)

One thing I did internally is to have a clear separation between the data, and the presentation of the data. I store the data in a format that can easily be manipulated by traditional UNIX tools (Cygwin on my PC helps immensely!) or brought into Excel and worked with there. For presentation of the data, the excel spreadsheets I publish as PDF files are composed almost entirely of VLOOKUP functions. To add a new entry, or move an entry to another sheet, all I have to do is insert a column, paste the generic formulas into, type in the kit number and date received, and I'm done.

Robert, your thoughts about further splitting off of the data are very similar to what I have been thinking of. I will be making some changes in that area soon.

Regards,
david

We may need to recognize, which is what I think you are getting at, that there may need to be both research databases and user(consumer) databases. My thoughts are primarily focused on the user databases, a "Ysearch for SNPs", so to speak.

I am suggesting that we have a user database format that would allow for spin-off types of analyses and research, albeit probably not as heavy duty or more focused on generic analysis formats, like .csv files or spreadsheets.

Generally, you want the data items to be described as data themselves so configurations can change easily without actual logic/programming changes. Hence, a well thought out data design is critical. I don't think the below is the ultimate answer but I just want to spur thinking by providing a the following hypothetical design for saving SNP data in a common format. Sorry, I'm sure some of the audience will drop off on this, but I don't have time to write manuals, programs, etc. to explain it properly and work out details.

Identifier:
Indentifying name:
Source:
Source Date:
Configuration Version:
Record_Type: 1=Ancient Hg, 2=Master Hg Record, 3=Individual data Record
Partent Haplogroup Name: R1b-DF49
SNP 1 label: DF23
SNP 1 allele:
SNP 2 label:
SNP 2 allele
thru
SNP N label:
SNP N allele:

Tbe "Master" or "Ancient" record types would be artificial records that are identified by the haplogroup label and would contain the ancestral alleles for the SNPs to be recorded within the haplogroup's data set. These are important as they would be needed for determining derived for ancestral states in the individual data records. There would be multiple versions of these master records as they phylogeny is updated. Accordingly, the individual data records would have to use/point to the SNP configuration appropriate version as well as the appropriate haplogroup.

The "Individual" data records would have the kit # as the identifier and probably the MDKA surname as the Identifying name along with the actual results.

There will need to be an equivalent SNP transformation table as well. Which would have record types as either equivalent or identical to go with the SNP label and whatever you are transforming it into, be it a haplogroup name or what I have called a "lead" SNP for a branch.

I hate to say this, but I think the haplogroup names should be the long haplogroup names since there is not necessarily just one SNP that identifies a phylogenetic branch and sometimes we don't know how many occurrences of an SNP there are. We never really know.

If the data is designed and stored in a standard, accessible format then there could be multiple ways to present results by the general users through spreadsheets, network diagramming type programs, etc. I'm a very strong advocate of publishing the data in an intelligent format, not just in fixed styles like .pdf'd that csn only be used for very specific purposes. .pdf's are okay, I'm just saying we need the publishing to output data formats too just like FTDNA project screens have export features and are not pure pixel graphics.

Since the phylogeny will change as it is discovered I don't think we can avoid the idea of SNP configuration versions. I think the versions can be maintained at the haplogroup level, not dependend on the whole Y tree. It is a pain and may require some kind of transform program/process that runs through the database and recopies individual records into the new configuration formats, probably having to leave blanks for any new SNPs addes. The users or data sources would have to add the SNP results. ... um, probably need some kind of adapter program/process for each new test offering to port data into the current configuration.

David
09-20-2013, 04:09 AM
I've been struggling with this in a bottom-up fashion, while trying to cleanup the ISOGG SNP Index, as well as during the course of trying to put together a comprehensive listing of all SNPs.

For each Y location, there can be zero or more mutations. Each different mutation event normally has a different SNP name. But not always.

For each combination of Y location and mutation (or lack of mutation, including back-mutation), there can be one or more state changes associated with that pair. Each state change represents a different clade, and if there is more than one state change, then each different haplogroup is normally designated by the SNP name followed by a decimal counter. But not always.

And each SNP may have one or more synonyms that describe exactly the same chrY position, mutation event, and state change. The synonymous SNPs are typically assigned by different researchers. But not always.

And of course, looking from the other direction, each clade may have multiple mutations associated with it, where all are believed to be phylogenetically equivalent, and each mutation has a set of synonymous SNP names assigned to it.

Not a single clean relationship in the lot with scattered deviations from the "rules" all over the place.

--david

TigerMW
09-20-2013, 03:40 PM
I've been struggling with this in a bottom-up fashion, while trying to cleanup the ISOGG SNP Index, as well as during the course of trying to put together a comprehensive listing of all SNPs.

For each Y location, there can be zero or more mutations. Each different mutation event normally has a different SNP name. But not always.

For each combination of Y location and mutation (or lack of mutation, including back-mutation), there can be one or more state changes associated with that pair. Each state change represents a different clade, and if there is more than one state change, then each different haplogroup is normally designated by the SNP name followed by a decimal counter. But not always.

And each SNP may have one or more synonyms that describe exactly the same chrY position, mutation event, and state change. The synonymous SNPs are typically assigned by different researchers. But not always.

And of course, looking from the other direction, each clade may have multiple mutations associated with it, where all are believed to be phylogenetically equivalent, and each mutation has a set of synonymous SNP names assigned to it.

Not a single clean relationship in the lot with scattered deviations from the "rules" all over the place.

--david

I attempt to deal with this situation in my spreadsheets by defining a table (the Clades tab) that really is data method way to describe the phylogeny. I name the "branches" but use the short haplogroup style and am forced to picking a "lead with" SNP for each branch. That is no problem for synonyms (different names but same SNP) of an SNP except multiple occurrences (the .1, .2, etc.) causes problems sometimes. However, for equivalent SNPs (different SNPs but same branch), this only works as long as the equivalence is maintained, which could be forever, or maybe not.

Similar to a GEDCOM file or to my Clades table, we need an ISOGG endorsed standard file/data format for describing an SNP based phylogeny. Once we have that, people can write analysis or presentation tools by strictly importing the the phylogeny files. It'd be like having the ISOGG haplogroup R html page in an "structured data" intelligent file format.

This is needed regardless of whether or not we ever have a consumer/user oriented "Ysearch for SNPs" standard file format. Of course, since Ysearch is out of date number of STRs wise, and does not support automated import/export (witness the captcha's); we need a new "Ysearch for STRs" as well. Of course the individuals' test result SNP records and STR records need to point to each other, probably via using the same index key (Ysearch identifier.)

This brings up the topic of another data/file piece needed that is really a third type of record that should be maintained by for each individual/consumer. There should be an MDKA reference index which uses the same identifier (i.e. Ysearch ID) but has a structured format like we've already seen in Ysearch for MDKA surname, probably variants, earliest known location and year (probably birth date) in more granular formatting like village/town/parish, county/district/shire, province/region, country, etc. to go with data source, source data. I actually have this in my spreadsheets but it is in a tab that I delete before posting. I maintain this because the current FTDNA and Ysearch data is not always in a consistent format and so I often have to compare the two and "transform it." It also allows for easier development of new presentations of the data. I have a column for surname heritage too (i.e. Irish, English, Basque, Spanish, etc.), but I've never used it because it is a can of worms itself and I'm not that smart to figure out an authoritative based name classification system.

RobertCasey
09-20-2013, 04:30 PM
The source data and source date probably needs to be moved down to the YSNP level vs. the submission ID level. We currently currently have eight sources for YSNP testing results: 1) FTDNA YSNP report; 2) WTY Finch2; 3) Nat Geo 2.0 raw files; 4) Full genomes (may have several flavors); 5) 23andme (we rarely use); 6) Crono2 files; 7) FTDNA YSTR report (if no YSNP report is available - if FTDNA ever updates the terminal YSNP, this could be significant); 8) the occasional manual entry.

Since our sources are getting more and more diverse, knowing the source at the YSNP level will be pretty important.

TigerMW
09-20-2013, 05:16 PM
The source data and source date probably needs to be moved down to the YSNP level vs. the submission ID level. We currently currently have eight sources for YSNP testing results: 1) FTDNA YSNP report; 2) WTY Finch2; 3) Nat Geo 2.0 raw files; 4) Full genomes (may have several flavors); 5) 23andme (we rarely use); 6) Crono2 files; 7) FTDNA YSTR report (if no YSNP report is available - if FTDNA ever updates the terminal YSNP, this could be significant); 8) the occasional manual entry.

Since our sources are getting more and more diverse, knowing the source at the YSNP level will be pretty important.

Yes, agreed. That's what I was trying to describe. Every record needs a Source origin and Source date. In the MDKA/Indentifier cross-reference index that I don't publish I include the project name and date that I copied the data, but I don't update it every time I have an update (upgrade or advanced test result comes in). I really should have a change log file so I could repeat the update transactions if needed or trace them. This is also complicated in that some individuals' records will have input from multiple sources, i.e. FTDNA Y DNA SNP report, FTDNA WTY, 23andME, and now FGS and Chromo2.

You can see why I mentioned on another forum today that the average consumers have to consider vendor support and integration in their testing decisions. It's not what it should be, but FTDNA does have the ability to have Geno2 derived results transferred to the Y DNA SNP report so that reduces the data source issue a little. It sure would be nice if FTDNA would transfer relevant (only) negative results to but I don't think they'll ever figure that out. Also, it sure would be nice if they replaced their WTY (which appears to be discontinued) with a full genome scan offering that transferred relevant results to the Y DNA SNP report.

Of course, while I'm at it, they should create some kind of structure for SNP display rather than one long string in one cell. I have a tab/worksheet you don't see where I do a bunch of data/string manipulations on the downloaded Y DNA SNP report combined with other data (WTY and other co. manual SNP results stored on another table). I have formulas to go each SNP string three times, first to transform any equivalent SNPs to add the "lead with" branch SNP I picked out along with floaters(unpositioned), secondly to scan through the recreated string youngest (outer most branch level) back to oldest (trunk) to find the "defining" or terminal SNP, and finally to scan the next branch level down (younger) from the defining SNP to find relevant negatives so I can classify asterisk/paragroup people. (We have to look after the asterisk people, you know! ;) Well, we need to know what's a true terminal SNP or not too.) This SNP manipulation worksheet can bring my computer to its knees some times, but it sure beats the eye test of manually looking through Y DNA SNP report Geno2 results and a set of manual notes on the side... and a phylogeny chart on the side.

I would be truly discouraged if I was a newbie. It's all of the volunteers like yourselves that save the day.

RobertCasey
09-21-2013, 10:01 PM
In the MDKA/Indentifier cross-reference index that I don't publish I include the project name and date that I copied the data, but I don't update it every time I have an update (upgrade or advanced test result comes in). I really should have a change log file so I could repeat the update transactions if needed or trace them. This is also complicated in that some individuals' records will have input from multiple sources, i.e. FTDNA Y DNA SNP report, FTDNA WTY, 23andME, and now FGS and Chromo2.

I think we have both been overly concerned about the possible updates to the derived fields - surname and origins. This manual analysis takes a tremendous amount of time and I have been wondering how much effort is worth the time to review. For a while, I thought that Paternal Ancestor Name could be unique to each project vs. unique to each FTDNA ID. So, I pulled the submissions that had joined the most projects to get an idea of how often these fields are updated over time. I found 26 submissions that belonged to over 20 projects. I found no variations between projects for the Paternal Ancestor Name field. Three had changes to the surname over the last six months (two unknown to known and one spelling change) and one other changed the field without any affect on the surname. So that is 4 out of 26 (15 %) that changed and 3 out of 26 where the change was meaningful (12%). Of course, any individual that joins many projects may be more likely to update these fields. The time spent to catch these changes could be spent on keeping the database more up to date or adding yet more smaller and obscure geographical / ethnic projects. Of course, neither of us have pulled all the projects as there are so many (that is yet another topic of breadth of coverage vs. keeping the database up to date).

Here is an interesting fact that I extracted regarding how many possible L21 submissions that have joined more than 20 projects:

Over 50 include - 128, 114, 64 and 50 (I know we both recognize these FTDNA_ids)
30 to 49 - 36, 34, 32, 32, 32, 32, 31 and 31
25 to 29 - 29, 28, 27, 27, 26 and 25
20 to 24 - 24, 23, 22, 21, 21, 20 and 20

Of course, this number grows over time as these people join yet more projects and as we include more projects. Project formations seem to be very scant these days except for smaller haplogroup projects which rarely have many unique FTDNA IDs. So the project list is not changing very much. However, I see no end to adding new L21 submissions via adding more projects. When adding new projects (smaller ones), around 20 to 30 % of the L21 submissions are new to our databases - for L21 geographic projects like Poland & Hungary - the L21 discoveries are falling off since L21 content is lower as the geography gets less L21 oriented. But even geographic projects continue to add new possible L21 submissions but at a much lower rate than smaller projects.

R.Rocca
09-25-2013, 11:17 AM
Some very good news on the SNP testing front. From Bennett Greenspan...

"I want to assure you that we have no plans to curtail, rather to enlarge, the number of single Y SNP's we offer. The approx. 120 that Thomas had in various stages will be vetted and launched over the next 3-4 weeks."

GTC
09-25-2013, 02:22 PM
Some very good news on the SNP testing front. From Bennett Greenspan...

"I want to assure you that we have no plans to curtail, rather to enlarge, the number of single Y SNP's we offer. The approx. 120 that Thomas had in various stages will be vetted and launched over the next 3-4 weeks."

Thanks for that welcome news.

RobertCasey
09-25-2013, 02:50 PM
Some very good news on the SNP testing front. From Bennett Greenspan...

"I want to assure you that we have no plans to curtail, rather to enlarge, the number of single Y SNP's we offer. The approx. 120 that Thomas had in various stages will be vetted and launched over the next 3-4 weeks."

When I brought this issue up to Full Genomes, they replied that they had a source to test individual YSNPs. So if FTDNA falls way behind or refuses to add private YSNPs AND Full Genomes does not come up with a good alternative,
full Y-chromosome tests viability will greatly slow down. Since I am primarily interested in private SNPs vs. ISOGG qualifying SNPs, I would not test my DF27* submission unless I can find a way to test these private SNPs. My current
FG test is the only L226 test for FG. If I can not find any way to test for SNPs under L226 (to split up this large grouping of submissions), full Y-chromosome tests mean much less to me and other genealogists (who subsidize the vast
majority of testing). Finding pre-L226 mutations could be very useful to other clusters - but does not solve my genealogical problem of getting more YSNPs under L226. Several Nat Geo tests have not discovered any private SNPs
under L226 to date. Maybe the Cromo2 test will have better luck. I am sure that the FG test will find some SNPs that descend from L226.

GTC
09-25-2013, 02:59 PM
When I brought this issue up to Full Genomes, they replied that they had a source to test individual YSNPs. So if FTDNA falls way behind or refuses to add private YSNPs AND Full Genomes does not come up with a good alternative,
full Y-chromosome tests viability will greatly slow down.

Good point. Greenspan's statement applies to work that Thomas had in hand at the time he left. It does not address SNP discovery that came from Thomas' WTY work.

However, I guess that WTY would now be considered small beer compared to the recent offerings of other companies. I wonder what, if anything, FTDNA's response is going to be to that competition in the SNP-hunting arena.

TigerMW
09-25-2013, 03:31 PM
... Greenspan's statement applies to work that Thomas had in hand at the time he left. It does not address SNP discovery that came from Thomas' WTY work...

That is NOT my understanding. I've asked similar questions and the feedback I'm getting is that FTDNA will continue on this work on individual SNPs. Obviously, it will have to work differently with their personnel changes.

On the other hand, I do expect offering changes related to things in the WTY realm. It just makes sense. There are less expensive methods out there of doing FG type work. I would be surprised if we don't have a couple of good competitors, including FTDNA... however, I still expect single SNP offerings to continue.

I'm more worried about pricing on "one at a time" a la carte orders than anything. They raised the price while removing the one time transfer fee early this year but prices are never guaranteed.

R.Rocca
09-25-2013, 03:46 PM
That is NOT my understanding. I've asked similar questions and the feedback I'm getting is that FTDNA will continue on this work. I expect offering changes. It just makes sense. There are less expensive methods out there of doing FG type work. I would be surprised if we don't have a couple of good competitors, including FTDNA... however, I still expect single SNP offerings to continue.

I'm more worried about pricing on one at a time than anything. They raised the price and removed the one time transfer fee early this year but prices are never guaranteed.

Correct, the communication went on to say that FTDNA he would set up a mechanism for accepting future SNPs as well.

MitchellSince1893
09-25-2013, 04:45 PM
Hoping Z150 makes the cut.

lgmayka
09-26-2013, 02:56 AM
Some very good news on the SNP testing front. From Bennett Greenspan...

"I want to assure you that we have no plans to curtail, rather to enlarge, the number of single Y SNP's we offer. The approx. 120 that Thomas had in various stages will be vetted and launched over the next 3-4 weeks."
That's great news. But in the meantime, without Ymap, how can we add SNPs to the ISOGG haplotree without their numerical locations? Did someone make a full copy of all the information on Ymap before it disappeared?

David
09-26-2013, 03:06 AM
That's great news. But in the meantime, without Ymap, how can we add SNPs to the ISOGG haplotree without their numerical locations? Did someone make a full copy of all the information on Ymap before it disappeared?

I have all the data. You can leave the Y-pos info off of what you send Alice and I can supply.

--david

lgmayka
09-26-2013, 02:24 PM
You can leave the Y-pos info off of what you send Alice and I can supply.
Thank you!

TigerMW
11-22-2013, 04:48 PM
Does anyone have any new thoughts on this? FGC and Chromo 2 results are now coming in. We'll have a flood of Big Y on top of that in a couple of months so we need to try to get something done with this. Mark J is also very interested.

To me, it would just be shame if very little subclade does this their own unique way and with different levels of discipline.

We may need to recognize, which is what I think you are getting at, that there may need to be both research databases and user(consumer) databases. My thoughts are primarily focused on the user databases, a "Ysearch for SNPs", so to speak.

I am suggesting that we have a user database format that would allow for spin-off types of analyses and research, albeit probably not as heavy duty or more focused on generic analysis formats, like .csv files or spreadsheets.

Generally, you want the data items to be described as data themselves so configurations can change easily without actual logic/programming changes. Hence, a well thought out data design is critical. I don't think the below is the ultimate answer but I just want to spur thinking by providing a the following hypothetical design for saving SNP data in a common format. Sorry, I'm sure some of the audience will drop off on this, but I don't have time to write manuals, programs, etc. to explain it properly and work out details.

Identifier:
Indentifying name:
Source:
Source Date:
Configuration Version:
Record_Type: 1=Ancient Hg, 2=Master Hg Record, 3=Individual data Record
Partent Haplogroup Name: R1b-DF49
SNP 1 label: DF23
SNP 1 allele:
SNP 2 label:
SNP 2 allele
thru
SNP N label:
SNP N allele:

Tbe "Master" or "Ancient" record types would be artificial records that are identified by the haplogroup label and would contain the ancestral alleles for the SNPs to be recorded within the haplogroup's data set. These are important as they would be needed for determining derived for ancestral states in the individual data records. There would be multiple versions of these master records as they phylogeny is updated. Accordingly, the individual data records would have to use/point to the SNP configuration appropriate version as well as the appropriate haplogroup.

The "Individual" data records would have the kit # as the identifier and probably the MDKA surname as the Identifying name along with the actual results.

There will need to be an equivalent SNP transformation table as well. Which would have record types as either equivalent or identical to go with the SNP label and whatever you are transforming it into, be it a haplogroup name or what I have called a "lead" SNP for a branch.

I hate to say this, but I think the haplogroup names should be the long haplogroup names since there is not necessarily just one SNP that identifies a phylogenetic branch and sometimes we don't know how many occurrences of an SNP there are. We never really know.

If the data is designed and stored in a standard, accessible format then there could be multiple ways to present results by the general users through spreadsheets, network diagramming type programs, etc. I'm a very strong advocate of publishing the data in an intelligent format, not just in fixed styles like .pdf'd that csn only be used for very specific purposes. .pdf's are okay, I'm just saying we need the publishing to output data formats too just like FTDNA project screens have export features and are not pure pixel graphics.

Since the phylogeny will change as it is discovered I don't think we can avoid the idea of SNP configuration versions. I think the versions can be maintained at the haplogroup level, not dependend on the whole Y tree. It is a pain and may require some kind of transform program/process that runs through the database and recopies individual records into the new configuration formats, probably having to leave blanks for any new SNPs addes. The users or data sources would have to add the SNP results. ... um, probably need some kind of adapter program/process for each new test offering to port data into the current configuration.

RobertCasey
11-22-2013, 08:06 PM
I think we need to discuss both the organization of the backend database organizations as well as published output reports (GEDCOM type reports for exchange and publication to the general public via spreadsheets). First, any public publications really must be in xls format vs. PDF formats or HTML type formats (which require even more parsing which is just too labor intensive). Also, the backend database needs to have two forms: 1) source database (where you load up all the data with minimal processing and parsing); 2) production databases where duplicate information is removed and derived data can be added. I can not tell you how many times I went back to the source tables to determine what was going on. For instance - FTDNA ID 213555 (was apparently kicked out or left the Carroll project but remained in the McGee project). FTDNA 208188 was apparently kicked out or left the Allan project (or went private) is no longer available in any project that I could find. Without source files, this would be impossible to determine what happened.

We also must allow for duplicate data across different sources. Such as L21 test results from Full Genomes files as well as FTDNA WTY tests, FTDNA YSNP reports, Nat Geo raw files, CROMO2 raw files and Big Y source files (if not fully loaded into YSNP reports like Nat Geo raw files). We need to address how to handle errors in these sources. A simple example is the polarity switch of L371 results from Nat Geo uploads. We need to keep an errata correction files for such issues and correct the data (we should not just eliminate L371 until FTDNA fixes the problem). We also need to establish a common acceptable list of unstable YSNPs and defend this list as being filtered out for now. This excludes all ** and *** YSNPs in Full Genomes tests as well as David Reynolds advice for YSNPs to filter out. With so many high quality YSNPs to analyze, we need to be pretty brutal on YSNPs that may have issues. We capture these in our source files but purge them from our production and public files until research indicates they should be added back in.

I really do not like the proposed SNP label and SNP result format. This is only an equivalent of the list of YSNPs parsed into two pieces for each YSNP. This kind of organization will require a lot of parsing and will reduce our ability to analyze the data.

The purist form (with minimal reduction in analysis): ftdna_id, source, source_date, ysnp label, ysnp result (one entry for every YSNP tested). My advice from my database experts all agree this is the best approach (hard to get your head around this format). Again, the source file captures all YSNPs and then the production file is normalized (L21 for backend with all the ancient positive results from Nat Geo removed). The other potential format would be one ftdna_id per entry but all YSNPs listed (which requires constant updates of fields and a lot of empty fields). This would have to be normalized to L21 or even a lower level.

Rory Cain
01-14-2014, 12:04 AM
When I brought this issue up to Full Genomes, they replied that they had a source to test individual YSNPs. So if FTDNA falls way behind or refuses to add private YSNPs AND Full Genomes does not come up with a good alternative, full Y-chromosome tests viability will greatly slow down.

I received a similar reply from FGC but they seem to have done nothing further. Yseq, run by Thomas Krahn, has now filled this gap.

RobertCasey
01-14-2014, 06:27 PM
I received a similar reply from FGC but they seem to have done nothing further. Yseq, run by Thomas Krahn, has now filled this gap.

YSEQ is also ready to offer bulk discounts for full plate testing. They stated that they could test 96 YSNPs at one time for $499 (at $5.20 per YSNP). I put out a post on creating
a Z253 haplogroup test that includes the 96 downstream YSNPs of Z253 (from the first two Z253 Full Genomes tests of L226 and PF825). For any L226 testing candidate, I will
pay for $200 for any L226 person interested in testing these 96 YSNPs - if they pay for the other $300. There should be around 24 YSNPs downstream of L226.

TigerMW
01-14-2014, 07:31 PM
I don't have much new insight to add. I'll just summarize what I'm thinking.

I think this will come down to intensive comparative focus of SNPs down at the fairly youthful subclade level, for example L226 and below. In my case, it is probably not L513 (about 2000-2500 years old) but probably L706.2/L705.2, about a 1000 years old. There is some, may be a lot of tree build-out to do between L513 and its major branches, like L705.2, L193, CTS3087 and the undiscovered peers. I will work hard on those layers too but my priority has to be the last 1000 years. I've got some thinking to do about L159.2 and Z220 as well but, but the skids aren't as well greased in those areas as they are in L706.2 and L513 for me.

The discovery of the tree from 1000 year on back to whenever is largely an academic exercise, in my opinion. I think it is still important and I do care about deep ancestry, history, prehistory, etc., but my guess is most of us will want to spend our hobbyist money on the last 1000 years or so. This is where deep ancestry and genetic genealogy will connect and actually do a genealogist some good.

I hate to say this because I know what it sounds like. It's like saying every man for himself. I don't like that attitude. All of this data is only useful if shared, compared and investigated. This will go on and I will do my part to present useful (hopefully) information for L21 people across the board.

However, it is impossible for someone, i.e. David Reynolds, to do the L21 comparative subclade analysis he did for Geno 2 and WTY and now apply that to hundreds of Big Y orders and FGC orders to go with Chromo 2 and probably a Geno 3. I like that YSEQ is there to do one at time but as a project administrator I shudder to think of tracking all of these test results one a time - manually. This is why I recommended to Thomas Krahn that he create a "opt-in" easy checkmark and a downloadable public display report like a Y DNA SNP project FTDNA report. We have to at least have a chance to automatically collect and clean data. I'm doing this for Chromo 2 now but it is way too manual (and error-prone) and I don't see the mechanism for automation since there is no "project" reporting system. It's also surprisingly hard to get people to share their data. You'd think anyone who was spending a lot of money would want to share results. Results are only good if compared and you have to share before you can compare.

Of course, one option is to pay other people do analysis. However, the consultants may not have access to all of the data you would down at the small subclade and surname project levels. Much of the effort is in recruiting, cajoling and educating your target audience anyway. A one time analysis will soon fall out of date as new data comes in. Also, at least for me, this is a hobby, so I won't pay others for consulting when the whole thing is actually simple. I would not charge for help given, either. If I have time and can help I will, but ultimately the trade-off is where to spend time and money. I submit the incentives for the non-academic people will be to focus on the last 1000 years.

Every major subclade needs an advocate to ensure testing, comparison and analysis is getting done at the mid-level branching. Those advocates won't have time to go into all of the last 1000 years of branching, though. The last 1000 years is where its every surname/super-family/sub-subclade for him/herself.

The net for me is most of my time, money and in-depth analysis of raw results will be on the last 1000 years. Subgroup B2 or L706.2 of L513.

I'll do the raw analysis for L513 down to the last thousand years since I'm the haplogroup project administrator. It also has clues as to the origin of L706.2 However, I won't spend the time recruiting and money on testing for the L513xL706.2 folks as I would for L706.2 I'll help but not lead other than on the mid-level branching and just guidance.

Back up at the L21>DF13 and etc branching levels I'll keep the spreadsheet system stitched together for the intermediate timeframe of SNPs and integrate STR and SNP information. I won't get into any raw SNP results analysis though. I just don't time. Presently, via spreadsheet, I can handle about 500 unique branches within L21. That's not a statement about SNPs, that's a statement of branches marked by one or more (equivalent) SNPs. I will be able to handle more, maybe a couple of thousand branches in the future. I'm not sure about the 3500 that Robert estimated.

As for the last several hundred years of the tree, there is almost no end in sight. In a way, this dictates my evolving perspective of what's public versus private... where the capacity to broadly support the size of the tree runs out of gas, the rest is deemed private.

I'll also try to keep the pipeline stoked as much as I can at the gigantic R1b project level to encourage people to test out STRs and do some kind of package testing to get into the right subgroups of R1b.

RobertCasey
01-14-2014, 10:16 PM
I think this will come down to intensive comparative focus of SNPs down at the fairly youthful subclade level, for example L226 and below. In my case, it is probably not L513 (about 2000-2500 years old) but probably L706.2/L705.2, about a 1000 years old. There is some, may be a lot of tree build-out to do between L513 and its major branches, like L705.2, L193, CTS3087 and the undiscovered peers. I will work hard on those layers too but my priority has to be the last 1000 years. I've got some thinking to do about L159.2 and Z220 as well but, but the skids aren't as well greased in those areas as they are in L706.2 and L513 for me.

The discovery of the tree from 1000 year on back to whenever is largely an academic exercise, in my opinion. I think it is still important and I do care about deep ancestry, history, prehistory, etc., but my guess is most of us will want to spend our hobbyist money on the last 1000 years or so. This is where deep ancestry and genetic genealogy will connect and actually do a genealogist some good.

I hate to say this because I know what it sounds like. It's like saying every man for himself. I don't like that attitude. All of this data is only useful if shared, compared and investigated. This will go on and I will do my part to present useful (hopefully) information for L21 people across the board.

However, it is impossible for someone, i.e. David Reynolds, to do the L21 comparative subclade analysis he did for Geno 2 and WTY and now apply that to hundreds of Big Y orders and FGC orders to go with Chromo 2 and probably a Geno 3. I like that YSEQ is there to do one at time but as a project administrator I shudder to think of tracking all of these test results one a time - manually. This is why I recommended to Thomas Krahn that he create a "opt-in" easy checkmark and a downloadable public display report like a Y DNA SNP project FTDNA report. We have to at least have a chance to automatically collect and clean data. I'm doing this for Chromo 2 now but it is way too manual (and error-prone) and I don't see the mechanism for automation since there is no "project" reporting system. It's also surprisingly hard to get people to share their data. You'd think anyone who was spending a lot of money would want to share results. Results are only good if compared and you have to share before you can compare.

Of course, one option is to pay other people do analysis. However, the consultants may not have access to all of the data you would down at the small subclade and surname project levels. Much of the effort is in recruiting, cajoling and educating your target audience anyway. A one time analysis will soon fall out of date as new data comes in. Also, at least for me, this is a hobby, so I won't pay others for consulting when the whole thing is actually simple. I would not charge for help given, either. If I have time and can help I will, but ultimately the trade-off is where to spend time and money. I submit the incentives for the non-academic people will be to focus on the last 1000 years.

Every major subclade needs an advocate to ensure testing, comparison and analysis is getting done at the mid-level branching. Those advocates won't have time to go into all of the last 1000 years of branching, though. The last 1000 years is where its every surname/super-family/sub-subclade for him/herself.

I think there will be major need for a new project type that is quite small for those YSNPs that are truly post-surname in creation (around 1,000 years or a little earlier years for clan surnames and probably a little more recent for English surnames). Haplogroup projects will lose interest or will not have the time/skills to duplicate your descendant chart for all the truly private YSNPs. Hopefully, some of the surname admins will step up and realize how important YSNPs are becoming to their research. However, larger surnames will obviously struggle with this. My Casey project is 50 % L226 which matches clan lore. So, for clan based surnames with large clusters, there is some hope. L226 is 1,400 to 1,500 years old according to our TMCRA experts and has over 20 surname clusters with five to ten submissions that match. So, there are a lot more "public" YSNPs to be discovered in the near term.

But my interest is moving towards my L226 research for my Casey lines (and getting back to my DF27+/Z196- mother's Brooks line which is very active again). For these projects, I will still need board Z253/DF27 data to get to these private YSNPs which will take at least another year or so to truly sort out some of the private YSNPs for analysis. My original interest for getting involved was to prove that genetics could sort out the 40 or so different Casey men in western South Carolina around 1800. 111 markers is really not doing much in that respect and it does not look like any 500 YSTR test has much interest (or is really needed). Getting these forty Casey lines to test Big Y tests at $600 each is just not going to happen very fast either. Also, L555 / Irwin research is stuck in the same place. It will be hard to get multiple L555 / Irwin testers to put down $600 per test to sort out the numerous Irwin branches. However, each test will reveal five to ten branches, so there is some hope. We really need a $100 test that includes 200 to 500 YSNPs that are L226 or L705.2 related (or 2,000 to 5,000 L513 or Z253 YSNPs for $200). Waiting for all the Big Y and Full Genomes tests to trickle in and find differences between them will take too long.

I have always viewed the older than 1500 year research as a required investment to get to the younger than 1,500 year YSNP discovery which would have more direct impact on genealogical research. What is very frustrating to me, I already have three or four YSNPs discovered that could split up my Casey surname cluster (which just sort of appeared around 300 years ago). But I have to sort out the branches of L226 first (which will be fun but will not solve my genealogical objectives for my Casey cluster). Waiting for the five L226 Big Y tests is very difficult for my patience / timeline on making progress.

The database consolidation issue will be coming to a head very soon. The L21 project can not be expected to track and compile 1,000s of downstream YSNPs. The Z253 and L513 project admins will have to step up or progress will lag. I do not think the FTDNA database will step up to the challenge either (even for the Big Y data). They still do not include relevant negative results in Nat Geo uploads and do not filter out the numerous useless older YSNPs which just confuse people. Also remember that WTY was never loaded into the YSNP reports as well. Our expectations of the even bigger demand on FTDNA to properly handle the Big Y results is probably unrealistic for FTDNA to deliver (but one hopes). Plus - FTDNA will have "haplogroup deep clade" tests sometime in the later part of this year to track as well. If fTDNA handles this data better, more progress will be made and more orders will flow - but IT support is not cheap and will drive up testing costs. Not sure how we are going to track all this data (both quantity and sources). I do not think people are willing to pay for IT support directly as the 23andme experience shows.

We really need to draft a standard for database upload interfaces - but that is some real "work". We need to lobby very hard or volunteer some IT support to address these issues. I really wish FTDNA would offer premium IT services at a yearly cost that has to be self-supporting. The heavy-weight user would sign up very quick - but it needs to appeal to a pretty large base of testers to really justify - those are less likely to pay for this service. Same problems of the past, they are just getting worse with a lot more data and more data sources.

Heber
01-14-2014, 10:52 PM
Has anyone defined somewhere a SNP discovery and matching process which can be automated or at least assisted using Data Analytics tools. My focus will be on DF21 and surname matching to see if I can detect patterns. I was hoping that Dave Reynolds could help out with the heavy lifting as he is also DF21 and has performed wonders to date.
Surname matching should become more relevant as we get into the last 1,000 years.
http://www.pinterest.com/gerardcorcoran/r1b-l21-df13-df21/

TigerMW
01-14-2014, 10:55 PM
I think there will be major need for a new project type that is quite small for those YSNPs that are truly post-surname in creation (around 1,000 years or a little earlier years for clan surnames and probably a little more recent for English surnames). Haplogroup projects will lose interest or will not have the time/skills to duplicate your descendant chart for all the truly private YSNPs. Hopefully, some of the surname admins will step up and realize how important YSNPs are becoming to their research. However, larger surnames will obviously struggle with this.
Agreed. I'm not sure there is a correct term for this new type project level. It's not really a surname project as it likely includes more than one surname and only factions of those surnames. It's kind of a small/youthful haplogroup project but people may not know whether or not they qualify initially. The unawashed R1b1a2 predicted masses can't be expected to find all of these small projects or test one at a time for qualifying SNPs. Moderate priced package tests will provide qualifying information, hopefully, by the time we get the coverage needed out of a Geno 4 or Chromo 4 or an FGC "entry" package. STRs still look to play important role for some time. The 11-13, now L513, project started out as a super-family project but we discovered it was much bigger and older than we thought. I originally tried to call it a "clan" project but that blew up when a couple of our continental participants complained we (I) were too Isles centric.




Getting these forty Casey lines to test Big Y tests at $600 each is just not going to happen very fast either. Also, L555 / Irwin research is stuck in the same place. It will be hard to get multiple L555 / Irwin testers to put down $600 per test to sort out the numerous Irwin branches. However, each test will reveal five to ten branches, so there is some hope. We really need a $100 test that includes 200 to 500 YSNPs that are L226 or L705.2 related (or 2,000 to 5,000 L513 or Z253 YSNPs for $200). Waiting for all the Big Y and Full Genomes tests to trickle in and find differences between them will take too long.
Agreed. The prices for these NGS tests are still way too expensive, even at Big Y's introductory $495 price. I think we need those to get to $199. The actual pricing today is $700 to $1200 which is the reason I pushed so hard on Big Y. I knew it was best chance we had to put a big dent in the L21 tree. It was knock-down dragout for my L706.2/B2 group. I have identified 131 suspects but can only communicate (and get a response) with about 40-50 of them. Ultimately, there are really only about 60 who I think want to investigate much further and probably only 20 who are serious. Out of those 20, we had quite a time, but finally came up with three fairly diverse 111 STR ht's to test for Big Y. One of them involved pooling of funds. I think three will give us a good start, but we'll probably look back and say we wish we had one per surname when all of the quality and one at a time SNP pursuit decisions kick in.


The database consolidation issue will be coming to a head very soon. The L21 project can not be expected to track and compile 1,000s of downstream YSNPs. The Z253 and L513 project admins will have to step up or progress will lag. I do not think the FTDNA database will step up to the challenge either (even for the Big Y data).
Yes, I'm concerned about that too. As you note, it's important to have ancestral reference models at low enough levels to filter out out irrelevant SNPs. We do need the negative versus the no call results too. I currently don't see how that all will be easily displayed. I feel like I should be spending time with them to help them see what they need to produce (IT wise/reporting) something we can work with without too much pain.


Same problems of the past, they are just getting worse with a lot more data and more data sources. Yes, it's a bit daunting, but I guess the optimistic way to look at it as at least we'll have more data. I guess I'm going Biblical, here, but it's like finding the needle(s) in the haystack. The good news is we are getting much of the haystack, the work is we have to separate the wheat from the chaff.