PDA

View Full Version : Extending a tool to U106



Dave-V
06-02-2017, 04:06 PM
I'm looking to extend my SAPP tool (http://www.jdvtools.com/SAPP) to other subclades outside of L21 and I wanted to try out U106 if anyone is interested in providing or pointing me to data and looking over the results.

As a quick overview, the tool generates STR mutation history trees restricted by SNP and (where known) genealogy information. It strives for maximum parsimony but it's not guaranteed to find it (and there's some debate about whether maximum parsimony is best anyway). An output example can be seen here (https://drive.google.com/file/d/0B1oWf7A5py4AU3dkczZCT3dldmM/view?usp=sharing).

I realize U106 has a LOT of analysis work already completed so this may not be useful, and actually one of my tests is where this type of analysis is most useful so that input would still be valuable.

Originally SAPP worked only for L21 because of two internal databases: it has the internal SNP tree to help it deconflict the SNPs given as input (so if you said a kit was positive for one SNP it would automatically know it would be negative for SNPs on other branches), and it contains allele frequencies for L21 for use in TMRCA calculations (which use Ken Nordvedt's adaptation of Bruce Walsh's algorithm).

I have since added the L2 and U106 SNP trees and the TMRCAs are not that granular anyway so I don't think it will make much difference there (but I'm also interested to find out).

What I need is STR data (67-111 markers would be best) and terminal SNPs for a well-defined subgroup of 20-150 kits. I could look at the FTDNA U106 project DNA tables but thought someone here could direct me to a good representative group to run. Something where the kits fall into known subgroups related within the past 2000 years would be great. And then I'd need someone to look over the results and tell me how right or wrong it was.

FYI the U106 SNP tree loaded into SAPP is this one - it produces both a "one-page" condensed view (https://drive.google.com/file/d/0B1oWf7A5py4ARDNUOXdPYzNTdHc/view?usp=sharing) of the SNP tree (shown below also) as well as a standard box chart (https://drive.google.com/file/d/0B1oWf7A5py4ANkJBdVN4VDIxY2M/view?usp=sharing) report (they're not on timescales, and ages shown are taken from YFull just for reference).

http://www.anthrogenica.com/attachment.php?attachmentid=16530&stc=1

Wing Genealogist
06-02-2017, 09:33 PM
Dave, First of all, Thanks so much for your kind offer.

You should be hearing from Iain McDonald, one of the co-admins for the U106 Project soon. I am also a member of the U106 admin team and forwarded this message to the team. Iain has been busy as of late and he reported your email to him is one of over 20 messages which he has accumulated since he was last able to dedicate some time to the project.

Please note: We have a much more comprehensive tree for U106 at: https://app.box.com/s/afqsrrnvv2d51msqcz2o but this tree is in a much different format than your work. On the second sheet of this tree "Equivalent SNPs and Details" we do try to indicate some of the STR indicators for some of the clades.

Dave-V
06-05-2017, 10:00 PM
Dave, First of all, Thanks so much for your kind offer.

You should be hearing from Iain McDonald, one of the co-admins for the U106 Project soon. I am also a member of the U106 admin team and forwarded this message to the team. Iain has been busy as of late and he reported your email to him is one of over 20 messages which he has accumulated since he was last able to dedicate some time to the project.

Please note: We have a much more comprehensive tree for U106 at: https://app.box.com/s/afqsrrnvv2d51msqcz2o but this tree is in a much different format than your work. On the second sheet of this tree "Equivalent SNPs and Details" we do try to indicate some of the STR indicators for some of the clades.

Thanks for that. In the meantime I took the liberty of replacing my former U106 tree for SAPP with the more comprehensive one you referenced, so I can now print it (although now the condensed tree doesn't fit on a page anymore for all of U106!)

The tool obviously has more to it than just printing SNP trees but the full "comprehensive" U106 tree is here: https://drive.google.com/file/d/0B1oWf7A5py4AWkhYemlkV2diZms/view?usp=sharing although I'm not sure why you'd ever want one that size! :) Subtrees print nicely though. I used the same colors for the SNP boxes that were shown in the spreadsheet you linked.

Incidentally in loading the larger U106 tree I noticed there are three SNPs under U106 - S933, Z17640, and BY118 - that are also in the L21 tree. I assume those should be referenced as .1 / .2 in the two subclades for clarity.

Wing Genealogist
06-06-2017, 12:04 PM
Dave,

I am at a loss as to how to use your program. You give a basic explanation of the inputs at: http://www.jdvtools.com/Inputs/ but the first link (which shows an input file) is broken.

I would have to admit to not being the most tech-savvy, and without a peek at an example input file, I cannot figure out how to make your program work.

I imagine other folks may well have this same issue, which is why I am posting here, rather than sending you a private message.

Dave-V
06-06-2017, 02:16 PM
Dave,

I am at a loss as to how to use your program. You give a basic explanation of the inputs at: http://www.jdvtools.com/Inputs/ but the first link (which shows an input file) is broken.

I would have to admit to not being the most tech-savvy, and without a peek at an example input file, I cannot figure out how to make your program work.

I imagine other folks may well have this same issue, which is why I am posting here, rather than sending you a private message.

Ah, that's the new Dropbox restrictions biting me again; I moved the sample file to Google Drive so that link should work now.

The input is similar to the old McGee utility except as a text file rather than onscreen. The only really required input is the STR marker data (/STRDATA section); the other sections are optional but obviously useful if you have SNP info (/SNPDATA section) or genealogy info (/GENDATA section).

My advice would be to create just a /STRDATA section first and try it, then add a /SNPDATA section and then other sections as desired (/IGNORE, or /MODAL, etc) for later runs as indicated.

I am available at [email protected] if you have any input issues or would like any help in building the file.

RobertCasey
06-06-2017, 09:03 PM
WingGenealogist and Dave Vance - here is a thoroughly tested file that represents most of R-L226. Here is some prep work that you have to complete in order to create the txt input file from a EXCEL spreadsheet:

1) You definitely need to create the /SNPTREE as the default haplotrees are not up to date from any source usually and cause major accuracy issues. The format of this input is: (father son grandson great-grandson etc.) not (father son son son etc.). I just enter father son pairs which is easier to track for changes.

2) /SNPDATA needs to have parenthesis around them - I just use an EXCEL macro to concatenate the required parenthesis (using &).

3) For both /SNPDATA and /STRDATA, you need to concatenate the FTDNA ID and Surname (I separate by a slash) via EXCEL concatenation.

4) The values of null markers must be converted from zero to "N". I just highlight the YSTR values for all rows and columns and do a replace of all zeros with "N" (be sure to change option to whole cell - otherwise 10 and 20 becomes 1N and 2N.

My current input record now contains 543 testers - which exceeds the current limits of SAPP. I had to eliminate 42 testers that I can not predict to get SAPP to return results. There is no error code returned - it just states the tree is completed but generates no output files. The good run took just over 30 minutes to complete - be patient as it will return good results if you wait.

Dave's tool does an very good job of collecting clusters of related individuals on the lower parts of the haplotree being charted. This is a real productivity tool for finding out where new testers belong in their respective clusters of related individuals. It also gives you a second opinion on possible YSTR evolution which many times will be better than any manual method.

Any charting tool will struggle with accuracy if attempting to chart very old parts of haplotrees. Any charts attempting to plot all of U106 will just not work due to hidden YSTRs being so plentiful 2,500 to 5,000 years ago. I also highly recommend omitting CDY markers as these two markers produce 50 % of the mutations for 67 marker testers. The sweet spot for charting of haplogroups that are single signature and YSNP predictable via signatures (1,500 to 2,500 year time frame).

Here is a robust 501 tester input file for L226:

http://www.rcasey.net/DNA/Temp/L226_Vance_20170606D.txt


For the creation of the /SNPTREE, here is my source for L226:

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Tree.pdf

Here is the chart that I have manually created using signatures:

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf

Dave-V
06-06-2017, 09:42 PM
Thanks for the input examples Robert. Two minor clarifications to your points:



1) You definitely need to create the /SNPTREE as the default haplotrees are not up to date from any source usually and cause major accuracy issues. The format of this input is: (father son grandson great-grandson etc.) not (father son son son etc.). I just enter father son pairs which is easier to track for changes.

There shouldn't be a problem with the U106 SNP tree since I just loaded it from the one Wing Genealogist provided. The internal SNP tree for L21 was taken from Mike Walsh's data, so although the sources do vary they do come from accredited sources and I try to keep them up to date but yes, I can fall behind.

People do rely on different SNP trees (FTDNA, YFull, citizen-scientist trees, etc) so they do vary, plus any individual project (surname projects, etc) will often have their own extended SNP tree, so for all those reasons the "/SNPTREE" section is certainly a useful way to modify it or add further SNPs down to a surname project level etc.

Note that the /SNPTREE section only modifies the internal SNP tree for that run of the program, it's not a permanent change.



3) For both /SNPDATA and /STRDATA, you need to concatenate the FTDNA ID and Surname (I separate by a slash) via EXCEL concatenation.

The kit "names" can really be anything you like not including spaces, +, or -. Most of my early examples just used the ID, but adding the surname is useful to see in the charts.

RobertCasey
06-06-2017, 10:22 PM
There shouldn't be a problem with the U106 SNP tree since I just loaded it from the one Wing Genealogist provided. The internal SNP tree for L21 was taken from Mike Walsh's data, so although the sources do vary they do come from accredited sources and I try to keep them up to date but yes, I can fall behind.

But WingGenealogist will be adding YSNP branches on a weekly basis, so uploading his file only works until the next wave of YSNP branches are discovered. You really do not want get into the YSNP branching end of this as it changes too often and the accuracy of sources vary radically.


The kit "names" can really be anything you like not including spaces, +, or -. Most of my early examples just used the ID, but adding the surname is useful to see in the charts.

I do realize that the format is flexible - but the required format is not as flexible as it could be. FTDNA ID and Surname should be two different fields entered which reduces the burden on the end user have to create EXCEL macros to concatenate fields (which can generate errors). I guess the real issue is only allowing one field as input when several would be much better.