PDA

View Full Version : An alternative to Fluxus - for R1b-L21 only



Dave-V
04-06-2016, 10:40 PM
I've been testing out a phylogenetic tree-producer based on STR data and I believe it performs at least as well as Fluxus. It also has the advantage of being able to include SNP and genealogy data into its tree calculations.

This is an exercise many of us have done by hand (and explained so well in Maurice Gleeson's great videos), so if you've already done it and you don't need to change it, you won't need this tool. For me, as my groups get new testers, new SNP test results or the occasional new revelation from genealogy research, I was tired of re-doing the tree construction by hand. And I'm really sick of Fluxus :).

Other folks have given me sample data and it has performed very well. It definitely singles out the outliers, and groups the clearly closely-related kits together well. It's not perfect, but as I said I think it works at least as well as Fluxus, which certainly isn't perfect either. Please note the tool will only work correctly for R1b-L21 and subclades, for reasons explained on the FAQ page.

If you want to try it out or see more about what it's based on, the tool plus explanations etc is under the "SAPP" option at http://www.jdvtools.com. Comments welcomed.

As a quick overview, basically it takes a text file (*.txt) of input like this (this example has STR, SNP, and genealogy data):

/STRDATA
f10101/Smith <List of STRs like for the McGee utility>
f12211/Smith <List of STRs like for the McGee utility>
f231231/Jones <List of STRs like for the McGee utility>
f565656/Smith <List of STRs like for the McGee utility>
f312213/Adams <List of STRs like for the McGee utility>
/SNPDATA
f565656/Smith (ZZ516+ ZZ507-)
f312213/Adams (ZZ516+ ZZ507?)
f231231/Jones (ZZ516- ZZ507-)
f12211/Smith (ZZ516? ZZ507+)
/GENDATA
PapaSmith1725 (f10101/Smith f565656/Smith)

and produces this:

http://www.anthrogenica.com/attachment.php?attachmentid=8615&stc=1

RobertCasey
04-07-2016, 07:40 PM
I have found that YSTR pattern recognition for building descendant charts is very unreliable in the under 800 to 1,200 year time frame. There are just too many parallel mutations in this time frame to be statistically reliable. However, in the 1,200 to 2,400 year time frame, signature recognition is extremely reliable (except where massive convergence happens). When analyzing private YSNP testing candidates based on the signature of NGS testers, it is amazing how many parallel mutations and a few backwards mutations of YSTRs have to happen when testing very recent YSNP mutations (under 1,000 years in general). Once there is enough younger YSNPs to eliminate these multiple YSTR mutations, I think a combination of both would produce very useful trees. Also, if somebody published all the 1,000s of 400 YSTR haplotypes associated with NGS tests, this resolution might allow descendant chart building with only YSTRs.

Nigel McCarthy is on the leading edge with this kind of tree building and he does it manually. Also, CDYs are just too volatile to use with any reliability. Also, weighting the YSTR values based on mutation rates also reduces errors as well. A much better approach is to work your way up the haplotree with signatures of NGS testers where both younger YSNPs and YSTRs can be used. But these signatures are very small scope in nature but do easily discover lots of more recent YSNP branches from 1000s of private YSNPs.

I ran your tool for nineteen 67 marker tests under L226 and it produced 16 levels of branches - so your tool really needs 50 to 100 submissions to flatten out the chart. Also, L226 has 250 testers at 67 markers, so the 200 tester limitation would make hard for even smaller YSNPs like L226 - M222 and L1335 would require many more and smaller scope YSNPs would be too few to produce more accurate charts. It certainly looks much better than Fluxus which is very bad for this kind of connection (unless heavily modified). Also, you need to add a function to ignore the first n fields as cutting and pasting from FTDNA reports would have to go through a spreadsheet first and then a cut and paste to Wordpad. But it was fun to use and only took about 10 or 15 minutes to figure out how to make it work. Also, manually removing the extra multi-copy markers could be automated.

I tried many times to build a L226 chart and just gave up as picking the multiple mutations of YSTRs was just too much to guess at. I finally just built a bottom up approach with surname clusters with common YSTR of modal values. Now that we have 25 branches under L226 and quite of few panel and pack tests (plus around 50 NGS tests), both combined might produce a more reliable chart. Again, I using another version of the bottom up approach of creating many new small branches via testing of private YSNPs (we now have five branches via testing private YSNPs). So your tool could soon have enough viable data to generate decent charts. But you still need to group several testers together and increase the limit of 200, otherwise M222, L1335, L226, etc. can not use this technology. You can also combine this methodology with my L21 YSNP predictor methodology to filter out non-L226 submissions for L226 charts (which has over 99 % reliability).

Your GENDATA is an excellent idea as you can to add genealogical data as well: 1) surname is very important (but with NPEs can be misleading at times); 2) several people test the same line, so these can be merged into one cluster based on proven genealogy (but then you become dependent on unreliable genealogical data sometimes).

Dave-V
04-07-2016, 10:03 PM
When analyzing private YSNP testing candidates based on the signature of NGS testers, it is amazing how many parallel mutations and a few backwards mutations of YSTRs have to happen when testing very recent YSNP mutations (under 1,000 years in general). Once there is enough younger YSNPs to eliminate these multiple YSTR mutations, I think a combination of both would produce very useful trees.

Fully agree. I gave up pretty quickly on creating a perfect STR-based algorithm. It functions well enough that I can use the first passes to see unlikely identical mutations on different branches, and using the SNP or Gendata functions I can manually adjust the tree. Using the tool in that way isn't much different from doing it manually, except that the tedium of drawing the tree and charting the mutations is eliminated.

It has an internal adjustment that calibrates for the likely amount of convergence in the input tree, which I COULD offer as a screen option to fine-tune the STR algorithm for older/younger trees or where more/less convergence is likely. Just wanted to keep it simple for the first release.

But to your point, my main objective was to allow the combination of STR and SNP data especially as younger SNP data becomes more widely available.

Thanks for the feedback!

Dave-V
04-07-2016, 10:36 PM
But you still need to...increase the limit of 200.

I raised it to 300 for now although it can probably handle more (need to test it). I don't find the tree reports in text form all that useful, but even shrinking the picture down to tiny fonts and boxes doesn't get much past 100 kits before the picture size gets too large to handle. 300 kits generates 599 boxes - need to figure out how to represent that efficiently as a picture with all the associated information!

Mag Uidhir 6
04-08-2016, 01:46 AM
Dave,
My project has roughly 115 A2 Hts, 23 have tested NGS as seen here at Alex's Big Tree: http://www.ytree.net/DisplayTree.php?blockID=514&star=false

I'm gonna have to study up how to conform my data to fit into your SAPP (I'm slow and not too bright). However, my hat is off to your graphics, since I learn best visually, it is an awesome tool! Thanks!

Brad

RobertCasey
04-08-2016, 03:57 AM
Dave,
My project has roughly 115 A2 Hts, 23 have tested NGS as seen here at Alex's Big Tree: http://www.ytree.net/DisplayTree.php?blockID=514&star=false

I'm gonna have to study up how to conform my data to fit into your SAPP (I'm slow and not too bright). However, my hat is off to your graphics, since I learn best visually, it is an awesome tool! Thanks!

Brad

I already went through this process and here are the steps (from a YSTR FTDNA report): 1) cut and paste all submissions into a XCEL spreadsheet; 2) delete / merge all the columns into to one column before the YSTRs begin; 3) delete any extra multi-copy markers beyond the normal (like 464e and 464f); 3) Cut and paste the spreadsheet into Wordpad - using paste special and text format; 4) save file and remember its directory and name; 5) upload text file; 6) redirect image to be opened to your favorite image editor (Photoshop, etc.); 7) save a jpg with medium resolution.

Do not forget to add the header for YSTR before YSTR data - /STRDATA. Add YSNPs in format shown in example in original post and surnames in format shown in example. Be sure to use at least 50 submissions for YSTR only - probably 20 or 30 when both YSNPs and YSTRs are available (assuming most testers have one or two YSNP results available). You can then use gendata to correct any obvious errors generated.

RobertCasey
04-08-2016, 04:08 AM
I raised it to 300 for now although it can probably handle more (need to test it). I don't find the tree reports in text form all that useful, but even shrinking the picture down to tiny fonts and boxes doesn't get much past 100 kits before the picture size gets too large to handle. 300 kits generates 599 boxes - need to figure out how to represent that efficiently as a picture with all the associated information!

Rather than raising the data count, I would just merge some of lower boxes and have multiple testers into listed in one box if they are pretty similar. Having a very wide format is not bad if handles more data. Alex Williamson's BigTree is super wide - just make sure that you opt for pushing graphics outward vs. down when possible. With enough data entered, the number of rows will seldom go above 10 rows with the 200 tester limit. This assumes both YSNPs and YSTRs are entered. If people put in data that is too distantly related (surnames with different deep ancestry), this will produce bad charts. This would work great for predictable single signature haplogroup projects like L226. It will not work for Z253, DF49, L513, etc. since these are multiple signature YSNPs. It could work for M222 if using only YSTRs with YSNPs tested and many similar testers being filtered out (use the program as a chart generator).

Dave-V
04-08-2016, 07:41 PM
Dave,
My project has roughly 115 A2 Hts, 23 have tested NGS as seen here at Alex's Big Tree: http://www.ytree.net/DisplayTree.php?blockID=514&star=false

I'm gonna have to study up how to conform my data to fit into your SAPP (I'm slow and not too bright). However, my hat is off to your graphics, since I learn best visually, it is an awesome tool! Thanks!

Brad

So I did a couple of sample runs you can play with - first I took as many Z16340-wide 111 marker results as I could fit into 80 kits (there were only about 95 total), and then I also took 80 of the A2 67 and 111 kits. Both of those charts and input files are below.

My usual way of getting data for the tool is to start with Mike W's R1b-L21 67/111 spreadsheet, so Robert's points about getting it into Excel are already done. By hiding columns or rows in Excel you can also get the data prepared for a simple cut-and-paste into a txt file (usually via either Notepad or Wordpad). I also use Excel to generate the "kit/surname" labels so the surnames end up in the tree too. Doesn't all take very long.

For the full Z16340 chart I organized the data using Mike W's subclade designations - you'll see I put those in the GENDATA section which tells the tool they're all descended from a common ancestor. There are drawbacks to this - it implies Mike's assignments are 100% accurate, for one thing - but I did it to show how that function works to guide the tree construction even if you don't have genealogy data.

I did not put in any SNP data because I couldn't find most of the kits in Alex's tree back in Mike's spreadsheet. (But I'm kind of psyched that the tool still split off the FGC9793* Clarkes first :)). The SNP function is really only for SNPs that break up the group, because that data is used to force the tree construction into the right SNP hierarchy instead of letting the tool decide for itself just on STRs. So if you have NGS tests that split this group up you could add a /SNPDATA section with those SNPs listed as + or - appropriately.

Feel free to add more kits to these - but if you do, the tool will switch from graphics to text output (try it and see). You'll see even these graphics are 7000x12000 pixels - only 1M in total size, but at the limits of what a tool can usefully generate.

File links:

Z16340 (all): PNG: https://dl.dropboxusercontent.com/u/106196821/Z16340.png
TXT: https://dl.dropboxusercontent.com/u/106196821/Z16340.txt (this input file generated the picture above)

Z16340-A2: PNG: https://dl.dropboxusercontent.com/u/106196821/Z16340-A2.png
TXT: https://dl.dropboxusercontent.com/u/106196821/Z16340-A2.txt (this input file generated the picture above)

RobertCasey
04-09-2016, 12:13 AM
I gave it a good effort today to use your tool for the analysis of L226. This haplogroup now has 459 67 marker submissions - so obviously it exceeds your limits of 300 (html output) and 80 (graphic file). In my first exercise, I tried reducing 459 to below 300 - a quick and easy approach was to eliminate submissions that had very small genetic distance from the L226 signature. Unfortunately, I failed to get the html files viewable with any browser or Dreamweaver. You need to post better description of how to view this data. Here is the source XCEL and source input to your program:

http://www.rcasey.net/DNA/Temp/L226_Chart_20160408B.xlsx

http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt

In the second exercise, I tried to included only tests that have tested positive for branches below L226. Unfortunately, there were 93 submissions with this criteria, so I had to eliminate the starburst of branches just below L226 and concentrated on the trunk of the L226 where all the genetic bottlenecks reside. Unfortunately, I repeatedly got only 404 errors on this attempt:

http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408C.txt (http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)

The audience who would appreciate such a tool are those analyzing hapogroups. There are some smaller scope haplogroups that could use the tool. But larger single signature L21 haplogroups would love this tool if the limitations were increased (and was a little easier to use and more stable). There are five or six single signature L21 haplogroups larger than L226 but getting it to work for M222 with 1,000 to 2,000 submissions would be challenging. The HTML output would be acceptable (but I could not get that to work). Maybe an Acrobat file would be more consumable as well. But the 300 limit misses a lot of your audience who are attempting these kinds of charts. I would make some of the high features as options (TMRCA, etc.) in order to push the up the number of submissions. It needs to work like Alex Williamson's BigTree (it would probably be acceptable to generate the chart at your website and then we could just save the html file or Acrobat file to our hard drive).

Thanks for the effort though. I was really surprised that L226 had 93 submissions that have now tested positive for downstream branches of L226. If you analyze the signatures of NGS testers in order to test private YSNPs, it is a real eye opener how conflicting patterns reveal themselves which suggests that parallel mutations of YSTRs under L226 are widespread (even though the L226 fingerprint is predictable at 99.9 % accuracy).




(http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)

RobertCasey
04-09-2016, 03:49 PM
I finally got the larger example to work by deleting another 50 or so rows (not sure this was an issue). However, the downloaded file has an extension of .html.css which was causing programs to treat this html file as CSS file vs. HTML file. Once I copied and deleted the second extension.css, normal browsers were able to view the file. So you need to correct the extension for HTML downloads so programs do not automatically launch programs based on the incorrect extension of .css that currently gets downloaded. Also, giving the user the option to change the download directories and file names would be a convenience improvement eventually (not a high priority).

I can definitely see the need to another switch to turn off the TMRCA functionality. More than half of the nodes generated have no YSTR/YSNP mutations and this makes the text mode extremely hard to follow.

As expected with YSTRs only, the top portion of the tree is pretty bizarre, however, the bottom portion is getting my very genetically isolated Casey cluster pretty well. I do not understand why exact YSTR matches are generating multiple TMRCA nodes (this seems to be an error in that portion of the code). On the top portion of the chart, there are 27 consecutive father/son TMRCA nodes with no intervening YSTR/YSNP changes. My lowest submission is 69 levels down from the top level (primarily due to the TMRCA nodes). I was able to connect all the dots - wow those TMRCA nodes really add a lot of extra time to remove. Again, even within my genetic Casey cluster, there are more TMRCA nodes than nodes based on DNA mutations.

This tool did get my South Carolina Casey cluster very accurately since it is so genetically isolated from others in L226. It also put what we believe belongs to our group, a Kersey submission as well. The Careys are pretty far up the tree - but remain genetically close. Unfortunately, they have tested negative for most of my private YSNPs (I got too aggressive with testing outliers hoping for a broader branch). Also, the second Casey cluster is amazing close due to common YSTRs. However, YSNP testing has proven this to be random parallel mutations and YSNP testing reveals they are very distantly related under L226. The two Casey clusters share two L226 off modals values that none of the other 25 surname clusters have either off modal at all. This lead us to believe that although they were genetically pretty different, they shared off modal values and could share a common ancestor. Your charts show that YSTRs by themselves can lead down the wrong path for generating charts like this.

The HTML mode is not very useful with all the clutter of TMRCA nodes. If you can create a switch to turn that off, I will add YSNPs and observe how that greatly improves accuracy (as we both know will happen). Also, since this tool got my cluster pretty accurately, this tool may have value for analyzing signatures of NGS testers to make recommendations for testing of private YSNPs. The limit of 80 kits for this analysis is not much for this level of analysis. This will be my next project.

I really enjoy testing out new code like this as it is so much better than Fluxus which people still use (but people are modifying that bad code fit to make it work better). BTW - my parents got me started on genealogy as well and I just retired from IBM after 40 years (development for seven years - then sales and services).

http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408D5.txt (http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)

Dave-V
04-09-2016, 05:42 PM
Robert, thanks very much for all the work. My apologies I'm distracted by some family activities this weekend. Some items though for now:

1) I just turned off the TMRCAs for the text trees for now. I can add it back as an option later.

2) I checked the error logs and you hit the maximum execution time of 30 secs a few times; I'm assuming that was why the larger number of kits didn't work for you. That's a server provider setting limitation and I'll work with them to get it raised.

3) Just FYI your input .txt file has a space in the name/label for kit "fN91115 /McInerney" which is causing the STR values to be offset and it will come out wrong in the data. I'll make that more robust in future but for now you'll need to just take out that space before the /McInerney.

4) The issues you're having with the downloaded.html is interesting... and somewhat frustrating because it seems to be a browser or browser setting difference (i.e. I get the html file just fine). I'll have to test it with multiple browsers again to replicate that - once I can narrow it down I'll either correct for it or add it to the help information. Thanks for catching it though.

5) FYI I'm shrinking the tree in the next version by eliminating nodes with no value - i.e. nodes where there are no STR mutations or SNP data. That will shorten the tree depth considerably. Should be in production early next week.

6) When you do add SNPs, just by way of explanation for the SNP labels in the tree - when it notices the SNP mutation could have occurred at different places in the tree (which will happen if there are kits possibly positive in between the known positives and known negatives) it uses the naming convention "<SNPname" for the highest level on the tree the SNP could be and "SNPname>" for the lowest level. If the SNP could only be at one node it just puts the name there. And if you don't see a "<SNPname" anywhere it's because ANY kit in the set could be positive for that SNP so the highest possible node is somewhere above the Group MRCA for the set of kits tested.

7) You can reach me directly at davevance01@gmail.com if we want to take further debugging offline.

RobertCasey
04-12-2016, 12:27 AM
Update - sorry did not see your before this post.

I just took what I thought to be the best testing candidates for the private YSNPs of my NGS test 77349. I ran three charts: 1) YSTR only; 2) YSTR + YSNP; 3) YSTR, YSNP and the few proven genealogical connections. All three really properly identified my Casey surname cluster when compared to my manually done charts and each increase of data slightly improved the chart as expected. To be honest, the cluster is very genetically isolated under L226, so this was not a big surprise. The first major surprise is the possible connections above the surname cluster seem very plausible. After learning the tool somewhat, this tool could really help automate the analysis for testing of private YSNPs.

I think the largest issue with this tool are the limitations of 80 kits for graphic mode and 300 for HTML mode. I think the tool could be greatly increase these limits by creating a switch to suppress any blocks being generated just for TMRCA nodes (more than double the current limitations). The HTML mode is just too difficult to analyze since there are no boxes generated - it is really just a text file. If the TMRCA boxes were filtered out, that would help a lot as well. Of course, HTML is not real strong on creating graphics.

Some observations about the tool. I really like the upper and lower bound ranges of YSNP positioning (not sure how accurate it is though). I also like that it seems to present a very plausible chart at 75 kits for one NGS tester but much of the important data (more kits) is missing for this analysis. The tool seems to prefer creating parallel mutations vs. any backwards mutations (which is probably in error for my particular cluster for YSTR 460). As with any graphic tool (like Alex Williamson's huge BigTree), it takes some time to analyze the charts since they are so large. Having a filter for TMRCA nodes would greatly reduce the analysis as well (the TMRCA nodes greatly make the chart much more vertical than it really is just based on DNA and genealogy). I only included the final source for all three sources combined:

http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt
(http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)
Below are the three scenario with more information added each time:

http://www.rcasey.net/DNA/Temp/NGS_77349_chart_YSTR_only_20160409A.jpg
(http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)
http://www.rcasey.net/DNA/Temp/NGS_77349_chart_YSTR_&_YSNPs_20160409D.jpg
(http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)
(http://http://www.rcasey.net/DNA/Temp/NGS_77349_chart_YSTR_&_YSNP_&_GEN_20160409A.jpg)http://www.rcasey.net/DNA/Temp/NGS_77349_chart_YSTR_&_YSNP_&_GEN_20160409A.jpg



(http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)

Dave-V
04-12-2016, 01:44 AM
The nodes are now optimized - meaning nodes are only shown if they contain SNP labels or carry STR mutations common across all their descendants. If you try your file again you'll find the tree MUCH flatter.

Since that also makes the tree smaller in graphic form I've upped the limit for drawing picture trees from 80 to 100 for now. Still need to get the provider limits upped for the server to allow for more kits, but that should be coming in due course.

RobertCasey
04-12-2016, 02:10 AM
Robert, thanks very much for all the work. My apologies I'm distracted by some family activities this weekend. Some items though for now:

2) I checked the error logs and you hit the maximum execution time of 30 secs a few times; I'm assuming that was why the larger number of kits didn't work for you. That's a server provider setting limitation and I'll work with them to get it raised.

I had the same issue under WAMP server and changed the Apache Server setting max_execution_time = 500 (default is 30). The php settings for same parameter are apparently overwritten with the Server values. I later had issues with the default setting of memory_limit of 8MB which I increased to 512MB (my php application is a real memory hog).

RobertCasey
04-13-2016, 03:24 AM
Thanks for elimination of the TMRCA nodes. I just analyzed the same data using around 70 testers. The number of boxes was reduced by 28 % which should allow for 39 % increase in testers allowed. Even more radical, there was a 58 % reduction of levels (24 to 10). This change really flattens out the graphic and now more accurately represents the descendant chart based on YDNA data. This is for the graphic version of the chart (under 100). The jpg file size was also reduced by 36 % as well. Here is the new graphic for the same file:

http://www.rcasey.net/DNA/Temp/NGS_7..._20160412A.jpg (http://www.rcasey.net/DNA/Temp/NGS_77349_chart_YSTR_&_YSNP_&_GEN_20160412A.jpg)

(http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)

Dave-V
04-13-2016, 04:52 AM
Thanks Robert! The text file output should have the same output now (fewer TMRCA nodes). Sounds like you're not seeing my emails so if you still have problems, send your txt file to my Gmail account and I'll see if I can find out why. The server execution time is also raised so you shouldn't hit those limits either.

Dave

RobertCasey
04-13-2016, 07:19 PM
I used the all 67 markers from L226 and the text version went from 104 levels to only 4 levels. I had to eliminate 30 or 40 due to the timeout issue, so those need to added back. I do not think I want to attempt any more charts without YSNPs when the numbers go over 100. Without YSNPs to help sort out the parallel and backwards mutations, YSTRs only with this quantity should not be attempted. However, it does continue to get most of my Casey cluster correctly since it is genetically isolated. L226 is chock full of a lot parallel mutations which YSNPs should help sort out. But I can see that this chart could be converted to a box chart but would take a lot of effort.

PS. I fixed the gmail issue, so I should be able to receive emails now.

RobertCasey
04-14-2016, 05:32 PM
Here is the updated chart for signature matches to NGS 77349. I increased the testers from 70 to 98 (new limit increased to 100) and added many relevant negative results from Big Y tests and one new test result. One of the Carey submissions moved back into the branch containing the FGC5639 branch. Also, the upper range of FGC5639 was moved way down to where the lower limit was last time due many FGC5639 negative results added (omission on my part). I did experience several timeout issues (network connection lost probably due to the long processing time or it could have been a bad time for your URL being too busy). But it ran first time this morning. For the branch containing NGS 77349 (scope of this chart), it produces a very good chart. There is a pending FGC5639 test for the Meredith tester (along with three other private YSNPs for 77349 being tested for the first time).

Here is the input file and the graphic output file:

http://www.rcasey.net/DNA/Temp/L226_NGS_77349_YSTR_chart_all_20160413E.txt

http://www.rcasey.net/DNA/Temp/NGS_77349_chart_YSTR_&_YSNP_&_GEN_20160414A.jpg

(http://www.rcasey.net/DNA/Temp/L226_YSTR_chart_20160408B.txt)