PDA

View Full Version : Cladogram from STR Values



jbarry6899
05-29-2015, 08:27 PM
I would appreciate your thoughts about the best method for drawing a cladogram from STR values.

I have a group of men who share a close 37 marker match and are divided between two surnames, X and Y. I would like to try to create a cladogram to make an educated guess about whether X descended from Y, Y from X or both from another common ancestor, Z.

I found instructions for creating cladograms using McGee's Y Utility at
http://lawsondna.org/results/cladograminstruction.html
Unfortunately this approach uses Fluxus which is available only for Windows, not for my Mac. I explored Phylip, which doesn't offer this function, and MacClade, which is no longer supported in Mac OS X. So I thought I would consult a bit before deciding whether to spend some $$ to partition my Mac to run Windows programs.

A few questions:

1. Are you aware of any other programs that will run on a Mac to create cladograms from STRs?

2. If not, do you have any ideas for creating cladograms manually? I have made spreadsheets to compare each of the results with the modal and minimum values, and partitioned the group into clusters based on shared mutations. But I'm not sure where to go from there in plotting the differences in a cladogram.

3. If drawing a diagram manually isn't feasible, would you draw any conclusions regarding the earlier surname on the basis of distance from the modal or minimum haplotypes? If so, which one would give the better insight?

Thanks!

Jim

jbarry6899
06-03-2015, 12:22 PM
In case others are interested I found two useful tools:

Mesquite, a suite of programs for biologists that is Mac compatible:http://mesquiteproject.org

A manual methodology developed by John Robb: http://www.johnbrobb.com/Content/DNA/MutationHistoryTrees&FluxusDiagrams.pdf

MJost
06-03-2015, 12:44 PM
Isn't there a Windows on Mac software app?
http://www.imore.com/how-install-windows-10-mac-without-spending-dime

MJost

jbarry6899
06-03-2015, 12:54 PM
Isn't there a Windows on Mac software app?
http://www.imore.com/how-install-windows-10-mac-without-spending-dime

MJost

Yes, however I found Mesquite quite easy to use and John Robb had some good comments about Fluxus, so in the end I found good solutions without using a Windows program.

What I found was that the premilinary indication is that family X descended from Y, not the other way around, although we still have some test results in process that may refine that conclusion.

MJost
06-03-2015, 01:09 PM
Yes, however I found Mesquite quite easy to use and John Robb had some good comments about Fluxus, so in the end I found good solutions without using a Windows program.

What I found was that the premilinary indication is that family X descended from Y, not the other way around, although we still have some test results in process that may refine that conclusion.

Fluxus performs better when you have clear cut branches out towards the leafs. It does infer where a missing allele value should be and gets messy if you don't put a major ancestral subclade modal into the mix in order to 'Root' the tree.

More markers 'more better'. :nod:

MJost

jbarry6899
06-03-2015, 01:37 PM
Yes, in this case I was trying to infer the root, so Mesquite and the Robb manual method were useful. One of the problems is that two of the men in family X tested with Ancestry in the old days and have only 32 markers in common with the others. Recently I was able to track down another man from this line and his 37 marker test is under way. Depending on the results we may purchase an upgrade to more markers or to BigY.

FYI, all of this is being done in preparation for testing on the remains of one of the Earls of Barrymore. (http://www.anthrogenica.com/showthread.php?2681-Barry-DNA-Project&p=82999&highlight=barrymore#post82999

In this case family X (Barry) is generaly considered to be Anglo-Norman Irish and associated primarily with County Cork. Family Y is Christopher, a common surname in County Waterford. There is one subgroup of Barry men, probably I-L22 uN, which implies a Norse origin, and who have a number of close matches to Christopher men from County Waterford, a Viking stronghold. The phylogenetic trees generated by Mesquite and the Robb method imply that these Barrys were descended from a Christopher, not the opposite. So this tends to make the I-L22 Barry group a less likely candidate to match the Earl, whose line is traced directly to the original Anglo-Norman family.

There are uncertainties in this analysis, but it is useful background for the project. This discussion on NPEs is also part of our background research: http://www.anthrogenica.com/showthread.php?2921-NPEs-in-an-Anglo-Irish-Family&p=84018&highlight=NPEs+Irish+Family#post84018

I have an excruciatingly detailed background paper on DNA testing and Barry family history if anyone has the interest to read it: https://dl.dropboxusercontent.com/u/44452288/DNA%20Testing%20and%20Barry%20Family%20History.doc x

MJost
06-03-2015, 01:47 PM
You could use McGee's Utility to enter your haplotypes and produce a modal for those Hts. But small number of haplotypes may cause a skewing effect if swamped with a subgroup that is closely related from the rest. It is far better to use a major ancestral subclade modal with wider geographical distribution. Example I prefer to use L21's modal for every subclade under it.

MJost

jbarry6899
06-03-2015, 02:00 PM
You could use McGee's Utility to enter your haplotypes and produce a modal for those Hts.

MJost

Yes, I have looked at that as well. The Christopher samples are closer to the modal haplotype, but again the small sample size and the interrelationships introduce uncertainties.

The hope is that testing of the Earl will yield some strong evidence. Based on nearly 100 test results we have three candidate groups that may relate to him: this I-L22 group and two in R1b, Z49 (27 men) and L159.2 (9 men). The forensic examination and taking of samples is scheduled for mid August, so time will tell.

MJost
06-03-2015, 02:08 PM
Sounds interesting. May take a long time.

MJost

jbarry6899
06-03-2015, 02:10 PM
Sounds interesting. May take a long time.

MJost

Yes, we are hoping to get some preliminary results by the end of the year, but I've been working on this family for more than 30 years so I can wait a little longer.

ChrisR
10-14-2016, 10:07 PM
I found instructions for creating cladograms using McGee's Y Utility at
http://lawsondna.org/results/cladograminstruction.html
Unfortunately this approach uses Fluxus which is available only for Windows, not for my Mac. I explored Phylip, which doesn't offer this function, and MacClade, which is no longer supported in Mac OS X. So I thought I would consult a bit before deciding whether to spend some $$ to partition my Mac to run Windows programs.

Despite the advance in NextGenSequencing there is still the need to do Y-STR haplotype distance comparisons. I wonder if there are any new tools (Windows/Linux) usable without too much conversion work together with FTDNA Y-STR data? Also new Tutorials for refined usage are always interesting. Thanks

Dave-V
10-15-2016, 03:23 PM
Take a look at my STR-based tree generator here:

http://www.jdvtools.com/SAPP

It was intended to be an STR-based cladogram generator that can incorporate SNP and Genealogy data. On its own it will generate the tree that based on STR mutation rates are statistically most likely - that of course especially for small input datasets may not be representative of the actual tree so modeling different scenarios is encouraged (the tool, for instance allows adjusting the starting modal and ignoring certain markers, usually the more fast-moving ones).

The tool was intended originally to be used by surname or smaller project administrators so the underlying system does not scale up well to large input sets. Robert Casey has pushed the envelope with several hundred or more kits and hit performance limitations. However the original intent of the tool was to try and distinguish family lines within a group of kits based on STRs so once you get into the hundreds of kits, you're really looking to answer different questions. So this may or may not be useful depending on what questions you're looking to answer.

Edit: Note the tool was originally created for R1b-L21 but that's because it contains within its database the modal for L21 and the allele frequencies specific to L21. It's not hard to adapt to other haplogroups outside those under L21.

If you have questions I can help either here or offline at [email protected]

MfA
10-15-2016, 03:36 PM
Take a look at my STR-based tree generator here:

http://www.jdvtools.com/SAPP

It was intended to be an STR-based cladogram generator that can incorporate SNP and Genealogy data. On its own it will generate the tree that based on STR mutation rates are statistically most likely - that of course especially for small input datasets may not be representative of the actual tree so modeling different scenarios is encouraged (the tool, for instance allows adjusting the starting modal and ignoring certain markers, usually the more fast-moving ones).

The tool was intended originally to be used by surname or smaller project administrators so the underlying system does not scale up well to large input sets. Robert Casey has pushed the envelope with several hundred or more kits and hit performance limitations. However the original intent of the tool was to try and distinguish family lines within a group of kits based on STRs so once you get into the hundreds of kits, you're really looking to answer different questions. So this may or may not be useful depending on what questions you're looking to answer.

Edit: Note the tool was originally created for R1b-L21 but that's because it contains within its database the modal for L21 and the allele frequencies specific to L21. It's not hard to adapt to other haplogroups outside those under L21.

If you have questions I can help either here or offline at [email protected]

Can you post an example .txt file?

Edit: Ignore it, since there's already an example input file http://www.jdvtools.com/Inputs/

RobertCasey
10-15-2016, 04:45 PM
I spent a substantial amount of time with SAPP and is still a useful tool that automates finding the clusters of related individuals. However, the early branching does not work well and the 100 to 150 submission limit really misses the sweet spot for this kind of tool for larger predictable haplogroups like M222, L226, L193, etc.

I have developed a very laborious manual method that could be automated via coding. I found that there are three major parameters needed to create such charts with any accuracy: 1) YSTR signatures of tested submissions; 2) determining the overlapping signatures which are very plentiful (smaller signature should be replaced if part of a larger signature discovered later); 3) genetic distance that is too much for size of the signature. I have been able to take 500 67 marker submissions under L226 and produce a chart with around 60 % of the submissions (20 % tested and 40 % predicted).

Here is a link to my R-L226 haplotree chart, the L226 YSNP branch summary chart and the L720 haplotree chart:

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf

(http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf)http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Tree.pdf

http://www.rcasey.net/DNA/Temp/L720_Chart_20160823D.pdf

This methodology could be automated via programming if there are any programmers out there looking to make an impact on the genetic genealogy community. These charts are based on haplogroups that are predictable in nature (1,200 to 2,500 years old).

ChrisR
10-16-2016, 06:43 PM
Take a look at my STR-based tree generator here:
http://www.jdvtools.com/SAPP
...
The tool was intended originally to be used by surname or smaller project administrators so the underlying system does not scale up well to large input sets. Robert Casey has pushed the envelope with several hundred or more kits and hit performance limitations. However the original intent of the tool was to try and distinguish family lines within a group of kits based on STRs so once you get into the hundreds of kits, you're really looking to answer different questions. So this may or may not be useful depending on what questions you're looking to answer.

I spent a substantial amount of time with SAPP and is still a useful tool that automates finding the clusters of related individuals. However, the early branching does not work well and the 100 to 150 submission limit really misses the sweet spot for this kind of tool for larger predictable haplogroups like M222, L226, L193, etc.
Thanks Dave-V and RobertCasey,
the SAPP SNP and GEN options look interesting. I'm seeking a tool which mainly does bring an advantage over using for example Y-Utility+Heinila2012+Fluxus/PHYLIP (or simple GD comparison) on kits predicted/confirmed J2, so a very old haplogroup as well as a very big comparison dataset (thousands of kits). I know STRs have their limits even at Y111 or higher and that consequently any tool building trees (or TMRCA) on them is limited, but I suspect there is room on optimization regarding clustering usage for Y-Utility+Heinila2012+Fluxus/PHYLIP for example and maybe including all input/output steps into one process (as SAPP seems to achieve). I will keep SAPP in mind when I will work on a young cluster, but that is a seldom case for me.

RobertCasey
10-16-2016, 07:38 PM
Thanks Dave-V and RobertCasey,
the SAPP SNP and GEN options look interesting. I'm seeking a tool which mainly does bring an advantage over using for example Y-Utility+Heinila2012+Fluxus/PHYLIP (or simple GD comparison) on kits predicted/confirmed J2, so a very old haplogroup as well as a very big comparison dataset (thousands of kits). I know STRs have their limits even at Y111 or higher and that consequently any tool building trees (or TMRCA) on them is limited, but I suspect there is room on optimization regarding clustering usage for Y-Utility+Heinila2012+Fluxus/PHYLIP for example and maybe including all input/output steps into one process (as SAPP seems to achieve). I will keep SAPP in mind when I will work on a young cluster, but that is a seldom case for me.

YSTR signatures are not reliable for very old haplogroups like J2 since there has been so much time that the vast majority YSTR mutations are hidden and can no longer be determined due to excessively high percentages of backwards and parallel mutations. Chart building (which only works with signature based algorithms), is only feasible for the dozens of predictable YSNP branches under J2. Only YSNPs will be reliable to tie the predictable YSNPs together as the reliability of YSTRs are lost in this timeframe.

However, there are two forms of YSNP prediction: 1) fairly recent YSNP branches in the 1,200 to 2,500 year time frame (R-M222, R-L226, L1335, etc); 2) older YSNP prediction based unique combinations of more rare YSTR values (R-M269 with high accuracy and R-U106 reasonable reliability). There is a huge gap between these two time frames where prediction does not work well. Under R-L21, we could probably find enough recent YSNP branches to predict 80 to 90 % of L21.

Charting of these recent branches also require 20 to 30 % robust YSNP testing at 67 markers and even then 10 to 20 percent can not be reliably charted without YSNP testing due lack of divergence of YSTRs over the last 1,000 to 1,500 years. You will need a lot of NGS tests, SNP pack tests and individual YSNP testing of private YSNPs as well. With R-L226, we can now chart around 60 % with 20 % YSNP tested. We also have 500 67 marker submissions with 50 NGS tests and probably 100 SNP packs/panels as well as 100 tests of private YSNPs associated with NGS tests. Charting the last 10 to 20 % will be tough where YSNP testing is revealing many individuals at 67 markers that do not really match at genetic distances of 1 to 3 mutations (excluding CDY markers).

RobertCasey
12-08-2016, 06:23 PM
I have now created a pretty detailed procedure to build genetic descendant charts under predictable haplotypes (such as L226, M222, L193, etc.) This procedure is very laborious to create manually but once you have created a pretty robust chart, keeping it maintained is less laborious since it is an additive process. This is a process that is based on signatures, genetic distance, adjustments for overlapping signatures as well as rare marker values, multi-step mutations, slow moving markers, null markers, etc. This extends the signature prediction methodology of binary logistic regression down to the present. I am formalizing a functional specification for implementation via coding. This has been fully tested under L226 with a full pass with L743 (worked very well but this YSNP is a little too recent) and I am currently working on Z16437 under Z255 (this YSNP has some pretty significant convergence outside of Z255 and with other branches of Z255). Here is my latest version of the L226 chart (with the extensive testing of the new L226 SNP pack which includes 50 private YSNPs), the coverage went from 50 % to 68 % of the 500 67 marker submissions that are predicted to be L226 positive). We should top 75 % coverage in the next few months with incoming data and fine tuning of the charting program.

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf


If you are a programmer type and want to help the genetic genealogy community, this coding project would not be too rough for the first phase (90 % complete) but as with any coding project, the more complete version would double or triple the coding effort. Also, YSNP prediction could be automated across the entire genome as well which would a second coding project (it is currently possible to use 67 markers and predict 70 to 80 % all submissions to predictable YSNPs. Looking for some brave citizen scientist to step forward. These tools would eventually involve a small monthly charge to pay for IT charges as well as some for programming time, so you might actually get paid minimum wage or more eventually (the beta versions would be free of course).

http://www.rcasey.net/DNA/R_L226/R_L226_Contact_Project.html


(http://www.rcasey.net/DNA/R_L226/R_L226_Contact_Project.html)