PDA

View Full Version : L513 STR analysis using RCC Correlation



Dave-V
09-08-2014, 08:05 PM
I thought I would cross-post this here for reference.

I don't think this replaces other STR analysis tools, but anyone working with STRs should see Bill Howard's RCC correlation approach; I've not seen this discussed before. This particular conversation started on the L21 Yahoo Group.

By way of background, Bill has developed this approach over several years along with many of the recognizable names in the genetic genealogy field. He has authored several papers for the Journal of Genetic Genealogy and was invited by Tim Jantzen and CeCe Moore to present a few weeks ago at the 2014 International Genetic Genealogy Conference in Chevy Chase, MD, so this method hasn't exactly been ignored by the community. But it was certainly the first I had heard of it or seen it applied.

A full index to Bill's papers, applications of the method and presentations is provided here:

https://dl.dropboxusercontent.com/u/59120192/Genealogy/Papers%26TreesIndex.pdf

Like TiP reports or the McGee utility, the RCC correlation analysis calculates TMRCA between kits based on STRs. But instead of using Genetic Distance it takes a completely different approach by calculating statistical correlation between kits. Kits that show a higher correlation have a more recent MRCA. By applying the method Bill has been able to translate the correlations into "Revised Correlation Coefficients" that equate to TMRCA in years. He then uses a tool to draw a phylogenic tree for all the kits.

I first gave Bill all the 37 (or more) marker tests for the L513 kits in the Vance surname project. The results are shown here (the RCC time scale at the bottom translates to every 10 RCC = 408.5 years):

https://dl.dropboxusercontent.com/u/106196821/Vance37No155538newcode.dec2013.rev2.m.pdf

His tool VERY nicely separates all the 513-L193-A1V Vances from the 513-V Vances, and even breaks out the two 513-Va and 513-Vb Vance groups within 513-V. There are also "a" and "b" subgroups within A1V, and they cluster nicely on the tree with one exception (F92592) who has a different documented ancestry and deserves more analysis.

Bill then ran for me all the L513 111 marker tests, with results shown here (here the time scale is every 10 RCC = 346.5 years, because of the influence of STRs 38-111):

https://dl.dropboxusercontent.com/u/106196821/L513111markersnewcode.dec2013.rev2.m.pdf (https://dl.dropboxusercontent.com/u/106196821/L513111markersnewcode.dec2013.rev2.m.pdf)

If you spend any time with this 111 marker phylogenic tree, you will see that this method is NOT an "automatic family tree generator". The RCC method is still prone to the same errors from convergence as the TiP reports, McGee utility, or any other STR-only analysis. To show this more clearly, Bill ran the same tree showing the subgroups, which is shown here:

https://dl.dropboxusercontent.com/u/106196821/L513111markersSNPnewcode.dec2013.rev2.m.pdf (https://dl.dropboxusercontent.com/u/106196821/L513111markersSNPnewcode.dec2013.rev2.m.pdf)

This version of the same 111 marker tree shows more clearly that in a few cases a 6365 SNP subgroup and a 5668 subgroup were mixed in incorrectly because that's how they look from STRs alone. That's where SNP analysis comes in.

But what the RCC approach DOES do is very nicely group the subgroups together, and proposes a set of TMRCAs that can be used to validate or refine the TMRCAs by other methods. Bill has compared his calculations to the traditional tools like TiP reports and those reports are in his papers and his conference presentation in his index. Bill points out that when a comparison is made between two haplotypes, the TiP report is not to be trusted if the same marker that is shared by the two haplotypes differs by an absolute difference of 2, but the TiP report appears to be valid when the absolute difference is 1. Two-thirds of the time the TiP will be OK, but the RCC method appears to be OK for all DYS site differences, even when the absolute value of the difference between the same marker is two or more. Because these differences tend to accumulate with time, the TiP process will not be useful when applied to genetic time scales whereas the RCC method will remain valid.

If nothing else, since it groups the kits very closely to the current subgroups, his method is a very nice validation of the hard work that Mike Walsh and others have done to manually group us all into these sub-groups before SNPs below L513 were available. So I think for anyone who is still looking at STRs, this would make a very useful additional analysis method.

Bill is the expert here, so I'll have to let him handle most of the questions. Beyond the papers in his Index above, he is available at [email protected] if you have questions or want to try his method on a particular set of STRs. He will run trees for you if you send data to him formatted as described here:

https://dl.dropboxusercontent.com/u/59120192/Genealogy/Trees/Template.xlsx.

Note in particular the DYS389ii format he needs is the larger value which is (DYS389i + FTDNA's DYS389ii).