PDA

View Full Version : Statistics and Probablity Theory supporting YDNA charting

RobertCasey
04-22-2017, 09:50 PM
Attached is a very preliminary paper that I have written on the mathematics that support charting of YDNA via YSTR signatures combined with massive YSNP testing. I am looking for feedback on the math involved and any suggestions of why the mathematical model is producing much higher numbers than the observed data. Both track very well - but the ratio of observed data to mathematical models continues to fall at a consistent rate as the mutation rates increases.

There are three documents that should be reviewed:

1) The fifteen page paper that summarizes the statistical models and probability theories used in the analysis. This paper also summarizes the mathematical model compared to the observed data for the haplogroup R-L226 charts.

http://www.rcasey.net/DNA/Temp/Charting_via_signatures_20170422.pdf
(http://www.rcasey.net/DNA/Temp/Charting_via_signatures_20170422.pdf)
2) The EXCEL spreadsheet that summarizes mutation rates of Burgarella and Heinilla as well as using the mathematical models in comparison with the observed data of using signature based charting for L226:

http://www.rcasey.net/DNA/Temp/Mutation_Rates_20170422A.xlsx

3) The actual chart of R-L226 used as input to the mathematical model in the EXCEL spreadsheet:

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf

(http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf)

RobertCasey
05-04-2017, 03:47 AM
Here is a major update charting using signatures. It also goes into a lot of details about the probability of backwards mutations, two single step mutations in the same direction, parallel mutations, etc. It also includes significant analysis of the two most comprehensive papers on mutation rates by Burgarella and Heinilla and possible issues with these mutation rates.

I made to significant updates to the number of mutation events per path as suggested by Alex Williamson. Using YNSP branches and approximately 75 years per YSNP (had to lower it some for a few branches), this reduced the number of average transmission events significantly for adjustments for 47 branches. I also added a little over a dozen surname clusters as well which average 40 transmission events. These two changes resulted in the average going from 60 transmission events to 46 transmission events. I also considered adding NGS clusters where the date would be based on the number of private YSNPs and branch equivalents but this seemed pretty difficult to determine who would belong to these clusters. Since we know that 46 is still pretty high, I just later made a estimate of a further reduction of another 25 % based on finding new branches and surname clusters over the coming months (we are now averaging one per branch per two or three weeks now).

I also analyzed the genetic clusters of the individuals that could not be charted since their signatures have not been YSNP tested to date. These genetic clusters include 80 % of the genetic clusters with very large signatures and significant genetic distance within these clusters. Although only 22 % are not charted, this added 52 % to the number mutations which was a pretty big surprise this was so large. These three adjustments really closed the gap between the mathematical model and the observed values.

Next, I took a very hard look at the validity of the mutation rates by Burgarella and Heinilla. I looked at my experiences with L21 signatures and looked at the number of off modal mutations for 50,000 haplogroup R testers at 67 markers. My empirical observations tracked very well with medium and slow mutating markers but I just could not tell anything about faster mutating markers. Burgarella was a major reduction from the Chandler numbers and Heinilla was a major increase in mutation rates over Burgarella. The off modal values and my empirical observations revealed that many Burgarella mutations did not track very well but the Heinilla mutation rates tracked much better. However, the mutation rates of the faster markers seem very low for Buragerella paper and very high for Heinilla paper. So I decided to use the average of the two and most of the discrepencies from the HG R modal values and my observations really smoothed out.

But after all these modifications, the mutation rates still appear to be an issue. The slow mutation rates actually matched and the medium mutation rates are only 14 % higher than the observed values. However, the faster mutation rates are 80 % higher since the Heinilla mutation rates for faster markers are so much larger than the Burgarella paper. I am now convinced that the mutation rate for the Heinilla paper are probably too high for faster mutating markers. For the last adjustment, I just analyzed where the gaps were and how big they were. I will have to wait for the next round of mutation rate papers to further adjust the mutation rates.

Here is the updated paper:

http://www.rcasey.net/DNA/Temp/Charting_via_signatures_20170503.pdf

Here is the updated source spreadsheet with all the details on calculations:

http://www.rcasey.net/DNA/Temp/Mutation_Rates_20170503A.xlsx

Here is the updated charting for L226 (since I do not change the file
name, you may need to hit refresh to clear out your browser cache):

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf
(http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Home.pdf)
Here is the updated spreadsheet that was used to create the chart (since I do not
change the file name, you may need to hit refresh to clear out your browser cache):

http://www.rcasey.net/DNA/R_L226/Haplotrees/L226_Signatures.xlsx

Jan_Noack
12-12-2017, 07:58 AM
Hi Robert,
Only just glancing at this and time conflicts both for reading and gaining the necessary background knowledge prevent me from giving the deserved considered reply but could it possibly be that mathematical models include the mutations that go back and forward and thus " cancel out " in observed data? This would be expected to be much greater as mutation rates increase ..as you have observed?. In other words in the faster mutation rates the observation data doesn't catch the all the mutations, and would expectedly miss, likely, the majority of them. Sorry if this obvious suggestion has been included in your theory.
BTW I've only tested my father and hubbie to STR37 marker., so I'm a newbie.

The pubmed paper mentioned in another post by Cofgene, pubmed article PMC4863667 (sorry I am not allowed to post links)
mentions the selection of the STR in the panels leads to lower observed mutation rates than expected , as well as the sampling method of short reads (100bp to 150bp mostly) resulting in the elimination of the longer STR's (they've been chopped up), and it's the longer STR's that have the increased mutation rate. So a combo of these would also cause the mathematically expected mutation rate to exceed the observed actual rate.

then I read awhile ago of one admin (very experienced) who found changes of up to 4 to 6 mutations in some father-son. Would these be a NPE to FTDNA ? and yet they were actually father and son and some of the mutations changed back the next gen on, so I don't take FTDNA "matches or non-matches" as stated but try to look (if possible) at the detail. This admin had his own way of calculating genetic distance (different from FTDNA) which he found worked a lot better for his group. I read this awhile ago, so memory is vague. I only mentioned it as it does show that mutations may occur to a greater extent than FTDNA allows for "matches".