PDA

View Full Version : STR Wars, GDs, TMRCA estimates, Variance, Mutation Rates & SNP counting



Pages : 1 [2]

MJost
06-02-2015, 05:33 PM
I have updated the VBA automatic process to perform lookups of each position and if it matches one of 875 CombBED regions, I post the beginning and ending position with its length. Once done, I remove any region that appears more than once and then calculate the "True CombBED Length". I preloaded it with all the A0-T to R-DF13 positions, along with my own 27 Sanger Sequenced postive SNPs out of 71 original Full Genome HQ Private NGS DF13>FGC5494 sublcade SNPs lineage.

https://drive.google.com/file/d/0By9Y3jb2fORNUndIQnZDVkVhSTQ/view?usp=sharing

Edit: Uploaded a Zipped copy that will keep an error from occurring.

MJost

George Chandler
06-03-2015, 02:52 PM
I have updated the VBA automatic process to perform lookups of each position and if it matches one of 875 CombBED regions, I post the beginning and ending position with its length. Once done, I remove any region that appears more than once and then calculate the "True CombBED Length". I preloaded it with all the A0-T to R-DF13 positions, along with my own 27 Sanger Sequenced postive SNPs out of 71 original Full Genome HQ Private NGS DF13>FGC5494 sublcade SNPs lineage.

https://drive.google.com/file/d/0By9Y3jb2fORNUndIQnZDVkVhSTQ/view?usp=sharing

Edit: Uploaded a Zipped copy that will keep an error from occurring.

MJost

Ok thanks. Could I send you some positions to see which ones pass your validation process? Do you know how many Full Genomes tests have been done using proven ancestral lines?

George

MJost
06-03-2015, 03:10 PM
Ok thanks. Could I send you some positions to see which ones pass your validation process? Do you know how many Full Genomes tests have been done using proven ancestral lines?

George

Sure, send them. I will run them for you.

MJost

MJost
06-04-2015, 10:02 PM
Out of 27 of my new 71 SNPs under DF13, I share all but eight of them with 155812 Ross who his MDKA is from the Highlands. Using STRs he has a GD9 at 67 with me and 13 at 111 markers.

So the ninth SNP we share together. Using the CombBED length calculator I can enter see which SNPs fall outside of the CombBED region. The result is two.

Back nine SNPs to a MRCA, he and I could be as far back as 1,702 ybp = 248 AD (from 1950) Max (Using my recalibrated 190 years per SNP)

Dropping the few (two) SNPs that are outside of the normal 'Good" regions of the Chr Y, leaves seven SNPs back to a most recent ancestor, seven (using known good Chr Y regions) our common grandfather would be back:

1,007 = 943 AD (starting at the year 1950)<< minimum estimate.
95% CI High = 1180.11 ybp =770 AD
95% CI Low = 878.8 ybp =1,071 AD

Remember that the SNP has a 190 year width rate using all SNPs.

Used the Full Y CombBED Length for both: 8473821

The Isle of Man born Watterson (Juan) starts out at four normal 'Good" regions SNPs back at 756 ybp = 1,194 AD. Max.

And the same (4) normal 'Good" regions happen to fit in this calculation" using 143 years per SNP rate.
576 ybp = 1,374 AD (starting at the year 1950)<< minimum estimate.
95% CI High = 674.35 ybp =1,276 AD
95% CI Low = 502.17 ybp =1,448 AD

The Watterson surname has been documented on the Government Key Roles as far back as they have been keeping records, 1417 AD or over 598 years back.

I have a considerable number of Irish in my branch that have earlier STR TMRCAs.

I have to say I and Ross' lines could have split around or shortly after A.D. 400, when the people from Dál Riata began to settle across the Irish Sea along the Scottish coast in County Argyll. Other Irish migrants were also establishing footholds along the coast farther south, as far as Wales and even Cornwall, but the migrants from Dál Riata were known to the Romans as "Scotti" and they would eventually give their Gaelic language and their name to all of what is now known as Scotland. With the IOM being established a little later in history.

MJost

MJost
06-18-2015, 11:32 PM
Over on this thread I posted a time line of events but some missed a key fact.
http://www.anthrogenica.com/showthread.php?4710-R1b-U106-in-Swedish-Battle-Axe-Culture-(a-Corded-Ware-subgroup)&p=90233&viewfull=1#post90233

Lets go over it and discuss the numbers to arrive at a simple manual calculation of a years per SNP and then dating of the "R" tree SNPs.

The cal 24K old Mal'ti boy had five ancestral SNPs to 'R' and 35 derived SNPs for 40 total as per the paper's chart.

'R' to DF13 had 139 SNPs. Add back some number for under DF13 to present, I added my 27 Sanger validated SNPs, for total of 166. There are 40 SNPs from 'R' to a node equal to R*- MA-1 (40) , would be 126 SNPs to present from MA-1.

MA-1's calibrated age of 24000 divided by 126 equals around 190.5 years per SNP.

If 'R' to U106 has 130 as shown in YFull's R tree, the number of SNPs between the two, MA-1 and U106, less 40 equals 90 SNPs. 90 x 190.5 = 17,145 years. 24000 - 17,145 would be 6,855 years before present or 4,9k bc.

Using the five derived SNPs for U106 Rise98 and its calibrated date, pushes the years per SNP higher to 208. It would take around 14 total derived SNPs to maintain my 190.5 years per SNP instead of five. RISE98 of the Swedish Battle Axe culture, a Scandinavian Corded Ware subgroup was buried in Lilla Beddinge in southern Sweden between 2275 BC and 2032 BC.

The Mutation Rate would have to be 6.24E-10 instead of 8.2E-10 using straight counting. The lower years per SNP is inconsistent unless we are going to assume that several (infrequently) SNPs did mutate at one (same) time during the transmission event.




R" tree
Possible Location
YBP
BC from 1950
# of occur


R-P232
Most likely Home territory Lake Bailak
31434
= 29,484 BC
1


PF5953/M764 (R* MA-1 Mal'ti Boy Ancestor Split)

30672
= 28,722 BC
5


R.1-Y482/PF6056/F459
Ma-1 Derived SNP# 18
27243
= 25,293 BC
23


R1-M781/PF6145
Ma-1 Derived SNP# 22
26481
= 24,531 BC
27


40th SNP down from 'R'
(Mal'ti Boy Died at four year old, MA-1 R* equiv node 22000 cal bc 52.9°N 103.5°E near river Belaya that flows into Lake Baikal) Near Lake Bailak, Irkutskaya oblast', Russia
24004
= 22,054 BC
40


R1b-M343/PF6242 (R1a split)
Rudnichnyy rayon, Kemerovo, Kemerovskaya oblast' (1460km west of MA-1 site*)
18099
= 16,149 BC
71


R1b1-M415/PF6251

17908
= 15,958 BC
72


R1b1.1-L389/PF6531

17527
= 15,577 BC
74


R1b1a-P297/PF6398 (V88 Branch split) (block of 3)
Gusinyy Brod, Novosibirskaya oblast' (1679km)
17337
= 15,387 BC
75



(Duplenskaya, Novosibirskaya oblast', Russia (1792km)) (Moving westwards M269 block may have skirted parts of the the southern Urals)
16765
= 14,815 BC
78


S10/PF6399

10479
= 8,529 BC
111


R1b1a2a-L23/S141/PF6534 (Block of 2)
Southwest of Novaya, Respublika Bashkortostan, Russia Chishminsky District 54.679970, 55.325795 (3619km)
9526
= 7,576 BC
116


R1b1a2a1-L51/M412/S167/PF6536 (R-Z2103 branched) (4 SNP block)
Tuymazinsky District Republic of Bashkortostan, Russia 54.520512, 54.028872 near Pervomayskoye, Respublika Bashkortostan, Russia (3708km)
9145
= 7,195 BC
118


R1b1a2a1a-L11/S127/PF6539 ( block of 8 SNPs)
Sergiyevsky District Samara Oblast, Russia
53.928046, 51.332282 (3952km)
8383
= 6,433 BC
122


R1b1a2a1a2-P312/S116/PF6547 (U106 Branched)

6859
= 4,909 BC
130


R1b1a2a1a2c-L21/M529/S145 (first of block of 6)
(Kurgan Wave 1 4400 - 4200 BC)
6478
= 4,528 BC
132


R-DF13/S521/CTS241 ( first of two SNP block)
(Yamnaya I0439 [and I0443] L23+ xL51/M412, Z2105 basal R1b1a* 3305-2925 cal BCE) (Kurgan Wave II 3400-3200bc)(Oldest wooden wheel in the world ever found - in Slovenia dates 3340-3030 cal BC) (Wooden 4-wheel Wagon Kurgan Yamnaya (pit grave) culture c3200-2500bc) (Aegean Bronze Age begins around 3200 BCE)
5335
= 3,385 BC
138


FGC5494s (1st Tier) MJost DF13>FGC5494 subclade
(I0231 Ekaterinovka, Samara grave 1, 2910-2875 cal bc) (Kurgan Wave III 3000-2800bc)
4954
= 3,004 BC
140


FGC5496s (Derived SNP #2)
(Yamnaya phase III 2800bc)(Bell Beaker culture start Central Europe 2800bc)
4764
= 2,814 BC
141





MJost

George Chandler
06-20-2015, 08:30 PM
The average number of Sanger SNP's from the 26 different samples in the R-S1051 Group (using both FGC and Big Y) below DF13 is 32. I'm still using 139 years per Sanger and it's pretty close. What I'm wondering about is the Y Full count that you're using. When you look at S1051 they have S1050 listed (Y Full) but there are anywhere from four to seven reliable Sanger SNP's between the two which are not listed. Are you missing some of the critical SNP's between that just haven't been listed?

George

MJost
06-21-2015, 12:13 AM
The average number of Sanger SNP's from the 26 different samples in the R-S1051 Group (using both FGC and Big Y) below DF13 is 32. I'm still using 139 years per Sanger and it's pretty close. What I'm wondering about is the Y Full count that you're using. When you look at S1051 they have S1050 listed (Y Full) but there are anywhere from four to seven reliable Sanger SNP's between the two which are not listed. Are you missing some of the critical SNP's between that just haven't been listed?

George

I pulled every SNP in the Yfull YTree v3.8 tree down from A0-T to R to DF13, and pulled every Ybrowse SNP and position and matched them up to use in the Mark V's YFull-based SNP Block Age Estimator. YFull may show several equivalent SNP names but I had to uncover which one was actually shown in Ybrowse. I found all but five or so that didn't have positions back to A0-T.

I can get you the list of them if you wish.

MJost





.

George Chandler
06-21-2015, 02:10 AM
I pulled every SNP in the Yfull YTree v3.8 tree down from A0-T to R to DF13, and pulled every Ybrowse SNP and position and matched them up to use in the Mark V's YFull-based SNP Block Age Estimator. YFull may show several equivalent SNP names but I had to uncover which one was actually shown in Ybrowse. I found all but five or so that didn't have positions back to A0-T.

I can get you the list of them if you wish.

MJost





.


No that's ok - I just noticed that they have some unplaced SNP's but many are missing between S1051 and S1050 so I would suspect some of the Sanger SNP's have yet to be added to their list and this could account for part of the shortfall in terms of an SNP count.

George

MJost
06-21-2015, 05:07 AM
On time back I did a cursory count of DF13 subclade SNPs and it was right around 33. A Rho average really needs to be done on those DF13 subclade SNPs. We know that the BigY is pretty clean but way less SNPs than Full NGS results. At around 60 to 70% of a Full Y comparison. All I know is my own FullY SNPs amounted to 71 HQ but 40% proved to be Sanger quality and confirmed and is why I used them to show the total SNPs under MA-1 to present.


So I created this chart of years per SNP for any number of SNPs below DF13 based on my MA-1 calibration with my Sanger Sequence SNPs



First Column is number of SNPs derived under MA-1
2nd Column are years per SNP recalibrate a years per SNP.

For SNPs below DF13, and example would be your total derived SNPs under DF13 branch to present are 35 AND if you have a SGD of 5 SNPs back to a common shared SNPs then 179.1343 would be used per SNP or a TMRCA of 895.6 years.

In my case I have 27 derived SNPs under my DF13 branch which was originally calibrated from MA-1 of 24000 years before present with a per SNP of 190.5079.

My SGD (SNP GD) of 2 difference would have a TMRCA of 381 years. But one of those SNPs could have occurred at the present persons birth so subtract one SNP to get a minimum number of SGD, which would be SGD of one thus a TMRCA of 190.5 years. Since I was born in 1955, then 1955-(190.5 x 2) = 1574 AD max with a range using SGD1 to 1764.5 AD min.

Just something to play with.

98 262.4336 DF13
99 242.4646 CTS8221/Z2542
100 240.0400 Derived SNP #1
101 237.6634 Derived SNP #2
102 235.3333 Derived SNP #3
103 233.0485 Derived SNP #4
104 230.8077 Derived SNP #5
105 228.6095 Derived SNP #6
106 226.4528 Derived SNP #7
107 224.3364 Derived SNP #8
108 222.2593 Derived SNP #9
109 220.2202 Derived SNP #10
110 218.2182 Derived SNP #11
111 216.2523 Derived SNP #12
112 214.3214 Derived SNP #13
113 212.4248 Derived SNP #14
114 210.5614 Derived SNP #15
115 208.7304 Derived SNP #16
116 206.9310 Derived SNP #17
117 205.1624 Derived SNP #18
118 203.4237 Derived SNP #19
119 201.7143 Derived SNP #20
120 200.0333 Derived SNP #21
121 198.3802 Derived SNP #22
122 196.7541 Derived SNP #23
123 195.1545 Derived SNP #24
124 193.5806 Derived SNP #25
125 192.0320 Derived SNP #26
126 190.5079 Derived SNP #27
127 189.0079 Derived SNP #28
128 187.5313 Derived SNP #29
129 186.0775 Derived SNP #30
130 184.6462 Derived SNP #31
131 183.2366 Derived SNP #32
132 181.8485 Derived SNP #33
133 180.4812 Derived SNP #34
134 179.1343 Derived SNP #35
135 177.8074 Derived SNP #36
136 176.5000 Derived SNP #37
137 175.2117 Derived SNP #38
138 173.9420 Derived SNP #39
139 172.6906 Derived SNP #40
140 171.4571 Derived SNP #41
141 170.2411 Derived SNP #42
142 169.0423 Derived SNP #43
143 167.8601 Derived SNP #44
144 145.5278 Derived SNP #45
145 165.5448 Derived SNP #46
146 164.4110 Derived SNP #47
147 163.2925 Derived SNP #48
148 162.1892 Derived SNP #49
149 161.1007 Derived SNP #50
150 132.0867 Derived SNP #51
151 158.9669 Derived SNP #52
152 157.9211 Derived SNP #53
153 156.8889 Derived SNP #54
154 155.8701 Derived SNP #55
155 154.8645 Derived SNP #56
156 153.8718 Derived SNP #57
157 152.8917 Derived SNP #58
158 151.9241 Derived SNP #59
159 150.9686 Derived SNP #60
160 150.0250 Derived SNP #61
161 149.0932 Derived SNP #62
162 148.1728 Derived SNP #63
163 147.2638 Derived SNP #64
164 146.3659 Derived SNP #65
165 145.4788 Derived SNP #66
166 144.6024 Derived SNP #67
167 143.7365 Derived SNP #68
168 142.8810 Derived SNP #69
169 142.0355 Derived SNP #70
170 141.2000 Derived SNP #71




MJost

MJost
07-14-2015, 11:09 PM
R SNPs YFull Experimental YTree v3.12 on 02 July 2015 R-tree have been updated.

According to YFull, my Full Genome Chr Y has 1756 SNP A0-T to P.
YFull reports my R to DF13 line has 283 SNPs.
Full Genome reported that I have 71 high quality SNPs below DF13.
All my R SNPs total 354 SNPs.

Utilizing my modded YFull-based SNP Block Age Estimator,
My CombBED R SNPs total is 200.

Setting the following factors:
MJost Full Y - NGS Length coverage for age: 8402506
Rate Constant of SNP Mutations: 8.20E-10
Results in a SNP Mutation Rate: 145.14 years per SNP.

(Using Full Y - NGS Genome Length Coverage: 8473821 results in 143.92
years per SNP or BigY - NGS Genome Length Coverage: 7600000 results in
160.46 years per SNP)

R has a 'Calculated Age of Phylogenetic Block ybp': 29,027 (27,077 BC)
95% CI High = 34003.46 ybp =32,053 BC
95% CI Low = 25321.73 ybp =23,372 BC

Recalibration the R tree with R0-MA1 has five ancestral SNPs back to R
and has 35 derived SNPs for 40 SNPs below R. Assuming these 40 SNPs fit
into the CombBED ranges, leaves 160 SNPs starting at around 22000 cal BC
to present.

Considering 24004 ybp (age at death) divided by 160 SNPs equates to 150
years per SNP back up to 1946AD.

145 to 150 years per CombBed SNP.

R-P312 or 46 ComBED SNPs back from present is
~4,804 BC using 150 years per SNP or
~4,670 BC using 145 years.

Michal posted that he believes that 'U106 diverged most likely between
6500 and 5000 years ago, probably within the 6000-5500 BP time frame.'
(or 4000-3500bc)

http://www.anthrogenica.com/showthread.php?4710-R1b-U106-in-Swedish-Battle-Axe-Culture-(a-Corded-Ware-subgroup)&p=95592&viewfull=1#post95592

DF13 is five SNPs below P312 using CombBED list of SNPs, would have
spawned ~4,054 BC.

R to present counting years: 27,854 bc

MJost

Chad Rohlfsen
07-15-2015, 12:07 AM
That seems pretty old. Yfull has L23 younger than your P312.

MJost
07-15-2015, 12:43 AM
That seems pretty old. Yfull has L23 younger than your P312.I cant agree with their SNP TMRCA's.

MJost

Chad Rohlfsen
07-15-2015, 01:11 AM
I cant agree with their SNP TMRCA's.

MJost

They have L23 forming at 4400 BCE, while you have P312 up to 4800BCE. I wasn't referring to TMRCA. Do you not agree with that date?

MJost
07-15-2015, 01:36 AM
They have L23 forming at 4400 BCE, while you have P312 up to 4800BCE. I wasn't referring to TMRCA. Do you not agree with that date?no, I don't believe YFull is correct.

I am using MA-1 calibration. So MA-1 SNPs could be incorrect for SNP counting. Using both CombBED aging and the same list of SNPs counted, they are close to each other. Not too exact but close. These are the CombBed Qualified SNPs only between L23 and P312, ten between them. L23 would be 6,304 BC


R-L23-L23/S141/PF6534
R-L51-L51/M412/S167/PF6536
L11-L11/S127/PF6539
L11-L151/PF6542
L11-L52/PF6541
L11-P311/S128/PF6545
L11-YSC0000191/PF6543/S1159
L11-CTS7650/PF6544/S1164
L11-P310/S129/PF6546
L11-YSC0001249/CTS10353/S1175
R-P312-P312/S116/PF6547


MJost

MJost
07-15-2015, 05:31 AM
Running the entire tree from A0-T, I post below each derived branch and its age
using these constants:

Rate Constant of SNP Mutations: 8.20E-10
CombBED Length: 8402506 <<< MJost Full Y - NGS Length coverage for age:

Resulting in a years per SNP Mutation Rate: 145.14 years

As before I re-extracted and counted all my (YF03272) YReport which is from YFull's Experimental YTree v3.12 dated as of 02 July 2015

SNPs and Ages from my Full Genome NGS are as follows:

Number of A0-T to present SNPs in CombBED Region: 1428/(out of 2110)
Calculated Age of Phylogenetic Block ybp: 207,255 ybp = 205,305 BC
95% CI High = 242784.7 ybp = 240,835 BC
95% CI Low = 180797.1 ybp = 178,847 BC

A1 SNPs: 926/1577 | 134,397 ybp = 132,447 BC

A1b SNPs: 882/1462 | 128,011 ybp = 126,061 BC

BT SNPs: 857/1415 | 124,382 ybp = 122,432 BC

CT SNPs: 587/1009 | 85,195 ybp = 83,245 BC

CF SNPs: 412/690 | 59,796 ybp = 57,846 BC

F SNPs: 408/686 | 59,216 ybp = 57,266 BC

GHIJK SNPs: 307/524 | 44,557 ybp = 42,607 BC

HIJK SNPs: 304/520 | 44,122 ybp = 42,172 BC

IJK SNPs: 303/519 | 43,976 ybp = 42,026 BC

K SNPs: 298/514 | 43,251 ybp = 41,301 BC

K(xLT SNPs: 285/514 | 41,364 ybp = 39,414 BC

MP SNPs: 284/496 | 41,219 ybp = 39,269 BC

P SNPs: 282/494 | 40,929 ybp = 38,979 BC

R SNPs: 200/354 | 29,027 ybp = 27,077 BC

R-Y482 SNPs: 184/332 | 26,705 ybp = 24,755 BC

R1-M781/PF6145 SNPs: 182/328 | 26,415 ybp = 24,465 BC

R1b-M343/PF6242 SNPs: 157/284 | 22,786 ybp = 20,836 BC

R1b1-L278 SNPs: 131/236 | 19,013 ybp = 17,063 BC

R-L389-L389/PF6531 SNPs: 130/234 | 18,868 ybp = 16,918 BC

R-P297-P297/PF6398 SNPs: 128/232 | 18,578 ybp = 16,628 BC

R-M269-M269/PF6517 SNPs: 114/183 | 16,546 ybp = 14,596 BC

R-L23-L23/S141/PF6534 SNPs: 56/183 | 8,128 ybp = 6,178 BC

R-L51-L51/M412/S167/PF6536 SNPs: 55/98 | 7,983 ybp = 6,033 BC

L11-L11/S127/PF6539 SNPs: 54/93 | 7,837 ybp = 5,887 BC

R-P312-P312/S116/PF6547 SNPs: 46/81 | 6,676 ybp = 4,726 BC

R-L21-L21/M529/S145 SNPs: 45/79 | 6,531 ybp = 4,581 BC

R-DF13-DF13/S521/CTS241 SNPs: 41/73 | 5,951 ybp = 4,001 BC


MJost

MJost
07-15-2015, 03:06 PM
@Chad

I added YFull's formed age along with what the age would be if 150 years per SNP counting calibrated with MA-1.

Number of A0-T to present SNPs in CombBED Region: 1428/(out of 2110)
Calculated Age of Phylogenetic Block ybp: 207,255 ybp = 205,305 BC
95% CI High = 242784.7 ybp = 240,835 BC
95% CI Low = 180797.1 ybp = 178,847 BC

YFull formed 235,500 ybp < appears to be using ~162.64 years per SNP from 1428 total CombBED SNPs

(@150 years per SNP from R-MA-1 calibrated Age = 214200 = 212,250 BC
95% CI High = 242784.71 ybp =240,835 BC
95% CI Low = 180797.13 ybp =178,847 BC)

A1 SNPs: 926/1577 | 134,397 ybp = 132,447 BC
YFull formed 145,800 ybp (@150 = 138,900)

A1b SNPs: 882/1462 | 128,011 ybp = 126,061 BC
YFull formed 132,800 ybp (@150 = 132,300)

BT SNPs: 857/1415 | 124,382 ybp = 122,432 BC
YFull formed 126,600 ybp (@150 = 128,550)

CT SNPs: 587/1009 | 85,195 ybp = 83,245 BC
YFull formed 84,100 ybp (@150 = 88,050)

CF SNPs: 412/690 | 59,796 ybp = 57,846 BC
YFull formed 68,200 ybp (@150 = 61,800)

F SNPs: 408/686 | 59,216 ybp = 57,266 BC
YFull formed 65,400 ybp (@150 = 61,200)

GHIJK SNPs: 307/524 | 44,557 ybp = 42,607 BC
YFull formed 48,100 ybp (@150 = 46,050)

HIJK SNPs: 304/520 | 44,122 ybp = 42,172 BC
YFull formed 47,900 ybp (@150 = 45,600)

IJK SNPs: 303/519 | 43,976 ybp = 42,026 BC
YFull formed 47,900 ybp (@150 = 45,450)

K SNPs: 298/514 | 43,251 ybp = 41,301 BC
YFull formed 46,400 ybp (@150 = 44,700)

K(xLT SNPs: 285/514 | 41,364 ybp = 39,414 BC
YFull formed 44,500 ybp (@150 = 42,750)

MP SNPs: 284/496 | 41,219 ybp = 39,269 BC
YFull formed 43,800 ybp (@150 = 42,600)

P SNPs: 282/494 | 40,929 ybp = 38,979 BC
YFull formed 43,800 ybp (@150 = 42,300)

R SNPs: 200/354 | 29,027 ybp = 27,077 BC
YFull formed 31,400 ybp (@150 = 30,000)

R-Y482 SNPs: 184/332 | 26,705 ybp = 24,755 BC

R1-M781/PF6145 SNPs: 182/328 | 26,415 ybp = 24,465 BC
YFull formed 27,600 ybp (@150 = 27,300)

R1b-M343/PF6242 SNPs: 157/284 | 22,786 ybp = 20,836 BC
YFull formed 22,100 ybp (@150 = 23,550)

R1b1-L278 SNPs: 131/236 | 19,013 ybp = 17,063 BC
YFull formed 18,600 ybp (@150 = 19,650)

R-L389-L389/PF6531 SNPs: 130/234 | 18,868 ybp = 16,918 BC
YFull formed 16,600 ybp (@150 = 19,500)

R-P297-P297/PF6398 SNPs: 128/232 | 18,578 ybp = 16,628 BC
YFull formed 15,800 ybp (@150 = 19,200)

R-M269-M269/PF6517 SNPs: 114/183 | 16,546 ybp = 14,596 BC
YFull formed 13,300 ybp (@150 = 17,100)

R-L23-L23/S141/PF6534 SNPs: 56/100* | 8,128 ybp = 6,178 BC
YFull formed 6,400 ybp (@150 = 8,400)
(*Corrected typo number of total SNPs to present)

R-L51-L51/M412/S167/PF6536 SNPs: 55/98 | 7,983 ybp = 6,033 BC
YFull formed 6,100 ybp (@150 = 8,250)

L11-L11/S127/PF6539 SNPs: 54/93 | 7,837 ybp = 5,887 BC
YFull formed 5,600 ybp (@150 = 8,100)

R-P312-P312/S116/PF6547 SNPs: 46/81 | 6,676 ybp = 4,726 BC
YFull formed 4,900 ybp (@150 = 6,900)

R-L21-L21/M529/S145 SNPs: 45/79 | 6,531 ybp = 4,581 BC
YFull formed 4,600 ybp (@150 = 6,750)

R-DF13-DF13/S521/CTS241 SNPs: 41/73 | 5,951 ybp = 4,001 BC
YFull formed 4,500 ybp (@150 = 6150)

MJost

MJost
07-16-2015, 08:14 PM
I charted the three different ages, CombBED age using 145.14 years per SNP, YFULL formed and my MA-1 Calibrated @150 years per SNP from A0-T to R-DF13.

https://drive.google.com/file/d/0By9Y3jb2fORNcUktYmU4ZUVsT1k/view?usp=sharing

MJost

MJost
07-24-2015, 07:04 PM
In order to improve my SNP counting in respect to using my own 40 of 71 DF13>FGC5494 SNPs CombBED qualified, I went ahead and complied a list from AlexW's L21 'The Big Tree' of n=741 DF13 kits in his spreadsheet months earlier. Today I did some further calculations today. In the chart link below I complied a Frequency histogram, which I may have posted elsewhere, which shows a well defined bell curve with a peak at around 50 SNPs. I then actually went ahead and performed an Actual Frequency percentage and a normal distribution of frequency percentage analysis for these same DF13 SNP counts from 741 kits.

This results appear to follow a relatively normal distribution curve.




N=741 All SNP ranges
N=598 (Counts with only 41 to 69 SNP range)


Mean
54.95
51.30


STD Dev
13.32
0.23


Max
68.26
51.53


Min
41.63
51.06





https://drive.google.com/file/d/0By9Y3jb2fORNbTc3MENUZURuWVE/view?usp=sharing


MJost

MJost
07-24-2015, 10:51 PM
Using 51 (narrowed SD) as the number of SNPs below DF13 plus 160 R to DF13 CombBED SNPs less 40 MA-1 SNPs totals 211 back from present SNPs to four year old boy MA-1 age cal to 24000 ybp equals 171 SNPs. 24004 divided by 171 equals 140.374 rounded to 140.4 years per SNP.

Using the full mean of 54.95 DF13 SNPs for 741 kits and the 160 R to DF13 SNPs, less 40 MA-1 adds up to 174.95 SNPs for 137.2 years per SNP.

Using the Adamov_et al. (2015) paper calculation:
Full Y - NGS Genome Length Coverage: 8473821
Rate Constant of SNP Mutations: 8.20E-10
Results in a SNP Mutation Rate: 143.92 years per SNP

And if we consider that we are using CombBED ranges which the BigY results fit, then the BigY Y - NGS Genome Length Coverage would stay the same: 7600000 then the mutation rate must change to achieve the 140.4 years per SNPs calibrated to MA-1. It would need adjusted to a slightly faster rate of 9.40E-10 in order to get down to ~140 years per SNP or even faster to match the 54.95 SNPs under DF13 a mutation rate of 9.59E-10.

So how old is DF13, or P312 for that matter, are we talking using these new numbers of NGS SNPs under DF13???

DF13 son of L21 son of P312 age using a mean of 54.95 SNPs from present at 137.2 equals 7539.4 ybp or (5,539.4 - 60) 5,479 BC

MJost

alan
07-28-2015, 03:51 PM
I just looked at the full y http://www.yfull.com/tree/R1b/ and a couple of thinks stand out. If the dates are taken literally (which obviously they shouldnt be) then putting the dates for P312, and DF27 are lets say at least 150 years younger than some of the RC dates that have been claimed for beaker there.

While that could be down to error ranges, there is one more fundamental issue re P312 and beaker and Iberia. RC dates in central Europe for beaker (except dodgy ones) seem to not much predate 2500BC This may well mostly have been U152 and therefore the suggested date for this and the beaker RC dates are pretty closely aligned. However, Iberian beaker has been suggested to have started as early as 2800 or 2750BC by some authors recently reviewing them. It seems to me that there is no up to 300 years age between DF27 (the dominant Iberian P312) and U152 to match the difference in RC dates in SW and central Europe.

MJost
07-28-2015, 08:52 PM
We accept the hard RC dating for Ancient remains. But ChrY DNA aging methods such as counting seems to be a tool with incomplete results when considering how many SNPs we have now under DF13 running a mean average of 55, with any sample having 41 to 68 SNPs and these fit the the CombBED (CB) ranges criteria. If we add the five CB SNPs back to P312 range would be 46 to 73 SNPs to present at 1 sigma per any NGS test. When using NGS results, calculating a years per SNP number using a ChrY length for age number, everyone seems to be in each others ballpark. So these Ancient DNA samples will need to be tested using NGS method to compare apples to apples. With NGS we can not know how many SNPs we are missing to that RC date and it maybe that these could be daughter'ed out lineages not derived into any current branch. We already know the NGS list of SNPs under R and where MA-1 fits and the remaining SNPs down to DF13 that are 160 CB SNPs less MA-1's 40 SNPs under R. Then you add back the below DF13 SNPs taking the total into MA-1 RC 24000 ybp age for a years per CB SNP. (Noting that some of MA-1's SNPs may not match the CombBED ranges.)

I previously showed that three source generally follow the same age lines overall. Here are the adjusted ages of back to DF13 using MA-1 calibrated. The less SNPs under DF13 the larger the age as the SNPs at DF13 back to R do not change.

1 sigma range of NGS SNPs reported under DF13, MA-1 calibrated, produce:

41 SNPs under DF13 at 149.1 years per SNP = DF13 age of 6113.1 ybp
46 SNPs = 144.6 = 6651.6
51 SNPs = 140.0 = 7140
55 SNPs = 137.2 = 7546
68 SNPs = 127.7 = 8683.6

Maybe we have to have a fudge factor?

Using a Age length if the ChrY times the various studies' mutation rate of the ChrY to get a Years per SNPs is close.

I know!!!! What a mess this making soup thing has become. What is wrong here?

MJost

dp
07-28-2015, 09:25 PM
Mark,
Perhaps you need an older calibration. Are there any Chimp NGS studies that could identify the numbers of SNPs that have occured in the last 6 million years? That may be a bit extreme but they are getting autosomal DNA from Neanderthal's now, what about Y-DNA?
One reason I ask is maybe the mutation rate in the large scale is non-linear.
dp

MJost
07-28-2015, 09:51 PM
Mark,
Perhaps you need an older calibration. Are there any Chimp NGS studies that could identify the numbers of SNPs that have occured in the last 6 million years? That may be a bit extreme but they are getting autosomal DNA from Neanderthal's now, what about Y-DNA?
One reason I ask is maybe the mutation rate in the large scale is non-linear.
dp

Well, I have all the SNPs down from A0-T to R-DF13 from YFull's NGS built tree. I don't think the mutation rate from A0-T to present has changed especially in the combBED ranges used. That why they are used, they are very stable areas. I doubt the rate would be different when converted to a yearly mutation rate.

I know other have used other ancient remains the Ust-Ishim sample who I think is 45K years old and had good sequencing. If someone has a count of the derived SNPs back to the R common ancestor it could be done with no problem. Double the number of years approximately would help. But need a source Calibrated RC dating as well.

MJost

MJost
07-31-2015, 12:23 AM
I have recalibrated my years per SNP counting method using the Ust-Ishim sample date and Adamov's reported placement of this ancient DNA on the Tree, eleven SNPs derived under K(xLT) to R-DF13 to present using CombBED qualified SNPs (245 out of 425) from the SNPs listed in the YFull Experimental YTree v3.12 at 02 July 2015. Using the mean number of BigY SNPs of 55, totals 295 SNPs under K(xLT) to R-Present.

Information link here:

Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data
Adamov et al 2015

http://rjgg.molgen.org/index.php/RJGGRE/article/view/151/175

"Calibration using the Ust-Ishim sample

According to radiocarbon dating reported by Fu et al. (2014), Ust-Ishim Man (UIM) lived about 45 thousand years ago. On the family tree below, he is at the beginning of branch KM2335 (see Fig. 2). Fu et al. (2014) studied the full genome of the ancient sample. The presence of a mutation at M526 placed UIM at the beginning of haplogroup K(xLT). Detailed examination of derived options revealed a later mutation (M2308/Z4842). Studying the BAM file, we identified 11 mutations (including M2308), which emerged after M526 in the combBED area (total length of 8.463 Mbp). This number was used to calculate the average mutation rate."

The results are:

Calibration using the Ust-Ishim sample SNPs
41 SNPs under 157.1 years per SNP
46 SNPs = 154.3
51 SNPs = 151.6
55 SNPs = 156.25 Mean number of SNPs below DF13 via BigY 1 Sigma 41 to 68 SNPs range)
68 SNPs = 143

K(xLT) to present years 46,719 = 44,769
Ages back to A0-T here:



1
A0-T
1
225,000
223,050 BC


503
A1
1
146,875
144,925 BC


547
A1b
1
140,000
138,050 BC


572
BT
1
136,094
134,144 BC


842
CT
1
93,907
91,957 BC


1017
CF
1
66,563
64,613 BC


1021
F
1
65,938
63,988 BC


1122
GHIJK
1
50,157
48,207 BC


1125
HIJK
1
49,688
47,738 BC


1126
IJK
1
49,532
47,582 BC


1131
K
1
48,750
46,800 BC


1144
K(xLT)-M526/PF5979/
1
46,719
44,769 BC


1145
MP
1
46,563
44,613 BC


1147
P
1
46,250
44,300 BC


1229
R-P232
1
33,438
31,488 BC


1247
R1-M781/PF6145
1
30,625
28,675 BC


1272
R1b-M343/PF6242
1
26,719
24,769 BC


1315
R-M269-M269/PF6517
1
20,000
18,050 BC


1373
R-L23-L23/S141/PF6534
1
10,938
8,988 BC


1374
R-L51-L51/M412/S167/PF6536
1
10,782
8,832 BC


1375
L11-L11/S127/PF6539
1
10,625
8,675 BC


1383
R-P312-P312/S116/PF6547
1
9,375
7,425 BC


1384
R-L21-L21/M529/S145
1
9,219
7,269 BC


1388
R-DF13-DF13/S521/CTS241
1
8,594
6,644 BC






MJost

Mac von Frankfurt
07-31-2015, 12:57 PM
Given how far forward in time SNPs have come over the last two years I am curious how your opinion on the value of STR vs SNP might have changed.



This may seem a little off track, but bear with me. This is about understanding MRCAs....

What's the value of a haplogroup?

What's the value of an SNP?

You might be surprised to hear me say this but I think there is very little value in haplogroups and SNPs
... at least in and of themselves.

A haplogroup is just a group of people with a common ancestor.

An SNP is just a single nucleotide polymorphism, a mutation, that marks the group of people with a common ancestor. It is just a signpost on a branch of the human family tree. The true nature of the haplogroup of people, any commonality in culture, location, etc., many not align with the SNPs have marked the lineages. The SNP could mark either a subset or superset of the true group of people we care about.

This gets into some notions about value and philosopical concerns, but these are the points I'm getting at.

1) I do not care too much about all of the extinct lineages of mankind. There are many, many extinct lineages. On the Y chromosome/paternal side probably there are many, many more lineages that have gone extinct than those who survive.

2) I do care about how we got here and how, where, when and why they did what they did to get us to where we are today.

I think these notions are just conveying that what many hobbyists may care about most is the connection to genealogy and deeper ancestry.... and specifically our ancestry.

The net is that the most recent common ancestors (MRCAs) of the various branches remaining today (and in recent history) are critical people to try understand. The more MRCAs we can understand better at more layers and branches in the tree, then the more we have a chance to understand our ancestry.

I am not saying that all of the old extinct lineages were not important people or that SNPs are useless. I'm just trying to say they are most important in how they help us understand who we are and how we got here. They are just bread crumbs from an old trail.

Superconducting supercolliders smash atoms and look at the residue of the accident to try to get more detail on the characteristics of the atom. In the case of genetics; the accidents, bottlenecks, growth spurts, etc. have already taken place but, likewise, we are looking at the residue to try to ascertain what happened.

I don't care when an SNP first occurred. I care about the expansion and movements of my ancestry. The SNP marked haplotroup ages may help put a maximum age in place for my ancesty. That's good, but it's not really the haplogroup I'm after.


P.S. Science may be interested in who the genetic Adam was or wasn't and some other things. That's fine with me but I'm really after understanding how we, the survivors, got here.

MJost
07-31-2015, 02:11 PM
I sure Mike will agree, a lot has changed since his post back on 05-15-2013. To understand where we are today we have to have a precise understanding of where we have been in the SNP side of things with the advent of NGS results providing broad sweeping changes to the Y tree. Its the trek that matters.

MJost

MJost
08-06-2015, 08:56 PM
In previous I had post I had ran some SNP counts under DF13 but later I determined that I was including DF13 and L21's block of SNPs so the mean number should be lower. I wont re-calculated it but its off a tad. Using SNP counting I used only CombBED ('CB' SNPs from the Chr Y Tree back to K(xLT) to re-calibrate with Ust-Ishim - 45000 ybp, I didn't have the CB SNPs for DF13 and below and I used the Mean count from AlexW's. DF13_D(erived)_All, n=767, Aves_SNP=47.9 (Previously 55).

DF13_and_D(erived)_CB_All, n=767, Ave count: SNPs that fit the CombBED ranges = 31.1 +-5.2 or (range of 25.9 to 36.3 CB SNPs at DF13 and below). Now we can add apples to apples for a more correct years per SNP calibration back to 45000 ybp from present on the R-DF13 branch, recalibration is 171.1 years per CB SNP.
(167.9 to 174.4 range (26 to 36 CB SNPs))

This was complied from a more recent copy I had of AlexW's BigTree spreadsheet dated near the end of January 2015 before he converted his Excel data into an other format. I converted it to show each position and marked it with a "#" as the first character if it matched the 875 CombBED regions. From there I count the total number of CombBED ('CB" SNPs under L21. Below is the CB Counts and stats of DF13 subclades as well.


MJost

MJost
08-06-2015, 08:58 PM
Data posted here from my previous post


End of January's BigTree data.



P312 Derived All
n=
886



Ave
52.6



AVEDEV
9.8







DF13 Derived All
n=
767




Ave
47.9



AVEDEV
9.9



Min
26.0



Max
90



Median
44








DF13 D CB Only
n=
767



Ave
31.1




AVEDEV
5.2



Min
16.0



Max
56



Median
30







DF63 D CB
n=
35



Ave
31.1



AVEDEV
4.5




Min
16.0



Max
47



Median
32







DF13>DF49 CB
n=
101



Ave
41.0




AVEDEV
6.7



Min
22.0



Max
56




Median
43







DF13>L1335 CB
n=
62



Ave
29.5



AVEDEV
3.0



Min
21.0



Max
37




Median
29







DF13>L513 CB
n=
97



Ave
147420.9



AVEDEV
86582.3




Min
4479.0



Max
352942



Median
143683







DF13>Z251 CB
n=
34




Ave
28.0



AVEDEV
3.4



Min
21.0



Max
39




Median
28







DF13>Z253 CB
n=
83




Ave
29.5



AVEDEV
4.1



Min
19.0




Max
40



Median
29







DF13>Z255 CB
n=
45




Ave
26.5



AVEDEV
2.9



Min
20.0




Max
37



Median
26







DF13>DF21 CB
n=
122




Ave
30.7



AVEDEV
3.4



Min
18.0




Max
40



Median
31







DF13>FGC11134
n=
67



Ave
26.6



AVEDEV
2.3




Min
18.0



Max
34



Median
27







DF13>FGC5494
n=
27




Ave
30.7



AVEDEV
4.9



Min
21.0



Max
42



Median
31







DF13>DF41
n=
56



Ave
35.0



AVEDEV
4.1



Min
0.0



Max
44




Median
36







DF13>M222 only
n=
484



Ave
28.7



AVEDEV
4.0



Min
13.0




Max
46



Median
29







DF13>49xM222 only
n=
30



Ave
30.1



AVEDEV
3.4




Min
22.0



Max
39



Median
30.5




Some DF27 'CB' numbers counted below P312


P312>




DF27>Z195 D CB
n=
46



Ave
31.3



AVEDEV
3.8




Min
18.0



Max
45



Median
31.5







DF27>ZZ12 D CB
n=
23



Ave
28.6




AVEDEV
6.6



Min
17.0



Max
47



Median
25








DF27>ZZ12>ZZ19 D CB
n=
20



Ave
25.7



AVEDEV
3.9




Min
14.0




Max
37



Median
25.5






MJost

MJost
08-07-2015, 07:47 PM
So lets look at what the ALL SNPs and only CB SNPs counts I produced previously to do further SNP Age calculations.

I have been intending to get the below numbers for months but needing to have the DF13 and below SNPs counted with the CombBED ('CB') counts identified to average them, which I just did previously as already posted from the last copy I had of AlexW's Jan 2015 BigTree spreadsheet which is no longer being produced. This was very time consuming running each of the 886 kits list of SNPs below P312 and DF13 to determine which ones were CB qualified via several different spreadsheets using macros after I manually selecting first and last SNPs in column then having macros copy and paste between two spreadsheets. AlexW is looking at possiblity of identifing CB SNPs within his webpage in the future.

Method #1 Counting ALL SNPs

Using "DF13 and below ALL" 397 SNPs from K(xLT), to the end of the L21 block of SNPs, subtracting 11 SNPs down from K(xLT) to 45k year old Ust-Ishim and adding from my list of DF13 ALL's which have an average of 47.9 +- 9.9 (1 sigma range) of SNPs from the Jan '15 BigTree produces 98.0 years per SNP, calculates a SD range of 95.95 to 100.22 years per SNP. Using DF13 and derived All, n=767, Ave 47.9, AVEDEV 9.9. K(xLT) to present years: 46,078 YBP.

Method #2 counting CombBED only SNPs

Counting SNPs using only the 45K year old Ust-Ishim to present CB SNPs it calculates to 171.1 years per SNP back to present, with 233 CB SNPs down to the end of L21 Block and the 31 mean SNPs at or below DF13, total CB SNPs = 264.


Method #2 compared to Adamov et al. 2015

Next using CombBED ('CB') SNPs is a common denominiator for comparing any NGS results, from the study "Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data" Adamov_et al. (2015) http://ru.rjgg.org

I used their SNP Age formula from the paper using my modified DaveV SNP Block Age Estimator spreadsheet.

I then calculated a the total CombBEDxdups Length from the A0-T to present SNPs totaled 7077022 bp (see below), and using the resultant Years per CB SNP, the 45K year old Ust-Ishim's age is calculates to 45,492 ybp = 43,542 BC. The Adamov_et al. (2015) paper states that the BigY - NGS Genome Length Coverage is 7600000, but this figure is a little to long for my CB Aging into R-L21 Tree use.

In order to calculate a new CombBEDxdups Length: 7077022, I added DF13* Hughes #39420 31 CB SNPs up to the L21 block and the remaining CB's back to A0-T (total 1418 CB SNPs) to create a bench mark length from the SNP Block Age Estimator spreadsheet.

AND with the above new length, also using Adamov_et al. (2015) Rate: 8.2E-10 results in a SNP Mutation Rate equaled 172.3 years per SNP, calculating Age

back to the A0-T, the Phylogenetic Block ybp of:
244,350 = 242,400 BC.
95% CI High = 286,238.23 ybp =284,288 BC
95% CI Low = 213,156.13 ybp =211,206 BC

Using the new CB Length of 7,077,022, the mutation rate would need to be at 8.25E-10 vs 8.2E-10 to be closer to being equal the above SNP counting result of 171.1 which actually calculates to 171.3 years per SNP, close enough.

Using the closer 8.25E-10 rate for Ust-Ishim calculates to 45,217 ybp = 43,267 BC instead of the above 45,492 ybp = 43,542 BC, and for an even finer grain adjustment using a rate of 8.29E-10 produces 45K ybp.

Now here are the ages of P312 and DF13 at both Counting Age methods 1 & 2.

Method #2-CB only: 171.1 Years per SNP
P312: 4,039 BC
DF13: 3,184 BC +-855 (1 SD)

Method #1-ALL: 98.0 Years per SNP
P312: 3,360 BC
DF13: 2,576 BC +- 980 (1 SD)

Difference in BC range between each method:
DF13: 608 years is well within one STD Devation of each method

Using ALL the SNPs and not just CB SNPs for true Counting Age makes the most sense for me to completely date DF13 and then use the CB years per SNP when comparing various NGS results such a BIGY vs FGC on a level playing field.

ALL Counted SNP calibrated to Ust-Ishim, BC from 1950AD
*Note: my last DF13 STR age was ~1,750 BC



First tier subclade DF13
2380 BC


DF13(BigTreeMean-47 below)*
2576 BC +-980* (1 SD)


P312
3360 BC


L11-L11/S127/PF6539
4536 BC


R-L51-L51/M412/S167/PF6536
5026 BC


R-L23-L23/S141/PF6534
5222 BC


R-M269-M269/PF6517
13356 BC


R-P297-P297/PF6398
18158 BC


R1b1-L278
18550 BC


R-MA-1-24Kybp_5Anc/78Derived
22078 BC


R1b-M343/PF6242
23254 BC


R1-M781/PF6145
27566 BC


R-Y482-Y482/PF6056/F459
27958 BC


R-P232
30114 BC


Ust-Ishim_11d_below_K(xLT)
43050 BC


P-P237
43540 BC


K(xLT)-M526/PF5979/
44128 BC





MJost

kinman
08-08-2015, 03:15 AM
Hi MJost,
Thanks for all that data. I am not only interested in the timing when these haplogroups originated, but also where. In particular, the big gap between R-M269 and R-L23, which is about 7000 years according to YFull and about 8000 years according to your figures. There are apparently over 80 SNP mutations arising between the origins of those two haplogroups, so 8,000 years sounds about right (at 100 years per mutation).
My first question is WHERE these two haplogroups originated. If M269 originated in central Kazakhstan (just an educated guess), how far west might the origin of L23 be (perhaps western Kazakhstan)? And my second (somewhat related) question, is why we haven't found a lot of side branches coming off during that 8,000 years? Are all those side branches now extinct, or have we not looked in the right places to find living descendants of some of those side branches? And (3rd question) if most or all of those side branches do turn out to be extinct, what happened to cause so many of those branches to go extinct? Intense period of warfare?, disease?, or something else wiping out large numbers of the descendants of M269 during that time period?
------------Ken Kinman

MJost
08-09-2015, 08:36 PM
Hi MJost,
Thanks for all that data. I am not only interested in the timing when these haplogroups originated, but also where. In particular, the big gap between R-M269 and R-L23, which is about 7000 years according to YFull and about 8000 years according to your figures. There are apparently over 80 SNP mutations arising between the origins of those two haplogroups, so 8,000 years sounds about right (at 100 years per mutation).
My first question is WHERE these two haplogroups originated. If M269 originated in central Kazakhstan (just an educated guess), how far west might the origin of L23 be (perhaps western Kazakhstan)? And my second (somewhat related) question, is why we haven't found a lot of side branches coming off during that 8,000 years? Are all those side branches now extinct, or have we not looked in the right places to find living descendants of some of those side branches? And (3rd question) if most or all of those side branches do turn out to be extinct, what happened to cause so many of those branches to go extinct? Intense period of warfare?, disease?, or something else wiping out large numbers of the descendants of M269 during that time period?
------------Ken Kinman

I responded over here on this thread.

http://www.anthrogenica.com/showthread.php?3882-R1b-and-its-sibling-R1a-possible-route(s)-into-Europe&p=101144&viewfull=1#post101144

MJost

haleaton
08-09-2015, 09:21 PM
FGC & YFull analysis provides STR results from NGS sequencing of our BAM data and and also processes it relative to the Reference Sequence combining the various reads versus position.

LobSTR needs the raw BAM file to calculate STRs but are there STR count values for the processed sequence or the Reference Sequence itself? Is the STR data not part of the SNP data?

Can a SNP mutation change the STR count?

Sorry this is so basic a question.

MJost
08-10-2015, 12:31 AM
In the previous post I showed the CB counts of DF13 subclades and DF27. Here are the total counts. These you may possibly consider using to adjust the years per SNP for each specific subclade which maybe due to different average age per generation but the ages of P312 and DF13 will static.

Posting in parts.



P312 D All
n=
886



Ave
52.6



AVEDEV
9.8







DF13 D ALL Only
n=
767



Ave
47.9



AVEDEV
9.9



Min
26.0



Max
90



Median
44







DF63 D ALL
n=
35



Ave
45.0



AVEDEV
7.5



Min
26.0



Max
61




Median
43




MJost

MJost
08-10-2015, 12:33 AM
Part 2


DF13>DF49 ALL
n=
101



Ave
68.6



AVEDEV
14.5



Min
33.0



Max
90



Median
76







DF13>L1335 ALL
n=
62



Ave
53.0



AVEDEV
3.9



Min
36.0



Max
66



Median
53







DF13>L513 ALL
n=
97



Ave
43.2



AVEDEV
4.8



Min
33.0



Max
69



Median
42







DF13>Z251 ALL
n=
34



Ave
40.1



AVEDEV
5.0



Min
30.0



Max
74



Median
40







DF13>Z253 ALL
n=
83



Ave
45.2



AVEDEV
4.7



Min
33.0



Max
70



Median
44







DF13>Z255 ALL
n=
45



Ave
37.3



AVEDEV
2.6



Min
27.0



Max
48



Median
37




MJost

MJost
08-10-2015, 12:34 AM
Part 3



DF13>DF21 ALL
n=
122



Ave
46.5



AVEDEV
6.5



Min
28.0



Max
70



Median
45







DF13>FGC11134 ALL
n=
67



Ave
40.5



AVEDEV
2.8



Min
32.0



Max
61



Median
40







DF13>FGC5494 ALL
n=
27



Ave
45.3



AVEDEV
6.5



Min
29.0



Max
76



Median
44







DF13>DF41
n=
56



Ave
56.3



AVEDEV
10.8



Min
35.0



Max
81



Median
52







DF13>M222 only ALL
n=
484



Ave
42.4



AVEDEV
6.6



Min
17.0



Max
74



Median
42







DF13>49xM222 only ALL
n=
30



Ave
44.2



AVEDEV
5.6



Min
33.0



Max
62



Median
44.5




MJost

MJost
08-10-2015, 12:35 AM
Part 4



P312>




DF27>Z195 D ALL
n=
46



Ave
44.8



AVEDEV
5.1



Min
25.0



Max
61



Median
45







DF27>ZZ12 D ALL
n=
23



Ave
38.5



AVEDEV
7.4



Min
23.0



Max
56



Median
35







DF27>ZZ12>ZZ19 D ALL
n=
20



Ave
38.2



AVEDEV
4.3



Min
26.0



Max
49



Median
37.5




MJost

MJost
08-26-2015, 01:02 AM
FGC & YFull analysis provides STR results from NGS sequencing of our BAM data and and also processes it relative to the Reference Sequence combining the various reads versus position.

LobSTR needs the raw BAM file to calculate STRs but are there STR count values for the processed sequence or the Reference Sequence itself? Is the STR data not part of the SNP data?

Can a SNP mutation change the STR count?

Sorry this is so basic a question.

Sorry I meant to get back to your post but I didn't.

In reviewing ISOGG learning center http://www.isogg.org/wiki/Y_chromosome_DNA_tests

STRs data contains sequences of repeating nucleotides with a particular number of repetitions. SNPs are change to a single nucleotide in a DNA sequence which can occur even within STRs but are basically two different occurrences. Most NGS have shorter length reads which can not cover longer repeating STRs. In the future, maybe. STRs highly mutagenic and mutate at frequencies far greater than SNP point mutations. And it gets more complex than this explanation.

MJost

MJost
09-16-2015, 08:54 PM
I recounted the YFull SNPs from A0-T to L21 and added in the BigY average of 48 SNPs at or below DF13. Recalibrated using the Ust’-Ishim's 45K Cal ybp to come up with a fixed years before present for beginning for DF13 as a base number which then can be used to come up with a years per SNP. I calculated the 48 +-10 SD years per SNP. as reference. EVERYONE at or under DF13 shares these same SNPs. Only by determining the most freq number of SNPs bell curve and validating that the standard distribution fits a normal distribution would help confirm its validity of those kit numbers that are under L21 for the DF13+ men, 48 +-10 SNPs for over 750+ NGS kits.

In the "Adamov, et al" paper, "The average mutation rate calibrations of Anzick-1 and Ust-Ishim do not show any significant difference." For the Paleolithic run of 12,600-45,000 years BP which is about 72% of time to present it is a stable range of SNPs and I will assume that the previous and post mutations hold the same stable status. Thus the years per SNP using Ust'-Ishim's dates would also be very stable average.

Also, with all the recent dates used in various discussions, it is important to understand that referenced SNPs such as M269 is within a 83 SNP block and we currently see M269 as part of an unordered (placement within this group of SNPs is unknown). But as I have seen, based on aged sites and basic SNP testing that M269 most likely mutated near the end of the block of 83. L11 is also an unordered block of 11 SNPs, ranging from 4,588 BC to 3,416 BC when P312 was ~born.

The list of my SNPs A1 to the end of L21 block that are shared with others are found here:
http://www.yfull.com/share/yreport/4cb31ab46ae3067e90cfd36752d69817/

The A0-T portion of the tree is found here:
http://www.yfull.com/tree/A0-T/

The SNPs above and below of Ust-Ishim node at 11 SNPs below K(xLT) at 45000 calibrated years before present (46,880–43,210 cal BP at 95.4% probability or +-1,835 cal years).

MJost




AgeYBP
AgeBC
SNP Sum
Block SNPs
Haplogoup


197,813
= 195,813 BC
2,027
590
A0-T


140,221
= 138,221 BC
1,437
47
A1b


135,633
= 133,633 BC
1,390
406
BT


96,002
= 94,002 BC
984
319
CT


64,863
= 62,863 BC
665
4
CF


64,473
= 62,473 BC
661
162
F


48,659
= 46,659 BC
499
4
GHIJK


48,269
= 46,269 BC
495
1
HIJK


48,171
= 46,171 BC
494
5
IJK


47,683
= 45,683 BC
489
17
K


46,024
= 44,024 BC
472
1
K(xLT)


45,926
= 43,926 BC
471
5
MP-M1205


45,438
= 43,438 BC
466
137
P


32,065
= 30,065 BC
329
22
R


29,917
= 27,917 BC
307
4
R-Y482


29,527
= 27,527 BC
303
44
R1


25,232
= 23,232 BC
259
48
R1b


20,547
= 18,547 BC
211
2
R1b1


20,351
= 18,351 BC
209
2
R-L389


20,156
= 18,156 BC
207
49
R-P297


15,373
= 13,373 BC
158
83
R-M269


7,271
= 5,271 BC
75
2
R-L23


7,076
= 5,076 BC
73
5
R-L51


6,588
= 4,588 BC
68
12
R-L11


5,416
= 3,416 BC
56
2
R-P312


5,221
= 3,221 BC
54
6
R-L21


4,635
= 2,635 BC
48
48
DF13 and below 48 Ave BigY SNPs to Present





1,979
A0-T to end of R-L21 SNPs





424
K(xLT) to end of R-L21 SNPs





413
Less Ust-Ishim 11 SNPs





48
DF13 w/48 Ave BigY SNPs


45,000
= 43,000 BC
461
461
Ust-Ishim to present SNPs





97.61388286
YPS Calibated Ust-Ishim 45000/461 SNPs





1950
Ave birth year





2,027
A0-T to DF13>present SNPs





10
DF13 SNPs SD (+-)




37
125.2828751
DF13 Adj ave SD SNP from 4,635 ybp




57
81.32397153
DF13 Adj ave SD SNP from 4,635 ybp

pregan
01-11-2016, 02:52 AM
I am interested in running some TMRCA estimates between Big Y South Irish kits. My thinking was I would use the novel SNP list from Alex's Nig Tree as the filtered list and only use the ones marked '+' (high quality). Couple questions for you all :

1) Are there issues I need to be aware of using this method? One obvious one would be the small sample size where a highly mutated line or non-mutated line will skew results.

2) What would be a good value for years per SNP? Initially, I was thinking to use the rate YFull was using for the closes parent subclade, 144 years.

Thanks!

Heber
04-28-2016, 08:09 AM
Does this paper finally resolve the issue of STR mutation rates.

"This a newly published paper in the American Journal of Human Genetics: “Population-Scale Sequencing Data Enable Precise Estimates of Y-STR Mutation Rates”. Authors: Thomas Willems, Melissa Gymrek, David Poznik, Chris Tyler-Smith, The 1000 Genomes Project Chromosome Y Group and Yaniv Erlich. It is a companion piece to the Poznik paper “Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences”.

Abstract
Short tandem repeats (STRs) are mutation-prone loci that span nearly 1% of the human genome. Previous studies have estimated the mutation rates of highly polymorphic STRs by using capillary electrophoresis and pedigree-based designs. Although this work has provided insights into the mutational dynamics of highly mutable STRs, the mutation rates of most others remain unknown. Here, we harnessed whole-genome sequencing data to estimate the mutation rates of Y chromosome STRs (Y-STRs) with 2–6 bp repeat units that are accessible to Illumina sequencing. We genotyped 4,500 Y-STRs by using data from the 1000 Genomes Project and the Simons Genome Diversity Project. Next, we developed MUTEA, an algorithm that infers STR mutation rates from population-scale data by using a high resolution SNP-based phylogeny. After extensive intrinsic and extrinsic validations, we harnessed MUTEA to derive mutation-rate estimates for 702 polymorphic STRs by tracing each locus over 222,000 meioses, resulting in the largest collection of Y-STR mutation rates to date. Using our estimates, we identified determinants of STR mutation rates and built a model to predict rates for STRs across the genome. These predictions indicate that the load of de novo STR mutations is at least 75 mutations per generation, rivaling the load of all other known variant types. Finally, we identified Y-STRs with potential applications in forensics and genetic genealogy, assessed the ability to differentiate between the Y chromosomes of father-son pairs, and imputed Y-STR genotypes.

The paper is behind a paywall: http://www.cell.com/ajhg/abstract/S0002-9297(16)30093-3

However, it was previously made available as a preprint under a slightly different title:

http://biorxiv.org/content/early/2016/01/15/036590

http://biorxiv.org/content/early/2016/04/09/036590.figures-only

MitchellSince1893
06-26-2016, 01:39 AM
I'm sure this has been posted before (from 2014 International Genetic Genealogy Conference) but I hadn't seen it and found it very informative

Dr Bill Howard's talk on Using correlation techniques on Y chromosome haplotypes to date events


https://m.youtube.com/watch?v=zmuYUIEEp5I&feature=youtu.be

JMcB
06-26-2016, 02:23 AM
Originally Posted by MJost View Post

The “Variance Method” (Slatkin, 1995; Stumpf, 2001) assumes that the variance (average-squared-distance from ancestral value) of each STR marker in a large population, is proportional to the TMRCA of that population.

Ken Nordtvedt has implemented variance into his Generations spreadsheet calculations. Please note that Ken explains Variance Sigma (Standard Deviation) Concepts on his website.

http://knordtvedt.home.bresnan.net/S...0Variance.pptx

-------------------------------------------------------------------------------

For those who are interested the above link for Ken Nordtvedt is now dead but much of his work can still be found here:

https://onedrive.live.com/?id=8B35ADFFC37790D0%2189654&cid=8B35ADFFC37790D0

MitchellSince1893
06-26-2016, 03:05 AM
I'm sure this has been posted before (from 2014 International Genetic Genealogy Conference) but I hadn't seen it and found it very informative

Dr Bill Howard's talk on Using correlation techniques on Y chromosome haplotypes to date events


https://m.youtube.com/watch?v=zmuYUIEEp5I&feature=youtu.be

Found this paper (part 1 and 2 links) from Dr Howard that is quite similar to the video
http://www.jogg.info/52/files/Howard1.pdf
http://www.jogg.info/52/files/Howard2.pdf

And here is a educational video on his method.
https://www.youtube.com/watch?v=wmylJXlHZZs

And lots more info here https://dl.dropboxusercontent.com/u/59120192/Genealogy/Papers%26TreesIndex.pdf

MitchellSince1893
07-28-2016, 12:11 AM
Interesting article on a study of 111 markers, genetic distance and actual relationship distance.

http://linealarboretum.blogspot.com/2016/07/are-111-marker-tests-better-at.html?m=1

Except that caught my eye

We have seen close relatives (5th cousins and closer) having genetic distances up to 5, while 13th cousins, once removed have a GD=0.

Joe B
08-26-2016, 04:45 PM
Klyosov, A. and Kilin, V. (2016) Kilin-Klyosov TMRCA Calculator for Time Spans up to Millions of Years.
Advances in Anthropology, 6, 51-71. doi: 10.4236/aa.2016.63007 Pub. Date: August 24, 2016

This paper was posted by Mis on a R1b-Z2103 thread.

a little theory about Z2103
http://file.scirp.org/pdf/AA_2016082414320572.pdf
This paper covers much more than Z2103 and is appropriate this "STR Wars" thread.

I have to believe that the authors are looking for a good discusion based on the last sentence in Discussion and Conclusions. Page 19 of the PDF.


The authors express hope that a long debate over which mutation rates to employ, how to count mutations,
and how to calculate the TMRCA in various cases, including complicated ones, is eventually over.

Advances in Anthropology, 2016, 6, 51-71
Published Online August 2016 in SciRes.
http://www.scirp.org/journal/aa
http://dx.doi.org/10.4236/aa.2016.63007

Huntergatherer1066
08-26-2016, 09:01 PM
Just so you know, Advances in Anthropology is not considered a legitimate journal in anthropology, and Klyosov is not an anthropologist. We were always warned to steer clear of journals like "Advances in Anthropology" in graduate school. It's not unlike the veterinarian who created her own "scholarly" journal to publish her Bigfoot research.

https://scholarlyoa.com/2014/10/02/an-editorial-board-mass-resignation-from-an-open-access-journal/

Here is another good article from UCL that discusses the problems with illegitimate researchers like Klyosov: https://www.ucl.ac.uk/mace-lab/debunking/theories

Joe B
08-27-2016, 12:05 AM
Just so you know, Advances in Anthropology is not considered a legitimate journal in anthropology, and Klyosov is not an anthropologist. We were always warned to steer clear of journals like "Advances in Anthropology" in graduate school. It's not unlike the veterinarian who created her own "scholarly" journal to publish her Bigfoot research.

https://scholarlyoa.com/2014/10/02/an-editorial-board-mass-resignation-from-an-open-access-journal/

Here is another good article from UCL that discusses the problems with illegitimate researchers like Klyosov: https://www.ucl.ac.uk/mace-lab/debunking/theories
There is a long history of debating Anatole Klyosov on this thread since May of 2013. No affirmation of his work is intended.

Dave-V
08-27-2016, 01:50 AM
I'm sure this has been posted before (from 2014 International Genetic Genealogy Conference) but I hadn't seen it and found it very informative

Dr Bill Howard's talk on Using correlation techniques on Y chromosome haplotypes to date events


https://m.youtube.com/watch?v=zmuYUIEEp5I&feature=youtu.be

FYI we heard on one of the forums recently that Bill Howard has passed away. He was very helpful to me in my surname group analyses and his RCC Correlation technique is worth remembering.

A full list of his papers and applications of his method is at https://dl.dropboxusercontent.com/u/59120192/Genealogy/Papers%26TreesIndex.pdf. And I posted a more complete analysis of his technique here awhile ago: http://www.anthrogenica.com/showthread.php?3127-L513-STR-analysis-using-RCC-Correlation.

RIP Bill. We can't afford to lose the pioneers.

razyn
08-27-2016, 04:03 AM
his RCC Correlation technique is worth remembering.

Amen to that. I was present for Bill Howard's slide presentation of it at the I4GG Conference in Maryland, two years ago. And subsequently we discussed it by phone a bit, a very few months before he died. His method is actually different, and shouldn't be filed away in the archives because he's no longer here to defend it, use it, or improve it. However, I don't have the math chops to do so. Bill was an astrophysicist, I believe. And he was well past 80; so he knew the math the way guys had to know it who did their calculations with slide rules, before computers made some aspects of it much easier.

razyn
08-27-2016, 02:04 PM
Incidentally, there has also been just a hint or two here about another method of estimating the age of a mutation, based on pairwise mismatches. I don't pretend to know how it's done. Dr. Hans van Vliet, a geneticist in Amsterdam, discussed this technique just a bit on the old WorldFamilies group for R1b moderated by rms2; and MikeWww and I have alluded to it here, though not very recently. http://www.anthrogenica.com/showthread.php?827-Where-did-DF27-originate-and-when-and-how-did-it-expand&p=15433&viewfull=1#post15433

Megalophias
10-31-2016, 02:45 AM
Rapid evolution of the human mutation spectrum (http://biorxiv.org/content/early/2016/10/30/084343.full.pdf+html)
More mutations at different rates between populations.

Dave-V
06-23-2017, 09:29 PM
For Mark Jost or anyone who is familiar with Ken Nordtvedt's Y-STR-based Interclade age estimating methodology, I recently changed the TMRCA estimations in SAPP (my phylogenetic tree tool (http://www.jdvtools.com)) to use that methodology instead of Nordtvedt's older adaptation of the Bruce Walsh methodology (http://www.jogg.info/pages/42/files/Nordtvedt.htm).

The main reason I switched is because Nordtvedt's older method required knowledge of the allele frequencies which are specific to individual haplogroups. I built the L21 allele frequencies into the tool but that made it less accurate for other haplogroups. The Interclade least squares estimations don't require allele frequencies so they are more widely applicable.

The tool recalculates a Interclade age estimation at every branching point (node) on the phylogenetic tree. Internally, every node has only two branches, so there are always two "sub-clades" to use for the methodology. When the phylogenetic tree is drawn it eliminates unnecessary nodes so you only see relevant branching with their TMRCA estimates. I should also note that although I calculate coalescence ages as well, I don't currently report them for simplicity.

Switching between the two methods, I find the Interclade methodology produces slightly younger TMRCA estimates but with tighter one-standard-deviation error ranges than Nordtvedt's older method.

Since the tool constrains the phylogenetic tree by SNP results, I also report the SNP TMRCA ages (where known) from YFull (V5.04 currently) against the SNPs reported on the tree. This allows for direct comparison around the tree between TMRCAs calculated by SNPs and TMRCAs calculated by STRs.

But... this is not necessarily an apples-to-apples comparison since the TMRCA nodes reported on the tree are only the TMRCAs for the group being charted, not the TMRCA for that SNP overall. You would expect then in general that the SNP TMRCAs should be older (MUCH older, in some cases) than the STR TMRCAs. So the way to compare the two ages is that the SNP's overall TMRCA is somewhere ABOVE the node shown on the tree, and the node itself is of the age reported for the STR-based calculation.

This is an example (from data invented for display purposes) of a node in the phylogenetic tree chart with both SNP (in blue) and STR (in green) ages shown:

http://www.anthrogenica.com/attachment.php?attachmentid=17151&stc=1

MarkoH
04-06-2021, 12:55 AM
Anatole Klyosov uses several method to produce ages based on a 25 year per generation mutation rate.
http://www.jogg.info/52/files/Klyosov1.pdf

Chandler has posted his own set of calculated mutation rates. His paper is found at: http://www.jogg.info/22/Chandler.pdf

Marko Heinla has produced his own more recent mutation rates back in May 2012 using methods using Chandler's methods. He has a link to his 111 marker rates near the botton of this web page.
https://dl.dropboxusercontent.com/u/50201824/old/jsphylosvg_trees.html

Marko Heinila's results are based on about 4,000 111 level samples. He used an estimation process that each haplotype pair was considered an independent random draw from a model distribution. Model distribution suggests what is the ratio of mismatches and matches in a given marker if pairs with a given number of matching markers in general are considered. The pair data was then used to solve the mutation rates. He said that this is the same idea as in Chandler's paper on mutation rate estimation.

MJost

A little late, but I have attached a short report that discusses my 2012 estimation method and it's relation to the original Chandler's approach. Basically (2012) estimation method reworks relevant math removing some approximations and also, perhaps more importantly, introduces topological data weighting. It appears that, although unpublished, these estimates are still referred to in the web even in some peer-reviewed papers. Some extra details then seem appropriate.

MarkoH
07-16-2021, 07:38 PM
YHRD dataset as been slowly expanding after 2012. New loci have been included and the dataset has been diversified. So I wrote a short description of the current situation wrt the accuracy of the 111 estimates.

edit: I replaced the attachment to fix a comment related to the original error estimates

MarkoH
07-18-2021, 03:37 PM
There was the technical detail that the rate estimates are found by a probability maximization process. Such estimate would then be a mode of a distribution rather than a mean as I incidentally suggested in the document yhrd2021b.pdf .

It seems that I can't any more edit the previous post, so the document with a more consistent error estimate section is attached here.

MarkoH
07-21-2021, 01:48 AM
There was the technical detail that the rate estimates are found by a probability maximization process. Such estimate would then be a mode of a distribution rather than a mean as I incidentally suggested in the document yhrd2021b.pdf .

It seems that I can't any more edit the previous post, so the document with a more consistent error estimate section is attached here.

Somehow I forgot to add the quite relevant 95% confidence interval for the overall scale of the estimates.

After all, errors are composed of per locus errors and an error in the overall scale. For the very fastest loci these are fairly similar in magnitude, while the per locus error dominates in the case of the slower loci.

Also the scale factor uncertainty needed to be removed from calculation producing Fig 1.

None of this of course changes anything material in the document....

MarkoH
07-22-2021, 02:39 PM
I would summarize my recent observations as follows:

Any pair of two real-world haplotypes is related by a sequence of independent
periods of evolution (called "branches" in rrates.pdf). If each of the
"branches" of evolution are given an equal probability of occurrence
(by using a weighting scheme), random draws from an abstract haplotype
pair distribution can be simulated and mutation rates can be
estimated.


Even tough my 2012 estimates did not score particularly well with
father/son pair data that was available in 2012, they work very well
with today's more complete YHRD dataset. It is estimated that the
accuracy is at least similar to a father/son dataset of 10,000
samples, but the accuracy could potentially be as good as with a set
of 80,000 father/son pairs.

MarkoH
07-25-2021, 05:04 PM
The use of a particular weighting function in the estimatinon algorithm description document rrates.pdf was critical for the accuracy. However, it may appear cryptic. To clarify the meaning of the weighting I wrote a short description of it (attached).

For example, if m(a,b) is a mutation count between samples a and b, and t(a,b) is their time distance, the related mutation rate is simply

sum w(a,b) m(a,b) / sum w(a,b) t(a,b)

where w(a,b) is the weighting factor and the sum is over all sample pairs (a,b). This simple result is accurate eg if the changes always happen in unique locations and can then be counted accurately from the pair observations without complications like back and parallel mutations.

MarkoH
07-27-2021, 03:32 PM
If the statistical tests of the document yhrd2021d.pdf are done for
"Heinila (2012)" rates, McDonald rates, and Chandler rates,
respectively, I find the following values with the data at
https://yhrd.org/pages/resources/mutation_rates :

P value

0.36 , 0.0016 , 3e-5

maximum likelihood N0 value

89000, 12000, 7300

This is surprising since McDonald rates are partly based on (2012)
rates. However, an inclusion of relatively small father/son studies
may reduce the effective size N0. Perhaps the biggest single
difference of (2012) rates and Chandler rates was the use of the
summing theorem (weighting.pdf).

MarkoH
08-01-2021, 02:54 AM
I started editing the documents I posted earlier since they omit some details. If you would like to get extended versions, please send a message.

I would strongly encourage P value tests for various mutation rate sets, perhaps simply ignoring complex loci like 385 & 389. P tells how likely as extreme (or more extreme) mutation counts as seen in measured transmission events would be if the rate set in question was assumed to be accurate.

MarkoH
08-07-2021, 11:09 PM
To wrap up the madness (hopefully), I have somewhat extended version the combines the earlier documents. Somewhat better method description and some about the various problems with it. A very painful topic, but it worked better than expected.

Edit: -3 version has an additional example calculation in Appendix A.

MarkoH
08-09-2021, 12:22 AM
Edit: -3 version has an additional example calculation in Appendix A to make it less abstract.

is attached here...

MarkoH
08-10-2021, 05:29 PM
some edits on the previous.... but I think I am done now

Let me recap briefly how it came to this. Long time ago it seemed that Chandler essentially solved the problem in his paper. However, it appeared that the Chandler method heavily over counted events which likely reduced accuracy. I had some networks in my hands.... I couldn't imagine any other way to counteract over counting that to reduce pair weights as 2^-path length in a binary tree... any tree estimation errors likely wouldn't matter much since unweighted method also worked. The exponential down-weighing was simple and appeared to do what is needed. The idea produced an enormous improvement in cross validation. Nordtvedt chose those values in his spreadsheet since little else was available. For some reason, various people persisted in using the estimates over the following years.Then I noticed about 10 years later that, with more data, YHRD estimates had essentially converged to the same values within statistical accuracy. That then required some kind of more detailed explanation.

MarkoH
08-27-2021, 08:00 PM
That then required some kind of more detailed explanation.


An executive summary about the relation to least-squares estimates,
etc:

If we have a tree graph of n leaf nodes, we introduce the indices:

p = 1,...,n(n-1)/2 leaf node pairs ( A, B )
e = 1,...,Ne edges connecting tree nodes

one can define path/edge membership matrix

H(p,e) = 1 if edge e is on the path p, zero otherwise

Sample pair differences in mutations, time, etc can be described with:

g(p) = sum of f(e) over the edges e on path p between two samples

The f(e)'s can be solved with least squares method by minimizing the
squared error sum

sum_p w(p) ( g(p) - sum_e H(p,e) f(e) )^2

If w(p) is of the form

w(p) = 1/( (deg(vp1)-1) ... (deg(vpn)-1) )

where vp1, vp2,...vpn are the internal tree nodes on the path p, and
deg(v) is the degree of node v (ie number of edges directly connected
to the node v, eg, deg(v)-1 = 2 if v is a binary node)

then the sum of f(e)'s has a particularly simple expression

sum_e f(e) = sum_p w(p) g(p)


In matrix terms, f(e)'s can be solved by least squares minimization as

f = R g

with

R = inv(H' W H) H' W

where W is a diagonal matrix composed of w(p) values, ' indicates
transpose, inv() inverse, etc.

For any tree, the matrix R has the property

sum_e R(e,p) = w(p)


Applications:

- If tree is chosen to consist of one binary node and one star node,
choosing the binary node leaf nodes by sum f(e) minimization
produces the selection rule in the neighbor joining algorithm.

- Parameters of underlying additive random process on a tree can be
estimated by eliminating over counting of random events (mutations)
by weighting each pair observation fractionally by w(p). The n(n-1)/2
pair differences in n samples can be viewed to provide
sum_p w(p) = n/2 independent process observations in total.

- Popular interclade and intraclade time estimators can be seen as
hierarchical constructions based on least squares estimates for the
underlying tree node distances. They can be generalized for the case
of known tree information. For any given tree, error estimates can
be constructed.

- w(p) values can be averaged over many potential trees in the case of
incomplete information and better overall statistical results can be
obtained.

edit: deleted misleading wording from the second item; results like
sum_e R(e,p) = w(p) were tested numerically with random trees and
there is an outline of a general proof as well

razyn
08-27-2021, 08:45 PM
I hope some people who have the math chops for this are reading it. I followed your charts, back in the day. Dr. Bill Howard was working on similar tasks, at a high level, and I caught his well-illustrated slide presentation at the 2014 Institute for Genetic Genealogy (I4GG) conference. Also had a phone followup with him, since he was local (for me); and I believe I have his paper (not the slides) on the preserved hard drive of a deceased computer of that vintage. Apparently he had published a version in 2009, but updated it a lot for the 2014 conference. Here is his biographical sketch for the astronomical community, in which he spent most of his career. https://baas.aas.org/pub/william-e-howard-iii-1932-2016/release/1 The article is footnoted, and the introductory version of his article is the final footnote. I'll paste it in:


(10) Howard, William E. III 2009, The Use of Correlation Techniques for the Analysis of Pairs of Y-STR Haplotypes, Part 1: Rationale, Methodology and Genealogy Time Scale, J. Genetic Genealogy, 5, p.256.

MarkoH
08-29-2021, 01:56 AM
I hope some people who have the math chops for this are reading it....

My main worry is why the mutation rate estimation method that I used in 2012 worked so well with the hindsight of more recent YHRD data. It is easy to imagine any number of issues that could derail such estimation process. Perhaps the various populations are very different causing rates to differ, perhaps simple mutation models do not work accurately enough, perhaps trees cannot be constructed well enough to allow good estimates, a fairly small dataset of 4000 individuals was used, etc. However, the actual accuracy seems extremely good against all odds. The almost surreal fact is that YHRD dataset is not large enough to differentiate the 2012 estimates from hypothetical exact values, though they certainly cannot be exact. A key finding is that the method can be viewed as being based on least-squares estimation on trees. It is implicit in the algorithm that the least squares errors are weighted so that a larger relative error is allowed for relatively long paths that have many branching points on their way trough the tree. Tree construction algorithms certainly have harder time finding such long paths correctly, which makes this very appealing. If no large construction errors exists, weighting is not very relevant and the weighting algorithm can be rationalized as being the simplest possible, if not the only one that is at all practical.

Just to expand the first application item in the summary post as an example...

In neighbor joining we have a binary node and a star cluster of degree n-1. In this case w(i,j) will have the values

w(1,2) = 1/2
w(i,j) = 1/(2(n-2)) , i=1,2 ; j=3,...,n
w(i,j) = 1/(n-2) , 3<=i<j<=n

The sum of edges then consists of the distance terms after multiplied by 2(n-2) :

(n-2) d(1,2)

d(i,j) , i=1,2 ; j=3...n

2 d(i,j) , 3<=i<j<=n

simplify by subtracting the sum of the terms 2d(i,j) , 1<=i<j<=n , since this does not depend of the choice of samples connected to the binary node

(n-4) d(1,2)

-d(i,j) , i=1,2 ; j=3...n

collect terms and change the index range in the sum (also d(i,i) = 0)

(n-2)d(1,2) - sum_(i=1,..,n) ( d(1,i) + d(2,i) )

which is the quantity to minimize over selection of 1,2 for finding a pair to join in the binary node. A short derivation vs
https://academic.oup.com/mbe/article/4/4/406/1029664 since a general least-squares result is applied.
A subtle point is that the NJ paper uses constant squared error weighting. However, for a symmetric enough graphs the result for sum of edges is still the same as with a w(p) weight.

MarkoH
09-01-2021, 05:13 PM
However, for a symmetric enough graphs the result for sum of edges is still the same as with a w(p) weight.

what is symmetric enough ? The following observation seems to give the answer:

Let us place root node r0 on any edge of a tree. Let wr( v ) be a vector of arbitrary positive values for each internal tree node v and the root node r0. Further, let root(p) be the internal node on a leaf-to-leaf path p that is closest to the root node r0 (or root node if the path p goes through the root edge).

In this case, if f(e) is the solution of the least squares problem

sum_p wr( root(p) ) w(p) ( g(p) - sum_e H(p, e) f(e) )^2

then

sum_e f(e) = sum_p w(p) g(p)

where w(p) = 1/( (deg(v1) -1) ... (deg(vn) -1) ) and v1,...vn are internal nodes on the path p.


If the tree is symmetric enough, one can have equal weight

wr( root(p) ) w(p) = 1

if w(p) has the same value for all paths with the same root(p). There has to be a root edge selection that produces this situation for all internal nodes.

More broadly, we have sum_e f(e) = sum_p w(p) g(p) if the paths that have the same root(p) and have different numbers of internal nodes of any given degree are weighted by the factor w(p) just relative to each other.


This works in numerical tests with randomly generated trees, and can seemingly be proved in general.

(deleted the last paragraph)

MarkoH
09-02-2021, 07:45 PM
The least-squares edge sum estimator

sum_e f(e) = sum_p w(p) g(p)

could be viewed as a form that makes, by construction, minimal use of longer paths through the tree. This is reasonable when tree construction errors, for example, are expected to occur. This feature is likely to contribute to the accuracy of the rate estimates.

A way of motivating this statement is to model the squared error for path p as a sum of contributions from each edge e on the path p. In this model, a larger error is then allowed for long paths over many edges. If an edge e contributes E(e) to the squared error of all leaf/leaf paths crossing it, and if E(e)'s are identically and independently exponentially distributed, then w(p) least-square weight represents maximum likelihood estimation. This is since the w(p) weighted squared error sum reduces to sum_e E(e) which is proportional to negative sum of log probabilities of E(e)'s described by exponential distribution. The w(p) factor reappears in the formula sum_e f(e) = sum_p w(p) g(p) for the sum over components of the least-squares solution f(e).

MarkoH
09-03-2021, 11:04 PM
The least-squares edge sum estimator

sum_e f(e) = sum_p w(p) g(p)


Actually this kind of estimation is a generalization of the formula (4) in this publication to the least-squares framework and arbitrary node degrees:

https://academic.oup.com/mbe/article/21/3/587/1079580

(edit: )
The theorem 1 in the paper above shows that the tree length estimator is minimum variance when leaf/leaf distance variance used in least-squares estimation increases exponentially with path length (path length is measured as the number of binary tree edges). In that case long paths are certainly avoided. (The argument sketched in the previous post doesn't seem to work out in detail, however, and should be ignored.)

The partial resolution aspect of STR trees can be estimated with an entropy like quantity

S = - sum_p w(p) log(w(p))

if w(p) = 1/( (deg(v1)-1) ... (deg(vn)-1)) , or 2^(1-length(p)) in the binary case where length(p) is the number of edges on the path p, the quantity S can be converted to a sum over edges. The sum over edges in turn evaluates to a sum over internal nodes v:

S = (1/2) sum_v deg(v) log(deg(v)-1)

On the other hand, a tree where all internal nodes have the same degree d, has nv = (n-2)/(d-2) internal nodes, where n is the number of leaf nodes. In this case,

S = (n-2) d log(d-1) / (2(d-2))

This allows to estimate effective d if w(p)'s are averaged over a set of binary trees that have some randomness, by solving d from the equation

- sum_p <w(p)> log(<w(p)>) = (n-2) d log(d-1) /(2(d-2))

where <w(p)> represents averaged w(p). The average of 5 binary trees used in rate estimation document produced about d = 4.5 . This is quite far from a binary tree (d=3).

MarkoH
09-05-2021, 12:53 PM
A brief conclusion why estimates are accurate:

The tree length result of Pauplin (2000) was applied in the context of process parameter estimation. The result is very well founded in the theory and has been shown to be related to estimators that have very desirable properties. In the context of statistical modeling, the Pauplin weight factor 2^(1-pij) represents fractional observation weight. This is necessary for accurate results since the weighting merely ensures that the correct number of mutations (tree length) is included in the analysis. Under wide enough process randomness, this is also sufficient for correct results eg in the cases of Poisson and Skellam variables.

---

Pauplin (2000) cited in https://academic.oup.com/mbe/article/21/3/587/1079580

MarkoH
10-14-2021, 05:14 PM
To complete an earlier comment on abstract simulation and its relation to tree length formulas:


One can choose a random leaf/leaf path in a set of trees, also known as a forest, by using the following procedure: First, choose a random leaf node, then choose randomly any of the edges connected to the node and not yet traversed to move to the next node; continue until a leaf node is reached and the leaf/leaf path is complete. The probability that the leaf/leaf path p is chosen by using this process, is

P(p) = (2/n) / ( (deg(v1)-1) ... (deg(vk)-1) ) , where v1...vk are the internal nodes on the path p and n is the number of leaf nodes

In other words, to get a given path p, one needs to first choose one of the two terminal leaf nodes, this produces the factor 2/n. Then one has to choose the appropriate edges along the path: For internal node v, one of deg(v)-1 edges not yet used has to be chosen, producing the other factors in the formula. If the leaf nodes are directly joined, P(p) is just 2/n.

It also turns out that the probability that the edge e is on a randomly chosen path p, is 2/n in general. Since this probability is a constant, one can compute the average leaf/leaf path length over this kind of random path simulation as

sum_p P(p) F(p) = (2/n) sum_e F(e)

when F(p) = sum_(e on path p) F(e)

or

sum_e F(e) = (n/2) sum_p P(p) F(p)

This is the abstract simulation mentioned in an earlier post, and the argument produces a Pauplin (2000) type formula for a forest length where internal nodes v are not necessarily binary, and deg(v)'s can be larger than 3.

Since the forest is described by a probability distribution P(p), one can then, for example, construct a metric distance by between forests by computing the Jensen-Shannon (JS) metric distance between the P(p) distributions of any two forests. This metric is more general than Robinson-Foulds (RF) distance of trees, and has a direct connection to the internal node degrees. The JS metric applied for trees or forests is in fact an another application of the topological entropy used in an earlier post to estimate average internal node degree.

---
Jensen-Shannon metric for probability distributions:
DM Endres, JE Schindelin, IEEE Trans. Inform. Theory, vol 49, pp 1858-1860, 2003. https://research-repository.st-andrews.ac.uk/bitstream/10023/1591/1/Endres2003-IEEETransInfTheory49-NewMetric.pdf

MarkoH
11-17-2021, 12:15 AM
It appears that the same method that was used to estimate mutation rates in my report could also be used for estimating STR trees.

The likelihood function (called here L) used for mutation rate problem in the report can be written as

-log(L) = sum_(i<j) wij dij

where

dij = weighted number of loci that differ between haplotypes i and j: If locus m differs, a contribution Wk(m) is added to the distance dij:

Wk(m) = -log(sk(m)/(1-sk(m)) - (1/k) sum_m1 log(1-sk(m1))

sk(m) is the probability that locus m differs if k loci in total differ between haplotypes i and j. The latter term with 1/k factor is related to the average contribution of the loci that do not differ.

wij is a Pauplin factor, found in the report as 2^(1-topological distance in the tree).

If mutation rates are known, however, the effective distances dij can be computed and used to estimate the tree instead. This can be done by, for example, utilizing the neighbor joining method with the distance matrix dij since the neighbor joining and related methods have been specifically engineered to optimize the sums like the -log(L) above. The use of a particularly weighted distance matrix dij would then favor trees that produce the expected STR statistics.

The attached plot shows the locus weight factors Wk(m) for various distances k. In the mutation rate raport, the trees were found by using a weighted parsimony algorithm with a locus weight 1/(1+600m) . If compared to weight factors in the plot, this roughly speaking corresponds to 20 locus differences, k=20, taking into account that the overall scaling for this factor doesn't matter. For small enough rates m and k, Wk(m) is well approximated by C(k)-log(m) where C(k) depends on the total number of differences.
47491

Doing the overall rate estimation with a distance matrix method might also be advantageous. Also, more detailed STR modelling could be used for the distnace matrix calculation.

MarkoH
11-23-2021, 05:47 PM
Instead of using the number of differing loci k like in the previous post, this calculation is reorganized by using simple genetic distance (sum of absolute values of differences) for Skellam distributed STRs:
47550
Since sizes of the differences vary, the "weight" is computed as average contribution to genetic distance based -log(L) divided by average size of the difference. Each curve is labeled with genetic distance (gd), number of locus differences (k) and median time distance (g).

Unlike in the previous figure, faster STRs retain more weight at larger distances. This is more appropriate for comparing with weighted parsimony.


It seems likely that the trees estimated in this way are reasonably close to those that produce highest data compression if the joint mutation rate and tree estimation is viewed as a haplotype data compression scheme. This can be seen in the case where mutation rates are equal and the locus count N is large. In this situation "the probability that the locus m differs when k loci differ in total" reduces to a simple expression:

sk(m) = k/N

The contribution of a haplotype pair ij to -log(L) can then be expressed as

w(ij) ( - k log(k/N) - (N-k) log(1-k/N) )

= w(ij) N ( -x log(x) - (1-x) log(1-x) ) = w(ij) N H(x)

where x = k/N and w(ij) is a Pauplin factor. Here H(x) = -x log(x) - (1-x) log(1-x) is the well-known expression for information content of Bernoulli distribution with parameter x. With N large, N H(x) then represents the theoretical information content for a set of N loci. The -log(L) minimization approximates these quantities as a minimal sum of edge information contributions. The more realistic case with non equal mutation rates can be viewed as an extension of the same principle.

edit: reuploaded the figure after a fix

MarkoH
11-25-2021, 04:14 PM
A brief conclusion why estimates are accurate:

The tree length result of Pauplin (2000) was applied in the context of process parameter estimation. The result is very well founded in the theory and has been shown to be related to estimators that have very desirable properties. In the context of statistical modeling, the Pauplin weight factor 2^(1-pij) represents fractional observation weight. This is necessary for accurate results since the weighting merely ensures that the correct number of mutations (tree length) is included in the analysis. Under wide enough process randomness, this is also sufficient for correct results eg in the cases of Poisson and Skellam variables.

---

Pauplin (2000) cited in https://academic.oup.com/mbe/article/21/3/587/1079580

Even though this has a large part, the main reason might be that the estimation procedure was sufficiently close to optimal from information theoretic point of view. The objective function used, if optimized both with respect to rate estimates and underlying tree, measures residual unexplained variation: With large trees it would here apparently converge to the overall leftover statistical entropy of the haplotype difference distributions over various genetic distances. The data for estimating these distributions is weighted by the Pauplin factors to account for interdependencies. This construction then turned out to be particularly accurate.


Edit: in the previous post, the original version of the plot (attached) corresponds to my description, not sure why I changed it. The issue here is the average over difference sizes that impacts large time difference curves at high rates.
47569

MarkoH
11-28-2021, 02:28 AM
Both father/son studies and STR datasets suggest that STR down mutation rate increases with increasing repeat number. One can model these observations by using the following simple forms for up and down mutation rates from initial i STR repeats:

T(i -> i+1) = m/2

T(i -> i-1) = m/2 exp( a (i-i0) )

here m, a, i0 are fitting parameters.

Since a large repeat number i>i0 increases down mutations, and since with a small repeat number i<i0 only up mutations are common, there likely is a limiting distribution p(i) for the repeat number i. This is reached from any starting value after a very long time has passed.

The limiting distribution p(i) can be found by solving the following detailed balance equations

p(i) ( T(i -> i+1) + T(i -> i-1) ) = p(i+1) T(i+1 -> i) + p(i-1) T(i-1 -> i)

The solution to these conditions turns out to be a discretized Gaussian distribution:

p(i) = C exp( -a (i +1/2 - i0)^2 /2 )

where C is chosen so that sum_i p(i) = 1 .

This can be verified by first noting that p(i) T(i -> i-1 ) = (m/2) p(i-1) . With this, the balance equation is satisfied and also

sum_i p(i) ( T(i -> i+1) + T(i -> i-1) ) = m

meaning that m can be identified with the average mutation rate in the limiting population p(i). Since there is a limiting distribution, variance expansion for a difference of two STR values is limited to 2 times the variance of the p(i). By using usual variance for a Gaussian, the maximal difference variance is maxV = 2/a .

The detailed approach to saturation can be studied numerically. The result roughtly speaking follows the approximate formula:

var = mG (1 + 0.5 mG/maxV ) /( 1 + mG (1 + 0.5 mG /maxV ) / maxV )

where mG is mutation rate times time distance (time distance = 2 tmrca ) .

The following plot compares this with accurate numerical population averaged variance for various values of maxV :
47586

Similar observations have been likely used in various time estimation methods. The saturation of faster STR's would also result with smaller STR weighting in tree estimations. Typical maxV might be in the range 2...8.

MarkoH
11-29-2021, 05:42 PM
By using the observations of the previous post, multiplying p(i) by exp(-b i) would satisfy the balance conditions for a more general case:


T(i -> i+1) = (m/2) exp( b(i-i0) )

T(i -> i-1) = (m/2) exp( (a+b)(i-i0) ) , a > 0 , any b


For a limiting distribution to exist, the down rate T(i->i-1) has to grow faster than the up rate, and the parameter a has to be then positive.

In this case the limiting distribution is shifted by -b/a but its variance does not change from b=0:

p(i) = C(b) exp( -a(i+b/a+1/2-i0)^2/2 ) , C(b) = 1 / sum_i exp( -a(i+b/a+1/2-i0)^2/2 )


However, the p(i) averaged rate changes:

sum_i p(i) (T(i->i+1) + T(i->i-1)) = m exp( -b(b/a+1)/2 ) C(b)/C(0)

A nonzero b likely affects the details of variance saturation as well.

edit: The plot in the previous post is for b=0 or b=-a, a = 2/maxV, i0=integer+1/2 and the variance is between a draw from the limiting distribution and a sample that is forward in time by the given amount. With b>0 or b<-a, the variance sauration process slows down.

MarkoH
12-02-2021, 04:40 PM
In the past, I have estimated a and b parameters by studying changes between median nodes in a rooted and timed STR tree: that study produced estimates for 54 STRs of 67 ignoring cases of insufficient data and a few instances of negative a's. The b parameter had a median value of about 0.14 and 80% of the data was between -0.05 and 0.3 . However, the saturation variance 2/a varied even more widely with 80% of values in the range 1.4 to 17 and a median value of 4.1. It turned out that 2/a was correlated with the overall variation range in large datasets. For example, log(a) was 80% correlated with log(D-1.2) where selecting 1.2 produced fit coefficient -2.0 though other choices are possible. On the other hand, the a and b parameters didn't have apparent relation with mutation rates, eg. The following might describe the overall situation:

var = x / (1 + x/maxV) , where x = mG ( 1 + 0.3 mG / maxV ) and maxV = ((D-1.2)/3.7)^2

where D was based on mininum and maximum values in a dataset of 34,000 haplotypes covering several macrohaplogroups. Here the earlier 0.5 coefficient was replaced by 0.3 in order to account for a slower variance saturation due to a positive b value b=0.14. (In the case of b=0.3, for example, one would have x=mG.)

MarkoH
12-11-2021, 09:22 PM
How does Nordtvedt weighted STR tmrca variance formula change if the previous STR saturation model and relevant approximations are used? The result is

G = sum_i w0(i) w1(i) var(str(i)) /( 2 sum_i w0(i) m(i) )

where w0(i) = 1/(1 + 4 m(i) G) is the usual weighting factor, and an additional factor w1(i) models the saturation behavior:

w1(i) = (1 + 2 m(i) G (1 + 2 c m(i) G/maxV)/maxV)/ (1 + 2 c m(i) G/maxV)

or more simply if c=0 :

w1(i) = (1 + 2 m(i) G/ maxV )

The TMRCA value G can be solved iteratively. Here c and maxV can depend on i.

For example, if c=0 and there is only one STR, one obtains

G = var (1 + 2 m G/maxV) / (2 m)

The solution for G is

G = (var/(2m)) /(1 - var/maxV)

This is works like usual if maxV is large, G = var/(2m). However the time estimate diverges when var gets close to maxV.

Two ideas were used to obtain this estimator: 1) its expectation value is correct as long as variance saturation is modelled accurately, 2) the usual weight w0(i) is fairly accurate mean variance weight for scaled str w1(i) var(str(i)) to minimize error variance

MarkoH
12-14-2021, 03:50 AM
It seems that if saturation variance is large, and the process is symmetric b=-a/2, the STR saturation model reduces to a mean reversion behavior known as Ornstein–Uhlenbeck process.

This means that for maxV large (ie a small), and b=-a/2, the variance in this model saturates exactly as :

var = maxV ( 1 - exp(- m g/ maxV)) , where m = 2 tmrca.

This seems to be very accurate down to maxV smaller than 1. And also since with this

1/(1-var/maxV) = exp( m g /maxV) = 1 + mg/maxV + (mg/maxV)^2 /2 + ...

it comes clearer why my approximative formulas worked. They were of the form

1/(1-var/maxV) = 1 + mg/maxV + c (mg/maxV)^2

and truncating the series and using a smaller c than 1/2 describes slower variance saturation happening when b is larger (or smaller) than b=-a/2. I haven't found similar exact solution for other b values than b=-a/2.

---
https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process

MarkoH
12-15-2021, 09:43 PM
Perhaps something like this for the case of mean reverting STR's: (for reducing factors, here g=2*tmrca) :

g = sum_i w0(i) w1(i) var(str(i)) / sum_i w0(i) m(i)

where

w0(i) = 1 / (1 + 2 maxV ( exp(m(i)g/maxV) -1) )

w1(i) = (m(i)g/maxV) / (1-exp(-m(i)g/maxV))

With g=0, w0(i) = w1(i) = 1 and that can be a starting point of an iteration. if maxV is large, w0(i)=1/(1+2m(i)g) , w1(i)=1, ie it reduces to the Nordtvedt formula.

The expression reduces to g if var(str(i)) = maxV(1-exp(-mg/maxV)) so it is consistent with saturating STR's.

If mg/maxV is large, the related STR is downweighted exponentially, which is actually statistically correct in that case.

Finally, if STR has nonzero b the formula underestimates g. Estimation of b could be avoided and maxV's could be obtained from variation range in large datasets.

(If the variance of STR difference over time g is given by var(g) , then the weighting factors would be in the same way w1 = m g / var(g) and w0 = 1/( 1 + 2 m var(g) / (dvar(g)/dg) ) so that a small derivative d var(g) / dg leads to a small weight, f.ex.. This can be derived by using maximum likelihood estimation for the variance model parameter (g) in the case of continuous variables. This is then modified so that the result reduces to the usual formula over shorter time periods. )

Wheal
12-15-2021, 11:33 PM
Since I'm very new at this.... If I have 24 differences in Str values.... from a possible ancestor from 1100 ad... what would that be? I know this is a dumb question, but I think my covid vaccine affected my ability to think!!

MarkoH
12-15-2021, 11:56 PM
Since I'm very new at this.... If I have 24 differences in Str values.... from a possible ancestor from 1100 ad... what would that be? I know this is a dumb question, but I think my covid vaccine affected my ability to think!!

What STR set would that be.... f.ex. for ftdna 111, 24 differences is likely more than 1000 years tmrca.

Wheal
12-15-2021, 11:59 PM
Thank you Mark, I didn't think it was possible to be a match and yes, ftdna Y700

MarkoH
12-16-2021, 12:04 AM
Thank you Mark, I didn't think it was possible to be a match and yes, ftdna Y700

If it is Y700 it is quite possible, the more STR's tested, the shorter time distance for the given number of differences.... not too familiar with 700 str test mutation rates though.

MarkoH
12-21-2021, 04:39 PM
Some of the earlier guesses were embarassingly bad and can be improved. A recap: STR variance saturation can be up to a degree described by repeat number (i) dependent up and down mutation rates:

T(i->i+1) = (m0/2) exp(b (i-i0))
T(i->i-1) = (m0/2) exp((b+a) (i-i0)) , a>0

Here m0 is the break even mutation rate at repeat number i=i0 where up and down rates are equal. The m is the average mutation rate (or "the mutation rate") over limiting distribution p(i) when time distances are large, p(i)=C exp( -a(i+b/a+1/2-i0)^2/2 ), the break even rate is given by m0 = m exp(b(b/a+1)/2) .

The saturation variance maxV is two times the variance of the p(i) distribution, maxV = 2/a . It is the variance of difference of two STR samples that are very distant in time.

If b=-a/2 the variance saturation with time is very close to exponential in this model, even when maxV is quite small:

var(m g) = (2/a)(1 - exp( -m g a/2))

where g is the time distance (2 tmrca). This is similar to mean reversion behavior described by continuous Ornstein-Uhlenbeck process. In other cases, variance saturation can be studied numerically. For a general b, variance approaches saturation value at a lower rate than in the case b=-a/2.

Direct numerical evaluation of variance saturation and an empirical approximative formula are compared in this plot for various choices of a and b parameters (dashed lines "(a)" are based on the approximation):
47806

The dashed-line curves in the plot are given by

approximation: var(m g) = (2/a)(1 - f(t) )

where f(t) is a guess for a time autocorrelation function

f(t) = exp(-(log(1+A*sqrt(t)))^2/A^2 )

A=B(1+ 3B )/6 with B=| b+a/2 | / sqrt(a) , and t=m g a/2

This is based on likely properties for a solution for continuous STR variables and then scaling that back to the case of discrete STR's and tweaking the expression for A to fit the numerical results. The guess looks fairly accurate over a range of a and b parameters. Continuous STR model can be obtained by setting x = i sqrt(a) , b' = b/sqrt(a) and letting a->0 while x and b' have finite values. Large b values can cause numerical difficulties. Most typical case for real STR is probably 0 < b < a though this is likely not universally true.

MarkoH
12-27-2021, 10:19 PM
A quick test of the STR saturation ideas, just by using 111 STR's, small amount of data, somewhat arbitrary parameters, just to give a rough general idea.

With Nordtvedt like variance estimator (saturation variance Vsat = Infinity), the numbers in []'s are the TMRCA estimate ranges wrt all haplogroups higher on the list:

R/Q [ 12... 22] ky (5)
N [ 17... 32] ky (5)
IJ [ 18... 35] ky (6)
G [ 17... 33] ky (5)
C [ 22... 41] ky (6)
E [ 22... 41] ky (5)
A [ 42... 77] ky (7)

With saturating STR variance estimator :

R/Q [ 17... 33] ky (9)
N [ 28... 54] ky (11)
IJ [ 31... 60] ky (11)
G [ 29... 56] ky (11)
C [ 40... 78] ky (12)
E [ 39... 75] ky (12)
A [122...264] ky (22)

The number in ()'s is the iteration count. Pauplin like weighting was used for macrohaplogroup tree, like var(N,R/Q) = 0.5*var(N,R) + 0.5*var(N,Q), etc. Since error estimates don't model parameter and model uncertainly, errors are certainly larger than suggested here.

This used variance curve A parameter A=0.25 (previous post). Saturation variance Vsat was modeled with extreme values. log(Vsat) produced by this estimator was 75% correlated with previous tree based estimates for 67 STRs. Due to small amount of data and since previous estimates likely were too low, Vsat was scaled up by a factor of 1.6 in order to avoid underestimation.

The Octave code used for the calculations using R/Q as an example:



%
% STR variance based time estimator with STR saturation modeling
% in Octave https://www.gnu.org/software/octave/
%

% STRs
str = {...
'393','390','19a','391','385a','385b','426','388', '439','389i','392','389B',...
'458','459a','459b','455','454','447','437','448', '449','464a','464b','464c',...
'464d','460','Y-GATA-H4','YCAIIa','YCAIIb','456','607','576','570','CDY a','CDYb','442',...
'438','531','578','DYF395S1a','DYF395S1b','590','5 37','641','472','DYF406S1','511','425',...
'413a','413b','557','594','436','490','534','450', '444','481','520','446',...
'617','568','487','572','640','492','565','710','4 85','632','495','540',...
'714','716','717','505','556','549','589','522','4 94','533','636','575',...
'638','462','452','445','Y-GATA-A10','463','441','Y-GGAAT-1B07','525','712','593','650',...
'532','715','504','513','561','552','726','635','5 87','643','497','510',...
'434','461','435'};

% 111 STR mutation rates
m = 1e-3*[...
0.95 2.45 1.68 2.76 1.83 3.46 0.11 0.58 4.71 2.34 0.60 2.77 7.17 0.52 1.53 ...
0.27 0.20 3.16 0.83 1.24 8.34 1.87 3.20 4.24 3.72 3.31 2.37 0.50 1.14 5.39 ...
2.48 11.09 8.93 14.36 18.45 3.29 0.49 0.57 0.22 0.43 0.32 0.19 1.31 0.37 0.02 ...
1.61 1.29 0.24 2.29 1.53 3.50 0.43 0.07 0.28 8.19 0.11 3.56 4.37 1.82 3.44 ...
0.60 0.47 0.79 1.37 0.20 0.23 0.72 18.28 1.58 0.07 1.15 1.31 7.73 1.00 0.75 ...
1.66 1.20 4.99 0.88 1.98 0.22 3.71 0.41 0.18 0.52 0.56 1.69 0.92 4.10 1.17 ...
1.67 0.84 1.54 16.38 0.23 7.58 4.12 4.44 4.61 2.63 1.65 3.18 0.24 3.39 1.22 ...
1.35 0.97 3.17 0.28 2.03 0.22]';

% largest and smallest observed values (in each hg, outliers of more than
% max(8,8sigma) are removed) , 3986 haplotypes
min_str=[...
10 20 11 9 9 10 10 9 9 9 7 10 11 7 8 7 10 19 13 15 22 9 11 11 11 ...
8 7 13 17 12 8 9 13 28 30 8 7 8 5 12 12 6 8 6 8 7 7 12 12 14 ...
13 3 10 9 11 6 10 16 16 8 10 8 10 9 10 9 7 27 9 8 11 8 20 22 16 ...
9 8 9 6 8 8 6 9 8 8 8 24 6 10 17 10 6 6 16 13 15 7 19 11 10 ...
13 20 11 17 15 8 12 14 7 9 8 ]';
max_str=[...
16 27 18 13 19 21 14 19 15 17 17 21 24 10 11 12 14 32 17 23 38 17 19 20 20 ...
13 14 24 25 20 19 22 25 42 44 15 14 14 10 17 19 9 13 12 9 14 13 12 25 27 ...
22 13 14 18 20 10 16 31 24 22 15 13 16 13 13 15 14 45 19 10 19 15 31 29 22 ...
16 15 15 15 15 10 14 14 12 14 14 34 14 17 31 20 15 13 32 18 24 17 27 20 16 ...
18 29 16 27 24 15 17 24 12 15 13 ]';

A = repmat(0.25, size(m)); % variance curve A parameter

% saturation variance Vsat

% a rough Vsat estimate motivated by extreme value theory
Vsat = Vsat0 = max((max_str-min_str).^2/2, 1);
m0 = 7.5e-7; % m0 depends on the dataset size (eg)
idx = m > exp(1) * m0 .* Vsat0;
for i=1:15; Vsat(idx) = Vsat0(idx)./log( m(idx)./(m0 * Vsat(idx)) ); end;
% scale up
Vsat *= 1.6;


% for usual variance estimator, set
% Vsat(:) = Inf;

% for exponential STR saturation, set
% A(:) = 0;

%
% Iterative time estimate based on variances of STR differences, var(str(i)):
%
% g = sum_i w0(i) w1(i) var(str(i)) / sum_i m(i) w0(i)
%
% with error estimate:
% std(g) = sqrt( sum_i w0(i)^2 w1(i)^2 varvar(i) ) / sum_i m(i) w0(i)
%
% where
%
% w0(i) = 1/(1 + 2 m(i) ( var(i) / (dvar(i)/dg)))
% w1(i) = m(i) g / var(i)
%
% The w0 and w1 expressions follow from the maximum-likelihood optimality
% condition for continuous STR variables:
%
% (d/dg) log( prod_i (var(i))^(-N/2) exp( - N var(str(i)) / (2*var(i)) ) ) = 0
%
% which is modified so that it reduces to Nordtvedt type estimator in
% the linear variance case for discrete STR's:
% w0(i) = 1/(1+2*m(i) g) , w1(i) = 1 when var(i) = m(i) g
% The required modification is
% dvar(i)/dg/(2 m var(i)) -> 1/(1+ 2m var(i)/(dvar(i)/dg)) .
%
% The following model is used for the STR variance:
%
% var(i) = Vsat (1- exp( -(log(1+A sqrt(m g/Vsat))/A)^2 ) ) , A>0
%
% It has the properties:
%
% var(i) = Vsat (1 -exp( - m g/Vsat)) when A -> 0
%
% var(i) = m g , if m g/Vsat small at any A
%
% log(Vsat - var(i)) = O( ( log(g) )^2 ) , A > 0
%
function [w0, w1, vv] = weight(g, m, Vsat, A)
t = m*g./Vsat;
eps = 1e-4;

% A*sqrt(t) and t small
ne = t .* (1-A.*sqrt(t));
dne = 1 - 1.5*A.*sqrt(t);
iw1 = 1 - A.*sqrt(t) - 0.5*t; % 1/w1

% Vsat<Inf, A small : exponential saturation
idx = (A.^2 < eps & t > eps);
iw1(idx) = (1-exp(-t(idx)))./t(idx);

% Vsat<Inf
idx = ( A.*sqrt(t) > eps);
ne(idx) = (log(1+A(idx).*sqrt(t(idx))) ./A(idx)).^2;
dne(idx) = log(1+A(idx).*sqrt(t(idx))) ./ ...
(A(idx).*(sqrt(t(idx)) + A(idx).*t(idx)));
iw1(idx) = (1-exp(-ne(idx)))./t(idx);

% Vsat<Inf, small ne
w0 = 1./(1+ 2*Vsat.*(ne+0.5*ne.^2)./dne);

% Vsat<Inf
idx = (ne>eps);
w0(idx) = 1./(1 + 2*Vsat(idx).*(exp(ne(idx)) - 1)./dne(idx));

% Vsat=Inf : no saturation
idx = (t == 0);
w0(idx) = 1./(1+2*m(idx)*g);

w1 = 1./iw1; % w1 = m g / variance
v = m * g .* iw1; % variance
vv = v.*(1+2*v); % variance of variance
endfunction;

% vstr : input vector of STR difference variances


% test: R/Q interclade variance
vstr = [...
0.493,2.965,1.161,0.808,8.526,8.811,0.046,0.128,0. 735,0.525,3.058,0.857,2.949,0.102,0.805,0.016,0.19 1,1.999,1.059,1.484,...
1.858,2.133,2.513,3.348,2.867,0.834,2.251,0.355,9. 477,1.409,1.623,2.786,3.425,9.506,4.437,0.889,1.02 0,0.137,1.073,0.389,...
2.867,0.039,0.961,0.098,0.002,2.976,0.760,0.014,1. 521,1.088,2.002,0.991,0.114,0.086,2.525,0.054,0.74 0,6.820,9.690,0.997,...
0.443,0.142,0.170,0.791,0.529,0.585,0.682,4.245,0. 680,0.284,0.984,1.370,2.823,2.458,1.178,1.056,1.08 8,1.310,1.047,0.874,...
0.012,1.846,0.998,0.079,0.107,0.535,3.854,0.400,0. 975,13.143,3.896,1.022,1.441,7.789,0.422,7.053,2.9 96,2.472,1.724,0.840,...
0.857,1.039,0.094,2.028,5.754,2.326,0.765,0.867,0. 124,1.488,0.043]';


% iterative solution for g
g = 0;
maxi=50;
for i=1:maxi
[w0, w1, vv] = weight(g, m, Vsat, A);
g0 = g;
g = sum( w0.*w1.*vstr ) / sum( w0.*m ); % 2TMRCA
sigma = sqrt(sum( (w0.*w1).^2 .* vv )) / sum( w0.*m ); % std for 2TMRCA
if abs(g-g0) < 0.1
break;
endif;
end;

if i==maxi
disp('iteration limit!');
endif;

% -2*sigma ... +2*sigma range for TMRCA
fprintf ('tmrca = %4.0f [%3.0f ... %3.0f] kyears \n', 0.5*g*30/1e3, (0.5*g-sigma)*30/1e3, (0.5*g+sigma)*30/1e3);
% test for R/Q produces
% tmrca = 25 [ 17 ... 33] kyears


edit: log(Vsat-var(i)) not log(var(i) ) in the comments in the code

MarkoH
12-30-2021, 06:28 PM
Instead of the relatively complex method used in the code, one could also use simple mean/variance optimized expression

g = sum_i var(str(i)) / (1 + 2*var_i(g)) / sum_i (var_i(g)/g) / (1 + 2*var_i(g))

with a general model STR variance curve var_i(g) and with iterative solution for g. The iteration solves the model equation

sum_i var(str(i)) / (1 + 2*var_i(g)) = sum_i var_i(g) / (1 + 2*var_i(g))

for g. In other words, it matches the observed variances with the model variances with weighting. However, even fully saturated str's impact the result if their observed variances var(str(i)) differ from the model variances var_i(g) at the optimal g.

The w(i) = 1/(1+2*var_i(g)) weighting minimizes the error variance

var(g) = sum_i w(i)^2 var(var(str(i))) / (sum_i w(i) (var_i(g)/g) ))^2

when var(var(str(i)) is set equal to var_i(g) (1 + 2 var_i(g)) . This var(var(str(i))) expression is accurate in the case with var_i(g) = m(i) g with small g and also in the mean reverting case for continuous variables when it reduces to 2 var_i(g)^2 . But it is not exactly accurate with mean reverting strs with small saturation variance; this is a similar situation than with the maximum-likelihood like method used in the code.

It looks likely that that tree estimates for str model parameters underestimate both saturation variance and curve A parameters. This is since minimum distance optimization reduces apparent variance from actual level and also probably reduces the apparent rate difference between high and low repeat numbers.

Edit:
In the lowest order in g mean reversion then means that the effective mutation rate is reduced:

m(i) -> m(i) / ( 1 + A(i) sqrt(t) +t/2)

where t = m(i) g / Vsat(i)