PDA

View Full Version : Understanding Formal Statistics, f4 and D-Stats



TuaMan
09-07-2018, 10:23 PM
Hey all, a question that's been on my mind for some time and that I've been meaning to seek further clarity on. Everyone who follows population genetics and has an interest in the historical ethnogenesis of different populations has surely encountered reference to these two "formal" statistical methods of inferrring population demographic history, they're standard features in the toolkit of any decent academic paper on historical population genetics. From my understanding, they're usually considered a more robust method of inferring population history than other tools like ADMIXTURE and whatnot.

I've read some of the big papers that explain the use of these methods, namely Green et al 2010 (which I believe was the paper that actually introduced D-Stats) and another paper by Nick Patterson on D-Stats (I believe he actually invented the methodology itself). While I think I have a decent rudimentary grasp of what these methods are and how they work, I'm not gonna lie, the papers were pretty technical and I don't have any university-level course work in genetics or much in the way of advanced statistics, so it's pretty dense reading for me.

So, I wanted to open a thread up on these formal stats in the hopes that some of the more knowledgable (and patient) members here could elaborate on how these methodologies really work. I would like to approach this from as much of a blank slate as possible, in the interests of trying to capture as much information on the technicalties of these tools as possible.

1. In layman's terms, what are you actually testing when utilizing either of these methods?

2. What is the exact difference between the two? When would you want to use, but not the other?

3. What are some limitations or confounding factors that can skew results of these methods (I know choice of OutGroup is one, but how exactly?)

I've been meaning to make a thread on this for a while, so I wanted to just finally put something out there and hope someone would be patient enough to flesh things out a bit. I know this an inherently technical topic, and I can ask a bit more targeted questions and explain more what I'm trying to get at if need be, but for now I figured I would try to keep it relatively basic. If anyone else reading this has their own questions about either of these methods, by all means feel free to chime in as well.

Kale
09-08-2018, 04:13 PM
I'm not sure my understanding is 100% correct, but this is what I've found likely to be true through observation.

- With D-stats, you have 4 populations (W,X Y,Z). You look at all the snps which all of the four populations have coverage for, and check how many mutations are shared between X & Y, X & Z, W & Y, and W & Z.

- There is some difference between haploid and diploid? The results will be skewed, with diploids attracted to diploids and haploids to haploids or something like that.
There are certain configurations where you can mix and the results won't be materially effected, such as 'outgroup dilpoid haploid haploid' or 'diploid diploid haploid haploid'

- I believe the difference in outgroups has something to do with damage and ascertainment (how the snps were selected for interest in the first place).
Chimp for example would have a whole bunch of snps unique to them, but we're not typically looking at Chimp variation, so it just shows as underived for all human mutations. I think DNA damage would act in the same way.
A d-stat like Chimp Ust_Ishim Kostenki14 Bichon would probably be highly positive, suggesting either Ust_Ishim & Bichon share drift, or Chimp & Kostenki.
The latter would be more likely/accurate because Kostenki14 being super ancient probably has more damage and definitely has more neanderthal (which like chimp is not the variation we're looking for so would just be underived everything)
Mbuti Ust_Ishim Kostenki14 Bichon on the other hand would be 0. Africans (to some extent) have their own variation represented in the panel, and obviously share mutations common to all humans, so damage & neanderthal would not attract to them.
There is the option of restricting to transversions, which means if say an snp naturally degrades from a C to a T over time, the software will only look at C's and ignore T's? Thus restricting the effect damage has.

TuaMan
09-09-2018, 02:13 PM
A d-stat like Chimp Ust_Ishim Kostenki14 Bichon would probably be highly positive, suggesting either Ust_Ishim & Bichon share drift, or Chimp & Kostenki.
The latter would be more likely/accurate because Kostenki14 being super ancient probably has more damage and definitely has more neanderthal (which like chimp is not the variation we're looking for so would just be underived everything)

Interesting, if that stat was positive I just would've assumed it meant Ust-Ishim and Bichon shared drift. I thought one of the points of choosing your outgroup was that you assume it's closer to the root of the phylogenetic than the other three populations, and that it hasn't admixed with any the others at any point. So basically, in that set up above W X Y Z (W being the outgrpup), you would pick an outgroup (Chimp) that you assume doesn't share any real drift with Y or Z (Kostenki and Bichon) and so you'd really just be testing Ust Ishim's relationship to either of the two.


Mbuti Ust_Ishim Kostenki14 Bichon on the other hand would be 0. Africans (to some extent) have their own variation represented in the panel, and obviously share mutations common to all humans, so damage & neanderthal would not attract to them.


Why in this case wouldn't Neanderthal attract to Mbuti but in the above situation does attract to Chimp? All three of these pops would just show as underived relative to later human pops, correct?

Kale
09-09-2018, 05:11 PM
All humans share a lot of mutations to the exclusion of Neanderthal. The most logical deduction I can think of is that these mutations are represented in the snp panel, enough so that at least an African outgroup doesn't act as a garbage bin like Chimp does. The attraction between Ust_Ishim and Bichon will still take place, just to a lesser degree.
The best way I can describe that, if Mbuti and Kostenki14 share 100 mutations, and Ust_Ishim and Bichon share 200. 1% extra Neanderthal in Kostenki14 would disrupt 1 shared mutation with Mbuti, but 2 shared mutations with Ust_Ishim.
With that sort of thing, you'd want an outgroup (meaning no interaction with the other 3 pops) as close as possible phylogenically to the other three populations in the equation.

TuaMan
09-09-2018, 06:25 PM
Can D-stats (or f4) tell you anything about the directionality of the gene flow between pops? I seem to have read conflicting things in the past regarding that question.

Kale
09-09-2018, 07:15 PM
I'm not sure about F4, but a single d-stat will say nothing about direction of flow, only presence.
Remeber, it doesn't even say which two the flow is between. If W,X Y,Z is positive it could be flow between X&Z or W&Y
You can run multiple with different populations to kind of get an idea though.

Kurd
09-09-2018, 07:33 PM
All humans share a lot of mutations to the exclusion of Neanderthal. The most logical deduction I can think of is that these mutations are represented in the snp panel, enough so that at least an African outgroup doesn't act as a garbage bin like Chimp does. The attraction between Ust_Ishim and Bichon will still take place, just to a lesser degree.
The best way I can describe that, if Mbuti and Kostenki14 share 100 mutations, and Ust_Ishim and Bichon share 200. 1% extra Neanderthal in Kostenki14 would disrupt 1 shared mutation with Mbuti, but 2 shared mutations with Ust_Ishim.
With that sort of thing, you'd want an outgroup (meaning no interaction with the other 3 pops) as close as possible phylogenically to the other three populations in the equation.

I'm happy to see that you and a couple of others (Shaikorth Megalophias & Parasar come to mind) here have an understanding of some of the concepts affecting shared drift calculations.

You are correct in that ascertainment bias is one of the factors which skew results where polymorphism is ascertained in Europeans in SNP panels. I recall discussing this here in 2015 where IBS showed greater allele sharing between Mbuti and Chimp and mentioning that this does not imply Mbuti are more related to Chimps than Eurasians, but rather, the alleles Mbuti & Chimp share are ancestral, and if genomewide polymorphisms are analyzed then my bet is that the sharing between Mbuti and Chimp will not be greater than between Eurasians and Chimp.

The other factor skewing results is when Mbuti is used as an outgroup in a dstat such as this D (Europeans, W Asians; Steppe, Mbuti). Here because of greater mutation sharing W Asians - Mbuti vs Europeans - Mbuti, W Asian - Steppe shared drift is dampened. I discuss this in detail in a new article at http://www.eurasiandna.com/2017/12/24/novel-accurate-method-assessing-derived-ancestry/.

This is the reason I started using Chimp in lieu of Mbuti in dstats such as the above back in 2016, and I'm glad to see that others are doing so too now ( the recent paper on Levant-Chl)

Abstract from my article:



BACKGROUND

Over the past decade various tools have been developed for ancient DNA analysis and assessing shared drift between populations. Some programs such as STRUCTURE, ADMIXTURE are allele frequency based. Others such as Reich Lab’s ADMIXTOOLS use both allele frequencies as well as direct allele comparisons. Others yet such as IBS and IBD compared genomes for allele matches, and shared haplotypes. However, shared drift calculation accuracy due to relatively recent gene flow between an ancient population and a contemporary one is limited for the following reasons:

The GRCh37/Hg19 Human Reference genome which was introduced in 2009, and has been used to align/map the vast majority of the aDNA sequences published to date is based on a few anonymous individuals representing a few countries and is thus not representative of human diversity. Although the donor identity ethnic group is not public, evidence based on personal experience indicates that NW Europeans and Africans are over represented. This causes a bias towards Europeans and Africans during alignment/mapping aDNA sequences to the Human Reference because some aDNA reads that fall outside of European or African variation often map to the wrong regions of the Reference genome, and sometimes don’t map at all.

Researchers and genome bloggers should be aware of some of issues outlined herein which affect accuracy of the analysis results.

We share some solutions based on personal experience which should help researchers and genome bloggers achieve higher accuracy in shared drift or admixture analysis involving ancient DNA (aDNA).


Using outgroups such as Mbuti and Chimp in comparisons involving populations such as Europeans and Asians, which have significantly different drift histories and SNPs ascertained in European populations, leads to inaccurate inferences that many researchers and bloggers are not aware of

Kurd
09-09-2018, 07:36 PM
Double posted

Kale
09-09-2018, 09:17 PM
The other factor skewing results is when Mbuti is used as an outgroup in a dstat such as this D (Europeans, W Asians; Steppe, Mbuti). Here because of greater mutation sharing W Asians - Mbuti vs Europeans - Mbuti, W Asian - Steppe shared drift is dampened. I discuss this in detail in a new article at http://www.eurasiandna.com/2017/12/24/novel-accurate-method-assessing-derived-ancestry/.

This is the reason I started using Chimp in lieu of Mbuti in dstats such as the above back in 2016, and I'm glad to see that others are doing so too now ( the recent paper on Levant-Chl)

Sorry if I am missing something... but from the two charts at the top of your article, wouldn't it be more problematic to use Chimp?
The Chimp excess alleles chart is basically the farther from English you get, the more excess alleles with chimp there are. That's basically ascertainment bias favoring English samples right?

The Mbuti excess alleles chart looks like an inverse of what I'd imagine levels of archaic ancestry are in those populations.
That doesn't seem like bias, but actual sharing within the human-node. To be fair it does seem a bit exaggerated though.

TuaMan
09-09-2018, 09:49 PM
Kale, how long did it take for you to get a decent handle on running the different Admixtools methods? I'd like to see how feasible it would be for someone like me to pick up on f3, f4, d-stats, by playing around with the tools myself, provided the learning curve isn't too steep.

Kurd
09-09-2018, 10:53 PM
Sorry if I am missing something... but from the two charts at the top of your article, wouldn't it be more problematic to use Chimp?
The Chimp excess alleles chart is basically the farther from English you get, the more excess alleles with chimp there are. That's basically ascertainment bias favoring English samples right?

The Mbuti excess alleles chart looks like an inverse of what I'd imagine levels of archaic ancestry are in those populations.
That doesn't seem like bias, but actual sharing within the human-node. To be fair it does seem a bit exaggerated though.

Sorry, I should have clarified that biases against E/W/S Asians in D ( Europeans, W/S Asians; Europeans, OUTGROUP) exist for BOTH Mbuti and Chimp. The biases exist due to ascertainment bias AND African admixture. The 2 are not necessarily dependent factors acting upon the test populations.

For the majority of W Asians the NET ( Mbuti bias - Chimp bias) is +ve. For S Asians and Oceanians its -ve. In other words, for W Asians its better to use Chimp, whereas for E/S Asians and Oceanians its better to use Mbuti, however, neither outgroup is ideal when comparing Europeans with E/W/S Asians and Oceanians.

To better understand what I'm referring to, these are the tables which was used to generate the barplots in the article:



Population
Excess_Alleles_Shared_With_MBUTI [M]
Excess over English
Excess_Alleles_Shared_With_CHIMP over English [C]
Net over English [M-C]


Abkhasian
589
321
179
142


Kurds
657
389
273
116


Armenian
588
320
209
111


Adygei
518
250
158
92


Druze
722
454
392
62


Georgian
583
315
269
46


Pathan
660
392
348
44


Sicilian
646
378
345
33


Chechen
510
242
213
29


Greek
442
174
166
8


Ukrainian
311
43
35
8


French
294
26
21
5


English
268
0
0
0


Iran_Fars
704
436
438
-2


Jordanian
1059
791
800
-9


Finnish
236
-32
3
-35


Albanian
473
205
255
-50


Belarusian
271
3
61
-58


Estonian
210
-58
41
-99


Kalash
419
151
269
-118


GujaratiD
835
567
705
-138


Scottish
234
-34
105
-139


Saudi
894
626
772
-146


Altaian
404
136
307
-171


Balochi
553
285
494
-209


Han
230
-38
391
-429


Chukchi
0
-268
363
-631


BedouinB
937
669
1400
-731


Papuan
26
-242
4127
-4369



Notice that for Abkhazians, Kurds, Adygei, etc, its better to use Chimp as an outgroup when testing against Europeans, whereas for S Asians, E Asians, and Papuans its better to use Mbuti. However, also notice that neither Mbuti nor Chimp is bias free when testing E/S/W Asians against Europeans.

The plots are not exaggerated since I did not use normalized values here. When I do in cases to highlight differences that are very small, I state so in the description.

EDIT: To give an idea how bad it can get take a look at Jordanians and Bedouin. For Bedouin, using Mbuti, there are 669 alleles shared with Mbuti to the exclusion of English, and using Chimp there are 1400 alleles shared with Chimp to the exclusion of English. Thus neither outgroup would work well for Bedouin or Jordanians

Kale
09-10-2018, 01:00 AM
Theoretically with full genomes, Jordanians & Bedouin should share a bit more with Mbuti than English do, by virtue of lower Neanderthal levels, just as Papuans share less with Mbuti than English because of excess Denisovan.
That's almost certainly not enough to account for the discrepancy (that's what I meant by exaggerated in the last post, not the plot itself), but it would be interesting to know how much it contributes.

Kurd
09-10-2018, 01:19 AM
Theoretically with full genomes, Jordanians & Bedouin should share a bit more with Mbuti than English do, by virtue of lower Neanderthal levels, just as Papuans share less with Mbuti than English because of excess Denisovan.
That's almost certainly not enough to account for the discrepancy (that's what I meant by exaggerated in the last post, not the plot itself), but it would be interesting to know how much it contributes.

I think most of the allele sharing is due to African <——> SW Asian introgression

Eterne
09-10-2018, 08:00 AM
Sorry if I am missing something... but from the two charts at the top of your article, wouldn't it be more problematic to use Chimp?
The Chimp excess alleles chart is basically the farther from English you get, the more excess alleles with chimp there are. That's basically ascertainment bias favoring English samples right?

There seems like a lot of discussion here about how ascertainment will affect the relationship between the two outgroups (Mbuti, Chimp) and different modern populations. But not much so far on how this will effect the relationship between outgroups and ancients?

If there are ascertainment issues inflating relatedness between non-West Europeans and chimp (by monomorphic sites which are polymorphic in West Europeans), then will they not cut the other way and also inflate relatedness between ancients and chimp? Thereby also inflating relatedness between non-West Europeans and ancients! (esp. considered in two measures that don't use outgroup, like direct IBS between an ancient and modern sample).

There doesn't seem like a reason why West Europeans+ancient will be unaffected by ascertainment, while non-West Europeans alone will be effected by ascertainment, deflating relatedness to the ancient. It seems like ancients should be affected by ascertainment too.

The only reason that this would not be the case for ancients is unless those ancients are genuinely more ancestral to West Europeans, and share the spectrum of variation for that reason...

Put another way, in a f4(A,B;C,D), I'd have thought any additional allele sharing between any three A,B, D due to ascertainment should cancel. It's only if you have ascertainment affecting any two pairs alone, e.g. A D, that you should have inflation/deflation?

lukaszM
09-10-2018, 08:43 AM
Can D-stats (or f4) tell you anything about the directionality of the gene flow between pops? I seem to have read conflicting things in the past regarding that question.

It is about D-stats on Admixture components but their theory provided by Dienekes is applicable to other use of D-stats?

http://dienekes.blogspot.com/2012/12/d-statistics-on-admixture-components.html

Kurd
09-10-2018, 03:18 PM
There seems like a lot of discussion here about how ascertainment will affect the relationship between the two outgroups (Mbuti, Chimp) and different modern populations. But not much so far on how this will effect the relationship between outgroups and ancients?

If there are ascertainment issues inflating relatedness between non-West Europeans and chimp (by monomorphic sites which are polymorphic in West Europeans), then will they not cut the other way and also inflate relatedness between ancients and chimp? Thereby also inflating relatedness between non-West Europeans and ancients! (esp. considered in two measures that don't use outgroup, like direct IBS between an ancient and modern sample).

There doesn't seem like a reason why West Europeans+ancient will be unaffected by ascertainment, while non-West Europeans alone will be effected by ascertainment, deflating relatedness to the ancient. It seems like ancients should be affected by ascertainment too.

The only reason that this would not be the case for ancients is unless those ancients are genuinely more ancestral to West Europeans, and share the spectrum of variation for that reason...

Put another way, in a f4(A,B;C,D), I'd have thought any additional allele sharing between any three A,B, D due to ascertainment should cancel. It's only if you have ascertainment affecting any two pairs alone, e.g. A D, that you should have inflation/deflation?



A great question which is very relevant to dstats, ADMIXTURE, and IBS, and sort of opens up a can of worms. There are many factors to consider here and I could probably take a couple of hours to completely cover the material.

Here are some main points to consider based on the following IBS run which I did at 100% genotype rate for an apples to apples comparison:

1- Ascertainment bias increasingly becomes a big factor the further back you go in time. Notice how Denisovan and Neanderthal shares more alleles with Chimp than with Mbuti. This of course should not be the case.

2- Ascertainment bias and reference bias when mapping sequences with unusual variation such as in Denisovan and Neanderthal causes them to share more alleles with Mbuti than Eurasians. That is why you see Neanderthal with 90%+ SSA in ADMIXTURE. Ascertainment bias at play big time!!

3- Ascertainment bias becomes noticeable as you go back 25K years. Notice the order of Ust -> Kostenki -> MA1 wrt allele sharing with Chimp.

4- The main reason I started diploid genotyping some of the published pseudo-haploid aDNA is because pseudo-haploids are inherently wrong at about 30% of the genome, because humans are hetrozygous at about 30% of the genome. Notice how the pseudo-haploid steppe genomes only share 58-59% alleles with Mbuti vs their diploid counterparts at 64%.

5- Chimp is more forgiving on the published pseudo-haploid aDNA than Mbuti

6- Some of Ust Ishim's affinity with SSA may be real, or it could be reference bias. My experience with using Broad Institute's GATK pipeline to genotype was that it creates considerable reference bias.

BTW, Martiniano genotyped the diploid steppe aDNA and not me. I take a different approach than Rui when diploid genotyping ancients. I have expressed my reservations with his approach which entails dropping T genotypes where the reference is a C **. Although an ok approach, I have expressed to him the bias this creates. My diploid aDNA genotypes tend to be a tad more accurate. I'll post a dstat comparison of mine vs his later.



SAMPLE
IBS-CHIMP (68K SNPs)


CHIMP
100.00%


Denisovan
81.28%


Neanderthal-Altai
79.77%


MBUTI
68.05%


Ust-Ishim
60.43%


Kostenki14
59.68%


MA1
59.48%


Eskimo1
59.10%


Eskimo2
59.04%


Karasuk-495-Diploid
59.01%


Kotias-Diploid
58.97%


BedouinB1
58.95%


BedouinB2
58.83%


Karasuk-493-Diploid
58.82%


Karasuk-495-Haploid
58.72%


Estonian2
58.72%


Loschbour_Imputed
58.68%


Estonian1
58.66%


Loschbour
58.63%


Karasuk-496-Diploid
58.63%


Yamnaya-552-Dip
58.62%


Kotias-Haploid
58.48%


Sintashta-395-Dip
58.39%


Yamnaya-552-Hap
58.22%


Sintashta-395-Hap
58.05%


Karasuk-493-Haploid
58.00%


Karasuk-496-Haploid
57.58%





SAMPLE
IBS-MBUTI (68K SNPs)


MBUTI
100.00%


Denisovan
69.18%


Neanderthal-Altai
68.75%


CHIMP
68.05%


BedouinB1
66.14%


Ust-Ishim
66.07%


BedouinB2
65.94%


Eskimo2
65.33%


Eskimo1
65.31%


Karasuk-495-Diploid
65.11%


Yamnaya-552-Dip
64.55%


Kotias-Diploid
64.43%


Karasuk-496-Diploid
64.41%


Karasuk-493-Diploid
64.40%


Sintashta-395-Dip
64.39%


Estonian1
64.38%


Estonian2
64.33%


Loschbour_Imputed
63.09%


Loschbour
63.07%


Kostenki14
59.85%


Karasuk-495-Haploid
59.48%


MA1
59.46%


Kotias-Haploid
59.23%


Yamnaya-552-Hap
59.15%


Sintashta-395-Hap
58.72%


Karasuk-493-Haploid
58.61%


Karasuk-496-Haploid
58.41%



Edit: ** to mitigate aDNA damage

Arch Hades
10-10-2018, 06:04 PM
I don't really know how to properly read/understand most formal admixture stats either. But I see them popping up in roughly every new ancient DNA study I read so i'd figure it's time and to stop being such a lazy ass. I don't need to know the hard math, just the basics.

So lets take a look at this simple one from Lazaridis 2016 where they are testing whether or not the Natufians have SSA admixture relative to other Ancient West Eurasians.

I have a few questions.My first would be

A. What does the third column mean? What does it stand for?You know the one that says" f4(Other Ancient, African, Chimp)"? Also what does the Z score mean? I know if the Z score was higher that would indicate admixture occurred but that's about it.

https://3.bp.blogspot.com/-XMLt1tGOTH4/V2esZSlR-zI/AAAAAAAAAfI/Ohf7nsPxfpwcquDVws4tEdUjfUD6TqG_gCLcB/s1600/Lazaridis2016_EDT1.png

Megalophias
10-10-2018, 07:24 PM
The f4 column is the actual value of the f4 statistic, the Z value measures the significance of the stat (of course a stronger stat tends to be more significant but it doesn't correspond perfectly).

Arch Hades
10-10-2018, 09:59 PM
The f4 column is the actual value of the f4 statistic, the Z value measures the significance of the stat (of course a stronger stat tends to be more significant but it doesn't correspond perfectly).

OK, so what is the F4 statistic? And what does chimp have to do with anything? I'm lost

Bas
10-10-2018, 10:00 PM
There are so many people on here who could give a better answer than me here, but if I'm correct, the third column (f4) and fourth column (Z score) are related,but in that the Z score gives the 'strength/reliability' of the f4 stat in that if there are two separate f4 stats, with identical f4 values, the Z score for the stat with the highest SNP count will get a stronger Z score, supposedly being 'more reliable' for the extra SNPs.

As I said, I stand to be corrected on that.

Edit: should have actually read the thread-Megalophias had already posted the answer as to what the f4 stat columns mean.

Arch Hades
10-11-2018, 03:45 PM
Can someone explain what is the difference between f4 and f3?.

Bas
12-12-2018, 09:30 PM
f4 stats:

In the below example (http://eurogenes.blogspot.com/2016/06/the-discrepancy.html)

f4: Corded_Ware_Germany Anatolia_Neolithic CHG Chimp 0.002396 9.226 574503


if the f4 is pops:

A(Chimp)
X (CHG)
Y (Anatolia_N)
Z (Corded Ware Germany)

Then it shows how much close or further away one population (z) is to another population (x) compared to pop Y,using pop A as an outgroup. The example above is a good one as it shows corded ware has CHG admixture that is not present in Anatolia_N.

The blue score shows the f4 score, which being positive here means that in this stat, the Z and X do share admixture to the exclusion of Y.

The red score is the Z-score, which is generally always going to agree with the f4 score, with the only difference being that the Z score gives an indicator of how significant the f4 score is (generally I have found that a highly positive/negative f4 stat with low snp runs (snps used is final column) will result in a low Z-score. It's almost like a confidence score as low snps mean low result confidence)

f3 stats:

To my knowledge, the difference between f4 and f3 stats is that f3 only show evidence of admixture between two populations, with one population as an outgroup (Mbuti;Yamnaya,Corded Ware) would be hugely significant.

TuaMan
12-13-2018, 01:31 AM
f4 stats:

In the below example (http://eurogenes.blogspot.com/2016/06/the-discrepancy.html)

f4: Corded_Ware_Germany Anatolia_Neolithic CHG Chimp 0.002396 9.226 574503



Bas,

The f4 test is used to calculate admixture ratios, correct? In this instance, how could we tell anything about the admixture ratio of any of the pops X, Y, or Z?

This stat above just looks like a regular D-stat. If f4 can be used to calculate admix ratios, then what exactly does D-stats tell us that f4 doesn't?

EDIT: Also, just to clarify, the positive stat just means CWC and CHG share more drift with each other than Anatolian Neo and CHG share with each other, and not that Anatolian Neolithic is necessarily without any CHG, right?

Bas
12-13-2018, 03:09 AM
Bas,

The f4 test is used to calculate admixture ratios, correct? In this instance, how could we tell anything about the admixture ratio of any of the pops X, Y, or Z?

This stat above just looks like a regular D-stat. If f4 can be used to calculate admix ratios, then what exactly does D-stats tell us that f4 doesn't?

EDIT: Also, just to clarify, the positive stat just means CWC and CHG share more drift with each other than Anatolian Neo and CHG share with each other, and not that Anatolian Neolithic is necessarily without any CHG, right?

Yeah, I think you're right about the shared drift thing there, I worded it a bit clumsily! About the f4 stats and admixture ratios, qpAdm uses f4 stats to work out the admix proportions. This explains it: http://gensoft.pasteur.fr/docs/AdmixTools/4.1/pdoc.pdf .

Also: http://science.sciencemag.org/content/sci/suppl/2018/03/14/science.aar8380.DC1/aar8380_vandeLoosdrecht_SM.pdf

[For admixture modeling, we used the program qpAdm (v632) (16) of the admixtools v3.0
package. QpAdm can be viewed as a generalization of f4 statistics jointly modeling multiple of
them. It tests if the observed target population and the proposed admixture model for it are
symmetrically related to a set of outgroups, and summarizes the results of multiple such
comparisons into a single statistic (16). It also estimates ancestry proportion coefficients, and
their 5 cM block jackknife SEs, by minimizing the difference between the target and the model.
More specifically, qpAdm requires a target population (T), source/surrogate populations (S) and a
set of outgroups (O). Outgroups are differentially related to sources so that they can be
distinguished by f4 statistics (Fig. S18). However, at the same time, outgroups must be related to
the target and the sources distantly enough so that a source and its related ancestry in the target
have a symmetrical genetic distance to all outgroups. An example of many scenarios to break this
prerequisite is a post-mixture gene flow from the target into an outgroup

Difference between f4 stats and D-stats as stated by Nick Patterson: (actually copied this from Eurogenes comments section from a couple of years back)

As mentioned earlier, D-statistics are very similar to the 4-population test statistics introduced in REICH et al. (2009). The primary difference is in the computation of the denominator of D. For statistical estimation, and testing for ‘treeness’, the D-statistics are preferable, as the denominator of D, the total number of ‘ABBA’ and ‘BABA’ events, is uninformative for whether a tree phylogeny is supported by the data, while D has a natural interpretation: the extent of the deviation on a normalized scale from -1 to 1.

http://www.genetics.org/content/early/2012/09/06/genetics.112.145037

TuaMan
02-28-2019, 12:04 AM
Does anyone here run Admixtools out of a Linux virtual machine (I have a Windows PC), and if so which VM do you recommend? Ditto for the distribution as well.

anglesqueville
03-01-2019, 07:45 AM
Does anyone here run Admixtools out of a Linux virtual machine (I have a Windows PC), and if so which VM do you recommend? Ditto for the distribution as well.

I work with a Fedora25 (yes, only 25, and I don't want to update it, seeing the problems of shared biblios when installing admixtools on the newer versions) from a VM Oracle VirtualBox. That works, but the problems are unavoidable: slowness, management of the RAM. I planned to buy another PC with a Linux as the only system but money, money, money...

Ruderico
03-02-2019, 02:09 PM
Sorry for the newbie question, but how does one make plots with f3-stats in PAST3, like Matt at Eurogenes did here https://imgur.com/a/42BjyWe ?

anglesqueville
03-03-2019, 09:47 AM
Sorry for the newbie question, but how does one make plots with f3-stats in PAST3, like Matt at Eurogenes did here https://imgur.com/a/42BjyWe ?

Assuming you want to plot a 2_columns matrix under a regression model (linear or polynomial), you choose "Model>Linear>Bivariate" or "Model>Polynomial". Example: 2 sÚries of D-stats showing affinity to WHG and natufian for some modern populations:

29126

First select the 2 columns and make PLOT XY:

29127

Obviously the regression is not linear. You select Model>Polynomial, and run. You get the natural parabolic regression:

29128

blackflash16
05-31-2019, 08:53 PM
Hey all, a question that's been on my mind for some time and that I've been meaning to seek further clarity on. Everyone who follows population genetics and has an interest in the historical ethnogenesis of different populations has surely encountered reference to these two "formal" statistical methods of inferrring population demographic history, they're standard features in the toolkit of any decent academic paper on historical population genetics. From my understanding, they're usually considered a more robust method of inferring population history than other tools like ADMIXTURE and whatnot.

I've read some of the big papers that explain the use of these methods, namely Green et al 2010 (which I believe was the paper that actually introduced D-Stats) and another paper by Nick Patterson on D-Stats (I believe he actually invented the methodology itself). While I think I have a decent rudimentary grasp of what these methods are and how they work, I'm not gonna lie, the papers were pretty technical and I don't have any university-level course work in genetics or much in the way of advanced statistics, so it's pretty dense reading for me.

So, I wanted to open a thread up on these formal stats in the hopes that some of the more knowledgable (and patient) members here could elaborate on how these methodologies really work. I would like to approach this from as much of a blank slate as possible, in the interests of trying to capture as much information on the technicalties of these tools as possible.

1. In layman's terms, what are you actually testing when utilizing either of these methods?

2. What is the exact difference between the two? When would you want to use, but not the other?

3. What are some limitations or confounding factors that can skew results of these methods (I know choice of OutGroup is one, but how exactly?)

I've been meaning to make a thread on this for a while, so I wanted to just finally put something out there and hope someone would be patient enough to flesh things out a bit. I know this an inherently technical topic, and I can ask a bit more targeted questions and explain more what I'm trying to get at if need be, but for now I figured I would try to keep it relatively basic. If anyone else reading this has their own questions about either of these methods, by all means feel free to chime in as well.

Found this excellent guide from a pop. gen workshop, it covers everything from filtering/converting BAM files to working with ADMIXTOOLS:

https://buildmedia.readthedocs.org/media/pdf/comppopgenworkshop2019/latest/comppopgenworkshop2019.pdf

Nebuchadnezzar II
08-16-2020, 06:11 PM
Hey all, a question that's been on my mind for some time and that I've been meaning to seek further clarity on. Everyone who follows population genetics and has an interest in the historical ethnogenesis of different populations has surely encountered reference to these two "formal" statistical methods of inferrring population demographic history, they're standard features in the toolkit of any decent academic paper on historical population genetics. From my understanding, they're usually considered a more robust method of inferring population history than other tools like ADMIXTURE and whatnot.

I've read some of the big papers that explain the use of these methods, namely Green et al 2010 (which I believe was the paper that actually introduced D-Stats) and another paper by Nick Patterson on D-Stats (I believe he actually invented the methodology itself). While I think I have a decent rudimentary grasp of what these methods are and how they work, I'm not gonna lie, the papers were pretty technical and I don't have any university-level course work in genetics or much in the way of advanced statistics, so it's pretty dense reading for me.

So, I wanted to open a thread up on these formal stats in the hopes that some of the more knowledgable (and patient) members here could elaborate on how these methodologies really work. I would like to approach this from as much of a blank slate as possible, in the interests of trying to capture as much information on the technicalties of these tools as possible.

1. In layman's terms, what are you actually testing when utilizing either of these methods?

2. What is the exact difference between the two? When would you want to use, but not the other?

3. What are some limitations or confounding factors that can skew results of these methods (I know choice of OutGroup is one, but how exactly?)

I've been meaning to make a thread on this for a while, so I wanted to just finally put something out there and hope someone would be patient enough to flesh things out a bit. I know this an inherently technical topic, and I can ask a bit more targeted questions and explain more what I'm trying to get at if need be, but for now I figured I would try to keep it relatively basic. If anyone else reading this has their own questions about either of these methods, by all means feel free to chime in as well.

Imagine 4 populations: 1,2,3,4

Their allele frequencies will be X1, X2, X3 and X4 respectively.

We're interested in a test for treeness: i.e does the phylogeny of these populations resemble a tree? Consider a scenario in which populations 1-4 fit a tree where 4 is the outgroup, 3 the next to branch and 1&2 are furthest left respectively. Like in the attached picture:
39031

Now imagine a polymorphic bi-allelic locus on the genome in which the ancestral allele is A and the derived is B.

How often would you expect the allele pattern ABBA - where pops 1&4 share allele A whilst pops 2&3 share allele B? Likewise, how often would you expect a BABA pattern - where 1&3 share B whilst 2&4 share A? If these populations are related in a tree-like fashion, we don't expect a significant excess of ABBA/BABA patterns to emerge - as an excess of either indicates admixture. Thus we count the number of times we see a BABA pattern vs an ABBA pattern out of all the ABBA/BABA sites.

BABA-ABBA/BABA+ABBA

A significantly negative result means there are more ABBA events than BABA, and for a significantly positive stat the opposite. A significant deviation either way rejects a tree-model and thus supports a case of admixture. More technically

P(ABBA)= E(X1 (1-X2) (1-X3) X4 + (1-X1) X2 X3 (1-X4))
P(BABA)= E((1-X1) X2 (1-X3) X4 + X1 (1-X2) X3 (1-X4))

P(BABA) should be equally as likely as P(ABBA) so we see these expected values equal to each other:

P(BABA)=P(ABBA)
therefore
P(BABA)-P(ABBA)=0

Subtracting these expected values results in:

E(X1-X2)(X3-X4)=0

Which is another way of saying the product of allele frequency changes between pop 1&2 and pop3&4 are expected to be independent.

The original F4 statistic is just the numerator of this D-statistic, i.e not normalized. Finding the F4 admixture ratio (i.e the proportions of admixture) uses a different normalization.

TuaMan
08-20-2020, 06:38 PM
Now imagine a polymorphic bi-allelic locus on the genome in which the ancestral allele is A and the derived is B.

How often would you expect the allele pattern ABBA - where pops 1&4 share allele A whilst pops 2&3 share allele B? Likewise, how often would you expect a BABA pattern - where 1&3 share B whilst 2&4 share A? If these populations are related in a tree-like fashion, we don't expect a significant excess of ABBA/BABA patterns to emerge - as an excess of either indicates admixture. Thus we count the number of times we see a BABA pattern vs an ABBA pattern out of all the ABBA/BABA sites.

BABA-ABBA/BABA+ABBA

A significantly negative result means there are more ABBA events than BABA, and for a significantly positive stat the opposite. A significant deviation either way rejects a tree-model and thus supports a case of admixture.

Both f4 and D stats are solely assessing whether populations share an excess of derived alleles with one another then, correct? Neither take into account ancestral allele sharing at all?

Nebuchadnezzar II
08-21-2020, 01:19 PM
Both f4 and D stats are solely assessing whether populations share an excess of derived alleles with one another then, correct? Neither take into account ancestral allele sharing at all?

The allele choice does not affect the stat as choosing the alternate allele simply flips the sign of both terms in the product.

Take P(ABBA) in the expression below, for example, there are two possible ways an ABBA pattern could emerge - hence the '+' between each possibility.

P(ABBA)= E(X1 (1-X2) (1-X3) X4 + (1-X1) X2 X3 (1-X4))

I.e populations 1&4 could have the ancestral allele whilst 2&3 share the derived allele OR 1&4 share the derived allele whilst 2&3 share the ancestral allele. Both constitute ABBA patterns.