# Thread: Understanding Formal Statistics, f4 and D-Stats

1. ## Understanding Formal Statistics, f4 and D-Stats

Hey all, a question that's been on my mind for some time and that I've been meaning to seek further clarity on. Everyone who follows population genetics and has an interest in the historical ethnogenesis of different populations has surely encountered reference to these two "formal" statistical methods of inferrring population demographic history, they're standard features in the toolkit of any decent academic paper on historical population genetics. From my understanding, they're usually considered a more robust method of inferring population history than other tools like ADMIXTURE and whatnot.

I've read some of the big papers that explain the use of these methods, namely Green et al 2010 (which I believe was the paper that actually introduced D-Stats) and another paper by Nick Patterson on D-Stats (I believe he actually invented the methodology itself). While I think I have a decent rudimentary grasp of what these methods are and how they work, I'm not gonna lie, the papers were pretty technical and I don't have any university-level course work in genetics or much in the way of advanced statistics, so it's pretty dense reading for me.

So, I wanted to open a thread up on these formal stats in the hopes that some of the more knowledgable (and patient) members here could elaborate on how these methodologies really work. I would like to approach this from as much of a blank slate as possible, in the interests of trying to capture as much information on the technicalties of these tools as possible.

1. In layman's terms, what are you actually testing when utilizing either of these methods?

2. What is the exact difference between the two? When would you want to use, but not the other?

3. What are some limitations or confounding factors that can skew results of these methods (I know choice of OutGroup is one, but how exactly?)

I've been meaning to make a thread on this for a while, so I wanted to just finally put something out there and hope someone would be patient enough to flesh things out a bit. I know this an inherently technical topic, and I can ask a bit more targeted questions and explain more what I'm trying to get at if need be, but for now I figured I would try to keep it relatively basic. If anyone else reading this has their own questions about either of these methods, by all means feel free to chime in as well.

2. ## The Following 3 Users Say Thank You to TuaMan For This Useful Post:

artemv (04-19-2019),  poi (05-17-2019),  Ruderico (09-08-2018)

3. I'm not sure my understanding is 100% correct, but this is what I've found likely to be true through observation.

- With D-stats, you have 4 populations (W,X Y,Z). You look at all the snps which all of the four populations have coverage for, and check how many mutations are shared between X & Y, X & Z, W & Y, and W & Z.

- There is some difference between haploid and diploid? The results will be skewed, with diploids attracted to diploids and haploids to haploids or something like that.
There are certain configurations where you can mix and the results won't be materially effected, such as 'outgroup dilpoid haploid haploid' or 'diploid diploid haploid haploid'

- I believe the difference in outgroups has something to do with damage and ascertainment (how the snps were selected for interest in the first place).
Chimp for example would have a whole bunch of snps unique to them, but we're not typically looking at Chimp variation, so it just shows as underived for all human mutations. I think DNA damage would act in the same way.
A d-stat like Chimp Ust_Ishim Kostenki14 Bichon would probably be highly positive, suggesting either Ust_Ishim & Bichon share drift, or Chimp & Kostenki.
The latter would be more likely/accurate because Kostenki14 being super ancient probably has more damage and definitely has more neanderthal (which like chimp is not the variation we're looking for so would just be underived everything)
Mbuti Ust_Ishim Kostenki14 Bichon on the other hand would be 0. Africans (to some extent) have their own variation represented in the panel, and obviously share mutations common to all humans, so damage & neanderthal would not attract to them.
There is the option of restricting to transversions, which means if say an snp naturally degrades from a C to a T over time, the software will only look at C's and ignore T's? Thus restricting the effect damage has.

4. ## The Following 3 Users Say Thank You to Kale For This Useful Post:

parasar (09-10-2018),  poi (05-17-2019),  TuaMan (09-08-2018)

5. Originally Posted by Kale
A d-stat like Chimp Ust_Ishim Kostenki14 Bichon would probably be highly positive, suggesting either Ust_Ishim & Bichon share drift, or Chimp & Kostenki.
The latter would be more likely/accurate because Kostenki14 being super ancient probably has more damage and definitely has more neanderthal (which like chimp is not the variation we're looking for so would just be underived everything)
Interesting, if that stat was positive I just would've assumed it meant Ust-Ishim and Bichon shared drift. I thought one of the points of choosing your outgroup was that you assume it's closer to the root of the phylogenetic than the other three populations, and that it hasn't admixed with any the others at any point. So basically, in that set up above W X Y Z (W being the outgrpup), you would pick an outgroup (Chimp) that you assume doesn't share any real drift with Y or Z (Kostenki and Bichon) and so you'd really just be testing Ust Ishim's relationship to either of the two.

Originally Posted by Kale
Mbuti Ust_Ishim Kostenki14 Bichon on the other hand would be 0. Africans (to some extent) have their own variation represented in the panel, and obviously share mutations common to all humans, so damage & neanderthal would not attract to them.
Why in this case wouldn't Neanderthal attract to Mbuti but in the above situation does attract to Chimp? All three of these pops would just show as underived relative to later human pops, correct?

6. All humans share a lot of mutations to the exclusion of Neanderthal. The most logical deduction I can think of is that these mutations are represented in the snp panel, enough so that at least an African outgroup doesn't act as a garbage bin like Chimp does. The attraction between Ust_Ishim and Bichon will still take place, just to a lesser degree.
The best way I can describe that, if Mbuti and Kostenki14 share 100 mutations, and Ust_Ishim and Bichon share 200. 1% extra Neanderthal in Kostenki14 would disrupt 1 shared mutation with Mbuti, but 2 shared mutations with Ust_Ishim.
With that sort of thing, you'd want an outgroup (meaning no interaction with the other 3 pops) as close as possible phylogenically to the other three populations in the equation.

7. ## The Following 2 Users Say Thank You to Kale For This Useful Post:

Nibelung (09-09-2018),  TuaMan (09-09-2018)

8. Can D-stats (or f4) tell you anything about the directionality of the gene flow between pops? I seem to have read conflicting things in the past regarding that question.

9. I'm not sure about F4, but a single d-stat will say nothing about direction of flow, only presence.
Remeber, it doesn't even say which two the flow is between. If W,X Y,Z is positive it could be flow between X&Z or W&Y
You can run multiple with different populations to kind of get an idea though.

10. ## The Following User Says Thank You to Kale For This Useful Post:

TuaMan (09-09-2018)

11. Originally Posted by Kale
All humans share a lot of mutations to the exclusion of Neanderthal. The most logical deduction I can think of is that these mutations are represented in the snp panel, enough so that at least an African outgroup doesn't act as a garbage bin like Chimp does. The attraction between Ust_Ishim and Bichon will still take place, just to a lesser degree.
The best way I can describe that, if Mbuti and Kostenki14 share 100 mutations, and Ust_Ishim and Bichon share 200. 1% extra Neanderthal in Kostenki14 would disrupt 1 shared mutation with Mbuti, but 2 shared mutations with Ust_Ishim.
With that sort of thing, you'd want an outgroup (meaning no interaction with the other 3 pops) as close as possible phylogenically to the other three populations in the equation.
I'm happy to see that you and a couple of others (Shaikorth Megalophias & Parasar come to mind) here have an understanding of some of the concepts affecting shared drift calculations.

You are correct in that ascertainment bias is one of the factors which skew results where polymorphism is ascertained in Europeans in SNP panels. I recall discussing this here in 2015 where IBS showed greater allele sharing between Mbuti and Chimp and mentioning that this does not imply Mbuti are more related to Chimps than Eurasians, but rather, the alleles Mbuti & Chimp share are ancestral, and if genomewide polymorphisms are analyzed then my bet is that the sharing between Mbuti and Chimp will not be greater than between Eurasians and Chimp.

The other factor skewing results is when Mbuti is used as an outgroup in a dstat such as this D (Europeans, W Asians; Steppe, Mbuti). Here because of greater mutation sharing W Asians - Mbuti vs Europeans - Mbuti, W Asian - Steppe shared drift is dampened. I discuss this in detail in a new article at http://www.eurasiandna.com/2017/12/2...ived-ancestry/.

This is the reason I started using Chimp in lieu of Mbuti in dstats such as the above back in 2016, and I'm glad to see that others are doing so too now ( the recent paper on Levant-Chl)

Abstract from my article:

BACKGROUND

Over the past decade various tools have been developed for ancient DNA analysis and assessing shared drift between populations. Some programs such as STRUCTURE, ADMIXTURE are allele frequency based. Others such as Reich Lab’s ADMIXTOOLS use both allele frequencies as well as direct allele comparisons. Others yet such as IBS and IBD compared genomes for allele matches, and shared haplotypes. However, shared drift calculation accuracy due to relatively recent gene flow between an ancient population and a contemporary one is limited for the following reasons:

The GRCh37/Hg19 Human Reference genome which was introduced in 2009, and has been used to align/map the vast majority of the aDNA sequences published to date is based on a few anonymous individuals representing a few countries and is thus not representative of human diversity. Although the donor identity ethnic group is not public, evidence based on personal experience indicates that NW Europeans and Africans are over represented. This causes a bias towards Europeans and Africans during alignment/mapping aDNA sequences to the Human Reference because some aDNA reads that fall outside of European or African variation often map to the wrong regions of the Reference genome, and sometimes don’t map at all.

Researchers and genome bloggers should be aware of some of issues outlined herein which affect accuracy of the analysis results.

We share some solutions based on personal experience which should help researchers and genome bloggers achieve higher accuracy in shared drift or admixture analysis involving ancient DNA (aDNA).
Using outgroups such as Mbuti and Chimp in comparisons involving populations such as Europeans and Asians, which have significantly different drift histories and SNPs ascertained in European populations, leads to inaccurate inferences that many researchers and bloggers are not aware of

12. ## The Following 6 Users Say Thank You to Kurd For This Useful Post:

Jatt1 (09-10-2018),  Kale (09-09-2018),  lukaszM (09-09-2018),  parasar (09-10-2018),  Ruderico (09-10-2018),  TuaMan (09-09-2018)

13. Double posted

14. Originally Posted by Kurd
The other factor skewing results is when Mbuti is used as an outgroup in a dstat such as this D (Europeans, W Asians; Steppe, Mbuti). Here because of greater mutation sharing W Asians - Mbuti vs Europeans - Mbuti, W Asian - Steppe shared drift is dampened. I discuss this in detail in a new article at http://www.eurasiandna.com/2017/12/2...ived-ancestry/.

This is the reason I started using Chimp in lieu of Mbuti in dstats such as the above back in 2016, and I'm glad to see that others are doing so too now ( the recent paper on Levant-Chl)
Sorry if I am missing something... but from the two charts at the top of your article, wouldn't it be more problematic to use Chimp?
The Chimp excess alleles chart is basically the farther from English you get, the more excess alleles with chimp there are. That's basically ascertainment bias favoring English samples right?

The Mbuti excess alleles chart looks like an inverse of what I'd imagine levels of archaic ancestry are in those populations.
That doesn't seem like bias, but actual sharing within the human-node. To be fair it does seem a bit exaggerated though.

15. ## The Following 2 Users Say Thank You to Kale For This Useful Post:

Eterne (09-10-2018),  Jatt1 (09-10-2018)

16. Kale, how long did it take for you to get a decent handle on running the different Admixtools methods? I'd like to see how feasible it would be for someone like me to pick up on f3, f4, d-stats, by playing around with the tools myself, provided the learning curve isn't too steep.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•