Hi Kale,

yes, I think you're right about this, it has to do with the maxmiss variable, and more generally, with the selection of SNPs.

I have written some of this up here:

https://uqrmaie1.github.io/admixtool....html#biases-1
The short version is this: maxmiss = 0 is the default option, which is most conservative, but also uses the smallest number of SNPs. You can set maxmiss to higher values to retain more SNPs when extracting data for many populations, but often it's better to just extract data for fewer populations at a time.

The reasons why maxmiss > 0 can cause problems is that it is generally not the case that SNPs that are missing in some populations are just a random subset of all SNPs. After I realized that this is in fact often a problem in practice, I set the default to maxmiss = 0, and changed the documentation to advise against using maxmiss > 0. I don't want to take the option out completely, because in many cases this can improve power without introducing bias. It's just not easy know when it's a good idea to use this and when it isn't.

So what's the difference between option 1 (f2_from_precomp) and option 2 (A directory which contains pre-computed f2-statistics)?

When there is no missing data (or when you set maxmiss = 0), f4 is exactly equal to the sum of 4 f2-statistics, and also to the sum of the 4 average allele frequency products of those population pairs. When some of the data is missing (or when maxmiss > 0), this is still true in expectation, but it's not exactly true anymore. It turns out that in the case of having missing data, the sum of 4 average allele frequency products is a better approximation than the sum of 4 f2-statistics.

extract_f2() computes 3 different things for each population pair: f2, average allele frequency products, and fst (fst is a recent addition).

When you run f2_from_precomp(), it will read the f2-statistics by default (because it doesn't know what you want to do with the output, and you can't use average allele frequency products to compute f3, for example). But when you run f4/qpdstat (or qpadm, which uses f4-stats), it will instead read the average allele frequency products, because they give better approximations of f4 in cases where maxmiss was set to > 0 when running extrat_f2().

You could get the same behavior by running f2_from_precomp(..., afprod = TRUE), and then plugging that into the f4/qpdstat function.

There is another small difference between how the f2-stats and the average allele frequency products are computed, which is that SNPs which have exactly the same allele frequency in all populations will be ignored in one case, but not the other. This can change the number of SNPs used and to some extent the estimate and SE, but it will barely affect the Z-score. The reason why this is so is because it makes it easier to exactly match the output of various ADMIXTOOLS 1 programs, which sometimes do and sometimes don't count those SNPs.

I wish this was all a bit simpler and less confusing, but this was the best I came up with in trying to make things both as accurate as possible, and match the results of the original ADMIXTOOLS. Many of these complications disappear when there either is little missing data, or when you keep maxmiss = 0. Even with maxmiss = 0, the number of SNPs can differ depending on which populations you choose, and sometimes even that can make a difference! But all in all, you'll be better off with setting maxmiss = 0. If you lose too many SNPs this way, you can either reduce the number of populations, or always compute f-stats from scratch, by passing the genotype file prefix as the first argument to f4. This will select the largest number of SNPs that are non-missing in all 4 populations of every f4-stat.