1. Use ComplexHeatmap to display calculator results in multiple columns

Originally Posted by PLogan
I now want to experiment with maybe doing this with other calculators such as K36, although that might produce a very busy chart.
For calculators with a large number of components like Eurogenes K36 or MDLP K27, maybe a heatmap will be more clear. And a heatmap also has the advantage that you can easily see which components have small percentages like 0% or 1%.

If you have a heatmap with a large number of columns and rows, then I think it's easier to read if you have row and column names on all four sides of the heatmap or if you split the heatmap in multiple columns. Neither of those is supported by the pheatmap package I used earlier, but you can use the ComplexHeatmap package instead:

In order to reorder the columns of the heatmap, I did an MDS of the FST matrix of the calculator and I ordered the components based on their position in the first dimension of the MDS. I reordered the rows of the heatmap based on the position of the populations in a hierarchical clustering tree, where I reordered the branches of the tree so that populations with a high percentage of the first columns were placed first and populations with a high percentage of the last columns were placed last. Another approach to ordering the rows and columns would be to use the Seriation package: https://jokergoo.github.io/ComplexHe...tmap-seriation.

Code:
```library(ComplexHeatmap)
library(circlize) # for `colorRamp2`
library(vegan) # for `reorder.hclust`
library(colorspace) # for `hex`

mds=cmdscale(fst,ncol(fst)-1)
t2=as.matrix(t)%*%mds

t=t[,order(-mds[,1])] # reorder columns based on position in first dimension of MDS matrix of FST matrix

t=t[reorder(hclust(dist(t2)),as.matrix(t)%*%seq(ncol(t))^2)\$order,] # order rows based on position in hierarchical clustering tree so populations with a high percentage of the first columns are placed first
# t=t[order(prcomp(t2)\$x[,1]),] # order rows based on position in first dimension of PCA
# t=t[reorder(hclust(dist(t2)),prcomp(t2)\$x[,1])\$order,] # order rows based on position in hierarchical clustering tree weighed by PC1 of PCA
# ser=seriation::seriate(t,method="BEA_TSP");t=t[get_order(ser,1),get_order(ser,2)] # use the `seriation` package to reorder rows and columns (this doesn't take FST into account)

rows=40
mats=lapply(seq(1,nrow(t),rows),function(i)as.matrix(t[i:(i+rows-1),]))

cellsize=30
png("1.png",w=length(mats)*(ncol(t)*cellsize+1000),h=rows*cellsize+1000,res=72)

maps=lapply(mats,function(mat){
Heatmap(
mat,
show_heatmap_legend=F,
show_column_names=F,
show_row_names=F,
cluster_columns=F,
cluster_rows=F,
width=ncol(mat)*unit(cellsize,"pt"),
height=nrow(mat)*unit(cellsize,"pt"),
col=colorRamp2(seq(0,100,length.out=7),hex(HSV(c(210,210,130,60,40,20,0),c(0,rep(.5,6)),1))),
cell_fun=function(j,i,x,y,w,h,fill)grid.text(sprintf("%.0f",mat[i,j]),x,y,gp=gpar(fontsize=15))
)
})

draw(Reduce(`+`,maps))
dev.off()
system("mogrify -gravity center -trim -border 16 -bordercolor white 1.png")```
I also made a script which uses ComplexHeatmap to display the results of an ADMIXTURE run with multiple K values side by side: https://anthrogenica.com/showthread....l=1#post770253.

2. ## The Following 3 Users Say Thank You to Nganasankhan For This Useful Post:

David Bush (06-27-2022),  PLogan (06-24-2022),  Telfermagne (06-26-2022)

3. Thank you very much for the additional guidance.

4. Injecting my family into the MDLP K27 heatmaps...

5. ## The Following User Says Thank You to PLogan For This Useful Post:

Nganasankhan (06-24-2022)

6. Display the closest populations to a population as a heatmap

For some reason I hadn't thought of this before, but a nice way to show the closest populations to a population is to display them as a heatmap, so then you not only see the distance to each population but also the percentage of their components.

I used the same method to account for FST as before, where I used matrix multiplication to multiply the matrix of admixture percentages with an MDS matrix of the FST matrix of the calculator.

Code:
```library(ComplexHeatmap)
library(circlize) # for `colorRamp2`
library(vegan) # for `reorder.hclust`
library(colorspace) # for `hex`

targ="Komi"

mds=cmdscale(fst,ncol(fst)-1)
t2=as.matrix(t)%*%mds
dist=sort(as.matrix(dist(t2))[targ,])
t=t[,order(-mds[,1])] # rearrange columns based on the order of each component in the first dimension of MDS of the FST matrix

cellsize=30
png("1.png",w=ncol(t)*cellsize+1000,h=nrow(t)*cellsize+1000,res=72)

draw(Heatmap(
as.matrix(t),
cell_fun=function(j,i,x,y,w,h,fill)grid.text(sprintf("%.0f",t[i,j]),x,y,gp=gpar(fontsize=15)),
col=colorRamp2(seq(0,100,length.out=7),hex(HSV(c(210,210,130,60,40,20,0),c(0,rep(.5,6)),1))),
show_heatmap_legend=F,
show_column_names=F,
show_row_names=F,
cluster_columns=F,
cluster_rows=F,
width=ncol(t)*unit(cellsize,"pt"),
height=nrow(t)*unit(cellsize,"pt"),
column_title=paste0("Distance to ",targ," (multiplied by MDS of FST)"),column_title_gp=gpar(fontsize=36),
))

dev.off()
system("mogrify -gravity center -trim -border 16 -bordercolor white 1.png")```

7. ## The Following 3 Users Say Thank You to Nganasankhan For This Useful Post:

David Bush (06-27-2022),  PLogan (06-24-2022),  Telfermagne (06-26-2022)

8. Another learning lesson... converting the "Make a heatmap of the FST matrix of a calculator" example from K13 to K36

Had to comment out the /1e3 on the read.csv or I got all zeros in the cells.

Code:
```library(pheatmap)
library(vegan) # for `reorder.hclust`
library(colorspace) # for `hex`

hc=hclust(as.dist(t))
hc=reorder(hc,cmdscale(t)[,1])

pheatmap::pheatmap(
t*1000,
filename="K36_Heatmap_of_FST_Matrix.png",
clustering_callback=function(...)hc,
legend=F,
cellwidth=16,
cellheight=16,
treeheight_row=80,
treeheight_col=80,
fontsize=8,
border_color=NA,
display_numbers=T,
number_format="%.0f",
fontsize_number=7,
number_color="black",
colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0),c(0,rep(.5,6)),1)))(256)
)```

9. ## The Following User Says Thank You to PLogan For This Useful Post:

David Bush (06-27-2022)

10. Run calculators on the command line with ADMIXTURE

I now figured out a faster way to run a calculator for a large batch of samples, which is to simply use the `-P` option of ADMIXTURE to project the samples. See section 2.14 of the ADMIXTURE manual (https://dalexander.github.io/admixtu...ure-manual.pdf). For example the following code runs Eurogenes K36 for all samples in the 1240K version of the Reich dataset (which took only about 2 hours on my computer, even though it would've taken about 4 days with the Python admix script):

Code:
```wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_1240K_public.{anno,geno,ind,snp}
f=v50.0_HO_public;convertf -p <(printf %s\\n genotypename:\ \$f.geno snpname:\ \$f.snp indivname:\ \$f.ind outputformat:\ PACKEDPED genotypeoutname:\ \$f.bed snpoutname:\ \$f.bim indivoutname:\ \$f.fam)
unzip K36.zip
plink --allow-no-sex --bfile v50.0_1240K_public  --extract <(cut -d\  -f1 K36.alleles) --geno .9999 --make-bed
cut -d\  -f1 K36.alleles|paste - K36.36.F|awk 'NR==FNR{a[\$1]=\$0;next}{print a[\$2]}' - plink.bim|cut -f2 >plink.36.P.in
cut -d\  -f2 plink.fam|paste - <(awk '{for(i=1;i<=NF;i++)\$i=sprintf("%.2f",100*\$i)}1' plink.36.Q)|awk -F\\t 'NR==FNR{a[\$2]=\$13;next}{print a[\$1]":"\$0}' v50.0_1240K_public.anno -|tr '\t ' ,|sort>reich.k36.1240k```
The `--geno .9999` option is not actually needed in this case, but if you run ADMIXTURE with a smaller set of samples and you don't include the `--geno` option, then it can result in an error that "all genotypes are missing for a SNP locus".

You can download a zip file for the calculator files for K36 from here: https://drive.google.com/file/d/1vOJ...hgnDDL0zHUJ4YN. (Mirror: https://onedrive.live.com/?authkey=%...1FDFEB45%21141.)

Here's a CSV file for the K36 results of samples in the Reich dataset: https://drive.google.com/file/d/1dGD...IV4Cg6KnKHPeeX. I used the 1240K version when available and the HO version only for samples that were missing from the 1240K version. Eurogenes K36 has about 160,000 overlapping SNPs with the 1240K version but only about 50,000 overlapping SNPs with the HO version.

The zip file linked above includes the text file `Eurogenes_K36_refs.txt` which contains a list of samples that were used as references for Eurogenes K36. I excluded a total of 1868 samples from the CSV file where the part after the last underscore of the sample ID in the text file was the same as the version ID or master ID of a sample in the Reich dataset. But that may have resulted in some false positives, and it also resulted in false negatives because many samples had different names in the text file and the Reich dataset. I didn't bother to remove samples that suffer from the calculator effect manually, but they include some Chuvash samples which got almost 100% of the Volga-Ural component, even though other Chuvash samples only got about 20%.

The following code generates population averages so that it removes duplicate samples, it removes samples with less than 200,000 SNPs, it removes samples that have been marked as contaminated, and it removes various suffixes from the population names:

Code:
```curl -Lso reich.k36.csv 'https://drive.google.com/uc?export=download&id=1dGDy5_jpDR6vxXYz1qIV4Cg6KnKHPeeX'
curl -LsO https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_HO_public.anno
LC_ALL=C sort -t\$'\t' -rnk14 v50.0_HO_public.anno|awk -F\\t '!a[\$3]++&&\$14>=2e5{print\$2}'|awk -F'[:,]' 'NR==FNR{a[\$0];next}\$2 in a' - <(sed 1d reich.k36.csv)|sed 's/:[^,]*//'||grep -v ^\\.\\.|sed -E 's/(_(son|daughter|brother|sister|father|mother|published|all|dup)|[._][12]d\.|[._]rel)[^,]*//'|sed -E 's/\.(SG|SDG|DG|WGA)//'|grev -v ^Ignore_|grep -Ev '_(contam|fail)'|awk -F, '{n[\$1]++;for(i=2;i<=NF;i++){a[\$1,i]+=\$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%.2f",a[i,j]/n[i]);print o}}'|sort|cat <(head -n1 reich.k36.csv) ->reich.k36.csv.ave```

11. ## The Following 3 Users Say Thank You to Nganasankhan For This Useful Post:

Kale (06-27-2022),  MistH028 (07-06-2022),  PLogan (06-26-2022)

12. Thank you so much for this K36 file Nganasankhan. I'm doing a little cleanup with kit names that have commas in them.

Australian:IHW9118(BUR,E)
Australian:IHW9193(WON,M)
Australian:IHW9195(WON,C)
and the seven Ignore_Australian kits.

13. I made two errors in the last shell command in post #16: `||grep` should be `|grep` and `grev` should be `grep`.

And also in post #14, there's a comma missing from the end of the line that starts with `right_annotation`.

Originally Posted by PLogan
I'm doing a little cleanup with kit names that have commas in them.

Australian:IHW9118(BUR,E)
Australian:IHW9193(WON,M)
Australian:IHW9195(WON,C)
and the seven Ignore_Australian kits.
I updated the file at Google Drive so I removed the part within the parentheses from the names of those Australian samples, so I changed for example `IHW9118(BUR,E)` to `IHW9118`.

14. ## The Following User Says Thank You to Nganasankhan For This Useful Post:

PLogan (06-28-2022)

15. Use Mantel's test to calculate a correlation coefficient between G25 and calculators

I now used ADMIXTURE to project all samples in the 1240K+HO version of the Reich dataset using various calculators. I then selected 1053 modern samples which had the same sample name and population name in the Reich dataset and in G25, and for each calculator, I used Mantel's test to compare a distance matrix of the calculator results of the samples with a distance matrix of the same samples in G25 (https://anthrogenica.com/showthread....etic-distances). When I took FST into account by multiplying the matrix of calculator results with an MDS matrix of the FST matrix of the calculator, I got the highest correlation coefficient with Harappaworld (0.990), followed by Eurogenes K13 (0.987), Eurogenes K15 (0.986), and Eurogenes K36 (0.984):

Calculator Correlation with G25
(FST accounted)
Correlation with G25
(FST not accounted)
SNPs in common
with 1240K+HO
Dodecad weac2 (K=7) 0.909 0.821 45877
Eurogenes K13 0.987 0.852 52770
Eurogenes K15 0.986 0.852 53263
Eurogenes K36 0.984 0.684 46518
Harappaworld (K=16) 0.990 0.854 56582
MDLP K27 0.973 0.752 49757

In the table above, accounting for FST resulted in a more dramatic increase in correlation with higher-K calculators like Eurogenes K36 and MDLP K27.

I got the lowest correlation with Dodecad africa9, but it's probably because it's designed for Africans, so it has 7 African components, 2 West Eurasian components, and no East Eurasian or American component.

I didn't attempt to exclude samples which suffered from the calculator effect, so in some cases the calculators with a lower correlation might simply have more samples that suffer from the calculator effect. One thing that might give an unfair advantage to the Eurogenes calculators is that my experiment only included samples that were in G25, but the G25 datasheets don't include samples that were part of the reference run for G25, so if Davidski also used the same references for G25 and Eurogenes calculators, then it might reduce the number of samples in my experiment that were part of the reference run of the Eurogenes calculators.

For example this calculates a correlation coefficient between Dodecad K12b and G25:

Code:
```library(ade4)

pick=intersect(rownames(t),rownames(g25))

121,0,144,72,139,123,101,181,146,52,129,218
74,144,0,142,56,67,99,97,60,137,52,129
123,72,142,0,142,129,91,173,145,32,130,209
62,139,56,142,0,41,98,136,52,137,33,185
48,123,67,129,41,0,86,140,63,123,41,183
81,101,99,91,98,86,0,135,99,85,82,172
132,181,97,173,136,140,135,0,124,170,123,51
66,146,60,145,52,63,99,124,0,139,47,169
118,52,137,32,137,123,85,170,139,0,125,206
41,129,52,130,33,41,82,123,47,125,0,171
173,218,129,209,185,183,172,51,169,206,171,0")

t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)

mantel.rtest(dist(t2[pick,]),dist(g25[pick,]))```
I want to add more calculators to my comparison, but I have not yet found an FST matrix for other calculators.

16. ## The Following User Says Thank You to Nganasankhan For This Useful Post:

Tomenable (07-06-2022)

17. Use Mantel's test to calculate a correlation coefficient between calculator results and an f2 matrix

The following code downloads the 1240K+HO version of the Reich dataset, converts it to PLINK format, selects modern samples, ignores samples with certain affixes in the sample name, and selects the first 8 samples from each population with at least 8 samples. Then it uses ADMIXTOOLS 2 to create an f2 matrix between the populations:

Code:
```wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_HO_public.{anno,ind,snp,geno}
f=v50.0_HO_public;convertf -p <(printf %s\\n genotypename:\ \$f.geno snpname:\ \$f.snp indivname:\ \$f.ind outputformat:\ PACKEDPED genotypeoutname:\ \$f.bed snpoutname:\ \$f.bim indivoutname:\ \$f.fam)
igno()(grep -Ev '\.REF|rel\.|fail\.|\.contam|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_relative|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref')
x=eight;sed 1d v50.0_HO_public.anno|awk -F\\t '\$5==0{print\$2,\$7}'|tr -d \"|grep -Fv .|igno|awk 'NR==FNR{++a[\$0];next}a[\$2]>=8&&++b[\$2]<=8' <(cut -f7 v50.0_HO_public.anno) ->\$x.pick
plink --allow-no-sex --bfile v50.0_HO_public --keep <(awk 'NR==FNR{a[\$1];next}\$2 in a' \$x.pick v50.0_HO_public.fam) --make-bed --out \$x
awk 'NR==FNR{a[\$1]=\$2;next}{\$1=a[\$2]}1' \$x.{pick,fam}>\$x.fam.temp;mv \$x.fam{.temp,}
Then next the following code goes through the CSV files of results for various calculators, so that it selects the samples that are included in the f2 matrix from the CSV file, calculates their population average, and calculates the correlation coefficient of the population averages with the f2 matrix:

Code:
```library(ade4)

sapply(c("africa9","eurasia7","harappaworld","k12b","k13","k15","k36","k7b","mdlpk27","weac2"),function(c){

t=t[sub(".*:","",rownames(t))%in%pick,]
a=data.frame(aggregate(t,list(sub(":.*","",rownames(t))),mean),row.names=1)

a2=as.matrix(a)%*%cmdscale(fst,ncol(fst)-1)

mantel.rtest(dist(a2),as.dist(f[rownames(a2),rownames(a2)]))\$obs
})```
Here are the results (the calculator results are multiplied by MDS of FST in each case):

Calculator Correlation with f2 (no LD
pruning before f2, 597,573 SNPs)
Correlation with f2 (with LD
pruning before f2, 115,808 SNPs)
Eurogenes K13 0.947 0.925
Eurogenes K15 0.945 0.921
Eurogenes K36 0.964 0.950
Harappaworld (K=16) 0.967 0.951
MDLP K27 0.967 0.958

Earlier I wasn't sure if I should perform LD pruning before I calculate f2 distances, but now I got higher correlation when I didn't do LD pruning, so I think I'll stop using LD pruning before f2. (In another experiment where I compared G25 distance to distance in `plink --pca`, I also got lower correlation when I did LD pruning before the PCA.)

18. ## The Following User Says Thank You to Nganasankhan For This Useful Post:

PLogan (07-04-2022)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•