Page 1 of 3 123 LastLast
Results 1 to 10 of 29

Thread: R and shell scripts for admixture calculators

  1. #1
    Registered Users
    Posts
    317
    Sex
    Ethnicity
    Finnish

    R and shell scripts for admixture calculators

    I have previously posted similar threads for G25, ADMIXTOOLS 2, SmartPCA, and ADMIXTURE: showthread.php?23441-Shell-commands-and-R-scripts-for-working-with-G25, showthread.php?23677-R-scripts-for-ADMIXTOOLS-2, showthread.php?23708-Shell-and-R-scripts-for-SmartPCA-and-ADMIXTURE.

    You can post your own scripts in other languages, and you can also post links to scripts made by other people.

    Make an MDS plot based on the FST matrix of a calculator

    MDS (multidimensional scaling) is essentially a version of PCA which takes a distance matrix as an input. If you have the FST matrix of a calculator, then you can use MDS to plot the components of the calculator in space so that the Euclidean distance between the points of the components is virtually the same as the FST distance between the components. The input matrix for MDS needs to be a square matrix, and the maximum number of dimensions produced by MDS is one fewer than the number of rows and columns of the input matrix, so for example it's 12 in the case of Eurogenes K13. Here you can see the first 6 dimensions out of 12:



    Code:
    library(tidyverse)
    library(cowplot)
    library(ggrepel)
    library(ggforce)
    
    fst=as.matrix(as.dist(read.csv(text="North_Atlantic,Baltic,West_Med,West_Asian,East_Med,Red_Sea,South_Asian,East_Asian,Siberian,Amerindian,Oceanian,Northeast_African,Sub-Saharan
    ,,,,,,,,,,,,
    19,,,,,,,,,,,,
    28,36,,,,,,,,,,,
    26,32,36,,,,,,,,,,
    26,35,28,21,,,,,,,,,
    52,62,50,48,39,,,,,,,,
    64,65,76,57,60,82,,,,,,,
    114,114,122,110,111,127,76,,,,,,
    111,111,123,109,112,130,83,56,,,,,
    138,137,154,138,144,161,120,113,105,,,,
    179,181,187,177,176,191,146,166,177,217,,,
    122,127,124,116,108,121,113,145,151,185,203,,
    146,150,150,140,135,141,133,164,170,204,220,41,",check.names=F)))/1e3
    
    mds=cmdscale(as.matrix(fst),ncol(as.matrix(fst))-1,eig=T)
    
    p=as.data.frame(mds$points)
    pct=paste0("Dim ",1:ncol(p)," (",sprintf("%.1f",100*mds$eig/sum(mds$eig)),"%)")
    
    plots=lapply(seq(1,5,2),function(i){
      xpc=paste0("V",i)
      ypc=paste0("V",i+1)
      seg=lapply(1:3,function(i)apply(as.data.frame(fst),1,function(x)unlist(p[names(sort(x)[i]),c(xpc,ypc)],use.names=F))%>%t%>%cbind(p[,c(xpc,ypc)]))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
    
      k=as.factor(cutree(hclust(as.dist(fst)),6))
    
      ggplot(p,aes(!!sym(xpc),!!sym(ypc)))+
      geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray60",size=.15)+
      ggforce::geom_mark_hull(aes(color=!!k,fill=!!k),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
      geom_point(aes(color=!!k),size=.5)+
      # geom_text(aes(color=!!k),label=rownames(p),size=2,vjust=-.7)+
      ggrepel::geom_text_repel(aes(color=!!k,label=rownames(p)),size=2,max.overlaps=Inf,force=2,segment.size=.2,min.segment.length=.2,box.padding=.05)+
      labs(x=pct[i],y=pct[i+1])+
      scale_x_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=c(.04,.04)))+
      scale_y_continuous(breaks=seq(-1,1,.02),expand=expansion(mult=c(.04,.04)))+
      scale_fill_manual(values=rainbow_hcl(nlevels(k),90,60))+
      scale_color_manual(values=rainbow_hcl(nlevels(k),90,60))+
      theme(
        axis.text=element_text(size=6,color="black"),
        axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
        axis.ticks=element_blank(),
        axis.ticks.length=unit(0,"cm"),
        axis.title=element_text(size=7),
        legend.position="none",
        panel.background=element_rect(fill="white"),
        panel.border=element_rect(color="gray85",fill=NA,size=.6),
        panel.grid.major=element_line(color="gray85",size=.2),
        plot.background=element_rect(fill="white")
      )
    })
    
    cowplot::plot_grid(plotlist=plots,ncol=3)
    ggsave("1.png",width=15,height=5)
    Calculate distance to a specified population so that FST is taken into account

    When you use Vahaduo to calculate distances between populations in a calculator like K13, then for example a population that is 100% Baltic will be equally distant from a population that is 50% Siberian and 50% Baltic as from a population that is 50% North_Atlantic and 50% Baltic. In order to calculate distances between populations so that the FST distance between the components is taken into account, you can first use MDS on the FST matrix of K13 so that you plot the components of the calculator in 12-dimensional space like in the previous script. Then next you can obtain the coordinates of each population inside the same 12-dimensional space if you express each population as a linear combination of the coordinates of its components, so if for example a population has 50% Siberian and 50% Baltic, then it's moved halfway away from the origin of the space towards the point of the Siberian component, and it's moved halfway away towards the point of the Baltic component. To easily generate the linear combination for each row in the spreadsheet of a calculator, you can simply use matrix multiplication to multiply the matrix of the spreadsheet with the MDS matrix of the FST matrix.

    For example in the original K13 spreadsheet, Maris have a high percentage of the Baltic component, so if FST is not taken into account, then Maris are closer to most Balto-Slavic populations than to any Central Asian population, but if FST is taken into account, then Maris end up moving closer to some of the Central Asian populations that have a similar level of Mongoloid admixture as Maris. This calculates distance to Maris so that FST is taken into account:

    Code:
    t=read.csv("https://pastebin.com/raw/H4pm9UF1",row.names=1,check.names=F) # K13 original
    fst=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1)
    
    t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
    
    s=head(sort(as.matrix(dist(t2))["Mari",]),16)
    writeLines(paste(round(s),names(s)))
    Output:

    0 Mari
    656 Chuvash
    1261 Tatar
    1402 Nogay
    1478 Afghan_Turkmen
    1776 Afghan_Tadjik
    1850 Tadjik
    1893 Turkmen
    1947 Uzbeki
    2128 Aghan_Hazara
    2168 La_Brana-1
    2211 Burusho
    2411 East_Finnish
    2417 Afghan_Pashtun
    2516 MA-1
    2518 Erzya

    In another thread, I used Mantel's test to calculate a correlation coefficient between genetic distance calculated using admixture calculators and genetic distance calculated using f2 or G25 (showthread.php?22402-Mantel-s-Test-G25-vs-genetic-distances). I found that the correlation greatly increased when I multiplied the table of admixture percentages with an MDS matrix of the FST matrix of a calculator.

    Compare regular distance with distance where FST is taken into account

    Here you can see that when FST is accounted for, then Maris move closer to Central Asians and South Asians but further from Europeans and Siberians:



    Code:
    library(tidyverse)
    library(ggforce)
    
    t=read.csv("https://pastebin.com/raw/H4pm9UF1",row.names=1,check.names=F)
    fst=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1)
    
    t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
    
    targ="Mari"
    xy=data.frame(x=rank(as.matrix(dist(t))[,targ]),y=rank(as.matrix(dist(t2))[,targ]))
    
    k=as.factor(cutree(hclust(dist(t2)),24))
    
    ggplot(xy,aes(x,y))+
    ggforce::geom_mark_hull(aes(color=!!k,fill=!!k),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
    geom_abline(linetype="dashed",color="gray80",size=.3)+
    geom_point(aes(color=!!k),size=.5)+
    geom_text(aes(color=!!k),label=rownames(xy),size=2,vjust=-.7)+
    scale_x_continuous(breaks=seq(1,1000,20),expand=expansion(mult=c(.04,.04)))+
    scale_y_continuous(breaks=seq(1,1000,20),expand=expansion(mult=c(.04,.04)))+
    scale_fill_manual(values=rainbow_hcl(nlevels(k),90,60))+
    scale_color_manual(values=rainbow_hcl(nlevels(k),90,60))+
    labs(x=paste("Rank of K13 distance to",targ,"(FST not accounted for)"),y=paste("Rank of K13 distance to",targ,"(multiplied by MDS of FST)"))+
    theme(
      axis.text=element_text(size=6),
      axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"cm"),
      axis.title=element_text(size=8),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray85",fill=NA,size=.6),
      panel.grid.major=element_line(color="gray85",size=.2),
      plot.background=element_rect(fill="white"),
      plot.subtitle=element_text(size=7),
      plot.title=element_text(size=11)
    )
    
    ggsave("1.png",width=7,height=7)
    Generate a CSV file for a calculator spreadsheet multiplied by MDS of FST

    You can use the output of this script at Vahaduo to generate more accurate models than with a regular calculator spreadsheet (or in some cases it might be more inaccurate, because for example in the case of European populations, this method also amplifies differences in noise-level admixture of non-European components).

    Code:
    t=read.csv("https://pastebin.com/raw/H4pm9UF1",row.names=1,check.names=F) # K13 original
    fst=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1)/1e3
    
    mds=cmdscale(fst,ncol(fst)-1)
    t2=as.matrix(t)%*%mds
    
    write.table(t(apply(t2,1,function(x)sub("^0","",sprintf("%.5f",x)))),"k13.mds.csv",sep=",",col.names=F,quote=F)
    Make a PCA plot which accounts for the FST distance between the components

    If you make a PCA plot where you include all rows of the original K13 spreadsheet so you don't account for FST, then in a plot that shows PC1 and PC2, one of the three corners of the plot is occupied by South Asians and not by Africans like in a conventional global PCA. However when you take FST into account, then the PCA plot will look more like a typical global PCA with Africans in the third corner:



    Code:
    library(tidyverse)
    library(ggforce)
    
    t=read.csv("https://pastebin.com/raw/H4pm9UF1",row.names=1,check.names=F)
    fst=as.matrix(as.dist(read.csv("k13.fst",row.names=1,check.names=F)))
    
    t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
    
    p0=prcomp(t)
    pct=paste0(colnames(p0$x)," (",sprintf("%.1f",100*p0$sdev/sum(p0$sdev)),"%)")
    p=as.data.frame(p0$x)
    
    p=p/sd(p[,1]) # divide the values of all PCs by the standard deviation of PC1 so the values are displayed on a normalized scale
    
    dist=as.matrix(dist(t2))
    k=as.factor(cutree(hclust(dist(t2)),24))
    
    set.seed(1)
    hue=seq(0,360,length.out=nlevels(k)+1)%>%head(-1)%>%sample()
    pal1=hcl(hue,100,55)
    
    # draw a line between each population and its two nearest neighbors
    seg=lapply(1:3,function(j)apply(dist,1,function(x)unlist(p[names(sort(x)[j]),c(1,2)],use.names=F))%>%t%>%cbind(p[,c(1,2)]))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
    
    ggplot(p,aes(PC1,PC2))+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray50",size=.1)+
    ggforce::geom_mark_hull(aes(group=!!k),color=pal1[k],fill=pal1[k],concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
    geom_point(color=pal1[k],size=.3)+
    geom_text(aes(label=rownames(p)),color=pal1[k],size=2,vjust=-.6)+
    labs(x=pct[i],y=pct[i+1])+
    scale_x_continuous(breaks=seq(-100,100,.5),expand=expansion(.09))+
    scale_y_continuous(breaks=seq(-100,100,.5),expand=expansion(.03))+
    theme(
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"pt"),
      axis.text=element_text(color="black",size=6),
      axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
      axis.title=element_text(color="black",size=8),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray80",fill=NA,size=.4),
      panel.grid.major=element_line(color="gray90",size=.2),
      panel.grid.minor=element_blank(),
      plot.background=element_rect(fill="white",color=NA)
    )
    
    ggsave(paste0(i,".png"),width=6,height=6)
    Make a hierarchical clustering tree based on the FST matrix of a calculator

    This uses `reorder.hclust` from `vegan` to reorder the branches of the clustering tree based on the position of each component on the first dimension in an MDS matrix of the FST matrix. (So basically if you think of the first two dimensions of the MDS matrix as the x- and y-axis of a PCA-style scatterplot, then the position of each component on the x-axis is used as the weight for reordering the branches of the clustering tree.)



    Code:
    library(ape) # for `plot.phylo`
    library(vegan) # for `reorder.hclust`
    
    fst=as.matrix(read.csv("https://pastebin.com/raw/HxXagu1B",row.names=1,check.names=F))
    
    png("1.png",w=700,h=800)
    
    hc=hclust(as.dist(fst))
    hc=reorder(hc,cmdscale(fst)[,1]) # use first dimension of MDS as weights for reordering the branches
    
    plot(as.phylo(hc),cex=1.2,font=1,underscore=T) # `font=1` doesn't use italic and `underscore=T` doesn't replace underscores with spaces
    dev.off()
    
    system("mogrify -trim -bordercolor white -border 20 1.png") # remove empty space around the plot but add back a 20-pixel margin
    Make a clustered heatmap which accounts for FST distance between components

    The branches are reordered so that the weight for each population is the position of the population in the first dimension of a PCA of the admixture matrix multiplied by the MDS matrix of the FST matrix.



    Code:
    library(pheatmap)
    library(vegan) # for `reorder.hclust`
    library(colorspace) # for `hex`
    
    t=read.csv("https://pastebin.com/raw/H4pm9UF1",row.names=1,check.names=F) # K13 original
    fst=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1)
    
    t=t[sample(nrow(t),16),]
    
    t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
    hc=hclust(dist(t2))
    hc=reorder(hc,prcomp(t2)$x[,1])
    
    # t=t[,order(colMeans(t))]
    t=t[,ncol(t):1]
    
    pheatmap::pheatmap(
      t,
      filename="1.png",
      cluster_cols=F,
      clustering_callback=function(...)hc,
      legend=F,
      cellwidth=16,
      cellheight=16,
      treeheight_row=80,
      treeheight_col=80,
      fontsize=8,
      border_color=NA,
      display_numbers=T,
      number_format="%.0f",
      fontsize_number=7,
      number_color="black",
      # breaks=seq(0,100,100/256),
      colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0),c(0,rep(.5,6)),1)))(256)
    )
    Make a heatmap of the FST matrix of a calculator

    The distances are multiplied by 1000.



    Code:
    library(pheatmap)
    library(vegan) # for `reorder.hclust`
    library(colorspace) # for `hex`
    
    t=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1,check.names=F)/1e3
    
    hc=hclust(as.dist(t))
    hc=reorder(hc,cmdscale(t)[,1])
    
    pheatmap::pheatmap(
      t*1000,
      filename="1.png",
      clustering_callback=function(...)hc,
      legend=F,
      cellwidth=16,
      cellheight=16,
      treeheight_row=80,
      treeheight_col=80,
      fontsize=8,
      border_color=NA,
      display_numbers=T,
      number_format="%.0f",
      fontsize_number=7,
      number_color="black",
      colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0),c(0,rep(.5,6)),1)))(256)
    )
    Sort populations based on their distance to a component

    When you use matrix multiplication to multiply the spreadsheet of a calculator with the FST matrix of the calculator, then it generates a matrix where instead of showing the percentage of each component of each sample, it shows the distance of each sample to each component.

    For example this finds populations in K13 that have the lowest distance to the Siberian component:

    Code:
    t=read.csv("https://pastebin.com/raw/9dMWpJcU",row.names=1,check.names=F)
    
    fst=as.matrix(as.dist(read.csv(text=",,,,,,,,,,,,
    19,,,,,,,,,,,,
    28,36,,,,,,,,,,,
    26,32,36,,,,,,,,,,
    26,35,28,21,,,,,,,,,
    52,62,50,48,39,,,,,,,,
    64,65,76,57,60,82,,,,,,,
    114,114,122,110,111,127,76,,,,,,
    111,111,123,109,112,130,83,56,,,,,
    138,137,154,138,144,161,120,113,105,,,,
    179,181,187,177,176,191,146,166,177,217,,,
    122,127,124,116,108,121,113,145,151,185,203,,
    146,150,150,140,135,141,133,164,170,204,220,41,",head=F)))
    
    t2=as.matrix(t)%*%fst
    s=head(sort(t2[,9]),16)
    cat(paste(round(s),names(s)),sep="\n")
    Output:

    1599 Evens
    1737 Evenki
    1818 Oroqen
    2070 Yakut
    2236 Dolgan
    3010 Koryak
    3109 Buryat
    3501 Hezhen
    3557 Tuvinian
    3690 Xibo
    3724 Chukchi
    4012 Japanese
    4081 Mongolian
    4372 Selkup
    4379 Ket
    4453 Naxi

    Find the 4 closest populations to each component

    Code:
    $ curl https://pastebin.com/raw/9dMWpJcU|tr -d \\r>k13
    $ curl https://pastebin.com/raw/6JmN2hRY|tr -d \\r>k13.fst
    $ awk -F, 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;next}{for(i=1;i<=NF;i++){s=0;for(j=1;j<=NF;j++)s+=$j*a[i,j];printf"%s"(i==NF?"\n":FS),s}}' k13.fst <(sed 1d k13|cut -d, -f2-)|paste -d, <(cut -d, -f1 k13|sed 1d) -|cat <(head -n1 k13) ->k13.mult
    $ for ((i=1;i<=$(wc -l<k13.fst);i++));do awk -F, '{print$(i+1),$1}' i=$i k13.mult|sed 1d|sort -n|awk '{$1=sprintf("%.0f",$1)}1'|head -n4|paste -sd\  -|sed s/^/$(head -n1 k13.mult|cut -d, -f$[i+1]):\ /;done
    North_Atlantic: 1341 Norwegian 1350 Southwest_English 1353 West_Scottish 1356 North_Dutch
    Baltic: 1465 Lithuanian 1587 Belorussian 1615 Russian_Smolensk 1685 Estonian_Polish
    West_Med: 1493 Sardinian 2020 French_Basque 2264 Southwest_French 2336 Spanish_Aragon
    West_Asian: 1553 Georgian 1587 Abhkasian 1891 Armenian 1956 Lezgin
    East_Med: 1650 Lebanese_Christian 1666 Lebanese_Druze 1797 Cyprian 1805 Kurdish_Jewish
    Red_Sea: 3234 Saudi 3493 Yemenite_Jewish 3995 Lebanese_Christian 4028 Samaritan
    South_Asian: 1690 Chamar 1746 Piramalai 1752 Sakilli 1950 North_Kannadi
    East_Asian: 919 Dai 946 She 1117 Vietnamese 1169 Miaozu
    Siberian: 1599 Evens 1737 Evenki 1818 Oroqen 2070 Yakut
    Amerindian: 45 Karitiana 1338 Pima 2017 Mayan 2844 North_Amerindian
    Oceanian: 889 Papuan 5132 NAN_Melanesian 14564 Austroasiatic_Ho 14931 Sakilli
    Northeast_African: 1365 Ethiopian_Gumuz 1586 Ethiopian_Anuak 1730 Sudanese 2144 Hadza
    Sub-Saharan: 404 Yoruban 753 Bantu_S.W. 808 Bantu_S.E. 863 Mandenka
    R version:

    Code:
    Rscript -e 't=read.csv("k13",row.names=1,check.names=F);fst=as.matrix(read.csv("k13.fst",head=F));t2=as.matrix(t)%*%fst;m=apply(t2,2,function(x){s=head(sort(x),4);paste(paste(round(s),names(s)),collapse=" ")});writeLines(paste0(names(t),": ",m))'
    Download a CSV file for the K13 results of all samples in the Reich dataset and generate subsets of the file

    I made a CSV file for the K13 results of all samples in the Reich dataset. My file uses the 1240K version when available and HO version otherwise, because Eurogenes K13 has more than three times as many overlapping SNPs with 1240K as with HO. The code below uses the file to generate population averages of modern and ancient samples, where it only includes ancient samples with at least 500,000 SNPs. It then multiplies the resulting files with an MDS matrix of the FST matrix of K13, and it uses the files multiplied by MDS of FST to find the closest ancient populations to the modern Mari average.

    Code:
    $ curl -Lso reich.k13 'https://drive.google.com/uc?export=download&id=1pL2R5dDvQ0wsPw_SgoFn_zWLDrGXa9nM'
    $ curl https://pastebin.com/raw/6JmN2hRY|tr -d \\r>k13.fst
    $ wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_{1240K,HO}_public.anno
    $ igno()(grep -Ev '\.REF|rel\.|fail\.|\.contam|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_relative|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref')
    $ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i,j]/n[i]);print o}}' "FS=${1-$'\t'}")
    $ awk -F\\t '$5==0{print$2}' v50.0_HO_public.anno|awk -F'[:,]' 'NR==FNR{a[$0];next}$2 in a' - reich.k13|igno|sed 's/:[^,]*//'|tav ,|sort>reich.k13.modern.ave
    $ awk -F\\t '$5>0&&$14>=5e5{print$2}' v50.0_HO_public.anno|awk -F'[:,]' 'NR==FNR{a[$0];next}$2 in a' - reich.k13|igno|sed 's/:[^,]*//'|tav ,|sort>reich.k13.ancient.ave
    $ Rscript -e 'for(f in c("reich.k13.modern.ave","reich.k13.ancient.ave")){t=read.csv(f,row.names=1,check.names=F);fst=read.csv("k13.fst",row.names=1)/1e3;mds=as.matrix(t)%*%cmdscale(as.matrix(fst),ncol(fst)-2);write.table(round(mds,4),paste0(f,".mds"),sep=",",quote=F,col.names=F)}'
    $ dist()(awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{$1=sprintf("%."x"f",$1)}1' "x=${3-3}")
    $ dist reich.k13.ancient.ave.mds <(grep ^Mari.SG, reich.k13.modern.ave.mds)|head -n8
    0.483 Kazakstan_Sargat_IA
    0.507 Russia_Karasuk_oRISE.SG
    0.614 Russia_Tuva_IA_AldyBel
    0.700 Kazakhstan_Saka_IA
    0.715 Russia_Chalmny_Varre
    0.719 Kazakhstan_Tasmola_Saka_IA
    0.741 Kyrgyzstan_Saka_IA
    0.779 Kazakhstan_IA_Tasbas
    The code below uses my modified version of michal3141's CVXPY script to create a model for the modern Mari average using ancient population averages (https://github.com/michal3141/g25). The models generated by the script are usually almost identical to Vahaduo. If you get an error that the solver CVXOPT failed, try running `sed -i s/CVXOPT/ECOS_BB/ mix`.

    Code:
    $ curl https://pastebin.com/raw/afaMiFSa|tr -d \\r>mix;chmod +x mix;pip3 install cvxpy
    $ ./mix reich.k13.ancient.ave.mds <(grep Mari reich.k13.modern.ave.mds)
    Mari.SG (.058):
    34% Latvia_BA
    23% Russia_KusnarenkovoKarajakupovo_Medieval.SG
    23% Russia_Chalmny_Varre
    12% Russia_Krasnoyarsk_BA.SG
    4% Georgia_Kotias.SG
    1% Russia_KolymaRiver_LN.SG
    1% Congo_Kindoki_Protohistoric
    1% Iran_ShahrISokhta_BA2
    0% Jordan_EBA
    Make a PCA of individual samples using Voronoi tessellation

    Documentation for `ggforce::geom_voronoi_tile`: https://ggforce.data-imaginist.com/r...om_delvor.html.



    Code:
    library(tidyverse)
    library(ggforce)
    
    t=read.csv("https://drive.google.com/uc?export=download&id=1bQVIJzio9NNxSlQjYA230z3nYSM9UQ7Z",row.names=1)
    fst=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1)
    
    t=t[!grepl("\\.REF|Ignore_|_o|_lc|_outlier|_dup|_contam|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|\\.rel\\.|Neanderthal|Denisova|Vindija|Orangutang|Macaque|Marmoset|Primate_Chimp|hg19ref",rownames(t)),]
    
    anno=as.data.frame(read_tsv(gsub("\u00A0"," ",read_file("https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_HO_public.anno"))))
    picks=anno[anno[,5]==0|anno[,5]>=4000,2]
    
    mds=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
    mds=as.data.frame(mds[sub(".*:","",rownames(mds))%in%picks,])
    
    rownames(mds)=sub("\\.(DG|SG|SDG|WGA)","",rownames(mds))
    mds=mds[!sub(":.*","",rownames(mds))%in%c("PUR","CLM"),]
    popav=data.frame(aggregate(mds,list(sub(":.*","",rownames(mds))),mean),row.names=1)
    poppick=names(head(sort(as.matrix(dist(popav))["Mari",]),200))
    mds=mds[sub(":.*","",rownames(mds))%in%poppick,]
    pop=sub(":.*","",rownames(mds))
    
    p0=prcomp(mds)
    p=as.data.frame(p0$x)
    pct=paste0(colnames(p0$x)," (",sprintf("%.1f",p0$sdev/sum(p0$sdev)*100),"%)")
    
    set.seed(1)
    color=as.factor(sample(seq(1,length(unique(pop)))))
    col=rbind(c(60,80),c(25,95),c(100,70),c(30,70),c(70,50),c(60,100),c(20,50),c(15,40),c(15,70),c(20,30),c(80,30))
    hues=max(ceiling(length(color)/nrow(col)),7)
    pal1=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],x[1],x[2])))
    pal2=as.vector(apply(col,1,function(x)hcl(seq(15,375,length=hues+1)[1:hues],ifelse(x[2]>50,.8*x[1],.2*x[1]),ifelse(x[2]>50,.3*x[2],100))))
    
    for(i in seq(1,3,2)){
      pc1=sym(paste0("PC",i+1))
      pc2=sym(paste0("PC",i))
    
      mult=diff(range(p[,i+1]))/diff(range(p[,i]))
      p[,i]=p[,i]*mult
      max=max(abs(p[,i]))
      ranges=apply(p,2,function(x)abs(max(x)-min(x)))
      maxrange=max(ranges[c(i,i+1)])
    
      centers=aggregate(p,by=list(pop),mean)
    
      ggplot(p,aes(!!pc1,!!pc2,group=-1L))+
      ggforce::geom_voronoi_tile(aes(fill=color[as.factor(pop)],color=color[as.factor(pop)]),size=.07,max.radius=maxrange/40)+
      geom_label(data=centers,aes(x=!!pc1,y=!!pc2,label=Group.1,color=color,fill=color),alpha=.8,size=2,label.r=unit(.08,"lines"),label.padding=unit(.08,"lines"),label.size=0)+
      coord_fixed(expand=F)+
      labs(x=pct[i],y=pct[i+1])+
      scale_fill_manual(values=pal1)+
      scale_color_manual(values=pal2)+
      theme(
        axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.ticks.length=unit(0,"pt"),
        axis.title=element_text(color="black",size=8),
        legend.position="none",
        panel.background=element_rect(fill="white"),
        panel.border=element_rect(color="gray85",fill=NA,size=.4),
        panel.grid.major=element_blank()
      )
    
      ggsave(paste0(i,".png"),width=7,height=7)
    }
    Make an area chart that displays changes in the global average share of components over time

    The code below uses my CSV file for the K13 results of all samples in the 1240K version of the Reich dataset. It rounds the age of each sample to the nearest 1000 years, and it then calculates the global average percentage of the K13 components for each 1000-year period.



    Code:
    library(tidyverse)
    
    t=read.csv("https://drive.google.com/uc?export=download&id=1-x0ZOywaMH3v1KsZYaK0fbPMpReDxE6m",row.names=1,check.names=F)
    anno=as.data.frame(read_tsv(gsub("\u00A0"," ",read_file("https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_1240K_public.anno"))))
    
    t=t[!grepl("\\.REF|rel\\.|fail\\.|\\.contam|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_relative|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref|_o",rownames(t)),]
    
    age=anno[,c(2,9)]%>%data.frame(row.names=1,check.names=F)
    age2=round(age/1000)
    tage=age2[sub(".*:","",rownames(t)),]
    ave=aggregate(t,list(tage),mean)%>%data.frame(row.names=1,check.names=F)
    
    taco=function(x)data.frame(V1=rownames(x)[row(x)],V2=colnames(x)[col(x)],V3=unname(do.call(c,x)),check.names=F)
    t2=taco(ave)
    t2$V1=as.numeric(t2$V1)
    group=factor(t2$V2,levels=unique(t2$V2))
    
    pal=paste0("#",unlist(strsplit("79a6f8 44ccf9 ece879 c5ec79 f5c361 ec9f79 36d17b ff78b8 e178ff 7442b2 782e2e 714c33 4e3d32"," ")))
    
    ggplot(t2,aes(x=V1, y=V3,color=!!group,fill=!!group))+
    geom_area(size=.1,alpha=.8)+
    labs(x="Thousands of years BP",y="Percentage of component")+
    scale_x_continuous(breaks=seq(0,200,5),expand=expansion(mult=c(.005,.005)),trans="reverse")+
    scale_y_continuous(breaks=seq(0,100,20),expand=expansion(mult=c(.005,.005)))+
    scale_color_manual(values=pal)+
    scale_fill_manual(values=pal)+
    guides(color=F,fill=guide_legend(ncol=7))+
    theme(
      axis.text=element_text(size=6,color="black"),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"cm"),
      axis.title=element_text(size=8),
      legend.background=element_blank(),
      legend.box.just="center",
      legend.box.margin=margin(0,unit="cm"),
      legend.box.spacing=unit(0,"in"),
      legend.direction="vertical",
      legend.justification="center",
      legend.key=element_rect(fill="white"),
      legend.margin=margin(-1,0,0,0),
      legend.position="bottom",
      legend.text=element_text(size=7,vjust=.5),
      legend.title=element_blank(),
      panel.background=element_rect(fill="white"),
      panel.grid=element_blank(),
      plot.background=element_rect(fill="white")
    )
    
    ggsave("1.png",width=8,height=4)
    Shell version:

    Code:
    $ curl -Lso reich.k13 'https://drive.google.com/uc?export=download&id=1-x0ZOywaMH3v1KsZYaK0fbPMpReDxE6m'
    $ curl -LsO https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_HO_public.anno
    $ awk -F\\t '{print sprintf("%.0f",$5/1000)*1000,$2}' v50.0_HO_public.anno|awk -F'[ :,]' 'NR==FNR{a[$2]=$1;next}{print a[$2]","$0}' - reich.k13|sed 1d|grep -v ^,|awk -F, '{for(i=3;i<=NF;i++)a[$1][i]+=$i;n[$1]++}END{for(i in a){o=i","n[i];for(j=3;j<=NF;j++)o=o","sprintf("%.1f",a[i][j]/n[i]);print o}}'|sort -t, -nk1,1|cat <(head -n1 reich.k13|sed s/^/age,n/) -|column -s, -t
    Make a scatterplot which shows the total percentage of two sets of components or the average distance to two sets of components

    Each population is connected with a line to its two nearest neighbors. The populations are divided into clusters by cutting a hierarchical clustering tree at the height where it has 24 subtrees.



    Code:
    library(tidyverse)
    library(ggforce)
    
    t=read.csv("reich.k13",row.names=1,check.names=F)
    fst=read.csv("k13.fst",row.names=1)/1e3
    
    anno=as.data.frame(read_tsv(gsub("\u00A0"," ",read_file("v50.0_HO_public.anno"))))
    t=t[sub(".*:","",rownames(t))%in%anno[anno[,5]==0,2],]
    t=t[!grepl("\\.REF|rel\\.|fail\\.|\\.contam|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_relative|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref|_o",rownames(t)),]
    
    t=aggregate(t,list(sub(":.*","",rownames(t)%>%gsub("\\.(DG|SDG|SG|WGA)|_published","",.))),mean)%>%data.frame(row.names=1)
    
    dm=as.matrix(t)%*%as.matrix(fst)
    colnames(dm)=colnames(t)
    t2=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
    
    p1=c("Baltic","North_Atlantic")
    p2=c("East_Asian","Siberian","Amerindian")
    xy=data.frame(x=rowSums(t[,p1]),y=rowSums(t[,p2]))
    # xy=data.frame(x=rowMeans(dm[,p1]),y=rowMeans(dm[,p2]))
    
    # pick=apply(xy<14,1,all)
    # xy=xy[pick,]
    # t2=t2[pick,]
    
    seg=lapply(1:3,function(i)apply(as.matrix(dist(t2)),1,function(x)unlist(xy[names(sort(x)[i]),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
    
    lim=c(min(xy),max(xy))+c(-1,1)*(max(xy)-min(xy))/20
    
    k=cutree(hclust(dist(t2)),24)
    pal1=hcl(seq(15,375,length.out=n_distinct(k)+1)%>%head(-1),95,60)
    
    ggplot(xy,aes(x,y))+
    geom_abline(linetype="dashed",color="gray80",size=.3)+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray50",size=.1)+
    geom_mark_hull(aes(color=as.factor(!!k),fill=as.factor(!!k)),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
    geom_point(size=.5,aes(color=as.factor(!!k)))+
    geom_text(aes(color=as.factor(!!k)),label=rownames(xy),size=2,vjust=-.7)+
    scale_x_continuous(breaks=seq(0,100,10),expand=expansion(mult=c(.06,.06)))+
    scale_y_continuous(breaks=seq(0,100,10),expand=expansion(mult=c(.06,.06)))+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal1)+
    coord_cartesian(xlim=lim,ylim=lim,expand=F)+
    labs(x=paste0("Total percentage of ",paste(p1,collapse=" + ")),y=paste0("Total percentage of ",paste(p2,collapse=" + ")))+
    # labs(x=paste0("Average distance to: ",paste(p1,collapse=" + ")),y=paste0("Average distance to: ",paste(p2,collapse=" + ")))+
    theme(
      axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
      axis.text=element_text(size=6),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"cm"),
      axis.title=element_text(size=8),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray85",fill=NA,size=.6),
      panel.grid.major=element_line(color="gray85",size=.2),
      panel.grid.minor=element_blank(),
      plot.background=element_rect(fill="white"),
      plot.subtitle=element_text(size=7),
      plot.title=element_text(size=11)
    )
    
    ggsave("1.png",width=7,height=7)
    Visualize the results of a calculator as a polygonal diagram

    This is basically a version of a ternary plot that can be generalized to a higher number of dimensions than 3. A ternary plot is a type of graph where points are drawn inside an equilateral triangle so that the position of each point is determined by three variables that sum up to a constant (https://en.wikipedia.org/wiki/Ternary_plot).




    Code:
    library(tidyverse)
    library(ggforce)
    library(colorspace)
    
    t=read.csv("https://drive.google.com/uc?export=download&id=1pL2R5dDvQ0wsPw_SgoFn_zWLDrGXa9nM",row.names=1)/100
    fst=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1)
    
    t=t[!grepl("\\.REF|rel\\.|fail\\.|\\.contam|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_relative|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref|_o",rownames(t)),]
    
    rownames(t)=sub("\\.(SG|DG|SDG|WGA|WGC)|_published","",rownames(t))
    t=data.frame(aggregate(t,list(sub(":.*","",rownames(t))),mean),row.names=1,check.names=F)
    
    mds=as.matrix(t)%*%cmdscale(fst,ncol(fst)-1)
    
    t=data.frame(t[,2],t[,9],t[,8],rowSums(t[,c(1,3:7,10:13)]))
    names(t)=c("Baltic","Siberian","East_Asian","Other")
    
    start=ifelse(ncol(t)==4,.25,0)
    corners=sapply(c(sin,cos),function(x)x(head((start+seq(0,2,length.out=ncol(t)+1))*pi,-1)))
    corners=corners*min(2/diff(apply(corners,2,range)))
    corners[,2]=corners[,2]-mean(range(corners[,2]))
    
    xy=as.data.frame(as.matrix(t)%*%corners)
    grid=as.data.frame(rbind(cbind(corners,rbind(corners[-1,],corners[1,])),cbind(corners,matrix(apply(corners,2,mean),ncol=2,nrow=ncol(t),byrow=T))))
    
    seg=lapply(1:3,function(i)as.matrix(dist(mds))%>%apply(1,function(x)unlist(xy[names(sort(x)[i]),],use.names=F))%>%t%>%cbind(xy))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
    
    k=as.factor(cutree(hclust(dist(mds)),32))
    
    set.seed(1)
    hue=seq(0,360,length.out=nlevels(k)+1)%>%head(-1)%>%sample()
    pal1=hex(colorspace::HSV(hue,.5,1))
    pal2=hex(colorspace::HSV(hue,.3,1))
    
    angle=head(seq(360,0,length.out=ncol(t)+1),-1)
    angle=ifelse(angle>90&angle<=270,angle+180,angle)
    
    ggplot(xy,aes(x=V1,y=V2))+
    geom_polygon(data=as.data.frame(corners),fill="gray25")+
    # geom_text(data=as.data.frame(corners),aes(x=1.04*V1,y=1.04*V2),label=colnames(t),size=3.2,angle=angle,color="gray85")+ # use rotated labels (for pentagon and higher number of corners)
    geom_text(data=as.data.frame(corners),aes(x=V1,y=V2+sign(V2)*.02),vjust=(1-corners[,2])/2,hjust=(1+corners[,1])/2,label=colnames(t),size=3.2,color="gray85")+ # don't rotate labels (for square and triangle)
    geom_segment(data=grid,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray30",size=.4)+
    ggforce::geom_mark_hull(aes(group=!!k,color=!!k,fill=!!k),concavity=1000,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.15,size=.1)+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray10",size=.25)+
    geom_point(aes(color=k),size=.5)+
    geom_text(aes(label=rownames(xy),color=!!k),size=2.2,vjust=-.6)+
    coord_fixed(xlim=c(-1,1),ylim=c(-1,1))+
    scale_fill_manual(values=pal1)+
    scale_color_manual(values=pal2)+
    theme(
      axis.text=element_blank(),
      axis.ticks=element_blank(),
      axis.title=element_blank(),
      legend.position="none",
      panel.background=element_rect(fill="gray20"),
      panel.grid=element_blank(),
      plot.background=element_rect(fill="gray20",color=NA,size=0),
      plot.margin=margin(0,0,0,0)
    )
    
    ggsave("1.png",width=13,height=13)
    The following code first gets the coordinates of the corners of an equilateral triangle with a unit radius, and it then uses matrix multiplication to plot the Gedrosia K3 results of populations inside the triangle:

    Code:
    > triangle=sapply(c(sin,cos),function(x)head(x(seq(0,2,length.out=3+1)*pi),-1))
    > triangle
               [,1] [,2]
    [1,]  0.0000000  1.0
    [2,]  0.8660254 -0.5
    [3,] -0.8660254 -0.5
    > admix=read.csv(text="Finnish,10.90,0.00,89.10\nNganasan,96.27,0.00,3.73",header=F,row.names=1)/100
    > as.matrix(admix)%*%triangle
                    [,1]     [,2]
    Finnish  -0.77162863 -0.33650
    Nganasan -0.03230275  0.94405
    Run calculators on the command line with Python

    As far as I know, the DIYDodecad program only comes with binaries for Windows and Linux, and Dienekes wrote it in FORTRAN but he didn't publish its source code (https://dodecad.blogspot.com/2011/08...lator-for.html). However there's now also a Python-based alternative to DIYDodecad (https://github.com/stevenliuyi/admix):

    Code:
    $ wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_HO_public.{anno,ind,snp,geno}
    $ f=v50.0_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
    $ pip3 install git+https://github.com/stevenliuyi/admix
    $ plink --bfile v50.0_HO_public --keep <(grep Chuvash33 v50.0_HO_public.fam) --recode 23
    $ admix -f plink.txt -v 23andme -m K12b -t 1e-7
    
    Admixture calculation models: K12b
    
    Calcuation is started...
    
    K12b
    Gedrosia: 2.06%
    Siberian: 20.86%
    Northwest African: 1.27%
    Southeast Asian: 0.00%
    Atlantic Med: 4.46%
    North European: 53.12%
    South Asian: 0.71%
    East African: 0.00%
    Southwest Asian: 1.85%
    East Asian: 2.89%
    Caucasus: 12.78%
    Sub Saharan: 0.00%
    It's important to specify a lower termination tolerance with `-t 1e-7`, because if you use the default tolerance of 1e-3, then certain samples end up getting close to 100% one component and 0% of all other components, like for example this Vologda Russian sample with Eurogenes K13:

    Code:
    $ plink --bfile v50.0_HO_public --keep <(grep HGDP00899 v50.0_HO_public.fam) --recode 23
    $ admix -f plink.txt -mK13
    
    Admixture calculation models: K13
    
    Calcuation is started...
    
    K13
    North_Atlantic: 0.12%
    Baltic: 99.76%
    West_Med: 0.00%
    West_Asian: 0.00%
    East_Med: 0.00%
    Red_Sea: 0.00%
    South_Asian: 0.00%
    East_Asian: 0.00%
    Siberian: 0.07%
    Amerindian: 0.00%
    Oceanian: 0.05%
    Northeast_African: 0.00%
    Sub-Saharan: 0.00%
    
    
    $ admix -f plink.txt -mK13 -t1e-4
    
    Admixture calculation models: K13
    
    Calcuation is started...
    
    K13
    North_Atlantic: 3.95%
    Baltic: 71.60%
    West_Med: 5.06%
    West_Asian: 2.74%
    East_Med: 6.62%
    Red_Sea: 0.67%
    South_Asian: 0.00%
    East_Asian: 0.16%
    Siberian: 6.24%
    Amerindian: 1.26%
    Oceanian: 1.31%
    Northeast_African: 0.04%
    Sub-Saharan: 0.36%
    
    
    $ admix -f plink.txt -mK13 -t1e-5
    
    Admixture calculation models: K13
    
    Calcuation is started...
    
    K13
    North_Atlantic: 4.09%
    Baltic: 71.69%
    West_Med: 4.73%
    West_Asian: 2.64%
    East_Med: 6.46%
    Red_Sea: 0.91%
    South_Asian: 0.00%
    East_Asian: 0.11%
    Siberian: 6.05%
    Amerindian: 1.37%
    Oceanian: 1.55%
    Northeast_African: 0.03%
    Sub-Saharan: 0.36%
    
    
    $ admix -f plink.txt -mK13 -t1e-6
    
    Admixture calculation models: K13
    
    Calcuation is started...
    
    K13
    North_Atlantic: 4.09%
    Baltic: 71.69%
    West_Med: 4.73%
    West_Asian: 2.64%
    East_Med: 6.46%
    Red_Sea: 0.91%
    South_Asian: 0.00%
    East_Asian: 0.11%
    Siberian: 6.05%
    Amerindian: 1.37%
    Oceanian: 1.55%
    Northeast_African: 0.03%
    Sub-Saharan: 0.36%
    I tried running Eurogenes K13 with different tolerance parameters for all 87 samples with the population name "Russian" in the HO version of the Reich dataset, because they also included a few other samples that got messed up results when I used the default tolerance of 1e-3. The average change in admixture proportions was about 0.3 percentage points between 1e-3 and 1e-4 but it fell to close to zero between 1e-5 and 1e-6, and even 1e-8 was almost as fast as 1e-3:

    Tolerance Average difference in admixture percentages
    compared to previous tolerance
    Running time
    per sample
    1e-1 - 1.22
    1e-2 3.144810 1.65
    1e-3 3.515349 3.12
    1e-4 0.316737 3.47
    1e-5 0.002387 3.64
    1e-6 0.000097 3.51
    1e-7 0.000062 3.65
    1e-8 0.000088 3.65

    I picked 1e-7 as the tolerance partially because it's also specified as a tolerance in the parameter file of DIYDodecad calculators, like for example this one: http://magnusducatus.blogspot.com/20...alculator.html.

    The admix script doesn't come with calculator files for Eurogenes K13 or K15, but you can download them here: https://drive.google.com/file/d/1X1n...ozV6IRFvekY7l8, https://drive.google.com/file/d/1xWD...r5I14ANG4mUXWs.

    Here's how you can run a batch of multiple samples and make population averages:

    Code:
    printf %s\\n Nganasan Finnish|awk 'NR==FNR{a[$0];next}$3 in a{print$1,$3}' - v50.0_HO_public.ind|while read x y;do plink --bfile v50.0_HO_public --keep <(awk '$2==x' "x=$x" v50.0_HO_public.fam) --allow-no-sex --recode 23 --out 23;admix -f 23.txt -m K12b -t 1e-7|grep %|sed 's/%//;s/.* //'|paste -sd, -|sed "s/^/$y:$x,/">>admix;done
    sed 's/:[^,]*//' admix|awk -F, '{n[$1]++;for(i=2;i<=NF;i++){a[$1,i]+=$i}}END{for(i in n){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%.2f",a[i,j]/n[i]);print o}}'>admix.ave
    I ran Eurogenes K13 for all samples in the 1240K and HO versions of the Reich dataset (I didn't manually remove any samples, so my files include samples that suffer from the calculator effect because Davidski used them as reference samples for K13):

    - https://drive.google.com/file/d/1-x0...K0fbPMpReDxE6m - 1240K (more SNPs but less modern samples)
    - https://drive.google.com/file/d/1bQV...230z3nYSM9UQ7Z - HO (less SNPs but more modern samples)
    - https://drive.google.com/file/d/1pL2...Fn_zWLDrGXa9nM - 1240K and HO combined (duplicate samples with the same master ID are removed so only the sample with the biggest SNP count is kept; use 1240K version when available and use HO version only for samples missing from 1240K)

    Eurogenes K13 has about 180,000 overlapping SNPs with the 1240K version of the Reich dataset but only about 50,000 overlapping SNPs with the HO version:

    Code:
    $ wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V50/V50.0/SHARE/public.dir/v50.0_{HO,1240K}_public.snp
    $ for x in 1240K HO;do cut -d\  -f1 /usr/local/lib/python3.9/site-packages/admix/data/K13.alleles|LC_ALL=C sort|comm -12 <(awk '{print$1}' v50.0_${x}_public.snp|LC_ALL=C sort) -|wc -l|echo "$(cat) out of $(wc -l< v50.0_${x}_public.snp) SNPs overlap with $x";done
    177354 out of 1233013 SNPs overlap with 1240K
    52770 out of 597573 SNPs overlap with HO
    When I compared the K13 results of samples that were included in both the 1240K and HO versions, the average difference in admixture percentages reached over 2% for samples that had less than 100,000 SNPs in the 1240K version:



    Here's how many SNPs different calculators have in common with the 1240K version of the Reich dataset (shown in the first column) and the HO version (shown in the second column):

    Code:
    $ for f in /usr/local/lib/python3.9/site-packages/admix/data/*.alleles;do awk 'ARGIND==1{a[$1];next}ARGIND==2{b[$1];next}$1 in a{na++}$1 in b{nb++}END{f=ARGV[3];sub(".*/","",f);print na,nb,f}' v50.0_{1240K,HO}_public.snp $f;done
    26711 23889 Africa9.alleles
    35641 35433 AncientNearEast13.alleles
    165122 48100 E11.alleles
    154754 43947 EUtest13.alleles
    159921 45048 Eurasia7.alleles
    182709 56582 HarappaWorld.alleles
    154754 43947 Jtest14.alleles
    161597 45265 K12b.alleles
    177354 52770 K13.alleles
    160674 46518 K13M2.alleles
    160674 46518 K14M1.alleles
    179286 53263 K15.alleles
    160674 46518 K18M4.alleles
    160674 46518 K25R1.alleles
    160674 46518 K36.alleles
    76223 75060 K47.alleles
    101280 101280 K7AMI.alleles
    160674 46518 K7M1.alleles
    161597 45265 K7b.alleles
    101280 101280 K8AMI.alleles
    92607 91944 KurdishK10.alleles
    114866 49757 MDLPK27.alleles
    76217 75054 MichalK25.alleles
    112059 110039 TurkicK11.alleles
    161125 45515 globe10.alleles
    161125 45515 globe13.alleles
    38491 9449 puntDNAL.alleles
    163875 45877 weac2.alleles
    165539 46293 world9.alleles
    Last edited by Nganasankhan; 06-20-2022 at 09:33 PM.

  2. The Following 11 Users Say Thank You to Nganasankhan For This Useful Post:

     Archetype0ne (06-20-2022),  Eurasia (06-26-2022),  heksindhi (06-20-2022),  JMcB (06-22-2022),  jstephan (06-24-2022),  Kale (06-25-2022),  peloponnesian (06-25-2022),  ph2ter (06-20-2022),  PLogan (06-20-2022),  Telfermagne (06-21-2022),  xerxez (06-26-2022)

  3. #2
    Registered Users
    Posts
    317
    Sex
    Ethnicity
    Finnish

    Display calculator results as a circular stacked bar chart

    This uses the Circlize package: https://jokergoo.github.io/circlize_book/book/.

    The colored groups are based on cutting a hierarchical clustering tree at the height where it has 20 subtrees (which is below the height of the branches that have black color in the clustering tree).



    Code:
    library(circlize)
    library(ComplexHeatmap) # for `Legend`
    library(vegan) # for `reorder.hclust`
    library(dendextend) # for `color_branches`
    library(colorspace) # for `hex`
    
    t=read.csv("https://pastebin.com/raw/23hWrRgG",row.names=1,check.names=F,header=T)/100
    fst=read.csv("https://pastebin.com/raw/6JmN2hRY",row.names=1)
    
    ord=c(2,1,3,5,6,4,7:13)
    t=t[,ord]
    fst=fst[ord,ord]
    
    hc=hclust(dist(as.matrix(t)%*%as.matrix(fst)))
    hc=reorder(hc,as.matrix(t)%*%seq(ncol(t))^2)
    
    hue=c(0,30,60,90,120,170,210,240,280,320)
    labelcolor=c(hcl(hue+15,60,70),hcl(hue+15,50,50))
    barcolor=c(hex(HSV(c(240,210,60,40,15,80,120,340,290),.5,.95)),hex(HSV(c(270,0,30),.5,.6)),hex(HSV(30,.5,.4)))
    textcolor=c(rep("black",9),rep("white",4))
    
    labels=hc$labels[hc$order]
    cut=cutree(hc,20)
    dend=color_branches(as.dendrogram(hc),k=length(unique(cut)),col=labelcolor[unique(cut[labels])])
    t=t[hc$labels[hc$order],]
    
    circos.clear()
    png("1.png",w=3500,h=3500,res=300)
    circos.par(cell.padding=c(0,0,0,0),gap.degree=360/nrow(t),points.overflow.warning=F)
    circos.initialize("a",xlim=c(0,nrow(t)))
    
    circos.track(ylim=c(0,1),bg.border=NA,track.height=.28,track.margin=c(.01,0),
      panel.fun=function(x,y)for(i in 1:nrow(t))circos.text(i-.5,0,labels[i],adj=c(0,.5),facing="clockwise",niceFacing=T,cex=.65,col=labelcolor[cut[labels[i]]]))
    
    circos.track(ylim=c(0,1),track.margin=c(0,0),track.height=.26,bg.lty=0,panel.fun=function(x,y){
      mat=as.matrix(t)
      pos=1:nrow(mat)-.5
      barwidth=1
      for(i in 1:ncol(mat)){
        seq1=rowSums(mat[,seq(i-1),drop=F])
        seq2=rowSums(mat[,seq(i),drop=F])
        circos.rect(pos-barwidth/2,if(i==1){0}else{seq1},pos+barwidth/2,seq2,col=barcolor[i],border="gray15",lwd=.1)
      }
      for(i in 1:ncol(mat)){
        seq1=rowSums(mat[,seq(i-1),drop=F])
        seq2=rowSums(mat[,seq(i),drop=F])
        lab=round(100*t[,i])
        lab[lab<=1]="" # don't show labels for 0% or 1%
        circos.text(pos,if(i==1){seq1/2}else{seq1+(seq2-seq1)/2},labels=lab,col=textcolor[i],cex=.5,facing="clockwise",niceFacing=T)
      }
    })
    
    circos.track(ylim=c(0,attr(dend,"height")),bg.border=NA,track.margin=c(0,.0015),track.height=.4,panel.fun=function(x,y)circos.dendrogram(dend))
    
    circos.clear()
    
    lg=Legend(at=colnames(t),legend_gp=gpar(col=barcolor),background=barcolor,type="points")
    draw(lg,just=c("center","center"))
    # draw(lg,x=unit(15,"mm"),y=unit(15,"mm"),just=c("left","bottom")) # draw legend in bottom left corner instead of middle
    
    dev.off()
    Last edited by Nganasankhan; 06-22-2022 at 03:42 AM.

  4. The Following 6 Users Say Thank You to Nganasankhan For This Useful Post:

     Aben Aboo (06-22-2022),  JMcB (06-22-2022),  Kale (06-25-2022),  PLogan (06-22-2022),  Wâldpykjong (06-22-2022),  xerxez (06-26-2022)

  5. #3
    Registered Users
    Posts
    517
    Sex
    Location
    Missouri, U.S.
    Ethnicity
    Colonial American
    Nationality
    American
    aDNA Match (1st)
    VK2020_Scotland_Orkney_VA:VK207
    Y-DNA (P)
    R1b-U152 >R-FTA96415
    mtDNA (M)
    J1b1a1a
    Y-DNA (M)
    I2-P37 > I-BY77146
    mtDNA (P)
    H

    United States of America Scotland England Netherlands
    That is awesome Nganasankhan!
    Last edited by PLogan; 06-22-2022 at 03:13 AM.

  6. The Following User Says Thank You to PLogan For This Useful Post:

     Nganasankhan (06-22-2022)

  7. #4
    Registered Users
    Posts
    353
    Sex
    Ethnicity
    Albanian
    Nationality
    Macedonian
    Y-DNA (P)
    J2b2-L283/J-FT29003

    Albania North Macedonia European Union
    Nganasankhan, excuse my ignorance, is copy pasting the code into a shell compiler enough?
    Recently got literate in Python, but it seems not to help me much with running the code. Thanks for these very valuable scripts you provide.
    “Man cannot live without a permanent trust in something indestructible in himself, and at the same time that indestructible something as well as his trust in it may remain permanently concealed from him.”

    ― Franz Kafka

  8. #5
    Registered Users
    Posts
    517
    Sex
    Location
    Missouri, U.S.
    Ethnicity
    Colonial American
    Nationality
    American
    aDNA Match (1st)
    VK2020_Scotland_Orkney_VA:VK207
    Y-DNA (P)
    R1b-U152 >R-FTA96415
    mtDNA (M)
    J1b1a1a
    Y-DNA (M)
    I2-P37 > I-BY77146
    mtDNA (P)
    H

    United States of America Scotland England Netherlands
    Got an error on the legend section, but attached is my rendering injecting myself and my parents. Look in the English area low right.

    Code:
    > lg=Legend(at=colnames(t),legend_gp=gpar(col=barcolor),background=barcolor,type="points")
    Error in Legend(at = colnames(t), legend_gp = gpar(col = barcolor), background = barcolor,  : 
      could not find function "Legend"
    > draw(lg,just=c("center","center"))
    Error in draw(lg, just = c("center", "center")) : 
      could not find function "draw"
    > # draw(lg,x=unit(15,"mm"),y=unit(15,"mm"),just=c("left","bottom")) # draw legend in bottom left corner instead of middle
    > 
    
    > install.packages("ComplexHeatmap")
    Warning in install.packages :
      package ‘ComplexHeatmap’ is not available for this version of R
    
    A version of this package for your version of R might be available elsewhere,
    see the ideas at
    https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
    Code:
    >
    R version 4.1.2 (2021-11-01)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 10 x64 (build 19044)
    
    Matrix products: default
    
    Random number generation:
     RNG:     Mersenne-Twister 
     Normal:  Inversion 
     Sample:  Rounding 
     
    locale:
    [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
    [5] LC_TIME=English_United States.1252    
    system code page: 65001
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] circlize_0.4.15   dendextend_1.15.2 vegan_2.6-2       lattice_0.20-45   permute_0.9-7    
    
    loaded via a namespace (and not attached):
     [1] Rcpp_1.0.8          pillar_1.7.0        compiler_4.1.2      RColorBrewer_1.1-2  viridis_0.6.2       tools_4.1.2         viridisLite_0.4.0   nlme_3.1-153       
     [9] lifecycle_1.0.1     tibble_3.1.6        gtable_0.3.0        mgcv_1.8-38         pkgconfig_2.0.3     rlang_1.0.1         Matrix_1.3-4        DBI_1.1.2          
    [17] cli_3.2.0           ggrepel_0.9.1       parallel_4.1.2      gridExtra_2.3       cluster_2.1.2       dplyr_1.0.8         GlobalOptions_0.1.2 generics_0.1.2     
    [25] vctrs_0.3.8         grid_4.1.2          tidyselect_1.1.1    cowplot_1.1.1       glue_1.6.1          R6_2.5.1            fansi_1.0.2         polyclip_1.10-0    
    [33] pheatmap_1.0.12     ggplot2_3.3.5       purrr_0.3.4         tweenr_1.0.2        farver_2.1.0        magrittr_2.0.2      splines_4.1.2       scales_1.1.1       
    [41] ellipsis_0.3.2      MASS_7.3-54         assertthat_0.2.1    shape_1.4.6         ggforce_0.3.3       colorspace_2.0-3    utf8_1.2.2          munsell_0.5.0      
    [49] crayon_1.5.0
    Thanks for sharing.

    Last edited by PLogan; 06-22-2022 at 06:32 PM.

  9. #6
    Registered Users
    Posts
    517
    Sex
    Location
    Missouri, U.S.
    Ethnicity
    Colonial American
    Nationality
    American
    aDNA Match (1st)
    VK2020_Scotland_Orkney_VA:VK207
    Y-DNA (P)
    R1b-U152 >R-FTA96415
    mtDNA (M)
    J1b1a1a
    Y-DNA (M)
    I2-P37 > I-BY77146
    mtDNA (P)
    H

    United States of America Scotland England Netherlands
    Those BioConductor packages require R 4.2. I'm apprehensive to upgrade and I've read horror stories for some who've upgraded and broken their environments.

    Is there a simpler way to make a legend?

    TIA

  10. #7
    Registered Users
    Posts
    317
    Sex
    Ethnicity
    Finnish

    Quote Originally Posted by Archetype0ne View Post
    Nganasankhan, excuse my ignorance, is copy pasting the code into a shell compiler enough?
    Recently got literate in Python, but it seems not to help me much with running the code. Thanks for these very valuable scripts you provide.
    Yeah I try to write most of my shell scripts so that they can just be copied and pasted to a terminal in macOS or GNU/Linux.

    You can install R from here: https://cran.r-project.org. Then open the R console application and paste in my code. To install the required packages, run for example `install.packages("tidyverse")`.

    Quote Originally Posted by PLogan View Post
    Those BioConductor packages require R 4.2. I'm apprehensive to upgrade and I've read horror stories for some who've upgraded and broken their environments.

    Is there a simpler way to make a legend?

    TIA
    [s]You don't need BioConductor if you install the development version of ComplexHeatmap[/s]: `library(devtools);install_github("jokergoo/ComplexHeatmap")` (https://jokergoo.github.io/ComplexHe...eference/book/). (Edit: apparently you do.)
    Last edited by Nganasankhan; 06-22-2022 at 09:36 PM.

  11. The Following User Says Thank You to Nganasankhan For This Useful Post:

     PLogan (06-22-2022)

  12. #8
    Registered Users
    Posts
    517
    Sex
    Location
    Missouri, U.S.
    Ethnicity
    Colonial American
    Nationality
    American
    aDNA Match (1st)
    VK2020_Scotland_Orkney_VA:VK207
    Y-DNA (P)
    R1b-U152 >R-FTA96415
    mtDNA (M)
    J1b1a1a
    Y-DNA (M)
    I2-P37 > I-BY77146
    mtDNA (P)
    H

    United States of America Scotland England Netherlands
    Hmmm... got all excited.

    install fail.JPG

    install fail2.JPG

    Appreciate the advice though.

  13. #9
    Registered Users
    Posts
    517
    Sex
    Location
    Missouri, U.S.
    Ethnicity
    Colonial American
    Nationality
    American
    aDNA Match (1st)
    VK2020_Scotland_Orkney_VA:VK207
    Y-DNA (P)
    R1b-U152 >R-FTA96415
    mtDNA (M)
    J1b1a1a
    Y-DNA (M)
    I2-P37 > I-BY77146
    mtDNA (P)
    H

    United States of America Scotland England Netherlands
    [s]You don't need BioConductor if you install the development version of ComplexHeatmap[/s]: `library(devtools);install_github("jokergoo/ComplexHeatmap")` (https://jokergoo.github.io/ComplexHe...eference/book/). (Edit: apparently you do.)[/QUOTE]

    Just wanted to give another shout out and say thank you. First for your generosity in sharing code examples and secondly for motivating me to upgrade my R environments on my machines. It has solved a number of problems I've been experiencing having everything current. I've been burned in the past when upgrading so was gun shy, but I really wanted to be able to reproduce and utilize your code for Circlize.

    Here is my final render injecting myself and parents. They'll be in the lower right in the English area. I now want to experiment with maybe doing this with other calculators such as K36, although that might produce a very busy chart.

    Thanks again Nganasankhan!

    Last edited by PLogan; 06-23-2022 at 10:57 PM.

  14. #10
    Registered Users
    Posts
    517
    Sex
    Location
    Missouri, U.S.
    Ethnicity
    Colonial American
    Nationality
    American
    aDNA Match (1st)
    VK2020_Scotland_Orkney_VA:VK207
    Y-DNA (P)
    R1b-U152 >R-FTA96415
    mtDNA (M)
    J1b1a1a
    Y-DNA (M)
    I2-P37 > I-BY77146
    mtDNA (P)
    H

    United States of America Scotland England Netherlands
    Another tribute to Nganasankhan.

    Playing with different populations to better learn how the code works. This is a "K13 British Isles counties and cities by Creoda" I collected. I collect all I can lay my hands on. LOL

    Here are the results.


Page 1 of 3 123 LastLast

Similar Threads

  1. Shell commands and R scripts for working with G25
    By Nganasankhan in forum Autosomal (auDNA)
    Replies: 71
    Last Post: 08-13-2022, 03:30 PM
  2. Shell and R scripts for SmartPCA and ADMIXTURE
    By Nganasankhan in forum Autosomal (auDNA)
    Replies: 34
    Last Post: 07-12-2022, 12:40 PM
  3. Genoplot - ADMIXTURE Calculators
    By GenoPlot in forum GenoPlot
    Replies: 16
    Last Post: 08-17-2020, 02:23 AM
  4. Admixture Calculators
    By MR J in forum GenoPlot
    Replies: 1
    Last Post: 07-12-2020, 05:48 PM
  5. WHG-EEF-EHG-CHG Admixture... any calculators?
    By R.Rocca in forum Ancient (aDNA)
    Replies: 19
    Last Post: 05-21-2017, 10:15 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •