Results 1 to 7 of 7

Thread: R scripts for ADMIXTOOLS 2

  1. #1
    Registered Users
    Posts
    241
    Sex
    Ethnicity
    Finnish

    R scripts for ADMIXTOOLS 2


    The scripts in this post use the 1240K+HO dataset, which can be downloaded from here: https://reich.hms.harvard.edu/allen-...cient-dna-data.

    This installs ADMIXTOOLS 2 (https://github.com/uqrmaie1/admixtools):

    Code:
    install.packages("devtools")
    devtools::install_github("uqrmaie1/admixtools")
    library("admixtools")
    When I tried to install `devtools`, it failed at first because there was an error in installing the dependency `gert` of its dependency `usethis`. By running `install.packages("gert")`, I saw that I needed to run `brew install libgit2`. Even then, `admixtools` failed to install because there was an error in installing its dependency `igraph`. By googling for the error, I saw that I needed to run `brew uninstall --ignore-dependencies suite-sparse`.

    f3 as percentages with min-max scaling

    This shows the amount of shared drift with Han when using Yoruba as an outgroup. The values are min-max scaled so that they are displayed on a scale where Villabruna is 0 and Nganasan is 100.

    Code:
    $ printf %s\\n Bashkir Besermyan Chuvash Enets Estonia_N_CombCeramic.SG Estonian.DG Finnish Finnish.DG Hungarian.DG Italy_North_Villabruna_HG Karelian Mansi Mari.SG Mordovian Nganasan Norway_Viking_o1.SG Russia_Bolshoy Russia_Chalmny_Varre Russia_Chalmny_Varre Russia_EHG Russia_MLBA_Krasnoyarsk_o Russia_Shamanka_Eneolithic.SG Saami.DG Selkup Tatar_Mishar Tatar_Siberian_Zabolotniye Todzin Tofalar Udmurt>pops
    $ R -e 'library(tidyverse);library(admixtools);f3=f3("path/to/v44.3_HO_public","Yoruba","Han",readLines("pops"))%>%arrange(est);est=f3$est-min(f3$est);est=est/max(est);cat(paste(round(100*est),f3$pop3),sep="\n")'
    [...]
    0 Italy_North_Villabruna_HG
    2 Hungarian.DG
    6 Estonian.DG
    12 Finnish.DG
    12 Mordovian
    13 Finnish
    13 Russia_EHG
    14 Karelian
    20 Tatar_Mishar
    26 Besermyan
    27 Chuvash
    28 Mari.SG
    29 Russia_Chalmny_Varre
    30 Udmurt
    33 Saami.DG
    37 Russia_MLBA_Krasnoyarsk_o
    38 Bashkir
    40 Norway_Viking_o1.SG
    54 Russia_Bolshoy
    55 Tatar_Siberian_Zabolotniye
    57 Mansi
    70 Selkup
    80 Enets
    90 Tofalar
    94 Todzin
    99 Russia_Shamanka_Eneolithic.SG
    100 Nganasan
    If you get a low number of SNPs that remain after filtering, try to remove samples with a low SNP count. For example in a run before the run shown above, my SNP count was 12245. I ran a command like this to find samples with a low SNP count:

    Code:
    $ awk -F\\t 'NR==FNR{a[$0];next}$8 in a{print$15,$8,$2}' pops path/to/v44.3_HO_public.anno|sort -n|head -n8
    30001 Estonia_N_CombCeramic.SG MA974.SG
    58972 Estonia_N_CombCeramic.SG MA975.SG
    201719 Russia_Chalmny_Varre CHV002.A0101
    205900 Norway_Viking_o1.SG VK518.SG
    218757 Russia_Shamanka_Eneolithic.SG DA251.SG
    287462 Russia_Bolshoy BOO002.A0101
    289954 Russia_EHG UzOO77
    300800 Russia_Shamanka_Eneolithic.SG DA250.SG
    After I removed Estonia_N_CombCeramic.SG, my SNP count increased to 73054.

    f3 as a heatmap

    This applies min-max scaling to f3 values and displays the scaled values as percentages:

    Code:
    library(tidyverse)
    library(admixtools)
    library(pheatmap)
    library(colorspace)
    
    minmax=function(x)(x-min(x,na.rm=T))/(max(x,na.rm=T)-min(x,na.rm=T))
    
    p1="Yoruba.DG"
    p2=c("Nganasan","Russia_HG_Karelia","Norway_N_HG.SG","Turkey_N","Russia_HG_Tyumen","Han","Estonia_CordedWare","Russia_Bolshoy","Luxembourg_Loschbour","Russia_Samara_EBA_Yamnaya","Georgia_Kotias.SG")
    p3=c("Nganasan","Estonian.DG","Selkup","Hungarian.DG","Karelian","Mansi","Udmurt","Mordovian","Saami.DG","Besermyan","Finnish","Finland_Levanluhta","Tofalar","Russia_MLBA_Krasnoyarsk_o","Russia_Bolshoy","Enets","Russia_Chalmny_Varre","Norwegian")
    
    f3=f3("path/to/v44.3_HO_public",p1,p2,p3)
    
    t=f3%>%select(pop2,pop3,est)%>%pivot_wider(names_from=pop2,values_from=est)%>%data.frame(row.names=1)
    
    # don't show the f3 values of populations with themselves in order to not stretch the color scale
    makena=names(which(sapply(rownames(t),function(x)any(names(t)==x))))
    lapply(makena,function(x)t[x,x]<<-NA)
    
    range=apply(t,2,function(x)paste(sub("^0","",sprintf("%.3f",range(x,na.rm=T))),collapse="-"))
    names(t)=paste0(names(t)," (",range,")")
    
    t=minmax(t) # use the same scale for the whole table
    # t=apply(t,2,function(x)minmax(x)) # scale each column individually
    
    pheatmap(
      t*100,filename="a.png",
      # cluster_cols=F,cluster_rows=F,
      treeheight_row=70,treeheight_col=70,
      legend=F,cellwidth=16,cellheight=16,fontsize=8,border_color=NA,
      display_numbers=T,number_format="%.0f",fontsize_number=7,number_color="black",
      colorRampPalette(colorspace::hex(HSV(c(210,170,120,60,40,20,0),.5,1)))(256)
    )
    Here min-max scaling was applied to the whole table on the left side and to each column individually on the right side:



    The script below displays the original f3 values without scaling. It also demonstrates how `vegan::reorder.hclust` can be used to reorder the branches of the heatmap.

    Code:
    library(tidyverse)
    library(admixtools)
    library(pheatmap)
    library(colorspace) # for hex
    library(vegan) # for reorder.hclust
    
    p1="Yoruba.DG"
    p2=c("Nganasan","Russia_HG_Karelia","Norway_N_HG.SG","Turkey_N","Russia_HG_Tyumen","Han","Estonia_CordedWare","Russia_Bolshoy","Luxembourg_Loschbour","Russia_Samara_EBA_Yamnaya","Georgia_Kotias.SG","Finnish","Finland_Levanluhta","Russia_Shamanka_Eneolithic.SG")
    
    f3=f3("path/to/v44.3_HO_public",p1,p2,p2)
    
    t=f3%>%select(pop2,pop3,est)%>%pivot_wider(names_from=pop2,values_from=est)%>%data.frame(row.names=1)
    
    diag(t)=NA # don't show the f3 values of populations with themselves in order to not stretch the color scale
    
    pheatmap(
      t,filename="a.png",
      clustering_callback=function(...){reorder(hclust(dist(t)),wts=t[,"Han"]-t[,"Turkey_N"])},
      display_numbers=apply(t,1,function(x)sub("^0","",sprintf("%.3f",x))),
      legend=F,border_color=NA,cellwidth=16,cellheight=16,fontsize=8,fontsize_number=7,number_color="black",
      # breaks=seq(.160,.190,.03/256), # use fixed limits for color scale
      colorRampPalette(hex(HSV(c(210,170,120,60,40,20,0),.5,1)))(256)
    )
    Result:



    f3-outgroup nMonte

    This saves an f3 table as a CSV file so it can be used with Vahaduo or nMonte. It uses a shell wrapper for my simplified version of nMonte3.R.

    Code:
    $ printf %s\\n Estonia_CordedWare Finland_Levanluhta Finnish Georgia_Kotias.SG Han Luxembourg_Loschbour Nganasan Norway_N_HG.SG Russia_Bolshoy Russia_HG_Karelia Russia_HG_Tyumen Russia_Samara_EBA_Yamnaya Russia_Shamanka_Eneolithic.SG Turkey_N>pops
    $ R -e 'library(tidyverse);library(admixtools);pops=readLines("pops");f3=f3("path/to/v44.3_HO_public","Yoruba.DG",pops,pops);f3%>%select(pop2,pop3,est)%>%pivot_wider(names_from=pop3,values_from=est)%>%mutate(across(where(is.double),round,6))%>%write_csv("f3")'
    $ cat f3
    pop2,Estonia_CordedWare,Finland_Levanluhta,Finnish,Georgia_Kotias.SG,Han,Luxembourg_Loschbour,Nganasan,Norway_N_HG.SG,Russia_Bolshoy,Russia_HG_Karelia,Russia_HG_Tyumen,Russia_Samara_EBA_Yamnaya,Russia_Shamanka_Eneolithic.SG,Turkey_N
    Estonia_CordedWare,0.189574,0.048826,0.050368,0.046842,0.040376,0.05185,0.042084,0.049919,0.04784,0.050933,0.047266,0.051533,0.040938,0.049494
    Finland_Levanluhta,0.048826,0.148686,0.051402,0.045953,0.045448,0.052605,0.048901,0.053542,0.052522,0.053938,0.054203,0.051491,0.046715,0.048399
    Finnish,0.050368,0.051402,0.052422,0.047057,0.042304,0.053757,0.04443,0.052616,0.048967,0.053264,0.049454,0.050761,0.04272,0.0495
    Georgia_Kotias.SG,0.046842,0.045953,0.047057,0.185421,0.0394,0.045323,0.041384,0.046218,0.044534,0.047698,0.044243,0.048927,0.040304,0.046263
    Han,0.040376,0.045448,0.042304,0.0394,0.061217,0.040942,0.057035,0.041632,0.049274,0.043826,0.044467,0.040925,0.057394,0.038767
    Luxembourg_Loschbour,0.05185,0.052605,0.053757,0.045323,0.040942,0.188232,0.043033,0.062498,0.050054,0.057409,0.049533,0.051483,0.04226,0.049539
    Nganasan,0.042084,0.048901,0.04443,0.041384,0.057035,0.043033,0.070154,0.04448,0.054056,0.047517,0.047917,0.043077,0.059516,0.039822
    Norway_N_HG.SG,0.049919,0.053542,0.052616,0.046218,0.041632,0.062498,0.04448,0.189739,0.053043,0.062429,0.053055,0.052822,0.043558,0.048115
    Russia_Bolshoy,0.04784,0.052522,0.048967,0.044534,0.049274,0.050054,0.054056,0.053043,0.063846,0.057001,0.054654,0.049657,0.051358,0.04458
    Russia_HG_Karelia,0.050933,0.053938,0.053264,0.047698,0.043826,0.057409,0.047517,0.062429,0.057001,0.170124,0.062403,0.055766,0.046348,0.048079
    Russia_HG_Tyumen,0.047266,0.054203,0.049454,0.044243,0.044467,0.049533,0.047917,0.053055,0.054654,0.062403,0.18867,0.054148,0.046497,0.043883
    Russia_Samara_EBA_Yamnaya,0.051533,0.051491,0.050761,0.048927,0.040925,0.051483,0.043077,0.052822,0.049657,0.055766,0.054148,0.055635,0.041908,0.047615
    Russia_Shamanka_Eneolithic.SG,0.040938,0.046715,0.04272,0.040304,0.057394,0.04226,0.059516,0.043558,0.051358,0.046348,0.046497,0.041908,0.064461,0.039068
    Turkey_N,0.049494,0.048399,0.0495,0.046263,0.038767,0.049539,0.039822,0.048115,0.04458,0.048079,0.043883,0.047615,0.039068,0.055858
    $ nm()(grep -v ^, "$1">/tmp/nm1;grep -v ^, "$2">/tmp/nm2;Rscript -e 'source("~/bin/nmon.R");a=as.numeric(commandArgs(T));getMonte("/tmp/nm1","/tmp/nm2",Nbatch=a[1],Ncycles=a[2],pen=a[3])' "${3-500}" "${4-1000}" "${5-.001}")
    $ nm <(sed 1d f3|grep -v Finnish) <(sed 1d f3|grep Finnish)
    Target: Finnish (d=.0028)
    45.2 Russia_Samara_EBA_Yamnaya
    35.6 Turkey_N
    5.0 Russia_Bolshoy
    3.4 Nganasan
    3.0 Han
    2.6 Luxembourg_Loschbour
    1.4 Finland_Levanluhta
    1.4 Norway_N_HG.SG
    1.2 Russia_Shamanka_Eneolithic.SG
    0.8 Russia_HG_Karelia
    0.4 Estonia_CordedWare
    $ cat ~/bin/nmon.R
    getMonte=function(datafile,targetfile,Nbatch=500,Ncycles=1000,pen=0){
      do_algorithm=function(selection,targ){
        mySel=as.matrix(selection,rownames.force=NA)
        myTg=as.matrix(targ,rownames.force=NA)
        dif2targ=sweep(mySel,2,myTg,"-")
        Ndata=nrow(dif2targ)
        kcol=ncol(dif2targ)
        rowLabels=rownames(dif2targ)
        matPop=matrix(NA_integer_,Nbatch,1,byrow=T)
        dumPop=matrix(NA_integer_,Nbatch,1,byrow=T)
        matAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
        dumAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
        matPop=sample(1:Ndata,Nbatch,replace=T)
        matAdmix=dif2targ[matPop,]
        colM1=colMeans(matAdmix)
        eval1=(1+pen)*sum(colM1^2)
        for(c in 1:Ncycles){
          dumPop=sample(1:Ndata,Nbatch,replace=T)
          dumAdmix=dif2targ[dumPop,]
          for(b in 1:Nbatch){
            store=matAdmix[b,]
            matAdmix[b,]=dumAdmix[b,]
            colM2=colMeans(matAdmix)
            eval2=sum(colM2^2)+pen*sum(matAdmix[b,]^2)
            if(eval2<=eval1){
              matPop[b]=dumPop[b]
              colM1=colM2
              eval1=eval2
            }else{matAdmix[b,]=store}
          }
        }
        fitted=t(colMeans(matAdmix)+myTg[1,])
        popl=unlist(lapply(strsplit(rowLabels[matPop],";"),function(x)x[1]))
        populations=factor(popl)
        return(list("estimated"=fitted,"pops"=populations))
      }
    
      source=read.csv(datafile,header=F,row.names=1)
      target=read.csv(targetfile,header=F,row.names=1)
    
      for(i in 1:nrow(target)){
        row=target[i,]
        result=do_algorithm(source,row)
    
        if(i>=2)cat("\n")
        dist=sqrt(sum((result$estimated-row)^2))
        cat(paste0("Target: ",row.names(row)," (d=",sub("^0","",sprintf("%.4f",dist)),")\n"))
    
        tb=table(result$pops)
        tb=sort(tb,decreasing=T)
        tb=as.matrix(100*tb/Nbatch)
        cat(paste(sprintf("%.1f",tb),row.names(tb)),sep="\n")
      }
    }
    qpGraph with predefined node list

    Code:
    t=read.table(text="R Yoruba.DG
    R A
    A O
    A AA
    O E
    O CH
    E EU
    E L1
    EU WH
    EU EH
    AA EH
    EH EH2
    EH EH1
    EH AAA
    AA AAA
    WH CCE
    WH Italy_North_Villabruna_HG
    CH YM
    CH CH1
    CH AAN
    EH1 YM
    EH1 EH0
    AAA AAN
    CH1 EH0
    CH1 Georgia_Kotias.SG
    L1 Turkey_N_published
    L1 LB
    L1 YM1
    L1 CW
    EH2 LB
    EH2 CWC
    YM YM1
    YM CW
    EH0 Russia_HG_Karelia
    AAN Nganasan
    AAN CCA
    LB Germany_EN_LBK
    LB CCW
    YM1 Russia_Samara_EBA_Yamnaya
    CW CCW
    CW Estonia_CordedWare
    CCW CWC
    CWC CCE
    CCE CCA
    CCA Finnish.DG")
    
    pop=c("Finnish.DG","Estonia_CordedWare","Germany_EN_LBK","Italy_North_Villabruna_HG","Nganasan","Russia_HG_Karelia","Russia_Samara_EBA_Yamnaya","Turkey_N_published","Yoruba.DG","Georgia_Kotias.SG")
    
    f2=f2_from_geno("path/to/v44.3_HO_public",pops=pop)
    gr=qpgraph(f2,t)
    
    plot_graph(gr$edges)
    ggsave("a.png",width=9,height=9)
    system("mogrify -trim -border 64 -bordercolor white a.png")
    
    plotly_graph(gr$edges) # view Plotly graph in browser
    In order to generate a black-and-white graph, run `write_dot(gr$edges)` and paste the output here: http://viz-js.com.



    qpAdm popdrop table as a stacked bar chart

    The `popdrop` table contains alternative models for each non-empty subset of the left populations. The script below selects all models from the `popdrop` table with zero negative weights (where the value of the `feasible` column is not false) and with at least two left populations (where `f4rank` is not 0). It then plots the models sorted by their p score.

    In qpAdm, the number of right populations needs to be greater than or equal to the number of left populations. Otherwise you'll get an error like this: `Error in qpadm_weights(f4_est, qinv, rnk = length(left) - 2, fudge = fudge: Mat::cols(): indices out of bounds or incorrectly used`.

    I'm still afraid of qpAdm because I have no idea how to choose the outgroups, but I just selected a subset of the outgroups that were used in the paper titled "The Genomic History of Southeastern Europe": https://anthrogenica.com/showthread....pains-of-qpAdm. When the genetic distance between the right populations to left populations is high like in my model, the p values also tend to be higher.

    Code:
    library(admixtools)
    library(tidyverse)
    library(reshape2) # for melt
    library(colorspace) # for hex
    library(cowplot) # for plot_grid
    
    left=c("Turkey_Boncuklu_N.SG","Estonia_CordedWare.SG","Latvia_HG","Russia_HG_Karelia","Finland_Levanluhta","Nganasan")
    right=c("Ethiopia_4500BP_published.SG","Mbuti.DG","Russia_Ust_Ishim_HG_published.DG","Karitiana.DG","Papuan.DG","ONG.SG")
    target="Mari.SG"
    
    qp=qpadm("path/to/v44.3_HO_public",left,right,target)
    
    t=qp$popdrop%>%filter(feasible==T&f4rank!=0)%>%arrange(desc(p))%>%select(1,4,5,7:last_col(5))
    t2=melt(t[-c(2,3)],id.vars="pat") # this uses `melt` because `pivot_longer` doesn't preserve the order of columns
    
    lab=sub("e-0","e-",sub("^0","",sprintf(ifelse(t$p<.001,"%.0e","%.3f"),t$p)))
    lab=paste0(lab," (",ifelse(t$chisq<10,sub("^0","",sprintf("%.1f",t$chisq)),round(t$chisq)),")")
    
    subtit=str_wrap(paste("Right:",paste(sort(right),collapse=", ")),width=58)
    
    p=ggplot(t2,aes(x=fct_rev(factor(pat,level=t$pat)),y=value,fill=variable))+
    geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
    geom_text(aes(label=round(100*value)),position=position_stack(vjust=.5,reverse=T),size=3.5)+
    ggtitle(paste("Target:",target),subtitle=subtit)+
    coord_flip()+
    scale_x_discrete(expand=c(0,0),labels=rev(lab))+
    scale_y_discrete(expand=c(0,0))+ # remove gray margins on the left and right side of the plot
    scale_fill_manual("legend",values=hex(HSV(c(30,60,180,210,270,330),.5,1)))+ # use manual colors
    # scale_fill_manual(values=colorspace::hex(HSV(head(seq(0,360,length.out=1+ncol(t)-3),-1),.5,1)))+ # use automatic colors
    guides(fill=guide_legend(ncol=2,byrow=F))+
    theme(
      axis.text.x=element_blank(),
      axis.text.y=element_text(size=11),
      axis.text=element_text(color="black"),
      axis.ticks=element_blank(),
      axis.title=element_blank(),
      legend.direction="horizontal",
      legend.key=element_rect(fill=NA), # remove gray border around color squares
      legend.margin=margin(-4,0,1,0),
      legend.text=element_text(size=11),
      legend.title=element_blank(),
      plot.subtitle=element_text(size=12),
      plot.title=element_text(size=16)
    )
    
    ggdraw(p)
    hei=c(.5+.2*(str_count(subtit,"\n")+1)+.25*nrow(t),.1+.23*ceiling(length(unique(t2[!is.na(t2$value),2]))/2))
    ggdraw(plot_grid(p+theme(legend.position="none"),get_legend(p),ncol=1,rel_heights=hei))
    ggsave("a.png",width=5.5,height=sum(hei))
    In the image below, the number in parenthes after the p-value is chi-squared. Models with a large number of left populations tend to get a low chi-squared relative to the p-value. Models with only two left populations tend to get a high chi-squared relative to the p-value.



    Print qpAdm popdrop table as plain text

    Code:
    library(admixtools)
    library(tidyverse)
    
    left=c("Turkey_Boncuklu_N.SG","Russia_Samara_EBA_Yamnaya","Latvia_HG","Russia_HG_Karelia","Nganasan","Finland_Levanluhta","Russia_Bolshoy")
    right=c("Russia_Ust_Ishim_HG_published.DG","Russia_HG_Tyumen","Belgium_UP_GoyetQ116_1_published_all","Russia_Kostenki14","Iran_Wezmeh_N.SG","Georgia_Kotias.SG","Turkey_Epipaleolithic","China_YR_LN")
    target="Finnish"
    
    qp=qpadm("path/to/v44.3_HO_public",left,right,target)
    
    t=qp$popdrop%>%filter(feasible==T&f4rank!=0)%>%arrange(desc(p))
    p=apply(t%>%select(7:last_col(5)),1,function(x){y=round(100*sort(na.omit(x),decreasing=T));paste(paste0(y,"% ",names(y)),collapse=" + ")})
    paste0(sub("^0","",sprintf(ifelse(t$p<.001,"%.0e","%.3f"),t$p)),": ",p)%>%cat(sep="\n")
    Output:

    .630: 45% Russia_Samara_EBA_Yamnaya + 28% Turkey_Boncuklu_N.SG + 15% Latvia_HG + 11% Nganasan
    .335: 41% Finland_Levanluhta + 35% Russia_Samara_EBA_Yamnaya + 24% Turkey_Boncuklu_N.SG
    .266: 39% Russia_Samara_EBA_Yamnaya + 35% Finland_Levanluhta + 25% Turkey_Boncuklu_N.SG + 2% Nganasan
    .011: 38% Russia_Samara_EBA_Yamnaya + 34% Turkey_Boncuklu_N.SG + 23% Russia_Bolshoy + 5% Latvia_HG
    .010: 46% Russia_Samara_EBA_Yamnaya + 34% Turkey_Boncuklu_N.SG + 17% Russia_Bolshoy + 4% Nganasan
    .009: 42% Russia_Samara_EBA_Yamnaya + 35% Turkey_Boncuklu_N.SG + 24% Russia_Bolshoy
    .004: 49% Russia_Samara_EBA_Yamnaya + 33% Turkey_Boncuklu_N.SG + 11% Nganasan + 7% Russia_HG_Karelia
    .002: 58% Russia_Samara_EBA_Yamnaya + 31% Turkey_Boncuklu_N.SG + 12% Nganasan
    7e-11: 50% Finland_Levanluhta + 35% Turkey_Boncuklu_N.SG + 16% Russia_HG_Karelia
    5e-12: 59% Finland_Levanluhta + 29% Turkey_Boncuklu_N.SG + 12% Latvia_HG
    7e-13: 52% Turkey_Boncuklu_N.SG + 39% Russia_HG_Karelia + 9% Nganasan
    4e-13: 69% Finland_Levanluhta + 31% Turkey_Boncuklu_N.SG
    2e-14: 52% Turkey_Boncuklu_N.SG + 27% Russia_HG_Karelia + 21% Russia_Bolshoy
    1e-19: 46% Turkey_Boncuklu_N.SG + 28% Russia_Bolshoy + 26% Latvia_HG
    1e-23: 71% Finland_Levanluhta + 29% Russia_Samara_EBA_Yamnaya
    4e-24: 54% Turkey_Boncuklu_N.SG + 46% Russia_HG_Karelia
    2e-26: 44% Turkey_Boncuklu_N.SG + 44% Latvia_HG + 12% Nganasan
    7e-27: 63% Turkey_Boncuklu_N.SG + 37% Russia_Bolshoy
    8e-36: 55% Russia_Samara_EBA_Yamnaya + 36% Turkey_Boncuklu_N.SG + 9% Latvia_HG
    6e-38: 63% Russia_Samara_EBA_Yamnaya + 37% Turkey_Boncuklu_N.SG
    5e-48: 85% Turkey_Boncuklu_N.SG + 15% Nganasan
    4e-53: 56% Turkey_Boncuklu_N.SG + 44% Latvia_HG
    7e-61: 85% Russia_Samara_EBA_Yamnaya + 10% Nganasan + 5% Latvia_HG
    4e-62: 90% Russia_Samara_EBA_Yamnaya + 10% Nganasan
    3e-76: 98% Russia_HG_Karelia + 2% Nganasan
    5e-78: 88% Russia_Samara_EBA_Yamnaya + 12% Russia_Bolshoy
    2e-116: 91% Latvia_HG + 9% Nganasan
    1e-128: 92% Latvia_HG + 8% Russia_Bolshoy

    plot_map

    The `plot_map` function uses Plotly to generate an interactive map for entries from an anno file. You can see the code of the function by running `plot_map`.

    Code:
    library(admixtools)
    library(tidyverse)
    
    t=as.data.frame(read_tsv(gsub("\u00a0"," ",read_file("path/to/v44.3_HO_public.anno")))) # remove non-breaking spaces so `as.numeric` won't fail
    t=t[,c(8,11,12,6)] # group, latitude, longitude, years BP
    t=t[!grepl("\\.REF|rel\\.|fail\\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref|_o|_outlier",t[,1]),]
    t=t[t[,2]!=".."&t[,3]!=".."&t[,2]!="-"&t[,3]!="-",] # remove lines with missing lat or long field (can be `..` or `-`)
    t[t[,4]==0,4]=1 # change years BP 0 to 1 because entries with years BP 0 are not plotted when using a log scale (which is the default option)
    a=aggregate(apply(t[,-1],2,as.numeric),list(t[,1]),mean)
    a[,5]=a[,1]
    names(a)=c("iid","lat","lon","yearsbp","group")
    admixtools::plot_map(a)


    A second option is to first run this in a shell:

    Code:
    $ igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutang|Primate_Chimp|hg19ref')
    $ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
    $ awk -F\\t 'NR>1&&$11!=".."&&$11!="-"{print$8,$11,$12,$6}' path/to/v44.3_HO_public.anno|igno|grep -v _o|tav \ >map
    $ head -n2 map
    Ireland_EBA.SG 55.292132 -6.191685 3765.333333
    Germany_EN_LBK 49.674249 9.582716 7066.370370
    Then run this in R:

    Code:
    t=read.table("map");t[t[,4]==0,4]=1;t[,5]=t[,1];names(t)=c("iid","lat","lon","yearsbp","group");plot_map(t)'
    A third option is to use Plotly directly:

    Code:
    library(plotly)
    
    a=read.table("map")
    label=ifelse(a[,4]==0,a[,1],paste0(a[,1],"\u00a0",round(a[,4])))
    pl=plot_ly(a,type="scattermapbox",mode="markers",x=a[,3],y=a[,2],text=a[,1],hovertext=label,hoverinfo="text",marker=list(size=8,color=log10(a[,4]),colorscale="Portland",colorbar=list(thickness=15,outlinewidth=0)))
    plotly::layout(pl,mapbox=list(style="stamen-terrain"),margin=list(t=0,r=0,b=0,l=0,pad=0))
    In order to display the names of populations in addition to dots, you can use `mode="markers+text"`. It is currently only supported by the mapbox map styles which require a free access token: https://community.plotly.com/t/scatt...-as-text/35065. The `satellite-streets` style also includes the names of cities but `satellite` doesn't.

    Code:
    library(plotly)
    
    a=read.table("map")
    label=ifelse(a[,4]==0,a[,1],paste0(a[,1],"\u00a0",round(a[,4])))
    pl=plot_ly(a,type="scattermapbox",mode="markers+text",x=a[,3],y=a[,2],text=label,textfont=list(color="white"),textposition="top",marker=list(color=log10(a[,4]),size=8,colorscale="Jet"))
    plotly::layout(pl,mapbox=list(style="satellite-streets",accesstoken="youraccesstoken"),margin=list(t=0,r=0,b=0,l=0,pad=0))


    FST or f2 heatmap

    The default clustering method used by `pheatmap` is equivalent to `hclust(dist(t))`. Here the input is already a distance matrix, so you can use this instead: `clustering_callback=function(...)hclust(as.dist(t ))`.

    Code:
    library(admixtools)
    library(pheatmap)
    library(vegan) # for reorder.hclust
    library(colorspace) # for hex
    
    pop=unlist(strsplit("Besermyan Enets Estonian Finnish Hungarian Karelian Mansi Mordovian Nganasan Saami.DG Selkup Udmurt"," "))
    
    f=fst("path/to/v44.3_HO_public",pop)
    # f=f2("path/to/v44.3_HO_public",pop)
    
    # convert FST or f2 pairs to square matrix
    df=as.data.frame(f[,-4])
    df2=rbind(df,setNames(df[,c(2,1,3)],names(df)))
    t=xtabs(df2[,3]~df2[,1]+df2[,2])
    
    weights=t["Hungarian",]-t["Nganasan",]
    # weights=rowMeans(t) # sort by average distance to other populations
    
    pheatmap(
      1000*t,filename="a.png",
      clustering_callback=function(...)reorder(hclust(as.dist(t)),weights),
      # clustering_callback=function(...)hclust(as.dist(t)),
      legend=F,border_color=NA,cellwidth=18,cellheight=18,
      treeheight_row=80,treeheight_col=80,
      display_numbers=T,number_format="%.0f",number_color="black",
      colorRampPalette(hex(HSV(c(210,210,130,60,40,20,0),c(0,.5,.5,.5,.5,.5,.5),1)))(256)
    )
    Here the FST values are multiplied by 1,000 and not by 10,000 as usual:

    Last edited by Nganasankhan; 04-24-2021 at 11:28 PM.

  2. The Following 19 Users Say Thank You to Nganasankhan For This Useful Post:

     23abc (04-24-2021),  altvred (04-24-2021),  Bart (09-09-2021),  brauthaz (04-24-2021),  Bygdedweller (04-24-2021),  David Bush (04-30-2021),  discreetmaverick (04-28-2021),  DMXX (04-24-2021),  fabrice E (04-24-2021),  Kaazi (10-02-2021),  Kapisa (04-24-2021),  Kaspias (04-28-2021),  kolompar (04-24-2021),  mokordo (04-28-2021),  Nino90 (04-25-2021),  peloponnesian (05-20-2021),  randwulf (04-25-2021),  sheepslayer (04-25-2021),  Tsakhur (08-31-2021)

  3. #2
    Registered Users
    Posts
    241
    Sex
    Ethnicity
    Finnish

    The script in this post generates qpAdm models for a set of target populations, selects the model with the highest p-value for each target population, and displays the models as a stacked bar chart.

    Code:
    library(admixtools)
    library(tidyverse)
    library(reshape2) # for melt
    library(colorspace) # for hex
    library(cowplot) # for plot_grid
    
    left=c("Turkey_N_published","Estonia_CordedWare.SG","Latvia_HG","Russia_HG_Samara","Russia_HG_Tyumen","Nganasan")
    right=c("Russia_Ust_Ishim_HG_published.DG","Karitiana.DG","Papuan.DG","ONG.SG","Iran_Wezmeh_N.SG","Belgium_UP_GoyetQ116_1_published_all","Turkey_Epipaleolithic","Georgia_Kotias.SG","Japanese")
    target=c("Finnish.DG","Estonian.DG","Selkup","Hungarian.DG","Karelian","Mansi","Udmurt","Mordovian","Saami.DG","Besermyan","Finnish","Enets","Mari.SG","Finland_Saami_Modern.SG","Russia_Bolshoy","Norway_Viking_o1.SG","Russia_Chalmny_Varre")
    
    f2=f2_from_geno("path/to/v44.3_HO_public",pops=c(left,right,target))
    qp=sapply(target,function(x)qpadm(f2,left,right,x))
    
    # select model with highest p-value
    qp2=apply(qp,2,function(x)x$popdrop%>%filter(feasible==T&f4rank!=0)%>%top_n(1,p)%>%select(5,7:last_col(5)))
    
    # select model with highest number of left populations; choose highest p-value when tied
    # qp2=apply(qp,2,function(x)x$popdrop%>%filter(feasible==T)%>%arrange(desc(f4rank),desc(p))%>%head(1)%>%select(5,7:last_col(5)))
    
    qp3=do.call("rbind",qp2)
    qp3$target=rownames(qp3)
    qp3=qp3[order(qp3[,"Nganasan"],na.last=F),]
    # nona=qp3;nona[is.na(nona)]=0;qp3=qp3[order(3*nona[,"Nganasan"]+nona[,"Russia_HG_Tyumen"]+.8*nona[,"Russia_HG_Samara"]),]
    qp4=melt(qp3[,-1],id.var="target")
    
    lab=sub("e-0","e-",sub("^0","",sprintf(ifelse(qp3$p<.001,"%.0e","%.3f"),qp3$p)))
    capt=str_wrap(paste("Outgroups:",paste(sort(right),collapse=", ")),width=80)
    
    p=ggplot(qp4,aes(x=fct_rev(factor(target,level=qp3$target)),y=value,fill=variable))+
    geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
    geom_text(aes(label=round(100*value)),position=position_stack(vjust=.5,reverse=T),size=3.5)+
    labs(caption=capt)+
    coord_flip()+
    scale_x_discrete(expand=c(0,0),labels=rev(paste0(qp3$target," (",lab,")")))+
    scale_y_discrete(expand=c(0,0))+ # remove gray area on left and right side of bars
    scale_fill_manual(values=hex(HSV(c(30,60,180,210,120,330),.5,1)))+ # manual colors
    # scale_fill_manual(values=hex(HSV(head(seq(0,360,length.out=length(unique(qp4$variable))+1),-1),.5,1)))+ # automatic colors
    guides(fill=guide_legend(ncol=3,byrow=F))+
    theme(
      axis.text=element_text(color="black",size=11),
      axis.text.x=element_blank(),
      axis.ticks=element_blank(),
      axis.title=element_blank(),
      legend.direction="horizontal",
      legend.key=element_rect(fill=NA), # remove gray border around color squares
      legend.spacing.y=unit(.03,"in"), # reduce vertical space between color squares
      legend.margin=margin(0,0,-4,0),
      legend.text=element_text(size=10),
      legend.title=element_blank(),
      plot.caption=element_text(size=10)
    )
    
    hei=c(.05+.25*ceiling(length(unique(qp4[!is.na(qp4$value),2]))/3),.25*nrow(qp3)+.05+.18*(str_count(capt,"\n")+1))
    ggdraw(plot_grid(get_legend(p),p+theme(legend.position="none"),ncol=1,rel_heights=hei))
    ggsave("a.png",width=6,height=sum(hei))

  4. The Following 4 Users Say Thank You to Nganasankhan For This Useful Post:

     David Bush (04-30-2021),  Kaspias (04-28-2021),  Nino90 (04-28-2021),  Tsakhur (08-31-2021)

  5. #3
    Registered Users
    Posts
    241
    Sex
    Ethnicity
    Finnish

    The script in this post calculates a matrix of f3 distances between each pair of specified populations. It displays the distance to one specified population on the x axis and to another specified population on the y axis, but the clustering is based on the full matrix. The distances are displayed as min-max scaled percentages.

    Code:
    library(admixtools)
    library(tidyverse)
    library(ggrepel)
    
    p1=c("Altaian","Altaian_Chelkan","Azeri","Balkar","Bashkir","Besermyan","Buryat","Chuvash","Daur","Dolgan","Enets","Estonian","Even","Evenk_FarEast","Evenk_Transbaikal","Finnish","Gagauz","Hungarian","Kalmyk","Karachai","Karakalpak","Karelian","Kazakh","Kazakh_China","Khakass","Khakass_Kachin","Kumyk","Kyrgyz_China","Kyrgyz_Kyrgyzstan","Kyrgyz_Tajikistan","Mansi","Mongol","Mordovian","Nanai","Negidal","Nganasan","Nogai_Astrakhan","Nogai_Karachay_Cherkessia","Nogai_Stavropol","Salar","Selkup","Shor_Khakassia","Shor_Mountain","Tatar_Kazan","Tatar_Mishar","Tatar_Siberian","Tatar_Siberian_Zabolotniye","Todzin","Tofalar","Tubalar","Turkish","Turkish_Balikesir","Turkmen","Tuvinian","Udmurt","Uyghur","Uzbek","Veps","Xibo","Yakut","Yukagir_Forest","Yukagir_Tundra")
    p2=c("Korean","Yakut")
    p3="Khomani"
    
    f3=f3("path/to/v44.3_HO_public",c(p1,p2),c(p1,p2),p3)
    
    t=f3%>%select(pop1,pop2,est)%>%pivot_wider(names_from=pop1,values_from=est)%>%data.frame(row.names=1)
    t2=t[,p2]
    
    t2=as.data.frame(apply(t2,2,function(x)100*(x-min(x))/(max(x)-min(x))))
    
    t2$k=cutree(hclust(dist(t)),12)
    names(t2)=c("x","y","k")
    
    ggplot(t2,aes(x,y))+
    geom_abline(linetype="dashed",color="gray80",size=.2)+
    geom_polygon(data=t2%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.2)+
    geom_point(aes(color=as.factor(k)),size=.7)+
    geom_text_repel(aes(label=rownames(t2),color=as.factor(k)),max.overlaps=Inf,force=2,size=2.2,segment.size=.2,min.segment.length=.2,box.padding=.05)+
    xlab(paste0("100  minmax(f3(x, ",p2[1],", ",p3,"))"))+
    ylab(paste0("100  minmax(f3(x, ",p2[2],", ",p3,"))"))+
    scale_x_continuous(breaks=seq(0,100,10))+
    scale_y_continuous(breaks=seq(0,100,10))+
    scale_color_manual(values=rainbow_hcl(n_distinct(t2$k),100,50))+
    scale_fill_manual(values=rainbow_hcl(n_distinct(t2$k),100,50))+
    theme(
      axis.text=element_text(color="black",size=6),
      axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"pt"),
      axis.title=element_text(size=8),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray90",fill=NA,size=.4),
      panel.grid.major=element_line(color="gray90",size=.2)
    )
    
    ggsave("a.png",width=6,height=6)
    Last edited by Nganasankhan; 05-19-2021 at 08:56 PM.

  6. The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

     David Bush (05-19-2021),  Tsakhur (08-31-2021)

  7. #4
    Registered Users
    Posts
    241
    Sex
    Ethnicity
    Finnish

    Print f2 distance to a specified population

    Code:
    $ brew install R
    $ Rscript -e 'install.packages("devtools");devtools::install_github("uqrmaie1/admixtools")'
    $ wget https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.{anno,ind,snp,geno}
    $ printf %s\\n Bashkir Belarusian Chuvash Chuvash Croatian English Estonian Finnish French Greek Hungarian Icelandic Italian_North Italian_South Karelian Lebanese Lithuanian Maltese Mordovian Norwegian Russian Russian_Archangelsk_Krasnoborsky Russian_Archangelsk_Leshukonsky Russian_Archangelsk_Pinezhsky Saami.DG Sardinian Sicilian Spanish Tatar_Kazan Tatar_Mishar Udmurt Ukrainian Veps>pop
    $ Rscript -e 'library(admixtools);f=f2("v44.3_HO_public",readLines(pop),unique_only=F);s=f[f$pop1=="Russian_Archangelsk_Leshukonsky",2:3];s=s[order(s$est),];cat(paste(sprintf("%.4f",s$est),s$pop2),sep="\n")'
    0.0000 Russian_Archangelsk_Leshukonsky
    0.0020 Russian_Archangelsk_Krasnoborsky
    0.0021 Russian_Archangelsk_Pinezhsky
    0.0025 Karelian
    0.0027 Finnish
    0.0029 Tatar_Kazan
    0.0029 Russian
    0.0031 Veps
    0.0032 Mordovian
    0.0032 Estonian
    0.0034 Belarusian
    0.0034 Tatar_Mishar
    0.0034 Ukrainian
    0.0036 Chuvash
    0.0040 Hungarian
    0.0040 Lithuanian
    0.0041 Croatian
    0.0042 Norwegian
    0.0042 Bashkir
    0.0044 French
    0.0044 English
    0.0044 Icelandic
    0.0052 Spanish
    0.0055 Udmurt
    0.0055 Greek
    0.0055 Italian_North
    0.0061 Italian_South
    0.0062 Sicilian
    0.0068 Maltese
    0.0078 Lebanese
    0.0080 Saami.DG
    0.0088 Sardinian
    Compare f2 and f3 distance to a specified population in a scatterplot

    The argument `unique_only=F` to the `f2` function lists all population pairs instead of omitting the half of pairs that are already included in flipped order. A similar argument is missing from the `fst` function.

    Code:
    library(admixtools)
    library(tidyverse)
    
    p1=c("Bashkir","Belarusian","Chuvash","Chuvash","Croatian","English","Estonian","Finnish","French","Greek","Hungarian","Icelandic","Italian_North","Italian_South","Karelian","Lebanese","Lithuanian","Maltese","Mordovian","Norwegian","Russian","Russian_Archangelsk_Krasnoborsky","Russian_Archangelsk_Leshukonsky","Russian_Archangelsk_Pinezhsky","Sardinian","Sicilian","Spanish","Tatar_Kazan","Tatar_Mishar","Udmurt","Ukrainian","Veps")
    p3="Yoruba"
    targ="Finnish"
    
    blocks=f2_from_geno("v44.3_HO_public",pops=c(p1,p3))
    f2=f2(blocks,p1,unique_only=F)
    f3=f3(blocks,p3,p1,p1)
    # fst=fst(blocks,p1)
    
    f22=f2%>%select(pop1,pop2,est)
    f32=f3%>%select(pop2,pop3,est)
    # fst2=rbind(fst[,-4],setNames(fst[,c(2,1,3)],names(fst)[-4]),data.frame(pop1=p1,pop2=p1,est=0))%>%arrange(pop1,pop2)
    
    xy=cbind(f32%>%filter(pop2==targ)%>%select(pop3,est),f22%>%filter(pop1==targ)%>%select(est))
    names(xy)=c("pop","x","y")
    
    ggplot(xy,aes(x,y))+
    geom_smooth(method="lm",se=F,size=.5)+
    geom_point(size=.5)+
    geom_text(label=xy$pop,size=2,vjust=-.7)+
    labs(x=paste0("f3(",p3,", ",targ,", x)"),y=paste0("f2(",targ,", x)"))+
    scale_x_continuous(breaks=seq(0,100,.002),expand=expansion(mult=c(.08,.08)))+
    scale_y_continuous(breaks=seq(0,100,.002),expand=expansion(mult=c(.04,.04)))+
    theme(
      axis.text=element_text(size=6,color="black"),
      axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"cm"),
      axis.title=element_text(size=8),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray80",fill=NA,size=.6),
      panel.grid.major=element_line(color="gray85",size=.2),
      plot.subtitle=element_text(size=7),
      plot.title=element_text(size=11)
    )
    
    ggsave("a.png",width=5,height=5)


    Generate a CSV file for an f2 matrix of specified populations

    Code:
    $ Rscript -e 'library(admixtools);library(tidyverse);f=f2("v44.3_HO_public",c("Estonian","Finnish","Karelian","Mordovian","Udmurt","Veps"),unique_only=F);f[,-4]%>%pivot_wider(names_from=pop1,values_from=est)%>%data.frame(row.names=1)%>%round(6)%>%write.csv("f2.csv",quote=F)'
    $ cat f2.csv
    ,Estonian,Finnish,Karelian,Mordovian,Udmurt,Veps
    Estonian,0,0.000947,0.001606,0.000949,0.006555,0.002509
    Finnish,0.000947,0,0.000415,0.001708,0.00597,0.001848
    Karelian,0.001606,0.000415,0,0.001849,0.006219,0.001245
    Mordovian,0.000949,0.001708,0.001849,0,0.005448,0.002697
    Udmurt,0.006555,0.00597,0.006219,0.005448,0,0.006655
    Veps,0.002509,0.001848,0.001245,0.002697,0.006655,0
    Generate a CSV file for a global f2 matrix

    This took about one hour to run on my computer.

    This removes duplicate samples by selecting unique samples by the version ID, which is the third field in the anno file. However this only includes modern populations, and it omits all populations with a suffix like .DG or .SDG or .SG, so in this case the code for removing duplicate samples doesn't actually remove any sample.

    Code:
    $ igno()(grep -Ev '\.REF|rel\.|fail\.|Ignore_|_dup|_contam|_lc|_father|_mother|_son|_daughter|_brother|_sister|_sibling|_twin|Neanderthal|Denisova|Vindija_light|Gorilla|Macaque|Marmoset|Orangutan|Primate_Chimp|hg19ref')
    $ sed 1d v44.3_HO_public.anno|awk -F$'\t' '!($8~/\./)&&$6==0'|sort -t$'\t' -rnk15|awk -F\\t '!a[$3]++{print$2,$8}'|igno|grep -v outlier|cut -d' ' -f1 >ind
    $ Rscript -e 'library(admixtools);library(tidyverse);blocks=f2_from_geno("v44.3_HO_public",inds=readLines("ind"));f=f2(blocks,unique_only=F);f[,-4]%>%pivot_wider(names_from=pop1,values_from=est)%>%data.frame(row.names=1)%>%round(7)%>%write.csv("worldf2.csv",quote=F)'
    $ awk -F, 'NR==1{for(i=2;i<=NF;i++)if($i==x)break;next}$1!=x{print$i,$1}' x=Udmurt worldf2.csv|sort -n|head|awk '{$1=sprintf("%.5f",$1)}1'
    0.00159 Besermyan
    0.00244 Tatar_Kazan
    0.00253 Bashkir
    0.00306 Chuvash
    0.00372 Tatar_Mishar
    0.00378 Nogai_Karachay_Cherkessia
    0.00385 Tatar_Siberian
    0.00403 Russian_Archangelsk_Krasnoborsky
    0.00409 Uzbek
    0.00424 Yukagir_Forest
    Generate an MDS plot based on an f2 matrix

    Classical MDS (multidimensional scaling) is basically a version of PCA which takes a distance matrix as an input. When `mat` is a matrix, `cmdscale(dist(mat))` produces the same coordinates as `prcomp(mat)$x`, except the sign of the columns may arbitrarily be flipped.

    This uses the CSV file produced by the previous script as input. Each population is connected with a line to its two closest neighbors. The colored clusters are based on cutting a hierarchical clustering tree at the height where it has 24 subtrees, but you can comment out a line to use K-means instead. The second plot below only includes the 128 closest populations to Mansi.

    Code:
    library(tidyverse)
    library(ggforce) # for geom_mark_hull
    library(colorspace) # for hex
    
    t=read.csv("https://pastebin.com/raw/B1t0ESsj",row.names=1,check.names=F) # global f2 matrix
    
    # pick=as.matrix(t)[,"Mansi"]%>%sort%>%head(128)%>%names
    # t=t[pick,pick]
    
    mds=cmdscale(as.dist(t),k=8,eig=T)
    p=as.data.frame(mds$points)
    eig=head(mds$eig,ncol(p))
    pct=paste0("MDS dimension ",1:ncol(p)," (",sprintf("%.1f",100*eig/sum(eig)),"%)")
    
    p$k=as.factor(cutree(hclust(as.dist(t)),24))
    # p$k=as.factor(kmeans(as.dist(t),24)$cluster) # use K-means instead of hierarchical clustering
    
    pal1=hex(HSV(seq(0,360,length.out=n_distinct(p$k)+1)%>%head(-1),.55,1))
    pal2=hex(HSV(seq(0,360,length.out=n_distinct(p$k)+1)%>%head(-1),.25,1))
    
    i=1
    seg=lapply(1:3,function(j)apply(t,1,function(x)unlist(p[names(sort(x)[j]),c(i,i+1)],use.names=F))%>%t%>%cbind(p[,c(i,i+1)]))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
    
    p1=sym(paste0("V",i))
    p2=sym(paste0("V",i+1))
    
    ggplot(p,aes(!!p1,!!p2))+
    geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray80",size=.15)+
    geom_mark_hull(aes(group=k),color=pal1[p$k],fill=pal1[p$k],concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.2,size=.15)+
    geom_point(aes(x=!!p1,y=!!p2),color=pal1[p$k],size=.3)+
    geom_text(aes(x=!!p1,y=!!p2),label=rownames(p),color=pal2[p$k],size=2,vjust=-.7)+
    labs(x=pct[i],y=pct[i+1])+
    scale_x_continuous(breaks=seq(-1,1,.02))+
    scale_y_continuous(breaks=seq(-1,1,.02))+
    theme(
      axis.text=element_text(color="gray90",size=6),
      axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"pt"),
      axis.title=element_text(color="gray90",size=8),
      legend.position="none",
      panel.background=element_rect(fill="gray40"),
      panel.border=element_rect(color="gray30",fill=NA,size=.8),
      panel.grid.major=element_line(color="gray30",size=.2),
      panel.grid.minor=element_blank(),
      plot.background=element_rect(fill="gray40",color=NA)
    )
    
    ggsave("a.png",width=13,height=7)
    Last edited by Nganasankhan; 09-07-2021 at 02:16 AM.

  8. The Following User Says Thank You to Nganasankhan For This Useful Post:

     David Bush (09-12-2021)

  9. #5
    Registered Users
    Posts
    241
    Sex
    Ethnicity
    Finnish

    Make an f2 MDS plot that displays ratio of f2 distance to two specified target populations

    You can comment out one paragraph to show f2 distance multiplied by 10,000 instead.

    Code:
    printf %s\\n https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.{anno,ind,snp,geno}|xargs -n1 curl -sO
    f=v44.3_HO_public;convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
    x=north;awk -F\\t '$11>35&&$6==0&&$8!~/\./{print$2,$8}' v44.3_HO_public.anno|grep -Ev 'Ignore|_outlier|Denisova|Vindija_light'>$x.pick
    plink --allow-no-sex --bfile v44.3_HO_public --keep <(awk 'NR==FNR{a[$1];next}$2 in a' $x.pick v44.3_HO_public.fam) --make-bed --out $x
    plink --allow-no-sex --bfile $x --indep-pairwise 50 10 .1 --geno .01 --out $x;plink --allow-no-sex --bfile $x --extract $x.prune.in --make-bed --out $x.p
    awk 'NR==FNR{a[$1]=$2;next}{$1=a[$2]}1' $x.pick $x.fam>$x.p.fam
    Rscript -e 'library(admixtools);p="'$x.p'";f=f2(p,unique_only=F);df=as.data.frame(f);write.csv(round(xtabs(df[,3]~df[,2]+df[,1]),8),paste0(p,".f2"),quote=F)'
    Code:
    library(tidyverse)
    library(ggforce)
    library(colorspace)
    
    t=read.csv("north.p.f2",row.names=1,check.names=F)
    
    targ1="French"
    targ2="Han"
    
    igno=rownames(t)%in%c("AA","Chukchi1","Eskimo_Naukan","Eskimo_ChaplinSireniki","Chukchi","Itelmen","Koryak","Tlingit","Aleut")
    t=t[!igno,!igno]
    
    # pick=t[targ1,]%>%sort%>%t%>%head(64)%>%rownames
    # t=t[pick,pick]
    
    # show ratio of f2 distance between two populations scaled to range 0-100
    dist=50*(1+(t[,targ1]-t[,targ2])/(t[,targ1]+t[,targ2]))
    title=paste0("50*(1+(f2(",targ1,",x)-f2(",targ2,",x))/(f2(",targ1,",x)+f2(",targ2,",x)))")
    brk=seq(0,100,25)
    lab=brk
    popname=paste(rownames(p),round(dist))
    pal=hcl(c(240,190,110,90,60,30,0)+15,c(90,90,90,0,90,90,90),c(65,65,65,0,65,65,65))
    
    # # show difference of f2 distance to two populations
    # dist=t[,targ1]-t[,targ2]
    # brk=seq(-max(abs(dist)),max(abs(dist)),length.out=5)
    # if(abs(min(dist))<max(dist)){brk[1]=min(dist)}else{brk[length(brk)]=max(dist)}
    # # lab=sub("0\\.",".",sprintf("%.4f",brk))
    # lab=round(1e4*brk)
    # title=paste0("1e4*(f2(",targ1,",x)-f2(",targ2,",x))")
    # popname=paste(rownames(p),round(dist*1e4))
    # pal=hcl(c(240,190,110,90,60,30,0)+15,c(90,90,90,0,90,90,90),c(65,65,65,0,65,65,65))
    
    # # show f2 distance to single population
    # dist=t[,targ1]*1e4
    # brk=seq(min(dist),max(dist),length.out=5)
    # ## lab=sub("0\\.",".",sprintf("%.4f",brk))
    # lab=round(brk)
    # title=paste0("1e4*f2(",targ1,",x)")
    # popname=paste(rownames(t),round(dist))
    # pal=rainbow_hcl(100,90,60,260+15,0+15)
    
    mds=cmdscale(as.dist(t),nrow(t)-1,eig=T)
    p=as.data.frame(mds$points)
    eig=head(mds$eig,ncol(p))
    pct=paste0("MDS dimension ",1:ncol(p)," (",sprintf("%.1f",100*eig/sum(eig)),"%)")
    
    p$k=as.factor(cutree(hclust(as.dist(t)),8))
    
    for(i in seq(1,7,2)){
      seg=lapply(1:3,function(j)apply(t,1,function(x)unlist(p[names(sort(x)[j]),c(i,i+1)],use.names=F))%>%t%>%cbind(p[,c(i,i+1)]))%>%do.call(rbind,.)%>%setNames(paste0("V",1:4))
    
      xpc=sym(paste0("V",i))
      ypc=sym(paste0("V",i+1))
    
      ggplot(p,aes(!!xpc,!!ypc))+
      geom_segment(data=seg,aes(x=V1,y=V2,xend=V3,yend=V4),color="gray60",size=.15)+
      ggforce::geom_mark_hull(aes(group=k),color="gray50",fill="gray50",concavity=100,radius=unit(.15,"cm"),expand=unit(.15,"cm"),alpha=.1,size=.15)+
      geom_point(aes(x=!!xpc,y=!!ypc,color=dist),size=.3)+
      labs(x=pct[i],y=pct[i+1])+
      geom_text(aes(x=!!xpc,y=!!ypc,color=dist),label=popname,size=2,vjust=-.7)+
      # ggrepel::geom_text_repel(aes(label=rownames(p),color=dist),max.overlaps=Inf,force=5,size=2.2,box.padding=.0,point.padding=0,min.segment.length=.1,segment.size=.2)+
      scale_x_continuous(breaks=seq(-1,1,.002),expand=expansion(.06))+
      scale_y_continuous(breaks=seq(-1,1,.002),expand=expansion(.03))+
      scale_color_gradientn(breaks=brk,labels=lab,colors=pal,name=title)+
      theme(
        axis.text=element_text(color="black",size=6),
        axis.text.y=element_text(angle=90,vjust=1,hjust=.5),
        axis.ticks=element_blank(),
        axis.ticks.length=unit(0,"pt"),
        axis.title=element_text(color="black",size=8),
        legend.direction="horizontal",
        legend.justification="left",
        legend.key.height=unit(.4,"cm"),
        legend.key.width=unit(.7,"cm"),
        legend.margin=margin(0,0,-7,0),
        legend.position="top",
        # legend.position=c(.285,.955), # place legend inside plot
        legend.spacing.y=unit(.05,"cm"),
        legend.text=element_text(size=6,vjust=.5),
        legend.title=element_text(size=8,vjust=2/3),
        panel.background=element_rect(fill="white"),
        panel.border=element_rect(color="gray90",fill=NA,size=.4),
        panel.grid.major=element_line(color="gray90",size=.2),
        panel.grid.minor=element_blank(),
        plot.background=element_rect(fill="white",color=NA),
        plot.title=element_text(size=10,color="black")
      )
    
      ggsave(paste0(i,".png"),width=6,height=6)
    }
    Last edited by Nganasankhan; 09-30-2021 at 11:30 PM.

  10. The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

     al.Krivich (10-01-2021),  David Bush (10-05-2021)

  11. #6
    Registered Users
    Posts
    241
    Sex
    Ethnicity
    Finnish

    Use qpGraph to model a modern population as a combination of the common ancestors of other modern populations

    Code:
    t=read.table(text="R A
    R B
    A Nganasan
    A Han
    B Sardinian
    B Lithuanian
    A C
    B C
    C Udmurt")
    
    pop=unique(as.vector(as.matrix(t)))
    pop=pop[nchar(pop)>2]
    
    f2=f2_from_geno("v50.0_HO_public",pops=pop,maxmiss=1)
    gr=qpgraph(f2,t)
    
    plot_graph(gr$edges)
    ggsave("1.png",width=4,height=4)
    system("mogrify -trim -border 48 -bordercolor white 1.png")


    Use qpGraph to estimate the percentage of East Eurasian ancestry

    Code:
    p1=c("Nganasan","Han")
    p2=c("Sardinian","Lithuanian")
    p3=c("Mari.SG","Udmurt","Finnish","Bashkir","Besermyan","Chuvash","Enets","Estonian","French","Hungarian","Karelian","Korean","Mansi","Mordovian","Nganasan","Saami.DG","Selkup","Veps","Finland_Levanluhta","Norway_Viking_o1.SG","Russia_Bolshoy","Russia_Sargat_IA","Russia_Mezhovskaya.SG","Russia_KusnarenkovoKarajakupovo_Medieval.SG","Russia_Chalmny_Varre","Russia_HG_Karelia","Estonia_MN_CCC_1","Estonia_MN_CCC_2","Estonia_N_CombCeramic.SG","Estonia_CordedWare")
    
    f2=f2_from_geno("v50.0_HO_public",pops=c(p1,p2,p3),maxmiss=1)
    
    r=sapply(p3,function(x){
      tree=rbind(c("R","A"),c("R","B"),c("A","C"),c("B","C"),cbind("A",p1),cbind("B",p2),c("C",x))
      gr=qpgraph(f2,tree)
      (gr$edges%>%filter(from=="A"&to=="C"))$weight%>%list(gr$score)
    })%>%apply(1,unlist)
    
    r=r[order(r[,1]),]
    paste(sprintf("%.1f",100*r[,1]),rownames(r),round(r[,2]))%>%writeLines
    Output:

    0.7 French 443
    1.4 Estonian 130
    1.6 Hungarian 283
    4.2 Estonia_CordedWare 79
    7.1 Estonia_MN_CCC_1 162
    7.3 Finnish 206
    7.9 Mordovian 162
    8.3 Karelian 247
    10.2 Estonia_N_CombCeramic.SG 164
    10.4 Veps 255
    10.5 Estonia_MN_CCC_2 175
    17.1 Russia_Mezhovskaya.SG 137
    18.6 Russia_HG_Karelia 355
    25.2 Chuvash 356
    25.9 Saami.DG 260
    26.0 Besermyan 293
    26.0 Russia_Chalmny_Varre 192
    29.5 Finland_Levanluhta 208
    29.7 Udmurt 496
    33.5 Mari.SG 164
    37.2 Norway_Viking_o1.SG 221
    37.7 Bashkir 261
    40.1 Russia_Sargat_IA 326
    50.4 Russia_KusnarenkovoKarajakupovo_Medieval.SG 194
    53.8 Russia_Bolshoy 697
    57.3 Mansi 877
    71.6 Selkup 1317
    80.4 Enets 1269
    100.0 Korean 1139
    100.0 Nganasan 3751
    Last edited by Nganasankhan; 10-22-2021 at 11:31 PM.

  12. The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

     David Bush (10-23-2021),  michal3141 (10-23-2021)

  13. #7
    Registered Users
    Posts
    241
    Sex
    Ethnicity
    Finnish

    Apply the same qpGraph model to a series of target populations and make a table of admixture weights

    This reproduces Figure 6A from Tambets et al. 2018: https://genomebiology.biomedcentral....18-1522-1#Fig6.

    Code:
    graph=read.table(text="R Yoruba.DG
    R A
    A O
    A AA
    O E
    O CH
    E EU
    E L1
    EU WH
    EU EH
    AA EH
    EH EH2
    EH EH1
    EH AAA
    AA AAA
    WH CCE
    WH Italy_North_Villabruna_HG
    CH YM
    CH CH1
    CH AAN
    EH1 YM
    EH1 EH0
    AAA AAN
    CH1 EH0
    CH1 Georgia_Kotias.SG
    L1 Turkey_N_published
    L1 LB
    L1 YM1
    L1 CW
    EH2 LB
    EH2 CWC
    YM YM1
    YM CW
    EH0 Russia_HG_Karelia
    AAN Nganasan
    AAN CCA
    LB Germany_EN_LBK
    LB CCW
    YM1 Russia_Samara_EBA_Yamnaya
    CW CCW
    CW Estonia_CordedWare
    CCW CWC
    CWC CCE
    CCE CCA")
    
    A=function(x)unlist(strsplit(x," "))
    targets=A("Hungarian Estonian Finnish Karelian Veps Saami.DG Mordovian Besermyan Udmurt Mansi Selkup Enets")
    paths=list(LBK=A("LB CCW CWC CCE CCA"),Corded_Ware=A("CW CCW CWC CCE CCA"),WHG=A("WH CCE CCA"),EHG=A("EH2 CWC CCE CCA"),Nganasan=A("AAN CCA"))
    
    pop=unlist(graph)%>%unique%>%Filter(function(x)nchar(x)>3,.)
    
    f2=f2_from_geno("v50.0_HO_public",pops=c(pop,targets),maxmiss=1)
    
    m=sapply(targets,function(targ){
      gr=qpgraph(f2,rbind(graph,c("CCA",targ)))
      e=gr$edges
      sapply(paths,function(p)cbind(p[1:length(p)-1],p[-1])%>%apply(1,function(x)e[e$from==x[1]&e$to==x[2],]$weight)%>%reduce(`*`))
    })
    
    round(100*m)
    Output:

    Code:
                Hungarian Estonian Finnish Karelian Veps Saami.DG Mordovian
    LBK                30       11      13       12   10        0        12
    Corded_Ware        63       73      66       58   56       66        63
    WHG                 1        1       1        0    0        0         1
    EHG                 0       10      12       23   23       14        12
    Nganasan            6        4       8        8   11       20        12
                Besermyan Udmurt Mansi Selkup Enets
    LBK                 4      0     0      0     0
    Corded_Ware        64     63    37     30     7
    WHG                 0      0     0      0     0
    EHG                 1      5     6      1     9
    Nganasan           30     32    57     69    85

  14. The Following User Says Thank You to Nganasankhan For This Useful Post:

     David Bush (10-23-2021)

Similar Threads

  1. How to get AdmixTools running
    By Bas in forum Inquiries Corner
    Replies: 42
    Last Post: Yesterday, 11:06 AM
  2. Shell commands and R scripts for working with G25
    By Nganasankhan in forum Autosomal (auDNA)
    Replies: 52
    Last Post: 10-31-2021, 04:17 PM
  3. How to run Admixtools
    By Bas in forum Inquiries Corner
    Replies: 17
    Last Post: 10-13-2017, 09:30 PM
  4. Admixtools & Ubuntu
    By anglesqueville in forum Inquiries Corner
    Replies: 23
    Last Post: 03-02-2017, 05:42 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •