Page 1 of 4 123 ... LastLast
Results 1 to 10 of 39

Thread: Shell commands and R scripts for working with G25

  1. #1
    Registered Users
    Posts
    92
    Sex
    Ethnicity
    Finnish

    Shell commands and R scripts for working with G25

    I'll post R scripts later. You can also post your own scripts in other programming languages, or you can post versions of my scripts converted to other languages.

    Some of the shell commands require GNU `awk`. You can install it on macOS by running `brew install gawk`.

    Download datasheets.

    Code:
    printf %s\\n ais\ 1UrhcfNMLW0oMXIbHGUE60v2taCM7PFw1 aas\ 1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl mis\ 1HYrDwxEXv82DvDLoq736pS5ZTGJA4dn5 mas\ 1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y aiu\ 1YKkEOtyV5SISvmY_FyS4YSLXCxxYt5_W aau\ 1f0imQyVNZ9RPESNAYIeIkA8fx4wAVNYo miu\ 18GcEVEl3GI-ByviD-TgQQjvEaaTbNTr2 mau\ 1y49hyvviJpHj9esVqyeiFm32DhnPlfRQ|while read l m;do curl "https://drive.google.com/uc?export=download&id=$m" -Lso $l;done
    Find the closest ancient populations for a modern population.

    Code:
    $ dist()(awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{printf"%."x"f %s\n",$1,$2}' "x=${3-3}"|sed s,^0,,)
    $ dist aas <(grep Finnish, mas)|head -n4
    .038 VK2020_SWE_Gotland_VA_o
    .042 VK2020_UKR_Shestovitsa_VA
    .045 VK2020_SWE_Uppsala_VA
    .045 Baltic_EST_IA
    Find outliers by sorting individuals within a population by their distance to the population average.

    Code:
    $ dist <(grep ^Ket: mis) <(grep ^Ket, mas)
    .078 Ket:587_R01C02
    .092 Ket:584_R01C01
    .094 Ket:584_R02C01
    .108 Ket:586_R01C02
    .160 Ket:570_R02C02
    Pick populations or samples by listing their names.

    Code:
    $ 25p()(printf %s\\n "${@-$(cat)}"|awk -F, 'NR==FNR{a[$1]=$0;next}$0 in a{print a[$0]}' <(cat [ma][ai]s) -)
    $ 25p Komi FIN_Levanluhta_IA_o:JK2065
    Komi,0.1187174,0.0307705,0.0850408,0.0682822,-0.0046161,0.0141677,0.0094945,0.0130841,-0.0030884,-0.0298138,0.0099383,-0.0086022,0.0182258,-0.0062204,-0.0087676,-0.0047601,-0.0014733,-0.0014444,-0.0039595,-0.0039646,0.0014723,-0.0011993,-0.0001234,0.0012051,0.0004191
    FIN_Levanluhta_IA_o:JK2065,0.135449,0.121864,0.083344,0.072352,0.047393,0.019522,-0.0094,0.005077,0.011453,0.001276,0.001137,0.002548,-0.011001,-0.013625,0.034609,0.0118,-0.004563,0.001647,0.012067,0.019384,0.000374,-0.001237,-0.004683,-0.002048,0.005389
    Calculate population averages for individual samples.

    Code:
    $ tav()(awk '{n[$1]++;for(i=2;i<=NF;i++){a[$1][i]+=$i}}END{for(i in a){o=i;for(j=2;j<=NF;j++)o=o FS sprintf("%f",a[i][j]/n[i]);print o}}' "FS=${1-$'\t'}")
    $ printf %s\\n Finnish Nenets|awk -F: 'NR==FNR{a[$0];next}$1 in a' - mis|sed 's/:[^,]*//'|tav ,
    Finnish,0.130593,0.091871,0.089906,0.076616,0.031719,0.023260,0.008727,0.013184,0.003177,-0.023071,0.003681,-0.006034,0.011923,0.005808,0.002108,-0.001282,-0.008136,0.000659,0.001291,0.002760,0.007046,0.001171,0.002794,0.005695,0.002108
    Nenets,0.069546,-0.264850,0.129352,0.038502,-0.113160,-0.049810,0.017015,0.027853,0.011944,-0.014342,0.073416,0.002548,0.010169,-0.056673,-0.023792,-0.012530,-0.003286,0.006322,0.008698,-0.008929,0.008810,0.017014,0.030097,-0.005350,-0.001365
    Calculate averages of population averages.

    Code:
    $ for x in Lithuanian French;do grep ^$x mas|sed 's/_[^,]*//';done|tav ,
    Lithuanian,0.135170,0.125249,0.084170,0.081240,0.043740,0.031013,0.010828,0.015892,-0.003142,-0.031693,-0.003967,-0.011805,0.021551,0.029319,-0.011364,0.000373,0.003076,-0.000428,0.002890,0.002881,-0.004133,-0.005420,0.008418,-0.007730,0.000102
    French,0.127181,0.142569,0.046532,0.017254,0.040677,0.006796,0.001748,0.002848,0.011926,0.018863,-0.004382,0.006062,-0.012855,-0.009116,0.010056,0.003374,-0.002865,0.001810,0.001438,-0.000224,0.003377,0.002120,-0.001762,0.004233,-0.000823
    Find ancient populations whose closest modern population average is Uralic.

    Code:
    grep -v _contam aas|while read l;do dist mas <(echo $l) 6|head -n1|awk 'NR==FNR{a[$0];next}$2 in a' <(printf %s\\n Besermyan Estonian Finnish Finnish_East Hungarian Ingrian Karelian Khanty Komi Mansi Mari Mordovian Nenets Nganassan Saami Saami_Kola Selkup Udmurt Vepsian) -|sed s/^/${l%%,*}\ /;done|head -n4
    Baltic_EST_IA .028435 Estonian
    Baltic_LVA_MN .167850 Finnish_East
    BEL_GoyetQ2 .198076 Finnish_East
    Corded_Ware_Baltic .046021 Ingrian
    Find populations with the highest distance between individuals within the population.

    Code:
    $ awk -F, 'NR==FNR{p=$1;sub(":.*","",p);for(i=2;i<=NF;i++)a[p][NR][i]=$i;next}END{for(i in a){max=0;for(j in a[i]){for(k in a[i]){s=0;for(l=2;l<=NF;l++)s+=(a[i][j][l]-a[i][k][l])^2;s=s^.5;if(s>max)max=s}}print length(a[i]),i,max}}' mis|sort -rnk3|head -n4
    33 Tunisian_Douz 0.518493
    20 Berber_Algeria 0.51828
    14 Tunisian_Berber_Matmata 0.418829
    9 Greenlander_East 0.412162
    Find modern individuals whose distance to the closest modern population average is .1 or more. The first column shows the sample name, the second column shows the distance, and the third column shows the closest population average.

    Code:
    $ awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[$1][i]=$i;next}{min=-1;minp="";for(i in a){s=0;for(j=2;j<=NF;j++)s+=(a[i][j]-$j)^2;s=s^.5;if(min==-1||s<min){min=s;minp=i}}print$1,min,minp}' mas mis|awk '$2>=.1'
    Berber_Algeria:BerZN151 0.100094 Fulani
    Berber_Algeria:BerZN182 0.100732 Gambian
    Ju_hoan_North:A_Ju_hoan_North-5 0.184278 Ju_hoan_North
    Ju_hoan_North:B_Ju_hoan_North-4 0.186879 Ju_hoan_North
    Generate two-way models where the percentage of both source populations is 50%, with one target population (RUS_Karelia_HG below), one fixed source population (RUS_MA1 below), and one variable source population.

    Code:
    $ awk -F, 'NR==1{for(i=2;i<=NF;i++)a[i]=$i;next}NR==2{for(i=2;i<=NF;i++)b[i]=$i;next}{s=0;for(i=2;i<=NF;i++)s+=(a[i]-(.5*$i+.5*b[i]))^2;print s^.5,$1}' <(grep RUS_Karelia_HG aas) <(grep RUS_MA1 aas) aas|sort -n|awk '{printf"%.3f %s\n",$1,$2}'|sed s,^0,,|head -n8
    .055 VK2020_NOR_North_LN_HG
    .058 Baltic_LVA_MN
    .059 NOR_N_HG
    .059 NOR_Meso
    .062 UKR_Meso
    .065 RUS_Veretye_Meso
    .066 UKR_N
    .072 SWE_Meso
    Generate two-way models with one target population, one fixed source population, and one variable source population. The first column is the distance of the model, the second column is the percentage of the variable source population, and the third column is the name of the variable source population. In order to return percentages with one digit after the decimal point, replace `r+=1` with `r+=.1` and replace `%.0f` with `%.1f`.

    Code:
    $ 2way()(awk -F, 'NR==1{for(i=2;i<=NF;i++)a[i]=$i;next}NR==2{for(i=2;i<=NF;i++)b[i]=$i;next}{for(r=0;r<=100;r+=1){s=0;for(i=2;i<=NF;i++)s+=(a[i]-(.01*r*$i+(1-.01*r)*b[i]))^2;t[r]=s^.5};min=-1;for(i in t)if(min==-1||t[i]<min){min=t[i];minr=i};if(min!=0)printf"%f %.0f %s\n",min,minr,$1}' "[email protected]"|sort -n|awk '{$1=sprintf("%.3f",$1);sub(/^0/,"")}1')
    $ 2way <(grep Udmurt mas) <(grep RUS_Krasnoyarsk_BA aas) aas|head -n8
    .029 76 UZB_Kashkarchi_BA
    .029 78 KAZ_Mys_MLBA
    .031 77 RUS_Krasnoyarsk_MLBA
    .031 79 KAZ_Zevakinskiy_BA
    .032 76 RUS_Srubnaya_Alakul_MLBA
    .032 75 KAZ_Karagash_MLBA
    .032 76 KAZ_Aktogai_MLBA
    .033 75 UKR_MBA
    Same as above, but use an output format that shows the percentage of both source populations.

    Code:
    $ 2way2()(awk -F, 'NR==1{print"Target: "$1;for(i=2;i<=NF;i++)a[i]=$i;next}NR==2{x=$1;for(i=2;i<=NF;i++)b[i]=$i;next}{for(r=0;r<=100;r+=1){s=0;for(i=2;i<=NF;i++)s+=(a[i]-(.01*r*$i+(1-.01*r)*b[i]))^2;t[r]=s^.5};min=-1;for(i in t)if(min==-1||t[i]<min){min=t[i];minr=i};if(min!=0)printf"%f - %.0f%% %s + %.0f%% %s\n",min,minr,$1,100-minr,x}' "[email protected]"|(read;echo "$REPLY";sort -n|awk '{$1=sprintf("%.3f",$1);sub(/^0?/,"d=")}1'))
    $ 2way2 <(grep Udmurt mas) <(grep RUS_Krasnoyarsk_BA aas) aas|head -n9
    Target: Udmurt
    d=.029 - 76% UZB_Kashkarchi_BA + 24% RUS_Krasnoyarsk_BA
    d=.029 - 78% KAZ_Mys_MLBA + 22% RUS_Krasnoyarsk_BA
    d=.031 - 77% RUS_Krasnoyarsk_MLBA + 23% RUS_Krasnoyarsk_BA
    d=.031 - 79% KAZ_Zevakinskiy_BA + 21% RUS_Krasnoyarsk_BA
    d=.032 - 76% RUS_Srubnaya_Alakul_MLBA + 24% RUS_Krasnoyarsk_BA
    d=.032 - 75% KAZ_Karagash_MLBA + 25% RUS_Krasnoyarsk_BA
    d=.032 - 76% KAZ_Aktogai_MLBA + 24% RUS_Krasnoyarsk_BA
    d=.033 - 75% UKR_MBA + 25% RUS_Krasnoyarsk_BA
    Generate two-way models with two fixed source populations (RUS_Krasnoyarsk_BA and Baltic_LTU_BA below) and one variable target population. The second column displays the percentage of the first source population.

    Code:
    $ way2()(awk -F, 'NR==1{for(i=2;i<=NF;i++)a[i]=$i;next}NR==2{for(i=2;i<=NF;i++)b[i]=$i;next} {for(r=0;r<=100;r+=1){s=0;for(i=2;i<=NF;i++)s+=($i-(.01*r*a[i]+(1-.01*r)*b[i]))^2;t[r]=s^.5};min=-1;for(i in t)if(min==-1||t[i]<min){min=t[i];minr=i};if(min!=0)printf"%f %.0f %s\n",min,minr,$1}' "[email protected]"|awk '{$1=sprintf("%.3f",$1);sub(/^0/,"")}1')
    $ way2 <(grep RUS_Krasnoyarsk_BA aas) <(grep Baltic_LTU_BA aas) <(printf %s\\n Estonian Finnish_East Karelian Khanty Komi Mansi Mari Mordovian Nenets Nganassan Saami Saami_Kola Selkup Udmurt Vepsian Bashkir Chuvash Tatar_Kazan Tatar_Mishar Tatar_Siberian Tatar_Siberian_Zabolotniye FIN_Levanluhta_IA RUS_Chalmny-Varre RUS_Bolshoy_Oleni_Ostrov{,_o} VK2020_NOR_North_VA_o{1,2} VK2020_SWE_Gotland_VA_o|awk -F, 'NR==FNR{a[$0];next}$1 in a' - mas aas)|sort -nk2
    .033 3 Estonian
    .042 9 Mordovian
    .029 10 Vepsian
    .031 10 Karelian
    .035 10 Finnish_East
    .036 11 VK2020_SWE_Gotland_VA_o
    .058 17 Tatar_Mishar
    .053 18 VK2020_NOR_North_VA_o2
    .038 19 Komi
    .031 21 Saami_Kola
    .064 22 Tatar_Kazan
    .041 26 RUS_Chalmny-Varre
    .068 27 Chuvash
    .044 28 Saami
    .060 28 Udmurt
    .042 30 FIN_Levanluhta_IA
    .092 34 Mari
    .052 37 VK2020_NOR_North_VA_o1
    .077 40 Bashkir
    .082 46 Tatar_Siberian
    .094 47 RUS_Bolshoy_Oleni_Ostrov
    .063 51 Tatar_Siberian_Zabolotniye
    .077 54 Mansi
    .081 56 Khanty
    .080 58 RUS_Bolshoy_Oleni_Ostrov_o
    .077 70 Selkup
    .059 73 Nenets
    .073 99 Nganassan
    Compare scaled and unscaled coordinates by using both to calculate the distance to Finns. The second column shows the distance using unscaled coordinates, the third column shows the distance using scaled coordinates, and the first column shows the ratio between the second and third column. The ratio is low for SSA populations, because with unscaled coordinates relatively little weight is given to PC1 which differentiates SSAs from non-SSAs.

    Code:
    dist2()(awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1")
    $ paste -d' ' <(dist2 mau <(grep Finnish, mau)) <(dist2 mas <(grep Finnish, mas))|awk '$3{print$1/$3,$0}'|sort -rn|awk '{printf"%.3f %.3f %.3f %s\n",$1,$2,$4,$3}'|sed s/0\\././g|(sed -u 4q;echo ...;tail -n4)
    .723 .011 .015 Ingrian
    .668 .029 .043 Cossack_Kuban
    .633 .017 .027 Russian_Kostroma
    .623 .024 .039 Russian_Tver
    ...
    .105 .080 .766 Umbundu
    .105 .080 .761 Kaba
    .105 .080 .766 Nyaneka
    .104 .080 .769 Ngumba
    Look up information about samples from the Reich dataset.

    Code:
    $ curl -LsO reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.anno
    $ grep MNG_North_N ais|cut -d, -f1|cut -d: -f2|grep -f- v44.3_1240K_public.anno|cat <(head -n1 v44.3_1240K_public.anno) -|cut -f2,7,10,15,16,17,21,25,27|rs -c -C -T|tr \\t ,
    Version ID,I13179,I11697,I11696,I13701,I11698,I14000,I13698
    Publication,WangBioRxiv2020,WangBioRxiv2020,WangBioRxiv2020,WangBioRxiv2020,WangBioRxiv2020,WangBioRxiv2020,WangBioRxiv2020
    Date mean in BP [OxCal mu for a direct radiocarbon date, and average of range for a contextual date],7515,7500,7528,7515,7520,7480,7504
    Country,Mongolia,Mongolia,Mongolia,Mongolia,Mongolia,Mongolia,Mongolia
    Lat.,49.53469,49.3925,49.3925,49.53469,49.3925,49.53469,49.53469
    Long.,103.2825,102.7025,102.7025,103.2825,102.7025,103.2825,103.2825
    SNPs hit on autosomal targets,774348,838638,866778,468513,959767,788721,924341
    Y haplogroup  in ISOGG v15.73 notation (automatically called),..,C2a1a,C2a1a,..,C2a1a,..,C2a1a
    mtDNA haplogroup if ≥2 coverage or published (merged data or consensus if not available),C4a1a3,D4j8,D4j,C4a1a3,D4b1c,D4j,[email protected]
    Extra Ruby script: calculate distance between populations.

    Code:
    $ cat ~/bin/eud
    #!/usr/bin/env ruby -roptparse
    
    opt={}
    OptionParser.new{|x|
      x.on("-m NUM",Integer){|y|opt[:m]=y}
      x.on("-f NUM",Integer){|y|opt[:f]=y}
    }.parse!
    
    a=IO.readlines(ARGV[0]).map{|l|x,*y=l.chomp.split(",");[x,y.map(&:to_f)]}
    
    puts IO.readlines(ARGV[1]).map{|l|
      x,*y=l.chomp.split(",")
      y.map!(&:to_f)
      d=a.reject{|z|z[0]==x}.map{|z|[z[1].map.with_index{|v,i|(v-y[i])**2}.sum**0.5,z[0]]}.sort_by(&:first)
      d=d.take(opt[:m])if opt[:m]
      "Distance to: #{x}\n"+d.map{|x,y|("%.#{opt[:f]||3}f"%x).sub(/^0/,"")+" "+y}*"\n"
    }*"\n\n"
    $ eud -m4 mas <(egrep Mbuti\|Ju_hoan_North mas)
    Distance to: Ju_hoan_North
    .162 Khomani_San
    .458 Bakola
    .472 Bedzan
    .485 Bantu_S.E.
    
    Distance to: Mbuti
    .444 Bakola
    .458 Bedzan
    .467 Baka
    .481 Biaka
    Last edited by Nganasankhan; 03-31-2021 at 05:53 PM.

  2. The Following 18 Users Say Thank You to Nganasankhan For This Useful Post:

     Bygdedweller (04-07-2021),  David Bush (04-11-2021),  discreetmaverick (04-03-2021),  dosas (03-31-2021),  hokkanto (03-31-2021),  Kaspias (04-04-2021),  Megalophias (03-30-2021),  mokordo (04-17-2021),  Northern Adriatic (04-03-2021),  perttimp (04-05-2021),  randwulf (03-31-2021),  Ryukendo (04-04-2021),  sheepslayer (03-31-2021),  sweuro (04-12-2021),  Tahmed (04-07-2021),  Tomenable (04-04-2021),  Varun R (05-03-2021),  xerxez (04-19-2021)

  3. #2
    Registered Users
    Posts
    193
    Sex

    This R script is for linear interpolation. It's most useful when you have the coordinates of a parent and a child so you can use it to generate the coordinates of the missing parent but you can use it any way you like. In this case I'll be using it to remove steppe admixture from Swedes. The quality of the generated coordinates will depend on the magnitude of difference so the quality will be much higher if you do it for members of the same population.

    Code:
    left <- "Yamnaya_UKR,0.1160995,0.0878435,0.0465745,0.106106,-0.030467,0.038766,0.0074025,-0.0026535,-0.0502105,-0.070252,-0.002923,0.003297,-0.002453,-0.019198,0.0311475,0.013988,-0.0009775,-0.0118455,-0.0036455,0.001313,-0.001248,-0.001175,-0.003636,0.0211475,0.001856"
    middle <- "Swedish,0.1340007,0.1283722,0.0715329,0.0557909,0.0414202,0.0213225,0.006847,0.0089682,0.0046947,-0.0067261,-0.004451,0.001635,-0.0048315,-0.0036907,0.0150773,0.0039838,-0.0074379,0.002004,0.003851,0.0046385,0.0064205,0.0021357,0.0009244,0.0106969,-0.0002994"
    amount <- 2
    left <- c(strsplit(left,","))
    middle <- c(strsplit(middle,","))
    right <- "Result"
    
    for(i in 2:26)
    {
    a <- as.numeric(left[[1]][i])
    b <- as.numeric(middle[[1]][i])
    c <- ((1-amount)*a+amount*b)
    right <- paste(right,c,sep=",") 
    }
    noquote(right)
    Target: Swedish
    Distance: 0.0000% / 0.00000000
    50.0 Result
    50.0 Yamnaya_UKR

    Distance to: Result
    0.08161614 POL_Globular_Amphora
    0.08334166 UKR_Globular_Amphora
    0.08383578 SWE_Megalithic_Ansarve
    0.08964459 FRA_Occitanie_LN

    You don't need to download R to use it just copypaste it here
    https://rdrr.io/snippets/
    Last edited by Standardized Ape; 03-31-2021 at 05:34 PM.

  4. The Following 9 Users Say Thank You to Standardized Ape For This Useful Post:

     BalkanKiwi (04-03-2021),  discreetmaverick (04-03-2021),  dosas (03-31-2021),  kaazi (04-05-2021),  mokordo (04-17-2021),  Nganasankhan (03-31-2021),  sheepslayer (03-31-2021),  sweuro (04-12-2021),  Tomenable (04-04-2021)

  5. #3
    Registered Users
    Posts
    92
    Sex
    Ethnicity
    Finnish

    Here's a shell version of the linear interpolation script:

    Code:
    $ echo 'Yamnaya_UKR,0.1160995,0.0878435,0.0465745,0.106106,-0.030467,0.038766,0.0074025,-0.0026535,-0.0502105,-0.070252,-0.002923,0.003297,-0.002453,-0.019198,0.0311475,0.013988,-0.0009775,-0.0118455,-0.0036455,0.001313,-0.001248,-0.001175,-0.003636,0.0211475,0.001856
    Swedish,0.1340007,0.1283722,0.0715329,0.0557909,0.0414202,0.0213225,0.006847,0.0089682,0.0046947,-0.0067261,-0.004451,0.001635,-0.0048315,-0.0036907,0.0150773,0.0039838,-0.0074379,0.002004,0.003851,0.0046385,0.0064205,0.0021357,0.0009244,0.0106969,-0.0002994'>/tmp/a
    $ awk -F, 'NR==1{for(i=2;i<=NF;i++)a[i]=$i}NR==2{for(i=2;i<=NF;i++)printf",%s",2*$i-a[i];print""}' /tmp/a|tee /tmp/b
    ,0.151902,0.168901,0.0964913,0.0054758,0.113307,0.003879,0.0062915,0.0205899,0.0595999,0.0567998,-0.005979,-2.7e-05,-0.00721,0.0118166,-0.0009929,-0.0060204,-0.0138983,0.0158535,0.0113475,0.007964,0.014089,0.0054464,0.0054848,0.0002463,-0.0024548
    $ curl 'https://drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl' -Lso aas
    $ dist()(awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1"|sort -n|awk '{printf"%."x"f %s\n",$1,$2}' "x=${3-3}"|sed s,^0,,)
    $ dist aas /tmp/b|head -n4
    .082 POL_Globular_Amphora
    .083 UKR_Globular_Amphora
    .084 SWE_Megalithic_Ansarve
    .090 FRA_Occitanie_LN

  6. The Following 4 Users Say Thank You to Nganasankhan For This Useful Post:

     dosas (04-03-2021),  Kaspias (04-04-2021),  sheepslayer (04-04-2021),  Tomenable (04-04-2021)

  7. #4
    Registered Users
    Posts
    92
    Sex
    Ethnicity
    Finnish

    This is a simplified version of nMonte3.R that only prints the percentage of each population in the oracle and the distance of the model. I didn't make changes to the `do_algorithm` function apart from changes to formatting, such as removing comments, removing space characters, using 2 instead of 4 spaces for indentation, and replacing assignment arrows with equals signs. Unlike the original `nMonte3.R`, this version requires that the files for the source and target populations have no header line.

    Code:
    getMonte=function(datafile,targetfile,Nbatch=500,Ncycles=1000,pen=0){
      do_algorithm=function(selection,targ){
        mySel=as.matrix(selection,rownames.force=NA)
        myTg=as.matrix(targ,rownames.force=NA)
        dif2targ=sweep(mySel,2,myTg,"-")
        Ndata=nrow(dif2targ)
        kcol=ncol(dif2targ)
        rowLabels=rownames(dif2targ)
        matPop=matrix(NA_integer_,Nbatch,1,byrow=T)
        dumPop=matrix(NA_integer_,Nbatch,1,byrow=T)
        matAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
        dumAdmix=matrix(NA_real_,Nbatch,kcol,byrow=T)
        matPop=sample(1:Ndata,Nbatch,replace=T)
        matAdmix=dif2targ[matPop,]
        colM1=colMeans(matAdmix)
        eval1=(1+pen)*sum(colM1^2)
        for(c in 1:Ncycles){
          dumPop=sample(1:Ndata,Nbatch,replace=T)
          dumAdmix=dif2targ[dumPop,]
          for(b in 1:Nbatch){
            store=matAdmix[b,]
            matAdmix[b,]=dumAdmix[b,]
            colM2=colMeans(matAdmix)
            eval2=sum(colM2^2)+pen*sum(matAdmix[b,]^2)
            if(eval2<=eval1){
              matPop[b]=dumPop[b]
              colM1=colM2
              eval1=eval2
            }else{matAdmix[b,]=store}
          }
        }
        fitted=t(colMeans(matAdmix)+myTg[1,])
        popl=unlist(lapply(strsplit(rowLabels[matPop],";"),function(x)x[1]))
        populations=factor(popl)
        return(list("estimated"=fitted,"pops"=populations))
      }
    
      source=read.csv(datafile,header=F,row.names=1)
      target=read.csv(targetfile,header=F,row.names=1)
    
      for(i in 1:nrow(target)){
        row=target[i,]
        result=do_algorithm(source,row)
    
        if(i>=2)cat("\n")
        dist=sqrt(sum((result$estimated-row)^2))
        cat(paste0("Target: ",row.names(row)," (d=",sub("^0","",sprintf("%.3f",dist)),")\n"))
    
        tb=table(result$pops)
        tb=sort(tb,decreasing=T)
        tb=as.matrix(100*tb/Nbatch)
        cat(paste(sprintf("%.1f",tb),row.names(tb)),sep="\n")
      }
    }
    If you save the script above as `~/bin/nmon.R`, you can use it via this shell wrapper:

    Code:
    $ nm()(grep -v ^, "$1">/tmp/nm1;grep -v ^, "$2">/tmp/nm2;Rscript -e 'source("~/bin/nmon.R");a=as.numeric(commandArgs(T));getMonte("/tmp/nm1","/tmp/nm2",Nbatch=a[1],Ncycles=a[2],pen=a[3])' "${3-500}" "${4-1000}" "${5-.001}")
    $ nm <(grep -E '^(FIN_Levanluhta_IA|RUS_Krasnoyarsk_BA|RUS_Maykop|UKR_MBA),' aas) <(grep Udmurt mas)
    Target: Udmurt (d=.025)
    54.8 FIN_Levanluhta_IA
    27.2 UKR_MBA
    9.8 RUS_Krasnoyarsk_BA
    8.2 RUS_Maykop
    You can run the original `nMonte3.R` with this shell wrapper:

    Code:
    nm3()(f()(awk -F, 'NR==1{for(i=1;i<NF;i++)printf",%s",i;print""}$1!=""' "$1");f "$1">/tmp/nm1;f "$2">/tmp/nm2;Rscript -e 'source("~/bin/nMonte3.R");a=as.numeric(commandArgs(T));getMonte("/tmp/nm1","/tmp/nm2",Nbatch=a[1],Ncycles=a[2],pen=a[3])' "${3-500}" "${4-1000}" "${5-.001}")
    Last edited by Nganasankhan; 03-31-2021 at 11:24 PM.

  8. The Following 4 Users Say Thank You to Nganasankhan For This Useful Post:

     dosas (04-03-2021),  Kaspias (04-04-2021),  sheepslayer (04-04-2021),  Tomenable (04-04-2021)

  9. #5
    Registered Users
    Posts
    92
    Sex
    Ethnicity
    Finnish

    This visualizes a 3-way model created with Vahaduo as a scatterplot:

    Code:
    library(tidyverse)
    library(ggrepel)
    
    t=read.table("tsv-table-copied-from-vahaduo",skip=1,row.names=1) # output of COPY AS TSV from MULTI tab
    t=t[-nrow(t),]
    t2=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",row.names=1,header=T,check.names=F) # modern averages scaled
    t3=read.csv("https://drive.google.com/uc?export=download&id=1F2rKEVtu8nWSm7qFhxPU6UESQNsmA-sl",row.names=1,header=T,check.names=F) # ancient averages scaled
    t4=rbind(t2,t3)
    t5=t4[row.names(t),]
    
    t$lab=paste0(row.names(t)," (",sub("^0","",sprintf("%.3f",t[,1])),")")
    nk=12
    t$k=cutree(hclust(dist(t5)),k=nk)
    
    ggplot(t,aes(x=V4,y=V5))+
    geom_hline(yintercept=0,color="gray70",size=.5)+
    geom_vline(xintercept=0,color="gray70",size=.5)+
    geom_point(aes(color=as.factor(k)),size=.7)+
    geom_polygon(data=t%>%group_by(k)%>%slice(chull(V4,V5)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.3)+
    geom_text_repel(aes(label=lab,color=as.factor(k)),size=2.3,max.overlaps=Inf,box.padding=.05,force=5,force_pull=2,segment.size=.3,segment.color="black",min.segment.length=.3)+
    labs(x="RUS_Krasnoyarsk_BA (%)",y="TUR_Barcin_N (%)")+
    ggtitle("3-way model: RUS_Krasnoyarsk_BA + TUR_Barcin_N + Baltic_LTU_BA")+
    coord_cartesian(ylim=c(-2,35),xlim=c(-4,101),expand=F)+
    scale_x_continuous(breaks=seq(-1000,1000,10))+
    scale_y_continuous(breaks=seq(-1000,1000,10))+
    scale_color_manual(values=hcl(seq(15,375,length=nk+1)[1:nk],90,50))+
    theme(
      aspect.ratio=1/sqrt(2),
      axis.text=element_text(color="black"),
      axis.ticks.length=unit(0,"pt"),
      axis.ticks.x=element_blank(),
      axis.ticks.y=element_blank(),
      axis.title=element_text(color="black"),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray70",fill=NA,size=.4),
      panel.grid.major=element_line(color="gray70",size=.2),
      panel.grid.minor=element_blank(),
      plot.title=element_text(size=11),
    )
    
    ggsave("a.png",width=7,height=7)
    system("mogrify -trim -bordercolor white -border 16x16 a.png")
    Result:



    Alternative version:

    Code:
    library(tidyverse)
    library(ggrepel)
    
    t=read.table("tsv-copied-from-multi-tab-in-vahaduo",header=T,row.names=1)
    t=t[-nrow(t),]
    t2=read.csv("all-lines-in-source-tab",row.names=1,header=F)[row.names(t),]
    
    t$lab=paste0(row.names(t)," (",sub("^0","",sprintf("%.3f",t[,1])),")")
    nk=12
    t$k=cutree(hclust(dist(t2)),k=nk)
    
    n=names(t)
    names(t)[2:3]=c("x","y")
    
    ggplot(t,aes(x,y))+
    geom_abline(linetype="dashed",color="gray70",size=.3)+
    geom_polygon(data=t%>%group_by(k)%>%slice(chull(x,y)),aes(color=as.factor(k),fill=as.factor(k)),alpha=.2,size=.3)+
    geom_point(aes(color=as.factor(k)),size=.7)+
    geom_text_repel(aes(label=lab,color=as.factor(k)),size=2.5,max.overlaps=Inf,box.padding=.05,segment.size=.3,force=5,segment.color="black",min.segment.length=.15)+
    # geom_text(aes(label=lab,color=as.factor(k)),size=2.5,vjust=-.7)+
    labs(x=paste(n[2],"(%)"),y=paste(n[3],"(%)"))+
    ggtitle(paste("3-way model:",n[2],"+",n[3],"+",n[4]))+
    coord_fixed()+
    scale_x_continuous(breaks=seq(-1000,1000,10),minor_breaks=seq(-1000,1000,1))+
    scale_y_continuous(breaks=seq(-1000,1000,10),minor_breaks=seq(-1000,1000,1))+
    scale_color_manual(values=hcl(seq(15,375,length=nk+1)[1:nk],90,50))+
    theme(
      axis.text=element_text(color="black",size=7),
      axis.ticks.length=unit(0,"pt"),
      axis.ticks.x=element_blank(),
      axis.ticks.y=element_blank(),
      axis.title=element_text(color="black",size=9),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray70",fill=NA,size=.5),
      panel.grid.major=element_line(color="gray80",size=.3),
      panel.grid.minor=element_line(color="gray90",size=.2),
      plot.title=element_text(size=11),
    )
    
    ggsave("a.png",width=7,height=6.2)
    system("mogrify -trim -bordercolor white -border 16x16 a.png")
    Last edited by Nganasankhan; 04-03-2021 at 10:27 PM.

  10. The Following 5 Users Say Thank You to Nganasankhan For This Useful Post:

     discreetmaverick (04-03-2021),  dosas (04-03-2021),  Kaspias (04-04-2021),  sheepslayer (04-04-2021),  Tomenable (04-04-2021)

  11. #6
    Administrator
    Posts
    2,348
    Sex
    Y-DNA (P)
    DF27
    mtDNA (M)
    V33

    New Zealand Croatia Star of David Ireland Scotland Poland
    Quote Originally Posted by Standardized Ape View Post
    This R script is for linear interpolation. It's most useful when you have the coordinates of a parent and a child so you can use it to generate the coordinates of the missing parent but you can use it any way you like. In this case I'll be using it to remove steppe admixture from Swedes. The quality of the generated coordinates will depend on the magnitude of difference so the quality will be much higher if you do it for members of the same population.

    Code:
    left <- "Yamnaya_UKR,0.1160995,0.0878435,0.0465745,0.106106,-0.030467,0.038766,0.0074025,-0.0026535,-0.0502105,-0.070252,-0.002923,0.003297,-0.002453,-0.019198,0.0311475,0.013988,-0.0009775,-0.0118455,-0.0036455,0.001313,-0.001248,-0.001175,-0.003636,0.0211475,0.001856"
    middle <- "Swedish,0.1340007,0.1283722,0.0715329,0.0557909,0.0414202,0.0213225,0.006847,0.0089682,0.0046947,-0.0067261,-0.004451,0.001635,-0.0048315,-0.0036907,0.0150773,0.0039838,-0.0074379,0.002004,0.003851,0.0046385,0.0064205,0.0021357,0.0009244,0.0106969,-0.0002994"
    amount <- 2
    left <- c(strsplit(left,","))
    middle <- c(strsplit(middle,","))
    right <- "Result"
    
    for(i in 2:26)
    {
    a <- as.numeric(left[[1]][i])
    b <- as.numeric(middle[[1]][i])
    c <- ((1-amount)*a+amount*b)
    right <- paste(right,c,sep=",") 
    }
    noquote(right)
    Target: Swedish
    Distance: 0.0000% / 0.00000000
    50.0 Result
    50.0 Yamnaya_UKR

    Distance to: Result
    0.08161614 POL_Globular_Amphora
    0.08334166 UKR_Globular_Amphora
    0.08383578 SWE_Megalithic_Ansarve
    0.08964459 FRA_Occitanie_LN

    You don't need to download R to use it just copypaste it here
    https://rdrr.io/snippets/
    I attempted to generate coordinates for my mother using mine and my fathers, which seems to generate something somewhat realistic.

    G25 Modern Scaled

    Target: BKMother
    Distance: 2.5032% / 0.02503211 | ADC: 0.25x RC
    65.4 Shetlandic
    19.8 Basque_French
    6.0 Spanish_Asturias
    4.0 English_Cornwall
    3.0 French_Paris
    1.8 Papuan

    G25 Modern Unscaled

    Target: BKMother
    Distance: 1.4035% / 0.01403466 | ADC: 0.25x RC
    67.6 Shetlandic
    22.4 Spanish_Asturias
    4.2 English
    1.8 Papuan
    1.6 Gupta
    1.2 Ju_hoan_North
    1.2 She

    EDIT: I've gone and made this a Sticky thread, so its easier to find in future for others looking for scripts and resources.
    Ancestry on paper: English, Scottish, Irish, Welsh, Croatian, Ashkenazi, Polish and Māori.

  12. The Following 4 Users Say Thank You to BalkanKiwi For This Useful Post:

     dosas (04-03-2021),  Kaspias (04-04-2021),  Nganasankhan (04-03-2021),  sheepslayer (04-04-2021)

  13. #7
    Registered Users
    Posts
    92
    Sex
    Ethnicity
    Finnish

    This downloads the datasheet for scaled modern population averages. It then selects the 100 populations closest to Saami and the 100 populations closest to Sardinians, and it shows distance to Saami on the y-axis and distance to Sardinians on the x-axis. The clustering is based on hierarchical clustering using the complete linkage method (farthest neighbor clustering).

    Code:
    library(tidyverse)
    
    x="Sardinian"
    y="Saami"
    
    t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",header=T,row.names=1)
    
    d=as.matrix(dist(t),upper=T)
    head=unique(names(c(head(sort(d[,x]),100),head(sort(d[,y]),100))))
    xy=as.data.frame(d[head,c(x,y)])
    
    nk=16
    k=cutree(hclust(dist(t[head,])),k=nk)
    names(xy)=c("x","y")
    xy$k=k
    
    ggplot(xy,aes(x,y))+
    geom_hline(yintercept=0,color="gray80",size=.3)+
    geom_vline(xintercept=0,color="gray80",size=.3)+
    geom_abline(linetype="dashed",color="gray80",size=.3)+
    geom_polygon(data=xy%>%group_by(k)%>%slice(chull(x,y)),aes(color=as.factor(k),fill=as.factor(k)),alpha=.2,size=.3)+
    geom_point(aes(color=as.factor(k)),size=.5)+
    geom_text(label=rownames(xy),aes(color=as.factor(k)),size=2.2,vjust=-.5)+
    xlab(paste("Distance to",x))+
    ylab(paste("Distance to",y))+
    coord_fixed(expand=F,xlim=c(-.015,.445),ylim=c(-.015,.305))+
    scale_x_continuous(breaks=seq(-1,1,.1),minor_breaks=seq(-1,1,.01))+
    scale_y_continuous(breaks=seq(-1,1,.1),minor_breaks=seq(-1,1,.01))+
    scale_color_manual(values=hcl(seq(15,375,length=nk+1)[1:nk],90,50))+
    theme(
      axis.text=element_text(color="black"),
      axis.ticks.length=unit(0,"pt"),
      axis.ticks.x=element_blank(),
      axis.ticks.y=element_blank(),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.grid.major=element_line(color="gray80",size=.25),
      panel.grid.minor=element_line(color="gray90",size=.15)
    )
    
    ggsave("a.png",width=10,height=10)
    system("mogrify -trim -bordercolor white -border 16x16 a.png")


    This finds the closest populations to Saami:

    Code:
    > t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",header=T,row.names=1)
    > d=as.matrix(dist(t))
    > h=head(sort(d["Saami",]),8)
    > cat(paste(sprintf("%.3f",h),names(h)),sep="\n")
    0.000 Saami
    0.052 Saami_Kola
    0.052 Udmurt
    0.060 Besermyan
    0.071 Chuvash
    0.074 Komi
    0.082 Tatar_Kazan
    0.095 Mari
    This sorts populations based on their distance to Saami divided by their distance to Sardinians:

    Code:
    > t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",header=T,row.names=1)
    > d=as.matrix(dist(t))
    > d1=d["Saami",]
    > d2=d["Sardinian",]
    > d3=cbind(d1/d2,d1,d2)
    > d3=d3[order(d3[,1]),]
    > head(round(d3,3),8)
                        d1    d2
    Saami      0.000 0.000 0.296
    Udmurt     0.183 0.052 0.285
    Saami_Kola 0.198 0.052 0.261
    Besermyan  0.226 0.060 0.267
    Chuvash    0.258 0.071 0.276
    Mari       0.298 0.095 0.320
    Komi       0.303 0.074 0.243
    Bashkir    0.343 0.112 0.326
    Same using shells:

    Code:
    $ curl 'https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y' -Lso mas
    $ dist2()(awk -F, 'NR==FNR{for(i=2;i<=NF;i++)a[i]=$i;next}$1{s=0;for(i=2;i<=NF;i++)s+=($i-a[i])^2;print s^.5,$1}' "$2" "$1")
    $ paste -d' ' <(dist2 mas <(grep Saami, mas)) <(dist2 mas <(grep Sardinian mas))|awk '$3{print$1/$3,$0}'|sort -n|awk '{printf"%.3f %.3f %.3f %s\n",$1,$2,$4,$3}'|head -n8
    0.000 0.000 0.296 Saami
    0.183 0.052 0.285 Udmurt
    0.198 0.052 0.261 Saami_Kola
    0.226 0.060 0.267 Besermyan
    0.258 0.071 0.276 Chuvash
    0.298 0.095 0.320 Mari
    0.303 0.074 0.243 Komi
    0.343 0.112 0.326 Bashkir
    Last edited by Nganasankhan; 04-04-2021 at 12:54 AM.

  14. The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

     Kaspias (04-04-2021),  sheepslayer (04-04-2021)

  15. #8
    Registered Users
    Posts
    424
    Sex
    Location
    USA
    Ethnicity
    Pothwari/Punjabi
    Nationality
    American
    Y-DNA (P)
    R-Y7
    mtDNA (M)
    R30b

    United States of America Pakistan
    Quote Originally Posted by Standardized Ape View Post
    This R script is for linear interpolation. It's most useful when you have the coordinates of a parent and a child so you can use it to generate the coordinates of the missing parent but you can use it any way you like. In this case I'll be using it to remove steppe admixture from Swedes. The quality of the generated coordinates will depend on the magnitude of difference so the quality will be much higher if you do it for members of the same population.

    Code:
    left <- "Yamnaya_UKR,0.1160995,0.0878435,0.0465745,0.106106,-0.030467,0.038766,0.0074025,-0.0026535,-0.0502105,-0.070252,-0.002923,0.003297,-0.002453,-0.019198,0.0311475,0.013988,-0.0009775,-0.0118455,-0.0036455,0.001313,-0.001248,-0.001175,-0.003636,0.0211475,0.001856"
    middle <- "Swedish,0.1340007,0.1283722,0.0715329,0.0557909,0.0414202,0.0213225,0.006847,0.0089682,0.0046947,-0.0067261,-0.004451,0.001635,-0.0048315,-0.0036907,0.0150773,0.0039838,-0.0074379,0.002004,0.003851,0.0046385,0.0064205,0.0021357,0.0009244,0.0106969,-0.0002994"
    amount <- 2
    left <- c(strsplit(left,","))
    middle <- c(strsplit(middle,","))
    right <- "Result"
    
    for(i in 2:26)
    {
    a <- as.numeric(left[[1]][i])
    b <- as.numeric(middle[[1]][i])
    c <- ((1-amount)*a+amount*b)
    right <- paste(right,c,sep=",") 
    }
    noquote(right)
    Target: Swedish
    Distance: 0.0000% / 0.00000000
    50.0 Result
    50.0 Yamnaya_UKR

    Distance to: Result
    0.08161614 POL_Globular_Amphora
    0.08334166 UKR_Globular_Amphora
    0.08383578 SWE_Megalithic_Ansarve
    0.08964459 FRA_Occitanie_LN

    You don't need to download R to use it just copypaste it here
    https://rdrr.io/snippets/
    Interesting script, basically it generated an output similar to what I had predicted for my dad.
    Distance to: Dad_Sim_Scaled
    0.02356518 Khatri
    0.02447416 Kohistani
    0.02690789 Punjabi_Sikh_India
    0.02712940 Gujar_Pakistan
    0.02718107 Sindhi
    0.02925271 Kashmiri_Pandit

    Target: Dad_Sim_Scaled
    Distance: 1.6573% / 0.01657250 | R3P
    96.4 Khatri
    2.0 Evenk
    1.6 Bantu_S.E.

    I had previously used moms coordinates to estimate ancestral source of my dad.
    Target: Kapisa_scaled
    Distance: 1.1112% / 0.01111196 | R2P
    65.8 Kapisa_Mom_scaled
    34.2 Kohistani

    @Standardized Ape: is it possible to modulate the distance between the 'middle' set of coordinates and the 'right' set of coordinates being generated?
    Target: Kapisa_scaled
    Distance: 1.7453% / 0.01745296
    44.4 IRN_Shahr_I_Sokhta_BA2_I8728
    35.0 UZB_Bustan_BA
    18.0 KAZ_Ak_Moustafa_MLBA1
    2.6 MNG_East_N

    Mitochondrial Haplogroup: R30b found in Shahr-I-Sokhta_BA1 subclades found in Swat IA:Leobanr, Katelai & Barikot Historic.

  16. The Following User Says Thank You to Kapisa For This Useful Post:

     BalkanKiwi (04-03-2021)

  17. #9
    Registered Users
    Posts
    92
    Sex
    Ethnicity
    Finnish

    This visualizes the values of eigenvectors as a heatmap. The clustering is based on hierarchical clustering using the complete linkage method.

    Code:
    library(pheatmap)
    library(vegan) # for reorder.hclust()
    library(colorspace) # for hex()
    
    t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",row.names=1,header=T,check.names=F) # modern averages scaled
    pop=c("Finnish","German","Han_Shanghai","Igbo","Ju_hoan_North","Karitiana","Mansi","Mari","Mbuti","Nganassan","Onge","Papuan","Pashtun","Saami","Somali","Yoruba")
    t=t[row.names(t)%in%pop,]
    
    d=as.matrix(dist(t))
    sort=reorder(hclust(dist(t)),wts=d["Ju_hoan_North",]-d["Han_Shanghai",])
    
    pheatmap(
      t,
      filename="a.png",
      legend=F,
      cluster_cols=F,
      clustering_callback=function(...){sort}, # comment out to disable sorting clusters manually
      treeheight_row=80,
      cellwidth=16,
      cellheight=16,
      fontsize=8,
      border_color=NA,
      display_numbers=T,
      number_format="%.2f",
      fontsize_number=7,
      number_color="black",
      breaks=seq(-.6,.6,1.2/256),
      colorRampPalette(hex(HSV(c(210,210,0,0,0),c(.6,.3,0,.3,.6),1)))(256)
    )


    You can do the same thing with this web UI: https://biit.cs.ut.ee/clustvis/. Go to the "Data import" tab, select "Paste data", and paste lines from a datasheet with a header column. Then go to the "Data pre-processing" tab, uncheck "row centering", and change "Row scaling" from "unit variance scaling" to "no scaling". Then go to the "Heatmap" tab, select "change data options", set "Clustering distance for columns" to "no clustering", set "Clustering distance for rows" to "Euclidean", and set "Clustering method for rows" to "complete". Then select "change display options", set "Heatmap color range minimum" to -.6, and set "Heatmap color range maximum" to .6.
    Last edited by Nganasankhan; 04-04-2021 at 12:01 AM.

  18. The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

     Kaspias (04-04-2021),  sheepslayer (04-04-2021)

  19. #10
    Registered Users
    Posts
    92
    Sex
    Ethnicity
    Finnish

    The scripts in this post visualize a CSV file copied from Vahaduo's MULTI tab as a clustered heatmap or as a stacked bar chart.

    Code:
    library(pheatmap)
    library(colorspace) # for hex()
    library(vegan) # for reorder.hclust()
    
    t=read.csv("csv-table-copied-from-multi-tab",header=T,row.names=1,check.names=F)
    ad=t[nrow(t),1]
    t=t[-nrow(t),]
    
    t2=read.csv("all-lines-in-target-tab",header=F,row.names=1,check.names=F)
    t3=t2[row.names(t2)%in%row.names(t),]
    d=as.matrix(dist(t3))
    sort=reorder(hclust(dist(t3)),wts=d["Turkish_Central",]-d["Nganassan",])
    
    row.names(t)=paste0(row.names(t)," (",sub("^0","",sprintf("%.3f",t[,1])),")")
    t=t[-1]
    
    pheatmap(
      t,
      filename="a.png",
      clustering_callback=function(...){sort},
      cluster_cols=F,
      legend=F,
      main=paste("Average distance:",sub("^0","",sprintf("%.3f",ad))),
      treeheight_row=80,
      cellwidth=16,
      cellheight=16,
      fontsize=10,
      border_color=NA,
      display_numbers=T,
      number_format="%.0f",
      fontsize_number=8,
      number_color="black",
      breaks=seq(0,100,100/256),
      colorRampPalette(hex(HSV(c(210,210,90,60,40,20,0),c(0,.4,.6,.6,.6,.6,.6),c(1,1,1,1,1,1,1))))(256)
    )


    Code:
    library(tidyverse)
    library(reshape2) # for melt()
    library(colorspace) # for hex()
    
    t=read.csv("csv-table-copied-from-multi-tab",header=T,check.names=F)
    
    ad=t[nrow(t),2]
    t=t[-nrow(t),]
    
    name=paste0(t[,1]," (",sub("^0","",sprintf("%.3f",t[,2])),")")
    t=cbind(name,t[,3:ncol(t)]/100)
    
    t=t[order(-2*t[,2]-t[,3]+.5*t[,6]+2*t[,7]+2*t[,8]),]
    
    t2=melt(t) # this uses melt() because pivot_longer() doesn't preserve the order of columns
    lab=round(100*t2$value)
    lab[lab==0]=""
    t2$lab=lab
    
    p=ggplot(t2,aes(x=fct_rev(factor(name,level=t$name)),y=value,fill=variable))+
    geom_bar(stat="identity",width=1,position=position_fill(reverse=T))+
    geom_text(aes(label=lab),position=position_stack(vjust=.5,reverse=T),size=3.5)+
    ggtitle(paste("Average distance:",sub("^0","",sprintf("%.3f",ad))))+
    coord_flip()+
    scale_x_discrete(expand=c(0,0))+ # remove small gray area above and below bars
    scale_y_discrete(expand=c(0,0))+ # remove small gray area on left and right side of bars
    scale_fill_manual("legend",values=hex(HSV(c(20,40,190,220,120,270,300),.5,1)))+
    guides(fill=guide_legend(ncol=3,byrow=T))+
    theme(
      axis.text.x=element_blank(),
      axis.text=element_text(color="black"),
      axis.ticks=element_blank(),
      axis.title.x=element_blank(),
      axis.title.y=element_blank(),
      legend.direction="horizontal",
      legend.text=element_text(size=12),
      legend.title=element_blank(),
      legend.key.size=unit(.2,"in"), # size of color squares
      legend.spacing.y=unit(.02,"in"), # vertical space between color squares
      plot.title=element_text(size=14),
      text=element_text(size=16)
    )
    
    ggdraw(p)
    leg=get_legend(p)
    ggdraw(plot_grid(p+theme(legend.position="none"),leg,ncol=1,rel_heights=c(1,.12)))
    ggsave("a.png",width=7,height=7)

  20. The Following 6 Users Say Thank You to Nganasankhan For This Useful Post:

     agent_lime (04-07-2021),  David Bush (04-11-2021),  dosas (04-05-2021),  Kaspias (04-04-2021),  perttimp (04-11-2021),  sheepslayer (04-04-2021)

Page 1 of 4 123 ... LastLast

Similar Threads

  1. What Surnames Are You Working On?
    By msmarjoribanks in forum Genealogy
    Replies: 71
    Last Post: 03-09-2021, 08:30 AM
  2. Replies: 6
    Last Post: 09-27-2019, 10:31 PM
  3. Ancestry's DNA link not working.
    By rdegnen in forum AncestryDNA
    Replies: 4
    Last Post: 01-29-2019, 02:04 AM
  4. Replies: 3
    Last Post: 02-06-2016, 04:44 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •