Page 4 of 4 FirstFirst ... 234
Results 31 to 38 of 38

Thread: Shell and R scripts for SmartPCA and ADMIXTURE

  1. #31
    Registered Users
    Posts
    341
    Sex
    Ethnicity
    Finnish

    Quote Originally Posted by bce View Post
    The comb ceramic samples are called Latvia_HG and Estonia_HG in g25 i think
    Actually I think some of the samples that are labeled Baltic_LVA_MN on G25 might be from the Comb Ceramic culture, or at least two of the samples got about 90% EHG in a WHG-EHG model with a relatively good distance. But among the samples that can be modeled as Loschbour + Karelia_HG with a distance below .1, there are only two Estonian samples, which are both from the Narva culture:

    Code:
    $ curl -Ls 'https://drive.google.com/uc?export=download&id=1UrhcfNMLW0oMXIbHGUE60v2taCM7PFw1'>ais
    $ way2()(awk -F, 'NR==1{for(i=2;i<=NF;i++)a[i]=$i}NR==2{for(i=2;i<=NF;i++)b[i]=$i}NR>2&&$1!=""{min=-1;for(r=0;r<=100;r+=1){s=0;for(i=2;i<=NF;i++)s+=($i-((1-r/100)*a[i]+(r/100)*b[i]))^2;s=s^.5;if(min==-1||s<min){min=s;minr=r}}minp=sprintf("%.3f",min);sub("^0","",minp);print minp,minr,$1}' "[email protected]")
    $ way2 <(grep Losch ais;grep Karelia_HG:UzOO77 ais) <(grep Baltic ais)|awk '$1<.1'|sort -rnk2
    .037 92 Baltic_LVA_MN:I4435
    .036 89 Baltic_LVA_MN:I4554
    .040 74 Baltic_LVA_MN:I4436
    .042 56 Baltic_LVA_HG:I4630
    .056 53 Baltic_EST_Narva:Veibri4
    .058 43 Baltic_EST_Narva:Kivisaare3
    .037 43 Baltic_LVA_HG:I4551
    .069 42 Baltic_LTU_Narva:Donkalnis7
    .047 42 Baltic_LVA_HG:I4441
    .046 42 Baltic_LVA_MN:I4437
    .049 40 Baltic_LVA_MN:I4627
    .058 39 Baltic_LVA_HG:I4628
    .047 39 Baltic_LVA_HG:I4438
    .060 38 Baltic_LVA_HG:I4440
    .051 38 Baltic_LVA_HG:I4632
    .049 37 Baltic_LVA_HG:I4434
    .047 37 Baltic_LVA_HG:I4552
    .065 36 Baltic_LTU_Narva:Kretuonas1
    .046 35 Baltic_LVA_HG:I4595
    .049 34 Baltic_LVA_HG:I4553
    .048 34 Baltic_LVA_HG:I4432
    .046 34 Baltic_LVA_HG:I4596
    .064 33 Baltic_LTU_Narva:Kretuonas4
    .063 32 Baltic_LTU_Narva:Kretuonas2
    .040 31 Baltic_LVA_HG:I4439
    .059 29 Baltic_LVA_HG:I4626
    .044 27 Baltic_LVA_HG:I4550
    .057 23 Baltic_LTU_Narva:Spiginas1
    .052 22 Baltic_LTU_Narva:Donkalnis6
    .047 22 Baltic_LTU_Meso:Spiginas4
    The two samples that got about 90% EHG above have the label Latvia_MN_o3 in the Reich dataset, and the sample with 74% EHG has the label Latvia_MN_o2:

    Code:
    $ curl https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_HO_public.anno -Lso ho.anno
    $ gi()(awk 'NR==FNR{a[tolower($0)];next}{for(i in a)if(index(tolower($0),i)){print;next}}' <(printf %s\\n "[email protected]") -)
    $ copt()(awk -F\\t 'NR==FNR{a[NR]=$0;next}{for(i in a)printf"%s"(i==length(a)?"\n":FS),$(a[i])}' <(printf %s\\n "[email protected]") -)
    $ tab()(awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf("%-"(b[i]+n)"s",a[h][i]);print""}}' "n=${2-1}" FS="${1-$'\t'}")
    $ ho()(sed 1d ho.anno|gi "[email protected]"|copt 8 2 15 6 11 12 4|tab)
    $ ho ccc combce latvia_mn
    Latvia_MN_o1.SG          Latvia_MN1.SG 85977  6071 56.2833 25.1333 JonesCurrentBiology2017
    Latvia_MN_Comb_Ware.SG   Latvia_MN2.SG 392395 6043 56.2833 25.1333 JonesCurrentBiology2017
    Latvia_MN_o3             I4435         399976 5972 56.2833 25.1333 MathiesonNature2018
    Latvia_MN_o2             I4436         401778 6109 56.2833 25.1333 MathiesonNature2018
    Latvia_MN                I4437         400509 6164 56.2833 25.1333 MathiesonNature2018
    Latvia_MN_o3             I4554         394577 5946 56.2833 25.1333 MathiesonNature2018
    Latvia_MN                I4627         367460 6071 56.2833 25.1333 MathiesonNature2018
    Estonia_N_CombCeramic.SG MA974.SG      30001  5586 59.45   28.08   SaagCurrentBiology2017
    Estonia_N_CombCeramic.SG MA975.SG      58972  5488 59.45   28.08   SaagCurrentBiology2017
    Estonia_MN_CCC_1         Tamula1       85935  5407 57.505  26.584  MittnikNatureCommunications2018
    Estonia_MN_CCC_2         Tamula3       81193  5666 57.505  26.584  MittnikNatureCommunications2018
    The samples labeled Latvia_MN are from the Zvejnieki burial site, and two of the samples were initially published in Jones et al. 2017 ("The Neolithic Transition in the Baltic Was Not Driven by Admixture with Early European Farmers"), but more samples were later published in Mathieson et al. 2018 ("The genomic history of southeastern Europe"):

    Code:
    $ curl https://haplogroup.info/all-ancient-dna.txt|iconv -f iso-8859-1 -t utf-8|sed 1d|sort -t$'\t' -nk48|grep Latvia_MN|cut -f 1,2,3,4,5,8,21,45,48,52,54,55|tr -d \"|tr \; _|tr \\t \;
    I4437;ZVEJ15;56.2833;25.1333;F;U5a1d2;;MathiesonNature2018;-4214;Baltic_HG;Latvia_MN;Zvejnieki
    I4436;ZVEJ14;56.2833;25.1333;M;U4a1;R1b1a(xR1b1a2);MathiesonNature2018;-4159;Baltic_HG;Latvia_MN_o2;Zvejnieki
    I4627;Latvia_MN1.SG, ZVEJ26;56.2833;25.1333;M;U4a1;R1b1a(xR1b1a2);MathiesonNature2018 (New data_ Individual first published in JonesCurrBiol2017);-4121;Baltic_MN;Latvia_MN;Zvejnieki
    I4631;ZVEJ31;56.2833;25.1333;F;U4;;JonesCurrBiol2017;-4093;Baltic_MN;Latvia_MN_Comb_Ware.SG;Zvejnieki
    I4435;ZVEJ13;56.2833;25.1333;F;U4a1;;MathiesonNature2018;-4022;Baltic_MN;Latvia_MN_o3;Zvejnieki
    I4554;ZVEJ24;56.2833;25.1333;F;U4d1;;MathiesonNature2018;-3996;Baltic_HG;Latvia_MN_o3;Zvejnieki
    In Jones et al. 2017, Latvia_MN2 was associated with the Comb Ceramic culture, and it had much higher EHG ancestry than Latvia_MN1 which was only slightly older:

    Next we sampled two Middle Neolithic individuals, Latvia_MN1 (6,201–5,926 cal BP), from an isolated grave located among burials from earlier periods, and Latvia_MN2 (6,179–5,750 cal BP), who was interred in a collective burial with five other individuals. During the Middle Neolithic at Zvejnieki, mortuary practices from the preceding periods were partially maintained, but some new features appeared, including collective burials and votive deposits, which are associated with the Comb Ware culture or its influences in the Baltic [21]. Despite having been roughly contemporaneous, these Middle Neolithic samples cluster in different regions of our PCA plot (Figure 2A) and have distinct profiles in ADMIXTURE analysis (Figure 2B). In both analyses, Latvia_MN1 groups with the Mesolithic Latvian samples, suggesting a degree of continuity across the Mesolithic-Neolithic transition in this region and consistent with suggestions that the eastern Baltic was a genetic refugium for hunter-gatherer populations during the Neolithic period [28]. The persistence of hunter-gatherer ancestry in the Baltic until at least the Middle Neolithic also provides a possible source for the resurgence of hunter-gatherer ancestry that is proposed to have occurred in central Europe from 7,000–5,000 cal BP [1]. In contrast, Latvia_MN2 is placed toward EHG in PCA space and has several components in ADMIXTURE analysis that are found in Native Americans, Siberians, and hunter-gatherer samples from the Caucasus. In keeping with these results, we found that there has been a northern Eurasian influence in the Baltic region since the Mesolithic period, as suggested by significantly positive statistics for the test D(Mbuti, X; Latvia Mesolithic, Latvia_MN2) when X was an EHG, modern and ancient Siberian (including the Upper Palaeolithic Mal’ta genome [29]), or Native American (Table 2). This influence is supported archaeologically by the appearance of copper rings and amber jewelry in Middle Neolithic collective burials that bear similarities to artifacts found in Estonia, Finland, and northwestern Russia [21, 30].
    Last edited by Nganasankhan; 10-06-2021 at 11:08 PM.

  2. #32
    Registered Users
    Posts
    341
    Sex
    Ethnicity
    Finnish

    I think I'll start using the `--pca` option of PLINK instead of SmartPCA, because it's about two orders of magnitude faster, but the results don't seem to be much different:



    In PLINK 2, the `--pca` option is even faster than in PLINK 1.9. For example the dataset in the plots above consisted of 863 samples and 120,108 SNPs. Generating a PCA for the dataset took 169 seconds with SmartPCA, 2.8 seconds with PLINK 1.9, and 1.2 seconds with PLINK 2.

    BTW an easy way to exclude related samples in PLINK 2 is to use `--king-cutoff` (https://www.cog-genomics.org/plink/2...ce#king_cutoff): "If used in conjunction with a later calculation (see the order of operations page for details), `--king-cutoff` excludes one member of each pair of samples with kinship coefficient greater than the given threshold. (See above for threshold suggestions.) Alternatively, you can invoke this on its own to write a pruned list of sample IDs to `plink2.king.cutoff.in.id`, and excluded IDs to `plink2.king.cutoff.out.id`."

    In the plot on the right below, I used a fairly extreme setting for `--king-cutoff`, which removed many Nganasan samples, so PC2 ended up being less dominated by Nganasans. However it also removed 4 out of 5 Forest Yukaghir samples:


  3. The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

     bce (10-08-2021),  David Bush (12-23-2021)

  4. #33
    Registered Users
    Posts
    341
    Sex
    Ethnicity
    Finnish

    Another way to estimate the percentage of East Eurasian ancestry of different populations is to do a PCA of Eurasian samples, where PC1 differentiates between West Eurasians and East Eurasians. Then do min-max scaling on the values of PC1, so the smallest value becomes 0 and the largest value becomes 1:

    Code:
    > t=read.table("eurasia.p.eigenvec")
    > pop=t[,1]
    > pc=t[,3]
    > mima=(pc-min(pc))/(max(pc)-min(pc))
    > min=aggregate(mima,list(pop),min)
    > max=aggregate(mima,list(pop),max)
    > mean=aggregate(mima,list(pop),mean)
    > n=table(pop)
    > cat(sprintf("%.0f (%.0f-%.0f) %s (n=%i)",100*mean$x,100*min$x,100*max$x,names(n),n)[order(mean$x)],sep="\n")
    2 (0-3) Sardinian (n=27)
    4 (2-6) Basque (n=29)
    4 (3-5) Spanish_North (n=5)
    4 (3-6) Cardona_Italian (n=13)
    4 (3-6) Italian_North (n=20)
    4 (2-6) Spanish (n=173)
    4 (4-5) Italian_South (n=4)
    5 (4-5) Cypriot (n=8)
    5 (4-6) English (n=10)
    5 (4-6) Albanian (n=6)
    5 (4-7) Greek (n=20)
    5 (3-7) French (n=61)
    5 (4-7) Maltese (n=8)
    5 (3-7) Sicilian (n=11)
    5 (4-7) Lebanese_Christian (n=9)
    5 (4-7) Croatian (n=10)
    5 (4-6) Jew_Turkish (n=8)
    5 (4-7) Cardona_German (n=22)
    6 (4-7) Moldavian (n=9)
    6 (6-6) Tambets_Gagauzes (n=1)
    6 (5-7) Bulgarian (n=9)
    6 (6-6) Tambets_Germans (n=1)
    6 (5-6) Scottish (n=4)
    6 (5-7) Orcadian (n=12)
    6 (5-7) Jew_Libyan (n=9)
    6 (5-7) Romanian (n=10)
    6 (5-7) Druze (n=35)
    6 (5-7) Armenian_Hemsheni (n=8)
    6 (4-8) Gagauz (n=9)
    6 (5-8) Czech (n=10)
    6 (5-7) Jew_Iraqi (n=6)
    6 (5-7) Jew_Ashkenazi (n=7)
    6 (6-7) Icelandic (n=12)
    6 (5-7) BedouinB (n=19)
    6 (4-10) Cardona_British (n=19)
    6 (5-8) Armenian (n=10)
    6 (4-7) Jew_Yemenite (n=8)
    6 (5-7) Norwegian (n=11)
    6 (5-11) Georgian (n=22)
    7 (6-8) Lithuanian (n=10)
    7 (6-8) Saudi (n=8)
    7 (6-8) Hungarian (n=20)
    7 (6-8) Jew_Iranian (n=9)
    7 (5-9) Assyrian (n=11)
    7 (6-9) Lebanese_Muslim (n=11)
    7 (7-7) Kabardinian_outlier (n=1)
    7 (6-10) Jew_Georgian (n=7)
    7 (6-11) Yemeni_Northwest (n=27)
    7 (6-9) Belarusian (n=10)
    8 (6-9) Palestinian (n=31)
    8 (5-10) Yemeni_Desert (n=16)
    8 (6-8) Ukrainian (n=13)
    8 (6-9) Ukrainian_North (n=8)
    8 (6-13) Yemeni_Highlands (n=28)
    8 (5-11) Lebanese (n=8)
    8 (8-8) Avar_outlier1 (n=1)
    8 (8-8) Tambets_Latvians (n=1)
    8 (7-9) Abkhasian (n=9)
    8 (6-11) Jordanian (n=9)
    8 (8-8) Abazin_outlier (n=2)
    8 (4-11) Syrian (n=8)
    9 (7-11) Estonian (n=10)
    9 (6-10) BedouinA (n=24)
    9 (9-9) Avar_outlier2 (n=1)
    9 (9-9) Tambets_Swedes (n=1)
    9 (9-11) Kubachinian (n=6)
    9 (9-10) Ezid (n=7)
    10 (8-12) Kurd (n=8)
    10 (7-12) Tambets_Estonians (n=36)
    10 (8-11) Kaitag (n=8)
    10 (10-10) Yemeni_Highlands_Raymah (n=1)
    10 (8-12) Cardona_Avar (n=27)
    10 (9-11) Lezgin (n=9)
    10 (8-11) Tabasaran (n=10)
    10 (10-11) Darginian (n=8)
    10 (8-16) Tambets_Poles (n=5)
    10 (6-18) Russian (n=71)
    11 (9-12) Chechen (n=9)
    11 (9-12) Avar (n=8)
    11 (9-12) Lak (n=10)
    11 (8-16) Adygei (n=31)
    11 (10-13) Ingushian (n=9)
    12 (9-18) Cardona_Iranian (n=26)
    12 (9-15) Iranian (n=38)
    12 (10-13) Tambets_Russians_Central (n=2)
    12 (5-20) Turkish (n=50)
    12 (7-19) Cardona_Turk (n=13)
    12 (7-14) Kumyk (n=8)
    13 (10-15) Tambets_Ingrian (n=6)
    13 (11-14) Finnish (n=8)
    13 (12-14) Tambets_Karels (n=2)
    13 (11-16) Azeri (n=15)
    13 (10-15) Ossetian (n=16)
    13 (9-17) Tambets_Finns (n=19)
    13 (10-15) Mordovian (n=32)
    13 (11-15) Circassian (n=9)
    14 (12-16) Balkar (n=10)
    14 (12-15) Cardona_Mordva (n=6)
    14 (11-16) Karelian (n=15)
    14 (13-16) Russian_Archangelsk_Krasnoborsky (n=6)
    15 (13-18) Karachai (n=11)
    15 (13-18) Kabardinian (n=9)
    15 (14-16) Tambets_Vepsas (n=2)
    15 (14-17) Abazin (n=8)
    15 (14-17) Veps (n=9)
    15 (13-17) Iranian_Bandari (n=8)
    16 (16-17) Russian_Archangelsk_Pinezhsky (n=5)
    17 (16-20) Makrani (n=19)
    18 (17-19) Turkish_Balikesir (n=5)
    19 (17-22) Brahui (n=21)
    19 (17-24) Balochi (n=20)
    20 (19-20) Russian_Archangelsk_Leshukonsky (n=4)
    21 (19-23) Tatar_Mishar (n=9)
    22 (20-23) Kalash (n=15)
    23 (20-27) Pathan (n=17)
    24 (19-30) Tajik (n=31)
    24 (19-28) Tambets_Saami_Kola (n=7)
    25 (19-30) Tambets_Tatars (n=2)
    25 (22-27) Tatar_Kazan (n=13)
    26 (24-28) Sindhi_Pakistan (n=14)
    27 (18-29) Chuvash (n=15)
    27 (26-28) GujaratiA (n=4)
    27 (27-27) Turkmen_outlier (n=1)
    27 (19-33) Jew_Cochin (n=5)
    28 (27-29) Besermyan (n=6)
    28 (23-36) Nogai_Karachay_Cherkessia (n=9)
    30 (28-33) GujaratiB (n=5)
    31 (28-34) Udmurt (n=10)
    32 (29-34) Burusho (n=23)
    32 (27-35) Tambets_Saami_SWE (n=11)
    33 (32-35) GujaratiC (n=5)
    33 (32-35) Cardona_Mari (n=4)
    34 (15-46) Cardona_Turkmen (n=20)
    35 (34-35) GujaratiD (n=5)
    35 (27-40) Cardona_Tajik (n=11)
    36 (30-38) Punjabi (n=8)
    36 (11-59) Tambets_Mansi (n=23)
    36 (15-75) Cardona_Komi (n=12)
    36 (24-47) Bashkir (n=52)
    38 (32-46) Turkmen (n=6)
    39 (37-41) Cardona_Sri_Lankan (n=20)
    40 (36-45) Cardona_Indian (n=22)
    41 (22-57) Uzbek (n=27)
    43 (31-49) Bahun (n=4)
    43 (42-44) Bengali (n=7)
    48 (40-54) Tatar_Siberian (n=17)
    49 (49-49) Tambets_Buryats (n=1)
    49 (38-60) Nogai_Stavropol (n=10)
    52 (50-54) Tatar_Siberian_Zabolotniye (n=4)
    53 (45-56) Mansi (n=8)
    54 (52-57) Nogai_Astrakhan (n=6)
    54 (44-61) Altaian_Chelkan (n=6)
    54 (47-64) Uyghur (n=10)
    54 (34-68) Cardona_Khant (n=13)
    55 (49-62) Karakalpak (n=14)
    56 (53-58) Hazara (n=11)
    57 (54-61) Tambets_Khanty (n=8)
    58 (45-88) Cardona_Uygur (n=14)
    60 (60-60) Khakass_outlier (n=1)
    60 (53-68) Cardona_Kazakh (n=10)
    61 (56-64) Tubalar (n=21)
    61 (59-62) Shor_Mountain (n=7)
    62 (59-64) Shor_Khakassia (n=4)
    63 (57-68) Kazakh (n=17)
    64 (62-67) Newar (n=7)
    65 (51-73) Cardona_Ket (n=8)
    65 (63-67) Kyrgyz_Tajikistan (n=4)
    66 (52-74) Khakass (n=14)
    66 (63-71) Cardona_Teleut (n=10)
    66 (52-94) Cardona_Selkup (n=36)
    67 (62-69) Ket (n=14)
    67 (58-83) Selkup (n=11)
    67 (52-75) Tharu (n=5)
    67 (61-72) Kyrgyz_China (n=12)
    67 (65-71) Kyrgyz_Kyrgyzstan (n=9)
    69 (32-81) Cardona_Tundra_Nentsi (n=13)
    71 (35-92) Even (n=7)
    71 (65-78) Kazakh_China (n=7)
    72 (65-74) Cardona_Forest_Nentsi (n=17)
    72 (68-75) Khakass_Kachin (n=7)
    72 (69-76) Altaian (n=23)
    73 (56-76) Cardona_Altai_Kizhi (n=18)
    73 (68-78) Enets (n=2)
    77 (66-83) Kusunda (n=5)
    77 (62-92) Cardona_Dolgan (n=9)
    79 (76-81) Kalmyk (n=10)
    81 (39-87) Tofalar (n=10)
    81 (77-84) Tuvinian (n=18)
    81 (65-96) Evenk_FarEast (n=5)
    83 (81-86) Salar (n=8)
    83 (77-87) Mongol (n=34)
    84 (82-85) Eskimo_Naukan (n=9)
    84 (81-86) Tamang (n=7)
    84 (72-86) Cardona_Buryat (n=16)
    84 (75-91) Burmese (n=10)
    84 (82-85) Todzin (n=3)
    84 (80-87) Cardona_Mongol (n=10)
    84 (77-88) Buryat (n=37)
    85 (79-94) Magar (n=6)
    85 (84-86) Eskimo_ChaplinSireniki (n=7)
    85 (77-90) Malay (n=9)
    85 (83-90) Dongxiang (n=7)
    85 (79-88) Yakut (n=20)
    86 (80-88) Cardona_Yakut (n=13)
    86 (80-90) Dolgan (n=3)
    86 (82-93) Thai (n=8)
    86 (81-89) Khamnegan (n=9)
    86 (86-86) Chukchi1 (n=1)
    86 (84-90) Dungan (n=11)
    86 (85-88) Chukchi (n=16)
    87 (83-90) Cambodian (n=8)
    87 (71-89) Cardona_Koryak (n=9)
    87 (86-88) Itelmen (n=5)
    88 (87-90) Koryak (n=9)
    88 (84-93) Gurung (n=5)
    89 (84-93) Bonan (n=9)
    89 (83-93) Cardona_Yukagir (n=3)
    89 (87-91) Cardona_Indonesia_Java (n=21)
    90 (86-93) Tagalog (n=5)
    90 (87-94) Yukagir_Tundra (n=12)
    90 (85-93) Cardona_Nganasan (n=14)
    91 (86-94) Evenk_Transbaikal (n=8)
    91 (88-92) Tu (n=10)
    91 (90-93) Visayan (n=4)
    91 (76-95) Cardona_Evenk (n=12)
    92 (88-94) Nganasan (n=18)
    92 (88-95) Rai (n=5)
    93 (90-95) Cardona_Tibetan (n=18)
    93 (92-94) Sherpa (n=4)
    93 (86-98) Tibetan (n=95)
    93 (85-97) Yugur (n=16)
    93 (91-94) Mongola (n=6)
    94 (91-95) Dusun (n=10)
    94 (93-94) Murut (n=8)
    94 (92-95) Daur (n=9)
    94 (91-96) Xibo (n=7)
    94 (94-95) Cardona_Even (n=2)
    95 (95-95) Ilocano (n=2)
    95 (92-97) Oroqen (n=9)
    95 (94-96) China_Lahu (n=6)
    95 (95-97) Kinh (n=8)
    96 (95-96) Vietnamese (n=10)
    96 (94-98) Cardona_Vietnamese (n=10)
    96 (94-97) Hezhen (n=7)
    96 (95-96) Tibetan_Yunnan (n=3)
    96 (95-96) Naxi (n=8)
    96 (95-97) Dai (n=10)
    96 (95-98) Yi (n=10)
    96 (96-97) Li (n=4)
    96 (95-97) Atayal (n=7)
    96 (96-97) Nivh (n=8)
    97 (94-99) Qiang (n=19)
    97 (95-99) Ulchi (n=24)
    97 (94-98) Kankanaey (n=10)
    97 (96-98) Zhuang_outlier (n=2)
    97 (96-98) Gelao (n=8)
    97 (96-98) Ami (n=10)
    97 (96-99) Japanese (n=29)
    97 (96-98) Nanai (n=9)
    97 (96-99) Mulam (n=16)
    97 (96-98) Cardona_Manchu (n=14)
    97 (96-98) Zhuang (n=19)
    97 (97-98) Maonan (n=17)
    98 (97-98) Cardona_Han_South (n=13)
    98 (97-99) Dong (n=20)
    98 (97-98) Miao (n=10)
    98 (94-100) Han (n=107)
    98 (97-98) Negidal (n=3)
    98 (97-99) Tujia (n=10)
    98 (98-99) She (n=9)
    99 (97-99) Korean (n=6)
    99 (98-100) Cardona_Korean (n=18)
    Last edited by Nganasankhan; 10-08-2021 at 08:18 AM.

  5. #34
    Registered Users
    Posts
    341
    Sex
    Ethnicity
    Finnish

    Display the results of an ADMIXTURE run in a shell so that rows and columns are reordered accounting for FST

    This assumes that there are two files which are created by a wrapper function I use to run ADMIXTURE: `oiropa.8a` is a space-delimited file which contains population names in the first column and population averages for admixture weights in the remaining columns, and the file `oiropa.8.fst` contains the FST matrix of the run as CSV with no column names or row names.

    You can generate the FST file like this (where the log file is generated by `admixture -j4 oiropa.bed 8|tee oiropa.8.log`): `sed '1,/^Fst/d;$d' oiropa.8.log|sed 1d|cut -f2-|tac|awk -F\\t 'NR==1{w=NF}{NF=w}1' OFS=,|tac>oiropa.8.fst`.

    The columns are sorted based on their position in the first dimension of an MDS matrix of the FST matrix. MDS is basically a version of PCA where the input is a distance matrix, so the first dimension of the MDS matrix corresponds to PC1 in a PCA.

    In order to sort the rows, matrix multiplication is used to multiply the matrix of admixture weights with an MDS matrix of the FST matrix so that the Euclidean distance between the populations roughly corresponds to the genetic distance between the populations. Then the multiplied matrix is used for hierarchical clustering, and the hierarchical clustering tree is reordered so that the weights consist of the position of each population on PC1 in a PCA of the multiplied matrix. And then the order of the populations in the clustering tree is used as the order of the rows.

    Code:
    $ amf()(Rscript --no-init-file -e 'suppressPackageStartupMessages(library(vegan));f=commandArgs(T)[1];t=read.table(paste0(f,"a"),r=1);fst=as.matrix(as.dist(read.csv(paste0(f,".fst"),head=F)));mds=cmdscale(fst,ncol(fst)-1);t2=as.matrix(t)%*%mds;t=t[reorder(hclust(dist(t2)),prcomp(t2)$x[,1])$order,order(mds[,1])];write.table(cbind(t(apply(round(t*100),1,function(x)sprintf("%3s",x))),rownames(t)),"",row.names=F,col.names=F,quote=F)' "[email protected]")
    $ amf oiropa.8
      0   0   0   0   0   0   0 100 Kalmyk
      0   0   0  11   8   1  15  66 Nogai_Astrakhan
      0   0   1   9   9   1  22  58 Nogai_Stavropol
      0   1   0   6  41   3   4  45 Bashkir
      0   0   0   5   5   9  54  27 Nogai_Karachay_Cherkessia
      0   0   0   0  62   6   0  32 Udmurt
      0   0   0   1  64   6   0  29 Besermyan
      0   0   0   3  65   2   4  26 Chuvash
      0   0   0  11  52   2  12  23 Tatar_Kazan
      0   0   2   1   3   0  90   4 Abazin
      0   0   0   1   5   1  90   4 Kabardinian
      0   0   0   1   0   1  97   1 Karachai
      0   0   0   1   1   1  96   1 Balkar
      0   0   0   0   0   0  99   1 Russia_NorthOssetian
      0   0   0   1   3   1  94   1 Circassian
      0   0   0   0   0   1  98   0 Ossetian
      0   0   0   0   0   0 100   0 Russia_Abkhasian
      0   0   0   2   1   0  97   0 Adygei
      0   0   0   0   0  24  76   0 Ingushian
      2   0   5   3   1  39  48   3 Kumyk
      0   1   0   3   1  46  50   0 Chechen
      0   2   0   4   2  64  29   0 Lezgin
      0   0   1   2   2  84  10   0 Tabasaran
      0   0   0   0   2  90   7   0 Avar
      0   0   0   0   0  95   4   0 Lak
      0   0   0   0   0 100   0   0 Darginian
      0   0   0   0   0 100   0   0 Kubachinian
      0   0   0   1   0  98   0   0 Kaitag
      0   0   2  15  54   2  10  16 Tatar_Mishar
      0   1   0   0  88   0   0  12 Russian_Archangelsk_Leshukonsky
      0  18   0   0  73   0   0   9 Russian_Archangelsk_Pinezhsky
      0   0   0  11  76   3   5   5 Mordovian
      0   0   0   7  85   0   2   5 Russian_Archangelsk_Krasnoborsky
      0   0   0   0  96   0   0   4 Veps
      0   0   0   2  94   0   0   3 Karelian
      0   0   0  13  84   0   0   3 Finnish
      0   0   1  15  84   0   0   0 Lithuanian
      0   0   0  19  81   0   0   0 Estonian
      1  10   1   8  74   1   3   2 Russian
      1   0   1  20  73   1   3   0 Belarusian
      0   0   7  23  64   1   4   0 Ukrainian
      2   0   6  21  67   1   2   0 Ukrainian_North
      0   0  75  10   8   1   6   0 Greek
      4   0  63  17   9   0   7   0 Albanian
      0   0  94   2   0   1   3   0 Italian_South
      0   0  98   0   0   0   2   0 Jew_Ashkenazi
      0   0  99   1   0   0   0   0 Sicilian
      0   0 100   0   0   0   0   0 Maltese
      1   0   5  53  37   0   4   0 Czech
      0   0  19  43  33   0   3   0 Hungarian
      0   0   0  99   0   1   0   0 English
      0   0   0  99   1   0   0   0 Norwegian
      0   0   0 100   0   0   0   0 Icelandic
      0   0   0 100   0   0   0   0 Orcadian
      0   0   0 100   0   0   0   0 Scottish
      8  11  14  65   1   1   1   0 French
      0   0  42  29  30   0   0   0 Croatian
      1   0  55  17  21   2   4   0 Bulgarian
      0   0  51  23  24   0   3   0 Romanian
      0   8  52   7  27   0   6   0 Moldavian
      1  15  41   9  19   2  13   1 Gagauz
     20   1  34  39   0   0   6   0 Italian_North
     41   0   0  56   0   2   0   0 Spanish_Lleida
     38   9   4  44   1   1   2   0 Spanish
     11  55   0  34   1   0   0   0 Spanish_North
      0 100   0   0   0   0   0   0 Basque
    100   0   0   0   0   0   0   0 Sardinian
    100   0   0   0   0   0   0   0 Italian_Sardinian
    Another approach to sorting the branches of the clustering tree is to move branches with a high percentage of the first columns to the beginning and to move branches with a high percentage of the last columns to the end: `t=t[,order(mds[,1])];t=t[reorder(hclust(dist(t2)),as.matrix(t)%*%seq(ncol(t ))^2)$order,]`.
    Last edited by Nganasankhan; 06-30-2022 at 05:13 AM.

  6. The Following User Says Thank You to Nganasankhan For This Useful Post:

     PLogan (06-29-2022)

  7. #35
    Registered Users
    Posts
    341
    Sex
    Ethnicity
    Finnish

    One reason why the output of SmartPCA can differ from `plink --pca` is that SmartPCA does something like `--mac 1` by default, so it removes SNPs where the minor allele count is 0. When I ran SmartPCA for a dataset with the default parameters, it said that it killed 36381 SNPs:

    Code:
     snp        snp_1_1611813 ignored . allelecnt:     0  missing:     2
     snp        Affx-50013327 ignored . allelecnt:     0  missing:     2
     snp        snp_1_1934796 ignored . allelecnt:     0  missing:   154
     snp          rs111514148 ignored . allelecnt:     0  missing:     1
     snp        snp_1_2113592 ignored . allelecnt:     0  missing:     0
     snp        Affx-50018835 ignored . allelecnt:     0  missing:     1
     snp        snp_1_2326032 ignored . allelecnt:     0  missing:     0
     snp        Affx-50025378 ignored . allelecnt:     0  missing:     3
     snp        Affx-50028654 ignored . allelecnt:     0  missing:     1
     snp          rs111432807 ignored . allelecnt:     0  missing:     1
    total number of snps killed in pass: 36381  used: 556743
    When I ran `plink --pca` for the same dataset, it only said `Excluding 4449 variants on non-autosomes from PCA`. But then when I ran PLINK with `--autosome --mac 1` (or `--autosome --maf .00001`), it said that `36381 variants removed due to allele frequency threshold(s)`, which is the exact same number as the SNPs removed by SmartPCA. I wasn't sure earlier what `allelecnt` means in the output of SmartPCA, but apparently it refers to minor allele count.

    Some people seem to assume that SmartPCA does something special that `plink --pca` doesn't do because SmartPCA is so much slower. But now when I tried running PLINK PCA and SmartPCA for the same dataset, and I used `--mac 1` with PLINK and `numoutlieriter: 0` with SmartPCA, and I compared distance matrices of the first 20 dimensions of the PLINK and SmartPCA runs, I got a correlation coefficient of 0.9999388:

    Code:
    t=read.table("inter.eigenvec") # PLINK
    ev=scan("inter.eigenval")
    rownames(t)=paste0(t[,1],":",t[,2])
    t=t(t(t[,-c(1,2)])*sqrt(ev))
    
    t2=read.table("inter.evec",r=1) # SmartPCA
    t2=t2[,-ncol(t2)]
    ev2=scan("inter.eval")[1:ncol(t2)]
    t2=t(t(t2)*sqrt(ev2))
    t2=t2[rownames(t),]
    
    cor(dist(t[,1:20]),dist(t2[,1:20])) # 0.9999388
    
    g=read.csv("g/25/mis",r=1)[rownames(t),]
    
    cor(dist(t[,1:8]),dist(g)) # 0.9127597
    cor(dist(t2[,1:8]),dist(g)) # 0.9135776
    cor(dist(t[,1:20]),dist(g)) # 0.814285
    cor(dist(t2[,1:20]),dist(g)) # 0.8140387
    The last 4 lines above show that I got a higher correlation with G25 when I truncated the output of PLINK or SmartPCA to 8 dimensions, and I got lower correlation when I truncated the output to 20 dimension.

    But you should probably use higher `--maf` setting rather than `--mac 1`, because when I used `--maf .09`, I was able to increase the correlation between PLINK and G25 from about 0.913 to about 0.985 (but I got lower correlation with `--maf .08` and `--maf .1`).
    Last edited by Nganasankhan; 07-12-2022 at 12:54 PM.

  8. The Following User Says Thank You to Nganasankhan For This Useful Post:

     PLogan (07-12-2022)

  9. #36
    Registered Users
    Posts
    341
    Sex
    Ethnicity
    Finnish

    In the script in post #14, there's a bug where it displays the percentage labels of some populations in the wrong order when the K-value is 10 or higher. Here's a fixed version that also uses a color palette that is designed for higher K-values. This version also automatically reorders the components of the ADMIXTURE run by doing hierarchical clustering for the FST matrix of the run.

    Code:
    library(tidyverse)
    library(vegan) # for reorder.hclust
    `|`=`%>%`
    
    t=as.matrix(read.table("https://pastebin.com/raw/ZDSA03Ed",row.names=1))
    t=t[!grepl("_o",rownames(t)),]
    fst=as.matrix(read.table("https://pastebin.com/raw/v8Pjz2nk"))
    
    mds=cmdscale(fst,ncol(fst)-1)
    mult=t%*%mds
    hc=reorder(hclust(dist(mult)),prcomp(mult)$x[,1])
    k=as.factor(cutree(hc,32))[hc$order]
    t=t[hc$labels[hc$order],]
    
    t=t[,reorder(hclust(as.dist(fst)),mds[,1])$order] # rearrange columns based on hierarchical clustering
    # t=t[,order(mds[,1])] # rearrange columns based on first dimension of MDS
    # t=t[,c(15,27,10,9,20,16,5,3,2,21,19,8,24,23,6,31,1,25,18,14,17,7,22,29,4,30,32,28,11,12,13,26)] # arrange columns manually
    colnames(t)=sprintf("V%02i",1:ncol(t)) # prevent ggplot from reordering bars
    
    p1=ggplot(ggdendro::dendro_data(as.dendrogram(hc))$segment)+
    geom_segment(aes(x=y,y=x,xend=yend,yend=xend),size=.35,lineend="round",color="gray10")+
    scale_x_continuous(expand=expansion(mult=c(0,.01)))+
    scale_y_continuous(limits=.5+c(0,nrow(t)),expand=c(0,0))+
    theme(
      axis.text=element_blank(),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0,"pt"),
      axis.title=element_blank(),
      panel.background=element_rect(fill="white"),
      panel.grid=element_blank(),
      plot.background=element_rect(fill="white",color=NA),
      plot.margin=margin(5,5,5,0)
    )
    
    t2=data.frame(V1=rownames(t)[row(t)],V2=colnames(t)[col(t)],V3=unname(c(t)))
    lab=round(100*t2$V3)
    lab[lab<=1]=""
    
    # hue=seq(15,375,length.out=8+1)|head(-1)
    hue=c(0,210,100,60,300,30,150,250)+15
    cl=list(c(50,60),c(30,90),c(100,70),c(25,40))
    set.seed(0)
    shuf=t|ncol|sample
    pal1=lapply(cl,\(x)hcl(hue,x[1],x[2]))|unlist|`[`(shuf)|rep(each=t|nrow)
    pal2=sapply(cl,`[[`,2)<=40|ifelse("white","black")|rep(each=hue|length)|`[`(shuf)|rep(each=t|nrow)
    pal3=list(c(70,70),c(50,50),c(30,60),c(70,30))|lapply(\(x)hcl(hue,x[1],x[2]))|unlist|head(n_distinct(k))|sample
    
    p2=ggplot(t2,aes(x=factor(V1,levels=rownames(t)),y=V3,fill=V2))+
    geom_bar(stat="identity",width=1,position=position_fill(reverse=T),size=.2,fill=pal1,color="gray10")+
    geom_text(aes(label=lab),position=position_stack(vjust=.5,reverse=T),size=3.5,color=pal2)+
    coord_flip()+
    scale_x_discrete(expand=c(0,0))+
    scale_y_discrete(expand=c(0,0))+
    theme(
      axis.text=element_text(color=pal3[k],size=11),
      # axis.text=element_text(color="black",size=11),
      axis.text.x=element_blank(),
      axis.ticks=element_blank(),
      axis.title=element_blank(),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray10",fill=NA,size=.4),
      plot.background=element_rect(fill="white",color=NA),
      plot.margin=margin(5,0,5,5)
    )
    
    cowplot::plot_grid(p2,p1,rel_widths=c(1.2,1))
    ggsave("1.png",height=.27*nrow(t),width=11,limitsize=F,dpi=150)
    Last edited by Nganasankhan; 09-22-2022 at 11:47 PM.

  10. The Following User Says Thank You to Nganasankhan For This Useful Post:

     PLogan (09-22-2022)

  11. #37
    Registered Users
    Posts
    1,770
    Ethnicity
    Arab
    Y-DNA (P)
    J2a-J-L210-J-y15222

    Could someone take me through on how I can convert my full genome into plink format. Iím trying to learn qpadm thanks
    Target: Me
    Distance: 1.4392% / 0.01439195
    36.0 TUR_Tepecik_Ciftlik_N
    22.2 Iberomaurusian
    22.0 SYR_Tell_Qarassa_Early_Antiquity
    13.4 COG_Kindoki_230BP
    5.8 Steppe_Pastoralist
    0.6 Iran_Neolithic

    Target: Me
    Distance: 1.5117% / 0.01511723
    41.8 MAR_LN
    24.6 SYR_Tell_Qarassa_Early_Antiquity
    13.8 COG_Kindoki_230BP
    10.8 Iberomaurusian
    5.0 Caucasus_Hunter-gatherer
    4.0 Steppe_Pastoralist

    R11109 MALE 1 CE 1749.5 400 CE ARCHAEOLOGY Isola_Sacra Y-DNA: J-Y15222 mtDNA: X2m'n

  12. #38
    Registered Users
    Posts
    579
    Sex
    Location
    Missouri, U.S.
    Ethnicity
    Colonial American
    Nationality
    American
    aDNA Match (1st)
    VK2020_Scotland_Orkney_VA:VK207
    Y-DNA (P)
    R1b-U152 >R-FTA96415
    mtDNA (M)
    J1b1a1a
    Y-DNA (M)
    I2-P37 > I-BY77146
    mtDNA (P)
    H

    United States of America Scotland England Netherlands
    Europe subset using k=25

    https://cdn.discordapp.com/attachmen...Europe_K25.png

    UPDATE: Nganasan noticed in my chart that Avar was separating from the rest abnormally. He did his usual brilliant analysis and determined there is something amiss with the Avar sample(s).

    https://cdn.discordapp.com/attachmen...66142012/1.png
    Last edited by PLogan; 09-23-2022 at 12:13 AM.

Page 4 of 4 FirstFirst ... 234

Similar Threads

  1. R scripts for ADMIXTOOLS 2
    By Nganasankhan in forum Autosomal (auDNA)
    Replies: 10
    Last Post: 09-23-2022, 01:30 AM
  2. Shell commands and R scripts for working with G25
    By Nganasankhan in forum Autosomal (auDNA)
    Replies: 71
    Last Post: 08-13-2022, 03:30 PM
  3. Replies: 6
    Last Post: 09-27-2019, 10:31 PM
  4. Replies: 3
    Last Post: 02-06-2016, 04:44 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •