Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 29

Thread: MDS plots for European modern individuals

  1. #11
    Global Moderator
    Posts
    3,723
    Sex
    Location
    Vissaiom
    Ethnicity
    Portuguese highlander
    Y-DNA (P)
    E-Y31991>FT17866
    mtDNA (M)
    H20 (xH20a)

    Asturias Galicia Portugal 1143 Portugal 1485 Portugal Order of Christ PortugalRoyalFlag1830
    Quote Originally Posted by anglesqueville View Post
    Ruderico, in the context of my last post, this question doesn't arise precisely because there is no dimension reduction before UMAP. The data consist of all the pairwise distances between individuals ("distances" in the sense of 1-ibs, computed on the panel of SNPs).
    Yes, I know that, but I was toying with truncations and am at a loss where should I place the limit at
    YDNA E-Y31991>PF4428>Y134097>Y134104>Y168273>FT17866 (TMRCA ~1100AD) - Domingos Rodrigues, b. circa 1690 Hidden Content , Viana do Castelo, Portugal - Stonemason, miller.
    mtDNA H20 - Monica Vieira, b. circa 1700 Hidden Content , Porto, Portugal

    Hidden Content
    Global25 PCA West Eurasia dataset Hidden Content

    [1] "distance%=1.6007"

    Ruderico

    NW_Iberia_IA,80.4
    Berber_EMA,11
    Roman_Colonial,8.6

  2. The Following 2 Users Say Thank You to Ruderico For This Useful Post:

     anglesqueville (02-08-2021),  JMcB (02-08-2021)

  3. #12
    Registered Users
    Posts
    1,041
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA (P)
    I2a2a1b2-CTS1977
    mtDNA (M)
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by Ruderico View Post
    Huijbregts how many dimensions would you suggest we truncate data at? I remember Angles made an experiment and suggested 8, based on distances between NW Europeans
    This is an important question when you are using unscaled Global25 data.
    The mathematical answer is "Use as many dimensions as the intrinsic dimensionality of the dataset".
    Unfortunately mathematicians don't agree about the best method to estimate the intrinsic dimensionality.
    What is certain however, is that the intrinsic dimensionaliteit is on the first place dependent on the number of samples.
    My rule of thumb is to calculate how many of samples you need to complete cover each half of each dimension.
    For one dimension you need two samples, one for the positive side and for the negative side.
    To completely cover 25 dimensions you need 2^25=33 554 432 samples.
    The Global25 of the ancient samples has 3830 samples. The 2log of 3830 is 7.162, so you can handle 7 dimensions (round downwards).
    The Global25 of the modern samples has 5917 samples. The 2log of 5917 is 7.544, so again you can handle 7 dimensions.
    But if you use a restricted dataset of say 500 samples, you can only use 5 dimensions (2log(500)= 5.398).

  4. The Following 9 Users Say Thank You to Huijbregts For This Useful Post:

     anglesqueville (02-08-2021),  Bart (02-10-2021),  JMcB (03-10-2021),  Nino90 (02-08-2021),  randwulf (02-09-2021),  Ruderico (02-08-2021),  sheepslayer (02-08-2021),  ThaYamamoto (02-10-2021),  Trelvern (02-09-2021)

  5. #13
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    Back to post #8. After having experimented quite a bit (fortunately, all this runs very quickly on my new Linux machine), I think I can say that this "direct UMAP" suffers from the same limitations as the "direct MDS", or more precisely that it requires the same constraints to be satisfied. Especially the presence of samples with a high rate of no-calls has dramatical effects. I had promised the members of my family clan who had provided me with imputed data to send them back in exchange, in addition to qpAdm models, representations (independent of G25, because most of those concerned do not and will not have "their" G25) easily interpretable. Honestly, I didn't even point out the possibility of running UMAP, and gave them "their" MDS, which they were fully satisfied with. But I encourage hobbyists to experiment since it only uses PLINK and R, both available for Windows.
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  6. The Following 3 Users Say Thank You to anglesqueville For This Useful Post:

     Bart (02-10-2021),  JMcB (02-10-2021),  randwulf (02-09-2021)

  7. #14
    Registered Users
    Posts
    1,041
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA (P)
    I2a2a1b2-CTS1977
    mtDNA (M)
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by anglesqueville View Post
    Back to post #8. After having experimented quite a bit (fortunately, all this runs very quickly on my new Linux machine), I think I can say that this "direct UMAP" suffers from the same limitations as the "direct MDS", or more precisely that it requires the same constraints to be satisfied. Especially the presence of samples with a high rate of no-calls has dramatical effects. I had promised the members of my family clan who had provided me with imputed data to send them back in exchange, in addition to qpAdm models, representations (independent of G25, because most of those concerned do not and will not have "their" G25) easily interpretable. Honestly, I didn't even point out the possibility of running UMAP, and gave them "their" MDS, which they were fully satisfied with. But I encourage hobbyists to experiment since it only uses PLINK and R, both available for Windows.
    UMAP is computationally far more complicated than say PCA. So it is not unexpected that no-calls have a dramatic effect.
    Another vulnerability of UMAP is low sample number. If I remember it correctly, I have read that UMAPs on less than 500 samples should be considered not reliable. From your plots I get the impression that you may have than 500 samples, but not by much. Can you give us an indication?

  8. The Following 2 Users Say Thank You to Huijbregts For This Useful Post:

     JMcB (02-10-2021),  randwulf (02-09-2021)

  9. #15
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    Quote Originally Posted by Huijbregts View Post
    UMAP is computationally far more complicated than say PCA. So it is not unexpected that no-calls have a dramatic effect.
    Another vulnerability of UMAP is low sample number. If I remember it correctly, I have read that UMAPs on less than 500 samples should be considered not reliable. From your plots I get the impression that you may have than 500 samples, but not by much. Can you give us an indication?
    Yes: 1102 individuals
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  10. The Following 3 Users Say Thank You to anglesqueville For This Useful Post:

     Huijbregts (02-09-2021),  JMcB (02-10-2021),  randwulf (02-09-2021)

  11. #16
    Registered Users
    Posts
    1,041
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA (P)
    I2a2a1b2-CTS1977
    mtDNA (M)
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by Huijbregts View Post
    This is an important question when you are using unscaled Global25 data.
    The mathematical answer is "Use as many dimensions as the intrinsic dimensionality of the dataset".
    Unfortunately mathematicians don't agree about the best method to estimate the intrinsic dimensionality.
    What is certain however, is that the intrinsic dimensionaliteit is on the first place dependent on the number of samples.
    My rule of thumb is to calculate how many of samples you need to complete cover each half of each dimension.
    For one dimension you need two samples, one for the positive side and for the negative side.
    To completely cover 25 dimensions you need 2^25=33 554 432 samples.
    The Global25 of the ancient samples has 3830 samples. The 2log of 3830 is 7.162, so you can handle 7 dimensions (round downwards).
    The Global25 of the modern samples has 5917 samples. The 2log of 5917 is 7.544, so again you can handle 7 dimensions.
    But if you use a restricted dataset of say 500 samples, you can only use 5 dimensions (2log(500)= 5.398).
    WITHDRAWAL #12
    Above quote from post 12 is bullshit (or even elephant shit, but not dinosaur shit).
    I have the reproachable habit of performing simple calculations not in an R console but in my browser.
    That would not have led to a blooper if I had entered the correct formula log2(x).
    Instead I used to enter 2log(x).
    In an R console this throws an error, but my browser interprets it as two times the logarithm with base 10: 2 * log10(x).
    Unfortunately that that value appears not implausible.
    With the correct formula the outcome is for 3830 samples: 11.90313 which means 11 dimensions.
    That appears too high. But notice that the rule of thumb supposes that all dimensions are nearly symmetrically filled which is overoptimistic.
    My assertion that you need 2^25=33 554 432 samples for 25 dimensions is not affected, because it used powers instead of logarithms.

    The question of how many dimensions to use remains an important one. Maybe I get a better idea.

  12. The Following 5 Users Say Thank You to Huijbregts For This Useful Post:

     anglesqueville (02-10-2021),  JMcB (02-10-2021),  randwulf (02-10-2021),  Ruderico (02-10-2021),  xerxez (02-11-2021)

  13. #17
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    For info. Work on the MDS continued on the French-speaking forum, on a discussion thread that is less and less French-speaking (which I am very happy with). Very positive experience and rich in lessons. We have virtually abandoned the use of dna.land (which seems frozen), and I take care of the imputation phase myself. It's work, but it's worth it.

    https://anthrogenica.com/showthread....es-diplo%EFdes
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  14. The Following 3 Users Say Thank You to anglesqueville For This Useful Post:

     Finn (03-08-2021),  JMcB (03-10-2021),  randwulf (03-08-2021)

  15. #18
    Gold Class Member
    Posts
    3,144
    Sex
    Location
    Pennsylvania
    Ethnicity
    West European
    Nationality
    USA
    Y-DNA (P)
    R1b-U152-Z36+FGC6511
    mtDNA (M)
    H11a2a

    United States of America Germany England France Scotland Ireland
    Quote Originally Posted by Ruderico View Post
    Yes, I know that, but I was toying with truncations and am at a loss where should I place the limit at
    I added the ability to truncate at calculation time to the Ancestry calculator, so one can experiment:

    https://yk.github.io/ancestry/

    If you have suggestions for improvement, just let me know.

  16. The Following 2 Users Say Thank You to randwulf For This Useful Post:

     JMcB (03-10-2021),  Ruderico (03-10-2021)

  17. #19
    Registered Users
    Posts
    56
    Sex
    Ethnicity
    Finnish

    By MDS, do you mean classical MDS or non-metric MDS? Classical MDS should produce the same coordinates as PCA (except that the sign of some dimensions may be arbitrarily flipped, like in the case of the second dimension below).

    Code:
    > t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",header=T,row.names=1)
    > prcomp(t)$x[1:4,1:4]
                       PC1        PC2         PC3          PC4
    Abazin     0.009519441  0.1667842 0.003774957 -0.000634985
    Abkhasian  0.038130735  0.1943700 0.012288608  0.004753655
    Adygei     0.026658686  0.1892593 0.004977751  0.003864157
    Aeta      -0.199946861 -0.2023205 0.101169686  0.033809794
    > cmdscale(dist(t))[1:4,]
                      [,1]       [,2]
    Abazin     0.009519441 -0.1667842
    Abkhasian  0.038130735 -0.1943700
    Adygei     0.026658686 -0.1892593
    Aeta      -0.199946861  0.2023205
    If a plot made with classical MDS looks different from the original coordinates produced by `plink --pca`, it's because you're supposed to scale the coordinates before you do a second PCA/MDS on top of them. A PCA plot based on the unscaled coordinates will look the same as a classical MDS plot.

    Code:
    wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
    tar -xf v44.3_1240K_public.tar
    f=v44.3_1240K_public
    convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam)
    awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' g/g/12.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_father|_mother|_son|_daughter|_brother|_sister|_twin'|grep -Ev 'Neanderthal|Denisova|_published|Cameroon_SMA'>picks
    plink --allow-no-sex --bfile $f --keep <(awk 'NR==FNR{a[$1];next}$2 in a' picks $f.fam) --pca
    paste -d' ' <(cut -d' ' -f2 picks) <(cut -d' ' -f2- plink.eigenvec)>plink.eigenvec.pops
    Code:
    library(tidyverse)
    library(colorspace)
    
    t=read.table("plink.eigenvec.pops",sep=" ")
    eig=as.double(readLines("plink.eigenval"))
    # t=read.table("https://pastebin.com/raw/bGptFp12",sep=" ") # data I used
    # eig=c(22.7034,12.2596,8.09738,5.49967,4.92393,2.80936,2.39839,2.28679,2.20875,2.20267,2.12417,1.98691,1.81393,1.78919,1.74964,1.71181,1.66634,1.65334,1.62163,1.56336)
    
    ave=aggregate(t[,-c(1:2)],list(t[,1]),mean)
    row.names(ave)=ave$Group.1
    ave=ave[,-1]
    
    # ave=(t(t(ave)*sqrt(eig))) # scale by multiplying with square roots of eigenvalues
    
    p=as.data.frame(cmdscale(dist(ave),k=20)) # classical MDS (k=2 would be a truncated version of k=20, but k=20 makes the clustering more accurate because it won't be based on only 2 dimensions)
    # p=as.data.frame(prcomp(ave)$x) # uncomment to use PCA instead of MDS
    # p=as.data.frame(ave) # uncomment to plot original coordinates by `plink --pca`
    names(p)[1:2]=c("x","y")
    p$k=cutree(hclust(dist(p),method="complete"),k=32)
    
    ggplot(p,aes(x,y))+
    geom_point(aes(color=as.factor(k)),size=.5)+
    geom_polygon(data=p%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.3)+
    geom_text(aes(label=row.names(p),color=as.factor(k)),size=2,vjust=-.7)+
    theme(
      aspect.ratio=1/sqrt(2),
      axis.text=element_text(color="black",size=6),
      axis.ticks.length=unit(0,"pt"),
      axis.ticks.x=element_blank(),
      axis.ticks.y=element_blank(),
      axis.title=element_text(color="black",size=8),
      legend.position="none",
      panel.background=element_rect(fill="white"),
      panel.border=element_rect(color="gray75",fill=NA,size=.4),
      panel.grid.major=element_line(color="gray70",size=.15),
      plot.title=element_text(size=9),
    )+
    ggtitle("Original coordinates by `plink --pca` (scaled)")+
    labs(x="",y="")+
    scale_x_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.12))+
    scale_y_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.03))+
    scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)
    
    ggsave("a.png")
    system("mogrify -trim -bordercolor white -border 20x20 a.png")
    Below the unscaled MDS plot is identical with the unscaled PCA plot, and the scaled MDS plot is identical with the scaled PCA plot.





    Above you can also see that when I used the original coordinates by `plink --pca`, the clustering was messed up when I used unscaled coordinates.
    Last edited by Nganasankhan; 03-17-2021 at 02:11 PM.

  18. The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

     Ajeje Brazorf (03-17-2021),  sheepslayer (03-17-2021)

  19. #20
    Gold Class Member
    Posts
    3,144
    Sex
    Location
    Pennsylvania
    Ethnicity
    West European
    Nationality
    USA
    Y-DNA (P)
    R1b-U152-Z36+FGC6511
    mtDNA (M)
    H11a2a

    United States of America Germany England France Scotland Ireland
    Quote Originally Posted by Nganasankhan View Post
    By MDS, do you mean classical MDS or non-metric MDS?
    @angles is starting with our raw files, imputing SNPs to get to a common basic file format, and then doing the MDS based plotting. The results are very good. You should follow his link above and read the thread.

  20. The Following 3 Users Say Thank You to randwulf For This Useful Post:

     anglesqueville (03-17-2021),  Bygdedweller (03-17-2021),  JMcB (03-17-2021)

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. Clustering the sub modern European G25 samples
    By Huijbregts in forum Autosomal (auDNA)
    Replies: 65
    Last Post: 01-26-2021, 01:18 AM
  2. Modern European Dimensions calculator (scaled - G25)
    By Norfern-Ostrobothnian in forum Autosomal (auDNA)
    Replies: 64
    Last Post: 01-02-2021, 09:51 PM
  3. PCAs on the basis of qpAdm models, for European modern groups.
    By anglesqueville in forum Autosomal (auDNA)
    Replies: 11
    Last Post: 09-01-2020, 10:24 AM
  4. Comparing Modern Lebanese to Ancient Individuals on Vahaduo
    By KingofPhoenicia001 in forum Autosomal (auDNA)
    Replies: 12
    Last Post: 08-14-2020, 10:38 PM
  5. Replies: 10
    Last Post: 06-22-2020, 03:10 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •