Page 1 of 3 123 LastLast
Results 1 to 10 of 29

Thread: MDS plots for European modern individuals

  1. #1
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland

    MDS plots for European modern individuals

    For several weeks I have been using a mode of representing autosomal data known as "Multidimensional Scaling". It is very classic but slightly used on this forum. Some examples of results have already been published on other threads, but I have found it necessary to make a point here for those who have not had the opportunity to see these examples. Firstly some information.
    a) The program used is PLINK, version 1.09
    b) The data are the genomic data in the form of a PLINK file (bed / bim / fam)
    c) This method is hypersensitive to sampling differences. It is therefore highly recommended that analyzed individuals share the same genotyped SNPs. No question here about letting in the data individuals with more than 10% of no calls. As the basic data are the proportions of shared alleles, no question neither of mixing diploid and pseudo-haploid genomes. In addition, a good practice is to operate a preliminary pruning to ensure that the couples of markers do not have a too high LD. These constraints lead to the need to have available data of rather high coverage. In practice, I have often been condemned to forget this stage of pruning, due to an unsufficient cover.
    d) For the reasons mentioned above, the presence of closely related individuals is highly toxic, which I realized very early on when I was experimenting with this methodology on family data. It is, of course, possible to cheat, by producing separate MDS, and by bringing together the components of the targeted individuals in addition to those of the references. It is nothing more than artifice, and prudence enjoins to remain with the separated individual plots in the case of related individuals.
    e) One of the difficulties was to choose the right number of MDS-components. Some recent studies used 8. My final choice was only 6. In the exemples below I'll post only the components 1 and 2 (in R, package ggplot2).

    If all these conditions have been met, it is possible to obtain high-reliability plots, independent of any model, which faithfully reflect the true genetic proximities of the individuals gathered in the data. Exemple: data from 1Kg,HumanOrigins (duplicates of course eliminated),Biagini (for the French, without the regional labels),globally around 1000 individuals all restricted to the same SNPs panel (around 450.000 SNPs), the individual percentages of no-calls are never > 0.02. I kept the original labels for 1Kg (GBR = Britain, TSI = Tuscany, IBS = Spain, CEU = Americans from North-European descent). I'll post a little .jpg and a large .pdf.
    basis_mds_1-2.jpg
    basis_mds_1-2.pdf

    At the initial request of members of my family, I have added private genomes to these public references. Since I needed good quality genomes and knew I was going to face scepticism about the reliability of my imputation work, I asked those concerned to provide me with data imputed by dna.land (and to turn their potential scepticism towards Pickrell). Today, I have placed more than twenty individuals (mostly familial plus some from AG-forumers), without encountering any reason to doubt the results.

    A few exemples. Unmixed individuals:
    Our friend Ruderico (Portuguese), JennyS (private genome, pure Orcadian lady), Kashubian (private genome provided by an AG-friend, thanks to him. Polish woman from Kashubia). I post .pdf, search "ruderico", "jenny", "kashubian".
    rudy_mds_1-2.pdf
    jenny_mds_1-2.pdf
    kashubian_mds_1-2.pdf
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  2. The Following 19 Users Say Thank You to anglesqueville For This Useful Post:

     Просигој (02-06-2021),  Aben Aboo (02-06-2021),  Bart (02-10-2021),  Bygdedweller (02-06-2021),  dany198124 (02-08-2021),  Finn (02-06-2021),  Helgenes50 (02-08-2021),  JMcB (02-06-2021),  MethCat (03-08-2021),  monedula (02-07-2021),  Nino90 (02-06-2021),  ph2ter (02-06-2021),  randwulf (02-05-2021),  Ruderico (02-05-2021),  sheepslayer (02-06-2021),  Theconqueror (03-11-2021),  timberwolf (02-06-2021),  vettor (02-05-2021),  xerxez (02-11-2021)

  3. #2
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    Mixed individuals (familial data):
    dieppe20 (75% Northern Normandy, 25% mixed from The Netherlands, Northern Germany and Finland), dieppe10 (his daughter, mother eastern Polish with very likely many Lithuanian ancestors).
    dieppe20_mds_1-2.pdf
    dieppe10_mds_1-2.pdf
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  4. The Following 7 Users Say Thank You to anglesqueville For This Useful Post:

     Просигој (02-06-2021),  Aben Aboo (02-06-2021),  JMcB (02-06-2021),  Nino90 (02-06-2021),  randwulf (02-05-2021),  sheepslayer (02-06-2021),  xerxez (02-11-2021)

  5. #3
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    These MDS-plots are perfectly suited to a post-application of UMAP (which the Margaryan team does for example in the study on the Viking period). For my part, I am a bit cyclothymic with regard to UMAP. One day it's fantastic, the next day I don't know. I'll leave it to you to judge (UMAP under R, with the default settings).
    basis_mds6_umap.pdf
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  6. The Following 3 Users Say Thank You to anglesqueville For This Useful Post:

     JMcB (02-06-2021),  randwulf (02-06-2021),  xerxez (02-11-2021)

  7. #4
    Registered Users
    Posts
    1,041
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA (P)
    I2a2a1b2-CTS1977
    mtDNA (M)
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by anglesqueville View Post
    These MDS-plots are perfectly suited to a post-application of UMAP (which the Margaryan team does for example in the study on the Viking period). For my part, I am a bit cyclothymic with regard to UMAP. One day it's fantastic, the next day I don't know. I'll leave it to you to judge (UMAP under R, with the default settings).
    basis_mds6_umap.pdf
    IMO your umap is very promising.
    Maybe Generalissimo should apply your workflow to the Global25 dataset.

  8. The Following 4 Users Say Thank You to Huijbregts For This Useful Post:

     anglesqueville (02-06-2021),  JMcB (02-06-2021),  Nino90 (02-06-2021),  randwulf (02-06-2021)

  9. #5
    Registered Users
    Posts
    4,804
    Sex

    Sorry, but I can only use analyses that produce familiar, easily reproducible shapes that have already appeared in major papers.

    The reason for this is simple: such shapes are likely to be more reliable. I can also cite such examples when anyone asks what I'm doing.

    Take a look at the PCAs in various studies focusing on PCA, genes vs geography, etc, and you'll come across exactly these shapes.

    https://vahaduo.github.io/g25views/#Europe1

    https://vahaduo.github.io/g25views/#WestEurasia

    https://vahaduo.github.io/g25views/#NorthEurasia1

  10. The Following 3 Users Say Thank You to Generalissimo For This Useful Post:

     Helgenes50 (02-08-2021),  JMcB (02-07-2021),  randwulf (02-07-2021)

  11. #6
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    Regarding UMAP, I do not think it useful to go again into detail about my uncertainties. Simply, I'm never sure, when UMAP seems to make structures appear, that those structures are real. And I always fear abusive interpretations. Thus, for example, UMAP acts on the MDS components by operating an abrupt separation of the Basque cluster. It would obviously be dramatically wrong to forget on this basis that the Basques nevertheless share most of their autosomal characteristics with the Spaniards and the French of the Aquitaine region, which clearly appears on the original representations. Especially I am not sure of the interest of UMAP when the original representations are clear, which seems to be the case here of those obtained with an MDS. As for MDS itself now: I don't quite understand what Huijbregts has in mind with G25. The ability of smartpca to project poor data onto a benchmark panel (Lsqproject for connoisseurs) is absent here, as I have already emphatically said. This fact alone makes it very difficult (to put it mildly) to imagine applying MDS's PLINK algorithm to a data file like G25's, even limiting it to modern diploid data. This is also the reason why you do not find either Germans or Dutch in the MDS representations that I published: the German and Dutch genomes that I have are too poor in SNPs.It was first for me to answer the question asked by a member of my family clan. In short: "G25 is obtained by applying a PCA algorithm to genetic data. Are you sure that another algorithm would give at least a similar picture?" So far I have not found any reason to answer other than "yes", at least for the PLINK MDS. My only question is the one I mentioned a moment ago (which concerns us directly in my paternal family): Germans and Dutch. But, there you go ... no data, no points.
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  12. The Following 2 Users Say Thank You to anglesqueville For This Useful Post:

     JMcB (02-07-2021),  randwulf (02-07-2021)

  13. #7
    Registered Users
    Posts
    1,041
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA (P)
    I2a2a1b2-CTS1977
    mtDNA (M)
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by Generalissimo View Post
    Sorry, but I can only use analyses that produce familiar, easily reproducible shapes that have already appeared in major papers.

    The reason for this is simple: such shapes are likely to be more reliable. I can also cite such examples when anyone asks what I'm doing.

    Take a look at the PCAs in various studies focusing on PCA, genes vs geography, etc, and you'll come across exactly these shapes.

    https://vahaduo.github.io/g25views/#Europe1

    https://vahaduo.github.io/g25views/#WestEurasia

    https://vahaduo.github.io/g25views/#NorthEurasia1
    When I half-seriously suggested that you should do an UMAP analysis, it was mainly because I am very happy with the Global25 data set and I would love to see how it would look in UMAP.
    However, I understand your motivation to conform to the PCA plots from the major papers.

    On the other hand, PCA has it shortcomings.
    With data from Bronze Age or older, PCA plots are fine, because the populations are genetically very different.
    But modern data, especially from Western Europe, are more similar and the intra-group variance may even be larger than the between-group variance. This makes PCA plots fuzzy and less easily interpretable.

    From a theoretical point of view, PCA is a dimensional reduction which preserves the directions of maximum variance (a.k.a. eigenvectors).
    As a consequence PCA has a few drawbacks:
    - The directions of maximum variance are dependent on the relative frequencies of the populations (sampling frequency).
    - As a consequence PCA favors highly numbered population groups, i.e. continent-wide groups.
    - Contrasts between groups that have low frequencies or small genetic differences can in theory also be retrieved by PCA, but only on the higher dimensions.
    - Global25 has 25 dimensions, but unfortunately the use of the higher dimensions is unreliable because of the 'curse of dimensionality'. It would take (more than) millions of samples to adequately fill a 25-dimensional space.
    If one runs the calculation with too few samples, the result will be overfitted.
    - PCA has an effective way of dimensional reduction by truncating after a low number of dimensions.
    - As a consequence, PCA on the Global25 dataset cannot reliably retrieve fine-grained differences of higher dimensionsal/genetically similar populations.

    So the question is: can we find a dimensional reduction that better preserves information on the nearby populations?
    Quite a number of nonlinear dimensional reduction algorithms can perform this trick.
    But at this moment the primus inter pares is UMAP, mainly because it preserves both the local and the faraway structure.


    It is true that UMAP doesn't produce shapes that are familiar from PCA.
    It also true that it shouldn't; UMAP doesn't preserve the eigenvectors, but the local structure (nearest neighbors). This is what it should do better than PCA.
    I don't think that the shapes of PCA should be called more reliable. They are so when the criterion is preservation of long-distance clines.
    But when the criterion is separation of local structures, UMAP should be better.

  14. The Following 7 Users Say Thank You to Huijbregts For This Useful Post:

     anglesqueville (02-07-2021),  Bart (02-10-2021),  Helgenes50 (02-08-2021),  JMcB (02-07-2021),  ph2ter (02-07-2021),  Ruderico (02-07-2021),  sheepslayer (02-07-2021)

  15. #8
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    Huijbregts, for a long time I have been curious to know what UMAP would give if it could be applied directly to the genetic data, without filtering by a reduction algorithm (PCA or MDS). It's stupid, I had the solution under my nose! It is enough to calculate (under PLINK for example) the matrix of pseudo-distances 1-ibs (ibs = rate of shared alleles) and to apply UMAP to it (without forgetting to fix " input = dist" in the settings). It works without difficulty. It's all fresh, so I didn't touch the other settings. I do not yet know what use this can have, and what are the limitations of this method (probably those inherent in any use of the 1-ibs pseudo-metric, I'll have to think about it). In any case, here is the first example of "really pure" UMAP.

    basis_mdist_umap.jpg
    basis_mdist_umap.pdf
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  16. The Following 5 Users Say Thank You to anglesqueville For This Useful Post:

     Huijbregts (02-07-2021),  JMcB (02-07-2021),  mokordo (02-10-2021),  randwulf (02-08-2021),  Ruderico (02-07-2021)

  17. #9
    Global Moderator
    Posts
    3,723
    Sex
    Location
    Vissaiom
    Ethnicity
    Portuguese highlander
    Y-DNA (P)
    E-Y31991>FT17866
    mtDNA (M)
    H20 (xH20a)

    Asturias Galicia Portugal 1143 Portugal 1485 Portugal Order of Christ PortugalRoyalFlag1830
    Huijbregts how many dimensions would you suggest we truncate data at? I remember Angles made an experiment and suggested 8, based on distances between NW Europeans
    YDNA E-Y31991>PF4428>Y134097>Y134104>Y168273>FT17866 (TMRCA ~1100AD) - Domingos Rodrigues, b. circa 1690 Hidden Content , Viana do Castelo, Portugal - Stonemason, miller.
    mtDNA H20 - Monica Vieira, b. circa 1700 Hidden Content , Porto, Portugal

    Hidden Content
    Global25 PCA West Eurasia dataset Hidden Content

    [1] "distance%=1.6007"

    Ruderico

    NW_Iberia_IA,80.4
    Berber_EMA,11
    Roman_Colonial,8.6

  18. #10
    Gold Class Member
    Posts
    7,518
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Orkney Netherlands Friesland East Frisia Finland
    Quote Originally Posted by Ruderico View Post
    Huijbregts how many dimensions would you suggest we truncate data at? I remember Angles made an experiment and suggested 8, based on distances between NW Europeans
    Ruderico, in the context of my last post, this question doesn't arise precisely because there is no dimension reduction before UMAP. The data consist of all the pairwise distances between individuals ("distances" in the sense of 1-ibs, computed on the panel of SNPs).
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  19. The Following 2 Users Say Thank You to anglesqueville For This Useful Post:

     JMcB (02-08-2021),  Ruderico (02-07-2021)

Page 1 of 3 123 LastLast

Similar Threads

  1. Clustering the sub modern European G25 samples
    By Huijbregts in forum Autosomal (auDNA)
    Replies: 65
    Last Post: 01-26-2021, 01:18 AM
  2. Modern European Dimensions calculator (scaled - G25)
    By Norfern-Ostrobothnian in forum Autosomal (auDNA)
    Replies: 64
    Last Post: 01-02-2021, 09:51 PM
  3. PCAs on the basis of qpAdm models, for European modern groups.
    By anglesqueville in forum Autosomal (auDNA)
    Replies: 11
    Last Post: 09-01-2020, 10:24 AM
  4. Comparing Modern Lebanese to Ancient Individuals on Vahaduo
    By KingofPhoenicia001 in forum Autosomal (auDNA)
    Replies: 12
    Last Post: 08-14-2020, 10:38 PM
  5. Replies: 10
    Last Post: 06-22-2020, 03:10 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •