Page 1 of 2 12 LastLast
Results 1 to 10 of 20

Thread: Scaled or unscaled,penalty on or off

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Gold Member Class
    Posts
    2,624
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland

    Scaled or unscaled,penalty on or off

    This may have been posted before, apologies if so.

    What are the pros and cons of


    Scaled on
    Scaled off
    Unscaled on
    Unscaled off ?
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  2. The Following User Says Thank You to firemonkey For This Useful Post:

     Adam.Krauze (01-10-2019)

  3. #2
    Junior Member
    Posts
    8
    Sex

    I also want to know. I only read that some say that and other another.

  4. #3
    Registered Users
    Posts
    1,067
    Sex
    Location
    Lisbon, Portugal
    Ethnicity
    Romanised Celtiberian
    Nationality
    Portuguese
    Y-DNA
    E-BY36857
    mtDNA
    H20

    Portugal 1143 Portugal 1485 Portugal Order of Christ
    Oh boy, this has the potential to get heated.

    What I can tell you is that Davidski uses scaled coordinates, but Ger Huijbregts (who created nMonte) doesn't because it discards information at higher dimensions.
    If I recall correctly scaling works by multuplying each coordinate by its squarerooted eigenvalue, so at higher dimensions (with low eigenvalues) each cordinate will be close to 0.

    I don't remember seeing genetics papers with scaled coordinates, but then again I don't know how they build the PCAs in the papers (I suppose it's a lot more complex than G25), nor whether that makes a difference. With that said, and since I'm in no position to judge on the actual methodology of PCAs when it comes to genetic data, I stick with what I've seen them do and use unscaled coordinates most of the time. But I really don't care either way.

    I suspect most people here use scaled coordinates because Davidski does too. Keep in mind that a pleasing result doesn't validate an incorrect method (ie numerous and rather similar pops in a dataset and using pen=0)
    Last edited by Ruderico; 01-11-2019 at 11:06 AM. Reason: fixed
    YDNA - E-Y31991>PF4428>BY36857. Domingos Rodrigues, b. circa 1680 Hidden Content , Viana do Castelo, Portugal
    mtDNA - H20. Maria Josefa de Almeida, b. circa 1750 Hidden Content , Porto, Portugal

    Global25 PCA West Eurasia dataset Hidden Content
    Hidden Content


    [1] "distance%=0.8188"

    Ruderico

    Scotland_LBA,39.6
    ALPc_MN,21.4
    England_CA_EBA,18.8
    Ukraine_N_o,14.6
    Iberomaurusian,5.6

  5. The Following 3 Users Say Thank You to Ruderico For This Useful Post:

     JMcB (01-10-2019),  Trelvern (01-11-2019),  Wing Genealogist (01-11-2019)

  6. #4
    Moderator
    Posts
    5,309
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA
    U152>L2>Z367
    mtDNA
    H5a1

    Normandie Netherlands Friesland Finland Orkney
    Quote Originally Posted by Ruderico View Post
    Oh boy, this has the potential to get heated. (1)

    What I can tell you is that Davidski uses scaled coordinates, but Ger Huijbregts (who created nMonte) doesn't because it discards information at higher dimensions.
    If I recall correctly scaling works by multuplying each coordinate by its squared eigenvalue, so at higher dimensions (with low eigenvalues) each cordinate will be close to 0.

    I don't remember seeing genetics papers with scaled coordinates, but then again I don't know how they build the PCAs in the papers (2) (I suppose it's a lot more complex than G25), nor whether that makes a difference. With that said, and since I'm in no position to judge on the actual methodology of PCAs when it comes to genetic data, I stick with what I've seen them do and use unscaled coordinates most of the time. But I really don't care either way.

    I suspect most people here use scaled coordinates because Davidski does too. Keep in mind that a pleasing result doesn't validate an incorrect method (ie numerous and rather similar pops in a dataset and using pen=0)
    (1) wisdom speaks through your mouth
    (2) most recent papers, if not all, use smartpca (part of the EIGENSOFT bundle), as Eurogenes-G25 does.
    (3) 1000% agree
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  7. The Following 2 Users Say Thank You to anglesqueville For This Useful Post:

     JMcB (01-11-2019),  Ruderico (01-11-2019)

  8. #5
    Gold Member Class
    Posts
    2,624
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    Quote Originally Posted by Ruderico View Post
    Oh boy, this has the potential to get heated.

    What I can tell you is that Davidski uses scaled coordinates, but Ger Huijbregts (who created nMonte) doesn't because it discards information at higher dimensions.

    So two experts in such matters can't concur? That makes it very difficult for a simpleton in such matters, such as myself, to know which is best to use for accuracy. I think this plays into my anxiety over uncertainty. Perhaps I'm looking for a definitive answer that nMonte is not designed/equipped to give.
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  9. The Following 3 Users Say Thank You to firemonkey For This Useful Post:

     JMcB (01-11-2019),  Ruderico (01-11-2019),  RVBLAKE (01-11-2019)

  10. #6
    Registered Users
    Posts
    1,067
    Sex
    Location
    Lisbon, Portugal
    Ethnicity
    Romanised Celtiberian
    Nationality
    Portuguese
    Y-DNA
    E-BY36857
    mtDNA
    H20

    Portugal 1143 Portugal 1485 Portugal Order of Christ
    Quote Originally Posted by firemonkey View Post
    So two experts in such matters can't concur? That makes it very difficult for a simpleton in such matters, such as myself, to know which is best to use for accuracy. I think this plays into my anxiety over uncertainty. Perhaps I'm looking for a definitive answer that nMonte is not designed/equipped to give.
    I'm far from being an expert in data science, but to my knowledge of statistics unscaled makes more sense. Regardless, and knowing my knowledge limitations on the matter, I let the big boys call the shots. As far as I'm aware scientific papers don't scale either (I don't recall Nick Patterson doing it), which is why I use unscaled coordinates myself. If one day they start scaling or transforming the coordinates in whatever way, I'll gladly do the same. I have no strong feeling towards one way or another
    YDNA - E-Y31991>PF4428>BY36857. Domingos Rodrigues, b. circa 1680 Hidden Content , Viana do Castelo, Portugal
    mtDNA - H20. Maria Josefa de Almeida, b. circa 1750 Hidden Content , Porto, Portugal

    Global25 PCA West Eurasia dataset Hidden Content
    Hidden Content


    [1] "distance%=0.8188"

    Ruderico

    Scotland_LBA,39.6
    ALPc_MN,21.4
    England_CA_EBA,18.8
    Ukraine_N_o,14.6
    Iberomaurusian,5.6

  11. The Following 3 Users Say Thank You to Ruderico For This Useful Post:

     JMcB (01-11-2019),  sktibo (01-11-2019),  Trelvern (01-11-2019)

  12. #7
    Registered Users
    Posts
    730
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA
    I2a2a1b2-CTS1977
    mtDNA
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by firemonkey View Post
    So two experts in such matters can't concur? That makes it very difficult for a simpleton in such matters, such as myself, to know which is best to use for accuracy. I think this plays into my anxiety over uncertainty. Perhaps I'm looking for a definitive answer that nMonte is not designed/equipped to give.
    Wait until I publish nMonte4, you will have the day of your lifetime.
    But seriously, I have an advice for you.
    You are mainly interested in your relation with modern NW-European populations.
    In that case even proponents of the scaled method should admit that the main differences between these populations are in the higher dimensions.
    And in that case the unscaled method is methodologically superior to the scaled method.

  13. The Following 4 Users Say Thank You to Huijbregts For This Useful Post:

     JMcB (01-11-2019),  ph2ter (01-11-2019),  PoxVoldius (01-11-2019),  Ruderico (01-11-2019)

  14. #8
    Gold Member Class
    Posts
    2,624
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    Quote Originally Posted by Huijbregts View Post
    This is how I became the grumbling coauthor of a method that I don't believe in.
    Are you talking about nMonte? If so are there better alternatives and how easy would it be for people to use them ?
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  15. #9
    Gold Member Class
    Posts
    2,624
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    Quote Originally Posted by Huijbregts View Post
    Wait until I publish nMonte4, you will have the day of your lifetime.
    But seriously, I have an advice for you.
    You are mainly interested in your relation with modern NW-European populations.
    In that case even proponents of the scaled method should admit that the main differences between these populations are in the higher dimensions.
    And in that case the unscaled method is methodologically superior to the scaled method.
    I tried scaled and unscaled default using 1000/500 25% filter for British isles.

    The scaled made more sense.
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  16. The Following User Says Thank You to firemonkey For This Useful Post:

     Huijbregts (01-11-2019)

  17. #10
    Registered Users
    Posts
    317
    Sex

    Quote Originally Posted by Ruderico View Post
    If I recall correctly scaling works by multuplying each coordinate by its squarerooted eigenvalue, so at higher dimensions (with low eigenvalues) each cordinate will be close to 0.

    I don't remember seeing genetics papers with scaled coordinates, but then again I don't know how they build the PCAs in the papers (I suppose it's a lot more complex than G25), nor whether that makes a difference. With that said, and since I'm in no position to judge on the actual methodology of PCAs when it comes to genetic data, I stick with what I've seen them do and use unscaled coordinates most of the time. But I really don't care either way.
    Yeah, higher dimensions will represent decreasing shares of variance with scaled coordinates, and will all represent an equal share of variance. But they should, otherwise what does it mean for a PC to be PC1 or PC2 (they're order in rank of variance)?

    Scaling by the square root of the eigenvectors reproduces in the PCA the euclidean distance between samples that would exist just from calculating the euclidean distance directly on the data. In unscaled data of course it won't.

    The lowest dimensions, which represent the largest shifts in allele frequencies, should also be the most robustly inferred and the least noisy.

    Anyway, this is not to argue that you should use the scaled data for nMonte (which I would, but not too strongly, and have done before, but I can't find the post using this forum's search software and can't be bothered to write again, and it wouldn't bring it to a conclusion anyway ). If you want to calculate a single item distance between two items though, you should certainly use the scaled dataset, as the unscaled dataset will give non-meaningful information.

    Just for context on the relative importance of the early vs late dimensions, here are some distances for the closest 30 population computed on the Welsh population average, using different numbers of scaled dimensions: JOSf63q.png

    You can kind of see that even two dimensions is pretty intuitive on what the closest neighbours would be expected to be for the Welsh.

    Even if you did discard all the later dimensions, which scaling does not do (or else results for 2 and 25 would be identical and they're not), two dimensions summarizes a lot, particularly for the main West Eurasian populations who are heavily sampled in this panel (perhaps less so, Papuans or Onge, but most of the audience are not Papuans). So the later dimensions don't actually *need* to be very large (they can be very small) and still infer the right closest neighbours, because they're functionally only for fine tuning anyway.

    (Comparatively same exercise using unscaled data - TMq33ji.png. I can't see any sign of a qualitative improvement in the rank order of the most intuitively close populations, and if anything a very slight decrease in correlation with intuition and geography and geographically expected recent ancestry.)

    Re; what papers do, the issue of whether they scale the values kind of doesn't matter, because they use for visualisation (where it doesn't matter because all it does is change the magnitude of numbers on the axes, not the appearance of the plot) and not distance calculations where it does.

  18. The Following 4 Users Say Thank You to Eterne For This Useful Post:

     LTG (01-11-2019),  Ruderico (01-12-2019),  Ryukendo (01-19-2019),  sktibo (01-12-2019)

Page 1 of 2 12 LastLast

Similar Threads

  1. Ultimate Modern World K=50 Calculator for G25 Scaled Spreadsheet
    By michal3141 in forum Autosomal (auDNA)
    Replies: 118
    Last Post: 02-14-2019, 11:11 PM
  2. K25 calculator for G25 scaled spreadsheet
    By michal3141 in forum Autosomal (auDNA)
    Replies: 59
    Last Post: 01-23-2019, 08:32 AM
  3. World K=8 calculator based on G25 scaled coordinates
    By michal3141 in forum Autosomal (auDNA)
    Replies: 69
    Last Post: 11-13-2018, 01:11 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •