Page 2 of 2 FirstFirst 12
Results 11 to 20 of 20

Thread: Scaled or unscaled,penalty on or off

  1. #11
    Gold Member Class
    Posts
    2,622
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    Quote Originally Posted by Ruderico View Post
    Oh boy, this has the potential to get heated.

    What I can tell you is that Davidski uses scaled coordinates, but Ger Huijbregts (who created nMonte) doesn't because it discards information at higher dimensions.

    So two experts in such matters can't concur? That makes it very difficult for a simpleton in such matters, such as myself, to know which is best to use for accuracy. I think this plays into my anxiety over uncertainty. Perhaps I'm looking for a definitive answer that nMonte is not designed/equipped to give.
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  2. The Following 3 Users Say Thank You to firemonkey For This Useful Post:

     JMcB (01-11-2019),  Ruderico (01-11-2019),  RVBLAKE (01-11-2019)

  3. #12
    Registered Users
    Posts
    361
    Sex
    Location
    United Kingdom
    Ethnicity
    English & Greek
    Nationality
    British
    Y-DNA
    J2-L397
    mtDNA
    H2a2a1

    United Kingdom England Greece Cyprus
    I am pretty conflicted after seeing Huijbregts comment. I have personally used scaled from the beginning, even the PCA that Davidski provides with your first email and results is scaled as well as all of his models over at Eurogenes Blog. This does set a precedent to use scaled. For this reason, all of my personal models and PCA diagrams over the past six or so months have been using scaled coordinates. Most of the models created by other members on this site have been using scaled coordinates. Considering all of the wacky results I get with unscaled including single item distances, unreliable models and strange positioning on PCA diagrams, suddenly accepting that this is in fact the canonical way to use the Global 25 is something that I'm not sure I'll be able to accept until a strong consensus has been made.

    I place great value in Huijbregts opinion for obvious reasons, but I also acknowledge that Davidksi was the one who created the Global 25. He has the most extensive experience out of probably anyone on this site when it comes to comparing scaled/unscaled models to the current academic literature to find out which one produces the most sound results. I would appreciate his input on this topic.

  4. The Following User Says Thank You to LTG For This Useful Post:

     michal3141 (01-11-2019)

  5. #13
    Registered Users
    Posts
    1,062
    Sex
    Location
    Lisbon, Portugal
    Ethnicity
    Romanised Celtiberian
    Nationality
    Portuguese
    Y-DNA
    E-BY36857
    mtDNA
    H20

    Portugal 1143 Portugal 1485 Portugal Order of Christ
    Quote Originally Posted by firemonkey View Post
    So two experts in such matters can't concur? That makes it very difficult for a simpleton in such matters, such as myself, to know which is best to use for accuracy. I think this plays into my anxiety over uncertainty. Perhaps I'm looking for a definitive answer that nMonte is not designed/equipped to give.
    I'm far from being an expert in data science, but to my knowledge of statistics unscaled makes more sense. Regardless, and knowing my knowledge limitations on the matter, I let the big boys call the shots. As far as I'm aware scientific papers don't scale either (I don't recall Nick Patterson doing it), which is why I use unscaled coordinates myself. If one day they start scaling or transforming the coordinates in whatever way, I'll gladly do the same. I have no strong feeling towards one way or another
    YDNA - E-Y31991>PF4428>BY36857. Domingos Rodrigues, b. circa 1680 Hidden Content , Viana do Castelo, Portugal
    mtDNA - H20. Maria Josefa de Almeida, b. circa 1750 Hidden Content , Porto, Portugal

    Global25 PCA West Eurasia dataset Hidden Content
    Hidden Content


    [1] "distance%=0.8188"

    Ruderico

    Scotland_LBA,39.6
    ALPc_MN,21.4
    England_CA_EBA,18.8
    Ukraine_N_o,14.6
    Iberomaurusian,5.6

  6. The Following 3 Users Say Thank You to Ruderico For This Useful Post:

     JMcB (01-11-2019),  sktibo (01-11-2019),  Trelvern (01-11-2019)

  7. #14
    Registered Users
    Posts
    730
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA
    I2a2a1b2-CTS1977
    mtDNA
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by LTG View Post
    I am pretty conflicted after seeing Huijbregts comment. I have personally used scaled from the beginning, even the PCA that Davidski provides with your first email and results is scaled as well as all of his models over at Eurogenes Blog. This does set a precedent to use scaled. For this reason, all of my personal models and PCA diagrams over the past six or so months have been using scaled coordinates. Most of the models created by other members on this site have been using scaled coordinates. Considering all of the wacky results I get with unscaled including single item distances, unreliable models and strange positioning on PCA diagrams, suddenly accepting that this is in fact the canonical way to use the Global 25 is something that I'm not sure I'll be able to accept until a strong consensus has been made.

    I place great value in Huijbregts opinion for obvious reasons, but I also acknowledge that Davidksi was the one who created the Global 25. He has the most extensive experience out of probably anyone on this site when it comes to comparing scaled/unscaled models to the current academic literature to find out which one produces the most sound results. I would appreciate his input on this topic.
    What all of us have in common is the understanding that working with unscaled data will sooner or later run you into trouble.
    Where we differ, is the solution that we propose.
    At the start of this longtime controversy is the assertion that pairwise distances of ancient populations from Global10 are inconsistent.
    Someone proposed a simplistic transformation to repair this by multiplying the PCA scores with the eigenvalues.
    That transformation failed to repair the inconsistencies. I commented that at least he should have multiplied with the square root of the eigenvalues. This is how I became the grumbling coauthor of a method that I don't believe in.
    But even with the square root the method is still so amateurish that is should not pass the peer review of academic journals.
    Moreover in the modern data science there are countless methods to regulate the overfitting of multivariate data.
    I think that the slight penalization of greater distances in nMonte3 is in line with the regularization in modern data science.

    I suppose that the better performance with ancient data is the main reason that Davidski adheres to scaled data.
    However this better perfomance can easily be explained.
    The Global25 dataset is dominated by modern samples and so is the Global25 PCA. So ancient data are mismatched in the Global25 PCA.
    This difference between ancient and modern data is a result of modern changes in the population genetics. Which are less omnipresent than the ancient substrates.
    So the recent changes in modern DNA have a higher probability of ending up in the higher dimensions of the PCA. The ancient populations are mainly represented in the lower dimensions, for them the higher dimensions are misfitted noise.
    As shown in the plot of my previous post, the scaling transformation drastically reduces the higher dimension and keeps the lower dimensions.
    So when fitting ancient populations with nMonte, scaled models will do better than unscaled (unless you penalize with nMonte3, but even than the fact remains that in the Global25 PCA ancient populations are undersampled).

    You mention that both your single item distances and your models are wacky unless you scale them.
    I read from your info that your ethnicity is English & Greek. English and Greek DNA are far apart and if you are a 50-50 mixture you are a in the middle and a nearest neighbor to neither of them.
    So it is logical that your single item distance is bullshit. Actually it would be odd if your scaled data do show plausible near neigbors.
    I cannot explain why your unscaled nMonte models are not satisfactory. Maybe the Greek part of your DNA has some rare admixtures in the higher dimensions.

  8. The Following 5 Users Say Thank You to Huijbregts For This Useful Post:

     JMcB (01-11-2019),  LTG (01-11-2019),  ph2ter (01-11-2019),  PoxVoldius (01-11-2019),  Ruderico (01-11-2019)

  9. #15
    Registered Users
    Posts
    730
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA
    I2a2a1b2-CTS1977
    mtDNA
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by firemonkey View Post
    So two experts in such matters can't concur? That makes it very difficult for a simpleton in such matters, such as myself, to know which is best to use for accuracy. I think this plays into my anxiety over uncertainty. Perhaps I'm looking for a definitive answer that nMonte is not designed/equipped to give.
    Wait until I publish nMonte4, you will have the day of your lifetime.
    But seriously, I have an advice for you.
    You are mainly interested in your relation with modern NW-European populations.
    In that case even proponents of the scaled method should admit that the main differences between these populations are in the higher dimensions.
    And in that case the unscaled method is methodologically superior to the scaled method.

  10. The Following 4 Users Say Thank You to Huijbregts For This Useful Post:

     JMcB (01-11-2019),  ph2ter (01-11-2019),  PoxVoldius (01-11-2019),  Ruderico (01-11-2019)

  11. #16
    Gold Member Class
    Posts
    2,622
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    Quote Originally Posted by Huijbregts View Post
    This is how I became the grumbling coauthor of a method that I don't believe in.
    Are you talking about nMonte? If so are there better alternatives and how easy would it be for people to use them ?
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  12. #17
    Gold Member Class
    Posts
    2,622
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    Quote Originally Posted by Huijbregts View Post
    Wait until I publish nMonte4, you will have the day of your lifetime.
    But seriously, I have an advice for you.
    You are mainly interested in your relation with modern NW-European populations.
    In that case even proponents of the scaled method should admit that the main differences between these populations are in the higher dimensions.
    And in that case the unscaled method is methodologically superior to the scaled method.
    I tried scaled and unscaled default using 1000/500 25% filter for British isles.

    The scaled made more sense.
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  13. The Following User Says Thank You to firemonkey For This Useful Post:

     Huijbregts (01-11-2019)

  14. #18
    Registered Users
    Posts
    730
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA
    I2a2a1b2-CTS1977
    mtDNA
    H13a1a1

    Netherlands Belgium
    Quote Originally Posted by LTG View Post
    Considering all of the wacky results I get with unscaled including single item distances, unreliable models and strange positioning on PCA diagrams, suddenly accepting that this is in fact the canonical way to use the Global 25 is something that I'm not sure I'll be able to accept until a strong consensus has been made.

    I place great value in Huijbregts opinion for obvious reasons, but I also acknowledge that Davidksi was the one who created the Global 25. He has the most extensive experience out of probably anyone on this site when it comes to comparing scaled/unscaled models to the current academic literature to find out which one produces the most sound results. I would appreciate his input on this topic.
    What do you mean, unreliable unscaled models, this one is perfect:
    distance%=0.8543

    LTG

    Turkish_Balikesir,19
    Turkish_Kayseri,15.6
    Irish,13.2
    Turkish_Istanbul,11.2
    Greek,10.2
    Greek_Crete,8.8
    Turkish_Aydin,6
    Scottish,4.2
    English,3.8
    English_Cornwall,2.6
    Orcadian,2.4
    Greek_Central_Anatolia,1.8
    Turkish_Adana,1.2

    or aggregated:
    British 26.2
    Greek 20.8
    Turkish 53.0

  15. #19
    Registered Users
    Posts
    361
    Sex
    Location
    United Kingdom
    Ethnicity
    English & Greek
    Nationality
    British
    Y-DNA
    J2-L397
    mtDNA
    H2a2a1

    United Kingdom England Greece Cyprus
    Quote Originally Posted by Huijbregts View Post
    What all of us have in common is the understanding that working with unscaled data will sooner or later run you into trouble.
    Where we differ, is the solution that we propose.
    At the start of this longtime controversy is the assertion that pairwise distances of ancient populations from Global10 are inconsistent.
    Someone proposed a simplistic transformation to repair this by multiplying the PCA scores with the eigenvalues.
    That transformation failed to repair the inconsistencies. I commented that at least he should have multiplied with the square root of the eigenvalues. This is how I became the grumbling coauthor of a method that I don't believe in.
    But even with the square root the method is still so amateurish that is should not pass the peer review of academic journals.
    Moreover in the modern data science there are countless methods to regulate the overfitting of multivariate data.
    I think that the slight penalization of greater distances in nMonte3 is in line with the regularization in modern data science.

    I suppose that the better performance with ancient data is the main reason that Davidski adheres to scaled data.
    However this better perfomance can easily be explained.
    The Global25 dataset is dominated by modern samples and so is the Global25 PCA. So ancient data are mismatched in the Global25 PCA.
    This difference between ancient and modern data is a result of modern changes in the population genetics. Which are less omnipresent than the ancient substrates.
    So the recent changes in modern DNA have a higher probability of ending up in the higher dimensions of the PCA. The ancient populations are mainly represented in the lower dimensions, for them the higher dimensions are misfitted noise.
    As shown in the plot of my previous post, the scaling transformation drastically reduces the higher dimension and keeps the lower dimensions.
    So when fitting ancient populations with nMonte, scaled models will do better than unscaled (unless you penalize with nMonte3, but even than the fact remains that in the Global25 PCA ancient populations are undersampled).

    You mention that both your single item distances and your models are wacky unless you scale them.
    I read from your info that your ethnicity is English & Greek. English and Greek DNA are far apart and if you are a 50-50 mixture you are a in the middle and a nearest neighbor to neither of them.
    So it is logical that your single item distance is bullshit. Actually it would be odd if your scaled data do show plausible near neigbors.
    I cannot explain why your unscaled nMonte models are not satisfactory. Maybe the Greek part of your DNA has some rare admixtures in the higher dimensions.
    Thank you for the in depth explanation, Huijbregts. It is much appreciated.

    Interestingly now that you have mentioned it, I have never had any real issues when using modern samples with the unscaled data. In fact, I can model my ancestry just as accurately as I can with the scaled data to a literal 50/50 split. The ancients are a different story entirely, which now makes a lot of sense after you pointed out their greater activity in the lower dimensions. That was always something that irritated me about using unscaled. I am certainly no expert on this topic but much like Davidski my primary interest lies with the modelling of modern samples using ancient data. Considering I can accurately achieve low distances for my modern ancestry using 3 or less populations with scaled data, coupled with the apparent superiority of scaled data when it comes to modelling ancient ancestry, I will continue to use scaled data. This is simply because of the continuity that this method has across the entirety of the datasheet across all time periods. Switching back and forth between unscaled and scaled methods is something that I want to avoid for my own sanity. I can see how some individuals with two or three sources of modern high correlation ancestries would be interested in using unscaled data for modelling but this understandably does not apply to myself.

  16. #20
    Registered Users
    Posts
    317
    Sex

    Quote Originally Posted by Ruderico View Post
    If I recall correctly scaling works by multuplying each coordinate by its squarerooted eigenvalue, so at higher dimensions (with low eigenvalues) each cordinate will be close to 0.

    I don't remember seeing genetics papers with scaled coordinates, but then again I don't know how they build the PCAs in the papers (I suppose it's a lot more complex than G25), nor whether that makes a difference. With that said, and since I'm in no position to judge on the actual methodology of PCAs when it comes to genetic data, I stick with what I've seen them do and use unscaled coordinates most of the time. But I really don't care either way.
    Yeah, higher dimensions will represent decreasing shares of variance with scaled coordinates, and will all represent an equal share of variance. But they should, otherwise what does it mean for a PC to be PC1 or PC2 (they're order in rank of variance)?

    Scaling by the square root of the eigenvectors reproduces in the PCA the euclidean distance between samples that would exist just from calculating the euclidean distance directly on the data. In unscaled data of course it won't.

    The lowest dimensions, which represent the largest shifts in allele frequencies, should also be the most robustly inferred and the least noisy.

    Anyway, this is not to argue that you should use the scaled data for nMonte (which I would, but not too strongly, and have done before, but I can't find the post using this forum's search software and can't be bothered to write again, and it wouldn't bring it to a conclusion anyway ). If you want to calculate a single item distance between two items though, you should certainly use the scaled dataset, as the unscaled dataset will give non-meaningful information.

    Just for context on the relative importance of the early vs late dimensions, here are some distances for the closest 30 population computed on the Welsh population average, using different numbers of scaled dimensions: JOSf63q.png

    You can kind of see that even two dimensions is pretty intuitive on what the closest neighbours would be expected to be for the Welsh.

    Even if you did discard all the later dimensions, which scaling does not do (or else results for 2 and 25 would be identical and they're not), two dimensions summarizes a lot, particularly for the main West Eurasian populations who are heavily sampled in this panel (perhaps less so, Papuans or Onge, but most of the audience are not Papuans). So the later dimensions don't actually *need* to be very large (they can be very small) and still infer the right closest neighbours, because they're functionally only for fine tuning anyway.

    (Comparatively same exercise using unscaled data - TMq33ji.png. I can't see any sign of a qualitative improvement in the rank order of the most intuitively close populations, and if anything a very slight decrease in correlation with intuition and geography and geographically expected recent ancestry.)

    Re; what papers do, the issue of whether they scale the values kind of doesn't matter, because they use for visualisation (where it doesn't matter because all it does is change the magnitude of numbers on the axes, not the appearance of the plot) and not distance calculations where it does.

  17. The Following 4 Users Say Thank You to Eterne For This Useful Post:

     LTG (01-11-2019),  Ruderico (01-12-2019),  Ryukendo (01-19-2019),  sktibo (01-12-2019)

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Ultimate Modern World K=50 Calculator for G25 Scaled Spreadsheet
    By michal3141 in forum Autosomal (auDNA)
    Replies: 118
    Last Post: 02-14-2019, 11:11 PM
  2. K25 calculator for G25 scaled spreadsheet
    By michal3141 in forum Autosomal (auDNA)
    Replies: 59
    Last Post: 01-23-2019, 08:32 AM
  3. World K=8 calculator based on G25 scaled coordinates
    By michal3141 in forum Autosomal (auDNA)
    Replies: 69
    Last Post: 11-13-2018, 01:11 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •