Page 1 of 2 12 LastLast
Results 1 to 10 of 20

Thread: Scaled or unscaled,penalty on or off

  1. #1
    Gold Member Class
    Posts
    2,506
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland

    Scaled or unscaled,penalty on or off

    This may have been posted before, apologies if so.

    What are the pros and cons of


    Scaled on
    Scaled off
    Unscaled on
    Unscaled off ?
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  2. The Following User Says Thank You to firemonkey For This Useful Post:

     Adam.Krauze (01-10-2019)

  3. #2
    Junior Member
    Posts
    8
    Sex

    I also want to know. I only read that some say that and other another.

  4. #3
    Registered Users
    Posts
    845
    Sex
    Location
    Lisbon, Portugal
    Ethnicity
    Western Meseta Iberian
    Nationality
    Portuguese
    Y-DNA
    E-Y31991 > PF4428
    mtDNA
    H20

    Portugal 1143 Portugal 1485 Portugal Order of Christ
    Oh boy, this has the potential to get heated.

    What I can tell you is that Davidski uses scaled coordinates, but Ger Huijbregts (who created nMonte) doesn't because it discards information at higher dimensions.
    If I recall correctly scaling works by multuplying each coordinate by its squarerooted eigenvalue, so at higher dimensions (with low eigenvalues) each cordinate will be close to 0.

    I don't remember seeing genetics papers with scaled coordinates, but then again I don't know how they build the PCAs in the papers (I suppose it's a lot more complex than G25), nor whether that makes a difference. With that said, and since I'm in no position to judge on the actual methodology of PCAs when it comes to genetic data, I stick with what I've seen them do and use unscaled coordinates most of the time. But I really don't care either way.

    I suspect most people here use scaled coordinates because Davidski does too. Keep in mind that a pleasing result doesn't validate an incorrect method (ie numerous and rather similar pops in a dataset and using pen=0)
    Last edited by Ruderico; 01-11-2019 at 11:06 AM. Reason: fixed
    G25 Hidden Content and Hidden Content distances
    Hidden Content
    Hidden Content
    Hidden Content

    DEIBABOR
    IGO
    DEIBOBOR
    VISSAIEIGO
    BOR

  5. The Following 3 Users Say Thank You to Ruderico For This Useful Post:

     JMcB (01-10-2019),  Trelvern (01-11-2019),  Wing Genealogist (01-11-2019)

  6. #4
    Registered Users
    Posts
    250
    Sex
    Location
    UK
    Ethnicity
    English & Greek
    Nationality
    British
    Y-DNA
    J2a-L397
    mtDNA
    H2a2a1

    United Kingdom Greece Byzantine Empire
    Scaled with pen=0 produces the most stable and logical results for me.

  7. #5
    Gold Member Class
    Posts
    2,506
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    See this thread https://anthrogenica.com/showthread....ntested-parent for an example of differences between scaled and unscaled results.
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  8. #6
    Registered Users
    Posts
    1,837
    Sex
    Location
    Pennsylvania
    Ethnicity
    West European
    Nationality
    USA
    Y-DNA
    R1b-U152-Z36+FGC6511
    mtDNA
    H11a2a

    United States of America Germany England France Scotland Ireland
    Here are my thoughts, illustrated with an example. My mom is about evenly split Isles/German with small minor South Asian, Amerindian, and African admixture. This model with pen=0 and unscaled comes out really good with the French reference taking some of both the English and the German and a relatively balanced Scottish and French East showing the more extreme parts of the major ancestry range. The minor ancestry is always faint, but supported by really well-organized groupings of admixture at targeted locations (so, I suspect are real). The percentages are expected for this minor ancestry below or maybe slightly understated:

    [1] "distance%=0.6248"

    French,63.8
    Scottish,20.2
    French_East,13.4
    Brahmin_Uttar_Pradesh,1.4
    Mbuti,0.6
    Surui,0.6

    I have found if the model is built well and the major ancestry isn't overfit (based on tips from Huijbregts on dealing with this kind of mix), the model tends to work almost as well with the default penalty with regards to the non-majority mix. It doesn't disappear altogether, but the penalty does reduce it to lower than expected. The Isles/German major "balance" stays correct or maybe even is improved:

    [1] "distance%=0.6678"

    French,65
    French_East,19
    Scottish,14.4
    Brahmin_Uttar_Pradesh,0.8
    Mbuti,0.4
    Surui,0.4

    Scaling doesn't do as well. With pen=0, it destroys the major ancestry balance and has great difficulty recognizing the minor ancestry. Here are results:

    [1] "distance%=1.2452"

    French,55
    Scottish,35
    French_East,9.6
    Brahmin_Uttar_Pradesh,0.4

    Using the default penalty with the scaled model helps restore the major ancestry balance and models it pretty well, but still destroys the recognition of the minor ancestry:

    [1] "distance%=1.3353"

    French,54.2
    Scottish,25.6
    French_East,20
    Brahmin_Uttar_Pradesh,0.2

    I have seen this pattern repeated quite a bit, including with all of my family. I have seen that the modeling with the most ancient references (like the European Steppe/WHG/EEF kind of test) produces more expected results with scaling. But, some of those higher dimension PCA positions do things like distinguish Siberians from Amerindians, which is kind of important if one has some real Siberian or Amerindian ancestry. I don't find these kinds of PCA positions cause too much trouble with unscaled modeling because Europeans, for example, group together on those dimensions and they don't have much impact on the model (they don't get in the way, I mean). So, I don't feel the need to minimize their impact. I guess it depends on what one wants to get from the model. My conclusion is that I tend to work with unscaled, pen=0 to produce as simple a model as possible and then run with the default penalty to ensure the model stays pretty strong, like the one I am showing above.
    Last edited by randwulf; 01-11-2019 at 06:07 PM.

  9. The Following 7 Users Say Thank You to randwulf For This Useful Post:

     Adam.Krauze (01-11-2019),  JMcB (01-11-2019),  Nibelung (01-11-2019),  ph2ter (01-11-2019),  PoxVoldius (01-11-2019),  Ruderico (01-11-2019),  Trelvern (01-11-2019)

  10. #7
    Moderator
    Posts
    5,175
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA
    U152>L2>Z367
    mtDNA
    H5a1

    Normandie Netherlands Friesland Finland Orkney
    Quote Originally Posted by Ruderico View Post
    Oh boy, this has the potential to get heated. (1)

    What I can tell you is that Davidski uses scaled coordinates, but Ger Huijbregts (who created nMonte) doesn't because it discards information at higher dimensions.
    If I recall correctly scaling works by multuplying each coordinate by its squared eigenvalue, so at higher dimensions (with low eigenvalues) each cordinate will be close to 0.

    I don't remember seeing genetics papers with scaled coordinates, but then again I don't know how they build the PCAs in the papers (2) (I suppose it's a lot more complex than G25), nor whether that makes a difference. With that said, and since I'm in no position to judge on the actual methodology of PCAs when it comes to genetic data, I stick with what I've seen them do and use unscaled coordinates most of the time. But I really don't care either way.

    I suspect most people here use scaled coordinates because Davidski does too. Keep in mind that a pleasing result doesn't validate an incorrect method (ie numerous and rather similar pops in a dataset and using pen=0)
    (1) wisdom speaks through your mouth
    (2) most recent papers, if not all, use smartpca (part of the EIGENSOFT bundle), as Eurogenes-G25 does.
    (3) 1000% agree
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  11. The Following 2 Users Say Thank You to anglesqueville For This Useful Post:

     JMcB (01-11-2019),  Ruderico (01-11-2019)

  12. #8
    Gold Member Class
    Posts
    2,506
    Sex
    Location
    Calne,England
    Ethnicity
    British and Irish
    Nationality
    Great Britain
    Y-DNA
    E-Y45878
    mtDNA
    H67

    United Kingdom Scotland England Ireland
    Quote Originally Posted by randwulf View Post
    Here are my thoughts, illustrated with an example. My mom is about evenly split Isles/German with small minor South Asian, Amerindian, and African admixture. This model with pen=0 and unscaled comes out really good with the French reference taking some of both the English and the German and a relatively balanced Scottish and French East showing the more extreme parts of the major ancestry range. The minor ancestry is always faint, but supported by really well-organized groupings of admixture at targeted locations (so, I suspect are real). The percentages are expected for this minor ancestry below or maybe slightly understated:

    [1] "distance%=0.6248"

    French,63.8
    Scottish,20.2
    French_East,13.4
    Brahmin_Uttar_Pradesh,1.4
    Mbuti,0.6
    Surui,0.6

    I have found if the model is built well and the major ancestry isn't overfit (based on tips from Huijbregts on dealing with this kind of mix), the model tends to work almost as well with the default penalty with regards to the non-majority mix. It doesn't disappear altogether, but the penalty does reduce it to lower than expected. The Isles/German major "balance" stays correct or maybe even is improved:

    [1] "distance%=0.6678"

    French,65
    French_East,19
    Scottish,14.4
    Brahmin_Uttar_Pradesh,0.8
    Mbuti,0.4
    Surui,0.4

    Scaling doesn't do as well. With pen=0, it destroys the major ancestry balance and has great difficulty recognizing the minor ancestry. Here are results:

    [1] "distance%=1.2452"

    French,55
    Scottish,35
    French_East,9.6
    Brahmin_Uttar_Pradesh,0.4

    Using the default penalty with the scaled model helps restore the major ancestry balance and models it pretty well, but still destroys the recognition of the minor ancestry:

    [1] "distance%=1.3353"

    French,54.2
    Scottish,25.6
    French_East,20
    Brahmin_Uttar_Pradesh,0.2

    I have seen this pattern repeated quite a bit, including with all of my family. I have seen that the modeling with the most ancient references (like the European Steppe/WHG/EEF kind of test) produces more expected results with scaling. But, some of those lower PCA positions do things like distinguish Siberians from Amerindians, which is kind of important if one has some real Siberian or Amerindian ancestry. I don't find these kinds of PCA positions cause too much trouble with unscaled modeling because Europeans, for example, group together on those dimensions and they don't have much impact on the model (they don't get in the way, I mean). So, I don't feel the need to minimize their impact. I guess it depends on what one wants to get from the model. My conclusion is that I tend to work with unscaled, pen=0 to produce as simple a model as possible and then run with the default penalty to ensure the model stays pretty strong, like the one I am showing above.

    Here are some results for my mother who is Scottish and Irish in that order with perhaps a little English.

    All 1000/500 25% fiter


    Unscaled 0


    Model Sample Details Fit Map English Finnish Irish Norwegian Orcadian Scottish
    1 English +Finnish +Irish +Norwegian +Orcadian +Scottish Custom:Tim_Mother_test 0.9955 Open Map 14.4 1 84.6 0 0 0


    Unscaled default


    Model Sample Details Fit Map English Finnish Irish Norwegian Orcadian Scottish
    1 English +Finnish +Irish +Norwegian +Orcadian +Scottish Custom:Tim_Mother_test 1.0486 Open Map 5 0.4 70.4 0 15.8 8.4


    Scaled 0


    Model Sample Details Fit Map English Finnish Irish Norwegian Orcadian Scottish
    1 English +Finnish +Irish +Norwegian +Orcadian +Scottish Custom:Tim_Mother_test_Scaled 1.9553 Open Map 36 10.4 53.6 0 0 0


    Scaled default



    Model Sample Details Fit Map English Finnish Irish Norwegian Orcadian Scottish
    1 English +Finnish +Irish +Norwegian +Orcadian +Scottish Custom:Tim_Mother_test_Scaled 2.0688 Open Map 15 6.4 42.8 7.6 2.8 25.4


    I hasten to add my mother was never tested. Her coordinates were worked out by Garvan from my father's coordinates and mine. They were confirmed by DMXX.
    Last edited by firemonkey; 01-11-2019 at 09:20 AM.
    Please support Mental health research and world community grid

    Hidden Content
    Hidden Content
    Hidden Content
    Hidden Content

  13. #9
    Registered Users
    Posts
    2,859
    Sex
    Location
    Taiwan
    Ethnicity
    Métis
    Nationality
    Canadian
    Y-DNA
    R-Z198 (DF27)
    mtDNA
    T2B-T152C

    Canada England Scotland Germany Poland France
    My preference is usually unscaled with both pen = 0 and penalty used FWIW
    Paper trail ancestry to the best of my knowledge:
    English (possibly containing some Welsh ancestry) 31.25%, Eastern European and Eastern German (Galicia, Poland) 25%, Scottish 17.96%, Scotch-Irish 12.5%, French 8.2%, Native American 1.95%, and Colonial American, 3.125%, which cannot be determined with complete certainty: there is Dutch (at least 1.36%) and some English. The rest could include Spanish, Norwegian, German, and French, but these percentages would be minuscule.

  14. The Following User Says Thank You to sktibo For This Useful Post:

     JMcB (01-11-2019)

  15. #10
    Registered Users
    Posts
    637
    Sex
    Location
    Netherlands
    Ethnicity
    South-Dutch
    Nationality
    Dutch
    Y-DNA
    I2a2a1b2-CTS1977
    mtDNA
    H13a1a1

    Netherlands Belgium
    The idea behind scaling is that multiplying PCA scores by the root of the eigenvalues results in PCA scores with distances that better represent the original data.

    An obvious reason to distrust this idea is that the raw DNA data are not in a continuous format but categorical (SNP on a certain position). SmartPCA converts these categorical data into a PCA in continuous data.
    I think the idea that post-SmartPCA distances should be adapted to pre-SmartPCA distances is preposterous.
    Moreover the nature of eigenvalues is misunderstood. In the Global25 data the eigenvalues are largely determined by the sampling density, which is an interference
    in the calculations, which should not be further enhanced.
    The scaling is implemented in an amateurish way. Before the multiplication, the scores should have been centered (subtracting the mean).
    From modern data science we can learn that most algorithms with multivariate data perform poorly when the variables have different variances. Therefore a standard way to make statistical calculations more robust is the normalizing of the data, i.e. dividing the variables by their standard deviation.
    Scaling does the reverse. Ouch.

    Sometimes a bad idea has unintended beneficial consequences, even scaling.
    What many people don't realize, is that scaling penalizes the higher dimensions. (because the higher diemnsions have by definition smaller eigenvalues).
    See the next plot of the standard deviation in the unscaled(green) and scaled(red) data.
    scaled_vs_unscaled.jpg


    Note the logarithmic vertical scale.
    So scaling strongly penalizes the higher dimensions. After dimension 10 the scaled higher dimensions contribute virtually nothing to the distance.
    Now this may in some cases be beneficial.
    The many samples and the many dimensions make Global25 a great dataset.
    But because it has so much detail to select from, there is a considerable risk that oddball admixtures percolate in the results, a.k.a. overfitting.
    IMO this penalizing of the higher dimensions is why in some cases the scaled result appears more plausible than the unscaled result (with pen=0).
    But scaling penalizes the higher dimensions too strongly (again the vertical scale on plot is logarithmic).
    Also scaling uses the eigenvalues for another purpose and their side effect on the overfitting is not optimal. The nMonte3 way of slightly penalizing the greater distances is much better tailored to the task.

  16. The Following 8 Users Say Thank You to Huijbregts For This Useful Post:

     JMcB (01-11-2019),  LTG (01-11-2019),  michal3141 (01-11-2019),  ph2ter (01-11-2019),  PoxVoldius (01-11-2019),  Ruderico (01-11-2019),  sktibo (01-11-2019),  Trelvern (01-11-2019)

Page 1 of 2 12 LastLast

Similar Threads

  1. K25 calculator for G25 scaled spreadsheet
    By michal3141 in forum Autosomal (auDNA)
    Replies: 57
    Last Post: 01-09-2019, 08:51 PM
  2. Replies: 57
    Last Post: 01-01-2019, 09:46 AM
  3. Ultimate Modern World K=50 Calculator for G25 Scaled Spreadsheet
    By michal3141 in forum Autosomal (auDNA)
    Replies: 109
    Last Post: 12-19-2018, 02:09 AM
  4. World K=8 calculator based on G25 scaled coordinates
    By michal3141 in forum Autosomal (auDNA)
    Replies: 69
    Last Post: 11-13-2018, 01:11 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •