# Thread: Global25 automated nMonte for South/Central Asian members

1. Originally Posted by Generalissimo
Don't worry about any of this. The accuracy of the Global25 is validated by formal statistics models.

In other words, when the Global25 and formal models correlate closely, then there's no problem. When they don't correlate then there's a problem, and there's no point going down that road, like modeling or modeling with super old samples that aren't ancestral to any modern populations.
Hey, great to hear from the General himself. Please also read my post after that which poi quoted just above. I recently got my global25 coordinates from you and have been in this forum for a bit. It's out of curiosity, but you probably have seen these numbers back and front you might be able to do an estimate by eye.
Say we are modeling p1 and p2 : nep_bram and up_bram (or pakistani) using these as base populations (DMXX's work):

Barcin N + Ganj D + Han + Khvalynsk + N Simulated AASI + W Siberia N (+MysteryP)

All of them have a non-zero distance. Could you produce a point MysteryP (25 numbers) such that both the p1 and p2 can have zero distance.
- ideally there exists a point that is the closest to the existing points.
- maybe it's a stretch, but an algorithm like that could be used to speculate a population that is related to both p1 and p2

2. ## The Following 4 Users Say Thank You to dumbo007 For This Useful Post:

Jatt1 (09-12-2018),  pnb123 (09-12-2018),  poi (09-12-2018),  Sapporo (09-12-2018)

3. Originally Posted by dumbo007
Thank you. I'm glad you're looking at it this way.
I'll summarize the assumptions I am making, so we can agree or correct them. I think you guys are doing a great job coming up with base populations that make historic sense and calculating these distances- I wanted to hack the math to maybe find a shortcut looking the other way.
- The PCA basis that Davidsky defined creates a 25Dim space where each population is a point.
- When looking for a fit for pop_x as an admixture of p1, p2, p3... we are looking for the closest point on the shape formed by p1,p2... to pop_x. The Euclidean distance from pop_x to that closest point is the fit distance. The coordinates of that point in the basis of p1,p2 are the admixtures we look at
- The admixture fractions have to be less than 1 (you wouldn't have a person that is 1.1 ASI + 1.1 Iran etc.) So, in that example, the closest point would be on the plane defined by the three points (the normal projection of (1.1, 1.1, 1.1) and the fit distance would be non-zero ( the distance of the normal drawn from the point onto the plane in this case).
- Changing bases should preserve the euclidean distance in a space- I think your star analogy might be inaccurate.
- I agree the points Ganj_dareh, Barcin, Steppe, ASI form a polygon ( in 2D, polyhedron in 3D, general polytope in n-D) and the SA populations are scattered on the periphery of this polytope so that all the fit distances are non-zero and it is possible that they are on different directions.
If this was 3D, let's say these components make up a shape like a glob of icecream. and say india_bram lies slightly outside in one direction (like a fruitfly), such that the fit distance (= euclidean distance) to the closest point on the icecream glob is 2.7. Maybe nep_bram lies a distance of 2.7 but in a slightly different direction. But it should be possible to find a point on this space, say the tip of an icecream cone, such that when the cone were to touch the icecream glob, the point india_bram falls between (inside) the icecream and the tip of the cone. If the tip of the icecream were to be taken as another of the base components, it should be possible to find a fit distance of zero with all admixture components less than 1.
- My speculation was that are we systematically missing a icecream-cone tip point (= some mystery population) (possibly more, but maybe it is just one such point that might explain this for more than one SA groups) and if we add that 'mystery component' it is possible to write these components with admixtures less than 1, and fit distance zero.
- I understand that even if such a point exists and is found with math first (I've seen a lot of attempts to come up with what might base components for these populations, but they still have a distance) , it might not make sense genealogically, but I think it is worth thinking about because it's not ruled out. Once we find a point like that we can see what that mystery component is- which way the cone-tip points, or what the closest existing/ancient population is to that tip.
Yes you are correct in that 1.1 would not be a value in a scaled PCA (because it should all sum to 1), but I was just illustrating an example of a linear combination, which for a truly orthonormal choice of real eigenvectors should represent any point in the space, such as using unscaled basis vectors.
Here is my understanding of the PCA coordinates given by David and what nMonte is doing:
Each of the samples is represented by 25 different coordinates in a PCA space with 25 orthonormal vectors, where PC1 contains the most variance demonstrated by the data, followed by PC2, PC3 etc, with PC25 containing the least variance (caution that this is variaence of the whole data set and so that may wipe out uniqueness of a highly divergent population with only a few samples). nMonte is fitting a reduced dimensional space (identified as the surface formed by the points we choose as source populations) to this 25D by least-squares projections, and then calculating our individual samples location in that reduced space (using an overfit model because source populations themselves are related to each other).
Distances are only preserved on a linear map (translation, rotation etc) when dimensions are preserved. Projections do not preserve distances (a simple example is if we collapse 2D space into 1D space using projection, a point originally at '(a,b)' is now at 'a', and the euclidean distance from the origin to that point has shrunk from 'sqrt(a^2+b^2)' to 'a' (sorry if I am being too basic, but I wanted to be inclusive of others in this discussion as much as possible so my own viewpoint is corrected if in error). This is where the astronomy example came in. (Regarding the ice-cream example, I wonder if the main problem is not where the fruit fly is with respect to the ice-cream in a given frame, but rather that the fruit fly appears in the location because it is superimposed into the image with the ice-cream by removing time as a dimension).

Instead of manually deciding which source populations best fit all of us, I guess one can use monte-carlo simulations or step-wise regression to figure out the best set of source populations that works for all of us within the given population space.

4. ## The Following 4 Users Say Thank You to soulblighter For This Useful Post:

dumbo007 (09-12-2018),  Jatt1 (09-12-2018),  pnb123 (09-12-2018),  poi (09-12-2018)

5. fwiw - here are the eigenvals (used to convert the unscaled pcs into scaled)

Code:
`129.557,103.13,14.222,10.433,9.471,7.778,5.523,5.325,4.183,3.321,2.637,2.246,2.21,1.894,1.842,1.758,1.7,1.605,1.58,1.564,1.557,1.529,1.519,1.452,1.434`

6. ## The Following 3 Users Say Thank You to poi For This Useful Post:

dumbo007 (09-12-2018),  Jatt1 (09-12-2018),  soulblighter (09-12-2018)

7. Does anyone want to create a group in the tool for their ethnic group (or add to the existing groups)?

I added one for my ethnic group. I can do for others as well... just make sure you represent the "true" representation of your group with all 4 grandparents of the same ethnic group.

It is interesting to see how the "average" looks versus individually. Also added the ability to use the g25 group averages as if they individuals.

8. ## The Following 8 Users Say Thank You to poi For This Useful Post:

aaronbee2010 (09-12-2018),  bmoney (09-12-2018),  bored (09-14-2018),  Jatt1 (09-12-2018),  pnb123 (09-12-2018),  Sapporo (09-12-2018),  Zaid (09-13-2018),  Zuran (09-12-2018)

9. Originally Posted by soulblighter
Yes you are correct in that 1.1 would not be a value in a scaled PCA (because it should all sum to 1), but I was just illustrating an example of a linear combination, which for a truly orthonormal choice of real eigenvectors should represent any point in the space, such as using unscaled basis vectors.
Here is my understanding of the PCA coordinates given by David and what nMonte is doing:
Each of the samples is represented by 25 different coordinates in a PCA space with 25 orthonormal vectors, where PC1 contains the most variance demonstrated by the data, followed by PC2, PC3 etc, with PC25 containing the least variance (caution that this is variaence of the whole data set and so that may wipe out uniqueness of a highly divergent population with only a few samples). nMonte is fitting a reduced dimensional space (identified as the surface formed by the points we choose as source populations) to this 25D by least-squares projections, and then calculating our individual samples location in that reduced space (using an overfit model because source populations themselves are related to each other).
Distances are only preserved on a linear map (translation, rotation etc) when dimensions are preserved. Projections do not preserve distances (a simple example is if we collapse 2D space into 1D space using projection, a point originally at '(a,b)' is now at 'a', and the euclidean distance from the origin to that point has shrunk from 'sqrt(a^2+b^2)' to 'a' (sorry if I am being too basic, but I wanted to be inclusive of others in this discussion as much as possible so my own viewpoint is corrected if in error). This is where the astronomy example came in. (Regarding the ice-cream example, I wonder if the main problem is not where the fruit fly is with respect to the ice-cream in a given frame, but rather that the fruit fly appears in the location because it is superimposed into the image with the ice-cream by removing time as a dimension).

Instead of manually deciding which source populations best fit all of us, I guess one can use monte-carlo simulations or step-wise regression to figure out the best set of source populations that works for all of us within the given population space.
I agree, if the coefficients are not constrained to be <=1, orthogonal bases can span the entire space. Here, in admixtures, these are constrained, so in the example, the closest point would be (because of symmetry) 1/3p1, 1/3p2, 1/3p3 and the projection comes in defining the distance from this point to pop_x. I'm pretty sure that we are talking fruitfly distance from icecream (comments please). A projection of a distant point onto a second plane and then using that point without also including the distance from original point would be misleading .

As for doing monte carlo to find the distance of a given point from a set of known points is what is being done now and it also makes sense genealogically. The reason I'm saying to look for a new point is because seems like there might not be an existing point where the cone-tip would be- or else I'm sure guys would've found it by trial already. Because all these SA samples still have some distance from sets of well chosen reference points, it might be instructive to ask: if I pick these n base populations, what extra point is needed (cone-tip) to make sure the fit-distances are zero, i.e the populations are entirely contained within the shape formed by the basis points.
In a way it is a way to quantify the fit distance instead of a scalar into a vector. or where are all the SA data-points (fruitflies) in reference to the glob (defining base populations)

10. It turns out that G25 coordinates I had for me ("poi") and my sister-in-law ("poi_sil") were SWAPPED. No wonder "poi" was so close to "poi_motherinlaw", while "poi_sil" was so close to "poi_mom". At least part of the craziness is now explained.

So, it looks like I'm the most WestSiberianN+Barcin shifted of my group with extremely low KhvalynskEN. Also, it makes sense that my East Asian is elevated compared to others since my mom also has higher East Asian in the group. My mother in law is at the bottom when it comes to East Asian in our group.

ps -- I double checked everybody's coordinates and mine was the only instance that was screwed up, so your coordinate labels should be fine. I will fix this in the next update.

11. ## The Following 11 Users Say Thank You to poi For This Useful Post:

bmoney (09-13-2018),  Jatt1 (09-12-2018),  khanabadoshi (09-13-2018),  MonkeyDLuffy (09-12-2018),  parasar (09-12-2018),  pnb123 (09-12-2018),  prashantvaidwan (09-12-2018),  Reza (09-12-2018),  Sapporo (09-12-2018),  traject (09-12-2018),  Zaid (09-13-2018)

12. Does anyone have breakdown of saidu sharif outlier? Pegasus?

13. ## The Following 2 Users Say Thank You to MonkeyDLuffy For This Useful Post:

Jatt1 (09-14-2018),  Zuran (09-14-2018)

14. Originally Posted by MonkeyDLuffy
Does anyone have breakdown of saidu sharif outlier? Pegasus?
It's more or less a more AASI shifted version of SIS BA3 I believe. Although, it might get very minor Steppe.

Edit: Just modeled it. Exactly as I said.

Screen Shot 2018-09-14 at 1.07.00 AM.png

Modeled it alongside SIS BA2 and SIS BA3. Not 100% in line with the paper. Most probably because the simulated AASI is still just slightly off. A little inflated as SIS BA3 should be getting around 42-44% and SIS BA2 closer to 14-15%.

Screen Shot 2018-09-14 at 1.15.36 AM.png

Modeled them using the Barcin N + Khvalynsk model as well.

Screen Shot 2018-09-14 at 1.39.22 AM.png

15. ## The Following 6 Users Say Thank You to Sapporo For This Useful Post:

bmoney (09-15-2018),  pegasus (09-14-2018),  pnb123 (09-14-2018),  poi (09-14-2018),  tipirneni (09-19-2018),  Zuran (09-14-2018)

16. Originally Posted by MonkeyDLuffy
Does anyone have breakdown of saidu sharif outlier? Pegasus?
He s like a 40% Iran_N , 50% AASI , 10% EHG but lacks Barcin so it cannot be MLBA Steppe, but Irula and Gond score Barcin but in their case, its extremely likely representing some archaic combo of Basal and WHG ish not actual ANF.

17. ## The Following 5 Users Say Thank You to pegasus For This Useful Post:

bmoney (09-15-2018),  MonkeyDLuffy (09-14-2018),  poi (09-14-2018),  Sapporo (09-14-2018),  Zuran (09-14-2018)

18. Originally Posted by pegasus
He s like a 40% Iran_N , 50% AASI , 10% EHG but lacks Barcin so it cannot be MLBA Steppe, but Irula and Gond score Barcin but in their case, its extremely likely representing some archaic combo of Basal and WHG ish not actual ANF.
It gets me the best fits, better than Paniya, under 2.

19. ## The Following 2 Users Say Thank You to MonkeyDLuffy For This Useful Post:

Jatt1 (09-14-2018),  pegasus (09-14-2018)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•