# Thread: Question about PCA: Eigenvector and explained variance

1. ## Question about PCA: Eigenvector and explained variance

Hi there,

I have a question about interpreting PCA plots from the Eurogenes site (Global 25). Is dividing the eigenvector of PC1 by the sum of eigenvectors (PC1 to PC25) equal to the proportion of variance explained by PC1 relative to the total variance?

I'll make an example from the Global 25 data. From the site, the eigenvectors from PC1 to PC25 are as follow:

129.557,103.13,14.222,10.433,9.471,7.778,5.523,5.3 25,4.183,3.321,2.637,2.246,2.21,1.894,1.842,1.758, 1.7,1.605,1.58,1.564,1.557,1.529,1.519,1.452,1.434
Total sum = 319.47

So, the proportion of eigenvector of PC1 is: 129.557/319.47 = 0.4055 (40.55%), and for the eigenvector of PC2 is: 103.13/319.47 = 0.3228 (32.28%). Hence, a PC1 vs. PC2 plot is able to explain 72.84% of the total variance, while a 3-dimensional PC1 vs. PC2 vs. PC3 plot is able to explain 77.29% of the total variance.

For instance, in the PC1 vs. PC2 plot, all Papuan populations are somewhere in-between the East Asian and European clusters. Now this is obviously not the case, as Papuans are not a mixture of the two. It is not until PC3, PC4, PC5, and PC6, before the Papuan populations form their own separate cluster. Since the eigenvectors of PC3 to PC6 explains for 13.12% of the sum, it follows that the distinction between Papuans and non-Africans is around 13.12% of the total variance in human populations.

Is my understanding correct?

Thanks.

2. I don't think you can use PCA for this sort of thing, for one because there are many more dimensions than the 25.

Sure, the 25 dimensions explain a lot of the variance, plenty enough to accurately model the ancestry of the samples, but what if someone runs a Global 500, would that affect your calculations? I don't know, I've never done it.

3. Thanks for the reply. But I am slightly confused now. Since there are many more dimensions than 25, then how well can a typical PC1 vs PC2 plot explain the genetic variations across human populations? Unfortunately I'm unable to post link, so I'll just refer to the Global 25 PCA plot.

In other words: If PC1 and PC2 don't explain most genetic variations, then why are we still plotting PC1 vs. PC2? Shouldn't we include all the way up to PC500 to capture population structures with reasonable accuracy (i.e. +/-90% confidence)?

But notice that the eigenvectors above PC2 sharply decreases beyond relevancy. Then I would expect the eigenvector at PC500 must be small enough to be negligible (i.e. it contributes almost nothing for explaining population variations). Hence, my initial assumption that PC1 and PC2 from the Global 25 are sufficient enough (~72% confidence) to model population structures. Please correct me if I am wrong.

4. Originally Posted by rasa.sayange36
Thanks for the reply. But I am slightly confused now. Since there are many more dimensions than 25, then how well can a typical PC1 vs PC2 plot explain the genetic variations across human populations?

In other words: If PC1 and PC2 don't explain most genetic variations, then why are we still plotting PC1 vs. PC2? Shouldn't we include all the way to PC500 to capture population structures with reasonable accuracy (i.e. +/-90% confidence)?
PCA is a reduction of the data and simplification of reality. It's generally just a way to visualize the data before moving into more involved analyses.

Most of the time ancestry proportions aren't modeled with PCA data, because there are certain issues that need to be overcome to be able to do this, but it can work well if done properly.

But notice that the eigenvectors above PC2 sharply decreases beyond relevancy. Then I would expect the eigenvector at PC500 must be small enough to be negligible (i.e. it contributes almost nothing for explaining population variations). Please correct me if I am wrong.
Yes, that's basically always true, and this is why plotting just the first two or three PCs is an effective way to visualize the data.

But I still wouldn't use PCA for what you were attempting to use it for, because you really need to use raw data and formal stats for that type of thing.

5. Yes, I suppose you are right about that. I understand your points. However, I am just a hobbyist who became interested in approximating genetic variation across populations. I am quite satisfied with a reasonable approximation (+/- 90% confidence). With that in mind, I just hope the PCA plots derived based on Global 25 are reasonably sufficient in representing population structures. Although I don't really know exactly how much Global 25 captures that information.

In terms of accuracy, there is a diminishing return when running higher components (e.g. Global 500), since eigenvectors tends to decrease at higher components. Which means that I expect an improvement in accuracy going from Global 485 to Global 500 is significantly less than going from Global 10 to Global 25. As you implied before, at some arbitrary higher dimensions, the PCA data should be able to model 99.99% of reality. But for me, modelling ~90% of reality is good enough.

Cheers.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•