European PCA and beyond

I am interested in how to best prepare a map of the European samples of the Eurogenes Global 25 PCA data.

The classical method is to use Principle Component Analysis (PCA).

Modern data scientist have developed a lot of alternative models. In this post I will compare PCA with t-distributed Stochastic Neighbor Embedding (t-SNE).

PCA transforms a set of correlated variables into a set of uncorrelated variables which are called 'principal components' or shortly 'dimensions'.

The method is not only used to get uncorrelated (orthogonal) components, but also to reduce the number of variables: because the higher components are more noisy, dropping the higher dimensions results in much cleaner data.

The Global 25 dataset estimates the 25 PCA components of raw DNA samples.

In the complete set of the Global 25 samples the optimal number of components appears to be 4.

When selecting a subset of the Global 25, the independence of the variables may have been lost, so it is advisable to perform a secondary PCA to restore the indepence of the scores.

The secondary PCA scores of the European subset appears to have only 2 principle components. This comes in handy for preparing scatter plots.

In the ecosystem of Global 25 users it is customary NOT to use PCA scores, but to multiply them with the root of the eigenvalues and call this 'scaled' data.

When the subset has only 2 principle components, the bias of this 'scaling' will remain limited, but IMO it is an amateurish method without scientific basis.

Anyway, I do not use it.

I selected the the European samples from Iron Age to modern times. I dropped some very distant populations (Chuvash, Mari, Bashkir, Kalmyk, Nogai, England_Roman_o).

Here is the scatterplot of the first two dimensions of the secondary PCA.

To give an idea where the populations are located on the plot, I have color-marked 10 populations. These labels are derived from the Global 25 labels, they are not the result of the PCA algorithm.

secPCA.png

In the past few years data scientists have developed alternative methods to handle multidimensional data. Many of these methods try to find a 'manifold'.

The idea is that high-dimensional data may locally be close to a lower dimensional manifold. An example is the surface of the earth. It has 3-dimensional coordinates, but at small distances it appears to be 2-dimensional.

Here is the scatterplot of the same 25 dimensional European data projected on a 2-dimensional t-SNE manifold.

t-SNE.png

Again the color-marking is from the Global labeling, not from the t-SNE algorithm.

The scatterplot shows a much more detailed structure than the secondary PCA. Yet in a topological sense, the structure appears similar.

(Note the two yellow outliers in both plots. They are Italian_Northeast_o:ALP188 and Italian_Trentino-Alto-Adige_o:ALP414)

On the lower right are two Caucasian clusters. On the upper right the North East European populations are separated from the more Western Europeans.

A t-SNE plot should not be interpreted in terms of genetic distances, but in terms of shared near neighbors.

T-SNE is considered a useful tool for exploring and visualizing, but the result should always be validated.

In this case the clustering by the color-marked populations appears to be plausible.

Yet my feeling is that the result could have been stronger with more data, especially in Middle and Eastern Europe.

T-SNE is created for datasets with more dimensions and more samples.

As it is now, the clustering may be more or less overfitted, but most models are. I like it as an interesting exploration, which is methodologically very different from PCA.