# Thread: MDS plots for European modern individuals

1. Originally Posted by anglesqueville
Ruderico, in the context of my last post, this question doesn't arise precisely because there is no dimension reduction before UMAP. The data consist of all the pairwise distances between individuals ("distances" in the sense of 1-ibs, computed on the panel of SNPs).
Yes, I know that, but I was toying with truncations and am at a loss where should I place the limit at

2. ## The Following 2 Users Say Thank You to Ruderico For This Useful Post:

anglesqueville (02-08-2021),  JMcB (02-08-2021)

3. Originally Posted by Ruderico
Huijbregts how many dimensions would you suggest we truncate data at? I remember Angles made an experiment and suggested 8, based on distances between NW Europeans
This is an important question when you are using unscaled Global25 data.
The mathematical answer is "Use as many dimensions as the intrinsic dimensionality of the dataset".
Unfortunately mathematicians don't agree about the best method to estimate the intrinsic dimensionality.
What is certain however, is that the intrinsic dimensionaliteit is on the first place dependent on the number of samples.
My rule of thumb is to calculate how many of samples you need to complete cover each half of each dimension.
For one dimension you need two samples, one for the positive side and for the negative side.
To completely cover 25 dimensions you need 2^25=33 554 432 samples.
The Global25 of the ancient samples has 3830 samples. The 2log of 3830 is 7.162, so you can handle 7 dimensions (round downwards).
The Global25 of the modern samples has 5917 samples. The 2log of 5917 is 7.544, so again you can handle 7 dimensions.
But if you use a restricted dataset of say 500 samples, you can only use 5 dimensions (2log(500)= 5.398).

4. ## The Following 9 Users Say Thank You to Huijbregts For This Useful Post:

anglesqueville (02-08-2021),  Bart (02-10-2021),  JMcB (03-10-2021),  Nino90 (02-08-2021),  randwulf (02-09-2021),  Ruderico (02-08-2021),  sheepslayer (02-08-2021),  ThaYamamoto (02-10-2021),  Trelvern (02-09-2021)

5. Back to post #8. After having experimented quite a bit (fortunately, all this runs very quickly on my new Linux machine), I think I can say that this "direct UMAP" suffers from the same limitations as the "direct MDS", or more precisely that it requires the same constraints to be satisfied. Especially the presence of samples with a high rate of no-calls has dramatical effects. I had promised the members of my family clan who had provided me with imputed data to send them back in exchange, in addition to qpAdm models, representations (independent of G25, because most of those concerned do not and will not have "their" G25) easily interpretable. Honestly, I didn't even point out the possibility of running UMAP, and gave them "their" MDS, which they were fully satisfied with. But I encourage hobbyists to experiment since it only uses PLINK and R, both available for Windows.

6. ## The Following 3 Users Say Thank You to anglesqueville For This Useful Post:

Bart (02-10-2021),  JMcB (02-10-2021),  randwulf (02-09-2021)

7. Originally Posted by anglesqueville
Back to post #8. After having experimented quite a bit (fortunately, all this runs very quickly on my new Linux machine), I think I can say that this "direct UMAP" suffers from the same limitations as the "direct MDS", or more precisely that it requires the same constraints to be satisfied. Especially the presence of samples with a high rate of no-calls has dramatical effects. I had promised the members of my family clan who had provided me with imputed data to send them back in exchange, in addition to qpAdm models, representations (independent of G25, because most of those concerned do not and will not have "their" G25) easily interpretable. Honestly, I didn't even point out the possibility of running UMAP, and gave them "their" MDS, which they were fully satisfied with. But I encourage hobbyists to experiment since it only uses PLINK and R, both available for Windows.
UMAP is computationally far more complicated than say PCA. So it is not unexpected that no-calls have a dramatic effect.
Another vulnerability of UMAP is low sample number. If I remember it correctly, I have read that UMAPs on less than 500 samples should be considered not reliable. From your plots I get the impression that you may have than 500 samples, but not by much. Can you give us an indication?

8. ## The Following 2 Users Say Thank You to Huijbregts For This Useful Post:

JMcB (02-10-2021),  randwulf (02-09-2021)

9. Originally Posted by Huijbregts
UMAP is computationally far more complicated than say PCA. So it is not unexpected that no-calls have a dramatic effect.
Another vulnerability of UMAP is low sample number. If I remember it correctly, I have read that UMAPs on less than 500 samples should be considered not reliable. From your plots I get the impression that you may have than 500 samples, but not by much. Can you give us an indication?
Yes: 1102 individuals

10. ## The Following 3 Users Say Thank You to anglesqueville For This Useful Post:

Huijbregts (02-09-2021),  JMcB (02-10-2021),  randwulf (02-09-2021)

11. Originally Posted by Huijbregts
This is an important question when you are using unscaled Global25 data.
The mathematical answer is "Use as many dimensions as the intrinsic dimensionality of the dataset".
Unfortunately mathematicians don't agree about the best method to estimate the intrinsic dimensionality.
What is certain however, is that the intrinsic dimensionaliteit is on the first place dependent on the number of samples.
My rule of thumb is to calculate how many of samples you need to complete cover each half of each dimension.
For one dimension you need two samples, one for the positive side and for the negative side.
To completely cover 25 dimensions you need 2^25=33 554 432 samples.
The Global25 of the ancient samples has 3830 samples. The 2log of 3830 is 7.162, so you can handle 7 dimensions (round downwards).
The Global25 of the modern samples has 5917 samples. The 2log of 5917 is 7.544, so again you can handle 7 dimensions.
But if you use a restricted dataset of say 500 samples, you can only use 5 dimensions (2log(500)= 5.398).
WITHDRAWAL #12
Above quote from post 12 is bullshit (or even elephant shit, but not dinosaur shit).
I have the reproachable habit of performing simple calculations not in an R console but in my browser.
That would not have led to a blooper if I had entered the correct formula log2(x).
Instead I used to enter 2log(x).
In an R console this throws an error, but my browser interprets it as two times the logarithm with base 10: 2 * log10(x).
Unfortunately that that value appears not implausible.
With the correct formula the outcome is for 3830 samples: 11.90313 which means 11 dimensions.
That appears too high. But notice that the rule of thumb supposes that all dimensions are nearly symmetrically filled which is overoptimistic.
My assertion that you need 2^25=33 554 432 samples for 25 dimensions is not affected, because it used powers instead of logarithms.

The question of how many dimensions to use remains an important one. Maybe I get a better idea.

12. ## The Following 5 Users Say Thank You to Huijbregts For This Useful Post:

anglesqueville (02-10-2021),  JMcB (02-10-2021),  randwulf (02-10-2021),  Ruderico (02-10-2021),  xerxez (02-11-2021)

13. For info. Work on the MDS continued on the French-speaking forum, on a discussion thread that is less and less French-speaking (which I am very happy with). Very positive experience and rich in lessons. We have virtually abandoned the use of dna.land (which seems frozen), and I take care of the imputation phase myself. It's work, but it's worth it.

14. ## The Following 3 Users Say Thank You to anglesqueville For This Useful Post:

Finn (03-08-2021),  JMcB (03-10-2021),  randwulf (03-08-2021)

15. Originally Posted by Ruderico
Yes, I know that, but I was toying with truncations and am at a loss where should I place the limit at
I added the ability to truncate at calculation time to the Ancestry calculator, so one can experiment:

https://yk.github.io/ancestry/

If you have suggestions for improvement, just let me know.

16. ## The Following 2 Users Say Thank You to randwulf For This Useful Post:

JMcB (03-10-2021),  Ruderico (03-10-2021)

17. By MDS, do you mean classical MDS or non-metric MDS? Classical MDS should produce the same coordinates as PCA (except that the sign of some dimensions may be arbitrarily flipped, like in the case of the second dimension below).

Code:
```> t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",header=T,row.names=1)
> prcomp(t)\$x[1:4,1:4]
PC1        PC2         PC3          PC4
Abazin     0.009519441  0.1667842 0.003774957 -0.000634985
Abkhasian  0.038130735  0.1943700 0.012288608  0.004753655
Aeta      -0.199946861 -0.2023205 0.101169686  0.033809794
> cmdscale(dist(t))[1:4,]
[,1]       [,2]
Abazin     0.009519441 -0.1667842
Abkhasian  0.038130735 -0.1943700
Aeta      -0.199946861  0.2023205```
If a plot made with classical MDS looks different from the original coordinates produced by `plink --pca`, it's because you're supposed to scale the coordinates before you do a second PCA/MDS on top of them. A PCA plot based on the unscaled coordinates will look the same as a classical MDS plot.

Code:
```wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar
tar -xf v44.3_1240K_public.tar
f=v44.3_1240K_public
convertf -p <(printf %s\\n genotypename:\ \$f.geno snpname:\ \$f.snp indivname:\ \$f.ind outputformat:\ PACKEDPED genotypeoutname:\ \$f.bed snpoutname:\ \$f.bim indivoutname:\ \$f.fam)
awk -F\\t 'NR>1&&\$2~/\.DG/{print\$2" "\$13}' g/g/12.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_father|_mother|_son|_daughter|_brother|_sister|_twin'|grep -Ev 'Neanderthal|Denisova|_published|Cameroon_SMA'>picks
plink --allow-no-sex --bfile \$f --keep <(awk 'NR==FNR{a[\$1];next}\$2 in a' picks \$f.fam) --pca
paste -d' ' <(cut -d' ' -f2 picks) <(cut -d' ' -f2- plink.eigenvec)>plink.eigenvec.pops```
Code:
```library(tidyverse)
library(colorspace)

# t=read.table("https://pastebin.com/raw/bGptFp12",sep=" ") # data I used
# eig=c(22.7034,12.2596,8.09738,5.49967,4.92393,2.80936,2.39839,2.28679,2.20875,2.20267,2.12417,1.98691,1.81393,1.78919,1.74964,1.71181,1.66634,1.65334,1.62163,1.56336)

ave=aggregate(t[,-c(1:2)],list(t[,1]),mean)
row.names(ave)=ave\$Group.1
ave=ave[,-1]

# ave=(t(t(ave)*sqrt(eig))) # scale by multiplying with square roots of eigenvalues

p=as.data.frame(cmdscale(dist(ave),k=20)) # classical MDS (k=2 would be a truncated version of k=20, but k=20 makes the clustering more accurate because it won't be based on only 2 dimensions)
# p=as.data.frame(prcomp(ave)\$x) # uncomment to use PCA instead of MDS
# p=as.data.frame(ave) # uncomment to plot original coordinates by `plink --pca`
names(p)[1:2]=c("x","y")
p\$k=cutree(hclust(dist(p),method="complete"),k=32)

ggplot(p,aes(x,y))+
geom_point(aes(color=as.factor(k)),size=.5)+
geom_polygon(data=p%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.3)+
geom_text(aes(label=row.names(p),color=as.factor(k)),size=2,vjust=-.7)+
theme(
aspect.ratio=1/sqrt(2),
axis.text=element_text(color="black",size=6),
axis.ticks.length=unit(0,"pt"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.title=element_text(color="black",size=8),
legend.position="none",
panel.background=element_rect(fill="white"),
panel.border=element_rect(color="gray75",fill=NA,size=.4),
panel.grid.major=element_line(color="gray70",size=.15),
plot.title=element_text(size=9),
)+
ggtitle("Original coordinates by `plink --pca` (scaled)")+
labs(x="",y="")+
scale_x_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.12))+
scale_y_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.03))+
scale_color_discrete_qualitative(palette="Set 2",c=80,l=40)

ggsave("a.png")
system("mogrify -trim -bordercolor white -border 20x20 a.png")```
Below the unscaled MDS plot is identical with the unscaled PCA plot, and the scaled MDS plot is identical with the scaled PCA plot.

Above you can also see that when I used the original coordinates by `plink --pca`, the clustering was messed up when I used unscaled coordinates.

18. ## The Following 2 Users Say Thank You to Nganasankhan For This Useful Post:

Ajeje Brazorf (03-17-2021),  sheepslayer (03-17-2021)

19. Originally Posted by Nganasankhan
By MDS, do you mean classical MDS or non-metric MDS?
@angles is starting with our raw files, imputing SNPs to get to a common basic file format, and then doing the MDS based plotting. The results are very good. You should follow his link above and read the thread.

20. ## The Following 3 Users Say Thank You to randwulf For This Useful Post:

anglesqueville (03-17-2021),  Bygdedweller (03-17-2021),  JMcB (03-17-2021)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•