YDNA E-Y31991>PF4428>Y134097>Y134104>Y168273>FT17866 (TMRCA ~1100AD) - Domingos Rodrigues, b. circa 1690 Hidden Content , Viana do Castelo, Portugal - Stonemason, miller.
mtDNA H20 - Monica Vieira, b. circa 1700 Hidden Content , Porto, Portugal
Hidden Content
Global25 PCA West Eurasia dataset Hidden Content
[1] "distance%=1.6007"
Ruderico
NW_Iberia_IA,80.4
Berber_EMA,11
Roman_Colonial,8.6
anglesqueville (02-08-2021), JMcB (02-08-2021)
This is an important question when you are using unscaled Global25 data.
The mathematical answer is "Use as many dimensions as the intrinsic dimensionality of the dataset".
Unfortunately mathematicians don't agree about the best method to estimate the intrinsic dimensionality.
What is certain however, is that the intrinsic dimensionaliteit is on the first place dependent on the number of samples.
My rule of thumb is to calculate how many of samples you need to complete cover each half of each dimension.
For one dimension you need two samples, one for the positive side and for the negative side.
To completely cover 25 dimensions you need 2^25=33 554 432 samples.
The Global25 of the ancient samples has 3830 samples. The 2log of 3830 is 7.162, so you can handle 7 dimensions (round downwards).
The Global25 of the modern samples has 5917 samples. The 2log of 5917 is 7.544, so again you can handle 7 dimensions.
But if you use a restricted dataset of say 500 samples, you can only use 5 dimensions (2log(500)= 5.398).
anglesqueville (02-08-2021), Bart (02-10-2021), JMcB (03-10-2021), Nino90 (02-08-2021), randwulf (02-09-2021), Ruderico (02-08-2021), sheepslayer (02-08-2021), ThaYamamoto (02-10-2021), Trelvern (02-09-2021)
Back to post #8. After having experimented quite a bit (fortunately, all this runs very quickly on my new Linux machine), I think I can say that this "direct UMAP" suffers from the same limitations as the "direct MDS", or more precisely that it requires the same constraints to be satisfied. Especially the presence of samples with a high rate of no-calls has dramatical effects. I had promised the members of my family clan who had provided me with imputed data to send them back in exchange, in addition to qpAdm models, representations (independent of G25, because most of those concerned do not and will not have "their" G25) easily interpretable. Honestly, I didn't even point out the possibility of running UMAP, and gave them "their" MDS, which they were fully satisfied with. But I encourage hobbyists to experiment since it only uses PLINK and R, both available for Windows.
En North alom, de North venom
En North fum naiz, en North manom
(Roman de Rou, Wace, 1160-1170)
UMAP is computationally far more complicated than say PCA. So it is not unexpected that no-calls have a dramatic effect.
Another vulnerability of UMAP is low sample number. If I remember it correctly, I have read that UMAPs on less than 500 samples should be considered not reliable. From your plots I get the impression that you may have than 500 samples, but not by much. Can you give us an indication?
Huijbregts (02-09-2021), JMcB (02-10-2021), randwulf (02-09-2021)
WITHDRAWAL #12
Above quote from post 12 is bullshit (or even elephant shit, but not dinosaur shit).
I have the reproachable habit of performing simple calculations not in an R console but in my browser.
That would not have led to a blooper if I had entered the correct formula log2(x).
Instead I used to enter 2log(x).
In an R console this throws an error, but my browser interprets it as two times the logarithm with base 10: 2 * log10(x).
Unfortunately that that value appears not implausible.
With the correct formula the outcome is for 3830 samples: 11.90313 which means 11 dimensions.
That appears too high. But notice that the rule of thumb supposes that all dimensions are nearly symmetrically filled which is overoptimistic.
My assertion that you need 2^25=33 554 432 samples for 25 dimensions is not affected, because it used powers instead of logarithms.
The question of how many dimensions to use remains an important one. Maybe I get a better idea.
anglesqueville (02-10-2021), JMcB (02-10-2021), randwulf (02-10-2021), Ruderico (02-10-2021), xerxez (02-11-2021)
For info. Work on the MDS continued on the French-speaking forum, on a discussion thread that is less and less French-speaking (which I am very happy with). Very positive experience and rich in lessons. We have virtually abandoned the use of dna.land (which seems frozen), and I take care of the imputation phase myself. It's work, but it's worth it.
https://anthrogenica.com/showthread....es-diplo%EFdes
En North alom, de North venom
En North fum naiz, en North manom
(Roman de Rou, Wace, 1160-1170)
I added the ability to truncate at calculation time to the Ancestry calculator, so one can experiment:
https://yk.github.io/ancestry/
If you have suggestions for improvement, just let me know.
By MDS, do you mean classical MDS or non-metric MDS? Classical MDS should produce the same coordinates as PCA (except that the sign of some dimensions may be arbitrarily flipped, like in the case of the second dimension below).
If a plot made with classical MDS looks different from the original coordinates produced by `plink --pca`, it's because you're supposed to scale the coordinates before you do a second PCA/MDS on top of them. A PCA plot based on the unscaled coordinates will look the same as a classical MDS plot.Code:> t=read.csv("https://drive.google.com/uc?export=download&id=1wZr-UOve0KUKo_Qbgeo27m-CQncZWb8y",header=T,row.names=1) > prcomp(t)$x[1:4,1:4] PC1 PC2 PC3 PC4 Abazin 0.009519441 0.1667842 0.003774957 -0.000634985 Abkhasian 0.038130735 0.1943700 0.012288608 0.004753655 Adygei 0.026658686 0.1892593 0.004977751 0.003864157 Aeta -0.199946861 -0.2023205 0.101169686 0.033809794 > cmdscale(dist(t))[1:4,] [,1] [,2] Abazin 0.009519441 -0.1667842 Abkhasian 0.038130735 -0.1943700 Adygei 0.026658686 -0.1892593 Aeta -0.199946861 0.2023205
Code:wget reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V44/V44.3/SHARE/public.dir/v44.3_1240K_public.tar tar -xf v44.3_1240K_public.tar f=v44.3_1240K_public convertf -p <(printf %s\\n genotypename:\ $f.geno snpname:\ $f.snp indivname:\ $f.ind outputformat:\ PACKEDPED genotypeoutname:\ $f.bed snpoutname:\ $f.bim indivoutname:\ $f.fam) awk -F\\t 'NR>1&&$2~/\.DG/{print$2" "$13}' g/g/12.anno|grep -Ev '\.REF|Ignore_|_o|_dup|_contam|_SomeAfrican|_father|_mother|_son|_daughter|_brother|_sister|_twin'|grep -Ev 'Neanderthal|Denisova|_published|Cameroon_SMA'>picks plink --allow-no-sex --bfile $f --keep <(awk 'NR==FNR{a[$1];next}$2 in a' picks $f.fam) --pca paste -d' ' <(cut -d' ' -f2 picks) <(cut -d' ' -f2- plink.eigenvec)>plink.eigenvec.popsBelow the unscaled MDS plot is identical with the unscaled PCA plot, and the scaled MDS plot is identical with the scaled PCA plot.Code:library(tidyverse) library(colorspace) t=read.table("plink.eigenvec.pops",sep=" ") eig=as.double(readLines("plink.eigenval")) # t=read.table("https://pastebin.com/raw/bGptFp12",sep=" ") # data I used # eig=c(22.7034,12.2596,8.09738,5.49967,4.92393,2.80936,2.39839,2.28679,2.20875,2.20267,2.12417,1.98691,1.81393,1.78919,1.74964,1.71181,1.66634,1.65334,1.62163,1.56336) ave=aggregate(t[,-c(1:2)],list(t[,1]),mean) row.names(ave)=ave$Group.1 ave=ave[,-1] # ave=(t(t(ave)*sqrt(eig))) # scale by multiplying with square roots of eigenvalues p=as.data.frame(cmdscale(dist(ave),k=20)) # classical MDS (k=2 would be a truncated version of k=20, but k=20 makes the clustering more accurate because it won't be based on only 2 dimensions) # p=as.data.frame(prcomp(ave)$x) # uncomment to use PCA instead of MDS # p=as.data.frame(ave) # uncomment to plot original coordinates by `plink --pca` names(p)[1:2]=c("x","y") p$k=cutree(hclust(dist(p),method="complete"),k=32) ggplot(p,aes(x,y))+ geom_point(aes(color=as.factor(k)),size=.5)+ geom_polygon(data=p%>%group_by(k)%>%slice(chull(x,y)),alpha=.2,aes(color=as.factor(k),fill=as.factor(k)),size=.3)+ geom_text(aes(label=row.names(p),color=as.factor(k)),size=2,vjust=-.7)+ theme( aspect.ratio=1/sqrt(2), axis.text=element_text(color="black",size=6), axis.ticks.length=unit(0,"pt"), axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), axis.title=element_text(color="black",size=8), legend.position="none", panel.background=element_rect(fill="white"), panel.border=element_rect(color="gray75",fill=NA,size=.4), panel.grid.major=element_line(color="gray70",size=.15), plot.title=element_text(size=9), )+ ggtitle("Original coordinates by `plink --pca` (scaled)")+ labs(x="",y="")+ scale_x_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.12))+ scale_y_continuous(breaks=seq(-10,10,.1),expand=expansion(mult=.03))+ scale_color_discrete_qualitative(palette="Set 2",c=80,l=40) ggsave("a.png") system("mogrify -trim -bordercolor white -border 20x20 a.png")
Above you can also see that when I used the original coordinates by `plink --pca`, the clustering was messed up when I used unscaled coordinates.
Last edited by Nganasankhan; 03-17-2021 at 02:11 PM.
Ajeje Brazorf (03-17-2021), sheepslayer (03-17-2021)
anglesqueville (03-17-2021), Bygdedweller (03-17-2021), JMcB (03-17-2021)