PDA

View Full Version : P37 STR UMAP



ph2ter
11-25-2021, 09:21 AM
I am very pleased with Y-DNA STR37 UMAP plot of all FTDNA users.
I seems to me that such plot shows relative time distance and kinship between various clades without even knowing the SNP results.


In case of I-P37 plot, M26, L233 and M423 are generally clearly splitted into separate clusters.
And inside M423 internal structure is clearly visible. L161 is separated from L621 and they both are internally structured.
We can see that Disles and Y18331 are more distanced from the main L621 cluster which is made of a bulk of I2a-Din.
I2a-Din is again splitted into two main clusters: Din-N and Din-S.
It can bee seen the affinity of BY128 to Y18331 and this can be because of S17250 bigger similarity to Y18331 than Z17855 or Y4460.




I-P37 STR37 UMAP


https://i.imgur.com/Db7PodE.jpg




I-M423 STR37 UMAP


https://i.imgur.com/qBIBnti.jpg

Nganasankhan
11-25-2021, 10:19 AM
Have you tried using a bigger `min_dist` setting so there's less empty space between the clusters? If you're using the `umap` R package, it can be changed like this:

def=umap.defaults
def$min_dist=.5
umap(mat,config=def)

Or how does it compare to PCA? I didn't find that much use for UMAP, because it mostly seems to be useful for visualizing clustering between points by means of the spatial position of the points. But even with PCA, you can visualize clustering by for example drawing convex hulls around the points based on the result of hclust+cutree or kmeans, so there isn't really a need to additionally visualize clustering based on the spatial position of the points.

ph2ter
11-25-2021, 10:07 PM
Have you tried using a bigger `min_dist` setting so there's less empty space between the clusters? If you're using the `umap` R package, it can be changed like this:

def=umap.defaults
def$min_dist=.5
umap(mat,config=def)

Or how does it compare to PCA? I didn't find that much use for UMAP, because it mostly seems to be useful for visualizing clustering between points by means of the spatial position of the points. But even with PCA, you can visualize clustering by for example drawing convex hulls around the points based on the result of hclust+cutree or kmeans, so there isn't really a need to additionally visualize clustering based on the spatial position of the points.
No, I haven't tried to play with min_dist, but I will.

PCA is not as good as UMAP in reconstruction of the genetic kinship.

ChrisR
12-05-2021, 05:46 PM
I am very pleased with Y-DNA STR37 UMAP plot of all FTDNA users.
I seems to me that such plot shows relative time distance and kinship between various clades without even knowing the SNP results.
I'm sorry if OT but I would like to ask about indications for UMAP and HowTo to create such maps from FTDNA Y-STR data?
I found this https://github.com/lmcinnes/umap and this https://umap-learn.readthedocs.io/en/latest/
But I have no coding skills (with Python) so not sure if someone with limited Shell-Commandline workflow experience like me can have success in trying this?

ph2ter
12-09-2021, 10:44 AM
I'm sorry if OT but I would like to ask about indications for UMAP and HowTo to create such maps from FTDNA Y-STR data?
I found this https://github.com/lmcinnes/umap and this https://umap-learn.readthedocs.io/en/latest/
But I have no coding skills (with Python) so not sure if someone with limited Shell-Commandline workflow experience like me can have success in trying this?
Firstly, you must prepare Y-dna STR data. Not all STR parameters have only one value. For example you must separate DYS385a from DYS385b.
And some of the kits don't have only two values, but three. One of these three must be excluded and similar for other STRs with multiple values.
You can make UMAP in R. Install R, install umap and then load umap library.


Input your values from the file:


mydatK37 <- read.table("C:/P37_STR.txt", sep="\t",header=T, row.names=1,quote="", fill=FALSE)



In your tab separated input file, the first column must be some ID, and one of the columns must be named Kit_Number.
In my input file the STR values start at 8th position (counting from 0).
The last STR value is at 44-th position (it depends if you have 37 STR values or more). Parameter n_neighbors must be less than the number of kits. You can play with this parameter.
Random state is seed. If you want always the same result, you must take the same seed:


res.umapK37<- umap(mydatK37[,8:44],n_neighbors=500,random_state=123)
plot(x=res.umapK37$layout[,1],y=res.umapK37$layout[,2], pch=20)
text(x=res.umapK37$layout[,1],y=res.umapK37$layout[,2], labels=mydatK37$Kit_Number, cex= 0.6, pos=3)