PDA

View Full Version : PCA plot using FLashpca



everest59
02-05-2014, 12:34 AM
Dienekes made a thread about this below:
http://dienekes.blogspot.com/2014/01/flashpca-very-fast-pca-on-large-scale.html

I decided to give it a test, and I must say it is a very fast software.
I used R to generate the attached PCA plot. So mainly South Asian/Central Asian populations.
Overall, this was as expected. Not sure why the Pulliyars are forming their own cluster. It could be a mislabeling.
P.S. The Metspalu dataset has some mislabeling. For example, I think Tamil Nadu scheduled castes may be mislabeld.

everest59
02-05-2014, 01:21 AM
I decided to clean up the PCA plot above a little bit. There are still some outliers (must be some mislabeling). I decided to add Singapore Indians to the plot above.
See attached.

Sein
02-05-2014, 09:45 AM
Excellent! everest59, could you try a global PCA with this? Also, could you try one with both of us on it? Thanks!

everest59
02-05-2014, 10:45 AM
Sure. Btw Stein, you are in there. You and your mom have been labeled Pathan2. I am labeled Nepali.
Sure, I will add 12 more populations to the pca above. Is that good enough?

Sein
02-05-2014, 10:53 AM
Thanks everest! And sure, whatever you think would be good in terms of experimentation. Just out of curiosity, would you by any chance know the ID's of the two Pashtun samples which are close to me? Thanks again!

Edit: Another question, but in your opinion, is this approach as good as regular PCA?

Dr_McNinja
02-05-2014, 11:50 AM
Everest, could you add me as well? Curious how close I would be to you (Nepali) on this

Sapporo
02-05-2014, 01:14 PM
Sure. Btw Stein, you are in there. You and your mom have been labeled Pathan2. I am labeled Nepali.
Sure, I will add 12 more populations to the pca above. Is that good enough?


Add me as well mate? Thanks. At least to the first one?

everest59
02-05-2014, 02:15 PM
Thanks everest! And sure, whatever you think would be good in terms of experimentation. Just out of curiosity, would you by any chance know the ID's of the two Pashtun samples which are close to me? Thanks again!

Edit: Another question, but in your opinion, is this approach as good as regular PCA?

Well, I can only figure out the IDs by adding texts. The texts appear jumbled up, but it appears that the one above you is 6af, whereas the one to the left of you is either 22af or 20af. I can't tell properly. All I see is af.
Edit: Okay, I'm not sure if 6af is that close to you. She seems to the most North-shifted Pashtun.

- - - Updated - - -

Sapporo and Mcninja, email me your data and I'll add you to the pca plot above.
Let me know if anybody else wants to be added.
I'd like to add the Mal'ta specimen pretty soon as well, perhaps to global pca.

Sein
02-05-2014, 02:21 PM
Well, I can only figure out the IDs by adding texts. The texts appear jumbled up, but it appears that the one above you is 6af, whereas the one to the left of you is either 22af or 20af. I can't tell properly. All I see is af.

- - - Updated - - -

Sapporo and Mcninja, email me your data and I'll add you to the pca plot above.
Let me know if anybody else wants to be added.
I'd like to add the Mal'ta specimen pretty soon as well, perhaps to global pca.

Thanks everest59! As a special request, is it possible for you to eventually create a PCA plot similar to the one I've attached? The PCA plot in question is basically only Eurasian populations. No Africans, North Africans, Amerindians, or Australasians-Oceanians. Just European, West Asian, Caucasus, South Asian, Central Asian, and East Asian-Siberian-Southeast Asian populations, minus all African or African admixed groups, and populations from the Americas and Oceania (sorry if that sounded repetitive). It would be interesting to compare to the Cristofaro PCA.

1360

everest59
02-05-2014, 02:24 PM
Thanks everest59! As a special request, is it possible for you to eventually create a PCA plot similar to the one I've attached? The PCA plot in question is basically only Eurasian populations. No Africans, North Africans, Amerindians, or Australasians-Oceanians. Just European, West Asian, Caucasus, South Asian, Central Asian, and East Asian-Siberian-Southeast Asian populations, minus all African or African admixed groups, and populations from the Americas and Oceania (sorry if that sounded repetitive). It would be interesting to compare to the Cristofaro PCA.

1360

Sure, I'll do that. I'll wait for Sapporo's and Mcninja's data first. Should be pretty easy.

Sein
02-05-2014, 02:29 PM
Sure, I'll do that. I'll wait for Sapporo's and Mcninja's data first. Should be pretty easy.

You're awesome man! Great stuff! Also, whenever you have the time, could do another quick one for me? All of the Afghan samples+Myself+Behar and Yunusbayev Central Asians+HGDP Pakistanis?

everest59
02-05-2014, 02:33 PM
Not a problem.
To everybody who's sending me their data, can you also let me know your ethnicity? Thanks.
Good to know there seems to be quite a bit of interest.

everest59
02-05-2014, 03:30 PM
Your awesome man! Great stuff! Also, whenever you have the time, could do another quick one for me? All of the Afghan samples+Myself+Behar and Yunusbayev Central Asians+HGDP Pakistanis?

Okay, I just created a chart using West Asians and Pakistani populations as well as Central Asians. I removed Kalash because they formed their own cluster. Your data is in there, so is mfa's. I don't know if you guys will be able to see yourself in there. Chart is attached.
As always there will be outliers. Perhaps some mislabelings as well (the Metspalu datasets I downloaded have some mislabelings).
Kurd2 is mfa
Pathan2 is Sein and his mom.

Sein
02-05-2014, 03:38 PM
Thanks! I can't find myself or my mom, but that's okay.

This has a nice east vs west dimension. Interesting to see that the Afghan Pashtuns are as West Eurasian as the HGDP samples. The Baloch seem to be about as West Eurasian as Iranians.

MfA
02-05-2014, 03:44 PM
Okay, I just created a chart using West Asians and Pakistani populations as well as Central Asians. I removed Kalash because they formed their own cluster. Your data is in there, so is mfa's. I don't know if you guys will be able to see yourself in there. Chart is attached.
As always there will be outliers. Perhaps some mislabelings as well (the Metspalu datasets I downloaded have some mislabelings).
Kurd2 is mfa
Pathan2 is Sein and his mom.

Thanks everest59, Can you post the PCA as .png I can't see myself

everest59
02-05-2014, 04:17 PM
Sure I will post it as png format. If that doesn't work I will remove some individuals . I am away from my computer right now.

Sapporo
02-05-2014, 05:22 PM
Thanks! I can't find myself or my mom, but that's okay.

This has a nice east vs west dimension. Interesting to see that the Afghan Pashtuns are as West Eurasian as the HGDP samples. The Baloch seem to be about as West Eurasian as Iranians.

I'd say the Iranians (on average) are slightly more West Eurasian than the HGDP Baloch. Like 4-5%. I also sent my raw data.

Sein
02-05-2014, 05:44 PM
I'd say the Iranians (on average) are slightly more West Eurasian than the HGDP Baloch. Like 4-5%. I also sent my raw data.

A very good point. That sounds about right, around 4%-5%

Sapporo
02-05-2014, 07:03 PM
A very good point. That sounds about right, around 4%-5%

I'd guess Baloch are around 88-90% West Eurasian on average while Iranians around 93-95%. Kurds a few % more than Iranians. At least on a very basic level. If you look at factors like ANE, etc. it is probably less. The HGDP Pashtuns and Afghan Pashtuns from the recent Di Cristofaro paper are probably around 84-88% on average. HGDP Sindhis around 79-82%. These are guesses based on Dodecad K12b's estimates though.

The largest difference is that Baloch have West Eurasian ancestry that is more (South Asian shifted) in their Gedrosia/Baloch component.

Sein
02-05-2014, 07:15 PM
You know, that's an interesting angle. I guess if we look at it from a pre-Lazaridis et al. perspective, Iranians are around 90% West Eurasian, since we know that the Behar Iranians are around 8% ASI. The average HGDP Baloch is 18% ASI. The average HGDP Pashtun is also 18% ASI. I'm referring to one of Dienekes' experiments, here:
https://docs.google.com/spreadsheet/ccc?key=0ArJDEoCgzRKedEd3N2drM05sck1wcG03TFdWUnZaQ mc&authkey=CIHIwKcO&hl=en_US&authkey=CIHIwKcO#gid=0

But if we look at it from a different lens, it would be exceedingly interesting to see how these populations breakdown in terms of ANE and Basal Eurasian, as well as West Eurasian proper (WHG, or a related element in the Near East and South Asia).

newtoboard
02-05-2014, 07:20 PM
If basal Eurasian is associated with ydna E then I don't expect there to be much east of Turkey. I think E frequencies take a sharp dip at the Iran's border with Turkey and Iraq and even between Kurds, Iranian Kurds have less E. I always found this interesting because the SW Asian component is almost non existent east of Iran as is E. But I think there was an E* sample from somewhere in India. As a whole Indo-Iranian groups have some of the lowest E frequencies anywhere. I wonder what prevented E lineages as well as Semitic languages from every really crossing the Zagros.

DMXX
02-05-2014, 10:19 PM
If basal Eurasian is associated with ydna E then I don't expect there to be much east of Turkey. I think E frequencies take a sharp dip at the Iran's border with Turkey and Iraq and even between Kurds, Iranian Kurds have less E.

This sort of pattern would indicate Y-DNA E's current distribution occurred before the Indo-Iranian speakers arrived from the east.

Data concerning Y-DNA E in Central Asia, at present, is actually conflicting and not as clear cut as your post implied.

1) Wells et al.'s The Eurasian Heartland: A continental perspective on Y-chromosome diversity (http://www.pnas.org/content/98/18/10244.long) found E-M96 scattered across Central Asia at varying frequencies, some of which cannot be considered insignificant. Of particular note:

Tajik (Samarkand, Uzbekistan) - 10% (n=40)
Uzbek (Khorezm, Uzbekistan) - 7% (n=70)
Tajik (Shugnan, Pamirs, Tajikistan) - 11% (n=44)

2) In contrast, Zerjal et al. (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC419996/)(similar research team as Wells et al.) approached Central Asian Y-Chromosomes barely a year after the aforementioned paper and came up with the following conclusion regarding Y-DNA E (haplogroups 8 and 21):


"Haplogroup 21 was restricted to the Caucasus region, with a frequency of 17% among the Lezgi and 10% among the Armenians."

I am inclined to take Wells et al.'s findings over Zerjal et al.'s in this context, given the former has a much more comprehensive sampling strategy than the latter.

The scenario whereby the Iranian plateau served as some sort of lineage buffer vis-a-vis Y-DNA E is therefore not sustainable. Especially when compared with specific Iranian populations. Quite a few of the Wells Central Asian sites had greater Y-DNA E than Grugni et al.'s Iranian samples (http://2.bp.blogspot.com/-_mypYjUAwqw/UAjZdNrHAqI/AAAAAAAAAMY/nxbhHviubow/s1600/study4.png). The picture therefore isn't so clear cut.

everest59
02-05-2014, 11:22 PM
Okay, I just created another pca. I had quite a bit of trouble with this one as the mother-son pairs were causing problems. So I removed all mother-son pairs (I had three of those pairs).
I know DMXX was about to send me a file. I will include you when I do it next time.
I pruned down the SNP's to a count of 50k.
I decided to keep 5 individuals per population so people can see where they are located. I think I have located everybody here.

Any opinions?
I can include more individuals per population if that's what people here want.

DMXX
02-05-2014, 11:29 PM
One quick observation regarding the Tajiks.

They seem to sit in a genetic space between Iranians and Pashtuns but look to be twice closer to the latter. It'd be interesting to have Pamiri populations involved someday to help determine whether this is due to Tajiks being "1/4 Persian" from post-Islamic times, or it's an indication of older population affinities.

Sein
02-05-2014, 11:30 PM
Again, awesome work!

So I'm closest to an Afghan Tajik, pretty interesting for me. And I'm glad you removed the mother-son pairs, those always create problems in Admixture and PCA.

Edit: That's a very interesting observation. Despite the West Iranian language, they seem rather similar to East Iranians (Pashtuns). If I'm not mistaken, I believe I read somewhere that their ancestors did speak East Iranian languages, so that might be relevant. That, or just living alongside Pashtuns.

everest59
02-05-2014, 11:32 PM
Again, awesome work!

So I'm closest to an Afghan Tajik, pretty interesting for me. And I'm glad you removed the mother-son pairs, those always create problems in Admixture and PCA.

I think it could be a Pashtun who got mislabeled.

newtoboard
02-05-2014, 11:33 PM
Well you are correct. I should have remembered Central Asia is a complex place. I still think the Iranian plateau as a buffer between the Levant Levant and South Asia applies. I think it is possible Central Asian E lineages are recent and can be traced back to the Silk Road and maybe the spread of Islam. I do recall even L657 in Central Asia being traced to Arab missionaries.

MfA
02-05-2014, 11:47 PM
Thanks Everest, I stand at the western front and surrounded by Stars of David :)

Here's the Eurogenes PCA of South and West Asia: https://docs.google.com/file/d/0B9o3EYTdM8lQTUpsTHNXSWQyWlE/edit
with ID's: https://docs.google.com/file/d/0B9o3EYTdM8lQWHc4SVZYYUc0Skk/edit

Humanist
02-05-2014, 11:54 PM
Okay, I just created another pca. I had quite a bit of trouble with this one as the mother-son pairs were causing problems. So I removed all mother-son pairs (I had three of those pairs).
I know DMXX was about to send me a file. I will include you when I do it next time.
I pruned down the SNP's to a count of 50k.
I decided to keep 5 individuals per population so people can see where they are located. I think I have located everybody here.

Any opinions?
I can include more individuals per population if that's what people here want.

Hi everest. Would you mind including me? Seeing the populations you have included, I suspect I will cluster near or among the Armenians.

everest59
02-06-2014, 12:03 AM
Hi everest. Would you mind including me? Seeing the populations you have included, I suspect I will cluster near or among the Armenians.

Sure, pm me. I'll give you my email address.

Also, I included DMXX. See attached.

I'm gonna take a break today though.

everest59
02-06-2014, 12:52 AM
1365Alright, final plot for the day with Humanist included. Yeah, you plot right there with the Armenians. See attached.
I think I am not going to accept any more samples for now.
I'd like to do more plots, perhaps tomorrow, if not by weekend.
I have some ideas. A global plot, West Asian specific, South Asian specific, and so on.
I just had another epiphany. I want to do an ADMIXTURE run from K=2 to 20. Is it okay if I include everybody? However, one step at a time. I'll do some PCAs first.

Sein
02-06-2014, 01:13 AM
1365Alright, final plot for the day with Humanist included. Yeah, you plot right there with the Armenians. See attached.
I think I am not going to accept any more samples for now.
I'd like to do more plots, perhaps tomorrow, if not by weekend.
I have some ideas. A global plot, West Asian specific, South Asian specific, and so on.
I just had another epiphany. I want to do an ADMIXTURE run from K=2 to 20. Is that okay if I include everybody? However, one step at a time. I'll do some PCA first.

Great, so that settles it, this method is just as accurate as PCA produced using EIGENSOFT (identical results). A great find!

And thank you for trying this out! I'm already looking forward to the PCAs you'll be creating later. And the Admixture run sounds fun. Do you plan to run it with different random seeds, and cross-validation?

Humanist
02-06-2014, 01:20 AM
1365Alright, final plot for the day with Humanist included. Yeah, you plot right there with the Armenians.

Yes, sir. Georgians are right there as well. Reminds me of this graphic (Cypriots, Druze, Samaritans, Armenians, Assyrians, Iraqi Mandaeans, Georgians, and Abkhazians):

http://i1096.photobucket.com/albums/g326/dok101/Faces/caucasus_map-1.jpg

Dr_McNinja
02-06-2014, 02:29 AM
1365Alright, final plot for the day with Humanist included. Yeah, you plot right there with the Armenians. See attached.
I think I am not going to accept any more samples for now.
I'd like to do more plots, perhaps tomorrow, if not by weekend.
I have some ideas. A global plot, West Asian specific, South Asian specific, and so on.
I just had another epiphany. I want to do an ADMIXTURE run from K=2 to 20. Is it okay if I include everybody? However, one step at a time. I'll do some PCAs first.What's the ID of that Hazara individual right next to me? I wonder if it would be possible to see their admixture breakdown

Sapporo
02-06-2014, 03:05 AM
What's the ID of that Hazara individual right next to me? I wonder if it would be possible to see their admixture breakdown

Check the Afghan study Hazara. 1-2 outliers there.

everest59
02-06-2014, 03:12 AM
Check the Afghan study Hazara. 1-2 outliers there.

That's probably it. It'll be pretty tough to go check out the individual ID's.

I will try to do some South Asian specific plots, hopefully tomorrow, no guarantee. Sein wanted to see plots with mainly Pakistani samples along with Central Asians and West Asians, so that was what these plots were.

parasar
02-06-2014, 04:42 AM
If basal Eurasian is associated with ydna E then I don't expect there to be much east of Turkey. I think E frequencies take a sharp dip at the Iran's border with Turkey and Iraq and even between Kurds, Iranian Kurds have less E. I always found this interesting because the SW Asian component is almost non existent east of Iran as is E. But I think there was an E* sample from somewhere in India. As a whole Indo-Iranian groups have some of the lowest E frequencies anywhere. I wonder what prevented E lineages as well as Semitic languages from every really crossing the Zagros.

There was a E* Bhil sample I believe. E is essentially the same as F and C as far as shared SNP are concerned.
http://4.bp.blogspot.com/-AFN4jjwuNBc/Us15U6bBVVI/AAAAAAAAJdc/uY8iTtxO0Mo/s1600/Scozzari.png

- - - Updated - - -


Sure, pm me. I'll give you my email address.

Also, I included DMXX. See attached.

I'm gonna take a break today though.

The Pakistani Hazaras (if my read is correct) are the ones with striking Turko-Mongol affinity, the Afghan Hazara much less, right?

DMXX
02-06-2014, 11:48 AM
Well you are correct. I should have remembered Central Asia is a complex place. I still think the Iranian plateau as a buffer between the Levant Levant and South Asia applies. I think it is possible Central Asian E lineages are recent and can be traced back to the Silk Road and maybe the spread of Islam. I do recall even L657 in Central Asia being traced to Arab missionaries.

I do agree the Zagros mountains served as some sort of genetic buffer but it isn't the case with Y-DNA E. Autosomal components seem to demonstrate this buffering better (SW Asian vs. Gedrosian).

There are several possibilities for Y-DNA E in Central Asia. It...
- Was one of the lineages which spread eastwards in the BMAC's founding and the patchy distribution is due to historical shifts
- Arrived in Central Asia with the Achaemanid Persians, who created irrigation systems and founded small colonies along the outer rim of their empire
- Founding of the Silk Road disseminated it from the western part of the Iranian plateau to there
- Founder effect of medieval Sassanid Persian lineages after the Islamic invasion of Persia took place
- A combination of some of the above, or potentially all?

Without STR and SNP data, there's nothing to convince us of either.

everest59
02-06-2014, 12:45 PM
About some Afghan Hazara samples, I really think some of them got mislabeled, and are actually either Tajik or Pashtun. I will list the Hazara ID's when I can.

MfA
02-06-2014, 01:03 PM
Looking forward to ADMIXTURE runs http://forum.beyond3d.com/images/smilies/new/runaway.gif

Dr_McNinja
02-06-2014, 03:34 PM
About some Afghan Hazara samples, I really think some of them got mislabeled, and are actually either Tajik or Pashtun. I will list the Hazara ID's when I can.Thanks. I looked at the new Afghan study Hazaras and they're nowhere near me (or any other Indians/Pakistanis) on the Harappa map I made, and their admixture is not like mine either (aside from being close in Baloch % in one).

everest59
02-06-2014, 06:42 PM
Great, so that settles it, this method is just as accurate as PCA produced using EIGENSOFT (identical results). A great find!

And thank you for trying this out! I'm already looking forward to the PCAs you'll be creating later. And the Admixture run sounds fun. Do you plan to run it with different random seeds, and cross-validation?

Yeah I think Eigensoft won't produce anything different.
I haven't tried cross validation before. Isn't that very slow?
Also, I am not sure what random seeds are for. Is it needed if you set a termination criteria only? Typically I just let the program run till the end.

Sein
02-06-2014, 06:52 PM
Yeah I think Eigensoft won't produce anything different.
I haven't tried cross validation before. Isn't that very slow?
Also, I am not sure what random seeds are for. Is it needed if you set a termination criteria only? Typically I just let the program run till the end.

Unfortunately, cross-validation is very slow, so it might not be good in our case. But the nice thing, it allows one to say which run has more generalizable results.

I think so, but I believe there is a default?

everest59
02-06-2014, 06:59 PM
Unfortunately, cross-validation is very slow, so it might not be good in our case. But the nice thing, it allows one to say which run has more generalizable results.

I think so, but I believe there is a default?

Yeah that's the main use of cross validation . I think I'll skip it.
Yeah you are right there is a default termination criteria that can be changed. I think default is good.

everest59
02-06-2014, 10:55 PM
Thanks. I looked at the new Afghan study Hazaras and they're nowhere near me (or any other Indians/Pakistanis) on the Harappa map I made, and their admixture is not like mine either (aside from being close in Baloch % in one).

Well, the ones I used were Hazara Afghans:
Hazara Hazara6_2Af 0 0 1 -9
Hazara Hazara6_6Af 0 0 1 -9
Hazara Hazara6_27Af 0 0 1 -9
Hazara Hazara6_66Af 0 0 1 -9
Hazara Hazara6_68Af 0 0 1 -9

I think the closest one to you has to be 66Af. Honestly, (s)he could be a Tajik that Metspalu mislabeled. However, not sure, could be a Hazara.
https://docs.google.com/spreadsheet/ccc?key=0AuW3R0Ys-P4HdFNHSFRLal83NnlxbUNFSy1sX0dtdWc&usp=sharing#gid=0

Sein
02-06-2014, 11:02 PM
Well, the ones I used were Hazara Afghans:
Hazara Hazara6_2Af 0 0 1 -9
Hazara Hazara6_6Af 0 0 1 -9
Hazara Hazara6_27Af 0 0 1 -9
Hazara Hazara6_66Af 0 0 1 -9
Hazara Hazara6_68Af 0 0 1 -9

I think the closest one to you has to be 66Af. Honestly, (s)he could be a Tajik that Metspalu mislabeled. However, not sure, could be a Hazara.
https://docs.google.com/spreadsheet/ccc?key=0AuW3R0Ys-P4HdFNHSFRLal83NnlxbUNFSy1sX0dtdWc&usp=sharing#gid=0

Any luck with a global plot?

everest59
02-06-2014, 11:10 PM
Any luck with a global plot?

Well, I haven't worked on that one yet. I wanna do regional plots first. Working on a South Asian one.

everest59
02-07-2014, 12:12 AM
Here is a regional South Asia specific. I will do a West-Asian specific plot tomorrow.
I will probably add some more populations to the chart below.
I tried to add some Xing populations, as they had some Punjabi Arain, Tamil, AP as well as Nepali samples, but they seemed to cluster together. So I decided to remove them. Also removed the Kalash, as they formed their own cluster.
P.S. That one GIH sample on top is probably a European sample (I think TSI if I remember correctly).

everest59
02-07-2014, 01:25 AM
Here, I did a more global plot. Unfortunately, I had to reduce total individual count per population. This definitely looks like the global 23andme PCA plot.

I think here is my plan. Tomorrow, I will do a West-Asian specific PCA plots (one regional and one global like this one).
After that, I will do a global plot with everybody included.
Then I'll add Mal'ta sample to the PCA and see where he clusters.

Sein
02-07-2014, 01:39 AM
Here, I did a more global plot. Unfortunately, I had to reduce total individual count per population. This definitely looks like the global 23andme PCA plot.

I think here is my plan. Tomorrow, I will do a West-Asian specific PCA plots (one regional and one global like this one).
After that, I will do a global plot with everybody included.
Then I'll add Mal'ta sample to the PCA and see where he clusters.

I love this aspect of PCA analyses, the dimensionality one finds in the data is almost always robust (at least with living populations). The results usually make sense, despite the paucity of samples. If you tried something like this in Admixture/Structure, the results would be far from optimal, but with PCA, this is really all you need. In short, PCA/MDS just seems more reliable vs Admixture/Structure.

It will be very interesting to see where the Mal'ta sample clusters, that's something I'm definitely looking forward to.

Also, I think CEU sample clustering near the Yoruba, but deviating in a minor European direction (this sample's distance to the Yoruba is comparable to the HGDP Pashtun distance from Georgians) is actually a ASW sample. That would explain why they are predominately African, but slightly shifted towards Europeans.

everest59
02-07-2014, 01:55 AM
I love this aspect of PCA analyses, the dimensionality one finds in the data is almost always robust (at least with living populations). The results usually make sense, despite the paucity of samples. If you tried something like this in Admixture/Structure, the results would be far from optimal, but with PCA, this is really all you need. In short, PCA/MDS just seems more reliable vs Admixture/Structure.

It will be very interesting to see where the Mal'ta sample clusters, that's something I'm definitely looking forward to.

Also, I think CEU sample clustering near the Yoruba, but deviating in a minor European direction (this sample's distance to the Yoruba is comparable to the HGDP Pashtun distance from Georgians) is actually a ASW sample. That would explain why they are predominately African, but slightly shifted towards Europeans.

Yup, it is an ASW sample. Also the GIH outlier I talked about in one of my earlier posts is actually an East Asian sample, I think CHB. I think I said CEU previously.

Sein
02-07-2014, 02:25 AM
Now that is interesting, mislabeling seems quite common. I guess it's perfectly alright when we're dealing with populations that are substantially diverged (though, on a modern human scale; looking at the "big picture", we are a rather homogenous species), like Sub-Saharan Africans and Europeans. But this does become a serious issue with geographically contiguous populations, like those sampled by Di Cristofaro et al. We have no real idea of whether the Pashtun-like Tajik and Hazara samples are mislabeled Pashtun samples, or just actual Tajik and Hazara samples that resemble Pashtuns due to gene-flow.

everest59
02-07-2014, 02:36 AM
Well, the datasets from the Metspalu website seems to have a lot of mislabelings, unless I did something wrong (which I doubt). The Indian dataset is essentially speaking unusable because the samples that are supposed to be Brahmins have been labeled as something else (for e.g. Dusadh) or vice versa. A lot of samples are mislabeled in that dataset. I know this because I ran a K=10 ADMIXTURE on a file with >3000 individuals that I created by combining various datasets. So I will say with 100% guarantee that the Afghan samples are mislabeled. The issue isn't in just one dataset.
However, the Behar dataset seems to be okay. So not all datasets are bad. I am disappointed by that Indian dataset.

Sein
02-07-2014, 02:48 AM
Well, the datasets from the Metspalu website seems to have a lot of mislabelings, unless I did something wrong (which I doubt). The Indian dataset is essentially speaking unusable because the samples that are supposed to be Brahmins have been labeled as something else (for e.g. Dusadh) or vice versa. A lot of samples are mislabeled in that dataset. I know this because I ran a K=10 ADMIXTURE on a file that I combined to create a file with >3000 individuals. So I will say with 100% guarantee that the Afghan samples are mislabeled. The issue isn't in just one dataset.
However, the Behar dataset seems to be okay. So not all datasets are bad. I am disappointed by that Indian dataset.

That is rather unfortunate. I think Zack pointed out something about the Behar et al. Paniya samples, awhile back at HAP? I'm guessing it must be difficult to properly keep track of these things during sample collection?

As far as the Afghan Tajik sample I resemble is concerned, I have to agree with you, probably an Afghan Pashtun. In the Eurogenes MCLUST analysis, this sample receives a nearly 100% classification in the Pashtun cluster! But when it comes to Hazara samples, if they are mislabeled, they can only be Tajik samples, not Pashtun. In the spreadsheet, none of the Afghan Hazara get classified in the Pashtun cluster, so that makes this an unlikely proposition:
https://docs.google.com/spreadsheet/ccc?key=0Ato3EYTdM8lQdE8xQ2N2VDBFUUQzS2RmRkhBVmNuZ Wc&usp=sharing#gid=0

Speaking of MCLUST, is there anything similar for PCA data? Correct me if I'm wrong, but I think MCLUST is only possible with MDS. If anything similar exists for PCA, that would be beautiful, since we can't examine all of the dimensions, and that amounts to a lot of variation we aren't seeing.

everest59
02-07-2014, 02:56 AM
That is rather unfortunate. I think Zack pointed out something about the Behar et al. Paniya samples, awhile back at HAP? I'm guessing it must be difficult to properly keep track of these things during sample collection?

As far as the Afghan Tajik sample I resemble is concerned, I have to agree with you, probably an Afghan Pashtun. In the Eurogenes MCLUST analysis, this sample receives a nearly 100% classification in the Pashtun cluster! But when it comes to Hazara samples, if they are mislabeled, they can only be Tajik samples, not Pashtun. In the spreadsheet, none of the Afghan Hazara get classified in the Pashtun cluster:
https://docs.google.com/spreadsheet/ccc?key=0Ato3EYTdM8lQdE8xQ2N2VDBFUUQzS2RmRkhBVmNuZ Wc&usp=sharing#gid=0

Yeah, I agree I don't think the mislabeled Hazara samples are Pashtuns. They seem more like Tajiks to me. Actually, from the Afghan paper, one Hazara sample seems very Tajik like.
Now, I really can't be sure because they only collected 5 samples per population. We don't know much about the diversity in Afghanistan within an ethnicity.

Sein
02-07-2014, 03:03 AM
I have to be honest, I was pretty disappointed when I found out they collected only 4-5 samples. I wish they could have collected at least 20-25 samples per ethnicity. And in an ideal world, I wish most of these research teams could collect around 40-45 samples. Although, that might be overkill for really homogenous populations (like the Kalash, in a case like that, 5 samples seems reasonable).

everest59
02-07-2014, 03:13 AM
I have to be honest, I was pretty disappointed when I found out they collected only 4-5 samples. I wish they could have collected at least 20-25 samples per ethnicity. And in an ideal world, I wish most of these research teams could collect around 40-45 samples. Although, that might be overkill for really homogenous populations (like the Kalash, in a case like that, 5 samples seems reasonable).

Yeah, they should have done a better job. 5 samples per pop isn't enough in a country like Afghanistan, which is very diverse.

Anyways, I decided to create a quick West-Asian specific plot. The participants probably won't be able to find themselves. I just wanted to show you all where various populations cluster. I will remove some samples tomorrow.

Sein
02-07-2014, 06:47 AM
Very interesting!

Everest, could you create a plot similar to the one at #23, but with a few more peninsular South Asian, East Asian, and European populations? In fact, could you include every population of predominantly Eurasian ancestry (Europeans, West Asians, South Asians, East Eurasians, and everyone in between), but excluding those that create clusters due to drift-inbreeding (Kalash)? Just 5 individuals per population (while we are on the Kalash, wouldn't using only 5 Kalash samples mitigate some of the usual issues)? I think this would also give Dr_McNinja his actual position in PCA space, since we would be dealing with richer dimensionality.

Of course, only if you feel it would add any value. I really appreciate all of this, and it's proving to be a lot of fun. Thank you.

Dr_McNinja
02-07-2014, 07:03 AM
That South Asia one is very interesting. It seems laid out more like a geographic map and I'm right near Everest there.

- - - Updated - - -

Where would Bedouins plot on a global map with West Asians and South Asians? Wondering because I don't think that minor bit of SW-Asian that shows up in admixture runs is really of a Bedouin origin or even Bedouin-like. I guess that name "Southwest Asian" is appropriate though, but I wonder what it represents. Perhaps really old Caucasian/West Asian?

Humanist
02-07-2014, 07:59 AM
I guess that name "Southwest Asian" is appropriate though, but I wonder what it represents. Perhaps really old Caucasian/West Asian?

A post that is relevant to your question. From the Dodecad thread:


I posted this on another forum a while back. It is based on this Dienekes blog entry: Inter-relationships of the Dodecad K12b...components (http://dienekes.blogspot.com/2012/08/inter-relationships-of-dodecad-k12b-and.html)

Dienekes [order rearranged]:


Southwest Asian appears to be Caucasus
Gedrosia appears to be Caucasus + a slice of Siberian
Atlantic Med appears to be Caucasus + a slice of North European
Northwest African appears to be Caucasus + a minority Sub Saharan
South Asian appears to be Caucasus + East Asian
East African appears to be Sub Saharan + minority Caucasus

Caucasus appears Atlantic Med + Gedrosia + slices of Northwest African and Southwest Asian
North European appears to be Atlantic Med + Gedrosia with a slice of Siberian

http://i1096.photobucket.com/albums/g326/dok101/Caucasus.jpg

everest59
02-07-2014, 10:39 AM
Very interesting!

Everest, could you create a plot similar to the one at #23, but with a few more peninsular South Asian, East Asian, and European populations? In fact, could you include every population of predominantly Eurasian ancestry (Europeans, West Asians, South Asians, East Eurasians, and everyone in between), but excluding those that create clusters due to drift-inbreeding (Kalash)? Just 5 individuals per population (while we are on the Kalash, wouldn't using only 5 Kalash samples mitigate some of the usual issues)? I think this would also give Dr_McNinja his actual position in PCA space, since we would be dealing with richer dimensionality.

Of course, only if you feel it would add any value. I really appreciate all of this, and it's proving to be a lot of fun. Thank you.

That is one thing I would really love to do. The only trouble is I don't know how to add more than 25 populations because R I think has only 25 different shapes.
If anybody knows how to do it , I will appreciate it. I started learning R just a few days ago.
I'll try to find a relevant code.

ZephyrousMandaru
02-07-2014, 11:05 AM
That South Asia one is very interesting. It seems laid out more like a geographic map and I'm right near Everest there.

- - - Updated - - -

Where would Bedouins plot on a global map with West Asians and South Asians? Wondering because I don't think that minor bit of SW-Asian that shows up in admixture runs is really of a Bedouin origin or even Bedouin-like. I guess that name "Southwest Asian" is appropriate though, but I wonder what it represents. Perhaps really old Caucasian/West Asian?

Southwest Asian likely branched off of Caucasus, but Caucasus, Mediterranean and Southwest Asian are all ultimately derived from Basal Eurasian. Which is probably the earliest ancestral West Eurasian component to ever exist. If I'm not mistaken, it peaks in Sardinians, Saudi Arabians and Yemenis. Polako from ABF has recently commented on this.


The most likely option is that Upper Paleolithic Middle Easterners were very much like Saudis and Yemenis minus the African, Caucasus-like and Eastern European-like admixture that they have now.

My guess is that today's Middle Easterners, that's including all of them even the minorities, are very different from the ones in the Paleolithic. Today's Middle Eastern components are modified versions of this Basal Eurasian component. Caucasus itself has ANE admixture, which explains why people who possess it in enormous amounts deviate towards the east. This probably applies to Southwest Asian as well, given that it's a derivative of Caucasus.

I think the earliest component to split off Basal Eurasian is probably Mediterranean, as it seems to lack this ANE-shift that Caucasus and Southwest Asian have.

Dr_McNinja
02-07-2014, 11:26 AM
Just wondering, does everyone consider Gedrosian to be West Asian? Aside from 23andMe/FTDNA that is, where I think they classify it as South Asian for a recent timeframe.

And which components would be considered West Eurasian? Gedrosian, Caucasian, North/east-European, Atlantic-Med, Southwest Asian? What about Siberian and other Arctic ones?

I added Dodecad K12b to the spreadsheet here: https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdFMtRHVlZDBuQ3lMcjhxMDE4V3JoY lE&usp=drive_web#gid=10

The three extra columns are for European (N-Euro+Atl-Med) and Caucasian (Gedrosian+Caucasian+SW-Asian), and then the third is the total of those two which I labeled as "West Eurasian?". This total is similar between Dodecad K12b and the HarappaWorld calculator.

Then I added my Geno 2.0 results:

50% Southwest Asian
19% Southeast Asian
16% Mediterranean
11% Northern European
2% Native American
2% Northeast Asian

That Southwest Asian represents some degree of West Eurasian and old Indian (according to the Geno 2.0 people). There was a discussion on that previously in the thread for my results. If you add the Mediterranean and North European components, you get 27%. Subtracting that 27% from the "West Eurasian?" field from the K12b sheet (57.96% for me) gives 30.96% which is unaccounted for West Eurasian in the Geno 2.0 results. Subtracting that from Geno 2.0's Southwest Asian gives 19.06%.

So that leaves two components, the Southeast Asian @ 19% and another mysterious one @ 19.06% to be the equivalent of the "South Indian" component in the other admixture calculators. That totals about 38% which is around what I got for South Indian in various calculators (37% in DIYHarappaWorld, 38-39% in Dodecad K12b). So can I take this to mean that the South Indian component is actually split almost 50/50 for me, with half representing ASI and half representing the original ANI? (In other words, Geno 2.0's SW-Asian+Med+N-Euro = Gedrosian/Caucasian/SW-Asian/NE-Euro/Atl-Med/Half-of-South-Indian) Or what else could that half of South Indian represent?

What does that bode for the remaining Arctic and East Asian admixture in some South Asians? Does that represent ASI from the north of the subcontinent or more recent admixture events?

newtoboard
02-07-2014, 01:27 PM
I do agree the Zagros mountains served as some sort of genetic buffer but it isn't the case with Y-DNA E. Autosomal components seem to demonstrate this buffering better (SW Asian vs. Gedrosian).

There are several possibilities for Y-DNA E in Central Asia. It...
- Was one of the lineages which spread eastwards in the BMAC's founding and the patchy distribution is due to historical shifts
- Arrived in Central Asia with the Achaemanid Persians, who created irrigation systems and founded small colonies along the outer rim of their empire
- Founding of the Silk Road disseminated it from the western part of the Iranian plateau to there
- Founder effect of medieval Sassanid Persian lineages after the Islamic invasion of Persia took place
- A combination of some of the above, or potentially all?

Without STR and SNP data, there's nothing to convince us of either.

Well if the BMAC people were absorbed by Proto Indo-Aryans the first option looks unlikely. The next three look quite likely. The last three events are quite recent so it does look like the Zagros mountains served as a buffer for E in older times.

everest59
02-07-2014, 11:02 PM
Good news guys. I now know how to zoom in. This is the same data I used to create image in post #57, but I managed to zoom in by changing the cartesian coordinates. Basically, I need to create just one global plot in the future, and zoom in to populations of interest.

Humanist
02-07-2014, 11:36 PM
Good news guys. I now know how to zoom in. This is the same data I used to create image in post #57, but I managed to zoom in by changing the cartesian coordinates. Basically, I need to create just one global plot in the future, and zoom in to populations of interest.

Thanks, everest. Interestingly, in this plot I am more similar to the Indo-European speaking Cypriots, than to the Semitic-speaking (albeit Arabic) Druze. The Arabs, as expected, form their own cluster on the right.

everest59
02-07-2014, 11:53 PM
Thanks, everest. Interestingly, in this plot I am more similar to the Indo-European speaking Cypriots, than to the Semitic-speaking (albeit Arabic) Druze. The Arabs, as expected, form their own cluster on the right.

It should be interesting to see where Zephyrous clusters.
Anyways, I forgot to include some populations in that plot (e.g. Kurds, Georgians).

MfA
02-08-2014, 12:35 AM
I overlap with a Turk next to the Armenians.. This looks like East Med/Med vs Southwest Asian PCA..

ZephyrousMandaru
02-08-2014, 01:50 AM
It should be interesting to see where Zephyrous clusters.
Anyways, I forgot to include some populations in that plot (e.g. Kurds, Georgians).

I'll probably cluster with Humanist, and end up in the Armenian and Georgian clusters, or perhaps the Cypriot cluster. If more Mesopotamian groups were added, I'd probably cluster with them.

Sein
02-08-2014, 04:20 AM
Just wondering, does everyone consider Gedrosian to be West Asian? Aside from 23andMe/FTDNA that is, where I think they classify it as South Asian for a recent timeframe.

And which components would be considered West Eurasian? Gedrosian, Caucasian, North/east-European, Atlantic-Med, Southwest Asian? What about Siberian and other Arctic ones?

I added Dodecad K12b to the spreadsheet here: https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdFMtRHVlZDBuQ3lMcjhxMDE4V3JoY lE&usp=drive_web#gid=10

The three extra columns are for European (N-Euro+Atl-Med) and Caucasian (Gedrosian+Caucasian+SW-Asian), and then the third is the total of those two which I labeled as "West Eurasian?". This total is similar between Dodecad K12b and the HarappaWorld calculator.

Then I added my Geno 2.0 results:

50% Southwest Asian
19% Southeast Asian
16% Mediterranean
11% Northern European
2% Native American
2% Northeast Asian

That Southwest Asian represents some degree of West Eurasian and old Indian (according to the Geno 2.0 people). There was a discussion on that previously in the thread for my results. If you add the Mediterranean and North European components, you get 27%. Subtracting that 27% from the "West Eurasian?" field from the K12b sheet (57.96% for me) gives 30.96% which is unaccounted for West Eurasian in the Geno 2.0 results. Subtracting that from Geno 2.0's Southwest Asian gives 19.06%.

So that leaves two components, the Southeast Asian @ 19% and another mysterious one @ 19.06% to be the equivalent of the "South Indian" component in the other admixture calculators. That totals about 38% which is around what I got for South Indian in various calculators (37% in DIYHarappaWorld, 38-39% in Dodecad K12b). So can I take this to mean that the South Indian component is actually split almost 50/50 for me, with half representing ASI and half representing the original ANI? (In other words, Geno 2.0's SW-Asian+Med+N-Euro = Gedrosian/Caucasian/SW-Asian/NE-Euro/Atl-Med/Half-of-South-Indian) Or what else could that half of South Indian represent?

What does that bode for the remaining Arctic and East Asian admixture in some South Asians? Does that represent ASI from the north of the subcontinent or more recent admixture events?

For South Asians, West Eurasian would be the sum total of one's Gedrosia, Caucasus, North_European, Atlantic-Med, and SW-Asian percentages, in addition to 50% of the South Asian component. And you are absolutely correct to split the South Asian component. According to Dienekes, the South Asian component for K12b is approximately 50% West Eurasian, and 50% ASI. Basically, West Eurasian ancestry for South Asians would be equivalent to your W-Eurasian+1/2S-Asian column. And it corresponds rather nicely with Geno 2.0. According to Geno 2.0, I'm 82% West Eurasian, and you are 77% West Eurasian, so that's an excellent correspondence between your estimate and Geno 2.0! For Geno 2.0, Iranians range between 86% to 95% West Eurasian, so you would probably obtain a similar estimate using your method. And for the Bengali individual, who is around 59% West Eurasian, that's a rather accurate estimate. I have seen other Bengali results, and Admixture tends to put them at 55%-65% West Eurasian, and PCA plots tend to construe them as slightly more West Eurasian then the more Western-shifted Hazara/Uyghur.

I tried this with my mother's results, and she is around 85% West Eurasian. Makes sense, as she clusters in the same spot as my Afghan Pashtun Popalzai friend, despite being significantly more South Asian than him (and moderately more so than myself).

I think Arctic and Siberian components may have entered northern South Asia with the Indo-Aryans, and the Scythians. They must have had some East Eurasian admixture.

Dr_McNinja
02-08-2014, 05:38 AM
For South Asians, West Eurasian would be the sum total of one's Gedrosia, Caucasus, North_European, Atlantic-Med, and SW-Asian percentages, in addition to 50% of the South Asian component. And you are absolutely correct to split the South Asian component. According to Dienekes, the South Asian component for K12b is approximately 50% West Eurasian, and 50% ASI. Basically, West Eurasian ancestry for South Asians would be equivalent to your W-Eurasian+1/2S-Asian column. And it corresponds rather nicely with Geno 2.0. According to Geno 2.0, I'm 82% West Eurasian, and you are 77% West Eurasian, so that's an excellent correspondence between your estimate and Geno 2.0! For Geno 2.0, Iranians range between 86% to 95% West Eurasian, so you would probably obtain a similar estimate using your method. And for the Bengali individual, who is around 59% West Eurasian, that's a rather accurate estimate. I have seen other Bengali results, and Admixture tends to put them at 55%-65% West Eurasian, and PCA plots tend to construe them as slightly more West Eurasian then the more Western-shifted Hazara/Uyghur.

I tried this with my mother's results, and she is around 85% West Eurasian. Makes sense, as she clusters in the same spot as my Afghan Pashtun Popalzai friend, despite being significantly more South Asian than him (and moderately more so than myself).

I think Arctic and Siberian components may have entered northern South Asia with the Indo-Aryans, and the Scythians. They must have had some East Eurasian admixture.I wonder if my West-leaning European admixture was somehow disconnected from the East Eurasian admixture. Like, combining Germanic European with Arctic admixture = NE-Euro? And they're both being picked up separately? Is that possible?

The two interpretations are to take the results at face value as separate components, where it seems to imply some kind of North Indian admixture, but puts me outside of Punjab and like near Nepal. Or to take it altogether where it places me in Punjab. In the case of the latter, that would mean tying the excess Arctic back up into the West Eurasian admixture. Even connecting some of the Caucasus-like admixture within South Indian with the Arctic and getting more Gedrosian as a result (and less South Indian) would "Punjabify" my results. It's just strange how they average up right but break down weird. I don't think high Arctic admixture is typical of the Himalayan region. It's close to Pashtun levels. The other clue is the absence (in Harappa and the other GEDmatch calculators) of East Asian admixture which does typify the Himalayan and East Indian region and even the far northwest (Pashtun). So I feel like my Arctic admixture doesn't make sense unless it came from a West Eurasian population.

And I keep coming back to what Dr. McDonald said regarding my mom's results that the European was hard to discern from South Asian. I think the community is settling for analyzing too few SNPs considering the size of the human genome. It's always fallacious to believe any small cross-section can truly/accurately represent the whole. Better reference populations, new chips that test more SNPs for ancestry, hope we won't have to wait too long for that.

EDIT: I remember you saying you thought there was a missing link, an East Eurasian population which was responsible for some ancestry in Pashtun. That's possible but it could also be that everyone's existing West Eurasian admixture is being underestimated by splitting off the Arctic. For some it's Gedrosian, for others it's Caucasian or European. We can't just add it back in though because the East Eurasian is a genuine possibility for Pashtun but I don't think it makes sense with me since my origins point to the East but there isn't enough East Asian. Also recall how I had like 6% archaic Denisovan/Neanderthal by Geno 2.0's estimates. I think it's just really old admixture that might not be counted right.

Sein
02-08-2014, 06:17 AM
You know, I think you might be on to something. The East Asian does seem unlikely in your case. In fact, maybe it represents ANE? In addition, I wouldn't be surprised if some of our ASI ancestry is really ANE. Back when 23andMe had a three-continent system of European, Asian, and African, most people from Punjab got 88%-96% West Eurasian. And if you use the Interpretome Chromosome Painting, you will almost certainly turn out to be around 90% West Eurasian. I would have been 94% "European" on the old 23andMe system, and Interpretome puts me around 95% West Eurasian. I asked Dr.McDonald to try a chromosomal painting, and I turned out 91% West Eurasian. For some reason, chromosome paintings always put me in the 90%-95% range for West Eurasian ancestry, while Admixture almost always puts me at 81%-83% West Eurasian. The disparity always confused me. But, perhaps ANE has something to do with it? ANE ancestry is basically West Eurasian, in contemporary terms (even though what I just said doesn't really make sense. Mal'ta is ancestral to living West Eurasians, so it isn't accurate to say that he was a West Eurasian, as we should really construe and conceptualize West Eurasians in relation to his clade). But it has an eastern affinity that might complicate things. That's why Europeans were appearing to be 20% Native American before the Mal'ta genome. And Pashtuns (and by implication, Punjabis) have some heavy ANE ancestry. Perhaps some excess ANE lacking in Georgians is being construed as ASI for South Asians? If so, instead of me being 82% West Eurasian, and you being 77% West Eurasian, I wouldn't be surprised if I'm actually 90% West Eurasian, and you're 85% West Eurasian.

It'll all be answered if researchers take a fresh look at South Asian genetics, taking into consideration the new ideas that have surfaced concerning the origins of Europeans. And if we live to see the day (lol) when ancient DNA will be acquired from South Asia, we have no idea what we'll have to rethink. Ancient DNA has a way of upsetting preconceived notions founded on data from living populations. And I couldn't agree more about improved SNP chips, and better ascertainment schemes. With full genomes, ascertainment bias won't be a big issue, but we still aren't there yet. And the more sampling, the better, no doubt about that.

parasar
02-08-2014, 06:52 AM
You know, I think you might be on to something. The East Asian does seem unlikely in your case. In fact, maybe it represents ANE? In addition, I wouldn't be surprised if some of our ASI ancestry is really ANE. Back when 23andMe had a three-continent system of European, Asian, and African, most people from Punjab got 88%-96% West Eurasian. And if you use the Interpretome Chromosome Painting, you will almost certainly turn out to be around 90% West Eurasian. I would have been 94% "European" on the old 23andMe system, and Interpretome puts me around 95% West Eurasian. I asked Dr.McDonald to try a chromosomal painting, and I turned out 91% West Eurasian. For some reason, chromosome paintings always put me in the 90%-95% range for West Eurasian ancestry, while Admixture almost always puts me at 81%-83% West Eurasian. The disparity always confused me. But, perhaps ANE has something to do with it? ANE ancestry is basically West Eurasian, in contemporary terms (even though what I just said doesn't really make sense. Mal'ta is ancestral to living West Eurasians, so it isn't accurate to say that he was a West Eurasian, as we should really construe and conceptualize West Eurasians in relation to his clade). But it has an eastern affinity that might complicate things. That's why Europeans were appearing to be 20% Native American before the Mal'ta genome. And Pashtuns (and by implication, Punjabis) have some heavy ANE ancestry. Perhaps some excess ANE lacking in Georgians is being construed as ASI for South Asians? If so, instead of me being 82% West Eurasian, and you being 77% West Eurasian, I wouldn't be surprised if I'm actually 90% West Eurasian, and you're 85% West Eurasian.

It'll all be answered if researchers take a fresh look at South Asian genetics, taking into consideration the new ideas that have surfaced concerning the origins of Europeans. And if we live to see the day (lol) when ancient DNA will be acquired from South Asia, we have no idea what we'll have to rethink. Ancient DNA has a way of upsetting preconceived notions founded on data from living populations. And I couldn't agree more about improved SNP chips, and better ascertainment schemes. With full genomes, ascertainment bias won't be a big issue, but we still aren't there yet. And the more sampling, the better, no doubt about that.

It all depends on the time.
100000 ybp we are all Africans.
50000 ybp we for the most part are all East Eurasians (though that basal Eurasian still confuses, as to whether it is old Mediterranean, ie, a very early offshoot from OoA or early branch of East Eurasian is not clear).

ANE (initial P, then NO, then C3) just happen to be later movements from eastern Eurasia.

That Arctic component may be signaling the initial ANE.
An Arctic origin of the Aryans, redux :) "Claiming that Arctic climes were much more moderate prior to 10,000 years ago, Tilak went on to hypothesize that during the preglacial and intergiacial periods civilization was developed by the ancestors of the living peoples of India but that onset of colder conditions some 10,000 to 8,000 years ago, migration to the warmer lands of the south was initiated." http://books.google.com/books?id=W6zQHNavWlsC&pg=PA369

And, evidence of a floral and related faunal change!

The change in vegetation began roughly 25,000 years ago and ended about 10,000 years ago - a time when many of the big animals slipped into extinction ... The Arctic region once teemed with herds of big animals, in some ways resembling an African savanna. Large plant eaters included woolly mammoths, woolly rhinos, horses, bison, reindeer and camels, with predators including hyenas, saber-toothed cats, lions and huge short-faced bears. http://www.anthrogenica.com/showthread.php?2149-New-Theory-on-the-Extinction-of-Ice-Age-Mammals-in-North-America&p=30273&viewfull=1#post30273

Dr_McNinja
02-08-2014, 07:55 AM
Sein, what settings did you use on the Interpretome? Here's my chromosome painting at default settings with Hapmap 2:

http://i.imgur.com/PTh1JDR.png

Sein
02-08-2014, 08:13 AM
Sein, what settings did you use on the Interpretome? Here's my chromosome painting at default settings with Hapmap 2:

http://i.imgur.com/PTh1JDR.png

The absolute best settings would be:

1374

Although, it will take a very long time to run. But the results are solid.

Still though, your chromosome painting is what people from northwestern South Asia should expect. Almost completely European. I thought something could be wrong with the algorithm, so I tried it on various European, East Asian, and African American 23andMe raw-data files. The results made sense for all of them. 100% East Asian for the East Asians, 100% European for the Europeans, and mostly African+European for the African American. So the results are robust for non-South Asians, but very unexpected for South Asians. Not sure why, but haploblock approaches just give much higher estimates of West Eurasian ancestry for South Asians than Admixture/Structure, but identical results for everyone else.

Dr_McNinja
02-08-2014, 08:32 AM
The absolute best settings would be:

1374With Hapmap 2 I take it?

http://i.imgur.com/oJ5z9tL.png

Sein
02-08-2014, 08:35 AM
With Hapmap 2 I take it?

http://i.imgur.com/oJ5z9tL.png

Definitely. So you're around 90%-95% West Eurasian according to this.

everest59
02-08-2014, 03:00 PM
I created a big 20x 20 file (in inches) that has 89 populations in it. How do I post it? I think people will need to download it and then zoom in.

Sein
02-08-2014, 03:02 PM
I created a big 20x 20 file (in inches) that has 89 populations in it. How do I post it? I think people will need to download it and then zoom in.

This sounds exciting!

everest59
02-08-2014, 03:08 PM
This sounds exciting!

Here is the file. I haven't really checked it thoroughly, but I say download it and then zoom in.
https://drive.google.com/file/d/0B3vEDdpZDjUpSXV0NGlubElVM2M/edit?usp=sharing

Actually, it is a 30x30 file due to so many populations. I haven't even looked for errors yet. Let me know what people see.

MfA
02-08-2014, 03:20 PM
Here is the file. I haven't really checked it thoroughly, but I say download it and then zoom in.
https://drive.google.com/file/d/0B3vEDdpZDjUpSXV0NGlubElVM2M/edit?usp=sharing

Actually, it is a 30x30 file due to so many populations. I haven't even looked for errors yet. Let me know what people see.

Thanks Everest I uploaded to a 3rd party image hoster.

http://abload.de/img/test49kfb.png

Just a thought maybe a text searchable PDF file is better for the big crowded maps?

ZephyrousMandaru
02-08-2014, 03:41 PM
Here is the file. I haven't really checked it thoroughly, but I say download it and then zoom in.
https://drive.google.com/file/d/0B3vEDdpZDjUpSXV0NGlubElVM2M/edit?usp=sharing

Actually, it is a 30x30 file due to so many populations. I haven't even looked for errors yet. Let me know what people see.

Thanks, it appears that I'm clustering not with the Armenians, but with the Druze and one Syrian, Humanist also seems to be clustering with the Druze as well. Although unlike me, he's not actually in the Druze cluster just adjacent to it. Interesting.

everest59
02-08-2014, 03:50 PM
Here it is in pdf format.
https://drive.google.com/file/d/0B3vEDdpZDjUpMGhkS21INU81Nms/edit?usp=sharing

Zoom in and see where you're located. I say download the pdf file first. Not very clear via google docs.

Dr_McNinja
02-08-2014, 04:16 PM
It matches the Geno 2.0 a little too well. When I used my actual Geno 2.0 raw data's run through the HarappaWorld calculator, it was under by about 2% (K12b says 77%, HarappaWorld says 76%, Geno2 raw data in Harappa says 74%, Geno 2.0's own estimate: 77%). So with the same raw data, Harappa is underestimating West Eurasian by about 3% compared to Geno 2.0. Considering the Geno 2.0 run also gave me a whopping 5.35% Arctic, I think they might be putting some of that into the West Eurasian components.

Ignoring myself and my mom, all the Jatts are hitting almost exactly the same West Eurasian. As European (NE-Euro+Med) goes down, Caucasian (Baloch+Caucasus+SW-Asian) goes up, or vice-versa, but these individuals are hitting the same number, including the Haryana Jatt with the 20% European total.

I think that is pretty significant.

I did that for the entire Harappa spreadsheet and the Jatts inhabit a range of 77% to 82% West Eurasian with HRP0008 at 84% as a bit of an outlier. The Haryana/Rajasthani Jatts are in the upper half but mixed with other Punjabi Jatts. My mom is at 76% and I'm at 73%. I believe if they were all run on the same calculator (I forgot at which number Zack changed it), the range would be even tighter.

The ones at 82% and above (4-5 individuals) have Arctic in the ~2-3% range and lower. The ones under have ~3-4% or higher. The ones under those are at 4-5% and I'm the highest at the bottom with 6%. A clear pattern is there.

HRP0370 is at 83.40% West Eurasian with 5.86% Arctic for reference, btw.

Once again the total Caucasian numbers are pretty geographically relevant. The higher numbers are in Punjab, the lower ones in Haryana. Mine places me right inbetween Punjab and Haryana. The only issue is my European is too low and my Arctic too high. Not just my Arctic, but I believe the South Indian is also boosted up (going by what Dr. McDonald said where the European and South Asian were hard to tell apart for my mom and us).

Taking all this into account in the context of our previous posts, I think we're seeing a pretty massive calculator effect across the board. The closer we get to the Caucasus, geographically, the more the West Eurasian admixture appears Caucasian (or Baloch depending on proximity to that). The further away we get, the more it shows up as European, culminating in the huge European numbers for the Haryana/Rajasthani Jatts. I don't think there's actually a variance in the actual admixture. I'd be willing to bet most, if not all, Jatts share very similar West Eurasian admixture (after tracing my family tree all over Punjab, that became kind of evident). But it forms a bit of a gradient and depending on which component an individual's ancestral location is near, the West Eurasian jumps into that category. Mine's actually jumping into South Indian, which according to the isotherms drawn up at Harappa long ago, actually peaks not too far away in Gujarat (and Punjabis do have an affinity towards Gujarat, and probably Y-DNA links too). But being that South Indian has ancestral ASI which resembles East Asian, a chunk is being left over in Arctic.

BTW, the isotherms: http://imgur.com/a/HrM8D#0

The problems are a lack of components and reference populations. Or rather, a good selection. There needs to be a North Indian or close ANI approximate in that region. It'd be interesting to see if the Haryana/Rajasthani Jatts (and Brahmins in general) suddenly lose a bunch of their European admixture to that because of how close they are to it. FTDNA uses one in their Population Finder and that thing placed me within spitting distance of our village, and Dr. McDonald also makes that distinction and placed my mom and I both in Pakistan. European just seems like a catch-all when no other component will pick up the West Eurasian. As for why mine is getting picked up by South Asian, who knows, but I think Muslim Jatts have a clear difference due to the inter-family marriages. There shouldn't be a difference because these are the same people, same clans (and more than half my family tree was Sikh two centuries ago), but the marriage habits are totally different (as Paul Gill called it, "Jatt law", death for marrying in the same clan! lol....). I think the difference between my family and the other Muslim Jatts is that we actually do hail from India (my grandfather used to mention Haryana) so in our case it's looking for the nearest ANI approximate and instead of defaulting to Northeast European like Haryana Jatts (all of whom have been Hindu I believe) or Caucasian/Baloch like Punjabi Jatt Muslims and Sikhs, goes for South Indian and Arctic (the West European-like admixture is still there getting picked up since it looks so different, and without as much surrounding NE-Euro, it comes out looking Norman/French). The inter-family marriage has probably been going on since our great-great and great-great-great grandparents.

This also affects the other Muslim South Asian populations. The European in Afghans might be severely underestimated (represented by all that Arctic hanging around). HRP0370's European admixture jumps like crazy in some calculators whereas the Jatts' does not.

I guess it just goes to show, as Razib Khan used to mention on his blog, we have to be epistemologically correct in how we treat these programs. Also I hate to go to the phenotype thing but the Haryana Jatts look just like other Jatts. That's 20% of West Eurasian admixture which is "European-like", not 20% actual Scandinavian or something as people talk about it. Hell, if I'm this East-shifted, my dad should be moreso and he (and his entire side) look more "Jatt" than I do. If someone could rerun the Harappa participants using two South Asian components like Dr. McDonald did, we'd probably be seeing a much tighter spread. My dad's phased results were sharply tilted towards West European (pre-Germanic) and my mom shows up as virtually all Germanic. But by McDonald's analysis, he only detected French for me, and none for her, despite her having the actual French in K36 and 9% North Sea in another one (K15 I think). The other calculators show that it's there but he couldn't pick it up at all.

Humanist
02-08-2014, 04:19 PM
Thanks, it appears that I'm clustering not with the Armenians, but with the Druze and one Syrian, Humanist also seems to be clustering with the Druze as well. Although unlike me, he's not actually in the Druze cluster just adjacent to it. Interesting.

You = ZZ
Me = YY

http://i1096.photobucket.com/albums/g326/dok101/Faces/everest_.png

everest59
02-08-2014, 10:19 PM
I just did a quick Mal'ta plot. Seems like the Siberian boy is clustering with Turkmen. The Nogais aren't too far away either.
(Just download this image). My plot does not have Karitiana anywhere close.

I'll do some more pruning of the SNPs and see what I get.
https://drive.google.com/file/d/0B3vEDdpZDjUpZmthMHlaSzhiUzg/edit?usp=sharing

Here's another one after pruning out more snp's. This one is in pdf format:
https://drive.google.com/file/d/0B3vEDdpZDjUpelJITmx6b2FsaDA/edit?usp=sharing

Not much of a difference IMO.

ZephyrousMandaru
02-08-2014, 11:02 PM
You = ZZ
Me = YY

http://i1096.photobucket.com/albums/g326/dok101/Faces/everest_.png

Thanks, could the reason why you're pulling slightly eastward due to you being more "Gedrosia"?

Humanist
02-08-2014, 11:26 PM
Thanks, could the reason why you're pulling slightly eastward due to you being more "Gedrosia"?

I would imagine so.

What are your K12b values? My values are below:

50.6-Caucasus

21.1-Gedrosia
18.4-Southwest_Asian

9.1-Atlantic_Med

0.5-East_Asian
0.2-South_Asian

0.0-North_European
0.0-Siberian
0.0-Northwest_African
0.0-East_African
0.0-Sub_Saharan
0.0-Southeast_Asian

ZephyrousMandaru
02-08-2014, 11:55 PM
I would imagine so.

What are your K12b values? My values are below:

50.6-Caucasus

21.1-Gedrosia
18.4-Southwest_Asian

9.1-Atlantic_Med

0.5-East_Asian
0.2-South_Asian

0.0-North_European
0.0-Siberian
0.0-Northwest_African
0.0-East_African
0.0-Sub_Saharan
0.0-Southeast_Asian

Dodecad K12b

Population
Gedrosia 18.05%
Siberian 0.55%
Northwest_African 0.87%
Southeast_Asian -
Atlantic_Med 11.45%
North_European 1.96%
South_Asian -
East_African 0.30%
Southwest_Asian 18.78%
East_Asian -
Caucasus 48.02%
Sub_Saharan -

I think it might be a combination of you being slightly more Gedrosia, and my nearly 1% Northwest African score. The latter is probably what's pushing me a little closer to the Druze.

Sein
02-09-2014, 04:30 AM
I just did a quick Mal'ta plot. Seems like the Siberian boy is clustering with Turkmen. The Nogais aren't too far away either.
(Just download this image). My plot does not have Karitiana anywhere close.

I'll do some more pruning of the SNPs and see what I get.
https://drive.google.com/file/d/0B3vEDdpZDjUpZmthMHlaSzhiUzg/edit?usp=sharing

Here's another one after pruning out more snp's. This one is in pdf format:
https://drive.google.com/file/d/0B3vEDdpZDjUpelJITmx6b2FsaDA/edit?usp=sharing

Not much of a difference IMO.

Thanks Everest!

Could you try a HapMap PCA with all of us and Mal'ta, but excluding Mexicans, African Americans, Luhya, and Mandenka? In fact, just CEU, YRI, CHB, JPT, and GIH? Thank you.

Dr_McNinja
02-09-2014, 05:21 AM
Sein, I added the Africa9 results to the sheet for some of us:

https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdFMtRHVlZDBuQ3lMcjhxMDE4V3JoY lE&usp=drive_web#gid=12

FST distances for the components:

https://docs.google.com/spreadsheet/ccc?key=0ArAJcY18g2GadGRjS1E3ZklyVllBVlhrNWVpNlFob EE&hl=en_US#gid=2

Surprisingly no one had any NW-African even though it's closest to SW-Asian and European. East African and West African are the next closest and that kind of clues us in that something weird is happening. I think East African is correlating to SW-Asian/Caucasian in some way and West African could be connected to either European or ASI.

Check out the adjusted column. There I assumed W_Africa, Mbuti, and S_Africa correlated to ASI, and E_Africa was the missing ANI to make South Indian, so I adjust the latter on the basis of the former. It makes more geographical sense. Biaka looks like it could represent some kind of excess East Asian.

It's just messing around but it illustrates the effect I think is happening. Me staying at 85% makes some sense (I checked with Photoshop's ruler, that Interpretome painting put me at 85-90%, not counting the blank spots).

ZephyrousMandaru
02-09-2014, 05:47 AM
Hey Everest, do you think you could generating those same PCA plots but from different dimensions other than 1 and 2, such as 1 and 3, 1 and 4, 1 and 5 and 1 and 6? The first two dimensions are only representing some of the variance in the samples, and it'd be interesting to see how much more there is within and between the samples.

Sein
02-09-2014, 06:44 AM
Sein, I added the Africa9 results to the sheet for some of us:

https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdFMtRHVlZDBuQ3lMcjhxMDE4V3JoY lE&usp=drive_web#gid=12

FST distances for the components:

https://docs.google.com/spreadsheet/ccc?key=0ArAJcY18g2GadGRjS1E3ZklyVllBVlhrNWVpNlFob EE&hl=en_US#gid=2

Surprisingly no one had any NW-African even though it's closest to SW-Asian and European. East African and West African are the next closest and that kind of clues us in that something weird is happening. I think East African is correlating to SW-Asian/Caucasian in some way and West African could be connected to either European or ASI.

Check out the adjusted column. There I assumed W_Africa, Mbuti, and S_Africa correlated to ASI, and E_Africa was the missing ANI to make South Indian, so I adjust the latter on the basis of the former. It makes more geographical sense. Biaka looks like it could represent some kind of excess East Asian.

It's just messing around but it illustrates the effect I think is happening. Me staying at 85% makes some sense (I checked with Photoshop's ruler, that Interpretome painting put me at 85-90%, not counting the blank spots).

This looks very interesting! The correlations might be indicative of something.

Dr_McNinja
02-09-2014, 11:54 AM
This looks very interesting! The correlations might be indicative of something.
You can do a kind of ghetto chromosome painting with gedmatch's one-to-many feature. Select everyone with a match over 5cM (this is the longest step), then go to Chromosome browser. If you selected too many, copy and paste the URL into notepad and chop off the kit #s into multiple URLs and reload them.

Not sure how to turn that into a % though. Since I have my mom's kit on there, I did a rough guesstimate with Photoshop and about 30.66% of the bar I share with her is also shared with Europeans on Chromosome 1 and 22.36% on Chromosome 2. Across the whole chromosome it looks like 20-30% across the whole thing.

I only counted segments with multiple matches to ensure there was actual commonality with Europeans and not just chance. You'll notice there will be regions with dozens of matches in the same area, safe to say that part should be painted European/Caucasian on any chromosome painter (I don't even want to say Caucasian, the overwhelming majority of these people are from Northern/Western Europe with a few from Eastern Europe).

Dr_McNinja
02-09-2014, 01:01 PM
I checked out paulgill's and his is in the same range. His Chromosome 1 was 25% and Chr 2 was 13.14% (assuming my overlap with my mom as boundary of total chromosome). I found another Gill Punjabi Jatt in his results though, I added them to the spreadsheet (edit: since it's likely a relative of his, I didn't bother... although I match PG and not his relative).

everest59
02-09-2014, 02:34 PM
Hey Everest, do you think you could generating those same PCA plots but from different dimensions other than 1 and 2, such as 1 and 3, 1 and 4, 1 and 5 and 1 and 6? The first two dimensions are only representing some of the variance in the samples, and it'd be interesting to see how much more there is within and between the samples.

Sure. Btw, I haven't even checked these files. Let me know if something looks awry. This was easy to do since I already created the PC data with 10 components.

1x3:
https://drive.google.com/file/d/0B3vEDdpZDjUpVFhPYnkxQ2xQaTg/edit?usp=sharing

1x4:
https://drive.google.com/file/d/0B3vEDdpZDjUpWWtzTTZyZzFXZ28/edit?usp=sharing

2x3:
https://drive.google.com/file/d/0B3vEDdpZDjUpTWItdk9yU044dEk/edit?usp=sharing

1x5:
https://drive.google.com/file/d/0B3vEDdpZDjUpUDhTUmtZX0JhcU0/edit?usp=sharing

I will work on the Hapmap + Malta +participant pca

I notice that some of the plots above are very tightly clustered. Not sure if you guys can see yourselves.
There are so many populations used that I need to create a big file.

everest59
02-09-2014, 03:15 PM
Thanks Everest!

Could you try a HapMap PCA with all of us and Mal'ta, but excluding Mexicans, African Americans, Luhya, and Mandenka? In fact, just CEU, YRI, CHB, JPT, and GIH? Thank you.

Excellent idea, and a very interesting output.
Mal'ta is clustered with South Asians + West Asians. YRI is on the top right. CHB on the bottom, Europeans (Tuscans and CEU) on the top left.

I think there is something wrong here.

Sein
02-09-2014, 03:28 PM
Excellent idea, and a very interesting output.
Mal'ta is clustered with South Asians + West Asians. YRI is on the top right. CHB on the bottom, Europeans (Tuscans and CEU) on the top left.

That is an exceedingly interesting and unexpected output! Not really sure what to make of it. Any insight or ideas?

everest59
02-09-2014, 03:31 PM
That is an exceedingly interesting and unexpected output! Not really sure what to make of it. Any insight or ideas?

Well, I think there was an error when I created the file. I am trying something else. Participants + SGVP Indian+ HGDP Karitiana, Russian as well as YRI.

Sein
02-09-2014, 03:51 PM
Well, I think there was an error when I created the file. I am trying something else. Participants + SGVP Indian+ HGDP Karitiana, Russian as well as YRI.

This should prove very interesting.

everest59
02-09-2014, 03:55 PM
Mal'ta again. This time, I'll include two files. First one is completely zoomed out. The other one is zoomed in. Mal'ta boy is close to the West Eurasian/South Asian populations zoomed in. Karitiana is far away. These charts show Ma-1 closer to the Gujarati types with high ASI probably due to the fact that Ma-1 himself is shifted towards Karitiana. Keep in mind, ASI is very East-Eurasianlike. Another issue: I was able to use only 7k SNP's.

1382

Sein
02-09-2014, 04:04 PM
Extremely interesting. In your opinion, would any changes occur in the output if you could have more SNPs at your disposal?

everest59
02-09-2014, 04:06 PM
Extremely interesting. In your opinion, would any change occur in the output if you could have more SNPs at your disposal?

I absolutely think so. I will need to recombine all the 23andmefiles first, and then I think I would combine it as well as Malta with HGDP. I think HGDP is better than HAPMAP due to more populations.

Sein
02-09-2014, 04:09 PM
I absolutely think so. I will need to recombine all the 23andmefiles first, and then I think I would combine it as well as Malta with HGDP. I think HGDP is better than HAPMAP due to more populations.

Great! I'm rather excited about this.

everest59
02-09-2014, 04:12 PM
Great! I'm rather excited about this.

It will be a little time consuming, so hopefully by tomorrow.

BMG
02-09-2014, 04:49 PM
Can you possibly use the 1000 genomes data .There is a lot of south asian samples there

everest59
02-09-2014, 04:53 PM
Can you possibly use the 1000 genomes data .There is a lot of south asian samples there

You mean Punjabis, Bengalis, etc? Do you know if those are publicly available? Because I really don't think they are. I use HAPMAP Gujaratis and Singapore Indians mainly.

BMG
02-09-2014, 05:05 PM
You mean Punjabis, Bengalis, etc? Do you know if those are publicly available? Because I really don't think they are. I use HAPMAP Gujaratis and Singapore Indians mainly.
There seems to be some data in here sequenced by complete genomics. But dont know that is what you are looking for
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data

everest59
02-09-2014, 05:50 PM
I may have found a way to calculate West Eurasian % using a PCA plot. Let's suppose Sardinians are the purest West Eurasian population (in reality Stuttgart would be the best population to use since she is a mix of EEF and WHG with no ANE). The F4 ratio formula using Sardinians would be:
West Asian %= F4( Basque, Yoruba ; X , Chinese)/ F4(Basque, Yoruba; Sardinian, Chinese)

However, the thing here is that I have found that there is no need to use Basque, We can actually do the following:
F4(Sardinian, Yoruba; X, Chinese)/ F4(Sardinian, Yoruba; Sardinian, Chinese)

What happens is that the Saridinian, Yoruba in the numerator and denominator cancel out, and the formula becomes:
F4( X, Chinese)/F4(Sardinian, Chinese)

However, this was using allele frequency. What if we substitute distance for allele frequency using a pca plot? Then the formula becomes:

West Asian %= Distance(X, Chinese)/Distance(Sardinian, Chinese)

Let's suppose distance between you and Chinese is 8, whereas the distance between Sardinian and Chinese is 10 in a pca plot.
So, your West Asian% would be : 8/10, or 80%

Any comments? I would just use a measuring tape and calculate West Eurasian % that way.

everest59
02-09-2014, 06:05 PM
There seems to be some data in here sequenced by complete genomics. But dont know that is what you are looking for
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data

Well, those seem to be individual files. I found a file in vcf format, but it doesn't have Punjabis or Bengalis. I think those populations are not available.

Dr_McNinja
02-09-2014, 06:35 PM
I may have found a way to calculate West Eurasian % using a PCA plot. Let's suppose Sardinians are the purest West Eurasian population (in reality Stuttgart would be the best population to use since she is a mix of EEF and WHG with no ANE). The F4 ratio formula using Sardinians would be:
West Asian %= F4( Basque, Yoruba ; X , Chinese)/ F4(Basque, Yoruba; Sardinian, Chinese)

However, the thing here is that I have found that there is no need to use Basque, We can actually do the following:
F4(Sardinian, Yoruba; X, Chinese)/ F4(Sardinian, Yoruba; Sardinian, Chinese)

What happens is that the Saridinian, Yoruba in the numerator and denominator cancel out, and the formula becomes:
F4( X, Chinese)/F4(Sardinian, Chinese)

However, this was using allele frequency. What if we substitute distance for allele frequency using a pca plot? Then the formula becomes:

West Asian %= Distance(X, Chinese)/Distance(Sardinian, Chinese)

Let's suppose distance between you and Chinese is 8, whereas the distance between Sardinian and Chinese is 10 in a pca plot.
So, your West Asian% would be : 8/10, or 80%

Any comments? I would just use a measuring tape and calculate West Eurasian % that way.I get ~66% for me, ~68% for Sapporo, and 71% for Sein measuring from the giant plot you initially posted.

everest59
02-09-2014, 06:38 PM
I get ~66% for me, ~68% for Sapporo, and 71% for Sein measuring from the giant plot you initially posted.

It seems to work to a degree. I got slightly less than yours for myself. It seems to be informative for South Asians, at least.
If I were allowed to plot the Onge, I would do it, but unfortunately, I have been told that it isn't available for distribution. With the Onge, the formula would be:
Distance (You, Onge)/Distance(Sardinian, Onge).
Another thing,since ANE came from West Eurasians, I would probably add ANE in there as well. So, let's say you are 10% ANE. I would add 10% to your calculation, meaning 76% total (since Sardinians are 0% ANE).

Dr_McNinja
02-09-2014, 07:57 PM
One of the new Jatts I found on GEDmatch seems to have a similar pattern to my mother:

https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdFMtRHVlZDBuQ3lMcjhxMDE4V3JoY lE&usp=drive_web#gid=3

The Gill with kit M141527.

It's readily apparent in the Eurogenes K15 and EUTest. The differences are that some of my mother's West Asian leaked into East Med, and in EUTest some of M141527's East Euro leaked into North Euro.

It looks like she's just missing a chunk of NE-Euro and a little bit more of West Asian which got transplanted into South Asian otherwise it's very close. It looks like a West Asian-shifted version of the Nepalese Brahmin or a South Asian-shifted version of the other Gill Jatt.

parasar
02-09-2014, 08:13 PM
Excellent idea, and a very interesting output.
Mal'ta is clustered with South Asians + West Asians. YRI is on the top right. CHB on the bottom, Europeans (Tuscans and CEU) on the top left.

I think there is something wrong here.

I think it may best to keep the datasets consistent, i.e. HAPMAP, 23andme, HGDP etc.
It appears that MA1 is finding more affinity to the 23andme samples, so say European 23andme samples would be closer MA1 than the CEU ones.

Nevertheless, of the comparable sets CEU, GIH, CHB and YRI, the GIH are indeed appearing closest, and YRI the farthest.

everest59
02-10-2014, 12:57 AM
I think it may best to keep the datasets consistent, i.e. HAPMAP, 23andme, HGDP etc.
It appears that MA1 is finding more affinity to the 23andme samples, so say European 23andme samples would be closer MA1 than the CEU ones.

Nevertheless, of the comparable sets CEU, GIH, CHB and YRI, the GIH are indeed appearing closest, and YRI the farthest.


I am getting similar results again. This one is all HGDP data. Again closest to Northwestern South Asians, specifically Sindhis, Burushos as well as Pathan. MA-1 is marked by L. This time I used 67k snp's.

https://drive.google.com/file/d/0B3vEDdpZDjUpUFlfX2xaSHJqd28/edit?usp=sharing

Generalissimo
02-10-2014, 01:33 AM
I am getting similar results again. This one is all HGDP data. Again closest to Northwestern South Asians, specifically Sindhis, Burushos as well as Pathan. MA-1 is marked by L. This time I used 67k snp's.

https://drive.google.com/file/d/0B3vEDdpZDjUpUFlfX2xaSHJqd28/edit?usp=sharing

He's more or less placed where people of mixed European and Native America origin land on such plots, but in his case, it's not because he's mixed, but rather because Europeans and Amerindians are the two groups with the highest ANE ancestry. He's closer to Europe, because Europeans are also WHG, which is very similar to ANE.

This should be seen more clearly within a dimension that splits Amerindians and East/South Asians more effectively. There should basically be a straight line running from Europe to the Americans, and MA-1 somewhere along that line.

Generalissimo
02-10-2014, 01:49 AM
Here's the cline...

http://imageshack.com/a/img819/8638/a4kh.png

everest59
02-10-2014, 01:53 AM
I think you are probably correct. I posted a plot a while back in this thread where MA-1 was clustering with Central Asians.

Generalissimo
02-10-2014, 02:00 AM
I think you are probably correct. I posted a plot a while back in this thread where MA-1 was clustering with Central Asians.

I'd say because the plots are based on modern samples, his position will basically be dictated by who today carries most admixture from his close relatives. So on a global plot he'l fall between WHG/ANE-heavy North/East Europeans and ANE-heavy South Amerindians, while on a Eurasian plot he'll probably be positioned somewhere between North/East Europe, the North Caucasus and Uzbekistan, or thereabouts.

everest59
02-10-2014, 02:02 AM
Edit: I'm removing the data. I don't know if my participants would like their PCA data made publicly available.

everest59
02-10-2014, 02:09 AM
Okay, a 1x3 PCA has Mal'ta closer to Native Americans:
https://drive.google.com/file/d/0B3vEDdpZDjUpY3NlUHV3N3hFX2M/edit?usp=sharing

A 1x4 plot where Mal'ta is equidistant from Native Americans and South Asians:

https://drive.google.com/file/d/0B3vEDdpZDjUpVUpFcEJtNG9YMDA/edit?usp=sharing

Okay, this is it.

Sein
02-10-2014, 08:10 AM
Hi Everest,

Once you get started on some Admixture runs, could you try some ANI-ASI "zombies"? You're obviously familiar with Dienekes' method for creating "zombies" from allele frequencies. Just as a refresher, http://dodecad.blogspot.com/2011/05/how-to-create-zombies-from-admixture.html. And I'm sure you've also seen Dienekes' highly interesting experiment utilizing ANI-ASI "zombies", http://dodecad.blogspot.com/2011/05/more-zombies-ancestral-north-indians.html. Finally, his attempt at analyzing the HGDP Pakistani populations using these "zombies", http://dodecad.blogspot.com/2011/05/aniasi-analysis-of-hgdp-pakistan-groups.html. The spreadsheet of the actual results, https://docs.google.com/spreadsheet/ccc?key=0ArJDEoCgzRKedEd3N2drM05sck1wcG03TFdWUnZaQ mc&authkey=CIHIwKcO&hl=en_US&authkey=CIHIwKcO#gid=0. As you can see, the results are absolutely beautiful, from a statistical perspective. The correlations between his original results (using only 10 Sindhis and 15 Pashtuns), and the published values seen in the Reich et al. paper, are amazing. In fact, his percentages approach the Moorjani et al. paper in some cases. This method has quite a few substantial advantages vs Admixtools/F statistics/population mixture tests. It can differentiate between East Asian and ASI, you can see the full complexity of population affinities, and you don't need an outgroup (or any other complicated issues). And to top it off, you can analyze groups that don't show "simple" admixture between ANI and ASI. For example, Iranians are outside the classic ANI-ASI cline, but he was still able to infer around 8% ASI for Iranians, which is what one would expect. With this method, you don't have to deal with only South Asian populations. You can also find some information on creating "zombies" from allele frequencies at the "Geographic Structure Prediction" site.

In short, just create ANI-ASI "zombies" in the same exact manner as Dienekes did, and then create "zombies" for all of the non-South Asian components in one of your K-runs (maybe K-10 to K-12, minus the Baloch-specific and South Indian-specific components). And then just run Admixture in supervised mode, using these zombies. This method seems to yield very accurate ASI percentages, and it can differentiate between the various West Eurasian layers in South Asia. It just works, and everyone would love it if you'd try this.

I hope you do consider trying this, once you get the ball rolling with Admixture. As always, everyone truly appreciates what you're doing with the data, and your output is absolutely amazing. Thanks!

Dr_McNinja
02-10-2014, 08:37 AM
Hi Everest,

Once you get started on some Admixture runs, could you try some ANI-ASI "zombies"? You're obviously familiar with Dienekes' method for creating "zombies" from allele frequencies. Just as a refresher, http://dodecad.blogspot.com/2011/05/how-to-create-zombies-from-admixture.html. And I'm sure you've also seen Dienekes' highly interesting experiment utilizing ANI-ASI "zombies", http://dodecad.blogspot.com/2011/05/more-zombies-ancestral-north-indians.html. Finally, his attempt at analyzing the HGDP Pakistani populations using these "zombies", http://dodecad.blogspot.com/2011/05/aniasi-analysis-of-hgdp-pakistan-groups.html. The spreadsheet of the actual results, https://docs.google.com/spreadsheet/ccc?key=0ArJDEoCgzRKedEd3N2drM05sck1wcG03TFdWUnZaQ mc&authkey=CIHIwKcO&hl=en_US&authkey=CIHIwKcO#gid=0. As you can see, the results are absolutely beautiful, from a statistical perspective. The correlations between his original results (using only 10 Sindhis and 15 Pashtuns), and the published values seen in the Reich et al. paper, are amazing. In fact, his percentages approach the Moorjani et al. paper in some cases. This method has quite a few substantial advantages vs Admixtools/F statistics/population mixture tests. It can differentiate between East Asian and ASI, you can see the full complexity of population affinities, and you don't need an outgroup (or any other complicated issues). And to top it off, you can analyze groups that don't show "simple" admixture between ANI and ASI. For example, Iranians are outside the classic ANI-ASI cline, but he was still able to infer around 8% ASI for Iranians, which is what one would expect. With this method, you don't have to deal with only South Asian populations. You can also find some information on creating "zombies" from allele frequencies at the "Geographic Structure Prediction" site.

In short, just create ANI-ASI "zombies" in the same exact manner as Dienekes did, and then create "zombies" for all of the non-South Asian components in one of your K-runs (maybe K-10 to K-12, minus the Baloch-specific and South Indian-specific components). And then just run Admixture in supervised mode, using these zombies. This method seems to yield very accurate ASI percentages, and it can differentiate between the various West Eurasian layers in South Asia. It just works, and everyone would love it if you'd try this.

I hope you do consider trying this, once you get the ball rolling with Admixture. As always, everyone truly appreciates what you're doing with the data, and your output is absolutely amazing. Thanks!I didn't know about this. Fascinating read, thanks. Wish someone would make a good admixture tutorial, razib's old one was a little skimpy and didn't explain much about what was really going on.

Dr_McNinja
02-10-2014, 08:47 AM
I did an estimation of paulgill's gedmatch kit in the chromosome browser and came up with 17.8% European (not counting X chromosome). That's within spitting distance of his admixture numbers of his European components and Arctic components added up in Dodecad K12b, Harappa, and Eurogenes K13.

Comparison of my results:


Chr #SNPS S-Indn Baloch Caucasn NE-Euro SE-Asn Sibrn NE-Asn Papuan Amercn Berngn Med SW-Asn San E-African Pygmy W-African
1 14964 35.05 35.78 11.81 11.93 0.00 0.00 4.55 0.00 0.02 0.84 0.02 0.00 0.00 0.00 0.00 0.00
2 14946 35.78 39.57 8.10 8.75 0.74 0.00 0.82 1.31 0.01 4.91 0.00 0.00 0.00 0.00 0.00 0.00
3 12630 43.69 39.36 0.00 13.03 0.00 2.75 0.00 1.13 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00
4 10988 36.57 41.67 1.31 6.87 0.01 0.63 0.00 2.41 0.57 0.00 3.20 6.77 0.00 0.00 0.00 0.00
5 11457 31.78 30.75 19.06 5.58 0.00 0.56 0.00 0.00 1.23 2.72 8.31 0.01 0.00 0.00 0.00 0.00
6 12223 38.79 40.52 7.30 2.59 0.71 0.00 0.96 1.67 3.70 0.01 3.75 0.00 0.00 0.00 0.00 0.00
7 10200 20.91 35.58 15.65 14.28 3.85 0.08 0.00 1.15 2.58 1.79 0.00 4.13 0.00 0.00 0.00 0.00
8 10504 41.59 28.51 8.92 11.43 0.27 5.74 0.00 0.00 0.00 0.00 0.00 1.33 0.00 0.00 0.00 2.21
9 9265 34.93 26.26 5.65 10.01 0.00 2.24 0.00 3.67 4.90 2.96 4.34 5.05 0.00 0.00 0.00 0.00
10 10207 34.18 32.94 17.94 6.96 0.00 0.00 0.00 0.79 5.25 0.01 0.00 1.93 0.00 0.00 0.00 0.00
11 9246 34.69 39.57 11.87 5.58 0.00 2.03 0.00 0.19 0.00 3.72 0.00 0.00 0.00 1.70 0.64 0.01
12 9268 47.40 33.13 0.01 0.00 0.00 0.00 0.00 0.95 0.32 5.58 5.17 7.44 0.00 0.00 0.00 0.00
13 7058 42.56 32.05 1.34 0.98 0.00 1.95 0.00 2.34 3.26 1.34 0.01 12.49 0.00 0.00 0.00 1.67
14 6416 30.16 26.45 14.00 13.92 4.59 3.20 0.02 0.00 3.26 0.97 0.02 0.00 0.00 0.00 3.42 0.00
15 5950 27.57 22.27 14.27 17.47 0.00 0.00 0.00 3.02 0.00 0.00 10.85 4.56 0.00 0.00 0.00 0.00
16 6186 35.73 37.36 14.58 10.03 0.00 0.00 0.00 0.00 0.01 1.37 0.92 0.00 0.00 0.00 0.00 0.00
17 5439 30.23 35.34 20.64 0.00 1.93 4.94 0.00 1.65 0.09 0.04 5.13 0.00 0.00 0.00 0.00 0.00
18 5980 42.19 35.17 4.32 5.73 4.53 1.34 0.01 0.00 0.00 3.19 0.00 3.52 0.00 0.00 0.00 0.00
19 3833 36.79 24.66 1.88 20.73 2.58 3.19 0.00 0.00 3.85 2.24 0.00 0.19 0.54 0.19 3.17 0.00
20 5279 26.01 29.63 15.05 18.46 0.01 1.11 1.69 3.01 0.00 0.00 0.00 3.95 0.00 0.05 0.74 0.29
21 3027 44.37 25.16 0.00 1.24 0.04 0.00 0.00 0.00 1.19 0.00 22.11 2.57 2.85 0.00 0.46 0.00
22 3107 37.69 26.38 0.01 24.05 0.00 0.00 0.00 3.55 0.00 0.00 5.69 0.00 2.61 0.00 0.01 0.00
Chr1 -- 30.66%
Chr2 -- 22.36%
Chr3 -- 9.7%
Chr4 -- 8.8%
Chr5 -- 8.31%
Chr6 -- 10.36%
Chr7 -- 12%
Chr8 -- 20%
Chr9 -- 30.67%
Chr10 - 17.4%
Chr11 - 16.34%
Chr12 - 8.55%
Chr13 - 11.56%
Chr14 - 24.5%
Chr15 - 53.8%
Chr16 - 27.5%
Chr17 - 29.0%
Chr18 - 18.5%
Chr19 - 23.3%
Chr20 - 34.5%
Chr21 - 34.55%
Chr22 - 46.9%

Dr_McNinja
02-10-2014, 09:50 AM
Also added several more Punjabi Jatts (Bal, Sidhu, Gill) and a Northern Uttar Pradesh individual to the spreadsheet:

https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdFMtRHVlZDBuQ3lMcjhxMDE4V3JoY lE&usp=drive_web#gid=6

The UP individual's results are interesting. My Baloch is lower than Bihar, Nepal, Uttar Pradesh, and even two South Indian Brahmins (I added one more what I think is a South Indian Brahmin just to the Harappa sheet, their Gedmatch name was Iyengar).

everest59
02-10-2014, 10:45 AM
Hi Everest,

Once you get started on some Admixture runs, could you try some ANI-ASI "zombies"? You're obviously familiar with Dienekes' method for creating "zombies" from allele frequencies. Just as a refresher, http://dodecad.blogspot.com/2011/05/how-to-create-zombies-from-admixture.html. And I'm sure you've also seen Dienekes' highly interesting experiment utilizing ANI-ASI "zombies", http://dodecad.blogspot.com/2011/05/more-zombies-ancestral-north-indians.html. Finally, his attempt at analyzing the HGDP Pakistani populations using these "zombies", http://dodecad.blogspot.com/2011/05/aniasi-analysis-of-hgdp-pakistan-groups.html. The spreadsheet of the actual results, https://docs.google.com/spreadsheet/ccc?key=0ArJDEoCgzRKedEd3N2drM05sck1wcG03TFdWUnZaQ mc&authkey=CIHIwKcO&hl=en_US&authkey=CIHIwKcO#gid=0. As you can see, the results are absolutely beautiful, from a statistical perspective. The correlations between his original results (using only 10 Sindhis and 15 Pashtuns), and the published values seen in the Reich et al. paper, are amazing. In fact, his percentages approach the Moorjani et al. paper in some cases. This method has quite a few substantial advantages vs Admixtools/F statistics/population mixture tests. It can differentiate between East Asian and ASI, you can see the full complexity of population affinities, and you don't need an outgroup (or any other complicated issues). And to top it off, you can analyze groups that don't show "simple" admixture between ANI and ASI. For example, Iranians are outside the classic ANI-ASI cline, but he was still able to infer around 8% ASI for Iranians, which is what one would expect. With this method, you don't have to deal with only South Asian populations. You can also find some information on creating "zombies" from allele frequencies at the "Geographic Structure Prediction" site.

In short, just create ANI-ASI "zombies" in the same exact manner as Dienekes did, and then create "zombies" for all of the non-South Asian components in one of your K-runs (maybe K-10 to K-12, minus the Baloch-specific and South Indian-specific components). And then just run Admixture in supervised mode, using these zombies. This method seems to yield very accurate ASI percentages, and it can differentiate between the various West Eurasian layers in South Asia. It just works, and everyone would love it if you'd try this.

I hope you do consider trying this, once you get the ball rolling with Admixture. As always, everyone truly appreciates what you're doing with the data, and your output is absolutely amazing. Thanks!

Thanks. I will see what I can do. Yes I always wanted to try this.
Well I don't know if there is any other plot I should create. If people have any requests, please let me know.
Otherwise I will do some ADMIXTURE runs. But in a few days. I need a break. LOL.

Sapporo
02-10-2014, 10:46 AM
Also added several more Punjabi Jatts (Bal, Sidhu, Gill) and a Northern Uttar Pradesh individual to the spreadsheet:

https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdFMtRHVlZDBuQ3lMcjhxMDE4V3JoY lE&usp=drive_web#gid=6

The UP individual's results are interesting. My Baloch is lower than Bihar, Nepal, Uttar Pradesh, and even two South Indian Brahmins (I added one more what I think is a South Indian Brahmin just to the Harappa sheet, their Gedmatch name was Iyengar).

How did you calculate the West Eurasian ancestry of the two Tajiks in Eurogenes V2 K15? I got 63.31% West Eurasian for the Afghan Kabuli Tajik (100) - (Arctic + East Asian + 1/2 South Asian). Are you counting Arctic as Euro? I disagree if you are because this Kabuli Tajik is very far East of me and Sein on 23andMe's Global Similarity. I also got 58.31% West Eurasian for the Tajikistan Tajik.

Sein
02-10-2014, 10:46 AM
Thanks. I will see what I can do. Yes I always wanted to try this.
Well I don't know if there is any other plot I should create. If people have any requests, please let me know.
Otherwise I will do some ADMIXTURE runs. But in a few days. I need a break. LOL.

No doubt, you definitely deserve one! :)

Sapporo
02-10-2014, 10:51 AM
No doubt, you definitely deserve one!

Agreed. What you have been doing in terms of individually created plots and admixture runs is quite impressive everest. Perhaps you can consider creating a blog or webpage to showcase all of your work in the near future when you have time? :)

Dr_McNinja
02-10-2014, 10:59 AM
How did you calculate the West Eurasian ancestry of the two Tajiks in Eurogenes V2 K15? I got 63.31% West Eurasian for the Afghan Kabuli Tajik (100) - (Arctic + East Asian + 1/2 South Asian). Are you counting Arctic as Euro? I disagree if you are because this Kabuli Tajik is very far East of me and Sein on 23andMe's Global Similarity. I also got 58.31% West Eurasian for the Tajikistan Tajik.Whoops, their Arctic wasn't supposed to be counted, I only wanted to do that for the South Asians. I fixed that

Sapporo
02-10-2014, 12:46 PM
Whoops, their Arctic wasn't supposed to be counted, I only wanted to do that for the South Asians. I fixed that


Needs to be adjusted for Eurogenes K13 as well. :)

Dr_McNinja
02-10-2014, 01:15 PM
Re: ADMIXTURE

How does it work anyway? Where did they get the references from? Can it use more than 180-190k SNPs used by Harappa for instance? Could you feed in a bunch of 23andMe/FTDNA raw data files and start trying to find ancestral populations off the basis of 500-700k SNPs? At what point do they connect the ancestral populations to the reference samples?

Dr_McNinja
02-10-2014, 01:22 PM
Re: ADMIXTURE

How does it work anyway? Where did they get the references from? Can it use more than 180-190k SNPs used by Harappa for instance? Could you feed in a bunch of 23andMe/FTDNA raw data files and start trying to find ancestral populations off the basis of 500-700k SNPs? At what point do they connect the ancestral populations to the reference samples?And out of curiousity, how many SNPs do 23andMe/FTDNA use for their Ancestry Composition/Population Finder?

Dr_McNinja
02-10-2014, 01:29 PM
Needs to be adjusted for Eurogenes K13 as well. :)Fixed.

Looking at both phased kits of my parents (never paid close attention to the phased maternal until now):

https://docs.google.com/spreadsheet/ccc?key=0AuXBmvmgdkfVdEphd2JCSU9rY3prVjcxc19CU2xCS 0E#gid=0

It looks like:

African <-> West Eurasian <-> Arctic <-> ASI <-> East Asian. (actually African is south of West Eurasian but you get the idea)

Some of my mom's Arctic and Far East (in her raw data) went into the ASI part of South Indian in her phased kit. At this point the data put out by these admixture utilities looks so messy I wonder how useful it really is.

Zack's slightly different calculator shifted everything over to the East. It put some ASI into East Asian (making my S-Indian go down) and moved even more West Eurasian into Arctic (and a little into African), and within West Eurasian moved some stuff over to SW-Asian.

EDIT: If I "fixed" my results by shifting back to the west (putting far east in ASI, 3.55% from ASI into Arctic (figuring 15% is a good baseline based on averages of the population), Arctic into West Eurasian), I'd get:

Far East = 0, ASI = 15.95, Arctic = 3.55, West Eurasian = 81.18, which puts me right in line with everyone else. If I do the same with my mom, she also winds up at 81% West Eurasian (which in the harappa spreadsheet is in the range of the other Punjabi Jatts).

My FTDNA Population Finder numbers:

82.36% North Indian (the two reference populations in India are North Indian and Southeast Indian, there was none of the latter picked up)
8.51% Middle East (SW-Asian pretty much with a bit of W-Asian)
9.13% Europe

That leaves me with 63.54% ANI and 18.82% ASI within that North Indian (edit: without shifting anything, it would be ~58 ANI, ~23.5 ASI, which again wouldn't be enough for Sindhi (46.1/23.5), but could be dumped into "North Indian").

The average for the HGDP Pathans (closest to the Jatts) from that link Sein posted was 48.5/18.5. So the ASI is just right, but that means ~15% of my N-Euro+W-Asian is getting caught as ANI. So instead of getting east-shifted into the Arctic, South India, and East Asia, it just shifts me into the North Indian ANI/ASI which doesn't change anything (averaging those out puts in me in Pakistani Punjab).

Anyone know which populations they use for North Indian and which for Southeast Indian? Brahmins and Tribals?

Have any other Pakistani Pashtun/Pathan or Punjabi Jatts used Population Finder? What were their results?

Sein
02-10-2014, 01:29 PM
Re: ADMIXTURE

How does it work anyway? Where did they get the references from? Can it use more than 180-190k SNPs used by Harappa for instance? Could you feed in a bunch of 23andMe/FTDNA raw data files and start trying to find ancestral populations off the basis of 500-700k SNPs? At what point do they connect the ancestral populations to the reference samples?

ADMIXTURE works by producing panmictic populations from the data you give it, characterizing them by allele frequencies, and then assigning percentages. In a sense, the components are simply how an algorithm dissects the data, because one wants it to do so. A nice paper that explains it all, but in a very critical light:

http://genome.cshlp.org/content/19/5/703.full
"Recent analyses have used a Bayesian K-populations cluster analysis as the tool of choice for analyzing the large-scale human population genetic data that are now available. These applications involve evolutionary and historical as well as quasitaxonomic concepts. A Bayesian K-populations cluster approach to variation treats our species, or some specified area of the world, in admixture terms, as if populated by people who either are members of discrete populations or are admixed descendants of such populations. Users of the Bayesian K-populations cluster approach refer to the discrete populations in various kinds of ancestry terms, such as by referring to them as “parental.” For example, samples of African-Americans collectively reflect a majority of ancestors from African and smaller fractions from European or other parental populations (Parra et al. 1998; McKeigue et al. 2000; Shriver et al. 2003). The mixing proportions are estimated by statistical analysis based on the admixed sample and donor genotype frequencies taken from samples of presumed parental populations, and may take into account the temporal dynamics of the admixture process (Pfaff et al. 2001).

This kind of admixture approach to human variation has been done most frequently by using the program structure, or programs implementing modifications of the same or a similar conceptual approach (Pritchard et al. 2000; Falush et al. 2003; Hoggart et al. 2004; McKeigue 2005; Tang et al. 2005, 2006; Zhu et al. 2006; Montana and Hoggart 2007). Here, we will use the phrase “structure-like analysis” to refer generically to this approach, regardless of the specific program used in any given paper. The popularity and ease of use of structure-like programs have fueled a recent trend to use the term ”population structure” in the limited admixture sense, and indeed to view history from this rather platonic view, as comprised of parental entities and their offspring. The architects of structure and related programs are well aware of the limitations of the method and state them clearly in their papers (Pritchard et al. 2000; Falush et al. 2003; Tang et al. 2005). However, applications of such programs are often made without heeding caveats or recognizing the limitations of the underlying models with respect to the questions and data at hand.

In structure-like analysis, typical input data consist of globally distributed polymorphisms (STRs, SNPs, indels, etc.) that are genotyped in a sample of individuals. Depending on the purpose and specific program, these may be from a series of intracontinental or global samples. The program user can optionally either specify the number of parental populations and provide their allele frequencies from external data, or can specify that number and have the program statistically group the sample and optimize their allele frequencies, or can have the program estimate both the optimized number of parental populations (K) and their allele frequencies. A parental population is assumed to be randomly mating with Hardy–Weinberg equilibrium genotype proportions, and the program uses likelihood ratio or other similar significance-testing criteria to identify such internally statistically homogeneous populations, and minimize any linkage disequilibrium between them, that is, to determine the statistically optimal population number and allele frequencies represented by the supplied data.

Once the parental populations have been characterized in terms of their allele frequencies, each individual in the data is assigned an estimated fraction of ancestry from each of the parental populations (which can be 1.0 for a member of a parental population)."

Harsh, but that's the best summary one can find.

Your idea concerning 23andMe raw-data is actually very interesting, and definitely possible. But one would need to create a robust/diverse data-set of 23andMe/FTDNA raw-data files. Also, ADMIXTURE can't utilize Linkage-Disequilibrium. You need to prune SNPs.

everest59
02-10-2014, 02:13 PM
Well I once tried running ADMIXTURE on a data set I created, which had 450k snp's. I got an error message. So I had to perform a linkage disequilibrium based pruning, after which I was left with 180k snp's. The total snp's you are left with depends on what instructions you give to Plink. For example, Zack used:
-indep-pairwise 50 5 0.3
On the other hand, Lazairidis etc al used:
250 25 0.4

The command Zack used ends up with fewer SNPs. When I create my data set I will use what Lazairidis used, which will probably mean more snp's.

Dr_McNinja
02-10-2014, 02:23 PM
Well I once tried running ADMIXTURE on a data set I created, which had 450k snp's. I got an error message. So I had to perform a linkage disequilibrium based pruning, after which I was left with 180k snp's. The total snp's you are left with depends on what instructions you give to Plink. For example, Zack used:
-indep-pairwise 50 5 0.3
On the other hand, Lazairidis etc al used:
250 25 0.4

The command Zack used ends up with fewer SNPs. When I create my data set I will use what Lazairidis used, which will probably mean more snp's.If you start off with 700-900k SNPs (23andMe/FTDNA range), would the pruning still push it all the way down to ~180k or would it be much higher?

everest59
02-10-2014, 02:32 PM
If you start off with 700snap's.0k SNPs (23andMe/FTDNA range), would the pruning still push it all the way down to ~180k or would it be much higher?

Keep in mind some of the data sets have lower snp count than 23andme data. For example HGDP has 600k something, but only 500k in common with 23andme. I think I may be able to push it to 200k+.
Admixture is a very slow program. At K=16 it takes more than a day on my computer. My laptop is probably 5 years old now.
Also keep in mind many people use ftdna, which doesn't completely match 23andme data.
One more thing, some 23andme data have only 500k snp's.

Dr_McNinja
02-10-2014, 02:38 PM
Keep in mind some of the data sets have lower snp count that 23andme data. For example HGDP has 600k something, but only 500k in common with 23andme. I think I may be able to push it to 200k+.
Admixture is a very slow program. At K=16 it takes more than a day on my computer. My laptop is probably 5 years old now.
Also keep in mind many people use ftdna, which doesn't completely match 23andme data.I hope it doesn't kill your laptop. -_- I'm waiting to get back (just under 2 weeks) so I can try and use it on my desktop (core i5, unfortunately not i7, but it's overclocked to 4.5GHz). It has very good cooling so it can run all day with the CPU usage maxed out.

If you want me to run it for you, if it's as simple as following some instructions and e-mailing back some files to you, I could do that (or someone with an even faster computer).

FTDNA has something like ~760k SNPs, any idea how many of those are shared with 23andMe? Most people get identical gedmatch calculator results between their raw data from both platforms.

everest59
02-10-2014, 02:44 PM
OK, let's do that since you have a fast computer. I can send you all my codes. Did you download wubi yet? It will allow up to 30 Gig in space.
I am at work, so a little busy.
The biggest trouble is converting data sets to Plink format.
Yeah that's the first thing you should do ( download pink for Ubuntu).

Dr_McNinja
02-10-2014, 02:55 PM
OK, let's do that since you have a fast computer. I can send you all my codes. Did you download wubi yet? It will allow up to 30 Gig in space.
I am at work, so a little busy.
The biggest trouble is converting data sets to Plink format.
Yeah that's the first thing you should do ( download pink for Ubuntu).I won't be able to for about 2 weeks when I get back to my desktop (probably the 22nd or 23rd of this month). I'll message you as soon as I can. I was running Ubuntu off a 32GB USB stick but Wubi sounds a lot more convenient. I'll let you know when I get back

everest59
02-10-2014, 03:05 PM
Do you know any programming languages? I mainly use perl. I don't know the Unix shell stuff.

Dr_McNinja
02-10-2014, 03:20 PM
I haven't programmed in like 10 years but I used to know perl and C. I can't find my way around the unix shell stuff either. -_- Using linux was painful. If only they had a Windows/DOS version of admixture...

Dr_McNinja
02-11-2014, 06:18 PM
I did that gedmatch chromosome browser thing with HRP0370's kit and got 21.2%.

Sapporo
02-11-2014, 08:30 PM
I did that gedmatch chromosome browser thing with HRP0370's kit and got 21.2%.


21.2% in terms of what?

Dr_McNinja
02-11-2014, 08:48 PM
21.2% in terms of what?European pretty much. How much of the genome is shared with European people. I got 17.8% for paul gill.

Sapporo
02-11-2014, 11:21 PM
European pretty much. How much of the genome is shared with European people. I got 17.8% for paul gill.

Based on which calculator? I'm curious what I would get. Presumably something similar to Paul.

Dr_McNinja
02-12-2014, 02:57 AM
Based on which calculator? I'm curious what I would get. Presumably something similar to Paul.It's not really a calculator, it's just that 17.8% of paul's raw data overlaps with people of European descent, actual members of Gedmatch's database. It will pick up SNP ranges that are weaker than what admixture would get (but all over 5cM for those SNP ranges).

Use Gedmatch's One-To-many feature, select everyone with a match over 5cM (very annoying step), then go to Chromosome Browser (you might have to copy/paste the URL into notepad to chop it up into two smaller URLs). Then print to a PDF to save the results. It doesn't tell you how long the base chromosome is (it's smaller than the cell devoted to each), but in my case I have my mom's kit on there so I used that as the boundary. Then I measured it using Photoshop's ruler and averaged together all the percentages for each chromosome.

I noticed paul and HRP0370 had a lot of matches for the chunks they did have. I had quite a few chunks which were only matching like 3 to 5 individuals. Still, even those chunks which matched like 20+ people weren't picked up in some places although McDonald did similarly paint half my 15th chromosome as European. But he also had more of my 7th painted as European than I found in matches. So I would figure these might all be underestimated slightly. I think we all (across the board throughout the subcontinent) have even higher European affinity and lower South Asian than admixture shows. That's pretty interesting considering how old the European is.

Dr_McNinja
02-12-2014, 03:40 AM
Paul's X chromosome had much less than mine, and most of HRP0370's entire X chromosome was lit up. Something crazy like 60+% I'd estimate. Dr. McDonald also painted it similarly. But he painted more European throughout than I picked up in matches. I'm not sure why.

It might be relating to HRP0370's family (and in general with Muslim populations) having had lots of inbreeding at the first/second cousin level. On my family tree it just happened twice in a recent genealogical timeframe (my mom's parents are first cousins and my parents are second cousins). I'm going to get my grandmother tested with FTDNA's Family Finder, maybe my dad too since his parents weren't related, and see what they get in admixture calculators. I know inbreeding throws the linkage disequilibrium rate for a loop, but no idea how/why that would affect admixture results. Or maybe it just makes the calculator effect more pronounced for individuals who weren't sampled in the initial run (so Pathans were sampled, but Muslim Punjabis were not until later I think). Who knows maybe everyone in that region is being shifted around a little. I suspect their European numbers are slightly deflated because of this.

Dr_McNinja
02-12-2014, 03:43 AM
Sapporo, do you have a gedmatch kit # whose one-to-many results have been calculated (takes several weeks after you upload)? I'll do it for yours too, might as well get one more baseline to see the correlation with admixture.

Sapporo
02-12-2014, 10:34 AM
Sapporo, do you have a gedmatch kit # whose one-to-many results have been calculated (takes several weeks after you upload)? I'll do it for yours too, might as well get one more baseline to see the correlation with admixture.

Sorry mate. Don't have any Gedmatch kit#'s that you wouldn't already have access to.

everest59
02-12-2014, 06:51 PM
Before I do any more analysis , I would like to create a better data set. The current one has some issues. I think I realize what mistake I made. Then I'll run a quick ADMIXTURE at k=10. Should be ready in a couple days.

Dr_McNinja
02-13-2014, 04:22 AM
If it's not too much trouble could you guys post your results using the weac, euro7 and globe4 calculators?

http://dodecad.blogspot.com/2011/09/weac-calculator.html

http://dodecad.blogspot.co.uk/2011/09/euro7-calculator.html

http://dodecad.blogspot.co.uk/2012/10/globe4-calculator.html

You can use it easily in DIYDodecadWrapper: http://www.y-str.org/tools/diy-dodecad-wrapper/

Dr_McNinja
02-13-2014, 11:24 AM
I ran the D-statistics script from Dienekes' blog ( http://dienekes.blogspot.co.uk/2012/12/d-statistics-on-admixture-components.html )

I just used globe13. So first my globe13 results:


Siberian 2.22
Amerindian 1.03
Southwest Asian 1.99
East Asian 0.04
Mediterranean 3.63
Australasian 0.54
Arctic 1.35
WestAsian 33.92
NorthEuropean 11.65
SouthAsian 43.61Results:
"South_Asian" "Siberian" "Ancestral" "0.00905" "2.76"
"South_Asian" "Amerindian" "Ancestral" "-0.00197" "-0.58"
"South_Asian" "West_African" "Ancestral" "-0.0029" "-0.99"
"South_Asian" "Palaeo_African" "Ancestral" "-0.00195" "-0.62"
"South_Asian" "Southwest_Asian" "Ancestral" "-0.02838" "-8.68"
"South_Asian" "East_Asian" "Ancestral" "0.01626" "4.99"
"South_Asian" "Mediterranean" "Ancestral" "-0.03346" "-9.99"
"South_Asian" "Australasian" "Ancestral" "0.01905" "5.13"
"South_Asian" "Arctic" "Ancestral" "0.00103" "0.3"
"South_Asian" "West_Asian" "Ancestral" "-0.04589" "-14.58"
"South_Asian" "North_European" "Ancestral" "-0.03705" "-11.41"
"South_Asian" "East_African" "Ancestral" "-0.0051" "-1.7" All the results:


Pop1 Pop3 Outgroup Dstat Z
Siberian Amerindian Ancestral 0.08235 22.04
Siberian West_African Ancestral -0.00404 -1.29
Siberian Palaeo_African Ancestral -0.00204 -0.62
Siberian Southwest_Asian Ancestral -0.03635 -10.24
Siberian East_Asian Ancestral 0.10341 29.96
Siberian Mediterranean Ancestral -0.0382 -10.4
Siberian Australasian Ancestral 0.03511 8.98
Siberian Arctic Ancestral 0.10017 27.47
Siberian West_Asian Ancestral -0.0465 -13.72
Siberian North_European Ancestral -0.03416 -9.8
Siberian South_Asian Ancestral -0.00963 -2.92
Siberian East_African Ancestral -0.00491 -1.53
Amerindian Siberian Ancestral 0.09245 24.56
Amerindian West_African Ancestral -0.00329 -0.97
Amerindian Palaeo_African Ancestral -0.00111 -0.3
Amerindian Southwest_Asian Ancestral -0.02747 -7.59
Amerindian East_Asian Ancestral 0.0795 21.46
Amerindian Mediterranean Ancestral -0.02966 -8
Amerindian Australasian Ancestral 0.02517 6.22
Amerindian Arctic Ancestral 0.13137 34.75
Amerindian West_Asian Ancestral -0.02981 -8.48
Amerindian North_European Ancestral -0.01105 -3.07
Amerindian South_Asian Ancestral -0.01108 -3.27
Amerindian East_African Ancestral -0.00479 -1.39
West_African Siberian Ancestral -0.28577 -80.54
West_African Amerindian Ancestral -0.29182 -81.48
West_African Palaeo_African Ancestral -0.0307 -9.46
West_African Southwest_Asian Ancestral -0.27147 -76.18
West_African East_Asian Ancestral -0.28053 -78.82
West_African Mediterranean Ancestral -0.28818 -81.27
West_African Australasian Ancestral -0.27066 -70.16
West_African Arctic Ancestral -0.29083 -80.84
West_African West_Asian Ancestral -0.30282 -90.69
West_African North_European Ancestral -0.30233 -87.21
West_African South_Asian Ancestral -0.30054 -90.45
West_African East_African Ancestral -0.07107 -21.87
Palaeo_African Siberian Ancestral -0.38624 -108.07
Palaeo_African Amerindian Ancestral -0.39156 -106.59
Palaeo_African West_African Ancestral -0.15975 -47.41
Palaeo_African Southwest_Asian Ancestral -0.37518 -106.32
Palaeo_African East_Asian Ancestral -0.38147 -105.47
Palaeo_African Mediterranean Ancestral -0.38824 -109.45
Palaeo_African Australasian Ancestral -0.36983 -96.17
Palaeo_African Arctic Ancestral -0.39111 -106.34
Palaeo_African West_Asian Ancestral -0.40099 -119.68
Palaeo_African North_European Ancestral -0.40038 -114.46
Palaeo_African South_Asian Ancestral -0.40081 -120.93
Palaeo_African East_African Ancestral -0.19443 -55.03
Southwest_Asian Siberian Ancestral -0.04719 -13.57
Southwest_Asian Amerindian Ancestral -0.04743 -12.99
Southwest_Asian West_African Ancestral 0.00305 1.02
Southwest_Asian Palaeo_African Ancestral 0.00268 0.86
Southwest_Asian East_Asian Ancestral -0.0458 -13.2
Southwest_Asian Mediterranean Ancestral 0.02425 7.26
Southwest_Asian Australasian Ancestral -0.04191 -11.19
Southwest_Asian Arctic Ancestral -0.04782 -13.17
Southwest_Asian West_Asian Ancestral -0.01642 -4.96
Southwest_Asian North_European Ancestral -0.00672 -2.03
Southwest_Asian South_Asian Ancestral -0.05727 -17.71
Southwest_Asian East_African Ancestral -0.00461 -1.5
East_Asian Siberian Ancestral 0.09776 27.29
East_Asian Amerindian Ancestral 0.06402 16.5
East_Asian West_African Ancestral -0.00256 -0.8
East_Asian Palaeo_African Ancestral -0.00021 -0.06
East_Asian Southwest_Asian Ancestral -0.03999 -11.03
East_Asian Mediterranean Ancestral -0.04404 -11.94
East_Asian Australasian Ancestral 0.03975 9.52
East_Asian Arctic Ancestral 0.07589 20.22
East_Asian West_Asian Ancestral -0.05235 -15.17
East_Asian North_European Ancestral -0.04409 -12.4
East_Asian South_Asian Ancestral -0.00775 -2.34
East_Asian East_African Ancestral -0.00327 -1
Mediterranean Siberian Ancestral -0.02902 -8.49
Mediterranean Amerindian Ancestral -0.02958 -8.11
Mediterranean West_African Ancestral 0.00051 0.17
Mediterranean Palaeo_African Ancestral 0.00231 0.72
Mediterranean Southwest_Asian Ancestral 0.04551 14.11
Mediterranean East_Asian Ancestral -0.02987 -8.75
Mediterranean Australasian Ancestral -0.02719 -7.38
Mediterranean Arctic Ancestral -0.02982 -8.33
Mediterranean West_Asian Ancestral 0.00599 1.84
Mediterranean North_European Ancestral 0.02386 7.41
Mediterranean South_Asian Ancestral -0.04222 -13.39
Mediterranean East_African Ancestral 0.00834 2.73
Australasian Siberian Ancestral -0.00526 -1.1
Australasian Amerindian Ancestral -0.02411 -4.96
Australasian West_African Ancestral -0.02211 -5.81
Australasian Palaeo_African Ancestral -0.0151 -3.78
Australasian Southwest_Asian Ancestral -0.06907 -15.6
Australasian East_Asian Ancestral 0.00448 0.92
Australasian Mediterranean Ancestral -0.07425 -16.62
Australasian Arctic Ancestral -0.01892 -3.95
Australasian West_Asian Ancestral -0.08557 -19.82
Australasian North_European Ancestral -0.07764 -17.73
Australasian South_Asian Ancestral -0.03945 -8.74
Australasian East_African Ancestral -0.0256 -6.51
Arctic Siberian Ancestral 0.10827 31.02
Arctic Amerindian Ancestral 0.12918 35.33
Arctic West_African Ancestral -0.00366 -1.16
Arctic Palaeo_African Ancestral -0.00213 -0.62
Arctic Southwest_Asian Ancestral -0.02977 -8.26
Arctic East_Asian Ancestral 0.08941 26.22
Arctic Mediterranean Ancestral -0.03181 -8.63
Arctic Australasian Ancestral 0.02855 7.25
Arctic West_Asian Ancestral -0.03451 -9.87
Arctic North_European Ancestral -0.01732 -4.88
Arctic South_Asian Ancestral -0.01009 -3.13
Arctic East_African Ancestral -0.00441 -1.37
West_Asian Siberian Ancestral -0.02387 -7.15
West_Asian Amerindian Ancestral -0.01619 -4.55
West_Asian West_African Ancestral -0.00368 -1.28
West_Asian Palaeo_African Ancestral -0.00061 -0.2
West_Asian Southwest_Asian Ancestral 0.01821 5.75
West_Asian East_Asian Ancestral -0.02478 -7.44
West_Asian Mediterranean Ancestral 0.01994 6.12
West_Asian Australasian Ancestral -0.02521 -6.88
West_Asian Arctic Ancestral -0.019 -5.39
West_Asian North_European Ancestral 0.0201 6.48
West_Asian South_Asian Ancestral -0.04123 -13.46
West_Asian East_African Ancestral 0.00147 0.5
North_European Siberian Ancestral -0.01069 -3.17
North_European Amerindian Ancestral 0.00342 0.95
North_European West_African Ancestral -0.00378 -1.29
North_European Palaeo_African Ancestral -0.001 -0.31
North_European Southwest_Asian Ancestral 0.02875 9.11
North_European East_Asian Ancestral -0.01575 -4.64
North_European Mediterranean Ancestral 0.03865 12.06
North_European Australasian Ancestral -0.0164 -4.51
North_European Arctic Ancestral -0.00095 -0.27
North_European West_Asian Ancestral 0.02079 6.71
North_European South_Asian Ancestral -0.03171 -10.14
North_European East_African Ancestral -0.00209 -0.69
South_Asian Siberian Ancestral 0.00905 2.76
South_Asian Amerindian Ancestral -0.00197 -0.58
South_Asian West_African Ancestral -0.0029 -0.99
South_Asian Palaeo_African Ancestral -0.00195 -0.62
South_Asian Southwest_Asian Ancestral -0.02838 -8.68
South_Asian East_Asian Ancestral 0.01626 4.99
South_Asian Mediterranean Ancestral -0.03346 -9.99
South_Asian Australasian Ancestral 0.01905 5.13
South_Asian Arctic Ancestral 0.00103 0.3
South_Asian West_Asian Ancestral -0.04589 -14.58
South_Asian North_European Ancestral -0.03705 -11.41
South_Asian East_African Ancestral -0.0051 -1.7
East_African Siberian Ancestral -0.2404 -68.31
East_African Amerindian Ancestral -0.24742 -68.26
East_African West_African Ancestral -0.013 -4.31
East_African Palaeo_African Ancestral -0.01047 -3.26
East_African Southwest_Asian Ancestral -0.23075 -65.56
East_African East_Asian Ancestral -0.23476 -66.57
East_African Mediterranean Ancestral -0.23638 -66.58
East_African Australasian Ancestral -0.22642 -59.09
East_African Arctic Ancestral -0.24564 -68.91
East_African West_Asian Ancestral -0.25388 -75.91
East_African North_European Ancestral -0.25618 -73.8
East_African South_Asian Ancestral -0.25676 -77.82Now I'm going to run it again using HarappaWorld.

EDIT: I take these results to indicate my West Asian component has a common ancestry with the Southwest Asian, Mediterranean, and North European components. Also that I don't really have any Siberian/Amerindian/Arctic ancestry, but these components are linked to South Asian and North European (EDIT: It seems Siberian is more likely to have come from South Asian, Arctic just barely so, and Amerindian is barely closer to North European).

EDIT: Harappa results:


[,1] [,2] [,3] [,4] [,5]
[1,] "Pop1" "Pop3" "Outgroup" "Dstat" "Z"
[2,] "S-Indian" "Baloch" "San" "-0.0501" "-21.82"
[3,] "S-Indian" "Caucasian" "San" "-0.03629" "-16.16"
[4,] "S-Indian" "NE-Euro" "San" "-0.03759" "-15.64"
[5,] "S-Indian" "SE-Asian" "San" "0.02167" "8.7"
[6,] "S-Indian" "Siberian" "San" "0.01054" "4.21"
[7,] "S-Indian" "NE-Asian" "San" "0.02165" "8.7"
[8,] "S-Indian" "Papuan" "San" "0.02364" "7.62"
[9,] "S-Indian" "American" "San" "-0.00028" "-0.1"
[10,] "S-Indian" "Beringian" "San" "0.00529" "1.96"
[11,] "S-Indian" "Mediterranean" "San" "-0.03735" "-15.3"
[12,] "S-Indian" "SW-Asian" "San" "-0.03183" "-14.01"
[13,] "S-Indian" "E-African" "San" "-0.00376" "-2.33"
[14,] "S-Indian" "Pygmy" "San" "-0.00184" "-1.34"
[15,] "S-Indian" "W-African" "San" "-0.00164" "-1.15"
[16,] "Baloch" "S-Indian" "San" "-0.0379" "-17.56"
[17,] "Baloch" "Caucasian" "San" "0.01392" "6.43"
[18,] "Baloch" "NE-Euro" "San" "0.0128" "5.69"
[19,] "Baloch" "SE-Asian" "San" "-0.01705" "-6.85"
[20,] "Baloch" "Siberian" "San" "-0.01391" "-5.62"
[21,] "Baloch" "NE-Asian" "San" "-0.01778" "-7.02"
[22,] "Baloch" "Papuan" "San" "-0.01889" "-6.17"
[23,] "Baloch" "American" "San" "-0.00724" "-2.72"
[24,] "Baloch" "Beringian" "San" "-0.01187" "-4.44"
[25,] "Baloch" "Mediterranean" "San" "0.01287" "5.58"
[26,] "Baloch" "SW-Asian" "San" "0.01552" "7.3"
[27,] "Baloch" "E-African" "San" "0.00089" "0.56"
[28,] "Baloch" "Pygmy" "San" "-0.00179" "-1.32"
[29,] "Baloch" "W-African" "San" "-0.00041" "-0.28"
[30,] "Caucasian" "S-Indian" "San" "-0.04264" "-19.65"
[31,] "Caucasian" "Baloch" "San" "-0.00524" "-2.46"
[32,] "Caucasian" "NE-Euro" "San" "0.02316" "10.11"
[33,] "Caucasian" "SE-Asian" "San" "-0.03314" "-13.92"
[34,] "Caucasian" "Siberian" "San" "-0.03072" "-12.83"
[35,] "Caucasian" "NE-Asian" "San" "-0.03316" "-13.72"
[36,] "Caucasian" "Papuan" "San" "-0.03357" "-11.09"
[37,] "Caucasian" "American" "San" "-0.03051" "-11.79"
[38,] "Caucasian" "Beringian" "San" "-0.03351" "-12.88"
[39,] "Caucasian" "Mediterranean" "San" "0.04592" "19.7"
[40,] "Caucasian" "SW-Asian" "San" "0.03362" "15.84"
[41,] "Caucasian" "E-African" "San" "0.0064" "4.01"
[42,] "Caucasian" "Pygmy" "San" "-0.00514" "-3.82"
[43,] "Caucasian" "W-African" "San" "-0.00342" "-2.35"
[44,] "NE-Euro" "S-Indian" "San" "-0.03193" "-13.91"
[45,] "NE-Euro" "Baloch" "San" "0.006" "2.68"
[46,] "NE-Euro" "Caucasian" "San" "0.03574" "15.71"
[47,] "NE-Euro" "SE-Asian" "San" "-0.01481" "-5.92"
[48,] "NE-Euro" "Siberian" "San" "-0.00542" "-2.17"
[49,] "NE-Euro" "NE-Asian" "San" "-0.0143" "-5.67"
[50,] "NE-Euro" "Papuan" "San" "-0.01769" "-5.77"
[51,] "NE-Euro" "American" "San" "0.00417" "1.53"
[52,] "NE-Euro" "Beringian" "San" "-0.00297" "-1.09"
[53,] "NE-Euro" "Mediterranean" "San" "0.04709" "19.52"
[54,] "NE-Euro" "SW-Asian" "San" "0.02987" "13.56"
[55,] "NE-Euro" "E-African" "San" "0.00229" "1.36"
[56,] "NE-Euro" "Pygmy" "San" "-0.00398" "-2.9"
[57,] "NE-Euro" "W-African" "San" "-0.00312" "-2.08"
[58,] "SE-Asian" "S-Indian" "San" "-0.00177" "-0.73"
[59,] "SE-Asian" "Baloch" "San" "-0.0517" "-20.7"
[60,] "SE-Asian" "Caucasian" "San" "-0.04908" "-20.13"
[61,] "SE-Asian" "NE-Euro" "San" "-0.04287" "-17.07"
[62,] "SE-Asian" "Siberian" "San" "0.0877" "30.69"
[63,] "SE-Asian" "NE-Asian" "San" "0.12524" "45.48"
[64,] "SE-Asian" "Papuan" "San" "0.04054" "11.66"
[65,] "SE-Asian" "American" "San" "0.05962" "18.55"
[66,] "SE-Asian" "Beringian" "San" "0.0767" "24.88"
[67,] "SE-Asian" "Mediterranean" "San" "-0.04429" "-17.23"
[68,] "SE-Asian" "SW-Asian" "San" "-0.04068" "-16.46"
[69,] "SE-Asian" "E-African" "San" "-0.00423" "-2.43"
[70,] "SE-Asian" "Pygmy" "San" "-0.00076" "-0.51"
[71,] "SE-Asian" "W-African" "San" "-0.00073" "-0.47"
[72,] "Siberian" "S-Indian" "San" "-0.00511" "-2.15"
[73,] "Siberian" "Baloch" "San" "-0.04134" "-16.69"
[74,] "Siberian" "Caucasian" "San" "-0.03941" "-16.41"
[75,] "Siberian" "NE-Euro" "San" "-0.02635" "-10.55"
[76,] "Siberian" "SE-Asian" "San" "0.09584" "34.85"
[77,] "Siberian" "NE-Asian" "San" "0.11511" "42.84"
[78,] "Siberian" "Papuan" "San" "0.03423" "10.44"
[79,] "Siberian" "American" "San" "0.08493" "28.11"
[80,] "Siberian" "Beringian" "San" "0.1149" "38.54"
[81,] "Siberian" "Mediterranean" "San" "-0.03301" "-12.58"
[82,] "Siberian" "SW-Asian" "San" "-0.03338" "-13.69"
[83,] "Siberian" "E-African" "San" "-0.00414" "-2.36"
[84,] "Siberian" "Pygmy" "San" "-0.00122" "-0.82"
[85,] "Siberian" "W-African" "San" "-0.00162" "-1.02"
[86,] "NE-Asian" "S-Indian" "San" "0.00065" "0.27"
[87,] "NE-Asian" "Baloch" "San" "-0.05007" "-19.93"
[88,] "NE-Asian" "Caucasian" "San" "-0.04678" "-19.1"
[89,] "NE-Asian" "NE-Euro" "San" "-0.04005" "-15.81"
[90,] "NE-Asian" "SE-Asian" "San" "0.1279" "46.8"
[91,] "NE-Asian" "Siberian" "San" "0.10949" "39.77"
[92,] "NE-Asian" "Papuan" "San" "0.04386" "12.75"
[93,] "NE-Asian" "American" "San" "0.07704" "24.07"
[94,] "NE-Asian" "Beringian" "San" "0.09949" "32.65"
[95,] "NE-Asian" "Mediterranean" "San" "-0.04238" "-16.31"
[96,] "NE-Asian" "SW-Asian" "San" "-0.03851" "-15.6"
[97,] "NE-Asian" "E-African" "San" "-0.0036" "-2.08"
[98,] "NE-Asian" "Pygmy" "San" "-0.00065" "-0.43"
[99,] "NE-Asian" "W-African" "San" "-0.00073" "-0.47"
[100,] "Papuan" "S-Indian" "San" "-0.02259" "-7.34"
[101,] "Papuan" "Baloch" "San" "-0.0751" "-25.12"
[102,] "Papuan" "Caucasian" "San" "-0.07106" "-24.24"
[103,] "Papuan" "NE-Euro" "San" "-0.06731" "-21.83"
[104,] "Papuan" "SE-Asian" "San" "0.01733" "4.89"
[105,] "Papuan" "Siberian" "San" "0.0035" "0.99"
[106,] "Papuan" "NE-Asian" "San" "0.01813" "5.14"
[107,] "Papuan" "American" "San" "-0.01117" "-3.01"
[108,] "Papuan" "Beringian" "San" "-0.00197" "-0.54"
[109,] "Papuan" "Mediterranean" "San" "-0.06476" "-20.97"
[110,] "Papuan" "SW-Asian" "San" "-0.0596" "-20.32"
[111,] "Papuan" "E-African" "San" "-0.01504" "-7.65"
[112,] "Papuan" "Pygmy" "San" "-9e-04" "-0.58"
[113,] "Papuan" "W-African" "San" "-0.00777" "-4.53"
[114,] "American" "S-Indian" "San" "-0.00741" "-2.96"
[115,] "American" "Baloch" "San" "-0.02662" "-10.23"
[116,] "American" "Caucasian" "San" "-0.03109" "-12.5"
[117,] "American" "NE-Euro" "San" "-0.00865" "-3.29"
[118,] "American" "SE-Asian" "San" "0.07639" "26.08"
[119,] "American" "Siberian" "San" "0.09387" "31.98"
[120,] "American" "NE-Asian" "San" "0.09144" "31.14"
[121,] "American" "Papuan" "San" "0.02764" "7.83"
[122,] "American" "Beringian" "San" "0.12912" "41.88"
[123,] "American" "Mediterranean" "San" "-0.02733" "-10.2"
[124,] "American" "SW-Asian" "San" "-0.02724" "-10.81"
[125,] "American" "E-African" "San" "-0.00377" "-2.15"
[126,] "American" "Pygmy" "San" "0.00058" "0.37"
[127,] "American" "W-African" "San" "-0.00089" "-0.56"
[128,] "Beringian" "S-Indian" "San" "-0.00395" "-1.59"
[129,] "Beringian" "Baloch" "San" "-0.0332" "-12.72"
[130,] "Beringian" "Caucasian" "San" "-0.03608" "-14.45"
[131,] "Beringian" "NE-Euro" "San" "-0.01776" "-6.8"
[132,] "Beringian" "SE-Asian" "San" "0.09151" "31.44"
[133,] "Beringian" "Siberian" "San" "0.12181" "42.73"
[134,] "Beringian" "NE-Asian" "San" "0.11193" "39.71"
[135,] "Beringian" "Papuan" "San" "0.03508" "10.33"
[136,] "Beringian" "American" "San" "0.12687" "41.2"
[137,] "Beringian" "Mediterranean" "San" "-0.03182" "-12.13"
[138,] "Beringian" "SW-Asian" "San" "-0.03055" "-11.95"
[139,] "Beringian" "E-African" "San" "-0.00318" "-1.69"
[140,] "Beringian" "Pygmy" "San" "-5e-04" "-0.31"
[141,] "Beringian" "W-African" "San" "-0.00082" "-0.5"
[142,] "Mediterranean" "S-Indian" "San" "-0.04905" "-21.62"
[143,] "Mediterranean" "Baloch" "San" "-0.01176" "-5.04"
[144,] "Mediterranean" "Caucasian" "San" "0.04017" "17.26"
[145,] "Mediterranean" "NE-Euro" "San" "0.02876" "12.04"
[146,] "Mediterranean" "SE-Asian" "San" "-0.03375" "-13.68"
[147,] "Mediterranean" "Siberian" "San" "-0.02974" "-11.82"
[148,] "Mediterranean" "NE-Asian" "San" "-0.03416" "-13.35"
[149,] "Mediterranean" "Papuan" "San" "-0.03262" "-10.47"
[150,] "Mediterranean" "American" "San" "-0.03215" "-11.73"
[151,] "Mediterranean" "Beringian" "San" "-0.03467" "-12.71"
[152,] "Mediterranean" "SW-Asian" "San" "0.04169" "18.82"
[153,] "Mediterranean" "E-African" "San" "0.00865" "5.2"
[154,] "Mediterranean" "Pygmy" "San" "-0.00514" "-3.71"
[155,] "Mediterranean" "W-African" "San" "-0.00123" "-0.81"
[156,] "SW-Asian" "S-Indian" "San" "-0.0666" "-30.32"
[157,] "SW-Asian" "Baloch" "San" "-0.0328" "-14.13"
[158,] "SW-Asian" "Caucasian" "San" "0.00371" "1.6"
[159,] "SW-Asian" "NE-Euro" "San" "-0.01212" "-5.04"
[160,] "SW-Asian" "SE-Asian" "San" "-0.05313" "-21.4"
[161,] "SW-Asian" "Siberian" "San" "-0.05315" "-21.47"
[162,] "SW-Asian" "NE-Asian" "San" "-0.05328" "-21.1"
[163,] "SW-Asian" "Papuan" "San" "-0.0504" "-16.68"
[164,] "SW-Asian" "American" "San" "-0.05505" "-20.33"
[165,] "SW-Asian" "Beringian" "San" "-0.05639" "-20.44"
[166,] "SW-Asian" "Mediterranean" "San" "0.01732" "7.26"
[167,] "SW-Asian" "E-African" "San" "0.00058" "0.36"
[168,] "SW-Asian" "Pygmy" "San" "-0.0049" "-3.54"
[169,] "SW-Asian" "W-African" "San" "1e-04" "0.07"
[170,] "E-African" "S-Indian" "San" "-0.24253" "-98.24"
[171,] "E-African" "Baloch" "San" "-0.24762" "-97.95"
[172,] "E-African" "Caucasian" "San" "-0.22733" "-89.33"
[173,] "E-African" "NE-Euro" "San" "-0.23999" "-89.34"
[174,] "E-African" "SE-Asian" "San" "-0.22273" "-84.17"
[175,] "E-African" "Siberian" "San" "-0.22881" "-83.93"
[176,] "E-African" "NE-Asian" "San" "-0.22412" "-84.97"
[177,] "E-African" "Papuan" "San" "-0.21391" "-66.62"
[178,] "E-African" "American" "San" "-0.23485" "-82.52"
[179,] "E-African" "Beringian" "San" "-0.23296" "-81.49"
[180,] "E-African" "Mediterranean" "San" "-0.22035" "-81.68"
[181,] "E-African" "SW-Asian" "San" "-0.20815" "-82.08"
[182,] "E-African" "Pygmy" "San" "0.0068" "4.43"
[183,] "E-African" "W-African" "San" "0.00729" "4.93"
[184,] "Pygmy" "S-Indian" "San" "-0.35421" "-138.94"
[185,] "Pygmy" "Baloch" "San" "-0.36144" "-136.73"
[186,] "Pygmy" "Caucasian" "San" "-0.3495" "-134.78"
[187,] "Pygmy" "NE-Euro" "San" "-0.35669" "-130.24"
[188,] "Pygmy" "SE-Asian" "San" "-0.33518" "-120.17"
[189,] "Pygmy" "Siberian" "San" "-0.34087" "-119.76"
[190,] "Pygmy" "NE-Asian" "San" "-0.33674" "-120.52"
[191,] "Pygmy" "Papuan" "San" "-0.32008" "-94.95"
[192,] "Pygmy" "American" "San" "-0.34506" "-117.11"
[193,] "Pygmy" "Beringian" "San" "-0.34467" "-116.58"
[194,] "Pygmy" "Mediterranean" "San" "-0.34467" "-123.52"
[195,] "Pygmy" "SW-Asian" "San" "-0.32867" "-124.38"
[196,] "Pygmy" "E-African" "San" "-0.13578" "-65.04"
[197,] "Pygmy" "W-African" "San" "-0.079" "-44.12"
[198,] "W-African" "S-Indian" "San" "-0.28595" "-115.36"
[199,] "W-African" "Baloch" "San" "-0.29314" "-113.88"
[200,] "W-African" "Caucasian" "San" "-0.28018" "-110.23"
[201,] "W-African" "NE-Euro" "San" "-0.28872" "-106.86"
[202,] "W-African" "SE-Asian" "San" "-0.2658" "-99.25"
[203,] "W-African" "Siberian" "San" "-0.27236" "-98.38"
[204,] "W-African" "NE-Asian" "San" "-0.26761" "-99.23"
[205,] "W-African" "Papuan" "San" "-0.25463" "-77.6"
[206,] "W-African" "American" "San" "-0.27781" "-97.53"
[207,] "W-African" "Beringian" "San" "-0.27644" "-96.37"
[208,] "W-African" "Mediterranean" "San" "-0.27351" "-101.3"
[209,] "W-African" "SW-Asian" "San" "-0.25498" "-100.4"
[210,] "W-African" "E-African" "San" "-0.05019" "-27.16"
[211,] "W-African" "Pygmy" "San" "0.01004" "6.44"

everest59
02-13-2014, 04:04 PM
I have something interesting to report. I was fooling around with Mclust (which I still don't understand properly), and I got some interesting results. I basically used the PCA data and fed it to Mclust. See the link below:

First thing's first. It seems like cluster 1 is the outlier cluster (groups like Kalash, Mbutipygmy as well as Papuans and Melanesians got lumped into this). However, the rest seem pretty much okay. Here's what's interesting :
Sein, me, Sapporo and Dr. Mcninja got lumped as 4, along with Baloch and Sindhis, which makes sense.
Zephorous and Humanist got lumped as 3, where we also find the Druze and the Bedouin.
Now here's where it gets interesting. DMXX, NK19191, mfa and Icebreaker are cluster 6, which is also the European cluster.

So, the link is below:
https://drive.google.com/file/d/0B3vEDdpZDjUpN2gyTW9LZ3NHdmc/edit?usp=sharing

Keep in mind, I don't know how to use Mclust very well yet.

P.S. The dataset is HGDP and nothing else, where I merged all the 23andme data.

Sein
02-13-2014, 10:09 PM
I have something interesting to report. I was fooling around with Mclust (which I still don't understand properly), and I got some interesting results. I basically used the PCA data and fed it to Mclust. See the link below:

First thing's first. It seems like cluster 1 is the outlier cluster (groups like Kalash, Mbutipygmy as well as Papuans and Melanesians got lumped into this). However, the rest seem pretty much okay. Here's what's interesting :
Sein, me, Sapporo and Dr. Mcninja got lumped as 4, along with Baloch and Sindhis, which makes sense.
Zephorous and Humanist got lumped as 3, where we also find the Druze and the Bedouin.
Now here's where it gets interesting. DMXX, NK19191, mfa and Icebreaker are cluster 6, which is also the European cluster.

So, the link is below:
https://drive.google.com/file/d/0B3vEDdpZDjUpN2gyTW9LZ3NHdmc/edit?usp=sharing

Keep in mind, I don't know how to use Mclust very well yet.

P.S. The dataset is HGDP and nothing else, where I merged all the 23andme data.

This is rather interesting, populations as diverse as Makrani/Baloch/Brahui, Burusho, Pashtun, Sindhi, Punjabi Jatt, and Nepali Brahmin are in a single cluster! Very cool stuff. How many PCAs did you keep?

Everest, I have an idea. Try all of the HGDP Pakistani populations (minus the Kalash, this sort of method will never be very good for them, they are just too drifted+inbred. But include the Hazara, and also include the Uyghur), all of the Di Crisofaro Afghans, and their equivalent co-ethnics from the Behar and Yunusbayev datasets.

That would leave you with 8 HGDP populations, all 5 Di Cristofaro populations, and 5 Central Asian populations from Behar+Yunusbayev. Run PCA on 18 dimensions. Then, run Mclust on all of your individuals, allowing K to be as high as 18.

I'm sure you'll obtain very interesting results. Thank you!

everest59
02-13-2014, 10:24 PM
This is rather interesting, populations as diverse as Makrani/Baloch/Brahui, Burusho, Pashtun, Sindhi, Punjabi Jatt, and Nepali Brahmin are in a single cluster! Very cool stuff. How many PCAs did you keep?

Everest, I have an idea. Try all of the HGDP Pakistani populations (minus the Kalash, this sort of method will never be very good for them, they are just too drifted+inbred. But include the Hazara, and also include the Uyghur), all of the Di Crisofaro Afghans, and their equivalent co-ethnics from the Behar and Yunusbayev datasets.

That would leave you with 8 HGDP populations, all 5 Di Cristofaro populations, and 5 Central Asian populations from Behar+Yunusbayev. Run PCA on 18 dimensions. Then, run Mclust on all of your individuals, allowing K to be as high as 18.

I'm sure you'll obtain very interesting results. Thank you!

I think the reason South Asian populations aren't being differentiated is due to lack of South Asian specific populations. Even when I used 52 dimensions, the results were the same with South Asians clustering together. The only population that was clustering differently was the Kalash.
Anyways, it was a good exercise, although not that accurate.
I will do what you said when I get a chance (although I'll have to combine some datasets).

Sein
02-13-2014, 10:27 PM
I think the reason South Asian populations aren't being differentiated is due to lack of South Asian specific populations. Even when I used 52 dimensions, the results were the same with South Asians clustering together. The only population that was clustering differently was the Kalash.
Anyways, it was a good exercise, although not that accurate.
I will do what you said when I get a chance (although I'll have to combine some datasets).

That makes sense.

Thank you!

DMXX
02-13-2014, 11:07 PM
First thing's first. It seems like cluster 1 is the outlier cluster (groups like Kalash, Mbutipygmy as well as Papuans and Melanesians got lumped into this). However, the rest seem pretty much okay. Here's what's interesting :
Sein, me, Sapporo and Dr. Mcninja got lumped as 4, along with Baloch and Sindhis, which makes sense.
Zephorous and Humanist got lumped as 3, where we also find the Druze and the Bedouin.
Now here's where it gets interesting. DMXX, NK19191, mfa and Icebreaker are cluster 6, which is also the European cluster.

Do you have any ideas how this happened, everest?

everest59
02-13-2014, 11:11 PM
That makes sense.

Thank you!

Okay, I did another quick one. This time I added some more populations. I added populations like Kurds, Iranians, Turks, as well as Central Asian populations. Looks like I forgot Hazaras.
BTW, this procedure seems to do a good job of identifying mislabelings. For example, one Tajik seems to be a Pashtun (# 13). Also, Sapporo is 13 with Pashtun populations, while Mcninja is 11. A couple of Gujarati samples and a few Singapore Indians are 13.

This time the Iranian, Kurdish, Assyrian and Turkish participants are clustering together. Probably because I added Kurd and Iranian populations (also added Turks). This was run on 35 dimensions.

https://drive.google.com/file/d/0B3vEDdpZDjUpelRpcXYyZlM2YWM/edit?usp=sharing



Do you have any ideas how this happened, everest?

I think it may be due to lack of certain populations. Check out my new file.
I know for example that in various calculators the Assyrian populations have no Northern European admixture. Perhaps that contributed.

Sein
02-13-2014, 11:26 PM
Okay, I did another quick one. This time I added some more populations. I added populations like Kurds, Iranians, Turks, as well as Central Asian populations. Looks like I forgot Hazaras.
BTW, this procedure seems to do a good job of identifying mislabelings. For example, one Tajik seems to be a Pashtun (# 13). Also, Sapporo is 13 with Pashtun populations, while Mcninja is 11. A couple of Gujarati samples and a few Singapore Indians are 13.

This time the Iranian, Kurdish, Assyrian and Turkish participants are clustering together. Probably because I added Kurd and Iranian populations (also added Turks). This was run on 35 dimensions.

https://drive.google.com/file/d/0B3vEDdpZDjUpelRpcXYyZlM2YWM/edit?usp=sharing




I think it may be due to lack of certain populations. Check out my new file.
I know for example that in various calculators the Assyrian populations have no Northern European admixture. Perhaps that contributed.

Now this a very interesting output! And rather fascinating, Dr_McNinja and you get classified in the North-Northwest Indian cluster, while Sapporo gets classified in the Pashtun cluster.

Everest, could you run Mclust with only Pashtuns, me, and Sapporo, 100 dimensions? I wonder if it could differentiate the Afghan and Pakistani Pashtuns. So far, it seems PCA-MDS can't tell the difference, but the HarappaWorld results suggest substantial differentiation. So, I'm wondering if it could differentiate with 100 dimensions? Of course, once you find the time.

everest59
02-13-2014, 11:53 PM
Now this a very interesting output! And rather fascinating, Dr_McNinja and you get classified in the North-Northwest Indian cluster, while Sapporo gets classified in the Pashtun cluster.

Everest, could you run Mclust with only Pashtuns, me, and Sapporo, 100 dimensions? I wonder if it could differentiate the Afghan and Pakistani Pashtuns. So far, it seems PCA-MDS can't tell the difference, but the HarappaWorld results suggest substantial differentiation. So, I'm wondering if it could differentiate with 100 dimensions? Of course, once you find the time.

Well, at dimension of 100, everybody gets differentiated. I tried 50, and again everybody has a differerent cluster number. At 5, this is what I get:
1 Pashtun 1
2 Pashtun 2
3 Pashtun 1
4 Pashtun 1
5 Pashtun 3
6 Pathan 1
7 Pathan 1
8 Pathan 1
9 Pathan 1
10 Pathan 1
11 Pathan 2
12 Pathan 1
13 Pathan 1
14 Pathan 1
15 Pathan 1
16 Pathan 4
17 Pathan 1
18 Pathan 1
19 Pathan 1
20 Pathan 2
21 Pathan 2
22 Pathan 4
23 Pathan 1
24 Pathan 2
25 Pathan 4
26 Pathan 3
27 Pathan 1
28 Sapporo 1
29 Sein 1

I need to go now, but I'll try different dimensions and see what I can do.
Here is the sequence for the Pashtun + HGDP samples :
[1] Pashtun2_6Af Pashtun2_8Af Pashtun2_20Af Pashtun2_22Af Pashtun10_17Af
[6] HGDP00213 HGDP00214 HGDP00216 HGDP00218 HGDP00222
[11] HGDP00224 HGDP00226 HGDP00228 HGDP00230 HGDP00232
[16] HGDP00234 HGDP00237 HGDP00239 HGDP00241 HGDP00243
[21] HGDP00244 HGDP00251 HGDP00254 HGDP00258 HGDP00259
[26] HGDP00262 HGDP00264

Sein
02-13-2014, 11:55 PM
Thanks Everest! This pretty much confirms what I suspected, HarappaWorld isn't very reliable in this case. The distribution seems random, which one would expect for a single population. On top of that, most of the Afghan and Pakistani Pashtuns get classified in #1. So it's safe to say that the Di Cristofaro and HGDP Pashtuns, despite being very geographically distant, belong to a single population (in terms of genetics).

Sein
02-14-2014, 02:07 AM
Hey Everest,

Could you try this with Mclust http://dodecad.blogspot.com/2012/01/fastibd-analysis-of-balkanswest-asia.html? The post where Dienekes explains all the technicalities http://dienekes.blogspot.com/2012/01/clusters-galore-fastibd-edition.html. Should be interesting, but as always, only when you find the time. I'm sure you must be very busy.

Also, this is absolutely awesome, http://admixturemap.paintmychromosomes.com/. The actual paper, http://www.sciencemag.org/content/343/6172/747.abstract?sid=736c3d26-56ca-4f23-930f-1fdf773fd92d. Supplementary info, http://www.sciencemag.org/content/suppl/2014/02/12/343.6172.747.DC1/Hellenthal.SM.pdf. This is cool and everything, but I honestly have my doubts. This is very computationally intensive stuff. On top of that, the supplements are a very intimidating read (for someone like myself, but you are way more intelligent than myself). I honestly think this might be out of the question for now, but it is fun to look at. And who knows, one day you might get enough computational muscle to try this out, so it's just something to keep in mind.

The HGDP Pashtun results are very interesting. The first admixture event occurs around 2530 BCE. For this event, admixture involves a Northwest European population, and a Northwest Indian population. For this event, Pashtuns turn out 40% Northwest European, and 60% Northwest Indian. The second admixture event occurs around 1362CE. For this event, admixture involves an Iranian-like population (really, a collage of Iranian+Sindhi+Northern European), and a Northwest Indian population (really, mostly Sindhi plus some Southeast Asian, and a little South Indian). For this event, Pashtuns turn out 77% Iranian, and 23% Northwest Indian. In short, the HGDP Pashtuns are a mix between Northern Europeans, Iranians, and Northwest South Asians (I'm basing this on the companion website). Pretty cool result, and this method is very sophisticated. Like I said though, the unfortunate thing is that it's very computationally intensive, and you need to have some experience with ChromoPainter/FineStructure.

everest59
02-14-2014, 03:15 AM
Hey Everest,

Could you try this with Mclust http://dodecad.blogspot.com/2012/01/fastibd-analysis-of-balkanswest-asia.html? The post where Dienekes explains all the technicalities http://dienekes.blogspot.com/2012/01/clusters-galore-fastibd-edition.html. Should be interesting, but as always, only when you find the time. I'm sure you must be very busy.

Also, this is absolutely awesome, http://admixturemap.paintmychromosomes.com/. The actual paper, http://www.sciencemag.org/content/343/6172/747.abstract?sid=736c3d26-56ca-4f23-930f-1fdf773fd92d. Supplementary info, http://www.sciencemag.org/content/suppl/2014/02/12/343.6172.747.DC1/Hellenthal.SM.pdf. This is cool and everything, but I honestly have my doubts. This is very computationally intensive stuff. On top of that, the supplements are a very intimidating read (for someone like myself, but you are way more intelligent than myself). I honestly think this might be out of the question for now, but it is fun to look at. And who knows, one day you might get enough computational muscle to try this out, so it's just something to keep in mind.

The HGDP Pashtun results are very interesting. The first admixture event occurs around 2530 BCE. For this event, admixture involves a Northwest European population, and a Northwest Indian population. For this event, Pashtuns turn out 40% Northwest European, and 60% Northwest Indian. The second admixture event occurs around 1362CE. For this event, admixture involves a Iranian-like population (really, a collage of Iranian+Sindhi+Northern European), and a Northwest Indian population (really, mostly Sindhi plus some Southeast Asian, and a little South Indian). For this event, Pashtuns turn out 77% Iranian, and 23% Northwest Indian. In short, the HGDP Pashtuns are a mix between Northern Europeans, Iranians, and Northwest South Asians (I'm basing this on the companion website). Pretty cool result, and this method is very sophisticated. Like I said though, the unfortunate thing is that it's very computationally intensive, and you need to have some experience with ChromoPainter/FineStructure.

I did not know that paper was available for free online. I will read the whole thing first. I haven't used ChromoPainter/FineStructure before. I am going to read the whole paper first. If the softwares are all available for free, I'll see what I can do.
Does it require a good comp? Mine has a 4 GB memory with probably around 6 GB in space left. :(

Sein
02-14-2014, 03:37 AM
I think the paper isn't open access. But the supplements are extremely detailed (and freely available). Everything is explained, and all the nuts and bolts are worked through. Also, one can find the actual results in a nice format at the companion website. And I think all of the software is available for free.

I believe it may require a good comp. ChromoPainter/FineStructure has a reputation for being very difficult and intensive, and this seems to be identical in this regard. I guess it depends on what you're willing to subject your machine to. It all depends on you my friend, if it can be done, that would be great. But if not, that's perfectly alright.

everest59
02-14-2014, 03:48 AM
I think the paper isn't open access. But the supplements are extremely detailed (and freely available). Everything is explained, and all the nuts and bolts are worked through. Also, one can find the actual results in a nice format at the companion website. And I think all of the software is available for free.

I believe it may require a good comp. ChromoPainter/FineStructure has a reputation for being very difficult and intensive, and this seems to be identical in this regard. I guess it depends on what you're willing to subject your machine to. It all depends on you my friend, if it can be done, that would be great. But if not, that's perfectly alright.

I'll need to learn the software first. Sounds like it would be interesting.
BTW, I know how to calculate Admixture dates using Admixtools. For example, for Nepalese Brahmins I got admixture date of 80 something generations using Dai and Georgian as two reference populations (however, standard error was really high). The Xing dataset does not have many SNP's in common with other datasets.

Also, Alder has the capability of calculating admixture dates as well. It is based on Linkage disequilibrium, so it can only infer more recent admixture dates. Now, admixtools on the other hand can calculate earlier admixture dates. It has its limitations though.
Speaking of Admixtools, the Harvard guys have another software they haven't released called qpgraph. That would have allowed me to calculate something like basal Eurasian.

Sein
02-14-2014, 03:58 AM
I'll need to learn the software first. Sounds like it would be interesting.
BTW, I know how to calculate Admixture dates using Admixtools. For example, for Nepalese Brahmins I got admixture date of 80 something generations using Dai and Georgian as two reference populations (however, standard error was really high). The Xing dataset does not have many SNP's in common with other datasets.

Also, Alder has the capability of calculating admixture dates as well. It is based on Linkage disequilibrium, so it can only infer more recent admixture dates. Now, admixtools on the other hand can calculate earlier admixture dates. It has its limitations though.
Speaking of Admixtools, the Harvard guys have another software they haven't released called qpgraph. That would have allowed me to calculate something like basal Eurasian.

Great!

They compare Admixtools and ROLLOFF to their software in the supplements. The main difference is that their software relies on haplotype information, rather than single SNPs as in ROLLOFF. I think this gives their method an edge.

This qpgraph sounds very interesting.

Everest, before you proceed with studying the software above, could you try FastIBD in conjunction with MClust, whenever you find the time? I think it's a much easier (and faster) form of analysis.

Dr_McNinja
02-14-2014, 04:11 PM
D-statistics results for HRP0370:


Pop1 Pop3 Outgroup Dstat Z
S-Indian Baloch Pygmy -0.05628 -24.95
S-Indian Caucasian Pygmy -0.05099 -23.83
S-Indian NE-Euro Pygmy -0.04962 -21.54
S-Indian SE-Asian Pygmy 0.03487 12.83
S-Indian Siberian Pygmy 0.01945 7.05
S-Indian NE-Asian Pygmy 0.0331 12.24
S-Indian Papuan Pygmy 0.04221 13.63
S-Indian American Pygmy 0.00736 2.51
S-Indian Beringian Pygmy 0.01562 5.6
S-Indian Mediterranean Pygmy -0.05257 -23.31
S-Indian SW-Asian Pygmy -0.04524 -21.39
S-Indian San Pygmy -0.00082 -0.56
S-Indian E-African Pygmy -0.00637 -4.4
S-Indian W-African Pygmy -0.00156 -1.16
Baloch S-Indian Pygmy -0.01269 -5.47
Baloch Caucasian Pygmy -0.00231 -1.13
Baloch NE-Euro Pygmy -0.00069 -0.32
Baloch SE-Asian Pygmy -0.003 -1.14
Baloch Siberian Pygmy -0.00453 -1.7
Baloch NE-Asian Pygmy -0.00551 -2.06
Baloch Papuan Pygmy 0.00086 0.29
Baloch American Pygmy 0.00058 0.21
Baloch Beringian Pygmy -0.00117 -0.44
Baloch Mediterranean Pygmy -0.00398 -1.88
Baloch SW-Asian Pygmy 0.00077 0.39
Baloch San Pygmy -0.00086 -0.6
Baloch E-African Pygmy -0.00186 -1.3
Baloch W-African Pygmy -0.00037 -0.29
Caucasian S-Indian Pygmy -0.01535 -6.7
Caucasian Baloch Pygmy -0.01039 -4.87
Caucasian NE-Euro Pygmy 0.01252 5.64
Caucasian SE-Asian Pygmy -0.01684 -6.57
Caucasian Siberian Pygmy -0.01908 -7.3
Caucasian NE-Asian Pygmy -0.01865 -7.16
Caucasian Papuan Pygmy -0.01153 -3.96
Caucasian American Pygmy -0.02043 -7.36
Caucasian Beringian Pygmy -0.02055 -7.7
Caucasian Mediterranean Pygmy 0.0321 15.65
Caucasian SW-Asian Pygmy 0.02183 11.19
Caucasian San Pygmy 0.00252 1.74
Caucasian E-African Pygmy 0.00675 4.83
Caucasian W-African Pygmy -3e-04 -0.24
NE-Euro S-Indian Pygmy -0.00522 -2.26
NE-Euro Baloch Pygmy 6e-05 0.03
NE-Euro Caucasian Pygmy 0.02149 10.21
NE-Euro SE-Asian Pygmy 0.00087 0.34
NE-Euro Siberian Pygmy 0.00566 2.09
NE-Euro NE-Asian Pygmy -4e-04 -0.15
NE-Euro Papuan Pygmy 0.0037 1.27
NE-Euro American Pygmy 0.01376 4.89
NE-Euro Beringian Pygmy 0.00945 3.5
NE-Euro Mediterranean Pygmy 0.03231 14.86
NE-Euro SW-Asian Pygmy 0.01709 8.15
NE-Euro San Pygmy 0.00134 0.92
NE-Euro E-African Pygmy 0.00155 1.07
NE-Euro W-African Pygmy -0.00106 -0.82
SE-Asian S-Indian Pygmy 0.0218 8.17
SE-Asian Baloch Pygmy -0.05893 -23.42
SE-Asian Caucasian Pygmy -0.06435 -26.89
SE-Asian NE-Euro Pygmy -0.0557 -22.05
SE-Asian Siberian Pygmy 0.09479 32
SE-Asian NE-Asian Pygmy 0.13458 46.74
SE-Asian Papuan Pygmy 0.05813 16.67
SE-Asian American Pygmy 0.0656 19.9
SE-Asian Beringian Pygmy 0.08521 27.93
SE-Asian Mediterranean Pygmy -0.0602 -23.23
SE-Asian SW-Asian Pygmy -0.05472 -22.74
SE-Asian San Pygmy -0.00183 -1.19
SE-Asian E-African Pygmy -0.00772 -4.85
SE-Asian W-African Pygmy -0.00157 -1.1
Siberian S-Indian Pygmy 0.01896 7.18
Siberian Baloch Pygmy -0.0487 -19.87
Siberian Caucasian Pygmy -0.0549 -23.32
Siberian NE-Euro Pygmy -0.0394 -15.7
Siberian SE-Asian Pygmy 0.10805 36.99
Siberian NE-Asian Pygmy 0.1254 43.61
Siberian Papuan Pygmy 0.05249 16.15
Siberian American Pygmy 0.09114 28.43
Siberian Beringian Pygmy 0.12364 41.29
Siberian Mediterranean Pygmy -0.04918 -19.02
Siberian SW-Asian Pygmy -0.04754 -19.8
Siberian San Pygmy -0.00139 -0.93
Siberian E-African Pygmy -0.00729 -4.67
Siberian W-African Pygmy -0.00209 -1.5
NE-Asian S-Indian Pygmy 0.02408 9.24
NE-Asian Baloch Pygmy -0.05756 -23.08
NE-Asian Caucasian Pygmy -0.06235 -26.3
NE-Asian NE-Euro Pygmy -0.05314 -21.09
NE-Asian SE-Asian Pygmy 0.13895 49.42
NE-Asian Siberian Pygmy 0.11627 41.07
NE-Asian Papuan Pygmy 0.06134 18.04
NE-Asian American Pygmy 0.08274 25.72
NE-Asian Beringian Pygmy 0.10769 36.23
NE-Asian Mediterranean Pygmy -0.05859 -22.68
NE-Asian SW-Asian Pygmy -0.05284 -21.9
NE-Asian San Pygmy -0.00195 -1.29
NE-Asian E-African Pygmy -0.00725 -4.57
NE-Asian W-African Pygmy -0.00172 -1.21
Papuan S-Indian Pygmy 0.00107 0.34
Papuan Baloch Pygmy -0.08138 -26.52
Papuan Caucasian Pygmy -0.0852 -28.25
Papuan NE-Euro Pygmy -0.07906 -25.76
Papuan SE-Asian Pygmy 0.02966 8.06
Papuan Siberian Pygmy 0.01169 3.26
Papuan NE-Asian Pygmy 0.02877 7.78
Papuan American Pygmy -0.00415 -1.09
Papuan Beringian Pygmy 0.00763 2.09
Papuan Mediterranean Pygmy -0.0796 -26.16
Papuan SW-Asian Pygmy -0.0727 -24.57
Papuan San Pygmy -0.00167 -1
Papuan E-African Pygmy -0.01816 -9.91
Papuan W-African Pygmy -0.00842 -5.32
American S-Indian Pygmy 0.0154 5.77
American Baloch Pygmy -0.03554 -13.76
American Caucasian Pygmy -0.04809 -19.12
American NE-Euro Pygmy -0.02332 -9.2
American SE-Asian Pygmy 0.08749 28.05
American Siberian Pygmy 0.10019 32.54
American NE-Asian Pygmy 0.10067 32.11
American Papuan Pygmy 0.04473 13.19
American Beringian Pygmy 0.13649 44.22
American Mediterranean Pygmy -0.04496 -17.31
American SW-Asian Pygmy -0.04295 -17.26
American San Pygmy -0.00317 -1.97
American E-African Pygmy -0.00855 -5.4
American W-African Pygmy -0.00303 -2.15
Beringian S-Indian Pygmy 0.01958 7.51
Beringian Baloch Pygmy -0.04117 -16.71
Beringian Caucasian Pygmy -0.0521 -21.62
Beringian NE-Euro Pygmy -0.03145 -12.54
Beringian SE-Asian Pygmy 0.10306 35.24
Beringian Siberian Pygmy 0.12844 43.65
Beringian NE-Asian Pygmy 0.12153 41.79
Beringian Papuan Pygmy 0.05278 15.42
Beringian American Pygmy 0.13202 40.86
Beringian Mediterranean Pygmy -0.04848 -19.11
Beringian SW-Asian Pygmy -0.04527 -18.8
Beringian San Pygmy -0.00213 -1.32
Beringian E-African Pygmy -0.007 -4.34
Beringian W-African Pygmy -0.002 -1.39
Mediterranean S-Indian Pygmy -0.02198 -9.09
Mediterranean Baloch Pygmy -0.01692 -7.45
Mediterranean Caucasian Pygmy 0.02703 12.78
Mediterranean NE-Euro Pygmy 0.01827 7.76
Mediterranean SE-Asian Pygmy -0.01754 -6.42
Mediterranean Siberian Pygmy -0.01817 -6.42
Mediterranean NE-Asian Pygmy -0.01975 -7.01
Mediterranean Papuan Pygmy -0.01072 -3.66
Mediterranean American Pygmy -0.02215 -7.62
Mediterranean Beringian Pygmy -0.02179 -7.61
Mediterranean SW-Asian Pygmy 0.03007 13.86
Mediterranean San Pygmy 0.00256 1.71
Mediterranean E-African Pygmy 0.00909 6.3
Mediterranean W-African Pygmy 0.00195 1.5
SW-Asian S-Indian Pygmy -0.04011 -16.74
SW-Asian Baloch Pygmy -0.03815 -16.82
SW-Asian Caucasian Pygmy -0.00964 -4.52
SW-Asian NE-Euro Pygmy -0.02291 -9.66
SW-Asian SE-Asian Pygmy -0.03738 -13.91
SW-Asian Siberian Pygmy -0.04199 -15.21
SW-Asian NE-Asian Pygmy -0.0393 -14.35
SW-Asian Papuan Pygmy -0.02905 -9.77
SW-Asian American Pygmy -0.04551 -15.92
SW-Asian Beringian Pygmy -0.04394 -15.61
SW-Asian Mediterranean Pygmy 0.00345 1.52
SW-Asian San Pygmy 0.0023 1.56
SW-Asian E-African Pygmy 7e-04 0.48
SW-Asian W-African Pygmy 0.00307 2.37
San S-Indian Pygmy -0.33458 -118.11
San Baloch Pygmy -0.36598 -138
San Caucasian Pygmy -0.35744 -135.87
San NE-Euro Pygmy -0.36367 -131.8
San SE-Asian Pygmy -0.32472 -109.91
San Siberian Pygmy -0.33353 -112.25
San NE-Asian Pygmy -0.32766 -109.82
San Papuan Pygmy -0.30506 -93.17
San American Pygmy -0.34039 -106.31
San Beringian Pygmy -0.33675 -110.51
San Mediterranean Pygmy -0.35312 -132.86
San SW-Asian Pygmy -0.3358 -127.54
San E-African Pygmy -0.14574 -77.51
San W-African Pygmy -0.09015 -57.1
E-African S-Indian Pygmy -0.227 -86.52
E-African Baloch Pygmy -0.25895 -103.01
E-African Caucasian Pygmy -0.24532 -99.16
E-African NE-Euro Pygmy -0.2558 -100.46
E-African SE-Asian Pygmy -0.21671 -78.4
E-African Siberian Pygmy -0.22657 -80.19
E-African NE-Asian Pygmy -0.21951 -77.93
E-African Papuan Pygmy -0.20307 -65.41
E-African American Pygmy -0.23396 -79.18
E-African Beringian Pygmy -0.22957 -78.34
E-African Mediterranean Pygmy -0.23888 -95.39
E-African SW-Asian Pygmy -0.22525 -90.27
E-African San Pygmy -0.00916 -6
E-African W-African Pygmy -0.00041 -0.31
W-African S-Indian Pygmy -0.2729 -103.41
W-African Baloch Pygmy -0.30562 -120.82
W-African Caucasian Pygmy -0.29889 -120.17
W-African NE-Euro Pygmy -0.3054 -117.66
W-African SE-Asian Pygmy -0.26193 -94.7
W-African Siberian Pygmy -0.27204 -96.86
W-African NE-Asian Pygmy -0.26508 -94.4
W-African Papuan Pygmy -0.24627 -79.47
W-African American Pygmy -0.27873 -94.15
W-African Beringian Pygmy -0.27498 -95.28
W-African Mediterranean Pygmy -0.29278 -115.27
W-African SW-Asian Pygmy -0.27305 -108.03
W-African San Pygmy -0.01232 -7.59
W-African E-African Pygmy -0.0627 -37.27
Interesting results. Using Baloch as their population, none of the other admixture (i.e, Siberian) is significant. I don't think these Arctic components represent actual Arctic ancestry, even in Pashtun. I think it's all bleedoff from the South Indian component (perhaps by way of ANE/ANI).

Dr_McNinja
02-16-2014, 03:27 AM
Hey guys, I've been fooling around with admixture a little bit. Anyone know what these IDs represent?

http://www.evolutsioon.ut.ee/MAIT/public_data/india/india_paper_data_dbSNP-b131_pos-b37_1KG_strand.fam

From this study:

http://www.evolutsioon.ut.ee/MAIT/public_data/india/

http://www.evolutsioon.ut.ee/MAIT/public_data/india.jpg

Humanist
02-16-2014, 03:37 AM
Hey guys, I've been fooling around with admixture a little bit. Anyone know what these IDs represent?

http://www.evolutsioon.ut.ee/MAIT/public_data/india/india_paper_data_dbSNP-b131_pos-b37_1KG_strand.fam

Check this out (http://evolbio.ut.ee/india/SOM%20Table%201_update.xlsx).

Dr_McNinja
02-16-2014, 03:49 AM
Check this out (http://evolbio.ut.ee/india/SOM%20Table%201_update.xlsx).
I was looking at that but still can't make sense of which IDs correspond to which groups :/

Humanist
02-16-2014, 04:00 AM
I was looking at that but still can't make sense of which IDs correspond to which groups :/

If we take the first four samples from your link:


CHEND85 CHEND85 0 0 1 -9
GONC8 GONC8 0 0 1 -9
TN18 TN18 0 0 2 -9
KUR1 KUR1 0 0 1 -9


CHEND85 Chenchus India South Asia
GONC8 Gond India South Asia
TN18 Tamil Nadu Scheduled Caste India South Asia
KUR1 Kurumba India South Asia

Dr_McNinja
02-16-2014, 05:37 AM
I think I'm getting the hang of it. Where can I find other datasets with ~500k SNPs? Are there any European and East Asian ones? I got ones for that Afghan study, an Indian study, and Caucasus.

There seems to barely be any difference with LD based pruning with just these few groups.

I think ideally someone could collect a bunch of the FTDNA/23andMe data files and make two runs at ~700k and ~900k SNPs to see if any different patterns emerge. Otherwise any use of the popular datasets will just give the same results.

Speaking of which, I can't figure out how to use zombies like on Dienekes' blog. He just said he uses them but not how. I figure he's taking the output from the ".P" file for the zombies? But how? Does he just convert it into .bed format somehow?

Dr_McNinja
02-16-2014, 10:45 AM
Experimented with LD pruning and noticed a pattern. I did 4 runs at k=4 (which resulted in a Caucasian, East European, South Asian, and East Asian type components), one at ~500k SNPs, then pruned to 200-250k, then pruned to ~100-120k, then pruned to ~80k. Going from the lower SNP count to the higher, South Indian goes down and West Eurasian-based ones (Baloch, Caucasian, whatever it is) goes up. My South Indian decreases at almost twice the rate of HRP0341 and HRP0370 (I dropped 4.33% by ~500k SNPs while HRP0341 dropped 2.7% and HRP0370 dropped 1.58%).

1.65% when going from ~80k to ~100-120k, then 0.8% going from ~100-120k to ~200-250k, then 1.9% going from ~200-250k to ~500k.

HRP0341: No drop from ~80k to ~100-120k, 1.55% when going from ~100-120k to ~200-250k, then 1.23% to ~500k. HRP0341 did jump almost 4% in the West Asian-like category at the expense of South Asian and European whereas myself and HRP0370 moved up a little there, a little in East Asian, and a little in Euro (the overall South Indian+East Asian still dropped by almost the same amounts as the SNPs increased, mine still more than the others).

Who knows how long the pattern would hold up or where it would wind up, but I figure it would likely persist to at least 1.5 million SNPs. Our kits are ~900+k SNPs but the data files here are from the ~500k chip.

So I'm still learning, but I don't see the point of LD-based pruning except as a trick to reduce processing time and still winding up in the right ballpark, someone correct me if I'm wrong.

Not really sure how to do the zombies though. I understand the general concept of what Dienekes did, but I have no idea how to go about doing that. If anyone knows or can find out, I'd be glad to give that a try.

Dr_McNinja
02-16-2014, 01:21 PM
Everest, your inbox is full :)

This is what I was going to PM:

I'm having trouble converting the HGDP dataset to PLINK format. The script Zack posted on Harappa doesn't work for me and I'm not sure why (it's a Unix script, those don't seem to be working for me, maybe because of different versions... I'm on Ubuntu 13.04). Do you happen to know of a Perl script which will do the conversion? Then I run liftover.py to convert the PLINK .bed file from B36 to B37 format?

Wish someone would just upload these datasets in PLINK format already.

everest59
02-16-2014, 02:46 PM
McNinja, I actually used the Zack script. I think you may have forgotten to download the sampleinformation.txt file, which will cause problems. See below:
http://www.stanford.edu/group/rosenberglab/data/rosenberg2006ahg/SampleInformation.txt

Then it should work.

Dr_McNinja
02-16-2014, 04:54 PM
Btw, I'm having trouble with the liftOver script. I've been trying to follow instructions from here:

http://genome.sph.umich.edu/wiki/LiftOver

So, since I got .ped and .map files, I tried running liftMap.py but got this error:


./liftMap.py -m hapmapCEU.map -p hapmapCEU.ped -o hapmapB37HG19
SUCC: map->bed succ
sh: 1: /home/mcninja/Desktop/plink: Permission denied
Traceback (most recent call last):
File "./liftMap.py", line 126, in <module>
makesure(liftBed(oldBed, newBed, unlifted),
File "./liftMap.py", line 48, in liftBed
for ln in myopen(params['UNLIFTED']):
File "./liftMap.py", line 19, in myopen
return open(fn)
IOError: [Errno 2] No such file or directory: 'hapmapB37HG19.bed.unlifted'
It makes a new .bed file, but then gets this error. I changed the path in the script as it shows, but I don't know why it stops here

everest59
02-16-2014, 05:07 PM
That's weird. Maybe you should go root and see if that solves the issue. I think the command is sudo -s.

Also, did you download the hg 18 to 19 chain file?

Dr_McNinja
02-16-2014, 05:24 PM
McNinja, I actually used the Zack script. I think you may have forgotten to download the sampleinformation.txt file, which will cause problems. See below:
http://www.stanford.edu/group/rosenberglab/data/rosenberg2006ahg/SampleInformation.txt

Then it should work.It almost works, I got hgdp.tfam and hgdp.tped but when I try to convert to .bed I get this error:


ERROR:
Problem with line 1 in [ hgdp.tped ]
Expecting 4 + 2 * 1042 = 2088 columns, but found more

Dr_McNinja
02-16-2014, 05:43 PM
That's weird. Maybe you should go root and see if that solves the issue. I think the command is sudo -s.

Also, did you download the hg 18 to 19 chain file?Yeah the chain file is in there. I tried using sudo with it and it still didn't work =\. Were you able to run that script? Are there any other versions around?

everest59
02-16-2014, 05:50 PM
It almost works, I got hgdp.tfam and hgdp.tped but when I try to convert to .bed I get this error:

Do you want me to upload the whole B37 file that I created online? It's in a nice format, with all populations nicely annotated.

I need some help from you. Since you have a fast computer, can you do the following:
1. Download a program called samtools:
http://samtools.sourceforge.net/

2. Download La Brana files here:
http://www.ncbi.nlm.nih.gov/sra/SRX388871

Then use SRA toolkit to convert the file into BAM format :
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std#s-6

Use the following command to convert to BAM format :
sam-dump SRR390728 | samtools view -Sb -o my_bam.bam -

Then if you could upload that file into google docs, that would be great.
If you would like, you can try to convert the BAM file into usable PLINK file yourself by calling variants on it.

VCF files for La Brana will be available soon, but in the meantime, we could all use that file.
My stupidass computer keeps freezing. I've been trying this since yesterdays. Comp's been freezing the whole day! Now I give up via my comp.

Dr_McNinja
02-16-2014, 07:45 PM
Do you want me to upload the whole B37 file that I created online? It's in a nice format, with all populations nicely annotated.

I need some help from you. Since you have a fast computer, can you do the following:
1. Download a program called samtools:
http://samtools.sourceforge.net/

2. Download La Brana files here:
http://www.ncbi.nlm.nih.gov/sra/SRX388871

Then use SRA toolkit to convert the file into BAM format :
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std#s-6

Use the following command to convert to BAM format :
sam-dump SRR390728 | samtools view -Sb -o my_bam.bam -

Then if you could upload that file into google docs, that would be great.
If you would like, you can try to convert the BAM file into usable PLINK file yourself by calling variants on it.

VCF files for La Brana will be available soon, but in the meantime, we could all use that file.
My stupidass computer keeps freezing. I've been trying this since yesterdays. Comp's been freezing the whole day! Now I give up via my comp.Alright, I'll try that first thing tomorrow (bedtime here), I'll let you know if I have any problems

EDIT: And yes please to the B37 upload! Is it the whole HapMap results? Or did you mean HGDP? If HapMap, I just wanted CEU and GIH.

everest59
02-21-2014, 01:18 AM
Okay, I just created the La Brana file. This PCA shows La Brana plotting with Europeans.
https://drive.google.com/file/d/0B3vEDdpZDjUpR1J4ODItdGxlLUU/edit?usp=sharing

This PCA, which is European-only, shows La Brana mainly plotting with the French.

https://drive.google.com/file/d/0B3vEDdpZDjUpc0pjUUtCb19FY0U/edit?usp=sharing

In a global plot, I had La Brana plotting close to the Burusho, so I decided not to include it.

parasar
03-17-2014, 07:13 PM
There was a E* Bhil sample I believe. E is essentially the same as F and C as far as shared SNP are concerned.
...
Dungri Bhill
http://www.academia.edu/221582/Yap_Insertion_signature_in_south_Asia