PDA

View Full Version : How do you deal with eigenstrat files



beyondAtheism
05-14-2020, 05:37 PM
Hi,

I am new to this stuff and have managed to get Admixtools running and can do the examples.

However, I am not sure how I am supposed to deal with the data in the source files, because they are quite large and also I am running on a cloud VM.

1. How do you get the population names from the files that you are interested in. Do you just open the files and look at the values in the column, because the files are quite large.
2. Does Admixtools run on individuals or on populations (default). How do you specify it to run on individuals, if possible?
3. How do you run calcs on different datasets. Do you just merge the data into one big dataset? I know admixr has a function for this.
4. Does the size of the dataset affect performance. If so, do you try and keep it as small as possible?

TuaMan
05-14-2020, 07:42 PM
What VM and distribution are you using?

beyondAtheism
05-14-2020, 08:32 PM
ubuntu 18.04 on an Azure VM, Standard D2s v3 (2 vcpus, 8 GiB memory) (I think one of their standard/low end models).

I'm getting the hang of it kind of. Just making a note of all the populations I am interested in by grepping the ind file. I'll then have the names to use in R. Currently have an issue getting modern populations (North Indian), I am using the combined dataset from Reichs page with the HO array, but I am not sure what populations are in that HO array.

Kale
05-15-2020, 05:25 AM
1. The .ind file is basically your list of samples. Change the population names in the last column to suit your needs.
2. Populations by default. The only way I know to run on individuals is change the name of the individuals in the .ind file.
3. Yep, I use plink, never tried admixr.
4. Yes absolutely. If you have a large master-dataset, and plan on doing extensive work with a small subset of populations, I'd highly recommend making a separate dataset with just those samples of interest. Making a separate dataset is quite quick.

misnomer
05-22-2020, 04:03 PM
1. The .ind file is basically your list of samples. Change the population names in the last column to suit your needs.
2. Populations by default. The only way I know to run on individuals is change the name of the individuals in the .ind file.
3. Yep, I use plink, never tried admixr.
4. Yes absolutely. If you have a large master-dataset, and plan on doing extensive work with a small subset of populations, I'd highly recommend making a separate dataset with just those samples of interest. Making a separate dataset is quite quick.

what he said. Also, only plink should be used for merging or extracting a smaller dataset. You will also need convertf if using plink.
if you have Reich's latest HO dataset, you really do not need to merge anything right now as most of the papers' samples are already there, except maybe 2-3 newest papers.

HO has more modern samples than 1240k dataset. but would also recommend use of the 1240k dataset than HO dataset (~530k snps) if possible.