PDA

View Full Version : Few questions on getting started with analysis



Kale
03-14-2018, 07:52 AM
Hi there, I'm looking to get started doing things like admixture, d-stats, etc. and I have a few questions.
1) Where to get datasets on modern human populations?
- I found one, George Busby's something-or-other, 163 world populations, but that leads me to question 2.
2) How do I view them in a text-editor friendly format?
- The .bim file in that dataset looks a lot like 23andme's raw data (it's the first I've seen so that's my frame of reference), and outlines the SNPs used in the dataset presumably.
- The .bed file (which contains the actual calls for populations), from online search is supposed to be text-editor friendly, but it's not, .bed + npp = gibberish.
3) How can I prune multiple datasets to get only the snps shared between them
- I can automate this in openoffice calc pretty easily, but calc is sllooooow at doing mass comparisons, there has to be a faster way...

Thanks in advance for any help :)

anglesqueville
03-14-2018, 08:02 AM
PLINK
https://www.cog-genomics.org/plink2/data
http://zzz.bwh.harvard.edu/plink/dataman.shtml

Sangarius
03-14-2018, 10:40 AM
Regarding datasets, the most widely used one is probably the Human Origins dataset from the Reich Lab. (https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/NearEastPublic.tar.gz) It comes in geno format, which can be easily converted to Plink format with the convertf program.
Then there is the Estonian Biocentre (http://evolbio.ut.ee/), which host a lof of datasets in Plink format. The largest one of them is their own:Estonian Genome Diversity Panel (EGDP) dataset (http://evolbio.ut.ee/CGgenomes_VCF/)
There is also the Simons Genome Diversity Project (SGDP) dataset. (http://reichdata.hms.harvard.edu/pub/datasets/sgdp/) This is whole genome data, so quite large. Available in different formats.
And then there is of course the old 1000 genomes dataset (http://www.internationalgenome.org/data#download).

There are probably many more sources that provide smaller datasets from papers, like datadryad (https://datadryad.org//), but these are the big ones that I know of.

To view large text files I would recommend glogg (https://glogg.bonnefon.org/). But bed files aren't readable by humans, due to their format. Trying to read geno files doesn't make much sense either.
Combining and editing datasets is best done in Plink, it's quite powerful.

anglesqueville
03-14-2018, 11:16 PM
Hi there, I'm looking to get started doing things like admixture, d-stats, etc. and I have a few questions.
1) Where to get datasets on modern human populations?
- I found one, George Busby's something-or-other, 163 world populations, but that leads me to question 2.
2) How do I view them in a text-editor friendly format?
- The .bim file in that dataset looks a lot like 23andme's raw data (it's the first I've seen so that's my frame of reference), and outlines the SNPs used in the dataset presumably.
- The .bed file (which contains the actual calls for populations), from online search is supposed to be text-editor friendly, but it's not, .bed + npp = gibberish.
3) How can I prune multiple datasets to get only the snps shared between them
- I can automate this in openoffice calc pretty easily, but calc is sllooooow at doing mass comparisons, there has to be a faster way...

Thanks in advance for any help :)

Out of curiosity, you have during the last months, published many D-stats, including double outgroups analyses, for ancient and modern datas. I have some difficulties to understand how you could do that without being reasonably aware of PLINK, admixtools, eigensoft, data management, etc. Am I missing something there?

Kale
03-15-2018, 03:03 AM
Out of curiosity, you have during the last months, published many D-stats, including double outgroups analyses, for ancient and modern datas. I have some difficulties to understand how you could do that without being reasonably aware of PLINK, admixtools, eigensoft, data management, etc. Am I missing something there?

Someone else made the D-stats and I'm moderately handy with excel ;) I don't need no nmonte, I can just eyeball it :P

anglesqueville
03-15-2018, 07:22 AM
Someone else made the D-stats and I'm moderately handy with excel ;) I don't need no nmonte, I can just eyeball it :P

OK. I'm not a professional but I've worked a lot on D-stats, including on their mathematical background (mathematics are my job, even if statistics and probability theories are far from being "my" maths ). I'm unable to tell how many hours I've spent on Patterson's fundamental text. I draw from this work 3 conclusions:
1) The best way to use D-stats is direct use
2) The only sensical way to combine formal stats is qpAdm. The so called " 2 outgroups" method is nothing serious, to say the least.
3) There is a great empirical part in the use of D-stats, and that often means you'll have to do many trials, to look inside the population clusters, including on individuals. Therefore you must work by yourself, no escape. Learn PLINK, learn EIGENSOFT, learn ADMIXTOOLS. I would like to be able to add "learn BEAGLE/FastIBD", but honnestly my own trials with FastIBD so far have been just pathetic ...

Kale
03-15-2018, 05:25 PM
OK. I'm not a professional but I've worked a lot on D-stats, including on their mathematical background (mathematics are my job, even if statistics and probability theories are far from being "my" maths ). I'm unable to tell how many hours I've spent on Patterson's fundamental text. I draw from this work 3 conclusions:
1) The best way to use D-stats is direct use
2) The only sensical way to combine formal stats is qpAdm. The so called " 2 outgroups" method is nothing serious, to say the least.
3) There is a great empirical part in the use of D-stats, and that often means you'll have to do many trials, to look inside the population clusters, including on individuals. Therefore you must work by yourself, no escape. Learn PLINK, learn EIGENSOFT, learn ADMIXTOOLS. I would like to be able to add "learn BEAGLE/FastIBD", but honnestly my own trials with FastIBD so far have been just pathetic ...

1) Unquestionably.
2) Perhaps so, but if it can produce logical results which correspond with other formal methods (including those published in literature) then there has to be some validity.
3) I shall.

Bas
03-15-2018, 06:20 PM
About the FastIBD, how easy is it to run? Does it require linkage disequilibrium to be done? I've tried IBD/IBS stuff on PLINK, using the --genome flag but I've heard some people say PLINK isn't great for IBD/IBS. My results were a bit weird with --genome using PLINK, but that was probably because I didn't do any LD beforehand.

anglesqueville
03-15-2018, 06:57 PM
About the FastIBD, how easy is it to run? Does it require linkage disequilibrium to be done? I've tried IBD/IBS stuff on PLINK, using the --genome flag but I've heard some people say PLINK isn't great for IBD/IBS. My results were a bit weird with --genome using PLINK, but that was probably because I didn't do any LD beforehand.

The only version of BEAGLE I know is the last one ( v 4.1). That one is autonomous: everything (imputing, phasing, IBD detection) can be done without getting outside. I tried it only on my dad's genome (already imputed). Difficult, at least for me, and very long on my too old computer. Never tried IBD research on PLINK.

Kale
03-16-2018, 05:06 AM
I've just read a lot of the information on the PLINK website...I'm running into 2 obstacles. 1) I'm not a programmer. 2) It looks like that requires learning a lot before one can do anything, and I'm only looking to do a few things (hell I'd be content with pruning non-overlapping snps between genomes and d-stats, and nothing else). I'm calculating it would be quicker (if possible to convert genomes to plain text) to just do the pruning in excel rather than learn what looks to me like Yoda playing Sumerian charades.

EDIT: Actually Bas gave me a great guide on how to do d-stats, thanks again Bas! So I guess really I just need to know how to merge datasets and prune non-overlapping snps? Actually, I probably don't even need to prune non-overlapping snps, that's of secondary importance if anything. So how would I merge datasets?

anglesqueville
03-16-2018, 07:38 AM
Merging: 2 programs
1) "Mergeit" inside the Eigensoft package. I had only bad experiences with it.
2) the flag --bmerge under PLINK on bed/bim/fam files

There is likely other sotfwares I'm not familiar with, anyway everybody uses PLINK. If you want to manage genetic datas, you have to use a management tool. Converting a file containing the infos for several hundreds of thousands, or even several millions of SNPS, for several hundreds of individuals, in a plain text file? If I write again "haha!" you will not appreciate it. And even if it was possible (pure non sense), which program would you use to work with it? R itself does not tolerate a really big matrix, without speaking of Exel! No way, you have to learn. Linux --> Plink --> learning, no escape.

Helgenes50
03-16-2018, 07:53 AM
Being myself a newbie with all these programs, qpAdm, Dstat
What I already said on the French sub-forum, Anglesqueville is a very good teacher, who knows very well what he is talking about and especially with a huge patience.
Thanks to him for his work

anglesqueville
03-16-2018, 08:03 AM
Being myself a newbie with all these programs, qpAdm, Dstat
What I already said on the French sub-forum, Anglesqueville is a very good teacher, who knows very well what he is talking about and especially with a huge patience.
Thanks to him for his work

YOU are a very brave and clever learner, with a huge tolerance for my guilty irritations :biggrin1:
edit: well, the difficult part is coming (imputed vcf files). Hopefully next week.

therrien.joel
03-16-2018, 09:01 PM
I'll interject a stupid question here... something I'm quite skilled at :biggrin1:...

I have tried to find a decent resource explaining just what DStats is, but to no avail. Is there a resource that would make for a good introduction to it?

anglesqueville
03-16-2018, 10:41 PM
I'll interject a stupid question here... something I'm quite skilled at :biggrin1:...

I have tried to find a decent resource explaining just what DStats is, but to no avail. Is there a resource that would make for a good introduction to it?

The reference is http://www.genetics.org/content/genetics/early/2012/09/06/genetics.112.145037.full.pdf

Bas
03-17-2018, 02:14 AM
I've just read a lot of the information on the PLINK website...I'm running into 2 obstacles. 1) I'm not a programmer. 2) It looks like that requires learning a lot before one can do anything, and I'm only looking to do a few things (hell I'd be content with pruning non-overlapping snps between genomes and d-stats, and nothing else). I'm calculating it would be quicker (if possible to convert genomes to plain text) to just do the pruning in excel rather than learn what looks to me like Yoda playing Sumerian charades.

No problem Kale :)

Yes, as Anglesqueville said, use PLINK to merge. But it really is quite easy. First you'll need your datasets that you want to merge. Reich lab has some of the most significant sets from the last few years. Those are EIGENSTRAT format. You need to convert them to BED/BIM/FAM though in PLINK. To get there:

Converting EIGENSTRAT to PED/MAP in Eigensoft

1) Once you have Linux set up, PLINK and Admixtools downloaded, make sure you also have the EIGENSOFT program and download the convertf program from the same place you got Admixtools from (releases tab). You'll get an 'EIG-master' folder when you download EIGENSOFT. Put the EIGENSTRAT (GENO/IND/SNP) dataset files you want to merge in this folder where you have the convertf executable. This is where we're going to convert them, first to PED/MAP files, then to BED/BIM/FAM.

2) You need to write a parfile first in order to merge. There should be an example one in the EIG-master folder when you download EIG. Just fill the fields in with the path to the relevant files. Parfile will be like this:


genotypename: file.geno
snpname: file.snp
indivname: file.ind
outputformat: PED
genotypeoutname: newfile.ped
snpoutname: newfile.map
indivoutname: newfile.ped
familynames: no

3) Save this parfile in the same folder as your datasets and convertf program.

4) Now you are ready to convert. Open up a terminal and navigate to the EIG-master folder where all this is with the 'cd' command and your directory path. So for example: 'cd /user/home/eig-master/bin' . Then, once in this directory, type in: '/user/home/eig-master/bin/convertf -p parfilename' and hit enter.

5) After a few minutes, 2 new PED/MAP files will have been written (there are just two)

Converting PED/MAP to BED/BIM/FAM in PLINK

6) Now move these PED/MAP files to the PLINK folder. In terminal, change directory to PLINK and input this in the terminal:
'user/home/plink/plink --file newfile --make-bed -allow-no-sex --out newplinkfile1'

And 3 new files will be written. A BED,BIM and FAM file. That's one dataset ready to be merged, do the same for the other dataset(s)

Merging two or more sets

7) So once you have your BED/BIM/FAM datasets, still in PLINK, input this:
'/user/home/plink/plink --bfile newplinkfile1 --bmerge newplinkfile2.bed newplinkfile2.bim newplinkfile2.fam --make-bed --allow-no-sex --out newplinkmerge'

You can just repeat, merging this merge with others until you have a ridiculously large dataset that you will spend hours labelling :P (more on that later)

Getting this new merge back to EIGENSTRAT-Converting the merge to PED/MAP

8) In the plink directory, input:
'user/home/plink/plink --bfile newplinkmerge --recode --allow-no-sex --out newpedmerge'

This merge will be much bigger than the BED/BIM/FAM files. If you merge more than two datasets, the resulting PED/MAP files can get up to 15GB or more in size.

The final step-converting to EIGENSTRAT

9) Now, move these new PED/MAP files back to the EIG folder where you have convertf and change directory to here in the terminal. It's the same process as in step 2 but obviously in the parfile you're going to have to rewrite the relevant filenames and also the 'output format' as 'EIGENSTRAT'.

10) With a big dataset, depending on your computer this final step could take an hour, but at the end you'll have a new dataset to merge.

*At this point, you'll have to start labelling all the individuals again, but you can just refer to your original EIGENSTRAT datasets for this. About the pruning, I don't know how to do this. I really should learn it though. With PLINK, you can also extract individuals from a dataset:

Extracting individuals in PLINK:

1) From the FAM file, make a new text file with the individual details you want to keep. Save this in the plink folder as 'keeplist'. Let's say these samples are from 'bellbeaker.fam'

2) In PLINK, open command terminal and enter:
'home/user/plink/plink --bfile bellbeaker --keep keeplist --make-bed --allow-no-sex --out extractedsamples'

This will leave the original file intact, and is vital when you want a smaller dataset. Then just keep merging these extracted samples until you have only the necessary.

Anglesqueville taught me how to use Admixtools too, so a very big thanks to him! :beerchug:

Kale
03-17-2018, 04:29 AM
Bas, the thanks button only works once per post...I'm going to have to ask you to make your posts less useful so that I don't feel guilty for only giving you one thanks :P

anglesqueville
03-17-2018, 07:44 AM
Bas, thanks for your thanks :biggrin1: . that said, just one thing. It's not at all obligatory to use the ped format. You can directly convert from eigenstrat to plink or conversely, exactly in the same way, with a parameter file


genotypename: file.geno
snpname: file.snp
indivname: file.ind
outputformat: PaCKEDPED
genotypeoutname: newfile.bed
snpoutname: newfile.bim
indivoutname: newfile.fam

note that in EIGENSOFT "PACKEDPED" is for "PLINK (bed/bim/fam".

Labelling does not spend hours if your original file is in EIGENSTRAT with a good .ind file. When you merge your two plink files, use the flag
--indiv-sort 0 ("0" "is the number, not the letter) and the individuals in the merged file will be in the non modified order, then you just have to merge the two original .ind with your text editor ( 2 minutes).

anglesqueville
03-17-2018, 07:49 AM
In fact, the only people we have all to thank is the heroes who made those two sites:
https://www.cog-genomics.org/plink2/data
http://zzz.bwh.harvard.edu/plink/dataman.shtml
All you need is inside. Very useful too: http://gaworkshop.readthedocs.io/en/latest/index.html

Kale
03-17-2018, 11:25 PM
Nevermind: Figured it out :)

Ok I feel like I'm almost there... here is where I'm at. Trying to run d-stat.

I have the directory /home/Admixtools-master/bin
Inside this directory I have the files qpDstat (downloaded from github), parqpdstat, and list_qpdstat
I also have another directory in there, NearEastPublic, with files HumanOriginsPublic2068.ind, .snp, and .geno inside

parqpdstat looks like this...
DIR: /home/user/Admixtools-master/bin/NearEastPublic/
SSS: HumanOriginsPublic2068
indivname: DIR/HumanOriginsPublic2068.ind
snpname: DIR/HumanOriginsPublic2068.snp
genotypename: DIR/HumanOriginsPublic2068.geno
popfilename: list_qpDstat

list_qpdstat looks like this...
Mbuti Yoruba Han French
Mbuti Yoruba Mozabite French

What do I enter into the command line?
I changed directory...I'm now in bin.
Bas's recommendation, and the documentation indicate...
qpDstat -p parqpdstat [-l lo] [-h hi] >logfile
I receive...
"bash: qpDstat: command not found..."

Putting the directory in front...
/home/user/Admixtools-master/bin/qpDstat -p parqpdstat [-l lo] [-h hi] >logfile
...gives "Permission denied"
- Changed qpdstat's permissions and allowed it to be executable (I thought everything was 'executable' by default in Linux)
New error message;
fatalx:
can't open file list_qpdstat of type r
error info: no such file or directory

Ok, so I didn't know it was case-sensitive, problem solved. First d-stats ever :amen:
result: Mbuti Yoruba Han French 0.0028 2.700 26308 26160 616935
result: Mbuti Yoruba Mozabite French -0.0069 -10.537 25766 26125 616935

looks about right :)

I have to say, this Fedora system is not user friendly at all...can I do all this stuff in say, Ubuntu? So I can at least have a desktop and a text-editor that works when I tell it to, not when it feels like it?

Kale
03-18-2018, 03:48 AM
Bas, thanks for your thanks :biggrin1: . that said, just one thing. It's not at all obligatory to use the ped format. You can directly convert from eigenstrat to plink or conversely, exactly in the same way, with a parameter file



note that in EIGENSOFT "PACKEDPED" is for "PLINK (bed/bim/fam".

Labelling does not spend hours if your original file is in EIGENSTRAT with a good .ind file. When you merge your two plink files, use the flag ("0" "is the number, not the letter) and the individuals in the merged file will be in the non modified order, then you just have to merge the two original .ind with your text editor ( 2 minutes).

I've run into something converting back to EIGENSTRAT after successfully? merging two datasets.
'snp order check fail; snp list not ordered: dataset.bim (processing continues)' ...which repeats 8 times

Is that going to be an issue going forward? Did they not merge right?

Also, holy crap, the individual datasets I merged were ~330MB and ~70MB, but merged they are now 2.9GB!?

Running the same two d-stats I did before, instead of 2-3 minutes it took over an hour! The values were replicated though. There were about 500,000 lines of 'nodata: rs(bunch of numbers)' in the logfile though.
Did I do something wrong?

anglesqueville
03-18-2018, 09:49 AM
I've run into something converting back to EIGENSTRAT after successfully? merging two datasets.
'snp order check fail; snp list not ordered: dataset.bim (processing continues)' ...which repeats 8 times

Is that going to be an issue going forward? Did they not merge right?

Also, holy crap, the individual datasets I merged were ~330MB and ~70MB, but merged they are now 2.9GB!?

Running the same two d-stats I did before, instead of 2-3 minutes it took over an hour! The values were replicated though. There were about 500,000 lines of 'nodata: rs(bunch of numbers)' in the logfile though.
Did I do something wrong?

Very problematic. And very difficult to debug remotely. Which datasets were you willing to merge?

mephisto
03-18-2018, 10:46 AM
I've run into something converting back to EIGENSTRAT after successfully? merging two datasets.
'snp order check fail; snp list not ordered: dataset.bim (processing continues)' ...which repeats 8 times

Is that going to be an issue going forward? Did they not merge right?

Also, holy crap, the individual datasets I merged were ~330MB and ~70MB, but merged they are now 2.9GB!?

Running the same two d-stats I did before, instead of 2-3 minutes it took over an hour! The values were replicated though. There were about 500,000 lines of 'nodata: rs(bunch of numbers)' in the logfile though.
Did I do something wrong?
The error you got is actually self explaining. Especially since the d-stat logfile you should know where something went wrong.

Maybe you should just consider bringing the two to be merged files first of all on the same denominator, in other words, prepare the files so that you have only same rsids in both to be merged files. Then you can merge them and convert them to EIGENSOFT and run d-stats.

Bas
03-19-2018, 12:18 AM
I've run into something converting back to EIGENSTRAT after successfully? merging two datasets.
'snp order check fail; snp list not ordered: dataset.bim (processing continues)' ...which repeats 8 times

Is that going to be an issue going forward? Did they not merge right?

Also, holy crap, the individual datasets I merged were ~330MB and ~70MB, but merged they are now 2.9GB!?

Running the same two d-stats I did before, instead of 2-3 minutes it took over an hour! The values were replicated though. There were about 500,000 lines of 'nodata: rs(bunch of numbers)' in the logfile though.
Did I do something wrong?

Now that you're up and running, looking forward to some good d-stattage from you!

D-stats involving near 100 individuals usually take just over 20 minutes on mine, and yes, the logfile I get after contains a lot of what look like errors or 'no data' etc. But as you said, the actual values look solid and snp count is high enough usually.

The problem you mentioned while converting to EIGENSTRAT, off the top of my head I've encountered this before but not for a while. Can't remember what happened with it. But it looks like everything turned out fine for you in the end though.

Kale
03-22-2018, 06:03 PM
Very problematic. And very difficult to debug remotely. Which datasets were you willing to merge?

Sorry for the delay, I've been very sick lately. I tried to merge HumanOriginsPublic2068 with AncientLazaridis2016, they both come in the NearEastPublic dataset from Reichlab. Just as a note, the original stats I ran with HumanOriginsPublic2068 contained ~616k snps, after the merge had those same ~616k snps plus the 500-something thousand snps 'missing' after the merge roughly match the ~1.2m snps that plink said were recognized during the merge. I don't know if that means anything. I meant to say everything went fine during the merge, the 'data missing' error came when running stats after the merge and convert back to EIGENSTRAT.

Chad Rohlfsen
03-23-2018, 06:11 AM
Data missing might mean you need to check labels in your ind file. If pops say control, you need those to be named for the pop.

Kale
03-23-2018, 04:13 PM
I relabeled them all after the merge and convert back to EIGENSTRAT and skimmed over the sample ids just to made sure they lined up, it all looked right.

Kale
03-25-2018, 06:02 PM
Well, I extracted some samples from the merged set in plink, then converted to EIGENSTRAT. The problem did not replicate...there were only like 200 lines of 'notdata: rs(bunch of numbers)' in the logfile as well.
The time is back to reasonable, I managed to run 11 stats in about 15 minutes. This time I have a new problem though...
The results have gone completely bonkers.
All the samples are labeled correctly in the .ind file, they are in a different order though, does that matter?

...order does matter. Problem solved, up and running finally :)

Is there a dataset of modern samples that uses the 1240k setup most of the new aDNA papers are using? Or at least more than the ~616k snps that HumanOrigins dataset does?

anglesqueville
03-25-2018, 09:51 PM
Is there a dataset of modern samples that uses the 1240k setup most of the new aDNA papers are using?
http://evolbio.ut.ee/CGgenomes_VCF/
Plenty of good job if you enjoy it ( ask Lukasz, he knows what I'm speaking about... ): 3,426,792 snps for .... the chrom 1. 3,725,095 for the 2, etc ... Just to give you an idea, I computed recently a pca of the HLA zone of the chrom 6. Not enough snps for qpAdm (even in Pagani, the HLA is only 6 cM or so), but enough for a pca. That said, mostly eurasian population.
If you like diploid and imputed ancient genomes, give a look to Martiniano (around 30 millions snps if I recall) https://datadryad.org//resource/doi:10.5061/dryad.g9f5r

I seem to recall that you can download plink files for Pagani, but only a vcf for Martiniano.

Kale
03-25-2018, 09:56 PM
http://evolbio.ut.ee/CGgenomes_VCF/
Plenty of good job if you enjoy it ( ask Lukasz, he knows what I'm speaking about... ): 3,426,792 snps for .... the chrom 1. 3,725,095 for the 2, etc ... Just to give you an idea, I computed recently a pca of the HLA zone of the chrom 6. Not enough snps for qpAdm (even in Pagani, the HLA is only 6 cM or so), but enough for a pca. That said, mostly eurasian population.
If you like diploid and imputed ancient genomes, give a look to Martiniano (around 30 millions snps if I recall) https://datadryad.org//resource/doi:10.5061/dryad.g9f5r

I seem to recall that you can download plink files for Pagani, but only a vcf for Martiniano.

Haha very nice. I think I'll be more than happy with just a single million snps though :P. Also, is there a function in eigensoft where one can test the number of snps in each individual sample? I'm going to add Olalde 2018 to my current datapile and it appears the authors conveniently forgot to put the coverage of each genome in the supp info :| I'd rather just have a few high quality samples and just leave out the oodles of low coverage ones.

EDIT: O boy!, tried to merge Fu et al. 2016 with my HumanOrigins/AncientLazaridis2016 dataset and I got thousands of...
"Warning: different genetic position for rs(bunch of numbers)"
And when trying to convert back to EIGENSTRAT all the Fu samples were ignored...

Here is example...
Fu: rs3094315 1 0.007526 752566 G A
Not-Fu: rs3094315 1 0.020130 752566 G A

752566 suggests both are build 37 (or at least not 38)

lukaszM
03-25-2018, 10:40 PM
EDIT: O boy!, tried to merge Fu et al. 2016 with my HumanOrigins/AncientLazaridis2016 dataset and I got thousands of...
"Warning: different genetic position for rs(bunch of numbers)"
And when trying to convert back to EIGENSTRAT all the Fu samples were ignored...

But you can't finish merging? If remember correctly it was only warning in my case.

Kale
03-25-2018, 10:59 PM
But you can't finish merging? If remember correctly it was only warning in my case.

Wait I'm sorry. I can finish merging. The Fu samples are ignored when converting back to eigenstrat.

sktibo
03-25-2018, 11:02 PM
Hey folks,
Since Kale has started this thread (Thanks to Kale for this!) I figured I'd jump in and ask some beginner questions as it looks like we have quite a friendly crowd here. My current goal is to be able to access the samples from the 1000 Genomes Project, and run some of these samples through various GEDmatch calculators. I don't think my little old computer can handle this in a text document, so how would you folks recommend I manage this to be able to do this? Can anyone suggest a starting point for this?

Thank you kindly

anglesqueville
03-26-2018, 06:28 AM
Wait I'm sorry. I can finish merging. The Fu samples are ignored when converting back to eigenstrat.

What are the phenotype indexes of the FU individuals in your plink? If they are -9, replace them with 1 and see if it works.

anglesqueville
03-26-2018, 06:39 AM
Hey folks,
Since Kale has started this thread (Thanks to Kale for this!) I figured I'd jump in and ask some beginner questions as it looks like we have quite a friendly crowd here. My current goal is to be able to access the samples from the 1000 Genomes Project, and run some of these samples through various GEDmatch calculators. I don't think my little old computer can handle this in a text document, so how would you folks recommend I manage this to be able to do this? Can anyone suggest a starting point for this?

Thank you kindly

1000 genomes provides vcf files. For vcf I use myself vcftools but there are likely other tools. Therefore, if you want to use 1000genomes datas, the way is clear (imho): LINUX > vcftools+plink+eigensoft > ... "..." depends on what you plan to do, ADMIXTURE, admixtools, Beagle/FastIBD (if you are more clever than I), and plenty of other tools that are only words for me. The starting point is LINUX+PLINK. Ask Helgenes50. 3 months ago he was working only with R. Now every day he merges, extracts, converts, computes dstats and qpAdms, etc. As we say in french, "ce n'est pas la mer boire" ( Something like "it's not like drinking the sea").

edit: you need a powerful computer. Myself, as my Linux is on a virtual java computer ( I want to keep Windows, hem... I confess that I'm a gamer), everything is slow and painful. If you are a serious guy, I mean more serious than me, you will need Linux as only operating system, at least 16 gigas RAM, and a recent CPU.

sktibo
03-26-2018, 06:50 AM
1000 genomes provides vcf files. For vcf I use myself vcftools but there are likely other tools. Therefore, if you want to use 1000genomes datas, the way is clear (imho): LINUX > vcftools+plink+eigensoft > ... "..." depends on what you plan to do, ADMIXTURE, admixtools, Beagle/FastIBD (if you are more clever than I), and plenty of other tools that are only words for me. The starting point is LINUX+PLINK. Ask Helgenes50. 3 months ago he was working only with R. Now every day he merges, extracts, converts, computes dstats and qpAdms, etc. As we say in french, "ce n'est pas la mer boire" ( Something like "it's not like drinking the sea").

edit: you need a powerful computer. Myself, as my Linux is on a virtual java computer ( I want to keep Windows, hem... I confess that I'm a gamer), everything is slow and painful. If you are a serious guy, I mean more serious than me, you will need Linux as only operating system, at least 16 gigas RAM, and a recent CPU.

No video card, but an okay CPU and 16 gb RAM. I'll look into getting Linux, currently only have Windows right now. I'm not too serious so I'll settle for virtual Linux. Thank you very much... I'll begin working on those basics

anglesqueville
03-26-2018, 06:57 AM
No video card, but an okay CPU and 16 gb RAM. I'll look into getting Linux, currently only have Windows right now. I'm not too serious so I'll settle for virtual Linux. Thank you very much... I'll begin working on those basics

http://vcftools.sourceforge.net/index.html
https://www.cog-genomics.org/plink2/data
http://zzz.bwh.harvard.edu/plink/dataman.shtml
http://genetics.med.harvard.edu/reichlab/Reich_Lab/Software.html
https://faculty.washington.edu/browning/beagle/beagle.html
https://people.maths.bris.ac.uk/~madjl/finestructure-old/chromopainter_info.html
http://gaworkshop.readthedocs.io/en/latest/contents/01_intro/intro.html ( very nice as a starter)

Kale
03-26-2018, 04:50 PM
What are the phenotype indexes of the FU individuals in your plink? If they are -9, replace them with 1 and see if it works.

Good call, forgot to check that.
Hmm...column 6 was actually labeled with the sample names (e.g. Vestonice16, Villabruna), for whatever reason, haven't seen that before. Changing column 6 to 1 as always did the trick.

Some results...
Mota Barcin_N Vestonice16 Kostenki14 -0.0169 -3.451 34448 35633 682775
Mota Levant_N Vestonice16 Kostenki14 -0.0224 -3.46 27772 29046 557701
Mota Iran_N Vestonice16 Kostenki14 -0.0004 -0.054 29293 29314 579051
Mota Kotias Vestonice16 Kostenki14 -0.0033 -0.499 34659 34889 682704
Matches what I've seen others get...
Mota Natufian Vestonice16 Kostenki14 -0.0081 -1.081 15493 15745 309723
Hadn't seen this stat before, but this is what I expected to see.

Mota Barcin_N GoyetQ116-1 Kostenki14 -0.0083 -1.57 37429 38057 713832
Mota Levant_N GoyetQ116-1 Kostenki14 -0.0067 -1.019 30228 30634 587364
Mota Iran_N GoyetQ116-1 Kostenki14 0.0012 0.165 31561 31488 608515
Mota Kotias GoyetQ116-1 Kostenki14 -0.0028 -0.416 37273 37482 713955
Again, matches what others have gotten.
Mota Natufian GoyetQ116-1 Kostenki14 -0.0132 -1.638 16604 17047 325591
Hadn't seen this before, pretty interesting. 325k snps isn't as much as I'd like to see, but still, Levantine Aurignacian maybe?

Kale
03-26-2018, 04:51 PM
Now just to find some more genomes. I've gotten everything of interest from Reich Lab. Where do people find others? Specifically I'm looking for...
KEB/IAM
Taforalt
Ancient-Egyptians (Schuenemann 2017)
Ancient South-Africans (Schlebusch 2017)
Tianyuan
The new Southeast Asian genomes (if they are available yet).

Think that's it...

anglesqueville
03-26-2018, 08:32 PM
Now just to find some more genomes. I've gotten everything of interest from Reich Lab. Where do people find others? Specifically I'm looking for...
KEB/IAM
Taforalt
Ancient-Egyptians (Schuenemann 2017)
Ancient South-Africans (Schlebusch 2017)
Tianyuan
The new Southeast Asian genomes (if they are available yet).

Think that's it...

Look inside the texts, most often at the end. You'll find the place where the datas have been submitted

Kale
03-27-2018, 04:21 AM
Look inside the texts, most often at the end. You'll find the place where the datas have been submitted

That's unfortunate. The preprints don't seem to have this info, and many of the final prints are paywalled :\

The KEB/IAM preprint had a number to look up in the European Nucleotide Archive, but searching that number returns 0 results.

...I'll make a deal with the general populace, give to me the samples that I seek and I will run any d-stats anybody ever wants, just ask :beerchug:

Kale
03-27-2018, 07:14 PM
Man this is getting frustrating...

After my latest merge (which added a whopping 13 samples), running qpDstat is returning "Killed" in the terminal after about 10 minutes. (It usually takes 10-15 minutes to run). The logfile just cuts off mid-thought.

I did try running more stats than usual (20), the max I had run prior was 15. Tried just doing 2 and it worked. Can I not run that many stats at once?

Megalophias
03-27-2018, 07:33 PM
I suspect the Moroccan ones haven't been uploaded, since the paper hasn't actually been published yet.

Reminds me again how much I appreciate the efforts of everyone who mucks around these gigantic files and finicky programs in Linux. :)

Kale
03-28-2018, 01:51 AM
I suspect the Moroccan ones haven't been uploaded, since the paper hasn't actually been published yet.

Reminds me again how much I appreciate the efforts of everyone who mucks around these gigantic files and finicky programs in Linux. :)

I saw Chad doing some analysis with them, and someone posted a link on page 4/5 of the topic, but I can't tell for the life of me where the 'download here!' button is on aforementioned link :/

Nevermind, some more digging and I found it, ENA's accession search function doesn't seem to be working properly however...

Samples are here
https://www.ebi.ac.uk/ena/data/search?query=taforalt

Found Tianyuan as well...
https://www.ebi.ac.uk/ena/data/view/PRJEB20217

Unfortunately I have no idea what a fastq or BAM file is or what to do with either.
Considering an Eigenstrat output for a handful of individuals is very tiny (and that's the format a lot of people want),
I'm venturing to say it would probably less time to upload it than to convert it.
It would make more sense collectively if 1 person converts the file, then uploads it.
Specialization is what allows civilization!
I shall uphold this value and do my part by running any d-stat anybody wants, ever.

Just putting these here so I can reference them later...
223662236522367

Kale
03-28-2018, 08:31 AM
A quick question, what is the syntax for using --extract in PLINK? I have a bunch of unused snps in my dataset (thanks Fu2016 dataset), and just want to extract the 1,233,553 that I know are good.

anglesqueville
03-28-2018, 09:39 AM
--extract <list>
"list" is the column list of the wanted snps (one by line), with of course the same ids as on the .bim.
./plink --bfile ancientfile --extract <list> --make-bed --allow-no-sex --out newfile

You can convert fastq to bam with the http://www.y-str.org/2015/08/srafastq-to-bam-kit.html (for example), and proceed the bams with http://www.y-str.org/2014/04/bam-analysis-kit.html ( a bundle of all the classic tools). All that assuming that you work on a powerful computer.

lukaszM
03-28-2018, 01:19 PM
--extract <list>
"list" is the column list of the wanted snps (one by line), with of course the same ids as on the .bim.
./plink --bfile ancientfile --extract <list> --make-bed --allow-no-sex --out newfile

You can convert fastq to bam with the http://www.y-str.org/2015/08/srafastq-to-bam-kit.html (for example), and proceed the bams with http://www.y-str.org/2014/04/bam-analysis-kit.html ( a bundle of all the classic tools). All that assuming that you work on a powerful computer.

In case of very big bam (more than 10 GB ) it takes days... Always seek if somewhere is PLINK format possible.

anglesqueville
03-28-2018, 02:27 PM
In case of very big bam (more than 10 GB ) it takes days... Always seek if somewhere is PLINK format possible.

Oh, I know that. At the beginning of the last summer as I analysed the BAMs of the Minoans/Myceneans the cooling fan of my computer burnt, fortunately without damage on the CPU. Anyway at the present times my internet connection is very unstable ( I live in a rather isolated countryside and the french telephonic network is so pathetic) and unless some miracle happens I'm unable to download more than, say 1.5 Gb. But perhaps Kale lives in a great city with cable connection, etc, and works with a high-powered computer?

Kale
03-28-2018, 03:18 PM
Unfortunately no, I have the opposite problem. Great internet, crappy computer (that and this Fedora 25 operating system is god awfully unstable). No Tianyuan for me :\

lukaszM
03-28-2018, 03:43 PM
Unfortunately no, I have the opposite problem. Great internet, crappy computer (that and this Fedora 25 operating system is god awfully unstable). No Tianyuan for me :\

My computer is few years old but was quite strong at this time, with (only?) 8GB RAM. And maybe it was a problem becasue with big files I had the same experiences as anglesqueville. I think big RAM is first thing to improve when you want to process large data.


Here is Tianjuan data but only for 21 chr http://www.y-str.org/2014/09/tianyuan-dna.html

Chad Rohlfsen
03-29-2018, 02:20 AM
You should remove same position markers. Also, if you get a warning about 4,000 of them, it may be mapped to a different reference genome. You need to contact authors for data, and Dstats Z<3-4, aren't worth making inferences. Dstats are the weakest method anyway. Learn other stuff.

Chad Rohlfsen
03-29-2018, 02:21 AM
Also, you'll need a good PC. Mine is 16GB and about 3.5 GHz. Try to get close to that or you'll have issues.

Bas
03-29-2018, 02:43 AM
Man this is getting frustrating...

After my latest merge (which added a whopping 13 samples), running qpDstat is returning "Killed" in the terminal after about 10 minutes. (It usually takes 10-15 minutes to run). The logfile just cuts off mid-thought.

I did try running more stats than usual (20), the max I had run prior was 15. Tried just doing 2 and it worked. Can I not run that many stats at once?

Yeah, I think the 'killed' message is a memory issue. I usually get that with more than 10 at a time so 15 at a time is good going.

Kale
03-29-2018, 03:28 AM
Yeah, I think the 'killed' message is a memory issue. I usually get that with more than 10 at a time so 15 at a time is good going.

It appears so...
Know how you were saying you had thousands of lines of 'no data:' in the d-stat logfile?
I did as well, all the datasets I have use the 1240k system, except the HumanOrigins system (which is a subset of that), and Fu2016 which dumped all those unused snps into the dataset.
By using the --extract function in PLINK, and using that to keep just the 1240k snp panel, I managed to get rid of all those unused snps.

Overall results...
- Can run more d-stats at once
- Takes less time to run stats
- No more 25mb logfile with 900k lines of 'nodata'
- Actual geno file is ~60% of it's former size

anglesqueville
03-29-2018, 06:32 AM
You should remove same position markers. Also, if you get a warning about 4,000 of them, it may be mapped to a different reference genome. You need to contact authors for data, and Dstats Z<3-4, aren't worth making inferences. Dstats are the weakest method anyway. Learn other stuff.

What do you think of, would you mind to elaborate? qpgraph? beagle/fastibd?

Chad Rohlfsen
03-29-2018, 05:42 PM
Learn f3, f4-ratio, qpAdm, and especially qpGraph. Anything else wouldn't hurt. If you can figure out Rarecoal, that would be awesome!

lukaszM
03-29-2018, 06:13 PM
Learn f3, f4-ratio, qpAdm, and especially qpGraph. Anything else wouldn't hurt. If you can figure out Rarecoal, that would be awesome!

Rarecoal looks like powerful tool. It is very hard to learn?:)

Chad Rohlfsen
03-29-2018, 06:22 PM
It looks like it. I only looked at it for a few minutes, but my head spun a bit. I grew up a generation too late!

Chad Rohlfsen
03-29-2018, 06:22 PM
If someone can learn to use it, I'll pay you to teach me.

anglesqueville
03-30-2018, 02:36 AM
I looked at Rarecoal more than a few minutes. Seems I lack some IQ points. That said, Beagle/FastIBD is not very easy either, to say the least. qpGraph is technically user friendly, but my feeling is that this soft is only for people with a deep knowledge of the ancient populations involved in the analyses. For a beginner, I would recommend f3, Dstat, and qpAdm in that order. A beginner should not forget smartpca. I got recently very fine and convincing results for a very specific question (chrom 6 HLA zone with the Pagani dataset) with smartpca and ( likely Chad is going to laugh) nMonte as extra tool. Anecdotally, all this together makes a not ridiculous tools bundle for amateurs ... I was recently speaking with a man who teaches genetics in a french province university ( hum... not Harvard or Cambridge) and obviously he was absolutely ignorant of all that. All he told was: "Fst", and as I had quoted several times the name of Nick Patterson, he asked: "the guy you speaks about, Nick something, who is he?" :\

Helgenes50
03-30-2018, 03:45 AM
And what do you think of qpBound ?

lukaszM
03-30-2018, 04:49 AM
I looked at Rarecoal more than a few minutes. Seems I lack some IQ points. That said, Beagle/FastIBD is not very easy either, to say the least. qpGraph is technically user friendly, but my feeling is that this soft is only for people with a deep knowledge of the ancient populations involved in the analyses. For a beginner, I would recommend f3, Dstat, and qpAdm in that order. A beginner should not forget smartpca. I got recently very fine and convincing results for a very specific question (chrom 6 HLA zone with the Pagani dataset) with smartpca and ( likely Chad is going to laugh) nMonte as extra tool. Anecdotally, all this together makes a not ridiculous tools bundle for amateurs ... I was recently speaking with a man who teaches genetics in a french province university ( hum... not Harvard or Cambridge) and obviously he was absolutely ignorant of all that. All he told was: "Fst", and as I had quoted several times the name of Nick Patterson, he asked: "the guy you speaks about, Nick something, who is he?" :\

But it seems our forum genetic masters (Generalissimo, Vadim, Kurd, I don't know about Chad) aren't professional geneticists from universities, as far as I know. But their knowledge individually, not to say gathered together is bigger than most of academic teams specialized in population genetics. At least when I speak about Poland...

anglesqueville
03-30-2018, 09:52 AM
And what do you think of qpBound ?

Never used it. I may of course be wrong, but I am not sure it's very useful, as you can easily compute the standard error on the D result with f3, f4 or D ( furthermore I seem to recall that there is an option in qpDstat in order to get it directly).

anglesqueville
03-30-2018, 09:55 AM
But it seems our forum genetic masters (Generalissimo, Vadim, Kurd, I don't know about Chad) aren't professional geneticists from universities, as far as I know. But their knowledge individually, not to say gathered together is bigger than most of academic teams specialized in population genetics. At least when I speak about Poland...

Don't know about Poland, but it's certainly true in France, at least in the little provincial universities. That said I don't doubt that there are in Paris or Varsovia actual specialists.

lukaszM
03-30-2018, 12:06 PM
Don't know about Poland, but it's certainly true in France, at least in the little provincial universities. That said I don't doubt that there are in Paris or Varsovia actual specialists.

I think many of them still live in Cavalli-Sforza times, of blood groups and enzimes...

Kale
04-05-2018, 02:48 AM
so... qpAdm, liblapack.so.3
It wants it, I couldn't find "liblapack.so.3" I found "liblapack.so.3.6.2" and downloaded it, now what? I put it in the directory qpAdm is in but it doesn't acknowledge. I tried renaming it just for fun, still nil.

Bas
04-05-2018, 03:02 AM
so... qpAdm, liblapack.so.3
It wants it, I couldn't find "liblapack.so.3" I found "liblapack.so.3.6.2" and downloaded it, now what? I put it in the directory qpAdm is in but it doesn't acknowledge. I tried renaming it just for fun, still nil.

Make sure you have internet connection and input:

sudo dnf install liblapack-dev

Depending on speed it should take 15-40 minutes. I have to do this everytime I run qpAdm from live usb. For me it actually downloads 8 packs, including blas.

Kale
04-05-2018, 06:20 AM
Aww that's dumb, I didn't think Linux 'installed' a lot of things. I'll have to hook my Linux comp to the web tomorrow. If I can get it to work without having to install every time, feel free to send me any qpAdm runs you want Bas and I'll do them, you've been so much help :)

Kale
04-21-2018, 04:18 AM
Interesting fact to note in the event someone else has this problem.

You can't run qp3pop F3's with a target that is a single individual. If you have inbreeding off (default) the SE will be in the trillions, if inbreeding is on, the result is 'no data'.

EDIT: I think liblapack was updated or something, the command that worked (for Fedora25 at least)

sudo dnf install lapack