PDA

View Full Version : How to merge 2 datasets with different number of SNPs using plink?



misnomer
01-29-2020, 07:29 AM
I have a 1240k eigenstrat file downloaded from Reich database.

I want to run analysis using data from Pathak 2018 which has eigenstrat file in 730k snps. How do I merge these 2 datasets using Plink without corrupting any data and retaining 1240k snps of the samples of the 1st dataset?

I have experience with merging 2 1240k snp datasets using Plink.

Thanks.

anglesqueville
01-29-2020, 08:37 AM
Here is how I usually work (I'm not claiming that this is the best method. Say that your 1040k file is A, B the other one).
1) Convert the 2 files in bed/bim/fam (PACKEDPED) with eigentools/convertf
2) In order to make less likely problems of positioning, exclude the SNPs in B that are not in A (you can try without this step, but in my experience it is often dangerous, and the conversion in eigenstrat may later result in problems of physical distance):
./plink --bfile A --write-snplist
./plink --bfile B --extract plink.snplist --make-bed --out B1
(you may have to add the flag --allow-no-sex. I assume that you don't want to filter or prune in anyway).
2) Try to merge
./plink --bfile A --bmerge B1.bed B1.bim B1.fam --indiv-sort 0 --make-bed --out essai
(the flag --indiv-sort 0 is here in order to get the individuals of file B following those of A in the resulting ind.file)
If you didn't encounter problems of multiallelic or multichromosomic snps, the job is done, you rename the "essai" files, and convert this plink file into eigenstrat
3) If not, you'll read in the essai-merge.missnp and the essai.log files the lists of the problematic snps. Usually I sacrifice them: I write the list of these markers (say "badsnps"), and ./plink --bfile B1 --exclude badsnps --make-bed --out B2; then step 2 to merge A and B2. If you don't want to sacrifice them, you can try to flip the multi-allelic snps, and see whether you have still a problem. About multipositions, I don't know whether there is a simple possibility not to sacrifice the snps involved.

misnomer
01-29-2020, 09:11 AM
Here is how I usually work (I'm not claiming that this is the best method. Say that your 1040k file is A, B the other one).
1) Convert the 2 files in bed/bim/fam (PACKEDPED) with eigentools/convertf
2) In order to make less likely problems of positioning, exclude the SNPs in B that are not in A (you can try without this step, but in my experience it is often dangerous, and the conversion in eigenstrat may later result in problems of physical distance):
./plink --bfile A --write-snplist
./plink --bfile B --extract plink.snplist --make-bed --out B1
(you may have to add the flag --allow-no-sex. I assume that you don't want to filter or prune in anyway).
2) Try to merge
./plink --bfile A --bmerge B1.bed B1.bim B1.fam --indiv-sort 0 --make-bed --out essai
(the flag --indiv-sort 0 is here in order to get the individuals of file B following those of A in the resulting ind.file)
If you didn't encounter problems of multiallelic or multichromosomic snps, the job is done, you rename the "essai" files, and convert this plink file into eigenstrat
3) If not, you'll read in the essai-merge.missnp and the essai.log files the lists of the problematic snps. Usually I sacrifice them: I write the list of these markers (say "badsnps"), and ./plink --bfile B1 --exclude badsnps --make-bed --out B2; then step 2 to merge A and B2. If you don't want to sacrifice them, you can try to flip the multi-allelic snps, and see whether you have still a problem. About multipositions, I don't know whether there is a simple possibility not to sacrifice the snps involved.

sounds awesome, exactly what i was looking for. does step 2 also work for snps in same position but with different names?

anglesqueville
01-29-2020, 09:39 AM
sounds awesome, exactly what i was looking for. does step 2 also work for snps in same position but with different names?

Hum, I was assuming that the B file used snpids, like Reich ( I don't know Pathak's study). If it's not the case, you may have before anything to rename the snps from Pathak ( in plink, --update-name with a file that contains the old and new names of the snps; you may also use ANNOVAR, but so far I always made it with plink).

Helgenes50
01-29-2020, 09:41 AM
sounds awesome, exactly what i was looking for. does step 2 also work for snps in same position but with different names?

Anglesqueville is a very good teacher !!!

misnomer
01-29-2020, 11:10 AM
Hum, I was assuming that the B file used snpids, like Reich ( I don't know Pathak's study). If it's not the case, you may have before anything to rename the snps from Pathak ( in plink, --update-name with a file that contains the old and new names of the snps; you may also use ANNOVAR, but so far I always made it with plink).

my file A is already a merge of 3 datasets, 2 from Harvard, one from Estonia tartu. I'll just redo all the merges from scratch and see whats happening.

misnomer
02-01-2020, 05:52 AM
Hum, I was assuming that the B file used snpids, like Reich ( I don't know Pathak's study). If it's not the case, you may have before anything to rename the snps from Pathak ( in plink, --update-name with a file that contains the old and new names of the snps; you may also use ANNOVAR, but so far I always made it with plink).

The culprit was Wang 2019 dataset. it had the exact same SNPs as Reich 1240k but diff names. Also noticed that the genetic position column (educated guess) in the snp file of Wang (the column with the decimals) are all set to 0, whereas the Reich files have decimals in those columns. Does this change anything?

Anyway, i managed to merge everything (8 datasets in all after sacrificing around 73k snps). Last question

Is it possible to retain population labels in the process of converting from eigenstrat to Packedped, merging and then converting back to eigenstrat? Currently I use excel to fill up population labels back as they were in the unmerged .ind files.

anglesqueville
02-01-2020, 08:56 AM
The culprit was Wang 2019 dataset. it had the exact same SNPs as Reich 1240k but diff names. Also noticed that the genetic position column (educated guess) in the snp file of Wang (the column with the decimals) are all set to 0, whereas the Reich files have decimals in those columns. Does this change anything?

Anyway, i managed to merge everything (8 datasets in all after sacrificing around 73k snps). Last question

Is it possible to retain population labels in the process of converting from eigenstrat to Packedped, merging and then converting back to eigenstrat? Currently I use excel to fill up population labels back as they were in the unmerged .ind files.

As far as I know (but I don't know everything, for sure) there is nothing in convertf to do that. Usually I use openoffice just like you.

Kale
02-03-2020, 03:32 PM
I copypaste the previous .ind files together in npp so that the samples order match.

misnomer
02-04-2020, 08:26 AM
Just a tip to speed up qpAdm processing time and save harddisk space (imp for virtual OS linux), might help someone.

After merging datasets and converting packedped to eigenstrat file, then modifying the .ind file to add back population labels - one should reconvert the eigenstrat to packedancestrymap format using convertf. also add this line in parameter file -hashcheck: NO. This allows you to edit population labels in ind file.

Size of geno file reduced by 75%, and qpAdm processing time reduced by 60%.

misnomer
02-07-2020, 04:04 PM
Just a tip to speed up qpAdm processing time and save harddisk space (imp for virtual OS linux), might help someone.

After merging datasets and converting packedped to eigenstrat file, then modifying the .ind file to add back population labels - one should reconvert the eigenstrat to packedancestrymap format using convertf. also add this line in parameter file -hashcheck: NO. This allows you to edit population labels in ind file.

Size of geno file reduced by 75%, and qpAdm processing time reduced by 60%.

You can eliminate 1 step from above by directly editing the .fam packedped individuals file and adding back population labels (replace '1s' in the 6th column inside .fam file). Then convertf to packedancestrymap format directly, instead of eigenstrat format. Remember to keep 'hashcheck: NO' in param file of convertf so that you can edit the resultant .ind file and still be able to run qpAdm without error.

Tanchik
07-19-2020, 12:18 PM
DnaStudioKit.

yinkagene
07-26-2021, 06:40 PM
Here is how I usually work (I'm not claiming that this is the best method. Say that your 1040k file is A, B the other one).
1) Convert the 2 files in bed/bim/fam (PACKEDPED) with eigentools/convertf
2) In order to make less likely problems of positioning, exclude the SNPs in B that are not in A (you can try without this step, but in my experience it is often dangerous, and the conversion in eigenstrat may later result in problems of physical distance):
./plink --bfile A --write-snplist
./plink --bfile B --extract plink.snplist --make-bed --out B1
(you may have to add the flag --allow-no-sex. I assume that you don't want to filter or prune in anyway).
2) Try to merge
./plink --bfile A --bmerge B1.bed B1.bim B1.fam --indiv-sort 0 --make-bed --out essai
(the flag --indiv-sort 0 is here in order to get the individuals of file B following those of A in the resulting ind.file)
If you didn't encounter problems of multiallelic or multichromosomic snps, the job is done, you rename the "essai" files, and convert this plink file into eigenstrat
3) If not, you'll read in the essai-merge.missnp and the essai.log files the lists of the problematic snps. Usually I sacrifice them: I write the list of these markers (say "badsnps"), and ./plink --bfile B1 --exclude badsnps --make-bed --out B2; then step 2 to merge A and B2. If you don't want to sacrifice them, you can try to flip the multi-allelic snps, and see whether you have still a problem. About multipositions, I don't know whether there is a simple possibility not to sacrifice the snps involved.

Thanks for this- so how do i carry out step 3? What do i do to generate the "essai-merge.missnp" if using Plink v1.9? As i still encounter problematic snps. thanks in advance for answering