Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: How to merge 2 datasets with different number of SNPs using plink?

  1. #1

    How to merge 2 datasets with different number of SNPs using plink?

    I have a 1240k eigenstrat file downloaded from Reich database.

    I want to run analysis using data from Pathak 2018 which has eigenstrat file in 730k snps. How do I merge these 2 datasets using Plink without corrupting any data and retaining 1240k snps of the samples of the 1st dataset?

    I have experience with merging 2 1240k snp datasets using Plink.

    Thanks.

  2. #2
    Moderator
    Posts
    6,731
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Netherlands Friesland Finland Orkney
    Here is how I usually work (I'm not claiming that this is the best method. Say that your 1040k file is A, B the other one).
    1) Convert the 2 files in bed/bim/fam (PACKEDPED) with eigentools/convertf
    2) In order to make less likely problems of positioning, exclude the SNPs in B that are not in A (you can try without this step, but in my experience it is often dangerous, and the conversion in eigenstrat may later result in problems of physical distance):
    ./plink --bfile A --write-snplist
    ./plink --bfile B --extract plink.snplist --make-bed --out B1
    (you may have to add the flag --allow-no-sex. I assume that you don't want to filter or prune in anyway).
    2) Try to merge
    ./plink --bfile A --bmerge B1.bed B1.bim B1.fam --indiv-sort 0 --make-bed --out essai
    (the flag --indiv-sort 0 is here in order to get the individuals of file B following those of A in the resulting ind.file)
    If you didn't encounter problems of multiallelic or multichromosomic snps, the job is done, you rename the "essai" files, and convert this plink file into eigenstrat
    3) If not, you'll read in the essai-merge.missnp and the essai.log files the lists of the problematic snps. Usually I sacrifice them: I write the list of these markers (say "badsnps"), and ./plink --bfile B1 --exclude badsnps --make-bed --out B2; then step 2 to merge A and B2. If you don't want to sacrifice them, you can try to flip the multi-allelic snps, and see whether you have still a problem. About multipositions, I don't know whether there is a simple possibility not to sacrifice the snps involved.
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  3. The Following 4 Users Say Thank You to anglesqueville For This Useful Post:

     agent_lime (01-29-2020),  michal3141 (04-22-2020),  misnomer (01-29-2020),  Ruderico (01-29-2020)

  4. #3
    Quote Originally Posted by anglesqueville View Post
    Here is how I usually work (I'm not claiming that this is the best method. Say that your 1040k file is A, B the other one).
    1) Convert the 2 files in bed/bim/fam (PACKEDPED) with eigentools/convertf
    2) In order to make less likely problems of positioning, exclude the SNPs in B that are not in A (you can try without this step, but in my experience it is often dangerous, and the conversion in eigenstrat may later result in problems of physical distance):
    ./plink --bfile A --write-snplist
    ./plink --bfile B --extract plink.snplist --make-bed --out B1
    (you may have to add the flag --allow-no-sex. I assume that you don't want to filter or prune in anyway).
    2) Try to merge
    ./plink --bfile A --bmerge B1.bed B1.bim B1.fam --indiv-sort 0 --make-bed --out essai
    (the flag --indiv-sort 0 is here in order to get the individuals of file B following those of A in the resulting ind.file)
    If you didn't encounter problems of multiallelic or multichromosomic snps, the job is done, you rename the "essai" files, and convert this plink file into eigenstrat
    3) If not, you'll read in the essai-merge.missnp and the essai.log files the lists of the problematic snps. Usually I sacrifice them: I write the list of these markers (say "badsnps"), and ./plink --bfile B1 --exclude badsnps --make-bed --out B2; then step 2 to merge A and B2. If you don't want to sacrifice them, you can try to flip the multi-allelic snps, and see whether you have still a problem. About multipositions, I don't know whether there is a simple possibility not to sacrifice the snps involved.
    sounds awesome, exactly what i was looking for. does step 2 also work for snps in same position but with different names?

  5. #4
    Moderator
    Posts
    6,731
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Netherlands Friesland Finland Orkney
    Quote Originally Posted by misnomer View Post
    sounds awesome, exactly what i was looking for. does step 2 also work for snps in same position but with different names?
    Hum, I was assuming that the B file used snpids, like Reich ( I don't know Pathak's study). If it's not the case, you may have before anything to rename the snps from Pathak ( in plink, --update-name with a file that contains the old and new names of the snps; you may also use ANNOVAR, but so far I always made it with plink).
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  6. The Following User Says Thank You to anglesqueville For This Useful Post:

     misnomer (01-29-2020)

  7. #5
    Gold Class Member
    Posts
    2,573
    Sex
    Location
    Brittany
    Ethnicity
    NW European
    Y-DNA (P)
    I-L813 >Y36690
    mtDNA (M)
    H3s
    mtDNA (P)
    H3s

    Normandie France Bretagne
    Quote Originally Posted by misnomer View Post
    sounds awesome, exactly what i was looking for. does step 2 also work for snps in same position but with different names?
    Anglesqueville is a very good teacher !!!
    Recent Ancestry, full Normand. Known Genealogy 7/8 of the Cotentin peninsula 1/8 region of Coutances. Unfortunately, there are many missing branches on the maternal side.

  8. The Following 3 Users Say Thank You to Helgenes50 For This Useful Post:

     Dalluin (01-31-2020),  misnomer (01-29-2020),  Ruderico (01-29-2020)

  9. #6
    Quote Originally Posted by anglesqueville View Post
    Hum, I was assuming that the B file used snpids, like Reich ( I don't know Pathak's study). If it's not the case, you may have before anything to rename the snps from Pathak ( in plink, --update-name with a file that contains the old and new names of the snps; you may also use ANNOVAR, but so far I always made it with plink).
    my file A is already a merge of 3 datasets, 2 from Harvard, one from Estonia tartu. I'll just redo all the merges from scratch and see whats happening.

  10. #7
    Quote Originally Posted by anglesqueville View Post
    Hum, I was assuming that the B file used snpids, like Reich ( I don't know Pathak's study). If it's not the case, you may have before anything to rename the snps from Pathak ( in plink, --update-name with a file that contains the old and new names of the snps; you may also use ANNOVAR, but so far I always made it with plink).
    The culprit was Wang 2019 dataset. it had the exact same SNPs as Reich 1240k but diff names. Also noticed that the genetic position column (educated guess) in the snp file of Wang (the column with the decimals) are all set to 0, whereas the Reich files have decimals in those columns. Does this change anything?

    Anyway, i managed to merge everything (8 datasets in all after sacrificing around 73k snps). Last question

    Is it possible to retain population labels in the process of converting from eigenstrat to Packedped, merging and then converting back to eigenstrat? Currently I use excel to fill up population labels back as they were in the unmerged .ind files.

  11. #8
    Moderator
    Posts
    6,731
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA (P)
    R-BY3604-Z275
    mtDNA (M)
    H5a1

    Normandie Netherlands Friesland Finland Orkney
    Quote Originally Posted by misnomer View Post
    The culprit was Wang 2019 dataset. it had the exact same SNPs as Reich 1240k but diff names. Also noticed that the genetic position column (educated guess) in the snp file of Wang (the column with the decimals) are all set to 0, whereas the Reich files have decimals in those columns. Does this change anything?

    Anyway, i managed to merge everything (8 datasets in all after sacrificing around 73k snps). Last question

    Is it possible to retain population labels in the process of converting from eigenstrat to Packedped, merging and then converting back to eigenstrat? Currently I use excel to fill up population labels back as they were in the unmerged .ind files.
    As far as I know (but I don't know everything, for sure) there is nothing in convertf to do that. Usually I use openoffice just like you.
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  12. The Following User Says Thank You to anglesqueville For This Useful Post:

     misnomer (02-01-2020)

  13. #9
    Registered Users
    Posts
    1,702
    Sex
    Omitted

    I copypaste the previous .ind files together in npp so that the samples order match.
    Collection of 14,000 d-stats: Hidden Content Part 2: Hidden Content Part 3: Hidden Content PM me for d-stats, qpadm, qpgraph, or f3-outgroup nmonte models.

  14. The Following User Says Thank You to Kale For This Useful Post:

     misnomer (02-04-2020)

  15. #10
    Just a tip to speed up qpAdm processing time and save harddisk space (imp for virtual OS linux), might help someone.

    After merging datasets and converting packedped to eigenstrat file, then modifying the .ind file to add back population labels - one should reconvert the eigenstrat to packedancestrymap format using convertf. also add this line in parameter file -hashcheck: NO. This allows you to edit population labels in ind file.

    Size of geno file reduced by 75%, and qpAdm processing time reduced by 60%.
    Last edited by misnomer; 02-05-2020 at 04:30 AM.

  16. The Following User Says Thank You to misnomer For This Useful Post:

     Kale (02-07-2020)

Page 1 of 2 12 LastLast

Similar Threads

  1. Number of SNPs tested.
    By dosas in forum MyHeritage
    Replies: 6
    Last Post: 10-02-2019, 06:45 PM
  2. Genetic Company, Number of SNPs used and Reliability
    By Lupus82 in forum Inquiries Corner
    Replies: 0
    Last Post: 02-16-2018, 10:23 PM
  3. Replies: 7
    Last Post: 11-05-2017, 07:41 PM
  4. A question about Plink
    By Anabasis in forum Inquiries Corner
    Replies: 0
    Last Post: 11-11-2015, 03:40 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •