PDA

View Full Version : PLINK extracting to 23&Me format - too large size of file



lukaszM
10-10-2017, 10:09 AM
I tried to extract some genomes from Pagani data. http://evolbio.ut.ee/CGgenomes_VCF/
The plink files are divided to separate chromosome files but I merged it to one file.
Then extracted one random genome to 23&me format but resultant file is enormous (1 GB ), but apart of this correct. How the hell is possible to reduce this tremendous size? I thought that 23&Me format in PLINK obviously reduces some not important SNPs, but it seems not at all.

BTW extracting 23&me files from separate chromosome files isn't any way better, merging resultant 23&me files to one, produced bigger size than above method.

anglesqueville
10-10-2017, 03:22 PM
I thought that 23&Me format in PLINK obviously reduces some not important SNPs, but it seems not at all
In Plink "23file" means "text file with rsid on col1, chrom num on col2, position on col3 and genotype on col4, without quotes and tab separated", that's all. Fortunately Plink does'nt intervene on the snps' list.

lukaszM
10-10-2017, 03:50 PM
Problem is only with Pagani dataset.

anglesqueville
10-10-2017, 04:11 PM
Problem is only with Pagani dataset.

The size of your resulting file depends on the number of snps. You'll experience this size problem with any very rich data. I don't recall how many snps are genotyped by Pagani/HGDP, but likely more that 10 millions. In my opinion this problem is secondary, because it's easy to solve in Plink with --extract and a snp list. The main problem is, as I told you, that the Pagani's bim has chromnum:pos in the first column, and not snpid. If you want to get a 23&me file available for gedmatch ( likely that's what you are expecting), you have to solve this identifier problem before anything else. My R-script is obviously helpless on entire chromosomes ( too slow). try this: https://www.biostars.org/p/171557/

anglesqueville
10-10-2017, 04:38 PM
Just gave a try: downloaded the snpids for chrom 22 on UCSC. That works fine:
#chrom chromEnd name
chr22 10510770 rs71207739
chr22 10510928 rs71207740
chr22 10511116 rs4022986
chr22 10511641 rs71207745
chr22 10513123 rs4084938
chr22 10513298 rs1829025
chr22 10513346 rs4084939
chr22 10513380 rs386831164
chr22 10513440 rs879023291
chr22 10513844 rs879124055
chr22 10516682 rs4022528
chr22 10516697 rs4022529
chr22 10516937 rs1829029
chr22 10519159 rs71267035
chr22 10519350 rs71286730
chr22 10519383 rs1913861
chr22 10519495 rs3850103 etc.....
You have to make this for each chromosome, make files with lines like "22:10510770 rs71207739", and use --update-map and --update-name in Plink. You'll have very likely to use --list-duplicate-vars and --exclude in order to get clean files.

Kurd
10-10-2017, 06:15 PM
No need to overthink this just —extract MySNPs.txt, where MySNPs is a list of SNPs, one per row

Then, ./plink —bfile Pagani —recode 23 —out 23andMe. That simple :)

anglesqueville
10-10-2017, 08:39 PM
That simple, yes. But ...
1) if on your "MySNPs.txt", the snps are named with their snpid ( rs#, or indels, etc), as on Pagani they are named with chrom#-position, Plink of course will just extract NO snp at all.
2) if on your "MySNPs.txt" the snps are named with chrom#-position, that will of course work, but the result will be a text file with chrom#-position in the first column:


# This data file generated by PLINK at: Tue Oct 10 22:34:20 2017
#
# Below is a text version of your data. Fields are TAB-separated.
# Each line corresponds to a single SNP. For each SNP, we provide its
# identifier, its location on a reference human genome, and the genotype call.
# For further information (e.g. which reference build was used), consult the
# original source of your data.
#
# rsid chromosome position genotype
22:16050607 22 16050607 GG
22:16051269 22 16051269 GG
22:16051477 22 16051477 CC
22:16051867 22 16051867 TT
22:16052273 22 16052273 CC
22:16052684 22 16052684 AA
22:16052766 22 16052766 AA
22:16052777 22 16052777 CC
22:16053107 22 16053107 CC
22:16053201 22 16053201 GG
22:16053485 22 16053485 GG

In any case he has to rename the snps.

Kurd
10-11-2017, 02:51 AM
I didn't realize that they had not annotated the SNPs with dbSNP rs nos. In this situation, personally I would annotate the VCF file using GATK Variant Annotation, which requires that you have the proper human reference and dbSNP files and their associated files. Although it takes me a few minutes, using GATK is tricky for anyone not familiar with it. In this case the UCSC described above may be your best option.

anglesqueville
10-11-2017, 06:25 AM
I didn't realize that they had not annotated the SNPs with dbSNP rs nos.
I experienced this problem some months ago, as I wanted to use smartpca on the HLA zone of uralic populations and my father. But as all was on a segment inside an only chrom, I wrote a little R-scrip in order to remake the annotation of Pagani, and all worked fine. But as I guess that Lukasz expects to get 23&me files for the whole genome uploadable to gedmatch, my script is helpless. I used yesterday evening the UCSC+update prodecure, as described in biostars, for the little chromosome 22, and it worked rather fine, although I got a little number of duplicate vars (less then 200). The procedure on the 23 autosomes and X bed files would take some time, but works. Pagani is a very precious basis for people interested in uralic modern genetics, but I cannot imagine the reason why Pagani used this annotation. :frusty:

Kurd
10-11-2017, 06:41 AM
I experienced this problem some months ago, as I wanted to use smartpca on the HLA zone of uralic populations and my father. But as all was on a segment inside an only chrom, I wrote a little R-scrip in order to remake the annotation of Pagani, and all worked fine. But as I guess that Lukasz expects to get 23&me files for the whole genome uploadable to gedmatch, my script is helpless. I used yesterday evening the UCSC+update prodecure, as described in biostars, for the little chromosome 22, and it worked rather fine, although I got a little number of duplicate vars (less then 200). The procedure on the 23 autosomes and X bed files would take some time, but works. Pagani is a very precious basis for people interested in uralic modern genetics, but I cannot imagine the reason why Pagani used this annotation. :frusty:

I’m not sure why he didn’t, since like I mentioned it is only a couple of minutes longer to annotate the VCF with dbSNP if you know what you are doing. I have had his dataset incorporated into mine a while now along with data from Simmons Genome Diversity Project which are much larger files (they include 5+ million bases not even in the human reference), but had forgotten that their data didn’t include rs nos.

Perhaps he only wanted serious people to use his data :)

anglesqueville
10-11-2017, 06:57 AM
a couple of minutes longer to annotate the VCF with dbSNP ... rather a couple of couples of minutes, if Pagani has written one vcf per chrom, as he did with the beds. :biggrin1: . I guess that Pagani did'nt imagine that somebody could have the need to make 23&me files of individuals extracted from his basis. Personally my only problem was to merge my father with Pagani, and of course I remade the annotation of my father's file, not of Pagani! :biggrin1:

lukaszM
10-11-2017, 11:21 AM
.. I guess that Pagani did'nt imagine that somebody could have the need to make 23&me files of individuals extracted from his basis.

Yes, my reason was to extract three Aeta kits. Out of curiosity. To check if they have Phillipino relatives on Gedmatch and how many, and of course how they behave in various calculators.
What is strange with that?

@Kurd

Perhaps he only wanted serious people to use his data

I don't know if my above reasons are serious or not. I learned PLINK just one week ago and extracted all samples from all public datasets (which were in PLINK format) except Pagani:)
All are uploaded on Gedmatch now, but hidden under "Research" label... I don't want to dominate some one-to-many results completely with them (like SE-Asia or Central Asia).

anglesqueville
10-11-2017, 11:39 AM
Lukasz:
What is strange with that? Strictly speaking, nothing. I just think that Pr Pagani was just ( and is still) unable to imagine that somebody could use his datas in this way. That said, those datas are public, and you are allowed to use them as you want.

edit: honnestly the ways I used myself Pagani ( 2 times only) can seem themselves strange, or at least highly adventurous. In particular the work I did on the HLA with the Uralics from Pagani, my father (imputed), and diverse admixtools programs was likely on the fringes of what is reasonable. :\

edit edit: Lukasz, as you are always looking for modern individuals, did you try with Simon Genome Diversity Project? ( link on Reich' page at Harvard)

lukaszM
10-11-2017, 11:45 AM
Lukasz: Strictly speaking, nothing. I just think that Pr Pagani was just ( and is still) unable to imagine that somebody could use his datas in this way. That said, those datas are public, and you are allowed to use them as you want.

Yes it's possible that he can't:)

I like complete solutions and it makes me sick that I can't use Pagani (yet...).
If something interests me I must gather all possible info:)

Kurd
10-11-2017, 01:25 PM
Yes it's possible that he can't:)

I like complete solutions and it makes me sick that I can't use Pagani (yet...).
If something interests me I must gather all possible info:)

I can put the plinks of the Aeta samples on google drive for you if you like and Annotate RS numbers With HG 19 reference?

lukaszM
10-11-2017, 01:29 PM
I can put the plinks of the Aeta samples on google drive for you if you like and Annotate RS numbers With HG 19 reference?

Ok if you can, would be great.

Kurd
10-11-2017, 01:39 PM
Ok if you can, would be great.

Here you go in Plink format. I also included the 3 Agta samples. All mapped to hg19 with dbSNP nos, with X chr, and about 900K SNPs. I removed no calls.

https://drive.google.com/file/d/0B5L_qxKAZF35NVk2WlVqZk04VkU/view?usp=sharing

lukaszM
10-11-2017, 02:22 PM
Here you go in Plink format. I also included the 3 Agta samples. All mapped to hg19 with dbSNP nos, with X chr, and about 900K SNPs. I removed no calls.

https://drive.google.com/file/d/0B5L_qxKAZF35NVk2WlVqZk04VkU/view?usp=sharing

Thanks!

Still I will be trying to manage it by yourself. I have some idea with --extract use but I must test.

lukaszM
10-12-2017, 08:17 PM
Thanks to R solution provided by Anglesqueville by PM I converted to 23&me format finally whole Pagani dataset:)