PDA

View Full Version : DNA.Land imputed VCF file



Ann Turner
10-13-2015, 03:20 AM
The following was cross-posted to 23andMe and an ISOGG Facebook thread about the imputed VCF file.

The file I uploaded to DNA.Land was an AncestryDNA file. I also have v4 data from 23andMe, which I knew contained a fair number of SNPs not included in AncestryDNA. There is a very good concordance rate for my 23andMe and AncestryDNA results (only 14 SNPs out of the 299672 overlapping SNPs had different genotype calls), so the chip method is very robust. My goal was to see if the imputed results matched the 23andMe results even when AncestryDNA had no data.

Handling the humongous VCF file took a lot of finagling, so I ended up working with just chromosome 22. The VCF file for C22 has 489,000 rows with columns for REF / ALT alleles. The last column contains 0/0 if the SNP is homozygous for the reference allele, 0/1 if the SNP is heterozygous, and 1/1 if the SNP is homozygous for the alternate allele. I used spreadsheet formulas to convert the VCF format to 23andMe's format.

For C22 there were 3615 SNPs with 23andMe data not included in my AncestryDNA results. 51 of those were no-calls, and the imputed file supplied values for 35 of those. 230 SNPs in the 23andMe set did not get any imputed values.

That left 3350 SNPs with data from both sources. 3238 of those (96.7%) were identical. Of the 112 SNPs with different values (3.3%), 55 were homozygous in the imputed file and heterozygous in the 23andMe file, 50 were the reverse, and 7 were opposite homozygotes.

When working with genotype (unphased) data for matching purposes, heterozygous results give you a free pass, since they are a universal match. Using imputed data might let a long consecutive run of half-identical SNPs continue a bit longer, but sooner or later (usually sooner) opposite homozygotes will crop up. Opposite homozygotes terminate the long consecutive run, but the algorithms build in some tolerance for genotyping error, so these rare errors will probably not have much of an impact on breaking up a matching segment.

The next question is whether imputed data from a 23andMe v4 file would be good enough to fill a v3 template for uploading to Family Finder at FTDNA. I'm now feeling somewhat optimistic about this, but it should be tested. There will not be as many SNPs in a v4 file (~600K vs 700K), and that could make imputation less robust. It would take somebody with the capacity for dealing with very large files to work out an automated method. I did quite a bit of manual tweaking (but it was a learning experience).

Petr
10-25-2015, 08:34 PM
I tested imputing results the following way. I took 2 files, the first 23andMe V4, the second FTDNA. This is the result of comparison of these 2 files by this tool:


302592 SNPs appear in both files

0.820 % (2480) of these 302592 common SNPs are NoCalls (i.e. NoCall in at least one file)

0.006 % (18) of the 302592 SNP results differ

3 differing SNPs are both homozygous
6 differing SNPs are homozygous in File1 but not in File2
9 differing SNPs are homozygous in File2 but not in File1
Then I created a file containing just the markers that are unique to 23andMe and compared real 23andMe genotypes with genotypes imputed by DNA.LAND from FTDNA file. Here are the results:

232936 SNPs appear in both files

4.582 % (10674) of these 232936 common SNPs are NoCalls (i.e. NoCall in at least one file)

2.053 % (4783) of the 232936 SNP results differ

159 differing SNPs are both homozygous
2204 differing SNPs are homozygous in File1 but not in File2
2418 differing SNPs are homozygous in File2 but not in File1
I don't know if this is good result or not, but I'm sure it is nothing one could rely on regarding health.

I also generated 23andMe v3 file from v4 + imputed markers, but https://www.familytreedna.com/autosomalTransfer/ apparently does not work at least two days so I'm not able to try if it will be accepted by FTDNA.

Ann Turner
10-25-2015, 10:48 PM
I'll look forward to hearing the results of your v4 + imputed markers. I spoofed a file with v4 data loaded into a v3 template, and the Family Finder results were quite similar for the longer segments. The file did generate more of those meaningless 1-3 cM segments, which actually fixed some cases where my son had a match not found in either parent. The additional small segments were enough to push the total cM over the (presumed) 20 cM threshold.

Thanks also for your reference to Total Commander in the other thread about differences in your imputed files. I was wondering if the lines with differences had strong genotype likelihood scores.

Ann Turner
10-26-2015, 09:57 PM
This link seems to work (the last part must be case sensitive)

https://www.familytreedna.com/AutosomalTransfer

Petr
10-27-2015, 12:30 AM
Thank you, today work both URLs. I uploaded the file and it was accepted. Parents of the uploaded person have 3196.95 and 3284.61 shared cM, Longest Block: 224.15 and 199.60 respectively.

Nicolas
10-28-2015, 11:10 AM
Hi, is there an utility available to load DNA.land imputed results into a v3 template, or should I write my own little script?

Ann Turner
10-28-2015, 09:43 PM
Hi, is there an utility available to load DNA.land imputed results into a v3 template, or should I write my own little script?
I've been asking around for someone who could put together a user-friendly utility to do just that. I know the logic to convert VCF to 23andMe format (including indels), but I don't have the skill set to handle such large files. Some line lengths are over 32K in size, due to a type of variant called "esv" (structural variant, with catalog numbers from the European Bioinformatics Institute).

Ann Turner
10-28-2015, 09:45 PM
That sounds promising. Are you planning to pay for your whole list of matches?

Petr
10-28-2015, 11:58 PM
I uploaded 4 people for free unlock.

Ann Turner
10-31-2015, 03:11 PM
Petr, in another thread you mentioned that you found differences in the imputed VCF files for two different uploads of the same FF file. Do you still have those kits at DNA.Land? If so, how many segments do the two "identical twins" show? I have an upload of an AncestryDNA and a 23andMe v4 kit, and there are quite a few breaks in the segments. I end up with 47 segments instead of the expected 22. A couple of the breaks are quite large, ~ 3 Mb.

Petr
10-31-2015, 05:54 PM
FTDNA(1) vs. FTDNA(2): # Shared Segments: 24, Total Shared Length: 3507.86 cM, break in chr9:136972587..137017847 and chr17:77186666..78417478
FTDNA(1) vs. FTDNA(3): # Shared Segments: 22, Total Shared Length: 3505.61 cM
FTDNA(1) vs. FTDNA(4): # Shared Segments: 23, Total Shared Length: 3505.16 cM, break in chr12:115016497..115073813
FTDNA(2) vs. FTDNA(3): # Shared Segments: 24, Total Shared Length: 3501.11 cM, break in chr4:4029347..4183831 and chr9:136972587..137017847
FTDNA(2) vs. FTDNA(4): # Shared Segments: 24, Total Shared Length: 3504.23 cM, break in chr9:136972587..137017847 and chr12:115016497..115073813
FTDNA(3) vs. FTDNA(4): # Shared Segments: 24, Total Shared Length: 3505.17 cM, break in chr4:4029347..4183831 and chr12:115016497..115073813
FTDNA(1) vs. 23andMe: # Shared Segments: 36, Total Shared Length: 3465.07 cM
FTDNA(2) vs. 23andMe: # Shared Segments: 44, Total Shared Length: 3465.69 cM
FTDNA(3) vs. 23andMe: # Shared Segments: 40, Total Shared Length: 3460.74 cM
FTDNA(4) vs. 23andMe: # Shared Segments: 42, Total Shared Length: 3470.01 cM
FTDNA(1) vs. Ancestry: # Shared Segments: 29, Total Shared Length: 3457.99 cM
FTDNA(2) vs. Ancestry: # Shared Segments: 31, Total Shared Length: 3451.63 cM
FTDNA(3) vs. Ancestry: # Shared Segments: 31, Total Shared Length: 3449.68 cM
FTDNA(4) vs. Ancestry: # Shared Segments: 27, Total Shared Length: 3451.87 cM
23andMe vs. Ancestry: # Shared Segments: 46, Total Shared Length: 3465.29 cM

FTDNA(1..4) are based on the same file.

GEDmatch shows 3587.1 cM between all 3 (FTDNA, 23andMe, Ancestry) files.

Ann Turner
11-23-2015, 11:40 PM
I succeeded in spoofing a v3 file from the imputed data and paid the $39 to unlock the kit. The major problem appears to be the error rate for imputed SNPs. It's not bad in absolute terms ( < 1% using my AncestryDNA kit as the gold standard), but it's enough to break up many segments into smaller pieces. My son shows 80 segments with his regular genotype kit at FTDNA and 210 segments with the phased file I uploaded (see http://www.thegeneticgenealogist.com/2015/03/30/guest-post-what-a-difference-a-phase-makes/ for background on that).

You will nonetheless get a certain percentage of the matches you would get with a 700K transfer from AncestryDNA. Here are the my stats for number of matches, broken into bins based on longest segment size.

bin regular imputed percent
all 1183 821 69.4%
< 10 cM 599 349 58.3%
10-15 cM 391 292 74.7%
> 15 cM 193 180 93.3%

I achieved much better results by spoofing a v3 file with v4 data. That results in a lot of no-calls, but FTDNA seemed to give credit for a block of 100 SNPs if there were no contradictions. The longest segments were very similar, but the missing SNPs did let many more small ( < 5 cM) IBS segments sneak through. Ironically, that actually fixed some cases where my son had a match not found in either parent, apparently because the small segments would push the total over the 20 cM threshold.

If you'd like to see the effect at GEDmatch, you can look at my imputed kit M803675 (*Ann Imp Turner) and my son's phased kit PM751855M1.

Nicolas
01-26-2016, 04:34 PM
Ann, could you please outline the steps you used to upload your imputed DNA.land to FTDNA?

Ann Turner
01-26-2016, 06:59 PM
Ann, could you please outline the steps you used to upload your imputed DNA.land to FTDNA?
It was a very kludgy method, splitting the huge VCF file into smaller pieces, using Excel VLOOKUP formulas, and stitching the files back together. I'm sure there are better ways, but I can't recommend uploading such a file to FTDNA. The results weren't very satisfactory.

Mellifluous
03-18-2016, 04:37 AM
I tested imputing results the following way. I took 2 files, the first 23andMe V4, the second FTDNA. This is the result of comparison of these 2 files by this tool:


302592 SNPs appear in both files

0.820 % (2480) of these 302592 common SNPs are NoCalls (i.e. NoCall in at least one file)

0.006 % (18) of the 302592 SNP results differ

3 differing SNPs are both homozygous
6 differing SNPs are homozygous in File1 but not in File2
9 differing SNPs are homozygous in File2 but not in File1
Then I created a file containing just the markers that are unique to 23andMe and compared real 23andMe genotypes with genotypes imputed by DNA.LAND from FTDNA file. Here are the results:

232936 SNPs appear in both files

4.582 % (10674) of these 232936 common SNPs are NoCalls (i.e. NoCall in at least one file)

2.053 % (4783) of the 232936 SNP results differ

159 differing SNPs are both homozygous
2204 differing SNPs are homozygous in File1 but not in File2
2418 differing SNPs are homozygous in File2 but not in File1
I don't know if this is good result or not, but I'm sure it is nothing one could rely on regarding health.

I also generated 23andMe v3 file from v4 + imputed markers, but https://www.familytreedna.com/autosomalTransfer/ apparently does not work at least two days so I'm not able to try if it will be accepted by FTDNA.

Hello, I'd like to know how you were able to create a v4 + imputed markers raw file. Which tools did you use? Thanks!

Petr
03-18-2016, 09:07 AM
I'm really sorry but I don't remember how I did this. Probably I used Visual FoxPro just because I'm able to work with it.

pofig37
01-10-2017, 01:43 PM
I came across a tool written by Sean Hewitt to convert imputed dna.land VCF file into 23andme v3 file (https://github.com/SeanHewitt/vcf-to-23andme). It's a python script, so you will also need to install python (https://www.python.org/downloads/) to run it.

I was able to take my VCF file from DNA.land that was imputed from 23andme v4 data, convert it to v3 set, upload to FTDNA and unlock it successfully.

Ann Turner
01-10-2017, 03:36 PM
I came across a tool written by Sean Hewitt to convert imputed dna.land VCF file into 23andme v3 file (https://github.com/SeanHewitt/vcf-to-23andme). It's a python script, so you will also need to install python (https://www.python.org/downloads/) to run it.

I was able to take my VCF file from DNA.land that was imputed from 23andme v4 data, convert it to v3 set, upload to FTDNA and unlock it successfully.
Be aware that imputation introduces errors, enough to split up some segments. The attached file shows how my sister compares with my regular file and the imputed file I uploaded to FTDNA. The errors do seem to cluster in certain regions. My son ends up with 84 segments when compared to me, vs the expected 23 segments.13534

Helves
04-05-2018, 07:18 PM
The imputed VCF file from DNA lands, is it compatible with Gedmatch/Genesis? I'm trying to upload my VCF file(which was loaded to DNA land as a 23andme v5) to Gedmatch but it's giving me error

vettor
06-07-2018, 06:17 PM
Can these very large DNAland VCF files be used in some way with other programs or ...........like yfull ?

Miqui Rumba
09-23-2018, 08:29 AM
No, imputed VCFs have not chrY SNPs neither mtDNA markers. Actually, these kits are pure sequencing speculation based on genealogical kit SNPs. I have a Gencove imputed VCF in Gedmatch Genesis since 1 year ago, it is based upon my 23andme v4 kit although my 23andme Genesis kit does not match woth imputed kit! Gencove uses Michigan imputation (Memic) and genotypes are phased (32M SNPs) while DNA.land uses Oxford imputation without phasing and obsolete VCF 4.0 version (but more SNPs, 39M). Genes for Good offers another small imputed VCF (8M SNPs, Michigan imputation) that can be converted to 23andme v3 format with Wilhem HO kit Studio and so good results like DNA.land imputed VCF input. There is a imputing app in Sequencing.com that will may impute some SNPs in chrY but I could not trust results: imputation offers terrible results for haplogroup prediction.