Page 1 of 3 123 LastLast
Results 1 to 10 of 21

Thread: DNA.Land imputed VCF file

  1. #1
    Registered Users
    Posts
    210
    Sex

    DNA.Land imputed VCF file

    The following was cross-posted to 23andMe and an ISOGG Facebook thread about the imputed VCF file.

    The file I uploaded to DNA.Land was an AncestryDNA file. I also have v4 data from 23andMe, which I knew contained a fair number of SNPs not included in AncestryDNA. There is a very good concordance rate for my 23andMe and AncestryDNA results (only 14 SNPs out of the 299672 overlapping SNPs had different genotype calls), so the chip method is very robust. My goal was to see if the imputed results matched the 23andMe results even when AncestryDNA had no data.

    Handling the humongous VCF file took a lot of finagling, so I ended up working with just chromosome 22. The VCF file for C22 has 489,000 rows with columns for REF / ALT alleles. The last column contains 0/0 if the SNP is homozygous for the reference allele, 0/1 if the SNP is heterozygous, and 1/1 if the SNP is homozygous for the alternate allele. I used spreadsheet formulas to convert the VCF format to 23andMe's format.

    For C22 there were 3615 SNPs with 23andMe data not included in my AncestryDNA results. 51 of those were no-calls, and the imputed file supplied values for 35 of those. 230 SNPs in the 23andMe set did not get any imputed values.

    That left 3350 SNPs with data from both sources. 3238 of those (96.7%) were identical. Of the 112 SNPs with different values (3.3%), 55 were homozygous in the imputed file and heterozygous in the 23andMe file, 50 were the reverse, and 7 were opposite homozygotes.

    When working with genotype (unphased) data for matching purposes, heterozygous results give you a free pass, since they are a universal match. Using imputed data might let a long consecutive run of half-identical SNPs continue a bit longer, but sooner or later (usually sooner) opposite homozygotes will crop up. Opposite homozygotes terminate the long consecutive run, but the algorithms build in some tolerance for genotyping error, so these rare errors will probably not have much of an impact on breaking up a matching segment.

    The next question is whether imputed data from a 23andMe v4 file would be good enough to fill a v3 template for uploading to Family Finder at FTDNA. I'm now feeling somewhat optimistic about this, but it should be tested. There will not be as many SNPs in a v4 file (~600K vs 700K), and that could make imputation less robust. It would take somebody with the capacity for dealing with very large files to work out an automated method. I did quite a bit of manual tweaking (but it was a learning experience).

  2. The Following 9 Users Say Thank You to Ann Turner For This Useful Post:

     AJL (11-23-2015),  anglesqueville (10-16-2015),  Cinnamon orange (03-21-2016),  dp (10-07-2016),  Mellifluous (03-17-2016),  MikeWhalen (10-13-2015),  MKitching57 (01-07-2017),  Sangarius (10-13-2015),  Táltos (10-13-2015)

  3. #2
    Registered Users
    Posts
    440
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    I tested imputing results the following way. I took 2 files, the first 23andMe V4, the second FTDNA. This is the result of comparison of these 2 files by this tool:

    Code:
    302592 SNPs appear in both files 
    
    0.820 % (2480) of these 302592 common SNPs are NoCalls (i.e. NoCall in at least one file)
    
    0.006 % (18) of the 302592 SNP results differ
    
    3 differing SNPs are both homozygous
    6 differing SNPs are homozygous in File1 but not in File2
    9 differing SNPs are homozygous in File2 but not in File1
    Then I created a file containing just the markers that are unique to 23andMe and compared real 23andMe genotypes with genotypes imputed by DNA.LAND from FTDNA file. Here are the results:
    Code:
    232936 SNPs appear in both files 
    
    4.582 % (10674) of these 232936 common SNPs are NoCalls (i.e. NoCall in at least one file)
    
    2.053 % (4783) of the 232936 SNP results differ
    
    159 differing SNPs are both homozygous
    2204 differing SNPs are homozygous in File1 but not in File2
    2418 differing SNPs are homozygous in File2 but not in File1
    I don't know if this is good result or not, but I'm sure it is nothing one could rely on regarding health.

    I also generated 23andMe v3 file from v4 + imputed markers, but https://www.familytreedna.com/autosomalTransfer/ apparently does not work at least two days so I'm not able to try if it will be accepted by FTDNA.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  4. The Following 3 Users Say Thank You to Petr For This Useful Post:

     Ann Turner (10-26-2015),  Táltos (10-26-2015),  ViktorL1 (09-10-2017)

  5. #3
    Registered Users
    Posts
    210
    Sex

    I'll look forward to hearing the results of your v4 + imputed markers. I spoofed a file with v4 data loaded into a v3 template, and the Family Finder results were quite similar for the longer segments. The file did generate more of those meaningless 1-3 cM segments, which actually fixed some cases where my son had a match not found in either parent. The additional small segments were enough to push the total cM over the (presumed) 20 cM threshold.

    Thanks also for your reference to Total Commander in the other thread about differences in your imputed files. I was wondering if the lines with differences had strong genotype likelihood scores.

  6. The Following User Says Thank You to Ann Turner For This Useful Post:

     Táltos (10-26-2015)

  7. #4
    Registered Users
    Posts
    210
    Sex

    This link seems to work (the last part must be case sensitive)

    https://www.familytreedna.com/AutosomalTransfer

  8. #5
    Registered Users
    Posts
    440
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    Thank you, today work both URLs. I uploaded the file and it was accepted. Parents of the uploaded person have 3196.95 and 3284.61 shared cM, Longest Block: 224.15 and 199.60 respectively.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  9. The Following 2 Users Say Thank You to Petr For This Useful Post:

     Táltos (10-27-2015),  vettor (10-27-2015)

  10. #6
    Junior Member
    Posts
    3
    Sex

    Hi, is there an utility available to load DNA.land imputed results into a v3 template, or should I write my own little script?

  11. #7
    Registered Users
    Posts
    210
    Sex

    Quote Originally Posted by Nicolas View Post
    Hi, is there an utility available to load DNA.land imputed results into a v3 template, or should I write my own little script?
    I've been asking around for someone who could put together a user-friendly utility to do just that. I know the logic to convert VCF to 23andMe format (including indels), but I don't have the skill set to handle such large files. Some line lengths are over 32K in size, due to a type of variant called "esv" (structural variant, with catalog numbers from the European Bioinformatics Institute).

  12. The Following User Says Thank You to Ann Turner For This Useful Post:

     Mellifluous (03-18-2016)

  13. #8
    Registered Users
    Posts
    210
    Sex

    That sounds promising. Are you planning to pay for your whole list of matches?

  14. #9
    Registered Users
    Posts
    440
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    I uploaded 4 people for free unlock.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  15. #10
    Registered Users
    Posts
    210
    Sex

    Petr, in another thread you mentioned that you found differences in the imputed VCF files for two different uploads of the same FF file. Do you still have those kits at DNA.Land? If so, how many segments do the two "identical twins" show? I have an upload of an AncestryDNA and a 23andMe v4 kit, and there are quite a few breaks in the segments. I end up with 47 segments instead of the expected 22. A couple of the breaks are quite large, ~ 3 Mb.

Page 1 of 3 123 LastLast

Similar Threads

  1. DNA.Land = alternative to gedmatch?
    By evon in forum DNA.Land
    Replies: 725
    Last Post: 07-29-2019, 07:47 PM
  2. DNA.Land Relationship Matches are impossible
    By J1 DYS388=13 in forum Open-Source Projects
    Replies: 7
    Last Post: 10-13-2015, 04:03 AM
  3. Geordie-land (Northumbria)
    By Jean M in forum History (Medieval)
    Replies: 9
    Last Post: 06-02-2015, 12:39 PM
  4. Big Y results: the VCF file
    By michaelcooley in forum FTDNA
    Replies: 2
    Last Post: 02-07-2015, 02:51 AM
  5. A Small Human File?
    By introvert in forum General
    Replies: 3
    Last Post: 06-28-2013, 11:03 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •