PDA

View Full Version : Dante Labs (WGS)



Pages : 1 2 [3] 4 5 6

MacUalraig
03-15-2019, 07:53 AM
Any advise on sources for their interpretation?

They contain the reads before they were aligned to the reference genome and made into the BAM (with a tool like bwa). So they are of use in their own right if you want to do a different alignment - did they already give you an hg38 aligned BAM? If so then less use.

ybmpark
03-15-2019, 02:13 PM
Their system was under maintenance for about 10 days and it is restored now but my record is entirely wiped out. Even my order history.
They exhibit every characteristic of a fraudulent operation and they are pretty smart; I took the thanksgiving promotion and it is way beyond 2 months so I cannot work out with paypal for refund.
All their replies seem automatically generated.

MacUalraig
03-16-2019, 08:43 AM
Their system was under maintenance for about 10 days and it is restored now but my record is entirely wiped out. Even my order history.
They exhibit every characteristic of a fraudulent operation and they are pretty smart; I took the thanksgiving promotion and it is way beyond 2 months so I cannot work out with paypal for refund.
All their replies seem automatically generated.

I've had my results thanks so nice try but a long way short of being fraudulent. Slow - yes. That's why its cheap.

In the DNA business delays are commonplace. For example one of their rivals failed my sample after having had it for five months. Now I'm waiting for sample 2 to be tried.

ybmpark
03-16-2019, 09:02 PM
...so nice try ...
nice try at unproked aggression.

Miqui Rumba
03-17-2019, 11:48 AM
If you request BAM and FASTQ files both, then theorically you can realign FASTQ with hg38 in usegalaxy.org for example. Problem is I detected some duplicated fastq files and forgotten reverse pairs in some WGZ big data. My WES fastq forward don't pass fastqc and I have terrible problems to align with whole genome (hg37 or hg38 neither, only Bowtie2 and hisat2 in hg38 work). Picard tools has a tool to revert BAM to fastq files although is a risky process.

MacUalraig
03-17-2019, 12:58 PM
If you request BAM and FASTQ files both, then theorically you can realign FASTQ with hg38 in usegalaxy.org for example. Problem is I detected some duplicated fastq files and forgotten reverse pairs in some WGZ big data. My WES fastq forward don't pass fastqc and I have terrible problems to align with whole genome (hg37 or hg38 neither, only Bowtie2 and hisat2 in hg38 work). Picard tools has a tool to revert BAM to fastq files although is a risky process.

I'm also having a look at aligning, not sure which direction to go at the moment as I have more hg19 files than hg38 so might try aligning my YSEQ data back to hg19 whilst I await my hg38 Dante BAM/FASTQ.

Has anyone tried running a Dante VCF through ReMap? I had to change this line:
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

to
Number=.

ie
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
which then run but not got it into shape enough to load in IGV yet.

Jan_Noack
03-18-2019, 06:00 AM
deleted duplicate post

Jan_Noack
03-18-2019, 06:00 AM
I ordered 4th Mar 19, I had a look at the tracking this morning. It came from the US to Sydney, Aust in a couple of days, sat at the airport for a couple of days and was released to go less than 20 miles in the suburbs of Sydney (to me) on 12th Mar.. I just went out to the post box and there it was (Mon 18th). I logged on again and its been deleted from the USPS tracking system but I have an email from Dante saying its arrived. If I just looked at the USPS tracking no, it would lead me to think the package had not yet arrived at their site "tracking Number: LZ009036991US
Status
Label Created, not yet in the system" and yet an hour ago it gave me the full tracking history.
When in Australia, after the airport there is no tracking just standard ost, and yet they seem to know it has been delivered.
Unfortunately the free return post box only works in the US.... so that will be $3 without tracking or $24 with tracking for return post. Seems a tad wasteful to give me a free return USPS box for use in the US only?

Anyway so far not anything fraudulent at all..as well as others receiving results, I'm hopeful.

I haven't done any computing for about 35 years now..so I'm not up to scratch with the latest tech, LOL.. but I'm hoping someone may be able to tell me when I get these results back what to do.. or maybe just send them to Full Genomes for processing?

I'd love to be able to merge the FTDNA bigY results and YSTR 111 results to the file produced from Dante though.. in theory, it seems it should be easy enough. Has anyone done this? ie merge BAM's. or some other format or would this only be possible with the raw data?. Guess I need to know the formats and spend a load of time getting "into" it all. Just seems that should be the way to go to get the best of both worlds with the processing (if at all possible), in the WGS 30X AND the deeper testing of the smaller no of FTDNA SNPs and STRs.


Anyone doing this or would Full Genomes take all your testing and merge them? I feel a bit more comfortable with Full genomes than YFull as some said that Y Full was in Russia? I haven't check this though. ..and not biased much , but maybe the US seems "safer" long term with data .. probably much of a muchness though. I gather most esend to YFull but I think Full Genomes does the same for about the same cost?

MacUalraig
03-18-2019, 07:53 AM
YFull and FGC are both run by fine people who are well respected in the community. I use YFull for analysis and yes I use their payment system too (as I cannot stand PayPal nor they me). Probably best if we leave it at that rather than start up on which foreign countries do this or that with your data!

Tonev
03-18-2019, 12:00 PM
YFull and FGC are both run by fine people who are well respected in the community. I use YFull for analysis and yes I use their payment system too (as I cannot stand PayPal nor they me). Probably best if we leave it at that rather than start up on which foreign countries do this or that with your data!

I found a discussion here from 2015 (https://anthrogenica.com/showthread.php?4908-YFull-vs-FGC). Are there more recent comments on the topic?

Tonev
03-20-2019, 09:46 AM
After all my WGS raw data from Dante Labs- finally received on a HDD, now shifting to YFULL for detailed Y+Mt analysis. Sharing fisrt statistics.2946129462294632946429465

Giosta
03-20-2019, 03:11 PM
How long it takes from hard disk order ... to receive it?

Tonev
03-20-2019, 08:29 PM
How long it takes from hard disk order ... to receive it?

It took us (me and Dante) quite longer than expected:

2018.08.15: Order WGS 30x (with shipment back included + explicit requirement for HDD with raw data included)

2018.08.29:DHL shipment-initiated by Dante Labs
2018.09.04:DHL shipment-back-departed from me to Dante (after some problems with the label for pre-payed shipment back)
2018.09.05.DHL.shipment-delivered to Dante (RESULTS-EXPECTED-10--12-weeks according to Dante's commercial, i.e. 06-20 Nov.2018)

2018.12.07: Link to my Rare diseases file- uploaded in the web site of Dante. Plus a refund notification.
2018.12.09: Link to my VCF file- finally uploaded in the web site of Dante.

2018.12.09: Question about the other raw data (Bam, fastq, fasta),
2018.12.09: Dante Labs reply that if I need the pre-payed raw data cited, they can add me to their HDD shipping queue?!
2018.12.10: My requirements for shipping without any other unncessary confirmations plus a tracking number of the shipment.

2019.03.13: 30 weeks after VCF receiving (~43 weeks after ordering)- DHL-Shipment by DANTE of the HDD with all raw data.

At the end of the day, the price-quality proved to be good enough for me (and hopefully shaping an industry standard trend rather than a limited time project with EU funding for ... let's say Rare diseases...which could explain their web site and other issues). Anyway realistic time-frame offer could have/can greatly help Dante's image and our patience ;)
Thank god everything is behind, now- shifting to next stage. Good luck to all of you guys and see you in other forums.

MacUalraig
03-22-2019, 08:45 AM
There is some Dante data on the PGP site if anyone is interested in having a peek - also some Gencove* (Nebula Genomics?) low-read examples and a Chromium LR too.

https://my.pgp-hms.org/public_genetic_data

To save anyone wasting a download, sample hu57D7DD looks to be hg E-P147. But if you just want an idea of data quality go ahead as there is a Y only extract BAM.

* their drop down list has 'Gencove (eg Nebula Genomics)', not sure who sold the individual test.

bjp
03-22-2019, 12:27 PM
Can somebody who actually received their BAM data on disk (preferably someone in/shipped-to the USA) let me know which email address at Dante they contacted to make that request before it actually went through? I've sent another request for BAM data for this kit with no response, though they did tell me months ago I would be in line to receive BAM data. Of course, they also said the VCF would be in hg38 and that didn't happen either so I am not particularly confident in receiving BAM data until I have a disk in hand. I have been using the generic contact @ dantelabs.com support email address. I don't know if it is a language barrier or just weak processes on their side but their replies have never inspired confidence, even if the VCF data looks legit.

MacUalraig
03-22-2019, 12:48 PM
Can somebody who actually received their BAM data on disk (preferably someone in/shipped-to the USA) let me know which email address at Dante they contacted to make that request before it actually went through? I've sent another request for BAM data for this kit with no response, though they did tell me months ago I would be in line to receive BAM data. Of course, they also said the VCF would be in hg38 and that didn't happen either so I am not particularly confident in receiving BAM data until I have a disk in hand. I have been using the generic contact @ dantelabs.com support email address. I don't know if it is a language barrier or just weak processes on their side but their replies have never inspired confidence, even if the VCF data looks legit.

Did you pay and place an order via

https://www.dantelabs.com/products/500-gb-hard-disk-containing-your-raw-data

Order History
Order Date Payment Fulfillment Total
#EU56xxW March 06, 2019 Paid Unfulfilled €59.00 EUR
#EU18xxW July 17, 2018 Paid Fulfilled €299.00 EUR

Giosta
03-22-2019, 12:53 PM
I ordered the BAM+FASTQ hard disk today. I used WELCOME10 discount code to get 10% lower price.

Erikl86
03-22-2019, 04:33 PM
I've received this email from Dante Labs today:

"Your DNA Sequencing was completed with success
Dear Customer,

We're excited to let you know that we completed the sequencing on your sample.

We are now preparing the reports and the data for delivery.

Please do not hesitate to contact me if you need any further assistance.

Best regards,

Dante Labs' Team"

How long should I expect before seeing my report and raw data?

bjp
03-22-2019, 05:19 PM
Did you pay and place an order via

https://www.dantelabs.com/products/500-gb-hard-disk-containing-your-raw-data


Thank you so much for that URL! No I did not, and no company rep ever directed me to it when previously telling me that I would be in the shipping queue.

(Edit to add: that link shows an order price in Euros when I visit it not logged in, but redirects me to the base us.dantelabs.com site when I visit it while logged in to my account. The runaround to get a simple answer out of that group is really something.)

JamesKane
03-23-2019, 12:01 AM
So Dante Labs is making you all pay around 70 Euro for something their rep said would be a free download during the Black Friday sale window? That is quite disappointing.

bjp
03-23-2019, 12:29 PM
So Dante Labs is making you all pay around 70 Euro for something their rep said would be a free download during the Black Friday sale window? That is quite disappointing.

I'm going to be nagging them twice a week until they get back to me, but for my (USA) kit they have intimated repeatedly that I was in the queue to have a disk sent but never hinted at a need for payment.

MacUalraig
03-23-2019, 10:12 PM
So Dante Labs is making you all pay around 70 Euro for something their rep said would be a free download during the Black Friday sale window? That is quite disappointing.

I bought before the Black Friday sale, but to be honest I've stopped bothering to follow the ins and outs of the claims about downloads and extra costs. Even with the extra for the disk it still comes in way cheaper than YSEQ (who charge extra for a disk but are open about it from the start) and FGC. I haven't yet had anything too big to download from FGC so can't remember if they include disk delivery?

Jan_Noack
03-25-2019, 11:30 AM
wouldn't a blank disk + Postage cost almost that much? I'm not concerned about the cost either, considering the comparison cost with full Genomes andYSEQ

JamesKane
03-25-2019, 10:25 PM
It may cost about that much for materials and postage, but that isn't my concern. Since Dante Lab's must possess the BAM prior to generating your call files. It's a minute amount of additional effort to put the BAM into a Cloud-storage solution with an expiring token to allow you to download it directly.

All the disc is doing is adding to complaints by their customers about the timeliness of results and generating more ill will since they said there would be a free download solution.

MacUalraig
03-26-2019, 07:40 AM
It may cost about that much for materials and postage, but that isn't my concern. Since Dante Lab's must possess the BAM prior to generating your call files. It's a minute amount of additional effort to put the BAM into a Cloud-storage solution with an expiring token to allow you to download it directly.

All the disc is doing is adding to complaints by their customers about the timeliness of results and generating more ill will since they said there would be a free download solution.

I believe the calling is done before return to Dante... (based on clues in the VCF header). Maybe the arrangement with the lab is that the BAMs follow at a more leisurely pace eg in a batch whereas the VCFs come back as soon as each one is ready.

My FGC WGS from 2015 had two separate pipelines and VCFs, one from the sequencing lab and the other from FGC.

ntk
03-26-2019, 05:40 PM
This shouldn't surprise anyone but I just tried this code on the US site and it doesn't work. I did get an order in in February and the returned kit was confirmed "delivered to the garage or an alternate location at the address at 3:39 pm on February 25, 2019 in DRAPER, UT 84020."

Sadly the status is still unchanged as "Waiting confirmation from Dante Labs". I suppose others in my cohort are still waiting on acknowledgment too? I'm in it for the long haul, as long as they don't go bankrupt and I get my results in the next year or so I'll be okay with that.
Anyone else have updates from the February sale? It's been a month since they got my kit and I'm still "Waiting confirmation." I'll contact support at this point but I'm interested to know how others in this batch are going. Not expecting results in a long time but acknowledgment they got my kit.

poi
03-26-2019, 08:41 PM
My BlackFriday purchase’s results are ready including the download. I haven’t checked it yet, just got the email.

aaronbee2010
03-26-2019, 09:02 PM
My BlackFriday purchase’s results are ready including the download. I haven’t checked it yet, just got the email.

Mine still says "waiting for confirmation from Dante Labs".

RIP

aaronbee2010
03-26-2019, 09:17 PM
Anyone else have updates from the February sale? It's been a month since they got my kit and I'm still "Waiting confirmation." I'll contact support at this point but I'm interested to know how others in this batch are going. Not expecting results in a long time but acknowledgment they got my kit.

Same here man. I sent them a message on Facebook 6 days ago but no response (they haven't even seen the message). It's probably worth sending them an email.

poi
03-27-2019, 03:50 AM
I have the DanteLab VCF (almost 1 gig) and want to convert to 23andme or ANcestry or FTDNA (any one of them) format. Unfortunately, I don't have the references for those companies so the straight up "conversion" from Wilhelm's DNA Studio Kit is useless with many no-calls due to the missing references. Has anyone solved this issue? Or should I just wait for the BAM files?

To illustrate my issue, the VCF->23andmeV3 converted raw only has "96,578 SNPs" for the HarappaWorld calculator. That's very low. Obviously the DNA Studio Kit did not output with the references.

MacUalraig
03-27-2019, 08:19 AM
I have the DanteLab VCF (almost 1 gig) and want to convert to 23andme or ANcestry or FTDNA (any one of them) format. Unfortunately, I don't have the references for those companies so the straight up "conversion" from Wilhelm's DNA Studio Kit is useless with many no-calls due to the missing references. Has anyone solved this issue? Or should I just wait for the BAM files?

To illustrate my issue, the VCF->23andmeV3 converted raw only has "96,578 SNPs" for the HarappaWorld calculator. That's very low. Obviously the DNA Studio Kit did not output with the references.

Have you had a go uploading the VCF itself? Louis Kessler reported his experiences here:

http://www.beholdgenealogy.com/blog/?p=2879

Some of his conclusions however shouldn't pass without comment. He closes with this statement:

"I don’t see that the WGS test provides enough added utility to make it something genetic genealogists need for matching purposes."

However he hasn't even mentioned YFull - ok he hasn't got his BAM file yet so maybe he will in due course but uploading to YFull is absolutely relevant to genetic genealogy so just focussing on GEDMatch is too restrictive.

poi
03-28-2019, 09:07 PM
Have you had a go uploading the VCF itself? Louis Kessler reported his experiences here:

http://www.beholdgenealogy.com/blog/?p=2879

Some of his conclusions however shouldn't pass without comment. He closes with this statement:

"I don’t see that the WGS test provides enough added utility to make it something genetic genealogists need for matching purposes."

However he hasn't even mentioned YFull - ok he hasn't got his BAM file yet so maybe he will in due course but uploading to YFull is absolutely relevant to genetic genealogy so just focussing on GEDMatch is too restrictive.

WGS is not for Gedmatch because those calculators are old and they rely on SNPs from 23andme v3 and similar old chips. But WGS is absolutely necessary for medical genetics and WGS has data to anything genetics related for the lifetime (including YFull etc).

As for my problem of the variant-only VCF file from DanteLabs (as my BAM harddisk has yet to arrive), I have 2 choices:

1. Grab the HG19 reference and use my DanteLab's VCF to build the mother-of-all SNP calls. Obviously it won't be as good as the BAM file generated one since VCF only has those "PASSED" calls, but close enough I suppose.
2. Wait for the BAM harddisk, learn and use tools manually to process the files (which might take days) or upload to the cloud and pay Sequencing.com's EVE app to process it within a few hours.

aaronbee2010
03-29-2019, 11:40 AM
Mine still says "waiting for confirmation from Dante Labs".

RIP


Same here man. I sent them a message on Facebook 6 days ago but no response (they haven't even seen the message). It's probably worth sending them an email.

I received a reply to my message on Facebook from a Dante Labs representative informing me that my DNA was extracted, had passed quality control and is now in the process of being sequenced.

They've given me a rough date estimate for my results by the end of June. 4 months from them receiving the sample to publishing the results is absolutely brilliant for the price I paid (if they do manage to fulfill the date estimate they gave me, of course).

bjp
03-29-2019, 01:27 PM
I'm going to be nagging them twice a week until they get back to me, but for my (USA) kit they have intimated repeatedly that I was in the queue to have a disk sent but never hinted at a need for payment.

I received a response a couple days ago apologizing for the delay and asking if I had an order number for the hard disk delivery. I told the rep the order number for the kit, and that if another order is required for the hard disk delivery to please tell me how to place that order, and pasted in comments from reps months ago saying I was in the HD queue.

Awaiting further details. It would be great to get an actual answer.

Tonev
03-31-2019, 05:05 PM
Just a part of my comments to another angry DL customer in the FB YFULL group that might be useful to you too: ... Having had the HDD folder structure and file names plus the web link to the VCF, I tested for online availability of the bam file and some others. With no success... So obviously, that is their decision for delivery...

Donwulff
03-31-2019, 05:46 PM
For example Amazon Web Services charges I believe $0.12 per gigabyte of outgoing bandwidth, so 100GB BAM file would cost 12 dollars to transfer. If people had to fetch this multiple times (Say 3 analysis services and maybe trying to download it via web browser 5 times unsuccessfully themselves) it starts quickly costing way more for the provider. I'm not sure if that's the case here, but just a viewpoint. I'm not sure how Sequencing.com does it for free (Or did last I checked) though. Another possibility would be they don't have the outgoing bandwidth to upload it anywhere themselves, though less credible.

Either way "sneakernet" is in many ways reasonable, but it's unfortunate I believe they kept promising online delivery but didn't do it. Right now their FAQ straight out says "When your results are ready, your Dante Labs account will have a link to download the VCF file(s) directly onto your computer or mobile device (about 150-200 MB of space). You can also ask us to send your BAM and FASTQ files via a hard drive if you wish (about 200-250 GB)." and "We would suggest using the gVCF app by our partner Sequencing.com. You can access the app on the Sequencing.com website, costing you less than $20." It always seemed to me like the easiest solution for them would be to offer to set it up on Sequencing.com for the customer, but there could be legal and other liabilities involved with that.

Sequencing.com is still saying "Dante Labs is a Preferred Provider of whole genome sequencing (WGS) and microarray genetic testing. They make it easy to access your genetic data via Sequencing.com.

If you're a client of Dante Labs, please email [email protected] with your Kit ID and your request to have your data files imported into your Sequening.com account." so has anybody tried that?

Tonev
03-31-2019, 07:20 PM
Preferred or not, sequencing is charging each piece of raw data for usually 12 months period... With the current DL schedule of delivery, if you are impatient, you will have to pay: 1. for the VCF file analysis, 2. for the bam file analysis... For instance for their Data Viewer Plus .... Yet, one of the HDD files has, for instance, all your pathogenic/risk data... for which Sequencing would possibly charge you as one or several different services/apps.

Location of the cited file (HDD/possible online): weblink\your-alfa-numerical-number\result_variation\sv\your-alfa-numerical-number.sv.gene.csv

Donwulff
03-31-2019, 08:59 PM
I'm not sure what you mean. https://sequencing.com/knowledge-center/free-unlimited-dna-genome-data-storage says "You will never be charged for DNA data storage, bandwidth, or processing. We've experienced companies that charge fees for storage, bandwidth and processing and each time we're left scratching our head." I have almost a terabyte of genetic data on Sequencing.com for two years, and have never been charged a penny.

Admittedly they play loose with the definition of "processing", since they're selling third party app access. However, if you're asking for services beyond storage & downloading, you're actually asking for additional services for free. Sequencing.com does offer data viewer, basic wellness app and EvE free for most basic variant calling pipeline. If you were to set up a local or cloud-based bioinformatics environment, it would cost much, much more.

It's my understanding that Dante Labs provides the basic reports that they promise upon the sales via their web-site, people are just complaining about getting the BAM/FASTQ raw-data. I received mine on USB drive and uploaded them to Sequencing.com (couple years ago), so I don't have a reason to ask Dante Labs to upload it to Sequencing.com, but I was wondering if that would solve people's problem since Sequencing.com claims they support that method. Since people appear to be complaining about the lack of online-delivery of the BAM.

Using and re-processing the raw data is always going to be your problem once you get it anyway, but yeah Sequencing.com and Galaxy (UseGalaxy.org 250GB free quota which may be enough for processing WGS) do offer significant services for free. But at least in my case, DL delivered the promised analysis online as soon as it was possible.

aaronbee2010
03-31-2019, 11:41 PM
For example Amazon Web Services charges I believe $0.12 per gigabyte of outgoing bandwidth, so 100GB BAM file would cost 12 dollars to transfer. If people had to fetch this multiple times (Say 3 analysis services and maybe trying to download it via web browser 5 times unsuccessfully themselves) it starts quickly costing way more for the provider. I'm not sure if that's the case here, but just a viewpoint. I'm not sure how Sequencing.com does it for free (Or did last I checked) though. Another possibility would be they don't have the outgoing bandwidth to upload it anywhere themselves, though less credible.

Either way "sneakernet" is in many ways reasonable, but it's unfortunate I believe they kept promising online delivery but didn't do it. Right now their FAQ straight out says "When your results are ready, your Dante Labs account will have a link to download the VCF file(s) directly onto your computer or mobile device (about 150-200 MB of space). You can also ask us to send your BAM and FASTQ files via a hard drive if you wish (about 200-250 GB)." and "We would suggest using the gVCF app by our partner Sequencing.com. You can access the app on the Sequencing.com website, costing you less than $20." It always seemed to me like the easiest solution for them would be to offer to set it up on Sequencing.com for the customer, but there could be legal and other liabilities involved with that.

Sequencing.com is still saying "Dante Labs is a Preferred Provider of whole genome sequencing (WGS) and microarray genetic testing. They make it easy to access your genetic data via Sequencing.com.

If you're a client of Dante Labs, please email [email protected] with your Kit ID and your request to have your data files imported into your Sequening.com account." so has anybody tried that?

I wonder if .BAM files are able to be imported into one's Sequencing.com account and not just .VCF files. That would be a lot cheaper and quicker than them sending over a hard drive.

MacUalraig
04-01-2019, 06:52 AM
Personally I'm not in a hurry to share my entire WGS data with a third party like sequencing.com, I don't even do that with YFull and I know and trust them a lot more.

Has anyone asked YFull about taking Dante VCFs pending the BAM, like they did with the hg38 rebuild of BY?

Donwulff
04-01-2019, 07:15 PM
I wonder if .BAM files are able to be imported into one's Sequencing.com account and not just .VCF files. That would be a lot cheaper and quicker than them sending over a hard drive.

Good point, I automatically assumed that was about the BAM file because you can upload the VCF file yourself, but they don't actually say BAM file. It would solve any storage and bandwidth issues Dante Labs might be having if that's the problem, though. So has anybody asked? I uploaded my BAM myself, so I don't need to request it.

Although, the responses on this thread are illustrative why hard-drive delivery may still be their best bet. Just as a counterpoint though, as I reported at the time when I received the USB stick with the BAM file, I was standing outside because they usually have trouble getting in and the courier (DHL I think) just asked "Are you waiting for a package?" and handed it over to me without checking identity or anything (Almost certainly breaking company policy, but that's what humans do). Add to that that an international post-package like that can be opened & copied by any number of people including customs in-transit and handling. Privacy & security is main reason I would prefer online delivery via a service with privacy policy & safeguards in place.

Mich Glitch
04-01-2019, 09:17 PM
My Dante Labs BAM file is processing now in YFull.
I've received it by HDD (free) and sent by GoogleDisk (4.99 CAD/month for 200 GB storage).
I'll cancel Google inscription soon. I need it for 101.6 GB file transfer only.
I ordered the test in April 2018.

Mich Glitch
04-01-2019, 09:19 PM
IMHO Dante Labs is relatively good option (price/quality).

Mich Glitch
04-01-2019, 09:23 PM
For old BigY I have 2.4% non-read SNPs in YFull.
For Dante Labs 0.11% only.
I am waiting for Big Y-700 results to compare all 3 tests.

Tonev
04-02-2019, 07:57 AM
Aaronbee2010, BAMs are uploadable in Sequencing.com- I did it. Yet, most apps charge for a file, i.e. up until recently I have payed for apps to analyse my vcf (while waiting for the bam), a bam analysis would be a separate charge.
MacUalraig, I have asked YFULL about accepting VCF, the reply was negative.

Donwulff
04-02-2019, 07:07 PM
Tonev, it's still unclear, Sequencing.com's upload page tells to contact Dante Labs support and ask *them* to upload the data, people are wondering if Dante Labs can upload the BAM to Sequencing.com.

At this point after two pages of debating that, I guess it would've been easier for me just to ask DL if they can do that, but as I said I uploaded my BAM myself on Sequencing.com two years ago and haven't paid a cent to Sequencing.com. It's just a way of delivering the BAM file for analysis for free. I've paid several hundred dollars to Amazon Web Services and my local utility company for self-processing and research, to say nothing of the working hours. It beats me why you would even want the BAM file if they do not intend to process it.

Regardless, I've just suggested an alternate method of delivery that both Dante Labs & Sequencing.com are advertising, because several people have complained about the lack of online delivery of the BAM. I'm not, by any means, suggesting that everybody HAS to put the BAM on Sequencing.com or that they should pay for any of the additional analysis offered by Sequencing.com, merely that according to the web-site this is a deliveyr option.

Regardless, EvE Free on Sequencing.com is *free*, as well as UseGalaxy, if you need to process the BAM at no cost.

Donwulff
04-02-2019, 09:50 PM
Short version: If (and I mean only if) you require online delivery of the Dante Labs raw data, Sequencing.com says you can contact Dante Labs support to have them upload it to Sequencing.com. Sequencing.com prides itself at storage & bandwidth being entirely free.



Longer version: I believe nobody has yet confirmed whether Dante Labs will provide tha BAM file this way, but because it should not incur storage & bandwidth costs on Dante Labs, it would make a lot of sense. Preferably someone who is waiting for BAM and doesn't have a problem with Sequencing.com should try if they'll do it. No, obviously you do not get any additional paid services for free, but the point is delivery. Unfortunately, not all services seem able to take files off Sequencing.com, I just tried UseGalaxy and both the US and EU versions say Sequencing.com sharing API returns error 500 (Is this intentional to prevent free analysis?). And yes, obviously you should check their privacy policy and general safeguards before trusting your data to them, though most people just won't, and I can assure you nobody is that interested in your DNA profile without you electronic health records or at least knowing who you are.

Unfortunately most of the useful analysis on Sequencing.com are now for-pay, EvE Free still supports BowTie2 mapping + samtools calling int VCF for example, which is often used, but the de-facto standard of bwa mem mapping + GATK calling into gVCF with ClinVar interpretation is now paid service ($4.99 so it won't exactly force you to break the piggy bank). 23andMe output format is also supported for use with sites that require that. On the other hand I can guarantee it will cost you more to set up a computer, learn the commands and pay the electricity to do it yourself. Cloud service processing will come out slightly cheaper if you already know what you're doing, but then your time should be more valuable unless you're looking to learn/experiment.

Disclaimer: It's possible that for whatever reason Dante Labs won't provide the BAM to Sequencing.com, in which case one can upload it with Big Yotta if they have access to Windows machine with large bandwidth, and storage and downloads are still free. I haven't actually tried any of the current analysis formats, because I myself only needed storage & transfer. I'm not affiliated with Sequencing.com and don't get any royalties, I just saw people complain about lack of online delivery. It's probably also most realistic option for most people to actually make use of the genetic data with third party services like GedMatch (23andMe format; last I tried gVCF was too big for Genesis), Promethease (gVCF format) or YFull (BAM sharing link).

MacUalraig
04-03-2019, 07:42 AM
Is this tie-up with sequencing.com the reason the supposed discussions about direct transfer to yfull never bore fruit (or even took place)?

aaronbee2010
04-03-2019, 06:06 PM
Recieved this email from Dante Labs today:

Dante Labs: update and plans

Dear Aaron,

As valued Dante Labs users, I am glad to share some updates on our company. In the last two years, we managed to reduce the cost of whole-genome sequencing to a few hundreds of dollars/euros, providing advanced data interpretation and customized reports. Thousands of people accessed advanced genetics at affordable prices for the first time, making a difference in their lives.

In 2018, we had also delays and our communication wasn't always clear. There are no excuses. We own the responsibility for these delays.

In 2019, we are making the following investments to offer you even more value:

we are building our high-throughput sequencing center in Italy to internalize the sequencing process
we will release more reports and more advanced analytics

You will receive a new Pharmacogenomic Report in the next 4 weeks, regardless of when you purchased the test, free of charge. This is a value of hundreds of dollars.

In general, we will add more analysis, more reports and more functionalities in 2019, staying true to our mission of making advanced genomics accessible to everyone.

Thanks!

Andrea

Tonev
04-03-2019, 07:15 PM
Same e-mail received by me too tonight!

Donwulff
04-03-2019, 09:22 PM
I don't have any internal information about Dante Labs or their decision process, and haven't heard about any YFull deal, but at least my VCF and reports were served via Amazon S3. On https://aws.amazon.com/s3/pricing/ it can be seen that storage is $0.023 per GB per month, so if they served 100 GB BAM and 100 GB FASTA, it would be $4.60 a month, and downloads $0.09 per GB that's $9 for the BAM. These costs would be borne by Dante Labs. Three years of storage alone would eat the entire price of their Black Friday special cost. Assuming someone decided to share the BAM with the world, it would quickly cost them more than even the full list price of the service. Of course, this could be mitigated by "Download it once within two months, and then we'll take it offline". But with Sequencing.com it won't cost anything for delivery (assuming they have the outgoing bandwidth) and we still don't even know if they do that.

I was just thinking that Sequencing.com isn't optimal for YFull delivery though, because YFull really only should get your Y chromosome data. UseGalaxy can do that filtering, but if you have to download the file off Sequencing.com and then upload it to UseGalaxy, that'll break most people's Internet connection. So Sequencing.com really needs an "Y chromosome only" app ;)

I got the Dante Labs e-mail as well. I'm especially interested about "In general, we will add more analysis, more reports and more functionalities in 2019" but we'll see, at the price they have been offering I've definitely expected them to just drop the data one time & offer no support after that. On the other hand occassional updates will keep them on the table & people talking about it, so lots of hopefully positive marketing and publicity.

Mich Glitch
04-03-2019, 09:48 PM
YFull has received my BAM file. 101.66 GB. See my posts in this thread.

MacUalraig
04-04-2019, 06:38 AM
I believe that Full Genomes used AWS and asked customers to let them know when they'd done the BAM download so they could pull it. Certainly none of my files are currently downloadable. Given that its six years since my first FGC BAM was released it would be a bit unrealistic to expect them to pay to host it for me all that time.

Petr
04-04-2019, 08:51 AM
Short version: If (and I mean only if) you require online delivery of the Dante Labs raw data, Sequencing.com says you can contact Dante Labs support to have them upload it to Sequencing.com. Sequencing.com prides itself at storage & bandwidth being entirely free.



Longer version: I believe nobody has yet confirmed whether Dante Labs will provide tha BAM file this way, but because it should not incur storage & bandwidth costs on Dante Labs, it would make a lot of sense. Preferably someone who is waiting for BAM and doesn't have a problem with Sequencing.com should try if they'll do it. No, obviously you do not get any additional paid services for free, but the point is delivery. Unfortunately, not all services seem able to take files off Sequencing.com, I just tried UseGalaxy and both the US and EU versions say Sequencing.com sharing API returns error 500 (Is this intentional to prevent free analysis?). And yes, obviously you should check their privacy policy and general safeguards before trusting your data to them, though most people just won't, and I can assure you nobody is that interested in your DNA profile without you electronic health records or at least knowing who you are.

Unfortunately most of the useful analysis on Sequencing.com are now for-pay, EvE Free still supports BowTie2 mapping + samtools calling int VCF for example, which is often used, but the de-facto standard of bwa mem mapping + GATK calling into gVCF with ClinVar interpretation is now paid service ($4.99 so it won't exactly force you to break the piggy bank). 23andMe output format is also supported for use with sites that require that. On the other hand I can guarantee it will cost you more to set up a computer, learn the commands and pay the electricity to do it yourself. Cloud service processing will come out slightly cheaper if you already know what you're doing, but then your time should be more valuable unless you're looking to learn/experiment.

Disclaimer: It's possible that for whatever reason Dante Labs won't provide the BAM to Sequencing.com, in which case one can upload it with Big Yotta if they have access to Windows machine with large bandwidth, and storage and downloads are still free. I haven't actually tried any of the current analysis formats, because I myself only needed storage & transfer. I'm not affiliated with Sequencing.com and don't get any royalties, I just saw people complain about lack of online delivery. It's probably also most realistic option for most people to actually make use of the genetic data with third party services like GedMatch (23andMe format; last I tried gVCF was too big for Genesis), Promethease (gVCF format) or YFull (BAM sharing link).

I asked both sequencing.com and Dante:


Sarah (Sequencing.com)
Apr 2, 3:25 PM PDT
Hi Petr -

Thanks for contacting us. In order to have your information imported, you need to email your kit and account information to [email protected] They will then transfer it to you.

Thank you for your question. I've notified the web team to make the process clearer on the landing page.

Best Regards,
Hannah
Support Team



From: Dante-Labs <[email protected]>
Sent: Thursday, April 04, 2019 8:00 AM
Subject: Re: FW: [Sequencing.com] Re: Import of Dante RAW data

Hello Petr,

Thanks for your message.

Sequencying.com is a third company who is our preferred provider of bioinformatics and reports generation. I'm afraid, we won't be able to assist much in importing the data in your account you have with them.

If you'd like to get the FASTQ and BAM files (~100 GB each) to be uploaded to Sequencing.com, you can do so by obtaining the 512 GB HD here, and manually upload them: https://www.dantelabs.com/products/500-gb-hard-disk-containing-your-raw-data

Please let me know if you have any questions.

Kindest regards,
Mark


So apparently no way.

Donwulff
04-04-2019, 11:21 PM
Thanks! That's a shame, manually/personally transfering the files on the web won't be an option for a lot of users, and it seems like the companies indeed do have different idea of whether that delivery method is supported or not! It's Dante Labs that actually decides in this case, though.

An odd observation from earlier, the option to choose a mapping method like BowTie2 for the EvE Free processing on Sequencing.com disappeared literally *hours* after I mentioned it on this thread. Since the mapping method is not applied by EvE Free on a BAM file (I tried before it became unavailable) I'm not sure if this is merely correcting a user-interface error or something else. When Sequencing.com launched they actually did offer many useful analysis for free, but unfortunately, again, despite claiming "many of which are completely free" after checking they have practically no truly free meaningful analysis available anymore. I'll note that free analysis are actually quite rare, and free storage & bandwidth are very useful, hopefully they'll stay that way.

Ric69
04-05-2019, 04:24 PM
Same e-mail received by me too tonight!

Ditto here...

kafky
04-10-2019, 11:18 AM
Hi to all! As most of you in this thread I have requested DANTE wgs, with or without a clear statement of BAM or VCF format being available for results. Those with genealogy objectives may want to: 1. check genetic matches on gedmatch; 2. Clarify mtdna and Y haplogroups and matches.

What would be the roadmap to achieve those goals? 1. Transforming VCF in a 23andme format file and upload to gedmatch? 2. Upload BAM (or Fasta) if available results to other websites for Y and mtdna processing? Since the only for sure available and more easy to transfer solution is with VCF, should that would be the new format to work genetic genealogical info and BAM just to archive?

A few questions to fuel this nice discussion!
Thanks!

MacUalraig
04-10-2019, 11:49 AM
You can dig out your Y haplogroup from the VCF yourself to be honest, it doesn't take that long even if you do it manually. If you need some pointers or assistance just post here and someone will aid you. I split out the Y chromosome from my Dante VCF and its only about 1Mb so quite easy to play around with (excel/wordpad/access/mysql etc). If you download a SNP database like ybrowse you can even find your novel SNPs.

I had hoped that YFull who do BAM file analysis would take Dante VCFs as an interim since they did this with the hg38 remapped BigY but apparently they won't so you will have to await the BAM.

mtDNA is only in the BAM file not the VCF.

JamesKane
04-10-2019, 11:54 AM
I believe that Full Genomes used AWS and asked customers to let them know when they'd done the BAM download so they could pull it. Certainly none of my files are currently downloadable. Given that its six years since my first FGC BAM was released it would be a bit unrealistic to expect them to pay to host it for me all that time.

AWS S3 storage costs depend on availability. Typically, you'd move the files to one of the Glacier tiers after a retention period of say 30 days. The newest Glacier tier is about $1 per TB per month. It takes 12 hours to make a file accessible though. I've been investigating this for my own use case as it's less costly than my current backup strategy.

As Donwulff already discussed the real cost in delivery is the outgoing network traffic. 9¢ per GB after the first free GB. Dante's WGS BAMs are purported to be 120GB on average and I've honestly no idea why they are 3.4x bigger than a comparable one run on Illumina sequencers.

MacUalraig
04-10-2019, 12:23 PM
AWS S3 storage costs depend on availability. Typically, you'd move the files to one of the Glacier tiers after a retention period of say 30 days. The newest Glacier tier is about $1 per TB per month. It takes 12 hours to make a file accessible though. I've been investigating this for my own use case as it's less costly than my current backup strategy.

As Donwulff already discussed the real cost in delivery is the outgoing network traffic. 9¢ per GB after the first free GB. Dante's WGS BAMs are purported to be 120GB on average and I've honestly no idea why they are 3.4x bigger than a comparable one run on Illumina sequencers.

For who though? I'm not familiar with AWS much but I understand there is an option to make the downloader pay.

"With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download from the bucket. The bucket owner always pays the cost of storing data. ..You might, for example, use Requester Pays buckets when making available large datasets.."

https://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html

Donwulff
04-10-2019, 01:23 PM
~100G is pretty typical of 30X sequence file, this can be confirmed for example at https://my.pgp-hms.org/public_genetic_data?utf8=%E2%9C%93&data_type=other which includes data from all kinds of providers, see "Full Genomes 30x WGS ". Note that because repetitive, redundant data compresses exceptionally well, additional read depth doesn't increse size of chromosome-coordinate sorted BAM file significantly. My 100bp Dante Labs BAM file was 99GB, with an apparent read depth of 43X. Of course, the Y-chromosome only data is less than one gigabyte.

MacUalraig
04-10-2019, 01:26 PM
My YSEQx30 is about 32Gb, sample Dante I have is about 111Gb.

kafky
04-10-2019, 03:44 PM
You can dig out your Y and mt haplogroups from the VCF yourself to be honest, it doesn't take that long even if you do it manually. If you need some pointers or assistance just post here and someone will aid you. I split out the Y chromosome from my Dante VCF and its only about 1Mb so quite easy to play around with (excel/wordpad/access/mysql etc). If you download a SNP database like ybrowse you can even find your novel SNPs.

I had hoped that YFull who do BAM file analysis would take Dante VCFs as an interim since they did this with the hg38 remapped BigY but apparently they won't so you will have to await the BAM.


Yep, some tips to extract Y and mtdna, and the best ways to manage/interprete results would be awesome!

MacUalraig
04-10-2019, 04:32 PM
You can dig out your Y and mt haplogroups from the VCF yourself to be honest, it doesn't take that long even if you do it manually. If you need some pointers or assistance just post here and someone will aid you. I split out the Y chromosome from my Dante VCF and its only about 1Mb so quite easy to play around with (excel/wordpad/access/mysql etc). If you download a SNP database like ybrowse you can even find your novel SNPs.



Correction: currently I believe Dante are not *calling* mtDNA in the VCF files they hand out and I have no chrM records. Supposedly it is in the BAM file which I haven't received yet (its in the sample they let me see). I've edited my earlier post in case it confuses people.

traject
04-10-2019, 04:33 PM
Correction: currently I believe Dante are not *calling* mtDNA in the VCF files they hand out and I have no chrM records. Supposedly it is in the BAM file which I haven't received yet (its in the sample they let me see).

Yeah, I didn't see anything in my VCF file for chrM.

kafky
04-10-2019, 05:00 PM
Yes, this one of my doubts since Dante (when I bought) said that included mtdna. I have both a VCF SNP and a VCF INDEL.

Donwulff
04-10-2019, 05:51 PM
My VCF in 2017 had chrM, in the beginning of the file oddly enough, but it's the hg19 (pre-Revised Cambridge Reference Sequence rCRS) Yoruba one http://haplogrep.uibk.ac.at/blog/rcrs-vs-rsrs-vs-hg19/
YFull would analyse both Y-chromosome and mtDNA if available, although that does raise the question of what about women who can't get Y chromosome analysis.
Time for an updated version of BAM Analysis Kit?
https://anthrogenica.com/showthread.php?15694-Updated-BAM-Analysis-Kit-Any-interest
https://github.com/teepean/BAM-Analysis-Kit
Yet that's pretty heavy approach that requires having the BAM at hand. Also, they really need to credit Felix on the project.

MacUalraig
04-10-2019, 06:26 PM
I can't make head or tail of some of the things they do, like skipping the mtDNA but explicitly mentioning it on the WGS product page which normally we would take as read...

"My Full DNA: Whole Genome Sequencing with mtDNA"

Perhaps a few more of us should report it as an error.

kafky
04-10-2019, 06:44 PM
I can't make head or tail of some of the things they do, like skipping the mtDNA but explicitly mentioning it on the WGS product page which normally we would take as read...

"My Full DNA: Whole Genome Sequencing with mtDNA"

Perhaps a few more of us should report it as an error.


I wrote them and they replied in few minutes asking me to fulfil a Receive your Customized Report request form.

Donwulff
04-10-2019, 07:09 PM
Well, I mean technically that's correct, WGS *does* sequence mtDNA as well ;) Also it's in the raw data if you pay extra for that and wait... To be honest, I'm not sure what is & isn't delivered with the product now, as I said in 2017 when they launched the VCF did have chrM. Was the "with mtDNA" there when people ordered?

I notice right now they barely mention VCF & raw data on the product, and their FAQ just sends people to Sequencing.com for gVCF's. On the other hand I can sympathize with their problem of selling what's essentially a highly technical product to the public. If they wrote "Our product consists of a VCF file of the primary assembly but due to lack of consensus on mtDNA reference, you'll have to do that yourself", the average customers would just shake their heads at the gibberish and move along.

I will note their FAQ specifically mentions "We found that most of the costs were not in the DNA sequencing process, but in the management of the samples before the sequencing, including DNA extraction, or in the bioinformatics processing after the sequencing. We have eliminated these inefficiencies and leveraged economies of scale to finally pass the cost savings to you." and urge you to go to Sequencing.com for gVCF files. So the low bioinformatics processing is a "feature".
And to be fair, even if you had to buy gVCF, mtDNA and SNP/23andMe processing separately, the price is a bargain.

But it should NOT take them pretty much any time per sequence to add rCRS/RSRS mtDNA FASTA, gVCF and 23andMe/AncestryDNA style picked SNP file. Problem is, on a guess almost none of their normal customers know to ask for those, and they're not something they can market to the average customer either.

Of course, it doesn't help they haven't responded to my question about the long read sequencing yet, so I'm wondering if they even received the questions. In 2017 they replied quickly to my questions. They may be overwhelmed with customer support requests, and/or my question was too technical. Funnily enough, just before that they sent out e-mail update saying "In 2018, we had also delays and our communication wasn't always clear. There are no excuses. We own the responsibility for these delays."

Anyway, that is the price you pay for a low cost with cuts to customer support & bioinformatics processing prices I guess. Still, it would be good to keep them aware there is demand for mtDNA FASTA, gVCF and 23andMe/AncestryDNA compatible picked SNP files.

MacUalraig
04-10-2019, 07:42 PM
They probably didn't expect to be hit by the dreaded 'expert users' :-)

kafky
04-10-2019, 09:04 PM
OK, this mtdna issue is stalled until new info. How about the Y and the gedmatch issues?

For the Y, without BAM file, what can we do? I know that my Y is inside R1-L51 cluster. I can search each defining polymorphism one-by-one. There was a good tool that connected with Chrome and ISOGG tree could ease the process with 23andme after extracting the Y SNPs and indicating the positive snps.

For autosomal (and X) gedmatch using, what can we do? I failed to transform the vcf file into 23andme raw file using DNA Kit Studio strategy.

Donwulff
04-10-2019, 10:03 PM
I guess the question is more like, what can YOU do. You have a vcf, is the vcf file still hs37?
There's https://isogg.org/wiki/Y-DNA_tools#Y-SNP_haplogroup_prediction_tools
I guess 23andMe's https://github.com/23andMe/yhaplo is among the best, but you need to be able to run Python scripts, also the datafiles/SNP/branches are couple of years old now.
Annotating SNP's directly with YBrowse's SNP names would get the latest data, but it also shouldn't take a huge effort to just go down the tree branch by branch and check for SNP's on the vcf (If you have any way to read the vcf file) - though I did just realize YFull's tree doesn't include Y-SNP locations, so those need to be cross-referenced somewhere like YBrowse.

Plain vcf files are a pain for autosomal matching; I've tried imputing them but that strategy isn't very good either.
If it's good high-depth WGS, you might get away by assuming unlisted SNP's are ancestral.
But quite honestly, since GEDMatch prohibited "artificial DNA kits" for "identifying someone" (such as a genealogical match), it seems that manipulating DNA data in any way is against GEDMatch current ToS.

Petr
04-11-2019, 05:55 AM
Correction: currently I believe Dante are not *calling* mtDNA in the VCF files they hand out and I have no chrM records.

My VCF files received on April 2nd (and ordered on November 20th) contain chrM records.

MacUalraig
04-11-2019, 06:41 AM
You might want to double check the chrM record situation if others have been given it then.

Don't forget you can load your Dante VCF into IGV (genome browser) along with the hg19 human reference, this can make some processing a bit easier. It needs an index file for it which it will prompt you for the first time you load it.

http://software.broadinstitute.org/software/igv/download

You can compare the VCF against the ybrowse SNP database in SQL (I use mysql for this) BUT ybrowse is only hg38!! So you would have to convert one or the other to do that with a Dante VCF.

pmokeefe
04-11-2019, 07:13 AM
My VCF files received on April 2nd (and ordered on November 20th) contain chrM records.
My VCF files received on Feb 25 did not contain chrM records.

Petr
04-11-2019, 08:19 AM
I have ordered the product: "My Full DNA: Whole Genome Sequencing with mtDNA".

The VCF files for my older orders in 2017 were for "Whole Genome Sequencing (WGS) - Full DNA Analysis" and did not contain the mtDNA data (or were incomplete).

The lines were starting like this:
chrM 73 . G A 3070 PASS
chrM 150 . T C 3070 PASS
chrM 200 . A G 3070 PASS
chrM 410 . A T 3070 PASS
chrM 2354 . C T 3070 PASS
chrM 2485 . C T 3070 PASS
chrM 2708 . G A 3070 PASS
chrM 4108 . C T 3070 PASS
chrM 5581 . C T 3070 PASS
chrM 7029 . T C 3070 PASS
chrM 8702 . G A 3070 PASS
chrM 9378 . G A 3070 PASS
chrM 9541 . C T 3070 PASS
chrM 9555 . G A 3070 PASS
chrM 10399 . G A 3070 PASS
chrM 10820 . G A 3070 PASS
chrM 10874 . C T 3070 PASS
chrM 11018 . C T 3070 PASS
chrM 11486 . T C 3070 PASS
chrM 11720 . A G 3070 PASS
chrM 11723 . C T 3070 PASS
chrM 12706 . T C 3070 PASS
chrM 12851 . G A 3070 PASS
chrM 13621 . T C 3070 PASS
chrM 14213 . C T 3070 PASS
chrM 14581 . G A 3070 PASS
chrM 14767 . T C 3070 PASS
chrM 14873 . C T 3070 PASS
chrM 14906 . A G 3070 PASS
chrM 15302 . A G 3070 PASS
chrM 15933 . C T 3070 PASS
chrM 16173 . C T 1489 LowGQX
chrM 16191 . C T 152 LowGQX
chrM 16194 . C T 271 LowGQX
chrM 16225 . T C 3070 PASS
chrM 16322 . T C 3070 PASS

(H13b1 haplogroup)

ChrisR
04-11-2019, 11:12 AM
My VCF files received on April 2nd (and ordered on November 20th) contain chrM records.
I have now access to DL results (VCF) who should be made downloadable also on April 2nd (ordered in Europe also in late Nov. AFAIK).
41 chrM lines are present in the SNP VCF. The lines where almost at the end before chrUn_gl000226 and chr18_gl000207_random.
Probably interesting lines including Y and mt info and one variant call each from the VCF:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=March-26-2019 at 19:xx UTC
##source=DanteLabs
##dataSourceType=WGS
##dataAnalysisProvider=Sequencing.com
##missingDataClarification=not_sequenced
##missingDataClarificationDescription=Chromosomal coordinates that are not included were not sequenced
##reference=HG19.USCS
##referenceInfo=HG19.USCS validated by Sequencing.com
##fileDate=20190326
##source=strelka
##source_version=2.9.10
##startTime=Tue Mar 26 19:xx 2019
##cmdline=/home/strelka/bin/configureStrelkaGermlineWorkflow.py --bam=sorted1055.bam --ref=/mnt/data/refData/1055/1055.fa --runDir ./outStrelka
##reference=file:///mnt/data/refData/1055/1055.fa
##contig=<ID=chrY,length=59373566>
##contig=<ID=chrM,length=16571>
##content=strelka germline small-variant calls
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the region described in this record">
##INFO=<ID=BLOCKAVG_min30p3a,Number=0,Type=Flag,Descriptio n="Non-variant multi-site block. Non-variant blocks are defined independently for each sample. All sites in such a block are constrained to be non-variant, have the same filter value, and have sample values {GQX,DP,DPF} in range [x,y], y <= max(x+3,(x*1.3)).">
##INFO=<ID=SNVHPOL,Number=1,Type=Integer,Description="SNV contextual homopolymer length">
##INFO=<ID=CIGAR,Number=A,Type=String,Description="CIGAR alignment for each alternate indel allele">
##INFO=<ID=RU,Number=A,Type=String,Description="Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs are not reported if longer than 20 bases">
##INFO=<ID=REFREP,Number=A,Type=Integer,Description="Number of times RU is repeated in reference">
##INFO=<ID=IDREP,Number=A,Type=Integer,Description="Number of times RU is repeated in indel allele">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="RMS of mapping quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GQX,Number=1,Type=Integer,Description="Empirically calibrated genotype quality score for variant sites, otherwise minimum of {Genotype quality assuming variant position,Genotype quality assuming non-variant position}">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Filtered basecall depth used for site genotyping. In a non-variant multi-site block this value represents the average of all sites in the block.">
##FORMAT=<ID=DPF,Number=1,Type=Integer,Description="Basecalls filtered from input prior to site genotyping. In a non-variant multi-site block this value represents the average of all sites in the block.">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum filtered basecall depth used for site genotyping within a non-variant multi-site block">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed. For indels this value only includes reads which confidently support each allele (posterior prob 0.51 or higher that read contains indicated allele vs all other intersecting indel alleles)">
##FORMAT=<ID=ADF,Number=.,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=.,Type=Integer,Description="Allelic depths on the reverse strand">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Sample filter, 'PASS' indicates that all filters have passed for this sample">
##FORMAT=<ID=DPI,Number=1,Type=Integer,Description="Read depth associated with indel, taken from the site preceding the indel">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##FORMAT=<ID=SB,Number=1,Type=Float,Description="Sample site strand bias">
##FILTER=<ID=IndelConflict,Description="Indel genotypes from two or more loci conflict in at least one sample">
##FILTER=<ID=SiteConflict,Description="Site is filtered due to an overlapping indel call filter">
##FILTER=<ID=LowGQX,Description="Locus GQX is below threshold or not present">
##FILTER=<ID=HighDPFRatio,Description="The fraction of basecalls filtered out at a site is greater than 0.4">
##FILTER=<ID=HighSNVSB,Description="Sample SNV strand bias value (SB) exceeds 10">
##FILTER=<ID=HighDepth,Description="Locus depth is greater than 3x the mean chromosome depth">
##Depth_chrY=12.00
##Depth_chrM=2236.00
##FILTER=<ID=LowDepth,Description="Locus depth is below 3">
##FILTER=<ID=NotGenotyped,Description="Locus contains forcedGT input alleles which could not be genotyped">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
##SnpSiftVersion="SnpSift 4.3s (build 2017-10-25 10:05), by Pablo Cingolani"
##SnpSiftCmd=""
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view -v snps sift.vcf.gz; Date=Tue Mar 26 19:xx 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT default
chrY 2668456 rs2058276 T C 130.0 PASS SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:15:12:6:0:0,6:0,2:0,4:-15.7:PASS:167,18,0
chrM 195 . C T 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1569:64:7,1562:3,825:4,737:-99:PASS:370,370,0

I'm now wondering if and how I could easily find out the mt Haplogroup from the 41 VCF chrM variant listings. So far I found no (semi)automatic method.

Petr
04-11-2019, 11:44 AM
Strange thing - 4 tests delivered on April 2nd and very different depths:

##Depth_chr1 16 11 8 21
##Depth_chr2 16 11 8 21
##Depth_chr3 16 11 8 21
##Depth_chr4 15 11 8 21
##Depth_chr5 16 11 8 21
##Depth_chr6 15 10 8 20
##Depth_chr7 16 11 8 21
##Depth_chr8 15 11 8 21
##Depth_chr9 15 10 8 20
##Depth_chr10 15 10 8 21
##Depth_chr11 15 11 8 21
##Depth_chr12 16 11 8 21
##Depth_chr13 15 11 8 21
##Depth_chr14 16 11 8 21
##Depth_chr15 16 11 8 21
##Depth_chr16 16 10 8 21
##Depth_chr17 15 10 8 20
##Depth_chr18 15 11 8 21
##Depth_chr19 15 10 8 20
##Depth_chr20 16 10 8 21
##Depth_chr21 16 11 8 21
##Depth_chr22 15 10 8 20
##Depth_chrX 8 10 4 10
##Depth_chrY 8 4 4 11
##Depth_chrM 1700 806 1132 3031

ChrisR
04-11-2019, 11:49 AM
I now used DNA Kit Studio v2.1 (https://wilhelmhgenealogy.wordpress.com/dna-kit-studio/) to create a raw data file from the VCF and then I did compare to 23andMe data from the same person:
RAW Analyzer > RAW FILE COMPARISON

File 1: 23andMe 2011 V3
File 2: DanteLabs 2019 grch37.snp

> Total total SNPs file 1: 960545
> Total total SNPs file 2: 4132157

> Total SNPs in common and equal genotyping (including NoCalls): 429209
> Total SNPs in common but not equal genotyping (including NoCalls): 27374

>> Total SNPs genotypes half different: 735 (0,161%)
>> Total SNPs genotypes full different: 13195 (2,89%)

> Total SNPs NoCalls File 1: 6508
> Total SNPs NoCalls File 2: 0

> Total SNPs not found in file 2: 503962
> Total SNPs common in file 1 and file 2: 456583
Do I understand the penultimate line correctly: 503962 23andMe V3 SNPs where not found/present in the VCFtoSNP data file? Looks like a lot and not plausible?

Also I did create a raw data file from the mt only data and compared to the 23andMe file but the mthap analysis results suggest something else is needed so that the DL file is correctly recognized / rCRSdiff probably wrong?:

23andMe data file

Found 2441 markers at 2440 positions covering 14.7% of mtDNA.

NOTICE: You appear to have uploaded a 23andme v3 raw data file which has 9 known unreliable markers that will be excluded from this analysis.

Markers found (shown as differences to rCRS):

HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16263G 16270T

IMPORTANT NOTE: The above marker list is almost certainly incomplete due to limitations of genotyping technology and is not comparable to mtDNA sequencing results. It should not be used with services or tools that expect sequencing results, such as mitosearch.

Best mtDNA Haplogroup Matches:

1) U5b1b1

Defining Markers for haplogroup U5b1b1:
HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16189C 16192T 16270T
...
Imperfect Match. Your results contained differences with this haplogroup:
Matches(25): 73G 150T 263G 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G 16270T
Extras(1): 16263G
Untested(2): 16189 16192
DL VCF to raw file

rCRSdiff format was uploaded. Based on the markers found, assuming the following regions were completely sequenced: HVR1 (16001~16569) HVR2 (1~574) CR (575~16000).

Found 16651 markers at 16651 positions covering 100.0% of mtDNA.

Markers found (shown as differences to rCRS):

HVR2: 195C 410A
CR: 2354G 2485C 3198G 5302C 5581G 5657T 7386T 7769A 8702T 9378C 9411A 9478C 9541C 10399T 10820T 10874T 10928C 11018A 11468G 11723G 12309G 12373G 12619C 12706C 12851C 13618T 14183G 14213G 14581C 14906T 15302T 15933A
HVR1: 16094C 16173T 16225T 16265G 16272G 16322G 16521G

Best mtDNA Haplogroup Matches:

1) H2a2a1

MacUalraig
04-11-2019, 11:55 AM
I have now access to DL results (VCF) who should be made downloadable also on April 2nd (ordered in Europe also in late Nov. AFAIK).
41 chrM lines are present in the SNP VCF. The lines where almost at the end before chrUn_gl000226 and chr18_gl000207_random.
Probably interesting lines including Y and mt info and one variant call each from the VCF:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=March-26-2019 at 19:xx UTC
##source=DanteLabs
##dataSourceType=WGS
##dataAnalysisProvider=Sequencing.com
##missingDataClarification=not_sequenced
##missingDataClarificationDescription=Chromosomal coordinates that are not included were not sequenced
##reference=HG19.USCS
##referenceInfo=HG19.USCS validated by Sequencing.com
##fileDate=20190326
##source=strelka
##source_version=2.9.10
##startTime=Tue Mar 26 19:xx 2019
##cmdline=/home/strelka/bin/configureStrelkaGermlineWorkflow.py --bam=sorted1055.bam --ref=/mnt/data/refData/1055/1055.fa --runDir ./outStrelka
##reference=file:///mnt/data/refData/1055/1055.fa
##contig=<ID=chrY,length=59373566>
##contig=<ID=chrM,length=16571>
##content=strelka germline small-variant calls
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the region described in this record">
##INFO=<ID=BLOCKAVG_min30p3a,Number=0,Type=Flag,Descriptio n="Non-variant multi-site block. Non-variant blocks are defined independently for each sample. All sites in such a block are constrained to be non-variant, have the same filter value, and have sample values {GQX,DP,DPF} in range [x,y], y <= max(x+3,(x*1.3)).">
##INFO=<ID=SNVHPOL,Number=1,Type=Integer,Description="SNV contextual homopolymer length">
##INFO=<ID=CIGAR,Number=A,Type=String,Description="CIGAR alignment for each alternate indel allele">
##INFO=<ID=RU,Number=A,Type=String,Description="Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs are not reported if longer than 20 bases">
##INFO=<ID=REFREP,Number=A,Type=Integer,Description="Number of times RU is repeated in reference">
##INFO=<ID=IDREP,Number=A,Type=Integer,Description="Number of times RU is repeated in indel allele">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="RMS of mapping quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GQX,Number=1,Type=Integer,Description="Empirically calibrated genotype quality score for variant sites, otherwise minimum of {Genotype quality assuming variant position,Genotype quality assuming non-variant position}">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Filtered basecall depth used for site genotyping. In a non-variant multi-site block this value represents the average of all sites in the block.">
##FORMAT=<ID=DPF,Number=1,Type=Integer,Description="Basecalls filtered from input prior to site genotyping. In a non-variant multi-site block this value represents the average of all sites in the block.">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum filtered basecall depth used for site genotyping within a non-variant multi-site block">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed. For indels this value only includes reads which confidently support each allele (posterior prob 0.51 or higher that read contains indicated allele vs all other intersecting indel alleles)">
##FORMAT=<ID=ADF,Number=.,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=.,Type=Integer,Description="Allelic depths on the reverse strand">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Sample filter, 'PASS' indicates that all filters have passed for this sample">
##FORMAT=<ID=DPI,Number=1,Type=Integer,Description="Read depth associated with indel, taken from the site preceding the indel">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##FORMAT=<ID=SB,Number=1,Type=Float,Description="Sample site strand bias">
##FILTER=<ID=IndelConflict,Description="Indel genotypes from two or more loci conflict in at least one sample">
##FILTER=<ID=SiteConflict,Description="Site is filtered due to an overlapping indel call filter">
##FILTER=<ID=LowGQX,Description="Locus GQX is below threshold or not present">
##FILTER=<ID=HighDPFRatio,Description="The fraction of basecalls filtered out at a site is greater than 0.4">
##FILTER=<ID=HighSNVSB,Description="Sample SNV strand bias value (SB) exceeds 10">
##FILTER=<ID=HighDepth,Description="Locus depth is greater than 3x the mean chromosome depth">
##Depth_chrY=12.00
##Depth_chrM=2236.00
##FILTER=<ID=LowDepth,Description="Locus depth is below 3">
##FILTER=<ID=NotGenotyped,Description="Locus contains forcedGT input alleles which could not be genotyped">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
##SnpSiftVersion="SnpSift 4.3s (build 2017-10-25 10:05), by Pablo Cingolani"
##SnpSiftCmd=""
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view -v snps sift.vcf.gz; Date=Tue Mar 26 19:xx 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT default
chrY 2668456 rs2058276 T C 130.0 PASS SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:15:12:6:0:0,6:0,2:0,4:-15.7:PASS:167,18,0
chrM 195 . C T 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1569:64:7,1562:3,825:4,737:-99:PASS:370,370,0

I'm now wondering if and how I could easily find out the mt Haplogroup from the 41 VCF chrM variant listings. So far I found no (semi)automatic method.

ChrisR, glad to see you have your VCF. There are some tools at James Lick's site

https://dna.jameslick.com/mthap/

but I suspect it would be quicker to check against the tree manually rather than fiddle to get it into one his accepted formats of

"Supported formats: 23andMe raw data, deCODEme raw data, Differences to rCRS (mutation list), FASTA, GenBank Flat File Format, ASN1"

I prefer to just go here:

http://phylotree.org/tree/R0.htm

where T152C pops up several times etc
The choice is up to you :-)

MacUalraig
04-11-2019, 12:24 PM
> Total SNPs not found in file 2: 503962
> Total SNPs common in file 1 and file 2: 456583[/CODE]
Do I understand the penultimate line correctly: 503962 23andMe V3 SNPs where not found/present in the VCFtoSNP data file? Looks like a lot and not plausible?



Those will be SNPs 23andMe test(ed) where you match the human reference and hence not in your vcf.

JamesKane
04-11-2019, 12:41 PM
One of the downsides of delivering the scored and filtered VCF files is you lack information about the non-variant sites. You can generally assume 90% of the genome territory has at least ten reads in a 30x WGS test, but you don't know where coverage is lacking. That's one of the points of the gVCF format, which Dante is already producing to give you the final product. The evidence is in their headers where you see GenotypeGVCFs being applied.

As customers you may want to reach out to them about providing that intermediate file.

Petr
04-11-2019, 12:55 PM
I have 2 results for women, the first is one year old and contains no ChrY. The VCF file for the second woman delivered on April 2nd contains 21399 SNPs on ChrY, 2194 of them with filter PASS. What could be the reason?

MacUalraig
04-11-2019, 12:55 PM
One of the downsides of delivering the scored and filtered VCF files is you lack information about the non-variant sites. You can generally assume 90% of the genome territory has at least ten reads in a 30x WGS test, but you don't know where coverage is lacking. That's one of the points of the gVCF format, which Dante is already producing to give you the final product. The evidence is in their headers where you see GenotypeGVCFs being applied.

As customers you may want to reach out to them about providing that intermediate file.

.....

TigerMW
04-11-2019, 04:42 PM
Probably in the hands of 'users/w******g' who I speculated previously may dwell in China... (if I can be pardoned the racial stereotyping)
What do you mean by that?

Giosta
04-11-2019, 04:51 PM
---
Dante Labs <[email protected]>

Dear Customer,
We're excited to let you know that we uploaded other VCF formats on your account: CNV, SV.
---

... Ok now I can download
snp.vcf
indel.vcf
cnv.vcf
sv.vcf

What are those new cnv and sv files? Relatively small size. What can I do with these?

ChrisR
04-11-2019, 05:01 PM
I now used DNA Kit Studio v2.1 (https://wilhelmhgenealogy.wordpress.com/dna-kit-studio/) to create a raw data file from the VCF and then I did compare to 23andMe data from the same person:

File 1: 23andMe 2011 V3
File 2: DanteLabs 2019 grch37.snp
> Total total SNPs file 1: 960545
> Total total SNPs file 2: 4132157
> Total SNPs in common and equal genotyping (including NoCalls): 429209
> Total SNPs in common but not equal genotyping (including NoCalls): 27374

Also I did create a raw data file from the mt only data and compared to the 23andMe file but the mthap analysis results suggest something else is needed so that the DL file is correctly recognized / rCRSdiff probably wrong?:

23andMe data file

Found 2441 markers at 2440 positions covering 14.7% of mtDNA.
Markers found (shown as differences to rCRS):
HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16263G 16270T
Best mtDNA Haplogroup Matches:
1) U5b1b1

DL VCF to raw file

rCRSdiff format was uploaded. Based on the markers found, assuming the following regions were completely sequenced: HVR1 (16001~16569) HVR2 (1~574) CR (575~16000).
Found 16651 markers at 16651 positions covering 100.0% of mtDNA.
Markers found (shown as differences to rCRS):
HVR2: 195C 410A
CR: 2354G 2485C 3198G 5302C 5581G 5657T 7386T 7769A 8702T 9378C 9411A 9478C 9541C 10399T 10820T 10874T 10928C 11018A 11468G 11723G 12309G 12373G 12619C 12706C 12851C 13618T 14183G 14213G 14581C 14906T 15302T 15933A
HVR1: 16094C 16173T 16225T 16265G 16272G 16322G 16521G
Best mtDNA Haplogroup Matches:
1) H2a2a1


There are some tools at James Lick's site https://dna.jameslick.com/mthap/
...
I prefer to just go here: http://phylotree.org/tree/R0.htm

Those will be SNPs 23andMe test(ed) where you match the human reference and hence not in your vcf.
Thanks MacUalraig. Logical the "SNPs not found in file 2" are ancestrals from the SNP-array, now that you say it.
Regarding mthap I did check manually in the VCF and while the autosomal and Y of the individual (it is not me) seems to be the same in 23andMeV3 and DL-WGS30x, the mt data is completely different. I think a contamination of only mt seems unlikely so at the moment I'm out of ideas what is going on here. Will triple check...

Petr
04-11-2019, 05:01 PM
I have ordered the product: "My Full DNA: Whole Genome Sequencing with mtDNA".

The lines were starting like this:
chrM 73 . G A 3070 PASS
chrM 150 . T C 3070 PASS
chrM 200 . A G 3070 PASS
chrM 410 . A T 3070 PASS
chrM 2354 . C T 3070 PASS
chrM 2485 . C T 3070 PASS
chrM 2708 . G A 3070 PASS
chrM 4108 . C T 3070 PASS
chrM 5581 . C T 3070 PASS
chrM 7029 . T C 3070 PASS
chrM 8702 . G A 3070 PASS
chrM 9378 . G A 3070 PASS
chrM 9541 . C T 3070 PASS
chrM 9555 . G A 3070 PASS
chrM 10399 . G A 3070 PASS
chrM 10820 . G A 3070 PASS
chrM 10874 . C T 3070 PASS
chrM 11018 . C T 3070 PASS
chrM 11486 . T C 3070 PASS
chrM 11720 . A G 3070 PASS
chrM 11723 . C T 3070 PASS
chrM 12706 . T C 3070 PASS
chrM 12851 . G A 3070 PASS
chrM 13621 . T C 3070 PASS
chrM 14213 . C T 3070 PASS
chrM 14581 . G A 3070 PASS
chrM 14767 . T C 3070 PASS
chrM 14873 . C T 3070 PASS
chrM 14906 . A G 3070 PASS
chrM 15302 . A G 3070 PASS
chrM 15933 . C T 3070 PASS
chrM 16173 . C T 1489 LowGQX
chrM 16191 . C T 152 LowGQX
chrM 16194 . C T 271 LowGQX
chrM 16225 . T C 3070 PASS
chrM 16322 . T C 3070 PASS

(H13b1 haplogroup)

It looks like it is wrong alignment, posittion
4108 should be 4107
5581 should be 5580
8702 should be 8701
9378 should be 9377
etc..., maybe even more misalignments.
The same error in all 4 files.

ChrisR
04-11-2019, 05:10 PM
I have 2 results for women, the first is one year old and contains no ChrY. The VCF file for the second woman delivered on April 2nd contains 21399 SNPs on ChrY, 2194 of them with filter PASS. What could be the reason? Probably pseudoautosomal X/Y areas not mapped correctly. If you check they should be unreliable Y-SNPs not used in the YFull, ISOGG or other Y-Trees.


What are those new cnv and sv files? Relatively small size. What can I do with these?
SV structural variation (https://en.wikipedia.org/wiki/Structural_variation)
CNV Copy-number_variation (https://en.wikipedia.org/wiki/Copy-number_variation)
So potentially more variants in addition to SNPs and INDELs to consider for autosomal/X and Y-Comparison. AFAIK not yet used much and probably only with long read sequencing those will be mapped accurately and comprehensively for more consistent use.

Donwulff
04-11-2019, 06:46 PM
It looks like it is wrong alignment, posittion
4108 should be 4107
5581 should be 5580
8702 should be 8701
9378 should be 9377
etc..., maybe even more misalignments.
The same error in all 4 files.

See https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=560274&viewfull=1#post560274 above, as said that's the Yoruba reference which is pretty useless. Converting the coordinates is trivial, but I spent a hour or so trying to fix the different ref alleles, but I should've probably worked it out on a whiteboard first because I don't seem to understand the problem yet ;)

tabix some.vcf.gz chrM | gawk 'BEGIN { split("73G 150T 195C 263G 750G 1438G 2352C 2483C 2706G 4769G 5580C 7028T 8701G 8860G 9377G 9540C 10398G 10819G 10873C 11719A 12705T 14212C 14766T 14905A 15301A 15326G 16172C 16189C 16223T 16320T", diff); } { if($2>=16193) $2-=2; else if($2>=3107) $2-=1; else if($2>=315) $2-=2; for(snp in diff) if ($2$4==diff[snp]) { delete diff[snp]; next; } print $2$5; } END { for(snp in diff) print diff[snp] }' | sort -n

Edit: My own Dante Labs provided VCF seems to be missing calls for *a lot* of the mtDNA, which confused the heck out of me. This version of the shell command has full Yoruban variant list, and seems to give correct haplogroup at least for your data, although I can't vouch all the differences are correct. Basically if the variant IS on the Yoruban list then it's skipped and if there's no call for that location then the Yoruban reference variant is outputted. This means we get the Yoruban ref. allele for any possible "No calls". Also does not handle multiallelics. Still wondering if I can clean up that logic...

ChrisR
04-11-2019, 08:19 PM
tabix some.vcf.gz chrM | gawk 'BEGIN { split("239C 263G 408T 750G 1438G 2352T 2483T 2706A 3106C 3915A 4727G 4769G 7028C 8860G 9380A 9540T 11017T 14766C 15326G 16362C 16482G", diff); } { if($2>=16193) $2-=2; else if($2>=3107) $2-=1; else if($2>=315) $2-=2; for(snp in diff) if ($2$5==diff[snp]) { delete diff[snp]; next; } print $2$5; } END { for(snp in diff) print diff[snp] }' | sort -n
Sorry, I have no current practice in Linux genetic tools. I installed tabix in Debian. If you could help me with the conversion of the data

HVR2: 195C 410A
CR: 2354G 2485C 3198G 5302C 5581G 5657T 7386T 7769A 8702T 9378C 9411A 9478C 9541C 10399T 10820T 10874T 10928C 11018A 11468G 11723G 12309G 12373G 12619C 12706C 12851C 13618T 14183G 14213G 14581C 14906T 15302T 15933A
HVR1: 16094C 16173T 16225T 16265G 16272G 16322G 16521G
The VCF File itself is like this

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT default
chrM 195 . C T 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1569:64:7,1562:3,825:4,737:-99:PASS:370,370,0
chrM 410 . A T 3070 PASS SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1388:168:0,1388:0,776:0,612:-99:PASS:370,370,0
chrM 2354 . C T 3070 PASS SNVHPOL=5;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:2224:70:0,2223:0,1122:0,1101:-99:PASS:370,370,0
chrM 2485 . C T 3070 PASS SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:2265:61:0,2265:0,1181:0,1084:-99:PASS:370,370,0
chrM 3198 . T C 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:2040:81:0,2040:0,990:0,1050:-99:PASS:370,370,0
chrM 5302 . A C 3070 PASS SNVHPOL=2;MQ=59 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:1726:22:1934:81:342,1592:152,763:190,829:-99:PASS:370,370,0
chrM 5581 . C T 3070 PASS SNVHPOL=3;MQ=56 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:18:1601:33:0,1601:0,754:0,847:-99:PASS:370,370,0
chrM 5657 . A G 3070 PASS SNVHPOL=3;MQ=55 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:14:1632:42:1,1631:0,886:1,745:-99:PASS:370,370,0
chrM 7386 . A G 3070 PASS SNVHPOL=2;MQ=53 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:18:1910:83:1,1909:1,940:0,969:-99:PASS:370,370,0
chrM 7769 . A G 3070 PASS SNVHPOL=2;MQ=59 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2106:45:0,2106:0,1111:0,995:-99:PASS:370,370,0
chrM 8702 . G A 3070 PASS SNVHPOL=3;MQ=59 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2055:111:1,2054:1,1087:0,967:-99:PASS:370,370,0
chrM 9378 . G A 3070 PASS SNVHPOL=2;MQ=58 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:18:2025:77:2,2023:0,1068:2,955:-99:PASS:370,370,0
chrM 9411 . A G 3070 PASS SNVHPOL=3;MQ=58 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:18:2081:56:1,2080:1,1045:0,1035:-99:PASS:370,370,0
chrM 9478 . G A 3070 PASS SNVHPOL=8;MQ=59 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2084:83:0,2082:0,1050:0,1032:-99:PASS:370,370,0
chrM 9541 . C T 3070 PASS SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1916:194:0,1916:0,937:0,979:-99:PASS:370,370,0
chrM 10399 . G A 3070 PASS SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:1907:68:0,1906:0,1000:0,906:-99:PASS:370,370,0
chrM 10820 . G A 3070 PASS SNVHPOL=6;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2156:92:0,2156:0,1014:0,1142:-99:PASS:370,370,0
chrM 10874 . C T 3070 PASS SNVHPOL=5;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2092:142:0,2092:0,1013:0,1079:-99:PASS:370,370,0
chrM 10928 . T C 3070 PASS SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1279:802:1,1278:0,268:1,1010:-99:PASS:370,370,0
chrM 11018 . C T 3070 PASS SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1676:419:0,1675:0,582:0,1093:-99:PASS:370,370,0
chrM 11468 . A G 3070 PASS SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:2056:40:0,2056:0,1028:0,1028:-99:PASS:370,370,0
chrM 11723 . C T 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:23:2080:107:0,2080:0,1062:0,1018:-99:PASS:370,370,0
chrM 12309 . A G 3070 PASS SNVHPOL=5;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2099:75:1,2098:1,1049:0,1049:-99:PASS:370,370,0
chrM 12373 . G A 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1982:241:3,1979:3,983:0,996:-99:PASS:370,370,0
chrM 12619 . G A 3070 PASS SNVHPOL=5;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:2299:50:28,2271:11,1088:17,1183:-99:PASS:370,370,0
chrM 12706 . T C 3070 PASS SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2218:68:22,2196:8,1063:14,1133:-99:PASS:370,370,0
chrM 12851 . G A 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2000:136:0,2000:0,972:0,1028:-99:PASS:370,370,0
chrM 13618 . T C 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:24:2155:60:1,2154:1,1103:0,1051:-99:PASS:370,370,0
chrM 14183 . T C 3070 PASS SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2016:82:1,2015:1,983:0,1032:-99:PASS:370,370,0
chrM 14213 . C T 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2023:96:0,2023:0,1016:0,1007:-99:PASS:370,370,0
chrM 14581 . G A 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1760:191:0,1760:0,996:0,764:-99:PASS:370,370,0
chrM 14906 . A G 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:2220:60:0,2220:0,1152:0,1068:-99:PASS:370,370,0
chrM 15302 . A G 3070 PASS SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:2013:61:0,2013:0,992:0,1021:-99:PASS:370,370,0
chrM 15933 . C T 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:30:2031:45:0,2031:0,964:0,1067:-99:PASS:370,370,0
chrM 16094 . T C 3070 PASS SNVHPOL=3;MQ=58 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:1931:15:1574:128:229,1345:76,444:153,901:-99:PASS:370,370,0
chrM 16173 . C T 3070 PASS SNVHPOL=3;MQ=53 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:1204:5:401:567:0,401:0,69:0,332:-99:PASS:370,370,0
chrM 16225 . T C 3070 PASS SNVHPOL=3;MQ=54 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:2258:14:765:186:1,764:1,249:0,515:-99:PASS:370,370,0
chrM 16265 . T C 3070 PASS SNVHPOL=6;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:2845:22:974:114:2,972:1,487:1,485:-99:PASS:370,370,0
chrM 16272 . C T 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:2643:22:894:220:1,892:1,419:0,473:-99:PASS:370,370,0
chrM 16322 . T C 3070 PASS SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:3070:22:1210:135:0,1210:0,736:0,474:-99:PASS:370,370,0
chrM 16521 . C T 3070 PASS SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:2855:22:977:14:4,973:0,195:4,778:-99:PASS:370,370,0

Ann Turner
04-11-2019, 08:27 PM
I wrote them and they replied in few minutes asking me to fulfil a Receive your Customized Report request form.
I got the same email (about requesting a customized report for mtDNA) but the link didn't work a few days ago. The page flipped to the home page after a few seconds. Today I'm getting a 404 "not found" error message.

Petr
04-11-2019, 08:33 PM
Thank you, I forgot that it is Yoruba.

Donwulff
04-12-2019, 11:17 AM
Tabix is very useful for working with vcf files, but actually it's not even necessary.

zcat some.vcf.gz | awk -v ORS=' ' 'BEGIN { split("73G 150T 195C 263G 750G 1438G 2352C 2483C 2706G 4769G 5580C 7028T 8701G 8860G 9377G 9540C 10398G 10819G 10873C 11719A 12705T 14212C 14766T 14905A 15301A 15326G 16172C 16189C 16223T 16320T", diff); } /^chrM[[:space:]]/ { if($2>=16193) $2-=2; else if($2>=3107) $2-=1; else if($2>=315) $2-=2; for(snp in diff) if ($2$4==diff[snp]) { delete diff[snp]; next; } print $2$5; } END { for(snp in diff) print diff[snp] }' | sort -n
Should work anywhere, even in Windows Subsystem for Linux. This is slower, because it goes over the whole vcf file.

408T 3197C 5301C 5656G 7385G 7768G 9410G 9477A 10927C 11017T 11467G 11722T 12308G 12372A 12618A 12850A 13617C 14182C 14580A 15932T 16093C 16263C 16270T 16519T 73G 150T 263G 750G 1438G 2706G 4769G 7028T 8860G 11719A 14766T 15326G 16189C
Feeding it into James Lick's mtHAP:

Best mtDNA Haplogroup Matches:

1) U5b1b1(T16192C)

Defining Markers for haplogroup U5b1b1(T16192C):
HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16189C 16270T

Marker path from rCRS to haplogroup U5b1b1(T16192C) (plus extra markers):
H2a2a1(rCRS) ⇨ 263G ⇨ H2a2a ⇨ 8860G 15326G ⇨ H2a2 ⇨ 750G ⇨ H2a ⇨ 4769G ⇨ H2 ⇨ 1438G ⇨ H ⇨ 2706G 7028T ⇨ HV ⇨ 14766T ⇨ R0 ⇨ 73G 11719A ⇨ R ⇨ 11467G 12308G 12372A ⇨ U ⇨ 16192T 16270T ⇨ U5 ⇨ 3197C 9477A 13617C ⇨ U5a'b ⇨ 150T 7768G 14182C ⇨ U5b ⇨ 5656G ⇨ U5b1 ⇨ 16189C ⇨ U5b1(T16189C) ⇨ 12618A ⇨ U5b1b ⇨ 7385G 10927C ⇨ U5b1b1 ⇨ 16192C ⇨ U5b1b1(T16192C) ⇨ 5301C 9410G 16093C 16263C

Good Match! Your results also had extra markers for this haplogroup:
Matches(27): 73G 150T 263G 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G 16189C 16192C 16270T
Extras(4): 5301C 9410G 16093C 16263C
That looks great match because none of the "Extras" is on the Yoruban variant list, so there weren't no-calls for them. I'm not sure why my own VCF seems to be missing most variants, though they're all in the bam raw data. So if there's a lot of mismatches then the conversion didn't work and one has to wait for the raw data. I'm wondering if we can put these on a FAQ and/or web tool somewhere...

MacUalraig
04-12-2019, 01:08 PM
There is a need for * easy to use * tools to do a number of tasks, aimed at non-programmers who use Windows (and I say that as a programmer who has worked on various OS in my time but also spends a lot of time with non-IT literate users). Examples would be easy splitting off the Y and mt BAMs (perhaps a batch file with params rather than command line samtools), extraction of a GEDMatch upload file etc. Or even a diddy GUI like the one Felix did for his BAM kit. This would counteract the idiot bloggers who are already going around saying WGS is no use for genetic genealogy.

:-)

TigerMW
04-12-2019, 01:27 PM
There is a need for * easy to use * tools ... This would counteract the idiot bloggers who are already going around saying WGS is no use for genetic genealogy.
I agree with you, they need to get much, much easier and in reality the analysis and tools should be blended in to the product offering itself.

I haven't seen an answer on this. MacUalraig, can you explain? Are you talking about the laboratory Dante uses?


Probably in the hands of 'users/w******g' who I speculated previously may dwell in China... (if I can be pardoned the racial stereotyping)


What do you mean by that?

Donwulff
04-12-2019, 02:13 PM
Most customers perhaps now and certainly in the future won't even have a traditional desktop, right now it's mobile devices/phones, soon enough someone will make augmented reality glasses that actually stick etc. In light of that, web-based analysis services are already where it is.

Of course, I'll have to say a LOT of this would be solved if Dante Labs just catered to their more knowledgeable userbase and provided the gVCF/mtDNA FASTA/tab-format(23andMe etc.)/BAM's per chromosome (including unmapped) "out of the box", it wouldn't add much to the bioinformatics but would make the product SO much more valuable. Of course, they have to fix the mtDNA reference first ;) I suspect *most* of their customers just grab the somewhat sub-standard health interpretation and run with it, so people should make it clear that they're losing customers because of it. (Granted, as said before, Dante Labs is still generally cheapest alternative even if you have to buy full analysis from Sequencing.com/FGC/YFull/Galaxy)

And then we'll have to deal with the GRCh37/GRCh38/GRCh39 confusion, but there I agree that different reference genomes could be purchaseable addons as it's non-trivial amount of re-calculation. (Although, I suspect that mapping to GRCh38 and then "liftover" to GRCh37 coordinates would be best strategy going forward). And yes, of course, if you get the BAM on a hard-drive, web-based analysis will be lot harder, although I understand several people are concerned about the safety of web-services. In that case they might want to re-consider if it's safe to use the Internet at all. Of course, we can probably have Java and virtual appliance based solutions which can run on websites, cloud servers and locally. Of course, with FGC for example charging $250 for "advanced analysis" on the web, I'm wondering a bit on the economics. Some standard pipelines on Google Genomics https://cloud.google.com/genomics/ (again not entirely turnkey solution) would cost just dollars to run, but there's no incentive to provide anything for free. Services like Promethease have an edge, because they've basically turned SNPedia contributions proprietary, so they have something others can't offer (without paying them). Sites like GEDmatch might also consider getting more involved into the bioinformatics/data-processing business like DNA.Land for example already is with imputation under the hood. But BAM processing and reference conversions will have to be part of the service eventually.

JamesKane
04-12-2019, 04:45 PM
Actually, I kind of agree with Mike. If Dante Labs wants to grow beyond delivering low cost sequencing results for predominantly medical purposes, it needs to start taking full pages from 23andMe's playbook.

I've said it before and will again. There is no money in sequencing in the future. It's a race to the bottom on price and margin with an eventual goal of $100 or less for a 30x short read WGS test. From there it becomes a question of extracting revenue by delivering insights into the data like trait reports or matching algorithms for genealogy, or selling access to de-identified data for medical research from customers who opt in. We can expect many of the current players to exit unless they adapt to this reality.

Sequencing.com kind of tries to fill this niche, but there isn't a seamless integration. Dante customers are on their own to get the 120GB of raw data into the cloud. Either that needs to be addressed or Dante must ultimately become more than it is today.

Donwulff
04-12-2019, 05:07 PM
Dante Labs has claimed in their communications they're moving more and more of actual sequencing into Italy (Granted, labor is probably still cheaper in China, but at least they won't have to pay a middle-man), so it's also a game of being established player with "one size fits all" economies of scale which they've been working on with very aggressive pricing. In the e-mail they sent customers they promised "more reports" anyway, alas they appear to think their customer base are mainly interested in medical interpretations, as reflected in their existing report & the new pharmacogenic report they just added. So it's just a matter of making them aware the genealogical genetic community exists and is a major player as well.

Of course, the medical reports are always a wager, compare 23andMe who were first slammed for providing reports on too many variants and then, after FDA stepped in, for providing reports on too few variants. There would be a huge pushback against Dante Labs as well if they hit the clinical laboratories & for-profit hospitals radar. Can you imagine the class-action lawsuits when people pick up on sequecing & intepretation errors? (No matter that clinical labs/hospitals do those as well, but there's a marketing/public relations issue).

TigerMW
04-12-2019, 05:40 PM
Dante Labs has claimed in their communications they're moving more and more of actual sequencing into Italy (Granted, labor is probably still cheaper in China, but at least they won't have to pay a middle-man), so it's also a game of being established player with "one size fits all" economies of scale which they've been working on with very aggressive pricing.
I think they have a fork in the road in the future, but their announcement of a lab is Italy is vaporware at this point, from what I can see. I asked them several days ago when they were opening it. ... crickets! (nothing). It's the answer I have received before. [Edit still on 4/12: Don says they are doing processing in a lab in Italy. I don't want to say they are doing no lab processing in Italy. I don't know. All I know is they told me in writing they use BGI for their consumer WGS test, which is what this thread is about. BGI shows on their web site WGS testing is done in China for them. For full production purposes, I have no reason to doubt what Dante and BGI are saying. My posting is specific to WGS and what Dante and BGI are saying.]

It's a good move marketing wise to announce a future lab in Italy to make the European market feel better.

However, the trade-offs can not be avoided. Price versus security (even if only perceived) and more interpretation services. Their lab partner*, BGI, shows they do the WGS testing in China.
https://www.bgi.com/global/resources/offices-and-laboratories/

BGI is massive and subsidized to boot. If you want cheap, BGI China is the future. Italy is not. I doubt if the US is either unless we all end up with near 100% automated (robots) labs. FTDNA says they have invested in automation in the lab but who knows? If Dante says the Italian government is making a massive investment in an automated laboratory that's what is needed. Otherwise, this means Dante is just buying customers today to build a database for the higher quality, more complete product in the future. This is the fork in their road. Price versus fuller capabilities. I think they'll go the way of fuller capabilities.


Of course, the medical reports are always a wager, compare 23andMe who were first slammed for providing reports on too many variants and then, after FDA stepped in, for providing reports on too few variants. There would be a huge pushback against Dante Labs as well if they hit the clinical laboratories & for-profit hospitals radar. Can you imagine the class-action lawsuits when people pick up on sequecing & intepretation errors? (No matter that clinical labs/hospitals do those as well, but there's a marketing/public relations issue).
23andMe is their real competitor. 23andMe could easily contract with BGI too, or may have enough money to do WGS in a high automation environment. I see their lab is in North Carolina that is a fairly pro-business state. I see GalaxoSmithKline invests in 23andMe. They have the money to do it.

BTW, I use to know the chairman for the old SmithKline. He was very nice, humble man.

* On Nov. 19 '18 I asked Dante via email "I am interested in your promotional WGS offering. Who specifically are the laboratories you use?" On Nov. 20th I received this reply.

"thank you for your message. We have a partnership with BGI."

Donwulff
04-12-2019, 07:47 PM
Dante Labs has certified, operational & running sequencing lab in Italy as has been well & widely reported: https://nanoporetech.com/services/providers/dante-labs - when, or if they'll be also running other sequencers is unclear, but implied.
Did anybody ever denied they're using BGI-seq? The physical location is more open to interpretation, though I believe their samples are usually sent to Hong Kong, see my comments above about the labor costs (Even if hardware & reagents do form substantial part of the price). https://www.dantelabs.com/pages/faq

Where are our genomes sequenced?
Your genomes are sequenced in one of our selected third-party partner labs in either Germany, Denmark, New Jersey (USA) or Hong Kong. Before selecting the best partners to sequence your DNA we ran pilots and reviewed more than 15 genetic labs. All of our partner labs are certified, and we have worked with them to implement special quality workflows, going beyond the normal regulatory measures. We have now setup our own laboratory in L’Aquila, Italy, [ in the university district, near the GSSI PhD school]. Looking forward, more and more samples will be analyzed internally.
According to Dante Labs they ARE moving to analyze more and more samples "internally" in Italy. Clearly, this will save them paying the profits of the testing laboratory, although whether it makes sense in light of labor costs I can't say for sure.

Also, re. the conspiracy theorists:

How do you protect your patients’ privacy?
Dante Labs is built on trust. We respect your privacy and protect it with strong encryption and strict policies that govern how all of the data is handled. We are compliant with the GDPR and the UK Cyber Essentials, required by the National Health Service. We also have a sophisticated cloud structure to protect your data.
Furthermore, after you send us your saliva sample, your data, samples, results and reports are only identifiable by the barcode on the saliva collection tube.
Lastly, your reports and results include no personal identification information, meaning if lost you will remain anonymous.
Obtaining a DNA sample to analyse is the easiest/cheapest part of sequencing. Without any attached personal information, the DNA data is useless even for practically all medical genetics purposes. Because the testing costs are paid basically by the customer, the medical companies are interested in obtaining them, but only when customers co-operate with the research. (See: 23andMe with 42 investors, Genos Research sequencing which was sold to NantOmics, AncestryDNA owned by Permira, Gene By Gene (aka FTDNA)).

Can we please, please lay off the political conspiracy theories for just one thread, at least? This is ridiculously long as is ;)

Donwulff
04-12-2019, 08:08 PM
From Dante Labs Facebook page, so this appears to actually be coming now:
29791

TigerMW
04-12-2019, 08:11 PM
Donwulff. You are laying out "political conspiracy theories". It could be construed as laying out strawman proposals. If you have some conspiracy theories, please provide the quotes and citations them specifically so we can discuss.

A good open discussion is always good. Debate often helps understanding.

MacUalraig, I see you are on this thread with thanks to Donwulff. Please explain this post that I've already asked about.


Probably in the hands of 'users/w******g' who I speculated previously may dwell in China... (if I can be pardoned the racial stereotyping)

MacUalraig
04-12-2019, 08:16 PM
Thanks Donwulff for all your helpful and positive contributions to the topic. :-)

TigerMW
04-12-2019, 08:21 PM
Of course, the medical reports are always a wager, compare 23andMe who were first slammed for providing reports on too many variants and then, after FDA stepped in, for providing reports on too few variants. There would be a huge pushback against Dante Labs as well if they hit the clinical laboratories & for-profit hospitals radar....
You bring up a good question. Does Dante say their health reports are FDA approved? We know it took a couple of years for 23andMe to get that sorted out. I see Dante says they use a "non-invasive, FDA approved saliva collection method" but that is a vary narrow statement.
https://us.dantelabs.com/products/whole-genome-sequencing-wgs-full-dna-analysis

ChrisR
04-13-2019, 08:19 AM
Defining Markers for haplogroup U5b1b1(T16192C):
HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16189C 16270T

Marker path from rCRS to haplogroup U5b1b1(T16192C) (plus extra markers):
H2a2a1(rCRS) ⇨ 263G ⇨ H2a2a ⇨ 8860G 15326G ⇨ H2a2 ⇨ 750G ⇨ H2a ⇨ 4769G ⇨ H2 ⇨ 1438G ⇨ H ⇨ 2706G 7028T ⇨ HV ⇨ 14766T ⇨ R0 ⇨ 73G 11719A ⇨ R ⇨ 11467G 12308G 12372A ⇨ U ⇨ 16192T 16270T ⇨ U5 ⇨ 3197C 9477A 13617C ⇨ U5a'b ⇨ 150T 7768G 14182C ⇨ U5b ⇨ 5656G ⇨ U5b1 ⇨ 16189C ⇨ U5b1(T16189C) ⇨ 12618A ⇨ U5b1b ⇨ 7385G 10927C ⇨ U5b1b1 ⇨ 16192C ⇨ U5b1b1(T16192C) ⇨ 5301C 9410G 16093C 16263C

Good Match! Your results also had extra markers for this haplogroup:
Matches(27): 73G 150T 263G 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G 16189C 16192C 16270T
Extras(4): 5301C 9410G 16093C 16263C
Thank you very much Donwulff!
Indeed now the results look great. So just a bad mtDNA reference by DL (or Sequencing.com) they hopefully will fix sooner or later. For direct comparison again the 23andMe MT data mthap report:

Defining Markers for haplogroup U5b1b1:
HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16189C 16192T 16270T

Marker path from rCRS to haplogroup U5b1b1 (plus extra markers):
H2a2a1(rCRS) ⇨ 263G ⇨ H2a2a ⇨ 8860G 15326G ⇨ H2a2 ⇨ 750G ⇨ H2a ⇨ 4769G ⇨ H2 ⇨ 1438G ⇨ H ⇨ 2706G 7028T ⇨ HV ⇨ 14766T ⇨ R0 ⇨ 73G 11719A ⇨ R ⇨ 11467G 12308G 12372A ⇨ U ⇨ 16192T 16270T ⇨ U5 ⇨ 3197C 9477A 13617C ⇨ U5a'b ⇨ 150T 7768G 14182C ⇨ U5b ⇨ 5656G ⇨ U5b1 ⇨ 16189C ⇨ U5b1(T16189C) ⇨ 12618A ⇨ U5b1b ⇨ 7385G 10927C ⇨ U5b1b1 ⇨ 16263G

Imperfect Match. Your results contained differences with this haplogroup:
Matches(25): 73G 150T 263G 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G 16270T
Extras(1): 16263G
Untested(2): 16189 16192

ChrisR
04-13-2019, 08:41 AM
Dante Labs has certified, operational & running sequencing lab in Italy as has been well & widely reported: https://nanoporetech.com/services/providers/dante-labs - when, or if they'll be also running other sequencers is unclear, but implied.
Did anybody ever denied they're using BGI-seq? The physical location is more open to interpretation, though I believe their samples are usually sent to Hong Kong, see my comments above about the labor costs (Even if hardware & reagents do form substantial part of the price). https://www.dantelabs.com/pages/faq

According to Dante Labs they ARE moving to analyze more and more samples "internally" in Italy. Clearly, this will save them paying the profits of the testing laboratory, although whether it makes sense in light of labor costs I can't say for sure.
In April 2018 I was able to attend a presentation of A. Riposati, Co-Founder and CEO DL. After the presentation together with a friend (biomedical researcher) I tried to get some background infos from Mr. Riposati. From my notes one year ago:

The European seat is in L'Aquila, A. Riposati was Amazon manager in America and co-founder M. Capulli assistant professor. They have about 6 employees in Italy and a few in New York. Tests are available in Europe except DE, CH, NL and the tests are bundled for European customers and sent to the partner laboratories in Bonn, Netherlands and Portugal. NGS is handled by the University of Bonn (ISO, CLIA, Illumina certification) and specific procedures can be agreed with scientific partners. For private customers, the company has been operational for almost a year and is developing against the forecasts of skeptical potential financiers in Europe within the plans - I think he mentioned they have sold almost 2000 NGS tests in Europe. They cooperate with private clinics, e.g. in Vienna, where doctors use tests for patients and also expand scientific cooperation, where they see themselves as service providers for logistics and DNA tests.
I do not know what has changed in the last 12 months.

MacUalraig
04-13-2019, 10:32 AM
ChrisR, was this presentation in Italy?

ChrisR
04-13-2019, 12:36 PM
ChrisR, was this presentation in Italy?Yes, in North Italy.

Donwulff
04-13-2019, 03:03 PM
Funnily enough they have a "research & development unit" & center for Nordic countries in Finland: https://www.helsinkibusinesshub.fi/dante-labs-launched-a-new-unit-in-the-finnish-biomedical-ecosystem/ It appears they have had all the business paperwork filed since last summer, but it looks to exist only on paper so far. According to "It also introduced Dante Labs to key players, including Finnish biobanks, Business Finland and its funding opportunities, the Helsinki Region Chamber of Commerce, and the Finnish Patent and Registration Office." so I think THAT one may exist for business benefits only for now. But technically they have offices in Italy, New York and Finland.

https://academicpositions.com/ad/gran-sasso-science-institute/2019/call-for-31-phd-research-fellowships-in-the-fields-of-physics-mathematics-computer-science-and-social-sciences/127767
"1 scholarship available for the research project “Design and implementation of efficient algorithms and data structures for the analysis of genomic data” financed by Dante Labs srl. The PhD student will help to develop scalable tools for genome informatics that can handle very large genomic databases, starting from recent advances involving the run-length compressed Burrows-Wheeler Transform."
(That's pretty generic, but then again best not to put huge constraints on research)

mkkangas
04-13-2019, 05:13 PM
Ive been waiting 4 months for my dante labs processing

mkkangas
04-13-2019, 05:14 PM
Is there a tracker for how long each DNA provider takes to process kits?

MacUalraig
04-13-2019, 05:24 PM
Is there a tracker for how long each DNA provider takes to process kits?

I doubt it for WGS at this point, probably for the mass chip tests there are some stats floating around somewhere. I ordered Dante last July and an FGC test in August and neither are fully complete and delivered yet. My YSEQ WGS test took place all in Germany and was only about 2 months from sample return.

MacUalraig
04-13-2019, 05:26 PM
I don't recommend of buying just on the basis of delivery time though. All those tests have something different to offer in some way.

JamesKane
04-13-2019, 05:44 PM
Processing time (https://thednageek.com/dna-tests/) has a table with a few NGS entries. It is missing the 30x WGS test I submitted last year, so I don’t know how accurate any of it really is with the short span of time it selects.

Donwulff
04-13-2019, 05:57 PM
"5Average time from mailing the test to the lab to receiving results; times based on user-submitted data from Dec’18–Feb’19."

It's a two month sliding window, which makes sense for the fast turn-around tests where queues and lab situations change monthly and you won't be interested in their performance last year. WGS tests which for DTC consumers take more than two months could benefit from longer window, or even listing individual test times at this point. In general, I don't think the WGS testing time has any relevance to anything, you know you're going to have to wait months for it anyway, so it isn't instant gratification. And people are freaking out when it's few days late. Granted, I've grumbled about how *all* companies are overly optimistic about the sequencing times. So all I can say is, expect delays and be very, very pleasantly surprised when companies deliver on time. I realize it can be nerve-wracking to wonder if the results are coming at all (Or when a lower price appears during the wait, but honestly, you'd probably wait even longer for the low price one) though, and again, it seems all companies could work a bit better on communicating delays to customers.

Giosta
04-13-2019, 06:17 PM
My Dante vcf does not contain any chrM lines

Donwulff
04-14-2019, 10:45 AM
rCRS reference is usually expressed as chrMT, while Yoruban is chrM. (M and MT are also possible) As said, in my VCF several calls which are clearly present in the BAM are missing (Perhaps they had a filter on too high sequencing depth, which is usually a mistake except on mtDNA) so it seems like they've changed the analysis pipeline around multiple times though. If none of the chromosomes exist or calls are missing, you're out of luck until you get BAM analysed.
The VCF is kind of large, so I'm not sure what people are using to read it with, checking that whatever you're using can handle the file size & mtDNA correctly is a good idea as well.

bjp
04-14-2019, 12:18 PM
Yesterday the free Pharmacogenetics report landed in my account for the Dante kit I am dealing with. I've also received word (though no tracking yet) that my order for a hard disk will be fulfilled soon, after placing an order from the site that has been listed here a few times.

I definitely have no mtDNA calls in this VCF though, and it's not a toolset issue. Tabix, grep, and vim all confirm no mtDNA calls in the VCF, and none popped up as I went through my pipeline to liftover to hg38 and merge in dbSNP/ClinVar/snpEff data. I will just pull them out of the BAM once it arrives and will post an updated timeline on order/receipt date when the BAM arrives.

Donwulff
04-14-2019, 01:11 PM
One can do quick & dirty variant calling with bcftools from the bam with:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
samtools index some-mapped-sorted-bam-file.bam
bcftools mpileup -d 10000 -r chrM -f hg19.fa.gz some-mapped-sorted-bam-file.bam | bcftools call -c --ploidy 1 | grep -v "[[:space:]]0:" --line-buffered | tee some-mapped-sorted-bam-file.chrM.vcf
(You can add -v to "bcftools call" for similar effect to the grep, but grep with --line-buffered shows progress. Or you can use multiallelic caller with -m instead of -c but then you'll have to contend with results like "Adenine OR Cytosine", so consensus caller may be easiest for genealogical mtDNA. For me the results are identical.)

Of course, that'll still be in coordinates of whatever reference was used for the bam, in this case hg19 Yoruban chrM. Grabbing the chrM reads and re-mapping them to either rCRS or RSRS reference would also be possible, but at that point you're starting to look at full-fledged genome analysis pipeline. Galaxy, Sequencing.com, YFull (for males) and FGC can all do that as well if you can get the BAM file online.

Ann Turner
04-15-2019, 03:45 AM
My Dante vcf does not contain any chrM lines
Nor did mine, but today I received working instructions on how to request a report. The first link they sent gave me a 404 error message.

Go to our website dantelabs.com
Click on Products
Click on My Full DNA: Whole Genome Sequencing with mtDNA
Over the far right, you should see a brief description of what this Kit offers as well as a "form" hyperlink above the cost.
Click on form and fill out the requested information.


It should look like the below:

If you seek to get a Customized Report on a disease or gene panel, after receiving your saliva collection kit, fill out the form

If results come back aligned to the Yoruba Reference Sequence, James Lick's Advanced utility lets you select that. I don't have any direct experience with it.

https://dna.jameslick.com/mthap-new/advanced.php

Petr
04-15-2019, 09:44 AM
Update:

Now I sent 3rd to 7th kit:
Delivered (tracking): 2019-01-03
and I'm curious to see what will be the turnaround time this year.

3rd to 7th kit:
Delivered (DHL tracking): 2019-01-03
Dante Labs: Kit received: 2019-01-13
Successful DNA extraction - Level A: 2019-02-27
DNA Sequencing was completed with success: 2019-03-22
Your results are ready!: 2019-04-02
HDD ordered: 2019-04-04
More results (indel, CNV, SV): 2019-04-10
So for 4 kits I'm waiting for HDD and one kit still has status "Success DNA - A" and according to Dante support "The expected results are in the early of May".

8th to 10th kit:
Delivered (UPS tracking): 2019-03-08
Dante Labs: Kit received: 2019-04-02?? (no e-mail received)
So these kits still have status "Kit received".

kafky
04-15-2019, 10:40 AM
There are some relevant and possible sub-topics he re derived of the massive Dante Lab approach.
1. Which file format would be best deliverable for domestic users, with genealogical and possibly health interest. Bam file may be essencial to have a full product, but bam is a very big file full of unrelevant information. VCF only has the results that differ from a reference genome, however, it should also include mtdna and would lack the indels and other non defined information. I think that both BAM and VCF may be the best approach (if vcf includes mtdna).
2. Quality. We can verify the quality of the results. I must say, from my direct knowledge, that Portuguese DNA labs are extraordinary good and also I have a very good appraisal on Italians and those from Finland. They have a track record of very good ethical and research based attitude, as for GPRD respect determined by EU commission (no Brexit jokes here). FDA reports are not so relevant out of USA.
3. Interpreting results. A new avenue is open that lead to both commercial driven services and to high differentiate end-users that can do their own job. I like to think myself as part of the second road. Gedmatch closed the door to VCF, I do not know why, but assume the complexity is very high. There may be room for a seqmatch service only accepting vcf files. Y, tools to clarify haplogroups like the ytree.morleydna.com could be better adapted to this new reality. Mtdna, James Lick tool is excelent but still need to have the mtdna results in a feasible way, that may include the extraction of mtdna (and Y) from the whole vcf.

Just some directions for reflection.

kafky
04-15-2019, 11:04 AM
Success on Y haplogroup definition.

The steps I made are:

1. transformed vcf in raw with the DNA Kit Studio (option without RSID);
2. on a text editor, deleted all data except the header and the Y results; saved as txt.
3. Uploaded the resulted file on ytree.morleydna.com

Voilá. DF27 without more definition than this.

ChrisR
04-15-2019, 11:54 AM
1. Which file format would be best deliverable for domestic users, with genealogical and possibly health interest. Bam file may be essencial to have a full product, but bam is a very big file full of unrelevant information. VCF only has the results that differ from a reference genome, however, it should also include mtdna and would lack the indels and other non defined information. I think that both BAM and VCF may be the best approach (if vcf includes mtdna).I think if they would also make available for download the uniparental raw data (FASTQ and/or BAM for the Y and mt) that would suffice for most users interested in genetic genealogy and should be possible at least for a certain amount of time (3 -6 months or even longer). If standardized it should not cause much more post-sequencing costs (analysis-pipeline & hosting).
The full FASTQ and BAM can remain an additional "product" either delivered by HDD (or SD-Card/USB-Stick) or by Download.
gVCF (records for all sites, whether there is a variant call there or not + accurate estimation of confidence that the sites are homozygous-reference or not) might be worth as a replacement to the currently used VCF.


3. Interpreting results. A new avenue is open that lead to both commercial driven services and to high differentiate end-users that can do their own job. I like to think myself as part of the second road. Gedmatch closed the door to VCF, I do not know why, but assume the complexity is very high. There may be room for a seqmatch service only accepting vcf files. Y, tools to clarify haplogroups like the ytree.morleydna.com could be better adapted to this new reality. Mtdna, James Lick tool is excelent but still need to have the mtdna results in a feasible way, that may include the extraction of mtdna (and Y) from the whole vcf.
I think for Y (and no also for mt) sequence analysis and phylogeny reconstruction the best way is currently represented by YFull. mthap is excellent as long as the latest PhyloTree Build is comprehensive enough. Similar for other mt/Y phylogeny tools where the used reference is key to the value.
Autosomal Matching is much more complex and I'm not an expert. I see:
a) difficulty to get huge numbers to upload at one place: while good gedmatch and DNA.Land may not have any of the matches interesting for a good analysis
b) difficulty to do the step from sub-1-Million-SNP-chips to Full-Genome-Variants and a robust matching including to all the old SNP-platform results

MacUalraig
04-15-2019, 11:59 AM
One does wonder if YFull will expand further into WGS analysis?

Donwulff
04-15-2019, 07:27 PM
A traditional medical/bioinformatics way of delivering results would be VCF + BED file of confident regions ie. which parts of genome were successfully tested. FTDNA BigY delivered like this, and after I pointed it out YFull started accepting their VCF files. I assume YFull could do the same for Dante Labs if they started delivering gVCF and/or BED files, though it must be remembered YFull is in the business of sequencing data. GEDmatch should've pre-emptively applied some default BED files for WGS and exome, derived from a few model samples. gVCF format might be winning, and I see a singular benefit for it in the online use: The callable regions are encoded in same file, so there's no "Forgot to upload BED or uploaded wrong one".

Dante Labs & Sequencing.com have already declared they're working on transfer ("working seamless within few weeks" or something like that) of the raw data to Sequencing.com where people can get the additional reports they need for a slight charge. I get that people want everything for free, and quite honestly I could only see that as a strong sales-benefit for Dante Labs, but being able to get most formats you want, up to date, for a small charge is great option. The first wellness report from Dante Labs is already produced with Sequencing.com, I don't know the technical details of that but it seems that Sequencing.com *already* has access to your results (likely the VCF) whether you choose to use their services further or not.

Not sure what's up with GEDmatch Genesis/VCF since I had to leave there after their limits on "artificial kits" which seems like it could cover most bioinformatics workflows (Ie. not directly obtained from vendor), but back before then it seemed that gVCF files were too large to upload there as is. A kit which has the superset of SNP's on all autosomal tests might be one of the best options. For myself (before obtaining the BAM file) I experimented with putting the sequencing results through Human Haplotype Reference Consortium imputation pipeline, which has two benefits: Adding any missing no-calls, and reducing file size to known variants only. Of course, imputation has the possibility of adding some spurious matches/non-match too, so this is clearly inferior to knowing the no-calls.

Useful deliverables would overall include:
gVCF file for Promethease; of course companies try to provide their own health interpretations for most consumers.
23andMe/AncestryDNA tab-files with identical SNP's (Some sites don't seem to be able to handle extra SNP's) for current genealogical services
mtDNA FASTA for mtDNA analysis
FASTQ/BAM as currently
Y-chromose & mtDNA BAM separated (Because YFull takes either/both)
Additionally:
VCF file with all latest dbSNP variants, whether ancestral, derived or no-call, and any derived novel variants.
BED file of confident regions.
A variant browser that can handle those, online for mobile devices like most genomics services provide right now.
For future:
BAM of unmapped/weakly mapped reads because metagenome (ie. bacteria etc.) is still poorly tapped

Indels can be in same VCF file as far as I can see, as long as they're normalized in the normal format. Everything I know of just ignores indels that they can't handle, and the same VCF format handles indels and SNP's just fine. In fact in some cases you can have like "Ancestral: A, Sample: G *and* AG" so separating them doesn't make much sense. Structural variants, where one segment can be flipped or moved elsewhere are different, but few things use them yet and short read sequencing isn't good at detecting them.

I'm intentionally describing (somewhat) what would be useful "right now", but it should be borne in mind that what's most useful next year may not be the same, which is one reason the "Vendor provides everything at once and you need nothing more" isn't realistic, and third party services serve a part in converting bioinformatics formats (To say nothing of re-analysis with new genomic references and tools). With race-to-the-bottom sequencing costs you can't necessarily expect vendors to provide these & updates for free indefinitely, though it's certainly an important business edge.

Also DNA.Land, MyHeritage and other established imputation using genelogic matching sites can technically start taking whole sequence data almost any time they want, which is interesting. In some ways it doesn't make sense to start a competing service because they could beat any competition the moment they want to, but until then, that service is missing. (However, whole sequence doesn't currently add much to autosomal matching due to genealogical matches coming in IBD segments, it's more about data compatibility; perhaps later genome phasing and indels).

Every company that targets/markets to EU is bound by GDPR. Of course, just because there's a law for something won't mean everybody will obey the law, on the other hand GDPR means that there are at least some recourses available should a company get caught from breaking it. However, that's veering dangerously into opinions and controversial topics... (Whee, long post now, lol)

Donwulff
04-15-2019, 10:08 PM
If people are getting their BAM's on hard-drives, it's relatively easy to build a JavaScript/browser based BAM-reader that would extract and send over only chrY or mtDNA mapped reads. I've been raking my brain trying to figure how you would *prove* that only the selected data is sent though. But I can't see the hard-disk delivery surviving long in the Internet age (And the Sequencing.com deal is good indication) so it seems a waste of effort, albeit I'm aware the "sneaker-net" beats best Internet transfer speeds if we're starting to see really massive data. Latest mobile phones have like 1TB storage though so perhaps people will be keeping their whole family's BAM files on one so they only have to worry about it being stolen, scrambled or hacked ;)

Autosomal BAM probably has less direct usage if processed into high-quality gVCF/VCF+BED, beyond very computationally intensive re-analysis for newish references etc. Although the long-read sequencing & polishing with sort-read sequencing raw data are a good example of new uses which may not be immediately apparent. Joint genotyping and family-trio analyses etc. taking advantage of similarities between genomes of close relatives are another significant use for the BED. But of course for starters you need one really good genotyping of the sequence anyway.

ybmpark
04-16-2019, 05:04 AM
If you inquired about your kit and received the following response
"I have checked on your DNA sample and appears to be still in the sequencing process which is a long and complex process, composed of several steps. Your results were expected to be ready on 12th last month. We are very sorry we failed to inform you there could be delays."

You should be alarmed because it is auto-generated. It sounds personal but is actually auto-generated. Anyone received results from last year's Thanksgiving promotion? I am curious as to whether they sequence the full-pay customers' first or randomly selected few first. If the latter some from last year's promo should have gotten their results by now.

MacUalraig
04-16-2019, 06:44 AM
It's just a stock phrase from testing labs. Tests are either ready or not ready and typically there isn't any refinement on 'not ready'.

MacUalraig
04-16-2019, 08:47 AM
Autosomal Matching is much more complex and I'm not an expert. I see:
a) difficulty to get huge numbers to upload at one place: while good gedmatch and DNA.Land may not have any of the matches interesting for a good analysis
b) difficulty to do the step from sub-1-Million-SNP-chips to Full-Genome-Variants and a robust matching including to all the old SNP-platform results

One guy I spoke to was panning the concept of WGS matching based on some research that suggested little advantage. He didn't cite the actual source but I believe it was this one:

"We estimate that WGS data provides a 5% to 15% increase in relationship detection
power relative to high-density microarray data for distant relationships. Our results identify regions of the genome that are
highly problematic for IBD mapping and introduce new software to accurately detect 1st through 9th degree relationships
from whole-genome sequence data."

The paper is a bit old now though and I've not read through all the more recent ones that cite it.

https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004144

That seems a rather defeatist/negative reaction.

kafky
04-16-2019, 09:29 PM
I received this week the results from a kit bought in the thanksgiving promotion...

kafky
04-17-2019, 12:29 AM
About GVCF in Dante. I am not sure if their format is vcf or gvcf. I will post here the header of my SNP file, I hope there is not any sensible information that I should not make public. If there is any personal or inadequate information, please, tell me. I deleted specific elements about the kit number.

I hope it may help to clarify which type of file is included. I must say that they also provided the INDELS file.

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=VQSRTrancheSNP99.00to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -4.1123 <= x < 0.6194">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -95126.5844">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -95126.5844 <= x < -4.1123">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.ApplyRecalibration=<ID=ApplyRecalibration,Version=3.7-0-gcfedb67,Date="Fri Mar 29 05:43:47 UTC 2019",Epoch=,CommandLineOptions="analysis_type=ApplyRecalibration input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/l3bioinfo/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_readi ng_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false input=[(RodBinding name=input source=.raw.snp.vcf.gz)] recal_file=(RodBinding name=recal_file source=.recalibrate_snp.recal) tranches_file=.recalibrate_snp.tranches out=/l3bioinfo/results/out_snp.vcf.gz ts_filter_level=99.0 useAlleleSpecificAnnotations=false lodCutoff=null ignore_filter=null ignore_all_filters=false excludeFiltered=false mode=SNP filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.GenotypeGVCFs=<ID=GenotypeGVCFs,Version=3.7-0-gcfedb67,Date="Fri Mar 29 03:54:21 UTC 2019",Epoch=,CommandLineOptions="analysis_type=GenotypeGVCFs input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/l3bioinfo/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_readi ng_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=[(RodBindingCollection [(RodBinding name=variant source=chr1.markdup.recal.g.vcf.gz)])] out=/l3bioinfo/results/chr1.markdup.recal.vcf.gz includeNonVariantSites=false uniquifySamples=false annotateNDA=false useNewAFCalculator=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 heterozygosity_stdev=0.01 standard_min_confidence_threshold_for_calling=10.0 standard_min_confidence_threshold_for_emitting=30. 0 max_alternate_alleles=6 max_genotype_count=1024 max_num_PL_values=100 input_prior=[] sample_ploidy=2 annotation=[] group=[StandardAnnotation] dbsnp=(RodBinding name= source=UNBOUND) filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Thu Mar 28 23:38:33 UTC 2019",Epoch=,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/l3bioinfo/chr1.markdup.recal.bam] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=[chr1] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/l3bioinfo/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_readi ng_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[StrandBiasBySample] excludeAnnotation=[ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] group=[StandardAnnotation, StandardHCAnnotation] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF bamOutput=null bamWriterType=CALLED_HAPLOTYPES emitDroppedReads=false disableOptimizations=false annotateNDA=false useNewAFCalculator=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 heterozygosity_stdev=0.01 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=30. 0 max_alternate_alleles=6 max_genotype_count=1024 max_num_PL_values=100 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING pair_hmm_sub_implementation=ENABLE_ALL always_load_vector_logless_PairHMM_lib=false phredScaledGlobalReadMismappingRate=45 noFpga=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=false keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxReadsInMemoryPerSample=30000 maxTotalReadsInMemory=10000000 maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.SelectVariants.2=<ID=SelectVariants,Version=3.7-0-gcfedb67,Date="Fri Mar 29 05:45:41 UTC 2019",Epoch=,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/l3bioinfo/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_readi ng_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=out_snp.vcf.gz) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/l3bioinfo/results/123191920493_WGZ.snp.vcf.gz sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=false excludeFiltered=true preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 maxNOCALLnumber=2147483647 maxNOCALLfraction=1.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.SelectVariants=<ID=SelectVariants,Version=3.7-0-gcfedb67,Date="Fri Mar 29 04:32:37 UTC 2019",Epoch=,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/l3bioinfo/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_readi ng_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=/l3bioinfo/123191920493_WGZ.vcf.gz) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/l3bioinfo/results/123191920493_WGZ.raw.snp.vcf.gz sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=true excludeFiltered=false preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[SNP] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 maxNOCALLnumber=2147483647 maxNOCALLfraction=1.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description ="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description ="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Descript ion="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Descript ion="This variant was used to build the positive training set of good variants">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out">

Donwulff
04-17-2019, 01:24 AM
It has GenotypeGVCFs which gets rid of the gVCF stuff. It's kind of weird, actually, the gVCF isn't intended as final product, you'd normally have BED file for the relevant data. A gVCF will have something akin to:
1 1 . N <NON_REF> . . END=10021 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
1 10022 . C <NON_REF> . . END=10023 GT:DP:GQ:MIN_DP:PL 0/0:11:3:10:0,3,45
1 10024 . C <NON_REF> . . END=10024 GT:DP:GQ:MIN_DP:PL 0/0:12:6:12:0,6,90
1 10025 . T <NON_REF> . . END=10025 GT:DP:GQ:MIN_DP:PL 0/0:11:0:11:0,0,340
1 10026 . A <NON_REF> . . END=10026 GT:DP:GQ:MIN_DP:PL 0/0:12:6:12:0,6,90
1 10027 . A <NON_REF> . . END=10028 GT:DP:GQ:MIN_DP:PL 0/0:13:0:12:0,0,369
1 10029 . C <NON_REF> . . END=10030 GT:DP:GQ:MIN_DP:PL 0/0:16:12:15:0,12,180

Note the column with END and the genotype 0/0, which are the essential parts, that says for example that genomic location 1 through 10021 are genotype 0/0 (Homozygous reference allele) with Genotype Quality of 0 (Because, uh, it's undefined section). 10021 to 10022 are homozygous reference allele as well, but with a read depth of 11 and genotype quality of 3. You'll notice listed this way there'll be a line for almost every location of the genome, and because of that gVCF files for end-users normally have less detail.

oagl
04-17-2019, 06:23 PM
I got a tracking number for the BAM hard drive shipment today. Didn't have to pay for the hard drive because I bought my two kits before this option was introduced. Took four months from request to shipment, but I can't complain for this price.

pinoqio
04-17-2019, 10:21 PM
On the product comparison table, read length is now listed as 150bp, while on the product page itself it is listed as 100bp.
So I asked about this apparent upgrade, and if it would apply to existing orders that have not been sequenced yet:

I confirm also existing orders receive 150bp.
:thumb:

ChrisR
04-19-2019, 09:21 AM
"We estimate that WGS data provides a 5% to 15% increase in relationship detection
power relative to high-density microarray data for distant relationships."
I also do not remember if I read this paper some years ago, had a little more then 15% in mind. The above paper (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004144) also states "(ERSA), has high power to detect relationships as distant as 8th-degree relatives (e.g., 3rd cousins once removed) from high-density SNP microarray data". I did not find a statement for WGS, but also if my memory works I saw statements that mentioned detection of relationships up to 8th cousins, while microarrays was limited at 5th or 6th cousins. I think it was possible because of unique SNPs which only happened in or in the generations before the common ancestor. Not sure however what coverage, read depth, read length and reference sequence was used, as all those are certainly crucial parameters for the cumputational power.
I also think ancestral population calculators could gain precision if "SNPs that occured after prehistory" would be identified and assigned to regions, so for example to get a good 500 AD prediction.

Donwulff
04-19-2019, 10:11 AM
Now I want a separate thread for WGS based relationship estimates, because that's not Dante Labs specific at all but meh...
I think the big question is how relevant they really would be at this point. I started counting number of possible cousins, but yeah that was done to death in the past so for example https://isogg.org/wiki/Cousin_statistics - 5th cousins over 4,700 but 9th cousins over a million.
Do you really need millions of matches? But, of course, you won't have millions of matches. Most of your ancient ancestors will disappear off your DNA entirely, the segment inherited from them just gone in the random recombination.
This means as the degree of cousinhood grows you won't be finding your relatives, or even your ancestors, but the people you happened to by random chance inherit DNA from among the millions of potential ancestors you have.
In fact, the number of potential ancestors very quickly exceeds the number of people who were living in the region at the time, hence you're quickly approaching the "identical ancestors point" for whatever region you're interested, at which everybody who left modern day relatives is ancestor of everybody now living.
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001555 looked at this in some detail. So basically you're then just getting a completely random sample of everybody in the region you're from. (And not incidentally... this is getting close to the biogeographical/ancestry estimates, but due to linkage disequilibrium they use just about ~300.000 SNP's).

It gets really worse than that, because with sequencing and smaller and smaller segments, the probability of various types of errors increases. In the hypothetical case that your DNA relative was based on single SNP, getting that single SNP wrong would give wrong result. Three SNP's, what happens if the middle one is wrong? Structural variations, duplications, all mean that once you get below certain size your results start look pretty random regardless of what DNA you actually inherited. And pedigree collapse, in DNA that means you will inherit DNA from same person via multiple paths, beyond certain segment length most of the matches start to be combined from segments inherited via different ways, so your matches could actually be thousand years old, if not older.

So, you get actually a random selection of the random people out of everybody who ever lived. You can get some crazy matches, theoretically, but they're not genealogically meaningful or correct. The current microarrays provide more data than they can comfortably use; this is why companies haven't moved to larger microarrays and use various caps on matches. And yes, I realize there's plenty of people who WOULD be willing to pay for that, it's just it rapidly gets genealogically meaningless, beyond the ancestry/biogeographical ancestry/ethnicity estimates they already offer. DNA.Land uses imputation and ERSA already, and with DNA.Land's founder Yaniv Erluch moving as MyHeritage's head scientist I can only surmise they do something similar, so those two could seemingly take sequencing data any time they deemed it advantageous (Which is tricky for competition). DNA.Land has so few matches they could certainly improve by listing some stone age matches ;)

Also, yes, there are marginal corner cases where it matters, so I'm sure it wouldn't be without any benefit. In particular I think if you were able to use the phasing information in genetic sequence usefully, that would improve the existing results significantly. Another thing is you could be able to detect breakpoints between IBD segments from different ancestors... or perhaps it's structural variation instead... so that makes all the data harder to use.

Ann Turner
04-20-2019, 11:01 PM
Donwulff or anybody:

How do you figure out an mtDNA variant from a gVCF file? In the fragment below, I understand that positions 196-235 and 237-301 match the Yoruba Reference Sequence -- but then why does the ALT column say <NON_REF>?

There is a variant at position 236. There are two ALT values plus <NON_REF>. The GT for this (and all rows) is ./. instead of the 0/1 nomenclature I see in regular VCF files. How do you figure out ALT value is correct, and what happens if there is heteroplasmy?

chrM 196 . T <NON_REF> . . END=235 GT:DP:GQ:MIN_DP:PL ./.:2796:99:2365:0,120,1800

chrM 236 . T C,A,<NON_REF> . . DP=2272;ExcessHet=3.01;RAW_MQ=8179200.00 GT:AD:DP:GQ:PL:SB ./.:0,2255,2,0:2257:99:84395,6799,0,84299,6744,84268 ,84376,6799,84288,84362:0,0,696,1561

chrM 237 . A <NON_REF> . . END=301 GT:DP:GQ:MIN_DP:PL ./.:1974:99:1650:0,120,1800

Another sample row (sorry about the word wrap)

chrM 9541 . C T,A,<NON_REF> . . BaseQRankSum=3.92;ClippingRankSum=0.00;DP=1691;Exc essHet=3.01;MQRankSum=1.89;RAW_MQ=6000302.00;ReadP osRankSum=0.091 GT:AD:DP:GQ:PL:SB ./.:5,1670,6,0:1681:99:48615,4967,0,48467,4954,48449 ,48580,5002,48495,48594:5,0,873,803

JamesKane
04-22-2019, 09:45 AM
The NON_REF is part of Broad Institute’s implimentation of the specification. It is used to represent the probability of not being the reference alleles in the span.

The ./. call at the other sites appears to indicate there is a problem with the genotype quality. The read depths clearly support the first alt allele for both sites. The probabilities are pretty wild, so the algorithm appears to have given up. You will likely need to look at the alignments in another tool.

Donwulff
04-22-2019, 01:14 PM
I saw ./. somewhere before (perhaps Genos Research VCF file) and don't recall what it was about, but my guess is it's because of the extremely high read-depth on mtDNA due to there being many mitochondrion in every cell. Usually, when thousands of short reads map to single location on the reference genome it's some sort of "low complexity region" or contaminant like primers, so you would legitimately wish to consider it error or suspect, but not so on mtDNA. So it has been blanked out here into "no call". (QUAL and FILTER columns are empty?)

The rows with "END=???" on them are gVCF probability bands for reference-calls, so tested but matches reference.

The 8th column gives the template for the sample-specific data. In this case (pretty common) it's GT:ADP:GQL:SB - the header gives full explanation for these. These, as well as the corresponding values, are separated by colons. First is of course GenoType as expected, here blanked out as noted above. But there's also ADP (also known as AD) which is generally "Allelic depths for the ref and alt alleles in the order listed". So for position 236 that means there were 0 reads with T, 2255 reads with C, 2 reads with A and no reads with any other nucleotide at that location.
The forum formatting actually messed up the value key for the rest of them, but the next field looks like total read depth DP. Next is probably GQ, "Genotype Quality" is what's known as Phred score likelihood of the genotype call. This is a logarithmic scale, 99 is the maximum (to reduce file size) which means almost certain for the allele with highest depth. (It's the Bayes theorem based probability from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/ also https://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it)

The second variant at 9541 has allele specific depths 5, 1670, 6, 0 out of 1681 total read depth in the order they're listed in the VCF, so in this case that would be 1570 reads for T. Judging mtDNA heteroplasmy from just this is hard. Remember that Illumina sequencing can have 0.1 to 1% error per base, so at read depth 1681 that could be around 17 erroneous bases. In addition reads mapping to wrong location is always possible. I think the different alleles here are well within expected sequencing error, but one quick check by hand is comparing the reads for different alleles between variants across mtDNA. Are they closer to the 2 of location 236, or 5+6 for location 9541?

chrM 236 . T C,A,<NON_REF> . . DP=2272;ExcessHet=3.01;RAW_MQ=8179200.00 GT:ADP:GQL:SB ./.:0,2255,2,0:2257:99:84395,6799,0,84299,6744,84268 ,84376,6799,84288,84362:0,0,696,1561
chrM 9541 . C T,A,<NON_REF> . . BaseQRankSum=3.92;ClippingRankSum=0.00;DP=1691;Exc essHet=3.01;MQRankSum=1.89;RAW_MQ=6000302.00;ReadP osRankSum=0.091 GT:ADP:GQL:SB ./.:5,1670,6,0:1681:99:48615,4967,0,48467,4954,48449 ,48580,5002,48495,48594:5,0,873,803

Presumably this is from saliva, of course the "somatic" variants like heteroplasmy can differ in different organs. I saw somewhere that methylation profile of saliva is closer to methylation profile of brains, which would certainly make sense for heteroplasmy too, but can't be taken for certain.

Ann Turner
04-23-2019, 02:34 AM
Thank you, Donwulff and James Kane. Sorry about the weird way my copy-and-paste ended up with special characters!

I think I see my way clear to assigning a base call now. The reads for the "other" alleles for all the variants are almost always between 0 and 6, with one exception where it was 48,780,0 (about 6%). FTDNA doesn't call heteroplasmy until it reaches 20%.

My next assignment to myself is to see if I can develop a spreadsheet method for converting the Yoruba to the rCRS. Your other message in this thread will help with that.

Donwulff
04-23-2019, 05:40 AM
I found out that https://www.mitomap.org/MITOMAP/MitoSeqs gives slightly different instructions for Yoruba/rCRS conversion from the Haplogrep page. The conversion table for the coordinate ranges, in particular. The credentials of the author aren't clear, but it's much more through. Subsequently I changed my script at https://github.com/Donwulff/bio-tools/blob/master/annotate/yoruban_to_rcrs.sh slightly. I haven't independently verified the changes between the references, because I wasn't aware James Lick's mthap had new version that accepts Yoruban reference directly: https://dna.jameslick.com/mthap-new/advanced.php There's another online conversion tool I've ran into at https://mseqdr.org/mvtool.php though it seems more useful for annotation than straight out conversion.

JamesKane
04-23-2019, 12:45 PM
Has anyone seen the new guarantee on Dante Labs site?

“We believe you should have a clear timeline of your genetic testing process. So we are introducing a 90 Day Guarantee for all orders from April 22nd 2019. If you don't get your results within 90 days from when we receive your saliva sample, then we'll give you a full refund as well as the full results.”

That’s incredibly bold or the noise we hear about slow results being returned are exceptions rather than the rule.

Donwulff
04-23-2019, 01:07 PM
I think the noise is the exception, although it's been bit confusing, but on the other hand it seems to me like the people making the largest noise about things like starting a campaign to fill Better Business Bureau/Facebook pages with complaints are usually the ones on 9th week of their "8 to 10 weeks" delivery time. On the other hand, it seems there have been some pretty big slip-ups in the past, and any publicity from their discount campaigns are easily lost on three or four people who give a negative review on delivery time (most who aren't even verified purchasers... on Amazon.com).

On the flip side, how many thousands have purchased the test and how many have complained? But on the other hand, they do have a public relations disaster in their hands from the delivery times. Does that 90 days include raw data? Because people are going to assume it does, and if it doesn't, there's going to be a LOT of fighting over that. But yes, hopefully this means they've streamlined everything so they can hold the promise of delivering everything, including the raw data, in 90 days.

Also note it's from when they receive the sample. I think there was at least one case on FB where someone asked what was taking so long with their results, only to be told they'd not received their sample yet. But often people take months to send their sample in, so I don't think you can blame Dante Labs on that, but it's definitely something you have to watch for. I think most delivery time complaints I have seen are about the raw data, and again, I expect most people don't even seek the raw data, it's just us "genealogy nerds".

At least on the EU site right now, the front-page pic is the "Whole GenomeL" but the description is "The most comprehensive DNA Test" and the link actually takes you to WGS test. I hope people are paying attention.

ybmpark
04-24-2019, 05:10 PM
I received my result this week. The are in vcf format. They tell me that I have to pay about 75 dollars for a hard drive that contains BAM and FASTA of my genome.
But someone says BAM can be downloaded from cloud free of charge. Anyone downloaded the BAM file without any additional charge? Did everyone pay for their hard drives or is it just me or us because we paid only 200 dollars.

MacUalraig
04-24-2019, 07:34 PM
I received my result this week. The are in vcf format. They tell me that I have to pay about 75 dollars for a hard drive that contains BAM and FASTA of my genome.
But someone says BAM can be downloaded from cloud free of charge. Anyone downloaded the BAM file without any additional charge? Did everyone pay for their hard drives or is it just me or us because we paid only 200 dollars.

Experiences are varying a bit depending on when you ordered and what exactly you were told at the time, plus how much you complain. I don't have a BAM link but I haven't asked for one as I need it via disk. We're talking several hundred gb of data here.

MacUalraig
04-24-2019, 07:38 PM
Has anyone seen the new guarantee on Dante Labs site?

“We believe you should have a clear timeline of your genetic testing process. So we are introducing a 90 Day Guarantee for all orders from April 22nd 2019. If you don't get your results within 90 days from when we receive your saliva sample, then we'll give you a full refund as well as the full results.”

That’s incredibly bold or the noise we hear about slow results being returned are exceptions rather than the rule.

If it relates to tests from now on run at their own dedicated lab in L'Aquila its believable. YSEQ manage that kind of turnaround via their local partner. If we're talking Asia, I'll believe it when I see it.

Donwulff
04-25-2019, 06:48 AM
Interesting, I thought with the 90 day guarantee the ultra-low prices would be done for because the 90-day announcement thread on FB was filled with complaint about the Black Friday sale, but there's now DNA Day Sale on the standard WGS.

Donwulff
04-25-2019, 07:42 AM
There's also a new blog post indicating they've indeed "streamlined" the processing: https://www.dantelabs.com/blogs/technical/turnaround-time
30026
And wow, 7 months *average* processing time in 2018 summer. I think we can assume this is genuine; I was just wondering if there's any way we can get average processing times rather than "pissed enough to report" processing times.

All I know is when I ordered my test in 2017 when everybody "knew for certain" it was a scam, it took 21 days from receiving sample to reporting successful DNA extraction, 80 days (11 and half weeks) from sample receipt to results. At the time they promised 7 to 9 weeks, although their front-page promised 3-4 weeks which turned out to refer to the microarray test only. I requested BAM files on the day the results came, they were sent on USB stick 86 days from the sample receipt and arrived by courier service couple days later, stretching raw data to almost 13 weeks from sample received. All in all, still under 3 months, and I didn't mind the wait because every testing company I've dealt with seems to have missed the promised time (Including some microarrays) so I am used to waiting, but at the time when *nobody* unfaffiliated had yet reported receiving their results from Dante Labs it was certainly more worrisome.

Putting a time-guarantee on the sequencing results is certainly a bold move though, but I suspect it doesn't apply to raw data as once again most people probably don't request that. On the FB board they're repeatedly saying that BAM's have to be ordered separately on the portable (SSD?) hard-drive, a nice piece that might be worth the extra price alone if you needed one. What I got was just a lousy USB stick ;) However, they've also confirmed they're working to have the raw data downloadable via Sequencing.com, who produce the existing health report.

Ps. While BGI is Chinese institute, according to Dante Labs FAQ they're using labs in Germany, Denmark, New Jersey and Hong Kong. Also, it isn't any fly-by-night or clandestine government operation like all posts (about it) on this thread seem to imply. It was founded for the international Human Genome Project that most large countries contributed to, and upon which all human genome re-sequencing relies, and has since then achieved many public scientific firsts, including the first short-read de-novo WGS human sequence and the first ancient human genome sequence. There's no reason to doubt their capability, and if the FAQ entry about using American sequencing center as well is correct, the US government would be far more likely to obtain a copy of the sequence than China, but only if you were somehow considered & known during sequencing as "Enemy of the state" (Which should not happen because the samples are anonymized in sequencing, hence the saliva tube barcodes and why it's not a bad idea to keep the barcode & sample ID private). On the other hand it's entirely possible that companies have to wait until batches from a better paying customer are processed to be able to run their own samples, but using multiple centers certainly helps with that.

Petr
04-25-2019, 01:48 PM
Today I received new Pharmacogenomics reports for 2 tests ordered in 2017.

NixYO
04-25-2019, 02:43 PM
WGS is not for Gedmatch because those calculators are old and they rely on SNPs from 23andme v3 and similar old chips. But WGS is absolutely necessary for medical genetics and WGS has data to anything genetics related for the lifetime (including YFull etc).

As for my problem of the variant-only VCF file from DanteLabs (as my BAM harddisk has yet to arrive), I have 2 choices:

1. Grab the HG19 reference and use my DanteLab's VCF to build the mother-of-all SNP calls. Obviously it won't be as good as the BAM file generated one since VCF only has those "PASSED" calls, but close enough I suppose.
2. Wait for the BAM harddisk, learn and use tools manually to process the files (which might take days) or upload to the cloud and pay Sequencing.com's EVE app to process it within a few hours.

I received a reply to my message on Facebook from a Dante Labs representative informing me that my DNA was extracted, had passed quality control and is now in the process of being sequenced.

They've given me a rough date estimate for my results by the end of June. 4 months from them receiving the sample to publishing the results is absolutely brilliant for the price I paid (if they do manage to fulfill the date estimate they gave me, of course).

I don't have any internal information about Dante Labs or their decision process, and haven't heard about any YFull deal, but at least my VCF and reports were served via Amazon S3. On https://aws.amazon.com/s3/pricing/ it can be seen that storage is $0.023 per GB per month, so if they served 100 GB BAM and 100 GB FASTA, it would be $4.60 a month, and downloads $0.09 per GB that's $9 for the BAM. These costs would be borne by Dante Labs. Three years of storage alone would eat the entire price of their Black Friday special cost. Assuming someone decided to share the BAM with the world, it would quickly cost them more than even the full list price of the service. Of course, this could be mitigated by "Download it once within two months, and then we'll take it offline". But with Sequencing.com it won't cost anything for delivery (assuming they have the outgoing bandwidth) and we still don't even know if they do that.

I was just thinking that Sequencing.com isn't optimal for YFull delivery though, because YFull really only should get your Y chromosome data. UseGalaxy can do that filtering, but if you have to download the file off Sequencing.com and then upload it to UseGalaxy, that'll break most people's Internet connection. So Sequencing.com really needs an "Y chromosome only" app ;)

I got the Dante Labs e-mail as well. I'm especially interested about "In general, we will add more analysis, more reports and more functionalities in 2019" but we'll see, at the price they have been offering I've definitely expected them to just drop the data one time & offer no support after that. On the other hand occassional updates will keep them on the table & people talking about it, so lots of hopefully positive marketing and publicity.

YFull has received my BAM file. 101.66 GB. See my posts in this thread.
Dante Labs has a DNA Day sale right now, so it costs only 199€/$229 to order a kit. Is this test like Big Y-700 plus mtFull Sequence from FTDNA plus a lot of health information? FTDNA also has a sale right now and Big Y-700 plus mtFull Sequence cost $598 (about 536.18€), while those two tests together normally cost $848 (about 760.48€).

Donwulff
04-25-2019, 03:08 PM
Right now FTDNA-testing is the only way to get INTO the FTDNA database. You're likely able to get project admins to let you into appropriate groups (and many of them use YFull information in addition in their analysis) without testing at FTDNA and do comparisons against project members, but there are also people who are only in the FTDNA searchable database. If you get your current terminal Y-SNP via another method (YSEQ, DL+YFull) you can usually test just that SNP on FTDNA (I believe you have to go through their customer service, which might be opposed to the idea, but no real knowledge) to get into their Y-SNP tree.

TBH need to find someone who has done that to get experiences. Also, with Dante Labs you may have to pay $75 extra for the raw data BAM that is required for those analysis (They're working on being able to download the raw data free via Sequencing.com, but I'm not certain of the status of that) and would need to find a way to get that 100GB BAM of the mtDNA/Y-chromosome portion of it online somehow.

The other thing is that at least currently you do need third party analysis like YFull ($49) or FGC ($75 for Y+mtDNA) to get your mtDNA and Y-chromosome results out of Dante Labs. FGC has additional analysis like "GedMatch Analysis of Uploaded WGS data" $25 or "Whole genome: Advanced Variant analysis" $250 where it isn't clear to me what those analysis include, but it sounds like you'd be able to use those for autosomal matching as well. There are a few other third party service providers like aforementioned Sequencing.com and some private individuals who may be able to help with this, though of course to get into most commercial matching databases (FTDNA/YFull etc.) you have to pay those services as well.

JamesKane
04-25-2019, 03:09 PM
. Is this test like Big Y-700 plus mtFull Sequence from FTDNA plus a lot of health information? FTDNA also has a sale right now and Big Y-700 plus mtFull Sequence cost $598 (about 536.18€), while those two tests together normally cost $848 (about 760.48€).

For some visuals on what 30x WGS covers in comparison to Big Y and mtDNA tests, I have some histograms and reporting on bases covered in this blog discussion. http://www.it2kane.org/2019/03/a-bit-about-wgs-testing/

Basically, a WGS stands as a suitable proxy for all the targeted tests except in very specific use cases where the reads are not long enough or additional coverage depth is needed.

Keep in mind the blog data is derived from 150 base pair reads with 500 base inserts on NovaSeq. The Dante Labs tests are shorter, but generally will look about the same for coverage.

As noted above, you are really going to need help sifting the data into information unless you are a motivated tinkerer.

Donwulff
04-25-2019, 04:14 PM
It's already been confirmed new sequences from Dante Labs have read length of 150. On my BigY insert size average was 282, read length up to 165 with average 159. Another one had insert size average 265 and read length average 158. This means FTDNA is intentionally overlapping the paired end reads by about 20 basepairs to form one ~270 basepair read, which is typical when you're doing targeted sequencing and know exactly where you want to place your reads. What the BigY has going for it is read depth on the targeted regions, and remember the new BigY-700 is more widely targeted, although I'm not sure if we have any examples of BigY-700 yet.

What, if any, actual effect that read overlap & depth have on actual results is I think open to debate. People have posted their results on YFull from both sequences, and I think the differences are more per individual than per sequencing provider. For me BigY (old) returned 839 STR's on YFull and Dante Labs only 710, but I believe others have reported reverse numbers. Obviously BigY being targeted it returned far less Y-SNP's overall, but there were two known SNP's that YFull couldn't call from my Dante Labs sequence, and 10 known SNP's on my terminal SNP level where the results actually differed. (TBH I want to know what's up with that latter one, I think it may be because the terminal level SNP's are most likely to be poor quality, ie. multiple similar sequences in the genome. That's the other thing, targeted sequencing is less likely to pull things from elsewhere in the genome. But overall it doesn't seem to have *huge* impact).

Oh and wait a sec, I just recalled that comparison is against FTDNA's calls on the BigY data if I remember right, which means that it likely also has effect from FTDNA's algorithms predicting what SNP's "should" or "shouldn't" be there even if they may not actually be. Not going to do a detailed comparison just now, though...

JamesKane
04-25-2019, 04:24 PM
It's already been confirmed new sequences from Dante Labs have read length of 150.

Was it? I know the store front was showing that, but the current details on the sale screen is back down to 100.

aaronbee2010
04-25-2019, 04:36 PM
Dante Labs has a DNA Day sale right now, so it costs only 199€/$222 to order a kit. Is this test like Big Y-700 plus mtFull Sequence from FTDNA plus a lot of health information? FTDNA also has a sale right now and Big Y-700 plus mtFull Sequence cost $598 (about 536.18€), while those two tests together normally cost $848 (about 760.48€).

For Y-DNA and mtDNA analysis, you need to purchase the .BAM file on a hard drive, which will set you back another €59/$66. Therefore, the effective cost is around €258/$288. I changed your DNA day sale value in dollars to reflect the current forex values as of the writing of this post.

If I had to compare it, you would get a Big-Y700, but only raw data, you would have to interpret it using another service, i.e. YFull ($49). You also get a mtFull Sequence, but (again) only raw data, you would have to interpret it using another service, i.e. YFull ($25)6.

You also get a full autosomal sequence. The file you use for this (.VCF) comes as a free download, so you could just download it and upload it straight to Sequencing.com or Promethease for health reasons. Since you're ordering a .BAM file anyways, you could send the .BAM file to Full Genomes for their $75 service that analyses your Y-DNA for SNP's + STR's and adds you to their Y-DNA database. They also analyse your mtDNA, and analyse your autosomal data. With the latter, they generate a file that you can upload to GEDmatch. If you just wanted FGC to analyse your Y-DNA, that would then be $50, but since you would (or at least, you should) have this done by YFull anyways, this may be a bit redundant.

Of course, you could try and interpret the raw data yourself, but I wouldn't recommend it. YFull and FGC analysis will also add you to their database. FTDNA's database is exclusive to customers who test Y-DNA with them unfortunately, however you can compare your Y-STR's with someone who's in a public FTDNA project.

Donwulff
04-25-2019, 04:49 PM
Read length confirmed on 150bp: https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=562007&viewfull=1#post562007
Granted, it's not impossible for it to change back, indeed if you say it's changed on the website. These technical specs are always "subject to change and availability", but last I heard it's 150bp.

mtDNA depends a bit on what you want to get out of it, and again, what analysis Dante Labs happens to do when the sample is processed. There's a chance that basic information is available from the VCF: https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=561174&viewfull=1#post561174

Dante Labs confirmed that online download of raw data via Sequencing.com is coming within weeks, which would facilitate YFull as well as Sequencing.com analysis (for a small charge): https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=560718&viewfull=1#post560718
Again, when it's not available I guess it's not certain, but that sounds pretty certain. Were you to have to get the BAM on hard-drive, you'd also need some way to submit it to YFull etc. which would cost you extra about $10 to $$$ depending on your bandwidth availability and service you choose to use. (Well uploading to Sequencing.com and linking it from there to services that accept it is free as long as you don't pay for outgoing bandwidth, have computer and the time, and don't disconnect the USB drive).

Edit: Whole GenomeZ with the Exome sequence included doesn't even have the datasheet. I wonder if it's possible the 200EUR/$222 offer could be ran on the older, 100bp technology instead. And the comparison table says 120GB for WGS and 90GB (nucleotide bases, actually, I assume) for the WGS + Exome. Shouldn't that be the other way around, or has the yield actually increased? Ugh.

Donwulff
04-25-2019, 05:48 PM
For the sake of clarity I should point out that neither read depth (The average number of reads for each location over the genome) nor data size/yield (whole amount of data) should be directly affected by read length. Also my BigY vs. Dante Labs comparison was definitely based on 100bp read length sequence I got from them in 2017.

At some point it was announced that Dante Labs was switching to MGISEQ-2000, though right now I could find only this Twitter thread with DL replying as reference: https://twitter.com/jdidion/status/1064599527607267330 . MGISEQ-2000 is an upgrade to the BGISEQ-500, which is capable of 150 basepair long reads: https://www.biorxiv.org/content/10.1101/577080v2.full Now, since it's an upgrade to the sequencing machine, there are likely to be some BGISEQ-500's laying around underutilized.

Another consideration is that for short read sequencing technology like Illumina and BGI/MGI, the read *quality* (likelihood that they got the base correct) falls fairly rapidly further down the read, which is the main reason they're "short read" technologies. So a longer read length isn't necessarily better. So we don't know for sure, but the reason they might choose to use shorter read length could be because of availability of machines, availability of consumables for the machines, or not meeting required quality target, or it could very, very likely be a case of accidentally using old copy of the web-site when they changed the price, unless someone hears something else.

NixYO
04-25-2019, 08:05 PM
For Y-DNA and mtDNA analysis, you need to purchase the .BAM file on a hard drive, which will set you back another €59/$66. Therefore, the effective cost is around €258/$288. I changed your DNA day sale value in dollars to reflect the current forex values as of the writing of this post.

If I had to compare it, you would get a Big-Y700, but only raw data, you would have to interpret it using another service, i.e. YFull ($49). You also get a mtFull Sequence, but (again) only raw data, you would have to interpret it using another service, i.e. YFull ($25)6.

You also get a full autosomal sequence. The file you use for this (.VCF) comes as a free download, so you could just download it and upload it straight to Sequencing.com or Promethease for health reasons. Since you're ordering a .BAM file anyways, you could send the .BAM file to Full Genomes for their $75 service that analyses your Y-DNA for SNP's + STR's and adds you to their Y-DNA database. They also analyse your mtDNA, and analyse your autosomal data. With the latter, they generate a file that you can upload to GEDmatch. If you just wanted FGC to analyse your Y-DNA, that would then be $50, but since you would (or at least, you should) have this done by YFull anyways, this may be a bit redundant.

Of course, you could try and interpret the raw data yourself, but I wouldn't recommend it. YFull and FGC analysis will also add you to their database. FTDNA's database is exclusive to customers who test Y-DNA with them unfortunately, however you can compare your Y-STR's with someone who's in a public FTDNA project.

Read length confirmed on 150bp: https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=562007&viewfull=1#post562007
Granted, it's not impossible for it to change back, indeed if you say it's changed on the website. These technical specs are always "subject to change and availability", but last I heard it's 150bp.

mtDNA depends a bit on what you want to get out of it, and again, what analysis Dante Labs happens to do when the sample is processed. There's a chance that basic information is available from the VCF: https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=561174&viewfull=1#post561174

Dante Labs confirmed that online download of raw data via Sequencing.com is coming within weeks, which would facilitate YFull as well as Sequencing.com analysis (for a small charge): https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=560718&viewfull=1#post560718
Again, when it's not available I guess it's not certain, but that sounds pretty certain. Were you to have to get the BAM on hard-drive, you'd also need some way to submit it to YFull etc. which would cost you extra about $10 to $$$ depending on your bandwidth availability and service you choose to use. (Well uploading to Sequencing.com and linking it from there to services that accept it is free as long as you don't pay for outgoing bandwidth, have computer and the time, and don't disconnect the USB drive).

Edit: Whole GenomeZ with the Exome sequence included doesn't even have the datasheet. I wonder if it's possible the 200EUR/$222 offer could be ran on the older, 100bp technology instead. And the comparison table says 120GB for WGS and 90GB (nucleotide bases, actually, I assume) for the WGS + Exome. Shouldn't that be the other way around, or has the yield actually increased? Ugh.
The price in dollar is from the American site (https://us.dantelabs.com/). Okay, if one gets the Dante's WGS for 199€ (and download the YUUUGE .BAM-file from the Internet free of charge (which could be true soon based on information in this thread)) and YFull's NGS Y-DNA interpretation for $49, one would have to pay circa £209.86 or 2,581.17 SEK instead of £348.25 or 4,283.57 SEK for BIG Y-700 from FTDNA.

aaronbee2010
04-25-2019, 08:13 PM
The price in dollar is from the American site (https://us.dantelabs.com/). Okay, if one gets the Dante's WGS for 199€ (and download the YUUUGE .BAM-file from the Internet free of charge (which could be true based on information in this thread)) and YFull's NGS Y-DNA interpretation for $49, one would have to pay up circa £209.86 or 2,581.17 SEK instead of £348.25 or 4,283.57 SEK for BIG Y-700 from FTDNA

If you're an EU customer, then it would be cheaper to purchase it from the EU site in that case (about a $7 difference, although this could change as the EUR/USD forex pair fluctuates in price). I know that currently, you can't download the .BAM file for free, although this could change in the future - I think it's safe to assume this isn't the case.

It's not really a secret that Dante Labs is a lot cheaper than pretty much all of its competitors when they have an offer on. Even if you have to pay extra for the hard drive, it's still a LOT cheaper than the nearest competitors (FGC and FTDNA) while sequencing all of your DNA, and not just the Y-DNA. A hard drive is also a convenience as well.

Donwulff
04-25-2019, 11:10 PM
Both Sequencing.com and Dante Labs have confirmed on Facebook that the direct import to Sequencing.com & download from there is coming. The https://sequencing.com/genetic-testing page already has automated e-mail request to Dante Labs for it (this does raise some concerns for security). There's no "safe assumption" that this won't be possible, although I guess with technology you have to always take into account possible changes & problems. Of course, Sequencing.com may not be a viable alternative if you want the raw data on YOUR computer, due to the size of the files, but if you want to send them to YFull or use Sequencing.com (paid) apps to analyse the data, you don't need to care about the file size. (Alas, I'm not sure there's an easy solution for sending YFull *only* the chrY+mtDNA, if you're not comfortable letting them have whole genome)

It's worth stressing that won't get you into FTDNA database, if that's part of the goal (And it's a good idea) then getting some entry level test at FTDNA, autosomal tranfer or something or perhaps terminal SNP test via their customer service should be budgeted in, but as said I'm not certain how this is best done. I'm not sure what FGC database is like, I would assume most of their Y-Elite testers are on YFull as well.

aaronbee2010
04-25-2019, 11:24 PM
Both Sequencing.com and Dante Labs have confirmed on Facebook that the direct import to Sequencing.com & download from there is coming.

Does this apply to the .BAM file or just the .VCF file?

Donwulff
04-25-2019, 11:58 PM
In the request template e-mail from the link provided it says "Please import a copy of all of my raw genetic data (paired fastq, bam and vcf files) into my Sequencing.com account." The relevant FB discussion was at https://www.facebook.com/DanteLabs/posts/833012390395547?comment_id=834502640246522&comment_tracking=%7B%22tn%22%3A%22R%22%7D Last time somebody asked they weren't yet ready to do the transfer (or rather their customer service had not been informed). I can't vouch that it'll work, but honestly they're both saying it does and there's no excuses :p

Also, for people who have Linux or unix-like system (Windows 10 Subsystem for Linux counts), unmetered Internet connection, ability to run simple bioinformatics commands and some sort of cloud storage/sharing service, getting the hard-drive and extracting chrY/chrMT, uploading it to sharing folder and sending that to YFull may still be worth considering. At least I sent YFull my whole BAM file (over Amazon S3 which cost something like 7 dollars I think? Before I found out about Sequencing.com) and they appeared to have no problem with that, but it does contain health- and identity-related information not everybody will be comfortable letting YFull potentially have. For people who are on mobile phone, not comfortable with command line etc. Sequencing.com may be best recourse.

aaronbee2010
04-26-2019, 12:27 AM
In the request template e-mail from the link provided it says "Please import a copy of all of my raw genetic data (paired fastq, bam and vcf files) into my Sequencing.com account." The relevant FB discussion was at https://www.facebook.com/DanteLabs/posts/833012390395547?comment_id=834502640246522&comment_tracking=%7B%22tn%22%3A%22R%22%7D Last time somebody asked they weren't yet ready to do the transfer (or rather their customer service had not been informed). I can't vouch that it'll work, but honestly they're both saying it does and there's no excuses :p

Also, for people who have Linux or unix-like system (Windows 10 Subsystem for Linux counts), unmetered Internet connection, ability to run simple bioinformatics commands and some sort of cloud storage/sharing service, getting the hard-drive and extracting chrY/chrMT, uploading it to sharing folder and sending that to YFull may still be worth considering. At least I sent YFull my whole BAM file (over Amazon S3 which cost something like 7 dollars I think? Before I found out about Sequencing.com) and they appeared to have no problem with that, but it does contain health- and identity-related information not everybody will be comfortable letting YFull potentially have. For people who are on mobile phone, not comfortable with command line etc. Sequencing.com may be best recourse.

So it looks like .BAM files will indeed be able to be sent directly to Sequencing.com. Nice!

Given this, what would be the best way to transfer the .BAM file from Sequencing.com to YFull? Could I just give YFull a download link from Sequencing.com, or would I need to download the .BAM file to a cloud account, and then give that link to YFull?

Petr
04-26-2019, 08:57 AM
They are able to transfer VCF files only.

I used the form above and I received the following reply yesterday:


Hello Petr,

Thanks for your message,

Your Raw data is being uploaded to your sequencing.com account. Please take note these are only VCF files.

FASTQ and BAM files (~100 GB each) can only be delivered via a 512 GB HD. You can purchase one here: https://www.dantelabs.com/products/500-gb-hard-disk-containing-your-raw-data

Please let me know if you have any questions.

Kindest regards,

Mark

Donwulff
04-26-2019, 11:11 AM
They are able to transfer VCF files only.

I used the form above and I received the following reply yesterday:

Okay, they really need to sort that out, because both the Facebook thread with Dante Labs & Sequencing.com, and the e-mail generated by Sequencing.com are talking about the BAM & FASTQ file. I get there may be a bandwidth issue, plus the BAM/FASTQ may need to be separately generated, but this level of communication is really frustrating. (Even though I got my own raw data in 2017 when I ordered the test...)

Petr
04-26-2019, 11:17 AM
In addition I ordered 2 HDDs with data on April 4th and still no information when they will be shipped.

oagl
04-26-2019, 11:22 AM
In addition I ordered 2 HDDs with data on April 4th and still no information when they will be shipped.

It appears there is a long queue. Ordered in January and got it a few days ago. So prepare to wait a few months. But you will get it :)

Donwulff
04-26-2019, 05:43 PM
I'm having to wildly guess here, but typically genome sequencers produce their own proprietary data format which contains more metadata etc. than the standard FASTQ/BAM files, and all the intermediary files like BAM are deleted after processing the VCF because they take a lot of space and have the same information. Presumably people who want the raw data are in the minority, and so it's cheaper to re-produce it when requested (Doubly so if they can charge extra for it, although I wouldn't recommend it.). This is why at every sequencing company I've seen you have to separately request the raw data, and it takes a while to produce. Add to this that assuming raw data analysis is done by third party like the sequencing center, they will have to wait either for the raw data to arrive to them on a hard-drive, or to download ~200GB of data. That would explain why there is a queue and why they can't just offer download of the raw data to everybody at no charge (In addition to data transfer & storage costs), although it sounds like they need to do something to address that queue. However I admit it doesn't really explain continuing to promise that the raw data will be downloadable and then somehow not have it downloadable...

JamesKane
04-26-2019, 06:04 PM
This is why at every sequencing company I've seen you have to separately request the raw data, and it takes a while to produce.

By default both FGC and YSEQ make the BAMs available immediately with the call and interpretation files. You need to request the FASTQ files, if you really want them. I don't have any direct dealing with Veritas, but they also used to provide the BAM in short order for those who were contributing the data to PGP: Harvard. That arrangement may have changed as there hasn't been a new submission in quite some time.

Donwulff
04-26-2019, 07:39 PM
I've indeed not bought sequencing from those companies, but it would make sense for companies whose primary product is the raw data to provide that without separate request. According to several sources, Veritas Genomics doesn't offer BAM/FASTQ at all (Although, oddly, their FAQ states they retain a copy of the BAM for 10 years "for archival purposes only") and just VCF costs $99 extra: https://twitter.com/veritasgenetics/status/1063492348997500928?lang=en
Companies which do bioinformatics or even the sequencing itself in-house are going to have easier time supplying BAM's, although the specifics naturally do vary from company to company and verge on trade-secrets. While fulfilling promises to customers should always be a high priority, many people have probably been wondering how providing the BAM files could take so long.

Erikl86
04-26-2019, 09:13 PM
Hallelujah !

Got this today:

"Dante Labs: Your sequencing is done. We are running your bioinformatic analysis!
Dear Customer,

We're excited to let you know that we completed the sequencing on your sample.

We are running out the bioinformatic analysis and we would like to inform you that we might need some extra days before the delivery of the results!

Please do not hesitate to contact me if you need any further assistance.

Best regards,

Dante Labs' Team"

Purchased the kit Nov 23, sent the kit back in Jan 5th.

kafky
04-28-2019, 10:40 PM
Experiences with Dante raw results on Genesis. There is an enormous bias dor false positive results. I have now on genesis almost 100 twin brothers, or direct family with more than 3200 cM! Almost all from Dante Labs and WGS. Gedmatch has to move for a very different approach of matching of WGS results.

Donwulff
04-28-2019, 11:22 PM
Normal (Non-gVCF) VCF-files only list sites, that differ from reference genome. Because of the way GEDMatch works, it can't tell if the location was genotyped or matches the reference, and indeed assumes these sites are "no call" and thus match every other sample. When I last tried, a plain gVCF file was too large for GEDMatch Genesis to handle, so the only solution seems to be to use a file with call for all variants in the dbSNP list of "known variants", or a superset of microarray test SNP locations. However, according to GEDMatch's new Terms of Service, you may use "An artificial DNA kit (if and only if: (1) it is intended for research purposes; and (2) it is not used to identify anyone in the GEDmatch database)". It's not exactly clear what an "artificial DNA kit" is, but it would seem that using a datafile that's been edited in any way is breach of contract and possibly criminal due to bypassing technical limitations of the service. While it's unlikely GEDMatch would pursue this against their users, my attempt to find consensus opinion about genetic genealogy with sequencing samples on ISOGG FB forum led to people terming me a Nazi child-abuser, so it appears safe to say that the topic is very controversial. (Point being, it may be more of a social/cultural problem than a technical one).

In any case the false positives are a known issue on GEDMatch with every standard VCF file, and at least before the new Terms of Service they had a disclaimer about the false positives on the site. There are many workarounds, community norms allowing, but for now I've deemed it a lost cause. Perhaps MyHeritage or another imputation-based sample-upload matching service with large existing userbase will do it correctly once WGS becomes more commonplace. Companies changing the SNP's on the microarray platforms they use, like the recent move to and from Global Screening Array GSA is certainly making that attractive because there is no longer a standard microarray that overlaps with most other tests.

Francisco
04-29-2019, 03:20 AM
Experiences with Dante raw results on Genesis. There is an enormous bias dor false positive results. I have now on genesis almost 100 twin brothers, or direct family with more than 3200 cM! Almost all from Dante Labs and WGS. Gedmatch has to move for a very different approach of matching of WGS results.

Yep, all the Whole Genome are broken.
I have a FullGenomes 30x and I am a brother of Vindija Neanderthal!!!
Of course I have also like 20 twin brothers and 40 cousins in countries my forefathers never were since Neolithic :)

There was even a girl that mail me saying I was "a parent" cause she was 4.1 distance from me. I told the truth, in my book she was like the number 200 in my distance list.

Any idea about a program that let us extract the FTDNA or 23andme genes from the Whole Genome data just to upload there in Gedmatch????

kafky
04-29-2019, 10:21 AM
Normal (Non-gVCF) VCF-files only list sites, that differ from reference genome. Because of the way GEDMatch works, it can't tell if the location was genotyped or matches the reference, and indeed assumes these sites are "no call" and thus match every other sample. When I last tried, a plain gVCF file was too large for GEDMatch Genesis to handle, so the only solution seems to be to use a file with call for all variants in the dbSNP list of "known variants", or a superset of microarray test SNP locations. However, according to GEDMatch's new Terms of Service, you may use "An artificial DNA kit (if and only if: (1) it is intended for research purposes; and (2) it is not used to identify anyone in the GEDmatch database)". It's not exactly clear what an "artificial DNA kit" is, but it would seem that using a datafile that's been edited in any way is breach of contract and possibly criminal due to bypassing technical limitations of the service. While it's unlikely GEDMatch would pursue this against their users, my attempt to find consensus opinion about genetic genealogy with sequencing samples on ISOGG FB forum led to people terming me a Nazi child-abuser, so it appears safe to say that the topic is very controversial. (Point being, it may be more of a social/cultural problem than a technical one).

In any case the false positives are a known issue on GEDMatch with every standard VCF file, and at least before the new Terms of Service they had a disclaimer about the false positives on the site. There are many workarounds, community norms allowing, but for now I've deemed it a lost cause. Perhaps MyHeritage or another imputation-based sample-upload matching service with large existing userbase will do it correctly once WGS becomes more commonplace. Companies changing the SNP's on the microarray platforms they use, like the recent move to and from Global Screening Array GSA is certainly making that attractive because there is no longer a standard microarray that overlaps with most other tests.



It may open the way to generate a WGS Gedmatch with clear data protection, without contacts off-system (no emails) obtaining a complete respect for confidentiality.

karwiso
04-29-2019, 05:40 PM
Normal (Non-gVCF) VCF-files only list sites, that differ from reference genome. Because of the way GEDMatch works, it can't tell if the location was genotyped or matches the reference, and indeed assumes these sites are "no call" and thus match every other sample. When I last tried, a plain gVCF file was too large for GEDMatch Genesis to handle, so the only solution seems to be to use a file with call for all variants in the dbSNP list of "known variants", or a superset of microarray test SNP locations. However, according to GEDMatch's new Terms of Service, you may use "An artificial DNA kit (if and only if: (1) it is intended for research purposes; and (2) it is not used to identify anyone in the GEDmatch database)". It's not exactly clear what an "artificial DNA kit" is, but it would seem that using a datafile that's been edited in any way is breach of contract and possibly criminal due to bypassing technical limitations of the service. While it's unlikely GEDMatch would pursue this against their users, my attempt to find consensus opinion about genetic genealogy with sequencing samples on ISOGG FB forum led to people terming me a Nazi child-abuser, so it appears safe to say that the topic is very controversial. (Point being, it may be more of a social/cultural problem than a technical one).

In any case the false positives are a known issue on GEDMatch with every standard VCF file, and at least before the new Terms of Service they had a disclaimer about the false positives on the site. There are many workarounds, community norms allowing, but for now I've deemed it a lost cause. Perhaps MyHeritage or another imputation-based sample-upload matching service with large existing userbase will do it correctly once WGS becomes more commonplace. Companies changing the SNP's on the microarray platforms they use, like the recent move to and from Global Screening Array GSA is certainly making that attractive because there is no longer a standard microarray that overlaps with most other tests.

I have the same experience with gVCF - too large for GEDmatch. A superset of common DNA testing arrays lika FTDNA, Ancestry, 23andme, MyHeritage and Living DNA gives around 1,5 million tested SNPs. Counting rows in dbSNP_151 with common SNPs (i.e. more than 1% had in at least one population) gives around 37,3 million rows. Saving all this would require chromosome number 1 byte, chromosome position 4 bytes, rsID would require 4 bytes, alleles called 2 bytes (probably 1 byte), quality 1 byte probably. Then we need to add some bytes for commas or tabs and newline symbols - 4 bytes at least. It adds up to 14-16 bytes per SNP. 37,3 million*15=560 MB and I think it could be compressed to something like 180 MB. If there are some additional characters like " then it probaly would be around 200 MB.

Normal FTDNA file is around 6.5 MB and generating a file with 37,3 millions SNPs will give 30 times bigger file (my estimation). It is a big challenge for GEDmatch to store and process such files. We could expect a few weeks processing time if one wants to match against million other profiles, but the storage is another challenge because it requires som investment harddrives.
If I count lines in dbSNP_151 for all SNPs (All_20180418.vcf), then I get approx 660 million lines. Something like 20 times more data, than in common SNPs - and corrsponding to approx 1000 times common DNA testing chip? Years to waint for a comparision! The size of the final files would be accepted by GEDmatch.

Anyway common or all SNPs, one has to count with enormous amount of duplicate data that requires storage and processing time, and it costs.

I think that we can serve GEDmatch with files with common SNPs, a Python program wouldn't take that much time to write. I still think that a server side implementation of reference genomes would be a more efficient solution.

A superset of SNPs could be usefull for FTDNA, if one imports the data there and wants to have matches with "old" Omniexpress and "new" GSA chips' results.

Donwulff
04-30-2019, 02:23 AM
As discussed in another thread here, latest dbSNP "known sites" dump with sites only and no additional info is about 6.4 gigabytes, compressed. Of course, that's far less than the 100GB that the BAM file takes, or even the 30 GB that a "raw" gVCF takes. VCF files can of course take all sorts of data, Sequencing.com "Genomic VCF" app created gVCF was 866M with less data (And that, if I recall right, was too large for GEDMatch Genesis).

Of course, there's no need to use the latest dbSNP, either, indeed for genealogical purposes you should in any case use some sort of filtering for confidence. Most of the dbSNP sites now are indels, which aren't used either. So one question is the intended purpose of the file of course. The 1000 Genomes Phase 1 high-confidence SNP list is about 2 gigabytes compressed. GEDMatch doesn't need (indeed shouldn't) store the original files, and processing files in the gigabyte range is easy nowadays, and if you fall below that you're no longer factually dealing with sequencing information (Not that there's necessarily anything wrong with a superset of DNA microarrays that matches them all, hopefully).

There are some highly efficient ways to store DNA data, and honestly we don't know what kind of internal representation GEDMatch uses now. I think that sequencing data somewhat complicates most of these; a service should probably however already incorporate a template of high-confidence SNP's for WGS and exome sequencing. The most obvious optimization is also you want to primarily use haplotypes and not individual SNP's, otherwise every time someone runs a match you have to compare every single SNP against everybody else in the databse, which becomes immediately infeasible. In general, I would really not worry about how any specific matching service internally implements it, that's their problem. And in fact, MOST of this discussion technically is, they'r eat the best position to determine which SNP's are most informative for them, and what format would be most expeditious.

teepean47
04-30-2019, 05:34 AM
I think that we can serve GEDmatch with files with common SNPs, a Python program wouldn't take that much time to write. I still think that a server side implementation of reference genomes would be a more efficient solution.


I have been working on to update extract23 that uses a filter where I have combined SNPs from different companies (23andMe + FTDNA + Ancestry etc.) I created it after Gedmatch disabled uploading VCFs. Or it is not working for me at least.

https://github.com/teepean/extract23

Donwulff
04-30-2019, 05:55 AM
Extract32 is one option, but it requires the BAM, and does genotype calling. If you have the BAM and are able to do genotype calling on your system, there's not much else you need. Go crazy, generate gVCF, force calls on whatever genomic site you want etc. If you have access to the BAM, easiest way is to just upload it to Sequencing.com and use their paid Eve run to generate gVCF or 23andMe file, though. Having a good list of superset of high-confidence SNP's off different DNA chips would help a lot.

But again, as a disclaimer under current GEDMatch terms of service, actually doing that could be criminal. And many genetic genealogists consider sequencing or bioinformatics heretical, and are known to go berserk if those are mentioned. So I'm going to recuse myself from this discussion other than to point out that DNA matching companies should see about getting a list of WGS & WES confident callable regions/variants and use those to interpret plain VCF files and/or just feed them into their imputation pipeline if they're using one. Ultimately it's a bit tricky, and the "optimal" solution would be to map & call variants of all sequences through the same pipeline so that all variants are called the same way, but that's unlikely to happen at present. ERSA2 had some ideas about non-confidently callable regions as well.

Edit: And sequencing companies should provide a pared-down gVCF, 23andMe/AncestryDNA formatted file of common SNP's, and perhaps a BED file of confident regions as well, but do any of them really do currently? So just assume not all of them are going to provide any of those at least all the time, but the more customers ask for them, the more likely they'll be provided. That would remove a lot of the ambiguity.

Jan_Noack
05-05-2019, 10:46 AM
I have the same experience with gVCF - too large for GEDmatch. A superset of common DNA testing arrays like FTDNA, Ancestry, 23andme, MyHeritage and Living DNA gives around 1,5 million tested SNPs. Counting rows in dbSNP_151 with common SNPs (i.e. more than 1% had in at least one population) gives around 37,3 million rows. Saving all this would require chromosome number 1 byte, chromosome position 4 bytes, rsID would require 4 bytes, alleles called 2 bytes (probably 1 byte), quality 1 byte probably. Then we need to add some bytes for commas or tabs and newline symbols - 4 bytes at least. It adds up to 14-16 bytes per SNP. 37,3 million*15=560 MB and I think it could be compressed to something like 180 MB. If there are some additional characters like " then it probably would be around 200 MB.
...
Normal FTDNA file is around 6.5 MB and generating a file with 37,3 millions SNPs will give 30 times bigger file (my estimation)...

I'm a newbie to this. I understand up to about 1.5Mil tested SNPs. Now you then count rows?

Why not only have 1.5mil row with the Chromosome, rs ID, allelles, quality, and formatting ... so should be about 1.5million rows..or roughly double the size I think ie. the FTDNA chip, My Heritage, ancestry etc chips is about 700K SNPs ?

MacUalraig
05-05-2019, 01:22 PM
Bear in mind though that GEDMatch trim back your uploaded SNPs into a common subset that is worth comparing with other kits. For example when I uploaded my YSEQ WGS '23andMe' file these were the stats:

Total SNPs input 1483429
Total 'usable' 1225795
Total 'slimmed' ie comparable 913998

You can get these stats from the File Diagnostic Utility function

Donwulff
05-05-2019, 04:16 PM
There was recent discussion on mtDNA heteroplasmy, perhaps not entirely co-incidentally (Because search- and recommendation-engines spy on us...) I saw recent research paper titled "Assessing mitochondrial heteroplasmy using next generation sequencing: A note of caution" https://www.sciencedirect.com/science/article/pii/S1567724918300874 from last year but published now recommended to me. "Mitochondrial DNA-like sequences in the nucleus (NUMTs) can interfere with the detection of heteroplasmy. For example, human chromosome 8 contains almost an entire mtDNA sequence inserted into the first intron of SDC2. This may yield misleading results in relation to heteroplasmy levels and highlights a need for greater rigor in the analysis of mtDNA sequence data, particularly as recent studies indicate that on average human individuals carries approximately 750 NUMTs, ~4 of which are typically unique to each individual." Indeed, generating a callability BED-file for Dante Labs sequence (PGP participant, GATK CallableLoci default settings) gives as "incidental finding" for mtDNA:
state nBases
CALLABLE 15646
POOR_MAPPING_QUALITY 925

"Poor mapping quality" means, in general terms, that the reads covering those nucleotide bases map into multiple locations over the *reference genome*. As the cautionary note above points out, individual genomes have other copies of the mtDNA sequence which may not exist in the reference genome, and would thus go undetected by this method. (Improved reference genomes & longer read-lengths will of course help some) This highlights just one of the problems for matching relatives from sequencing data; even when a location has coverage, it could be non-specific, ie. could originate from several locations over the genome (duplications, pseudogenes, inherited translocations etc.). And while most of these locations are common between most genomes, some are specific to an individual. Microarray tests get around this by mostly testing SNP's known to be highly specific and reproducible. The microarray product files include a list of known exceptions, and companies like 23andMe and especially AncestryDNA which have added large number of clinical variants to the mix are, of course, in uncharted territory. If GEDMatch (Is this Genesis or normal?? Are the "slimmed" etc. defined in detail somewhere?) already filters variants to ones "known" to be reliable, on one hand that's close to what they should be doing, but on the other hand it defeats most of the benefits of using sequencing, except for maximizing overlap.

kafky
05-06-2019, 09:22 PM
Update on the mtdna issue.
After contacting Dante Labs concerning the lack of mtdna results on my vcf file, they sent me a link with a specific vcf file only with mtdna variants. I transformed vcf to raw in DNA kit studio, uploaded to James Lick tool. Suggested a different haplogroup from last attempt from 23andme results. Firstly was H100, secondly with Dante results H2a2a1. Which would be correct?

Here are the James Lick results:
From Dante VCF-to-Raw results:

Markers found (shown as differences to rCRS):
HVR2: 73G 150T 152C 195C 302G 410A
CR: 2354G 2485C 2708T 5049T 5581G 6777G 7029T 8702T 9378C 9541C 10399T 10755T 10820T 10874T 11018A 11720T 11723G 12359T 12706C 12851C 13612G 14213G 14581C 14767C 14906T 15302T 15933A
HVR1: 16173T (16183G) 16185T 16191T 16225T 16244A 16322G


Best mtDNA Haplogroup Matches:

1) H2a2a1

Defining Markers for haplogroup H2a2a1:
HVR2:
CR:
HVR1:

Marker path from rCRS to haplogroup H2a2a1 (plus extra markers):
H2a2a1(rCRS) ⇨ 73G 150T 152C 195C 302G 410A 2354G 2485C 2708T 5049T 5581G 6777G 7029T 8702T 9378C 9541C 10399T 10755T 10820T 10874T 11018A 11720T 11723G 12359T 12706C 12851C 13612G 14213G 14581C 14767C 14906T 15302T 15933A 16173T (16183G) 16185T 16191T 16225T 16244A 16322G

Good Match! Your results also had extra markers for this haplogroup:
Extras(39): 73G 150T 152C 195C 302G 410A 2354G 2485C 2708T 5049T 5581G 6777G 7029T 8702T 9378C 9541C 10399T 10755T 10820T 10874T 11018A 11720T 11723G 12359T 12706C 12851C 13612G 14213G 14581C 14767C 14906T 15302T 15933A 16173T (16183G) 16185T 16191T 16225T 16244A 16322G

Donwulff
05-06-2019, 10:42 PM
H2a2a1 is the rCRS mtDNA sequence "root" which you therefore also get if there are no known/identified variants in the sample. In addition, it's easy to see that ALL the variants listed in the sample are in the "Extras" section, ie. none of the variants in the sample were expected for that haplogroup. It's by now pretty well established that Dante Labs is using the Yoruban chrM reference, so you should be using the New Advanced version at https://dna.jameslick.com/mthap-new/advanced.php and chose Yoruban reference which I'm assuming you didn't use. Although DL should consider updating to rCRS to reduce confusion, but the fact is there are different references (including GRCh38/hg38) in use so that's something people will have to continue to deal with. (Personally I think it's time to do everything in GRCh38 and liftover to hg19 for the 23andMe/AncestryDNA compatibility files).

kafky
05-06-2019, 10:59 PM
H2a2a1 is the rCRS mtDNA sequence "root" which you therefore also get if there are no known/identified variants in the sample. In addition, it's easy to see that ALL the variants listed in the sample are in the "Extras" section, ie. none of the variants in the sample were expected for that haplogroup. It's by now pretty well established that Dante Labs is using the Yoruban chrM reference, so you should be using the New Advanced version at https://dna.jameslick.com/mthap-new/advanced.php and chose Yoruban reference which I'm assuming you didn't use. Although DL should consider updating to rCRS to reduce confusion, but the fact is there are different references (including GRCh38/hg38) in use so that's something people will have to continue to deal with. (Personally I think it's time to do everything in GRCh38 and liftover to hg19 for the 23andMe/AncestryDNA compatibility files).

Same result with Yoruba. But something may not be working well. The vcf file is very, very small, only 40 variants. Those listed on previous post... I already wrote James Lick for advice...

Donwulff
05-06-2019, 11:26 PM
mtDNA has very few variants, especially so if you're close to whatever haplogroup the reference is rooted at. I have 18 variants in my rCRS mtDNA reference, Yoruban reference would have something around 40 because that's the number of differences Yoruban has to the European one. But I don't know what your VCF conversion step does, presumably it messes up the reference. The entire variant list is lost somewhere in the history of this endless thread, but I'm sure that's the Yoruban reference problem that was being discussed back then.

Erikl86
05-07-2019, 07:09 AM
Hallelujah !

Got this today:

"Dante Labs: Your sequencing is done. We are running your bioinformatic analysis!
Dear Customer,

We're excited to let you know that we completed the sequencing on your sample.

We are running out the bioinformatic analysis and we would like to inform you that we might need some extra days before the delivery of the results!

Please do not hesitate to contact me if you need any further assistance.

Best regards,

Dante Labs' Team"

Purchased the kit Nov 23, sent the kit back in Jan 5th.

These "extra days" turn out to be quite long...

bmoney
05-08-2019, 01:35 AM
Hi Guys,

I havent followed this thread but just a general question:

Is Dante WGS still the best product in terms of cost-effective WGS based ancestral analysis?

Also, do you rate it for health related information?

MacUalraig
05-08-2019, 06:30 AM
Hi Guys,

I havent followed this thread but just a general question:

Is Dante WGS still the best product in terms of cost-effective WGS based ancestral analysis?

Also, do you rate it for health related information?

Yes subject to the following caveats
1. Their prices fluctuate a lot and its best to wait for yet another sale rather than take the 'standard' price for example right now it says 549 Euros for the x30 and at that price I'd be tempted to go for the YSEQx15 at 643 Euros. Less read depth but faster and more professional all around.
2. The information is best left up to you after you get the raw data eg uploading to Promethease, YFull rather than what they give you.
3. Previously they have taken ages to deliver the VCF then a further age to get the BAM. They now have a 90 day guarantee to get the whole lot but it only started in late April so we don't know they can deliver.

aaronbee2010
05-08-2019, 02:38 PM
Hi Guys,

I havent followed this thread but just a general question:

Is Dante WGS still the best product in terms of cost-effective WGS based ancestral analysis?

Also, do you rate it for health related information?

It's the most cost-effective if you get it on offer. Their offers are usually really good (look out for offers that are EUR199 or less) however if you want to upload Y-DNA/mtDNA data to YFull (which requires a .BAM file), a hard drive is mandatory, and this will set you back another EUR59. You don't need the hard drive for autosomal analysis (the .VCF that's available for download is sufficient for this) however it's still worth it. Even if you order the hard drive then it's still easily the cheapest WGS (and even the cheapest Y-DNA sequencing test for that matter). The initial results (the .VCF file) have a 90 day guarantee however I don't think they mean that the 90 days includes HDD (with .BAM file) delivery - somebody correct me if I'm wrong on this.

I ordered a Dante Labs x30 WGS back in February and my .VCF should be ready by the end of next month - the .BAM file will probably take another few months. They only gave the 90 day guarantee recently.

I'm not privy to how good their WGS is for ancestry but it should be really good for health (I recommend uploading to either Promethease or Sequencing.com).

On another note - I think it's really good for uniparental marker analysis (uploading to YFull).

poi
05-08-2019, 06:31 PM
Hi Guys,

I havent followed this thread but just a general question:

Is Dante WGS still the best product in terms of cost-effective WGS based ancestral analysis?

Also, do you rate it for health related information?

Dante Lab is the most cost effective way to get WGS afaik. I already have mine done (VCF and health reports are ready) and awaiting the BAM hard drive (another 60 USD already paid for). To build the full SNP call with just the VCF(free download), it requires utilizing the reference file. So, a bit pain to build the full SNP calls right now for me. If anyone (looking at you Mr. aaronbee) has figured out how to build the full SNP list, let me know. I have stopped digging on that front as I will have everything in the BAM drive in a week or so. Just a bit of a wait. Once I have the BAM drive, I will upload to Sequencing.com where I can apparently use the EvE Premium app to build the SNP list. Also, I can upload to YFull for the customary YDNA analysis. AFAIK, regarding health report, you should be able to upload the BAM file full VCF to Promethese for their health report.

aaronbee2010
05-08-2019, 06:39 PM
Dante Lab is the most cost effective way to get WGS afaik. I already have mine done (VCF and health reports are ready) and awaiting the BAM hard drive (another 60 USD already paid for). To build the full SNP call with just the VCF(free download), it requires utilizing the reference file. So, a bit pain to build the full SNP calls right now for me. If anyone (looking at you Mr. aaronbee) has figured out how to build the full SNP list, let me know. I have stopped digging on that front as I will have everything in the BAM drive in a week or so. Just a bit of a wait. Once I have the BAM drive, I will upload to Sequencing.com where I can apparently use the EvE Premium app to build the SNP list. Also, I can upload to YFull for the customary YDNA analysis. AFAIK, regarding health report, you should be able to upload the BAM file full VCF to Promethese for their health report.

I would love to help, but what little free time I have after university is currently going to analyse STR's from various South Asian studies in an almost-hopeless effort to see where my unicorn-ish Y-DNA came from. Well that, and trying to learn how to use HipSTR without any prior coding experience (unless "Hello World" counts as experience).

Also, I'm not sure exactly what you mean by building a SNP list.

poi
05-08-2019, 06:50 PM
I would love to help, but what little free time I have after university is currently going to analyse STR's from various South Asian studies in an almost-hopeless effort to see where my unicorn-ish Y-DNA came from. Well that, and trying to learn how to use HipSTR without any prior coding experience (unless "Hello World" counts as experience).

Also, I'm not sure exactly what you mean by building a SNP list.

First off, I just realized that your YDNA in your profile has R1b. Unicorn indeed. I thought you were R2 before.

Anyway, back to the DanteLab VCF -- the VCF download only has the variants' calls, not the full SNPs. To build the full SNPs, you'd have to use the reference and use that to merge with your DanteLab VCF. At least that has been my understanding. With purely DanteLab's VCF, the Gedmatch admix calcs' results are wacky and the SNP coverage is extremely low.

aaronbee2010
05-08-2019, 07:02 PM
First off, I just realized that your YDNA in your profile has R1b. Unicorn indeed. I thought you were R2 before.

Anyway, back to the DanteLab VCF -- the VCF download only has the variants' calls, not the full SNPs. To build the full SNPs, you'd have to use the reference and use that to merge with your DanteLab VCF. At least that has been my understanding. With purely DanteLab's VCF, the Gedmatch admix calcs' results are wacky and the SNP coverage is extremely low.

It's a special disorder where I have two different Y-chromosomes. I also *coincidentally* happen to have mtDNA heteroplasmy (as is also shown in my profile), and also have two different mtDNA groups, which as I've said, is *purely a coincidence*.

R1b in South Asia is generally a lot rarer than R2, but there are more R1b samples in South Asia then whatever my subclade is D:

On a parallel note, M is generally a lot more common in South Asia than U7 (although U7 isn't rare), however my dads branch of M is a lot rarer than U7. Most of my uniparental markers are generally rare in South Asia - the only exception being my mothers mtDNA, which is woefully common around NW South Asia. I believe that translates to a 75% unicorn coefficient.

Regarding the VCF, I thought the VCF was just a simple list containing all mutations you have relative to the human reference sequence. Am I missing something here?

poi
05-08-2019, 07:17 PM
It's a special disorder where I have two different Y-chromosomes. I also *coincidentally* happen to have mtDNA heteroplasmy (as is also shown in my profile), and also have two different mtDNA groups, which as I've said, is *purely a coincidence*.

R1b in South Asia is generally a lot rarer than R2, but there are more R1b samples in South Asia then whatever my subclade is D:

On a parallel note, M is generally a lot more common in South Asia than U7 (although U7 isn't rare), however my dads branch of M is a lot rare than U7. Most of my uniparental markers are generally rare in South Asia - the only exception being my mothers mtDNA, which is woefully common around NW South Asia. I believe that translates to a 75% unicorn coefficient.

Regarding the VCF, I thought the VCF was just a simple list containing all mutations you have relative to the human reference sequence. Am I missing something here?
I need pick up my jaw from the floor and try to process your uni/duo-lineals. Amazing. May I ask how you found out about this? Through the commercial genetic testing alone?

Regarding the DanteLab VCF, it only contains what's different from the HG19 reference, so the VCF file alone will have low coverage in Gedmatch calcs, for example. AFAIK, you'd have to build another file that combines your DanteLab VCF and the HG19 reference. Until you have done that, the DanteLab VCF is almost useless in GedMatch. Correct me if I'm wrong.

aaronbee2010
05-08-2019, 07:39 PM
I need pick up my jaw from the floor and try to process your uni/duo-lineals. Amazing. May I ask how you found out about this? Through the commercial genetic testing alone?

Regarding the DanteLab VCF, it only contains what's different from the HG19 reference, so the VCF file alone will have low coverage in Gedmatch calcs, for example. AFAIK, you'd have to build another file that combines your DanteLab VCF and the HG19 reference. Until you have done that, the DanteLab VCF is almost useless in GedMatch. Correct me if I'm wrong.

Paternal Y-DNA: Me testing with LivingDNA + YSEQ
Maternal Y-DNA: Maternal uncle testing with 23andMe + YSEQ (I got my mom to persuade him :D)
Paternal mtDNA: Father testing with 23andMe (I had to get this one for my dad, with the excuse that it was on offer at the time)
Maternal mtDNA: Me testing with FTDNA (mtFull Sequence)

I'm trying to find my maternal grandmothers Y-DNA now (I've asked my mom to try and persuade her mothers brothers son) however I have no idea how well that will end up going. I've also asked if my mom can persuade one of her fathers sisters daughters to do a test (for my maternal grandfathers mtDNA). My moms a lot more accepting of my curiosity than my dad, who think it's all a pointless waste of money (I'm sure glad he doesn't know about my Dante Labs purchase - it cost me what little savings I had left after the mtDNA sequence + the money from selling an old graphics card from my computer). I want to see if I can test my paternal grandfathers mtDNA while he's still alive however I don't know how long I have left there. My dad says if I get a good result for my university degree, he'll give me some leeway (I want him and his father to take a mtFull Sequence).

-

Regarding the VCF, is this what you're saying (must making sure I understand you correctly):

* The Dante Labs VCF has just the mutations relative to the hg19 reference.
* To make a full VCF file suitable for GEDmatch, you would need to take a file containing the whole hg19 reference, delete all of the positions in there that you have variants for in the Dante Labs file, and then place your variants in there (or something like that).

pinoqio
05-08-2019, 08:16 PM
It's a special disorder where I have two different Y-chromosomes.
That is absolutely fascinating - if you don't mind my prying, do you know how that came about?
I am aware of XYY syndrome (https://en.wikipedia.org/wiki/XYY_syndrome) but it would simply give you a duplicate Y of your father.
On the other hand, to inherit your mothers side Y, she would have to have the Y chromosome herself. But I suspect it would have to be broken in some way, to not cause an intersex condition and infertility in a woman?

aaronbee2010
05-08-2019, 08:21 PM
That is absolutely fascinating - if you don't mind my prying, do you know how that came about?
I am aware of XYY syndrome (https://en.wikipedia.org/wiki/XYY_syndrome) but it would simply give you a duplicate Y of your father.
On the other hand, to inherit your mothers side Y, she would have to have the Y chromosome herself. But I suspect it would have to be broken in some way, to not cause an intersex condition and infertility in a woman?

That was actually a joke on my part. It was in reference to the fact I have two different Y-DNA subclades listed on my profile, which I joked was because I had two different Y chromosomes myself.

In reality, I'm just another XY male. R2-Y1383* is my paternal Y-DNA (my/my fathers Y-DNA) and R1b-Z2109 is my maternal Y-DNA (my maternal uncles Y-DNA).

I hope that's clarified things for you :D

pinoqio
05-08-2019, 08:50 PM
Ah, I might have to turn my sarcasm detector up a notch, eh ;)

To get back to @bmoney question, I even think there is a solid chance to get a free WGS at this point. If you check out the turnaround time (https://www.dantelabs.com/blogs/technical/turnaround-time), apparently it's just over 2 months. But this is on average, and with the huge variation and the general chaos at Dante Labs, I would say you have a fair 20%+ chance that their 90 day money back guarantee comes into effect. But of course it is tempting to wait, since they seem to be doing sub 200€ promotions every few months.

poi
05-08-2019, 09:37 PM
...
Regarding the VCF, is this what you're saying (must making sure I understand you correctly):

* The Dante Labs VCF has just the mutations relative to the hg19 reference.
* To make a full VCF file suitable for GEDmatch, you would need to take a file containing the whole hg19 reference, delete all of the positions in there that you have variants for in the Dante Labs file, and then place your variants in there (or something like that).

Basically yes. Dante Lab’s VCF depends on your variants in relation to the reference. So, you have to use both their VCF and the reference to build the full SNP calls. Technically, it is straightforward to build the full list using deduction, but it won’t be completely accurate because the VCF only contains “passing” variants. So, your non passing variants won’t be in the VCF, so the reference may not be accurate. Regardless, the newly built full list should cover most of the SNP list until you get the BAM file and use proper tools to get the full list(based on 30x reads).

That being said. I have no bioinformatics background, so I’m curious if what I’m saying is right.

aaronbee2010
05-09-2019, 11:12 AM
Basically yes. Dante Lab’s VCF depends on your variants in relation to the reference. So, you have to use both their VCF and the reference to build the full SNP calls. Technically, it is straightforward to build the full list using deduction, but it won’t be completely accurate because the VCF only contains “passing” variants. So, your non passing variants won’t be in the VCF, so the reference may not be accurate. Regardless, the newly built full list should cover most of the SNP list until you get the BAM file and use proper tools to get the full list(based on 30x reads).

That being said. I have no bioinformatics background, so I’m curious if what I’m saying is right.

If that's the case then it just seems best to wait for the .BAM file, as that has all the variants, not just the "passed" ones.

tontsa
05-09-2019, 04:51 PM
Hi,

Has someone succeed creating pipeline where you get either from the fastq or bam file working and accepted 23andme V3 or V5 file? I tried Sequencing.com's EvE premium with the .bam and resulting file is around 998 megs with way too many SNPs even for Gedmatch Genesis. Teepean's extract23 seems to be closest to working solution.. but I haven't yet succeeded with it.. prolly need to downgrade samtools or htslib.

Thanks in advance,
Toni

Erikl86
05-09-2019, 07:43 PM
These "extra days" turn out to be quite long...

Yay ! Results finally came !

P.S.

Didn't get the personal custom report.

Also, how can I upload the VCF SNP file to Gedmatch?

tontsa
05-10-2019, 04:39 AM
Also, how can I upload the VCF SNP file to Gedmatch?

Best I've come across is http://dnagenics.com/tools/dna-kit-studio-v2-2/ with that I was atleast able to get 23andme compatible file where Gedmatch Genesis analysis:
Number of original snps is 1467208
Usable SNPS is 1223256.
Usable SNPS(slim) is 922101.
Slimmed by 24.6 Pct.
Though I haven't tried the new version yet that fills in the blanks from hg19 reference genome..

Best regards,
Toni

pmokeefe
05-12-2019, 12:26 PM
October 30, 2018 Ordered Dante Labs WGS kit with shipping to address in the U.S.
Nov 15, 2019 kit delivered to address in the U.S.
Nov 21, 2019 Mailed kit back to Dante Labs using the provided shipping label from the U.S. to their address in Italy
--- That shipment was rejected by customs in Italy and returned to the U.S. address
Dec 17, 2018 mailed kit back to Dante Labs from the U.S. to their address in Utah after requesting a U.S. shipping destination
May 9th, 2019 vcf files and health report available for download on the Dante Labs web site

However, I did not receive an email from Dante Labs saying the results were ready this time (this was the third kit I have ordered, they emailed me when the earlier two were ready). I just happened to check the web site and saw it. So I'm not sure exactly when the results were first available.
In addition to the the usual snp.vcf.gz and indel.vcf.gz there were also sv.vcf.gz and cnv.vcf.gz files available for download (structural variants and copy number variants). Though they were not listed on the website for my previous kit, I was able to download the sv and cnv files by looking at the url for the original snp.vcf.gz download and changing the 'snp' to 'sv' and 'cnv'. However that did not work for my first kit which was ordered much earlier.
So that might be worth trying if you are interested in the structural and copy number variants.

Erikl86
05-12-2019, 01:01 PM
Best I've come across is http://dnagenics.com/tools/dna-kit-studio-v2-2/ with that I was atleast able to get 23andme compatible file where Gedmatch Genesis analysis:
Number of original snps is 1467208
Usable SNPS is 1223256.
Usable SNPS(slim) is 922101.
Slimmed by 24.6 Pct.
Though I haven't tried the new version yet that fills in the blanks from hg19 reference genome..

Best regards,
Toni

Thanks, unfortunately the converted SNP VCF file to 23andme V3 format gave me bogus results on Genesis. I mean I did get closest results to Ashkenazi Jews, but with a distance of 12-15. Also, very small amount of SNPs been detected by Genesis.

I will try the hg19 reference feature today.

On a positive note, I did get links to download my FASTQ files ! 65 GB in total ! Yeah !!

Any recommended tool to convert them to BAM files?

I've found this one:

http://www.y-str.org/2015/08/srafastq-to-bam-kit.html

But it's 4 years old, I'm sure there are newer tools around, no?

UPDATE: Just sent the links to YFull, hopefully they'll be able to analyze it and also provide me with a BAM file. Will update soon.

aaronbee2010
05-12-2019, 02:27 PM
Thanks, unfortunately the converted SNP VCF file to 23andme V3 format gave me bogus results on Genesis. I mean I did get closest results to Ashkenazi Jews, but with a distance of 12-15. Also, very small amount of SNPs been detected by Genesis.

I will try the hg19 reference feature today.

On a positive note, I did get links to download my FASTQ files ! 65 GB in total ! Yeah !!

Any recommended tool to convert them to BAM files?

I've found this one:

http://www.y-str.org/2015/08/srafastq-to-bam-kit.html

But it's 4 years old, I'm sure there are newer tools around, no?

UPDATE: Just sent the links to YFull, hopefully they'll be able to analyze it and also provide me with a BAM file. Will update soon.

Would the EvE tool on Sequencing.com suffice?

Erikl86
05-12-2019, 03:19 PM
Would the EvE tool on Sequencing.com suffice?

Perhaps. I'm trying to at the moment.

aaronbee2010
05-12-2019, 05:28 PM
Thanks, unfortunately the converted SNP VCF file to 23andme V3 format gave me bogus results on Genesis. I mean I did get closest results to Ashkenazi Jews, but with a distance of 12-15. Also, very small amount of SNPs been detected by Genesis.

I will try the hg19 reference feature today.

On a positive note, I did get links to download my FASTQ files ! 65 GB in total ! Yeah !!

Any recommended tool to convert them to BAM files?

I've found this one:

http://www.y-str.org/2015/08/srafastq-to-bam-kit.html

But it's 4 years old, I'm sure there are newer tools around, no?

UPDATE: Just sent the links to YFull, hopefully they'll be able to analyze it and also provide me with a BAM file. Will update soon.

I just realised that the order page on YFull (https://www.yfull.com/order/) already accept FASTQ files (provided they have a shareable link to download from), so you could just go straight for it.

Erikl86
05-12-2019, 05:31 PM
I just realised that the order page on YFull (https://www.yfull.com/order/) already accept FASTQ files (provided they have a shareable link to download from), so you could just go straight for it.

Yep, this is what I've done, I'm now waiting for their approval. Hopefully Dante Labs results is going to pay off in terms of genealogy as well.

I've tried the hg19 reference btw, and got the following results on DNA Kit Studio:
> Processed: 4368706 lines
> Processed: 4301081 SNPs

Uploading it now to Genesis.

aaronbee2010
05-12-2019, 05:38 PM
Yep, this is what I've done, I'm now waiting for their approval. Hopefully Dante Labs results is going to pay off in terms of genealogy as well.

I've tried the hg19 reference btw, and got the following results on DNA Kit Studio:
> Processed: 4368706 lines
> Processed: 4301081 SNPs

Uploading it now to Genesis.

By any chance are you also interested in health (this was one of the driving factors that made me invest in a Dante Labs kit while it was on offer)? I believe there are certain diseases that are more prevalent in Ashkenazi Jewish populations, and knowing if you have a genetic predisposition to at least one of them can help you to plan accordingly.

jodyfanning
05-12-2019, 06:32 PM
I think the noise is the exception, although it's been bit confusing, but on the other hand it seems to me like the people making the largest noise about things like starting a campaign to fill Better Business Bureau/Facebook pages with complaints are usually the ones on 9th week of their "8 to 10 weeks" delivery time. On the other hand, it seems there have been some pretty big slip-ups in the past, and any publicity from their discount campaigns are easily lost on three or four people who give a negative review on delivery time (most who aren't even verified purchasers... on Amazon.com).

On the flip side, how many thousands have purchased the test and how many have complained? But on the other hand, they do have a public relations disaster in their hands from the delivery times. Does that 90 days include raw data? Because people are going to assume it does, and if it doesn't, there's going to be a LOT of fighting over that. But yes, hopefully this means they've streamlined everything so they can hold the promise of delivering everything, including the raw data, in 90 days.

Also note it's from when they receive the sample. I think there was at least one case on FB where someone asked what was taking so long with their results, only to be told they'd not received their sample yet. But often people take months to send their sample in, so I don't think you can blame Dante Labs on that, but it's definitely something you have to watch for. I think most delivery time complaints I have seen are about the raw data, and again, I expect most people don't even seek the raw data, it's just us "genealogy nerds".

At least on the EU site right now, the front-page pic is the "Whole GenomeL" but the description is "The most comprehensive DNA Test" and the link actually takes you to WGS test. I hope people are paying attention.

I paid during one of the specials in November, received the kit end of December. It took some messing around to get the shipping label after I realised I needed to actually request it (they never sent it automatically), and sent it back in Feburary. My status changed to "Kit Received" in late February. I have heard zero since then.

aaronbee2010
05-12-2019, 07:10 PM
I paid during one of the specials in November, received the kit end of December. It took some messing around to get the shipping label after I realised I needed to actually request it (they never sent it automatically), and sent it back in Feburary. My status changed to "Kit Received" in late February. I have heard zero since then.

Have you contacted them via email or Facebook? Dante Labs didn't update my kit status until after I contacted them. I'm pretty sure my kit had passed QC before I even sent a message their way.

tsunami
05-13-2019, 12:33 AM
On a positive note, I did get links to download my FASTQ files ! 65 GB in total ! Yeah !!




How many FASTQ files do you have?

Erikl86
05-13-2019, 03:47 AM
How many FASTQ files do you have?

Two files

Erikl86
05-13-2019, 04:44 PM
I just realised that the order page on YFull (https://www.yfull.com/order/) already accept FASTQ files (provided they have a shareable link to download from), so you could just go straight for it.

Unfortunately, just received an email from them, saying they don't work with WGS FASTQ files, as they need to compile them to BAM files, and they don't support that as it's too resource intensive process.

Donwulff
05-15-2019, 05:46 AM
Y-chromosome only FASTQ's are very low processing though. On that note, I've been wondering with Dante Labs data, are there any "best practices" for mapping to upgrade to GRCh38 which I've been thinking of? I'm not sure if I should map with lower seed length & run BQSR or not. (Probably not, because there's no accepted best practices for Y-chromosome BQSR; as noted in other thread I've been noticing the error profile seems quite wacky for Y-chromosome).

Aside, you should be able to print the return shipping label from your Dante Labs account. It's a good idea to make that before you get the results. Back in 2017 they just sent it in e-mail, don't recall if there was one in the sample kit, but I don't see why not just put it in the sample kit since not everybody will even have easy access to printer..

Petr
05-15-2019, 07:16 AM
I just noticed FASTQ files for download. Surprisingly, they are much smaller, last year I have received FASTQ files 40 to 52 GB each, while the new files have 19 to 24 GB each.

I there any easy way how to determine the read length a number of reads in these FASTQ files?

Donwulff
05-15-2019, 07:44 AM
Read length = (maximum) row length, number of reads = number of rows/4. Longer reads has fewer total reads to meet same read depth. In Paired End (PE) there's two paired FASTQ files, so either count all reads or number of paired reads. Unfortunately most editors will choke on files that large (If you even have room to extract them - IF they are compressed, lack of compression would certainly explain larger size). One trick could be to interrupt the download quickly and see if an uncompresser will extract what was downloaded, then estimate total reads from the compression ratio. I think there are some large file editors you could try, but without knowing computer & OS it's impossible to even guess.

pinoqio
05-15-2019, 11:00 AM
I just noticed FASTQ files for download. Surprisingly, they are much smaller, last year I have received FASTQ files 40 to 52 GB each, while the new files have 19 to 24 GB each.

I there any easy way how to determine the read length a number of reads in these FASTQ files?

You can get statistics using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), this will report read length, quality, duplicate reads, etc

Do you have the smaller, downloadable FASTQs for the same kit that you already received on hard disk? Would be great to see a direct comparison to figure out how they are cutting down file size.

Petr
05-15-2019, 03:43 PM
So here are the sizes of data inside my FASTQ files:

2018, on HDD
PS: 100 x 600184154 x 2 = 120 giga
HB: 100 x 636133334 x 2 = 127 giga

2019, downloaded
OB: 150 x 342766670 x 2 = 103 giga
BK: 150 x 417158704 x 2 = 125 giga
JV: 150 x 312527687 x 2 = 94 giga
HK: 150 x 302178190 x 2 = 91 giga
MB: 150 x 358447022 x 2 = 108 giga

FGC WGS 15x, for comparison:
150 x 167869529 x 2 = 50 giga


FastQC looks nice, it will take more time to get the results.

I have no FASTQ download links for old kits.

pmokeefe
05-15-2019, 04:47 PM
I just noticed FASTQ files for download. Surprisingly, they are much smaller, last year I have received FASTQ files 40 to 52 GB each, while the new files have 19 to 24 GB each.

Hi Petr,
How did you notice the FASTQ files for download? Was the URL for the FASTQ download similar to the URL for the VCF files for that kit? If so what is specific for the FASTQ URLs? I have results back for three Dante Lab kits so far. I have been able to download files (like CNV and SV) there were listed for one kit, but not another by constructing a URL by hand. Hoping that might be true for the FASTQ files too.

Petr
05-15-2019, 09:51 PM
The links appeared in the Kit Manager.

The links looks like:

https://s3.amazonaws.com/datareleasefastq5/raw_data/d560018010XXXXX/d560018010XXXXX_USD16089235L_HJK2LDSXX_L3_1.fq.gz
https://s3.amazonaws.com/datareleasefastq5/raw_data/d560018010XXXXX/d560018010XXXXX_USD16089235L_HJK2LDSXX_L3_2.fq.gz
or
https://s3.amazonaws.com/datareleasefastq5/raw_data/s_560018010XXXXX_1.fq.gz
https://s3.amazonaws.com/datareleasefastq5/raw_data/s_560018010XXXXX_2.fq.gz

Erikl86
05-16-2019, 07:06 AM
I just noticed FASTQ files for download. Surprisingly, they are much smaller, last year I have received FASTQ files 40 to 52 GB each, while the new files have 19 to 24 GB each.

I there any easy way how to determine the read length a number of reads in these FASTQ files?

Yeah I also have the two FASTQ files available to download:

30446

I've downloaded and extracted the files - each one is 144 GB - total of ~300 GB:

30447

30448

Question is - how can I merge them?

Also, YFull team is awesome, and they compiled my FASTQ files into BAM file themselves !!

Check out this update from them:

30449

Next I'd like to get my BAM from them if possible, and use it to extract 23andme raw data to upload to Genesis. Maybe even send Davidski to get Global25 coordinates.

pmokeefe
05-16-2019, 08:23 AM
I tried several variations of URLs modeled on the ones Petr kindly posted, but only received errors. Not clear if that was because the files don't exist or my URL attempts were incorrect (or both). It was obvious that some substrings in Petr's FASTQ URL were just the kit IDs, which I substituted for mine. But it wasn't so obvious how other random-looking substrings should be copied between the VCF URLs and the FASTQ URLs. Or maybe the random-looking substrings are different for the VCF and FASTQ URLs and this exercise is futile? Any further hints from customers who have the FASTQ downloads would be most appreciated!

I also ordered the hard drives, but I move back and forth from America and Europe fairly frequently, so it can be hit-or-miss for me to receive a shipment.
I just contacted a Dante Labs representative using the chat facility on their website on this topic. They replied:

We are working in providing the download links for the raw files in each customers' account. I have forwarded your request to the relevant team for assistance.

Erikl86
05-16-2019, 08:42 AM
I tried several variations of URLs modeled on the ones Petr kindly posted, but only received errors. Not clear if that was because the files don't exist or my URL attempts were incorrect (or both). It was obvious that some substrings in Petr's FASTQ URL were just the kit IDs, which I substituted for mine. But it wasn't so obvious how other random-looking substrings should be copied between the VCF URLs and the FASTQ URLs. Or maybe the random-looking substrings are different for the VCF and FASTQ URLs and this exercise is futile? Any further hints from customers who have the FASTQ downloads would be most appreciated!

I also ordered the hard drives, but I move back and forth from America and Europe fairly frequently, so it can be hit-or-miss for me to receive a shipment.
I just contacted a Dante Labs representative using the chat facility on their website on this topic. They replied:

Yes I've also contacted them with a question what about my BAM file, and they promised it'll be available to download - yet no word sense (it's been two weeks).

fabaud
05-16-2019, 10:15 AM
Yes I've also contacted them with a question what about my BAM file, and they promised it'll be available to download - yet no word sense (it's been two weeks).

Yesterday, I received an e-mail: "We sincerely apologize for the delay in providing your BAM files. We have nudged the relevant team for them to upload these results. You will receive a response in a few days. Please let us know if you have any other questions."

Sounds good...

Donwulff
05-16-2019, 10:37 AM
In FASTQ file names, "R1" is generally "Read 1" and "R2" is "Read 2". In paired-end (PE) sequencing, which is generally the standard now, the DNA is cut into short pieces of few hundred basepairs each and then sequenced from each end for the read-length basepairs. This is beneficial, because you know that the paired reads should be located within a few hundred basepairs of each other on the human reference genome, which makes it easier to find their exact location. So the "merging" typically happens by solving the billion piece jigsaw-puzzle by mapping each one of the read-pairs against a specific reference genome, producing the Binary Alignment Map BAM-file. This is a relatively computationally intensive process, and because you want to also sort them into the familiar chromosome coordinate order, it also requires large amount of memory to do fast, SSD disks will help a lot as well. Again, there are other options, but I think right now Sequencing.com is looking like easiest option. You generally can't get this service for free due to the required resources & knowledge, although I have on the "Dante Labs technical" thread some pointers & scripts to do it if you have computer with Linux and a lot of CPU + memory + SSD drives (Or go for the full works & grab sample scripts from Broad Institures GATK Best Practices site, but that's even more involved).

Erikl86
05-16-2019, 11:22 AM
In FASTQ file names, "R1" is generally "Read 1" and "R2" is "Read 2". In paired-end (PE) sequencing, which is generally the standard now, the DNA is cut into short pieces of few hundred basepairs each and then sequenced from each end for the read-length basepairs. This is beneficial, because you know that the paired reads should be located within a few hundred basepairs of each other on the human reference genome, which makes it easier to find their exact location. So the "merging" typically happens by solving the billion piece jigsaw-puzzle by mapping each one of the read-pairs against a specific reference genome, producing the Binary Alignment Map BAM-file. This is a relatively computationally intensive process, and because you want to also sort them into the familiar chromosome coordinate order, it also requires large amount of memory to do fast, SSD disks will help a lot as well. Again, there are other options, but I think right now Sequencing.com is looking like easiest option. You generally can't get this service for free due to the required resources & knowledge, although I have on the "Dante Labs technical" thread some pointers & scripts to do it if you have computer with Linux and a lot of CPU + memory + SSD drives (Or go for the full works & grab sample scripts from Broad Institures GATK Best Practices site, but that's even more involved).

Well the problem is EvE (premium) on Sequencing.com lets you convert each FASTQ file separately - it has no option for merger that I could find. I've sent them a message few days ago, without any response.

From what you just wrote, I get that Read 1 and Read 2 need to be combined in the logical order of first R1 then R2.

In any case, I now have two separated BAM files on Sequencing.com, each about 52 GB in size.

MacUalraig
05-16-2019, 02:45 PM
I've experimented with bowtie2 at home, it wouldn't run on my older laptop (since retired) but was fine on an i7 with 8Gb, the total system ram in use was 5gb and the old one only had 4. I didn't use an SSD as it was too full up. According to my notes bowtie itself was using 3.4 Gb in line with what the authors said:

"Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB."

If you know and have linux feel free to use it but it is perfectly straightforward in Windows. Having said that for my 10X Genomics fastqs I just got today I'm going to do it on a new linux workstation on the ssd too (bit of a squeeze) so I can justify all the money I spent on it. :-) Ideally of course I need barcode-aware alignment...

oagl
05-16-2019, 02:49 PM
I used NextGenMap to realign my FASTQs to GRCh38, as it seems to be a faster/better alternative to other aligners like bowtie2 and bwa mem: https://github.com/Cibiv/NextGenMap/wiki

JamesKane
05-16-2019, 06:20 PM
If you know and have linux feel free to use it but it is perfectly straightforward in Windows. Having said that for my 10X Genomics fastqs I just got today I'm going to do it on a new linux workstation on the ssd too (bit of a squeeze) so I can justify all the money I spent on it. :-) Ideally of course I need barcode-aware alignment...

LongRanger requires more then 64GB of RAM (128GB is the minimum recommended), so you may have trouble getting it running. 10x Chromium FASTQ’s come out looking like very pricey 30x WGS BAMs without it. I recommend just using FGC’s results unless you have very special need to do it for yourself.

MacUalraig
05-16-2019, 06:44 PM
LongRanger requires more then 64GB of RAM (128GB is the minimum recommended), so you may have trouble getting it running. 10x Chromium FASTQ’s come out looking like very pricey 30x WGS BAMs without it. I recommend just using FGC’s results unless you have very special need to do it for yourself.

I took the LongRanger 'requirements' into account when making my purchase. It was one of the top considerations, but the machine is dual use as I also need it for some neural network chess I'm playing with which requires dual RTX GPUs too so not all the money went on CPU and RAM.

Donwulff
05-16-2019, 07:09 PM
The "industry standard" for mapping FASTQ files is bwa-mem + GATK MarkDuplicates (at the very least). I don't want to start an operating system holy war, but basically GATK only works in Linux (Either that, or I forgot how to define temporary file path so it works on Windows), so you can't do it "correctly". You could use Windows compiled samtools, but that can cause all kinds of format problems downstream.

BWA-mem and Bowtie2 both use Burrows-Wheeler transform FM-index and have relatively low memory usage, but the results are not equivalent (To say nothing of other mappers/aligners - the whole point is to have predictable, comparable results, and most tools don't even produce results that are compatible with downstream analysis). Personally I prefer CUSHAW3 ;) Either way, you'll need the executable binaries from somewhere which can be little tricky as well. So basically I'd just recommend running that on Linux. Of course, you can do everything on cloud platform yourself if you have the bandwidth and don't mind that, in which case computing resources are easy, mostly pricey.

The memory use in (Non-Chromium) processing comes mostly from sorting. Going with my jigsaw puzzle analogy you might think it's easy, with the reads matched against their appropriate location in reference genome just dump them out, but unfortunately the algorithm doesn't work like that. It would have to keep *all* the mapped reads in memory, taking at least as much memory as all the uncompressed FASTQ files together. Which, incidentally, is what you have to essentially do to put them in the right order. This is done in as large batches as will fit in memory, the batches are dumped into files, and then those sorted files are finally interleaved into one complete file in right order. Because of this you'll need as much memory as possible and preferably an SSD or it'll be ridiculously slow.

Most other operations (besides LongRanger) have small working sets and don't necessarily require huge amounts of memory, although Java or running them on multiple CPU threads can multiply the memory requirements. Sort is different, because it needs to effectively access the whole sequence, all the time, to find out where the reads belong in order, and to output them in that order.

Donwulff
05-16-2019, 07:17 PM
Well the problem is EvE (premium) on Sequencing.com lets you convert each FASTQ file separately - it has no option for merger that I could find. I've sent them a message few days ago, without any response.

From what you just wrote, I get that Read 1 and Read 2 need to be combined in the logical order of first R1 then R2.

In any case, I now have two separated BAM files on Sequencing.com, each about 52 GB in size.

Ahh, now I get what you mean. Separate BAM's won't be helpful, unfortunately. There's a thing called "interleaved FASTQ" which *might* work, not sure if there's a proper way to make one but quick Googling turned out this straightforward Linux shell command: https://gist.github.com/nathanhaigh/4544979 Honestly, most tools take the read pairs as separate files so I'm not sure what's the deal with EvE, it may or may not support interleaved FASTQ either. You could go with Unmapped BAM, uBAM, using GATK - https://gatkforums.broadinstitute.org/gatk/discussion/6484/how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam has some details on that (Java needed, and yes I don't think that will work on Windows either because it'll try to open temporary files in Linux specific way).

oagl
05-16-2019, 07:21 PM
... only works in Linux (Either that, or I forgot how to define temporary file path so it works on Windows), so you can't do it "correctly". You could use Windows compiled samtools, but that can cause all kinds of format problems downstream.

Nowadays you can easily install a Linux subsystem in Windows, which basically installs an Ubuntu on top. Then you have a Unix shell with apt-get and you can then e.g. use miniconda to install samtools etc. It works pretty well. So no need to install a different operating system if you have Windows 10.

JamesKane
05-16-2019, 07:32 PM
Beware of Windows Linux Subsystem until the new version ships. There are a lot of complaints about IO and the GATK Best Practices have tons of it. Until then you will be better off with VirtualBOX or other VM.

tsunami
05-16-2019, 07:46 PM
Yeah I also have the two FASTQ files available to download:

30446

I've downloaded and extracted the files - each one is 144 GB - total of ~300 GB:

30447

30448

Question is - how can I merge them?

Also, YFull team is awesome, and they compiled my FASTQ files into BAM file themselves !!

Check out this update from them:

30449


Because there are 2 FASTQ files, did you made one order or you needed to make two?

I don't see there is possibility to share two URLs inside one order.

Donwulff
05-16-2019, 08:09 PM
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4618857/ has independent benchmark showing bwa-mem with the highest propotion of "correctly" mapped reads on human genomes, with Bowtie2 and NextGenMap tying for the last place for current NGS. I also think neither Bowtie2 nor NextGenMap are alternative contig aware, which could cause major issues with GRCh38. And yeah, GATK is definitely not tested against them, so I'm bit dubious that alternative mapping tools will just work as well, the slight changes in formats usually drive me nuts.
On the flip side, minimap2 *might* actually be better, and presumably compatible as it was created by same author as bwa-mem as its replacement, and has benefit of working well on Oxford Nanopore as well, but it also seems that bwa mem will stay and continue being improved with some ideas from minimap2: https://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa
On the FDA sequencing data processing challenges for example, all the top entries have been BWA-MEM & GATK Best Practices compliant: https://precision.fda.gov/challenges/truth (Sentieon has their proprerietary optimized BWA-MEM/GATK implementation).

oagl
05-16-2019, 08:37 PM
with Bowtie2 and NextGenMap tying for the last place for current NGS.

If I interpret this figure (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4618857/figure/Fig2/) correctly NGM is way better than Bowtie2 for 100bp PE human reads and twice as fast as BWA mem. If you look here (http://www.robertlanfear.com/blog/files/short_read_mappers.html), running NGM with higher sensitivity so that the runtime is comparable with BWA mem yields better results, at least for this data set. I don't know however if the output data is GATK ready, but it does output SAM/BAM files.


I also think neither Bowtie2 nor NextGenMap are alternative contig aware, which could cause major issues with GRCh38.

True, but it is anyway recommended to choose a reference sequence without alt contigs as far as I know. See here (https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use).

Petr
05-20-2019, 11:46 AM
What is the current recommended procedure to get the Y and mt BAM for submitting Dante WGS results to YFull and Y-DNA Warehouse?

http://www.it2kane.org/2018/05/variant-discovery-process-update/ scripts 1a and 2?

oagl
05-20-2019, 03:19 PM
What is the current recommended procedure to get the Y and mt BAM for submitting Dante WGS results to YFull and Y-DNA Warehouse?

If you have the BAM file you can just use samtools to extract only those reads that are mapped to chrY and chrM:

samtools view -h -b /path/to/bam/file.bam chrY > /path/to/new/bam/file.chrY.bam
samtools view -h -b /path/to/bam/file.bam chrM > /path/to/new/bam/file.chrM.bam

Or to get one BAM file with both chrY and chrM:

samtools view -h -b /path/to/bam/file.bam chrY chrM > /path/to/new/bam/file.chrY.chrM.bam

Petr
05-20-2019, 08:20 PM
Thank you, but I don't have the BAM file, just FASTA files. And Dante BAM files are hg19 anyway.

Donwulff
05-20-2019, 09:28 PM
If you have the BAM file you can just use samtools to extract only those reads that are mapped to chrY and chrM:

samtools view -h -b /path/to/bam/file.bam chrY > /path/to/new/bam/file.chrY.bam
samtools view -h -b /path/to/bam/file.bam chrM > /path/to/new/bam/file.chrM.bam

Or to get one BAM file with both chrY and chrM:

samtools view -h -b /path/to/bam/file.bam chrY chrM > /path/to/new/bam/file.chrY.chrM.bam

You'll need index first, though I believe it'll tell you to run "samtools index /path/to/bam/file.bam" first. I believe this is best & sufficient method, it's the same FTDNA appears to use for the BigY BAM. Of course, GRCh38 from Dante Labs requires having it mapped to GRCh38 first... If you've re-mapped the reads to new build, in particular, you can do a dumb trick like:

(samtools view -H GRCh38.bwamem.bam; samtools view GRCh38.bwamem.bam | grep -P "\tchr(Y|MT?)\t") | samtools sort [email protected]$(nproc) -o chrY.bam

This saves having to sort and index the whole BAM, if you're *only* interested in chrY/M. In addition, it retains the paired end information of any read pair mapped in chrY. I don't think there's any real benefit to that, other than cleaner (and larger file). Can always cut it to chrY/M only with the samtools view after that; this is just optimization on the sort for Y/M-only.

My personal preference of course is https://github.com/Donwulff/bio-tools/blob/master/mapping/revert-bam.sh which produces sorted & indexed file. It's optimized in many ways, although I believe the results should still be equivalent to GATK best practices mapping, but I haven't checked that yet. Probably slightly harder to follow due to that. I think maybe nobody knows best parameters to use with bwa mem for Y chromosome though. I'm curious if it matters for YFull.

I considered updating my script for FASTQ files, right now you'd have to use Kane's 1a or something similar first, but the thing is FastqToSam just has you stick all your own parameters in, so scripting it barely helps. I can make it extract a lot of the data from file & read names though, but yeah... not yet there ;) Also if I was working on that script now I'd use MarkDuplicatesSpark which will work multi-processor even on single computer, but it wasn't available when I wrote that script and I forgot to test... Reference genome is free choice, but the one in bwakit is theoretically best. Note to self: Using postalt from bwakit should also refine mapping quality of primary assembly slightly, so I might have to do that in later version.

Edit edit edit: Some feedback on Kane's, I'm not sure BQSR makes sense for Y chromosome. My sample pipeline does it because I'm interested in autosomal as well, however I've been noticing that I get very different error profile for Y chromosome compared to autosomal chromosomes. In particular Kane's script may work poorly for targeted Y sequencing (BigY, Y Elite) because it doesn't include Y chromosomal variants and there may be no real autosomal variants. Also the known sites list is incomplete. However, for YFull I think we just skip BQSR, because the BAM files delivered by sequencing companies appear to be from before BQSR is ran. Importantly, Dante Labs / BGI uses different adaptors from Illumina, and so MarkIlluminaAdapters goes all screwy. I finally settled to using fastp which is, as the name implies, fast and works fine for Dante Labs data. Not trimming adapters at all is probably better than using MarkIlluminaAdapters for DL data. Also, it's doing MarkDuplicates in coordinate-sorted order, which is old practice that doesn't remove all duplicates. Difference should be marginal, though.

JamesKane
05-21-2019, 11:55 PM
FGC applies BQSR to their Y Elite BAMs as well. The script in my gist includes dbSNP142 in the known sites list, which should really be updated to the April 2018 version that's actually in use these days. The dbSNP142 database includes 185,130 SNPs on chrY and the newer one 451,300. That's a tad more than not any. ;)

For the purposes of submitting to YFULL, you probably don't need to apply BQSR as that is preprocessing for GATK's callers. As far as I can tell YFULL has rolled their own.

You are correct in that my alignment script should not be applied as is to [B/M]GISEQ data. That workflow makes the assumption you are dealing with Illumina data. For most I'd say use bwa mem (or possibly minimap2) directly. It deals with untrimmed data fine and there is a contingent of informaticians who disagree with Broad Institute on that piece of preprocessing.

From the documentation on the workflow for marking the duplicates first, "Because MarkDuplicates sees query-grouped read alignment records from the output of [1.3], it will also mark as duplicate the unmapped mates and supplementary alignments within the duplicate set." There may be a small benefit to changing my script to prepare the sort that way, but I'll leave it to others to prove having about 12TB of BAMs prepared with the old recommendation.

This should be a good starting point: (WARNING NOT TESTED!!!)
bwa mem -M -t 4 -p -R '@RG\tID:foo\tSM:bar\tLB:library1\tPL:mgiseq\tPU:u nit1' GRCh38.fasta sample_1.fastq sample_2.fastq | \
samtools sort -n -O bam > sample.name_sorted.bam

You can increase the -t value for every core and 4GB of RAM you have generally. Sometimes it works with less memory per thread but will crash randomly in my experience.

Change the meta-variable names in the read group string to something sensible of course.

The sort is specifying the query-name style and would be ready to go for MarkDuplicates. Once that's done resort by coordinate order, index that resulting BAM, and filter with view as mentioned up above.

Donwulff
05-22-2019, 01:05 PM
I should clarify that there isn't "one correct" way to create analysis-ready BAM file. Small changes in the process can cause significant changes in how hard-to-sequence variants are called. Broad Institute's GATK Best Practices exist in part to standardize on these methods, so that variants called by different parties are comparable to each other. Nevertheless, even the Best Practices leaves a lot of choices technically open, for example which dbSNP version should be used for base-quality and variant recalibration. Unfortunately, there generally isn't absolute "truth set" to compare against in sequencing experiments, so the best options are always at least somewhat subjective. In interest of submitting to YFull though, I believe the process should be as close to standardized GATK Best Practices as reasonable.

And there are also things like the MarkIlluminaAdapters which is indeed part of GATK Best Practices, but as the name implies is only approapriate to Illumina sequencing. As JamesKane originally identified, Dante Labs indeed doesn't use illumina but BGI/MGI-seq instead, so MarkIlluminaAdapters doesn't remove any actual adapters, but instead removes some valid sequences which happen to match Illumina Adapters. In my example pipeline I originally solved this by dropping adapter trimming entirely, but after trial & error partly commented on the "Dante Labs technical" thread settled on fastp which is both fast and extremely accurrate at removing adapter read-through in all types of paired-end short-read sequencing.

Regarding the BQSR, there is a paper which found that BQSR and VQSR aren't necessarily beneficial in many types of processing: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5048557/ and as said YFull probably doesn't make use of BQSR. My concern that the sequencing error profile of Y chromosome looks different from the autosomal chromosomes, even with the addition of YBrowse SNP's is also significant. Anyway the BQSR script on the web-site doesn't include SNP reference at all, only indels, though I see now that the Gist includes dbSNP's. GATK pages are actually pretty terrible on this point, but if you DO run it the Best Practices workflow on https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/PairedEndSingleSampleWf.hg38.inputs.json shows that the complete GRCh38 calibration set actually includes Mills_and_1000G_gold_standard.indels.hg38.vcf.gz AND Homo_sapiens_assembly38.known_indels.vcf.gz. https://software.broadinstitute.org/gatk/documentation/article.php?id=1247 says that you should use "The most recent dbSNP release (build ID > 132)" which technically changes according to when you construct your pipeline. The most recent release is currently dbSNP 152 while their actual pipeline uses dbSNP 138. Go figure.

It's a good point about running out of memory with multi-core computers, however, my script attempts to adapt to number of processor threads & amount of free memory automatically, but it will fail quite ungracefully if the system doesn't have minimum amount of memory for the cores. It may be necessary to manually adjust those options anyway. There are numerous optimizations where it doesn't affect the results, for example instead of MergeBamAlignment I simply lift the headers in their appropriate places during mapping, this should lead to equivalent results but skips that stage entirely. Samtools is used for coordinate-sorting and generating indexes because it's lot faster and more efficient than Java based GATK with neverheless equivalent results. Sequence is processed one chromosome/contig at time where possible to take advantage of parallel processing with no overlap or aliasing effects. The result is GATK Best Practices compliant (if you run the optional BQSR stage) analysis-ready BAM that's produced efficiently on a single compute node (Pending my version with MarkDuplicatesSpak of course;).

JamesKane
05-22-2019, 11:02 PM
Make sure to check your FASTQ's sequence identifiers before assuming [B/M]GI-SEQ! The 150 base PE sample I just finished processing some depth of coverage statistics on has an Illumina NovaSEQ flowcell id.

CATEGORY GENOME_TERRITORY MEAN_COVERAGE SD_COVERAGE MEDIAN_COVERAGE
WHOLE_GENOME 3043453562 19.236666 8.82621 20
NON_ZERO_REGIONS 2815239871 20.796061 7.19637 21