PDA

View Full Version : Dante Labs Long Read Test



pmokeefe
04-08-2019, 06:39 AM
https://www.dantelabs.com/collections/our-tests/products/long-reads-whole-genome-sequencing

WHOLE GENOMEL - LONG READS WHOLE GENOME SEQUENCING
Whole GenomeL - based on Third-generation sequencing (also known as long-read sequencing). The first Long Reads Whole Genome Sequencing available commercially, worldwide. Leveraging Third-generation sequencing technology in the Dante Labs Oxford Nanopore-certified lab
Out of curiosity I ordered the test. I don't really know that much about the Oxford Nanopore technology. If anyone is up on that I would like to learn more.

MacUalraig
04-08-2019, 06:48 AM
https://www.dantelabs.com/collections/our-tests/products/long-reads-whole-genome-sequencing

WHOLE GENOMEL - LONG READS WHOLE GENOME SEQUENCING
Whole GenomeL - based on Third-generation sequencing (also known as long-read sequencing). The first Long Reads Whole Genome Sequencing available commercially, worldwide. Leveraging Third-generation sequencing technology in the Dante Labs Oxford Nanopore-certified lab
Out of curiosity I ordered the test. I don't really know that much about the Oxford Nanopore technology. If anyone is up on that I would like to learn more.

Wow. I've been trying to follow this technology, I think Thomas Krahn at YSEQ has been experimenting with a MinION himself and posted about it on fb but recently said it wasn't quite ready for market yet. He prefers this approach to the Chromium barcode stuff.

Donwulff
04-08-2019, 07:00 AM
I'm very interested in that for research, like scaffolding Y-chromosome for structural variation, but that price is bit tough on many people's budgets! I'm not sure if one should expect significant discount on this. The costs of BGI-seq isn't clear, they might be getting very low price on it themselves, although the strategy seems to always have been to sell them at loss to establish themselves on the market. The Oxford Nanopore long read sequence may be pretty close to at-cost too, though I should check the latest estimates for flowcell yield & price.

Note though:
https://www.dantelabs.com/collections/our-tests

Long-read sequencing doesn't include their standard variant interpretation, which might be as well, as the order page states: "Optimized for analysis of repeated sequences, copy number variations and structural variations". The best results have been achieved by using Oxford Nanopore long reads for chromosome structure and short reads for individual variants. Of course both technologies provide both types of information, so technologies for combining them are active research topic.

Edit: Just checked, e-mail I received from Dante Labs announcing this says among other things:
"Combined with your existing whole genome sequencing, long reads will give you the best genomic map of your DNA.

A special offer for you

As a valued Dante Labs customer, we are glad to share with you a 20% discount for this new test."

They're sort-of hinting at combining the results themselves, but I see no mention or reference of them doing it yet. 20% for existing customers is interesting though. The problem with running large discounts like the Black Friday special is that you're always expecting better offer, though ;)

MacUalraig
04-08-2019, 07:05 AM
It would be great to see comparison between a PacBio test and a nanopore one if you are still doing the former. I agree the Dante price is too good to let it pass. The PacBio price at FGC is a bit painful though.

MacUalraig
04-08-2019, 07:11 AM
A special offer for you

As a valued Dante Labs customer, we are glad to share with you a 20% discount for this new test."



Oops, sounds like I should have rechecked my emails before ordering!

Donwulff
04-08-2019, 07:18 AM
When I ordered my original Dante Labs test, they lowered the price by $50 right after. I immediately e-mailed them and they reimbursed me with $50 Amazon gift-card (Which I think I never got around to using, ouch). If you forgot to use the discount code, it's worth e-mailing their customer support, though I can't blame them if they've decided to offer no reimbursements since then.
As an aside, I probably shouldn't mention this, but I originally created my Dante Labs account under different e-mail address from the one I ordered the test with, and I'm getting the discounts etc. on both e-mail accounts even though I didn't order anything on the second one. I don't think we're still supposed to share the discount code though, they might check if you've actually ordered before ;)

MacUalraig
04-08-2019, 07:24 AM
Yeah I did fire off a contact message about it, fingers crossed :-)

Donwulff
04-08-2019, 07:39 AM
I apologize if this is not state of the art, because I believe I've seen better results, but for a bit of what to expect:
https://www.nature.com/articles/s41467-019-09025-z

"The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5–15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieves 99.67, 95.78, 90.53% F1-score on 1KP common variants, and 98.65, 92.57, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than 2 h on a standard server. Furthermore, we present 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source (https://github.com/aquaskyline/Clairvoyante), with modules to train, utilize and visualize the model."

As said, long read sequencing's real strength is in structural variation, and genomic locations that are so repetitive/similar that it's impossible to place 160 (or 100) basepair reads into just one location. Y chromosome should have a lot of these, so as I suggested on the BigY-700 thread, I think this will eventually be a boon for Y chromosome phylogeny. Short reads aligned against the long sequencing read could provide evidence that the variant is a real one, and of course Sanger sequencing (up to 1000 basepairs for single read) could be used to validate any phylogenetically important variants. Or, just assume it's good enough for drawing phylogeny trees ;)

Donwulff
04-08-2019, 07:40 AM
Double post (Why does that always happen...)

JamesKane
04-08-2019, 10:51 AM
There will be a second example of this test result under R1b-S1121 to compare with pmokeefe's. There's about 1600 years separating them with roughly accurate Eóganachta pedigrees going back that far, so it should be interesting from the Y DNA perspective.

Donwulff
04-08-2019, 11:32 AM
The order page says "LONG READ RAW DATA Download your FASTQ, BAM and VCF files, with no extra charge." I wonder if downloads will actually be available this time, or if it'll be separate order again?

Also regarding raw data, "Meanwhile, it also has been found that electrical signals in Nanopore sequencing are sensitive to epigenetic changes in the nucleotides [12, 13, 14]. Several studies have demonstrated that Nanopore sequencing can be used to detect DNA methylation." DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning

I wonder if we have to ask Dante Labs whether they already do methylation calling, and if raw electrical signal data (Known as FAST5 files) is available. While likely not interesting from genealogical perspective, being able to extract methylation data from the results would make this extremely valuable for medical curiosity.

MacUalraig
04-08-2019, 12:20 PM
The order page says "LONG READ RAW DATA Download your FASTQ, BAM and VCF files, with no extra charge." I wonder if downloads will actually be available this time, or if it'll be separate order again?

Also regarding raw data, "Meanwhile, it also has been found that electrical signals in Nanopore sequencing are sensitive to epigenetic changes in the nucleotides [12, 13, 14]. Several studies have demonstrated that Nanopore sequencing can be used to detect DNA methylation." DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning

I wonder if we have to ask Dante Labs whether they already do methylation calling, and if raw electrical signal data (Known as FAST5 files) is available. While likely not interesting from genealogical perspective, being able to extract methylation data from the results would make this extremely valuable for medical curiosity.

I've got an epigenome test pending elsewhere, I agree its all fascinating.

Donwulff
04-08-2019, 01:27 PM
I've got an epigenome test pending elsewhere, I agree its all fascinating.

I asked Dante Labs about the raw data online download, FAST5 signal files, methylation calling (On second thought I'm not sure how this would be affected by using saliva sample, but it's worth a try), combining with the previously done WGS to "polish" the assembly and SNP's in the final result files. Lets see if I get any useful answers.

Anyway even if we get raw data on disk, everything except methylation calling should be doable without FAST5 files.

That's perhaps bit on the "research" side since previous hybrid assembly polishes have used Illumina short reads, but I don't see any reason BGIseq shouldn't work at least as well.

Jan_Noack
04-09-2019, 12:19 AM
My first sample to Full genomes took 9 weeks and by that time I'd sent another sample( I have no idea if the second sample has arrived as yet even though I sent it tracked about a month ago... I must check! which is supposed to be 5 to 10 days). That sample at Full genomes is still awaiting processing for a 20X ( sent back mid January). I think it would very diifficult, if not impossible, to send blood overseas from Australia. I can get blood collected easily though.

i have posted a 30X to Dante..it's still in the mail to get there (been about 6 weeks so far).
BUT it is going to the US, as I ordered from the US and they enclosed a fully paid return mail (from the US, so I could have returned for free postage if I had flown the US with it..which I have known people to do).

Does it then get sent to Italy? I think the saliva sample may not be optimal by the time it gets there for nanopore IF I was allowed to upgrade? Also I've read on Full Genomes section that blood is better., and I can't see much point if one can't get an optimal sample to the lab.
It's been what I've been waiting for though.

My other concern is the size of the data. How much bigger than a standard 30X would it be? Any idea?

Donwulff
04-09-2019, 05:45 AM
If they're both 30X, then by definition the basecalls will be equivalent size. https://www.dantelabs.com/collections/our-tests supports this as both WGS and Long Read are "Average Data Size: +180 GB" (I have no idea what that means, though. The BAM's are like 100G, but that's uncompressed, real size much larger than 180GB. 3 billion bases genome times 30X is 90 GB "data" plus some metagenome).

Side note, I just noticed that the table shows "Read Length PE150" for the second generation sequencing options, so it seems they have indeed upgraded to 150 basepair reads.

Shipping blood internationally would, even in best circumstances, be quite a mess (no pun intended). Plus it would need to be shipped chilled, 6+ weeks shipping time? I think even on FGC thread the benefit of using blood was inconclusive. In terms of current price Oxford Nanopore is not "bleeding edge", FGC Long Read costs $2900 at which point indeed it's worth putting several hundred down for best sample possible.

IMO main reason you would want to use blood though would be less contamination. I don't know how Dante Labs is running the Oxford Nanopore sequencing, those sequencers should be the capability of being selective about which molecules they sequence, if it's not human they could reject it. Then again, some people will be specifically interested in the metagenome in their saliva as well.

pinoqio
04-09-2019, 08:17 PM
For nanopore sequencing, the basecalling accuracy seems to be about 2 orders of magnitude worse than NGS.
According to this paper, quality scores for PromethION basecalls hover around 8-10 (80-90% confidence per basecall):
https://www.biorxiv.org/content/biorxiv/early/2018/10/03/434118.full.pdf Raw Fastq data: https://www.ebi.ac.uk/ena/data/view/PRJEB26791

Does anyone know what the aggregate confidence for the consensus basecall in the BAM is, after combining the ~30 reads?

Donwulff
04-09-2019, 08:31 PM
See https://anthrogenica.com/showthread.php?16842-Dante-Labs-Long-Read-Test&p=559602&viewfull=1#post559602

Problem is nanopore sequence has high "systematic error" rate, that means all reads will have the error. Indeed, the Dante Labs product page doesn't even talk about returning SNP's/SNV's or variant interpretation reports, just CNV and SV. However, the way this is "normally" done is the assembly is polished using short read sequencing, with long reads used for structure. This could be potentially very bad PR for Dante Labs if people don't fully understand the product and think it's just "better sequence". But yeah, combining it with WGS would be best of both words, and honestly I'm not sure anybody knows the real accuracy of the latest flowcell chemistry + analysis software, they've been improving by leaps and bounds.

Unfortunately I've not yet got any reply (or confirmation of receipt) to my questions about it to Dante Labs, although the questions were highly technical.

pmokeefe
04-10-2019, 03:23 AM
Here's a quick update on the status of my order. I submitted the order on Monday morning April 8th (Rome time). Tuesday morning, the next day, Dante Labs emailed me the the tracking number for the shipment. This morning (Wednesday) the shipper's web site shows the package has been picked up and is in transit. So far so good, though obviously this is just the very beginning of a long process. Will update again when I actually receive the kit.

MacUalraig
04-10-2019, 08:23 AM
Here's a quick update on the status of my order. I submitted the order on Monday morning April 8th (Rome time). Tuesday morning, the next day, Dante Labs emailed me the the tracking number for the shipment. This morning (Wednesday) the shipper's web site shows the package has been picked up and is in transit. So far so good, though obviously this is just the very beginning of a long process. Will update again when I actually receive the kit.

Mine seems to be meandering around Germany at the moment, where was yours sent out from?

pmokeefe
04-10-2019, 08:42 AM
Mine seems to be meandering around Germany at the moment, where was yours sent out from?
Likewise

Date Time DPD parcel centre Parcel status
10.04.2019 03:14 Aschaffenburg (D ... In transit. Preload.
09.04.2019 17:02 DPD data centre Order information has been transmitted to DPD
09.04.2019 15:40 Mittenwalde (DE) ... In transit.

pmokeefe
04-11-2019, 02:09 PM
I received my Long Read kit today, Thursday April 11, I ordered it on Monday, so just three days. The kit appears identical to the 30X short read test from Dante Labs. However, there was no return shipping label enclosed, I just emailed them with a request for that.

Donwulff
04-11-2019, 02:47 PM
Looks like Dante Labs raised the price on the long read test, although things are bit confusing, I think they lowered it first so looks like 849 => 799 EUR => 899 EUR to me. If memory serves, processing time went up from 4 to 8 weeks to 8 to 10 weeks. Both change suggest they were surprised by demand... meanwhile, still no reply to my questions about the product.

Someone said on another thread they received a reply within minutes, has anybody else been able to get any response about the Oxford Nanopore Long Read sequencing? I was eager to get to be among the first to get to play with the results, but if they don't provide actual raw FAST4 data at the same price, I may have to wait until someone does, as well as for the processing time to fall back down. And of course, not responding to queries while raising the price is a good way to lose a customer...

MacUalraig
04-11-2019, 03:21 PM
I sent them that query about the discount I omitted to apply and have not received a reply :-(

MacUalraig
04-11-2019, 03:30 PM
I also spot another oddity, the text says 'N50>20,000 bp' but the graphic alongside has the caption 'N50 > 8000'?

Donwulff
04-11-2019, 03:50 PM
I also spot another oddity, the text says 'N50>20,000 bp' but the graphic alongside has the caption 'N50 > 8000'?

I'm not sure if that's actually how they mean it, because yeah, it would be confusing to everybody. But technically there's no conflict between those claims, because the other says "average N50>20,000bp" and the other "N50>8000".
Of course, this could technically mean some customers get 8,000bp reads while others get 32,000bp reads which is actually not impossible, but hopefully their quality is more even. (Either way, most of the reads will be possible to join together).

MacUalraig
04-15-2019, 12:00 PM
My Long Reads kit arrived, just as I was finishing an early lunch. Now have to wait before I spit.

Donwulff
04-17-2019, 01:05 AM
I should've probably looked at this before asking questions: https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-genome/rel_3_4.md (Not related to Dante Labs, just a sample dataset)
On a quick count, it looks like the 30X signal-level FAST5 data for Oxford Nanopore weights at about 30 terabytes of data. There's certainly a case for "sneakernet", but at that size it would take a large shipment of hard-drives!
I do not know if it could be reduced smaller, that page says there's some redundancy and there have been developments on minimizing the files.
That's bit of a bummer though, because if that's true it practically prevents re-analysis of the signal files at home. The FASTQ & BAM files certainly will be useful. No word if Dante Labs can do the methylation calling themselves. This also having the caveats that the sample preparation must be PCR-free & the methylation profile will be specific to saliva, with whatever degradation has occurred before the analysis. So yea, in summary, I don't know but I wish Dante Labs would tell ;)

Dante Labs did respond to someone asking on Facebook about the short + long read combination, and their answer was they don't do it now but will look into it. This can be done by third party with the FASTQ/BAM files alone, for example with https://github.com/nanoporetech/ont-assembly-polish

Also sucks that they raised the price by 100EUR, I want to wait for a better deal now, though to be honest that looks like a very low price, and they should run into legal trouble with their pharmacogenetics report so it's a tough call.

MacUalraig
04-17-2019, 06:17 AM
They finally gave me some money back because of having missed the discount code

You have received a refund

Total amount refunded: €169.80 EUR

:-)

Francisco
04-17-2019, 07:55 AM
Bad news. At first the marketing representative said they would combine both tests (30x short + long reads) to make a complet map without the usual error problems of the Nanopore -wich has a NON random error, unlike PacBio that has more errors but are random-. But after the "technical" consult, they said that they don´t combine tests and don´t know when will they do it (if they do it).
Methylation is in other question, but we all know the answer "not yet, perhaps in the future, no ETA available". Bad news.

Wich would be the real advantage of this Test vs a 30x in Fullgenomes/DanteLabs/Yseq??? It appears to be worst than the 10x Chromium of Fullgenomes, but cheaper (less than half).
Any info about what would this Nanopore test give us? thx

JamesKane
04-17-2019, 11:29 AM
Any info about what would this Nanopore test give us? thx

Short answer: We don't know until we have results and something to compare with.

Longer answer:
On it's own I'm skeptical of the value of this test for genetic genealogy. The CNV and SV detection may be useful for medical applications.

I ordered one of these for the express purpose of hybridizing with a traditional short read 30x WGS test to attempt to get better resolution on the Y chromosome. Mostly the interest is in seeing how many of the STRs in Williams et al (2016), Chromosome-Wide Characterization of Y-STR Mutation Rates, can be resolved but there are also some SNP markers in Poznik's call mask that have failed Sanger validation I'd like to learn more about.

If I didn't have the 30x WGS test already, I'd be looking at the 10x Chromium instead.

Donwulff
04-17-2019, 04:51 PM
We have fairly good idea of what Oxford Nanopore Technology is capable of: https://anthrogenica.com/showthread.php?16842-Dante-Labs-Long-Read-Test&p=559602&viewfull=1#post559602 Although Dante Labs Italy is Oxford Nanopore certified https://nanoporetech.com/services/providers#tabs-0=Dante-Labs until first several people have received their results we don't know for certain about their overall quality & processing, and what customers are actually getting. Add to this that some select labs have received ONT R10 chemistry nanopore flowcells https://nanoporetech.com/about-us/news/new-r10-nanopore-released-early-access (Supposed to be Q40 at launch), which as of yet haven't even appeared in studies, but might make sense for for Dante Labs to be using due to the yield & their price.

Donwulff
04-17-2019, 05:22 PM
I want to stress this though, this is from the Dante Labs product page of the "Whole GenomeL" Oxford Nanopore sequencing. Note that nowhere do they say that you will receive SNP calls, and the product comparison page https://www.dantelabs.com/collections/our-tests explicitly shows you will NOT received the SNP interpretations. This makes sense, since as you can see in the above reference, ONT is currently only 80-90% accurate for SNP's.
29850
"- Optimized for analysis of repeated sequences, copy number variations and structural variations
- Third generation sequencing"

Whether they'll eventually deliver SNP calls or not, there are going to be a whole lot of unhappy customers who didn't read the description with thought, and expect to get "third generation" improved results on all measures for their more expensive test, even though DL never promises that. The standard WGS should be rightfully considered & marketed as the SNP/SNV test, and ONT GenomeL as chromosome structural test.

Of course, you get a bit of both from each one, and best of both worlds if you combine both. Unfortunately, as reported by me and Francisco above, on FB DL said that (despite giving the impression in their e-mail to existing customers) they do not automatically combine the test results. They did promise to look at it, but I wouldn't hold my breath. But again, if you get BAM/FASTQ from both, it will be possible for a third party to do the data combination using the latest technology & data (And possibly other tests as well). This approach normally allows sequencing accuracy that exceeds that possible on short read sequencing, and even sequencing regions that do not exist on the human reference genome(s): https://www.nature.com/articles/nbt.4060 (Note they used special ultra-long read protocol in this, I can never find the references I want to;)

Donwulff
04-17-2019, 05:44 PM
I'll also add that measures of sequencing accuracy are notoriously tricky. First of all you have the issue of what is the correct result; no absolute "truth set" exists. As shown in the Illumina/PacBio/ONT paper, NA12878 "Genome In a Bottle" existing call-set is considered the gold standard, but the paper found apparently large number of extra variants and improved on some previously known ones.

Second is exactly what regions you're measuring accuracy over. There are regions in the genome that are pretty much intractable by any existing technology, if you were to somehow include those regions of course the accuracy would be terrible. If you limit yourself to, say, only exons (The protein coding sections that comprise of about 1% of the whole genome) most technologies will have a phenomenal accuracy.

Finally, and perhaps most significantly if you have an error rate of 1 in million, over 3 billion basepairs of human genome that comes out to 6000 errors over the whole genome. However, if you do 30X reads and the errors are completely random you'd likely get perfect results. If the errors always occur in same place, you'd still have 6000 errors - most likely false positives. However, if you have 1 in million called *variants* being wrong, then you're pretty close to perfect sequence, with errors being mostly false negatives. Of course, real error rates are much higher.

So 99% can be pretty good, or really terrible, depending on context, and the only way you'll *really* know is if they're comparing multiple technologies side by side with same comparison method, like the paper in question.

MacUalraig
04-17-2019, 08:05 PM
I want to stress this though, this is from the Dante Labs product page of the "Whole GenomeL" Oxford Nanopore sequencing. Note that nowhere do they say that you will receive SNP calls, and the product comparison page https://www.dantelabs.com/collections/our-tests explicitly shows you will NOT received the SNP interpretations. This makes sense, since as you can see in the above reference, ONT is currently only 80-90% accurate for SNP's.
29850
"- Optimized for analysis of repeated sequences, copy number variations and structural variations
- Third generation sequencing"

Whether they'll eventually deliver SNP calls or not, there are going to be a whole lot of unhappy customers who didn't read the description with thought, and expect to get "third generation" improved results on all measures for their more expensive test, even though DL never promises that. The standard WGS should be rightfully considered & marketed as the SNP/SNV test, and ONT GenomeL as chromosome structural test.



I think maybe you went a bit far but these days I think we should avoid over-analyzing the small print from Dante ;-)

Probably fair to agree that as they haven't really produced much yet they may well be undecided about what to hand out, they won't want to spoil reputation with unreliable SNP reports (or ones which clash with other data sources) if they turn out that way.

But not to worry since

"Download your FASTQ, BAM and VCF files, with no extra charge. " :biggrin1::biggrin1::biggrin1:

Donwulff
04-17-2019, 09:02 PM
I think maybe you went a bit far but these days I think we should avoid over-analyzing the small print from Dante ;-)

Umm, wat? It's no small fine print, it's a huge table with cat-sized letters which says exactly that, no interpretations on SNP's will be provided:
29857
(And how do I make the image larger anyway, annoying these postmark sized inserts)

And that's just as well since we KNOW that the genotyping quality of long-read sequencing alone is way below short-read sequencing. They MAY yet provide VCF files with SNP's anyway, particularly if customers absolutely demand for them, but that doesn't change the fact that the technology itself is such that genotypes from long reads alone are worse. Long-read sequencing (at least as the technology stands now!) isn't "better sequencing", it's just a different kind of sequencing (Different type of results etc.) that's complementary to short-read sequencing.

Even some people posting in this thread seem like they may be expecting it to be "better" sequencing (Due to the higher price & IMO poor communication from Dante Labs), and nobody should really be giving people the impression that this is unequivocally better sequencing on its own, which was the point of my earlier posts. Only when properly combined with short read sequencing (Which Dante Labs is saying they do not currently do, but can be done with the raw data) can it beat either technology alone.

Donwulff
04-17-2019, 10:47 PM
For the deep phylogeny/genealogical use-case, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6315018/ "Selective single molecule sequencing and assembly of a human Y chromosome of African origin" is very interesting recent paper (Well, it is to me anyway). They used 25X Oxford Nanopore sequence and 30X Illumina (2x300PE though) to create new Y-chromosome reference for A0 haplogroup (NOTE! These are reads over haploid genome and would thus be equivalent to around 50X ONT and 60X Illumina sequences over whole genome). Of course, those were flow-sorted blood-samples which means they don't have to deal with leftover autosomal DNA and metagenome from saliva, so doing that with Dante Labs data won't be as clean, though exact steps are in the paper. Conversely, if you're working on European Y-chromosome, the existing human reference/BigY data can serve to anchor some of your Y-contigs.

From abstract: "Due to their inherent assembly difficulties, high repeat content, and large ampliconic regions, only a handful of species have their Y chromosome properly characterized. To date, just a single human reference quality Y chromosome, of European ancestry, is available due to a lack of accessible methodology. To facilitate the assembly of such complicated genomic territory, we developed a novel strategy to sequence native, unamplified flow sorted DNA on a MinION nanopore sequencing device. Our approach yields a highly continuous assembly of the first human Y chromosome of African origin. It constitutes a significant improvement over comparable previous methods, increasing continuity by more than 800%. Sequencing native DNA also allows to take advantage of the nanopore signal data to detect epigenetic modifications in situ."

Details: "We used the Nanopore data to construct a de novo assembly using Canu. We performed a self-correction by aligning the reads used for assembly and called consensus using Nanopolish, correcting a total of 127,809 positions. Finally, the Illumina library served to polish residual errors within the assembly using pilon. By this means, we corrected a further 101,723 single-nucleotide positions and introduced 105,640 small insertions and 6983 small deletions. We also explored further polishing options and found that running one additional round of error correction with racon potentially resolves several remaining errors, despite also introducing additional discordances. The final assembly is comprised of 35 contigs, with an N50 of 1.46 Mb amounting to 21.5 Mb of total sequence, in contrast to a contig N50 of 6.91 Mb of the GRCh38 Y-chromosome assembly. Compared with the gorilla Y-chromosome assembly with a contig N50 of 17.95 kb, our assembly is two orders of magnitude more contiguous."

And finally: "We produced a stringent call set of structural variants (SVs) derived from alignments to GRCh38 using Assemblytics. We detect 347 SVs at least 50 bp in size (931 variants at least 10 bp in size) of which 82 are at least 500-bp long. The cumulative length of these variants sums to 184 kb."

Producing haplogroup specific Y-references would be of use because they would allow for even more finegrained phylogeny tree construction for variants which occur on those structural variants, or in their vicinity, which is why I've already been working on de novo assembly with my BigY + Dante Labs WGS + 1000 Genomes samples of same haplogroup. ONT long-reads would allow to resolve long-range differences & provide more insight on which contigs are actually on the Y chromosome. Although I need to pore (pun intended) over that paper to see how much benefit ONT actually brings in this case, because they seem to be saying the structural variants were already known.

Donwulff
04-18-2019, 02:05 AM
On similar vein, on the autosomal side we have https://www.nature.com/articles/s41588-018-0273-y "Assembly of a pan-genome from deep sequencing of 910 humans of African descent" which caused quite a stir on the genetic genealogy scene with their conclusion of "Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome."

Although I have serious, serious reservations about the main claim of that paper, it still serves as a good remainder that even the GRCh38 reference genome is far from representing the full genetic variation of humans. Notably, THAT paper didn't use long-read sequencing, just deep short-read population sequencing, where long-read sequencing could have helped to resolve many of the questions left by the paper.

Of course, we aren't really able to use raw sequencing data for autosomal matching yet, let alone graph based reference genomes, but bringing this kind of power to the hands of consumers can only hasten research and development involving genomic structural variation, and enable "citizen science" on this important technology frontier.

https://www.sciencedirect.com/science/article/pii/S0092867419302156 has a good overview, including a table of relative strengths of DNA microarrays, WGS and long-read sequencing near the top, if you want just that and not the gory details ;) Although many points on the table are kinda debatable, like DNA microarrays are extremely accurate for the SNP's usually selected on them whereas short-read sequencing is error-prone and long-read sequencing extremely error-prone but can test more variants. Extremely large copy number variants are generally only available on chromosome microarrays/raw intensity data and not on the SNP lists genetic genealogy companies provide etc.

Donwulff
04-19-2019, 01:28 PM
Oxford Nanopore technology is improving so fast, that by the time a research study comes out, it's usually already outdated. That, the fact that we don't know Dante Labs process/method and ONT gives experimental new stuff to some of their customers means we don't know *exactly* what the Long Reads product will be like until first results are studied. This graph from "From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy" https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1462-9 illustrates perfectly both the speed of advancement and that the accuracy for a *single* Nanopore read is somewhere near 90%.
29899
Estimates of systematic error rate - the amount of error that can't be corrected by consensus results from multiple reads over the genome - are somewhat harder to come by. In one of the latest papers, "Comparative assessment of long-read error-correction software applied to RNA-sequencing data" https://www.biorxiv.org/content/10.1101/476622v2 different error-correction methods are compared on *RNA* sequencing. In this paper, using Oxford Nanopore data alone LoRMA wins with 2.91% per-base error rate. Of this, 2.51%-units are erroneous deletions and 0.03% erroneous insertions - usually, Oxford Nanopore sequencer mis-counts a long length of identical nucleotide, homopolymer. This leaves 0.37% per-base error for mismatch, commonly called SNP/SNV. Now, bear in mind this is still almost 4 in 1000, a number that's ridiculously high if you're looking for a pathogenic SNP that has true occurence of 1 in a million, for example. You would have 4000 false positives for each true one (Assuming, of course, perfectly balanced error. They might also occur at predictable spots, perhaps methylated sites etc.). It's also a far cry from 20% error.



As said though, nobody knows what kind of magic Oxford Nanopore Technology & Dante Labs have under the hood currently. I'm just addressing what's the uncorrectable error rate of the current (Now old) Oxford Nanopore Tecnology. Also, some of the error correction methods are currently computationally infeasible, although if it's *possible* from the data then there's no reason to think that more efficient algorithm and cheaper computation power won't eventually make it realistic option.

Nb. with 30X read-depth, it's the nature of the random distributions some locations will still get just 1 read and thus 10% error rate from just random error. With two reads you actually don't know which read is in error, so almost 20% will be suspect, but with three reads you have 0.1% chance of random error in all three, 0.01% for four and so forth from just random error. (Using the round 10% value; nobody knows what the exact error rate will be). With the average 15 reads on maternal or paternal copy of the chromosome, the chance of random error is almost non-existent, but systematic error (Which occurs in most of the reads, eg. mis-counting length of long stretch of identical nucleotides or methylated CpG site).

pinoqio
05-11-2019, 12:28 AM
My kit has been stuck on "Waiting for confirmation" for 3 weeks now, after they received the sample. Seems strange that they wouldn't even update to "Kit received".
Did anyone else's sample have more progress? I'm starting to think that their own Italy lab is not quite ready yet...

Petr
05-11-2019, 07:14 AM
My kit was delivered to Italy on April 26th and the status is "Waiting confirmation from Dante Labs" too.

My previous kits delivered to Dante on March 8th were marked as "Kit received" on April 2nd. In the past, it took just few days to see "Kit Delivered" and then 2-3 weeks to see "Successful DNA extraction - Level A".

The situation look much worse now.

MacUalraig
05-11-2019, 11:48 AM
My kit was delivered to Italy on April 26th and the status is "Waiting confirmation from Dante Labs" too.

My previous kits delivered to Dante on March 8th were marked as "Kit received" on April 2nd. In the past, it took just few days to see "Kit Delivered" and then 2-3 weeks to see "Successful DNA extraction - Level A".

The situation look much worse now.

Well 'Mark' on the help desk seems to have been replaced with 'Shanece'. I emailed them once already about the same issue, first of all she got it jumbled up with my earlier WGS kit even though I was very clear, then when she realised which one I meant she simply informed me that its status said 'Waiting for confirmation' and when that status changed they would email me.

I've just had another go!

JamesKane
05-11-2019, 12:32 PM
My Long Read kit was marked received on Monday. Three weeks after I dropped the sample in the mail to return to the Utah collection site.

pinoqio
05-14-2019, 02:31 PM
And today my long read kit jumped from "Waiting for confirmation" to "Success DNA - A" :thumb:
Looks like I was wrong, and the problem is more with their IT system, considering they are also not sending out any more email notifications..

MacUalraig
05-14-2019, 06:47 PM
Astonishing!

In fact its double astonishing as so has mine, but I've been bombarding their help desk the last couple of days about updating my status...;)

MacUalraig
05-17-2019, 06:12 PM
Newsflash - a 93Gb FASTQ file link has appeared attached to my nanopore kit!

oagl
05-17-2019, 07:48 PM
Newsflash - a 93Gb FASTQ file link has appeared attached to my nanopore kit!

That would have been fast results then :eek: Can you check whether the reads in the FASTQ file are really long reads?

pinoqio
05-17-2019, 08:47 PM
Indeed, I take everything back about their lab not being ready ;) (Edit: they were really fast and sequenced just 10days after the sample arrived, the fastq shows the exact time/date)
My own long reads fastq.gz is 138GB and will be downloading for a while...
Some preliminary stats gathered from the first 6% (8GB compressed, 16GB unzipped) of the file:
Longest Read: 192013 bp
Median Read: 2616 bp
Average Read 5765 bp
N50>13923bp (50% of all data is contained in reads longer than ~14k bp)
read quality: median phred score: ~14
predicted coverage: ~45x 3.2Gbp

I have no idea how homogeneous the data will be, so these numbers could change completely once I have the downloaded the remaining 94%.

And of course, looking at their website, it's a bit difficult to tell how much they promised in terms of N50 length ;)
30480

MacUalraig
05-17-2019, 09:50 PM
Mine took about 3h to download on my recently upgraded broadband but I've got to transfer it then run some analysis so probably leave that running overnight now. Please keep the reports coming!

Donwulff
05-18-2019, 05:42 AM
And of course, looking at their website, it's a bit difficult to tell how much they promised in terms of N50 length ;)
30480

As discussed previously, it's not certain if that's what they actually mean, but it's extremely clear what that is saying: Majority of reads in a sample are over 8,000 basepairs, with an average of over 20,000 basepairs between samples. Of course, that's indeed not a promise. Although since the molecule has to pass through the pore before it can be outputted, I suspect the beginning of the FASTQ's would have the shortest reads, and with the pores getting stuck in the end, read length starting to slowly fall again towards the end. (That is assuming they've not been re-ordered, of course).

MacUalraig
05-18-2019, 11:35 AM
Indeed, I take everything back about their lab not being ready ;) (Edit: they were really fast and sequenced just 10days after the sample arrived, the fastq shows the exact time/date)
My own long reads fastq.gz is 138GB and will be downloading for a while...
Some preliminary stats gathered from the first 6% (8GB compressed, 16GB unzipped) of the file:
Longest Read: 192013 bp
Median Read: 2616 bp
Average Read 5765 bp
N50>13923bp (50% of all data is contained in reads longer than ~14k bp)
read quality: median phred score: ~14
predicted coverage: ~45x 3.2Gbp

I have no idea how homogeneous the data will be, so these numbers could change completely once I have the downloaded the remaining 94%.

And of course, looking at their website, it's a bit difficult to tell how much they promised in terms of N50 length ;)
30480

Longest read 275622
N50 21174bp

I have a fastqc running too will post that when it finishes. Then will have a look at some aligning. Might try this one:

https://github.com/lh3/minimap2

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome;

MacUalraig
05-18-2019, 12:43 PM
FastQC output
summary

Filename redacted.fastq
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 11487338
Sequences flagged as poor quality 0
Sequence length 28-275622
%GC 40

can't seem to attach the whole report even zipped up, too big. Here is the Per base sequence content image though.

30490

MacUalraig
05-18-2019, 12:48 PM
Length distribution page

30491

pinoqio
05-18-2019, 01:14 PM
What tool did you use to calculate N50?

For my sample they had to use 2 flow cells. After the first one was dead after 2.5 days, they only had about 68Gbp (~22x), and had to use another one which produced another 77Gbp.
No idea what went wrong, according to Oxford Nanopore, they should produce 2x-3x that much on average. Can you check if your reads are all from one flowcell? All the info is in the read identifier.
If they routinely have to use 2 flow cells, they will have to double the price...

I created some nice graphs, this one shows the volume vs read length:
30494
So even though there are many reads shorter than 500bp, they don't add up to more than 1Gbp in total. The orange line is N50.

The read identifier also provides timestamps and channel id, so I was able to graph how long the nanopores survive:
30493
Quality decreases, but there seems to be a fitness-selection effect, where the longest lasting pores are also the ones more likely to produce longer reads.

MacUalraig
05-18-2019, 01:23 PM
I did the N50 semi-manually in a spreadsheet (hence having to correct it) as per

https://www.biostars.org/p/134275/

I only got one fastq its 92gb .gz and 203Gb extracted. I can see a flowcell id in the reads and if I compare the start and end of the file they appear to be different - 50882 at the top and 51468 at the end.

pinoqio
05-18-2019, 01:41 PM
Yes, this is how I calculated it as well.

Those flow cells are some expensive consumables https://store.nanoporetech.com/flowcells.html
Either Dante Labs has negotiated some amazing discounts, or they are losing money hand over fist with this.
In theory, you can wash and reuse them, but at least in my case, they used them up completely.

Or they are barcoding the DNA, sequencing multiple customers together in a single flow cell, and then filtering the reads.
That way, maybe those 68Gbp I'm seeing could be half or a third of the total output.
With that, they could really maximize use of the cells, but that chemistry is not exactly cheap either.

Donwulff
05-18-2019, 04:32 PM
You can't wash and reuse Oxford Nanopore flowcells that are used up; the wash kits are for switching samples while the flowcell still has useful life left. They do have some techniques which could allow "unclogging" stuck pores for same sample though, but afaik they're not in use yet. ONT originally calculated $1000 ONT whole genome on the basis of two flowcells. The flowcell yields are improving, so indeed it would be theoretically possible to get 30X on one, and more in future. https://nanoporetech.com/products/comparison says customer best in April was 160Gb on single flowcell, but most will be less, for now. If significant number of them hit 100Gb it's kinda fine, but DL marketing plan seems to also have been to take loss while building up volume for economy of scale. (Where's the overstrike format here, just looked at the graphs which do indeed strongly suggest they used two full flowcells at least on that sample).

Donwulff
05-18-2019, 05:43 PM
I'm very curious to see what people will come up with in terms of processing the ONT data, although hopefully Dante Labs follows through with their own analysis promptly. Mimimap2 seems like a sane choice, I especially like that it can be used for short read sequencing as well so there won't be technical compatibility issues between the two BAM's for example. Of course, mapping is just the beginning. https://github.com/nanoporetech has all sorts of code from Oxford Nanopore Technologies themselves, but a lot of it is research-stuff which requires hundreds of thousands of hours of runtime.

MacUalraig
05-19-2019, 09:50 AM
OK I've managed to align my fastq and view it in IGV. The minimap2 run took 6.5h 13Gb RAM and 3 cores using all defaults. Final sorted file is 115Gb. The output in IGV looks messier than either the YSEQ or 10X panes above - I had earlier done some sample BLAT searches on the reads and typically they were getting match scores of 95-97% (one was only 92%). On the other hand the matches were a lot less ambiguous, for comparison one short read one I looked at got 100% matches on about 4 different chromosomes.

I will be doing some manual inspections using VCFs from other tests but will try and so some kind of variant calling direct in due course.

minimap2 comes with pre-compiled Linux binaries by the way so quite easy to set up and kick off.

MacUalraig
05-19-2019, 10:23 AM
This is an IGV screenshot of a deletion on the X which was reported earlier by the 10X LongRanger pipeline. With the long reads IGV labels the length of the deletion although the linked reads in the pane above also spanned it. Either side of that not quite so pretty!

In order from the top the files are YSEQ WGS, FGC-10X and Dante-ONT

30503

Donwulff
05-19-2019, 10:26 AM
One possible clarification by the way, in this context "Gb" is usually Giga base pairs, referring to the complementary double-stranded pairs of DNA, Mb = Mega base pairs. Human genome is about 3 billion base pairs per parent, or 3,234.83 Mb (Mega-basepairs) per haploid genome 6,469.66 Mb total (diploid) technically. So 30X reads over the 3,234.83 Mb length is 97 Giga base pairs. When this is sequenced and written into a FASTQ, each read aquires header and each base(pair) aquires read quality value, more than doubling the storage size. Compression, depending on compression parameters and specifics of the sequencing (How many base quality levels in particular, but also how repetitive the sequences are etc.) may about halve it again.

It's been common practice to express computer quantities in powers of 2 because computers use binary system. Thus, for example, the maximum different values expressable in 10 bit quantity are 1024, and individual blocks of data and quantities are typically in multiplies of two. These are commonly shortened as "GB"; technically "Gb" would be "Giga bit" (Which is 1/8th of a byte) that's often used for transfer and compression sizes, but you can see where this starts getting confusing... Storage devices, however, commonly express their capacity in base-10, ie. 1000's because that yields slightly larger numbers. To ease this confusion, the term "Gibibytes" shortened as GiB was coined for base-2 quantities.

Therefore, a 128 Gigabyte (GB) SSD can often hold 119 Gibibytes (GiB) of data, which can hold less than 64 Giga base pairs (Gb) of uncompressed FASTQ which would compress to 64/2*8 = 256 Gibibits (Gibit). Conversely, 30X human genome is 97 Giga base pairs (Gb), which is > 194 Gigabytes (GB) FASTQ that compresses to around 97 GB (Convenient, that) which is about 90 GiB if displayed in base-2 units.

In above posts it's sometimes hard to tell which is being talked about. One recommendation is to forget about bits, use GiB for space and write Giga base pairs out. Almost nobody follows this convention, though ;)

MacUalraig
05-19-2019, 10:51 AM
I think I've only mentioned ram/disk/file sizes so far. In case anyone is interested when I did the N50 calc the total number of bases came out at 102,373,913,140.

MacUalraig
05-19-2019, 10:55 AM
Average read depth of the Dante-ONT bam now calculated at 30.83x.

KCW
05-31-2019, 01:01 AM
I read somewhere that Dante offers a long read discount to returning customers. Is this true? I have some 30x sequences with them. Has anyone used this discount? Does it actually exist? How does one apply it? Many thanks!

Donwulff
05-31-2019, 01:38 AM
The discount code was e-mailed to everybody with Dante Labs account, however it expired weeks ago, and the starting price of the long read test went up. If they're using two flowcells for most tests, it's still better than list price of the flowcells, but I hope that they can squeeze those reads on single flowcell soon. At least, after they've ran mine on two flowcells for 50X reads... *rubs hands* Haha ;)

Aside, those discount codes usually say "Just for you" or something else implying it's exclusive, but the Memorial day discount for example included social media sharing buttons. So they want us to share them or not?

pinoqio
05-31-2019, 10:46 AM
I read somewhere that Dante offers a long read discount to returning customers. Is this true? I have some 30x sequences with them. Has anyone used this discount? Does it actually exist? How does one apply it? Many thanks!

Expired, but the MEMORIALDAY promo code is still active, which brings it down to 700€.

Edit: Andrea Riposati talk on long reads at Dante Labs: https://www.youtube.com/watch?v=M1ESNXKIbME

My collection kit was a standard SpectrumDNA kit (http://www.spectrum-dna.com/products/dna-collection). Will be interesting to see if they can achieve longer reads with their "special" collection kit.

Ysearcher
06-03-2019, 03:14 PM
Bionano Genomics (San Diego) - https://bionanogenomics.com/technology/platform-technology/ is a relative newcomer to the field of structural variant analysis, and recently researchers comparing many platforms determined that Bionano was the single best technology for discovery of human structural variants. The downside is that finding the service is currently almost impossible, without referral to a handful of research centers.

Good luck with your Dante Labs Nanopore whole genome sequencing. Please post back when you receive your results.

Donwulff
06-04-2019, 05:33 AM
Reference for "recently researchers comparing many platforms determined that Bionano was the single best technology for discovery of human structural variants" please? Also, it should be stressed that Bionano is "optical mapping" and not sequencing, and it's complementary to sequencing technologies like Oxford Nanopore (https://academic.oup.com/hmg/article/27/R2/R234/4996216). It's also important to note the applicability of different techniques; for example Bionano (with additional long-read sequencing) appears to be mainly used for different challenging domains like telomeric regions (https://www.nature.com/articles/s41598-018-34774-0) or plant genomes (https://www.nature.com/articles/s41467-018-07271-1). Clinically or genealogically relevant structural variation in human genomes tends to be adequately resolved in sequencing-only technologies, albeit these kinds of statement do carry the caveat that different techniques may eventually find more use when there's more data to compare to.

pinoqio
06-06-2019, 03:26 PM
They reduced the price back to 799€, with the still active MEMORIALDAY code 599€.

About a month after the FASTQ data, still no sign of a report or a BAM file though.

Jan_Noack
06-07-2019, 12:04 AM
hmm I ordered when it was 899 with the MEMORIAL day discount. It has not shipped as yet? I have emailed Dante to ask about this reduction and if I can get a refund of 100EUR. I think they should not charge your Master card of visa until they can ship the kit? I'm a tad concerned here.

Jan_Noack
06-11-2019, 11:24 PM
hmm I ordered when it was 899 with the MEMORIAL day discount. It has not shipped as yet? I have emailed Dante to ask about this reduction and if I can get a refund of 100EUR. I think they should not charge your Master card of visa until they can ship the kit? I'm a tad concerned here. update- yesterday they read my emails and replied that could only give me a full refund. I now have a full refund. They still did not have any kits to send out, maybe there is a shortage of them, or maybe its because I'm in Australia?. This was for the long read.

Donwulff
06-13-2019, 10:08 AM
update- yesterday they read my emails and replied that could only give me a full refund. I now have a full refund. They still did not have any kits to send out, maybe there is a shortage of them, or maybe its because I'm in Australia?. This was for the long read.

That's bit weird, also 100 EUR worth disheartening, because I've asked for the discount too. However, Dante Labs has yet to respond to the request beyond saying it's been escalated higher-up. In my case I ordered the kit on the 27th of May, and it was mailed out of Germany on 28th.

Their chosen express delivery service at least here, dpd, is awful though. The parcel arrived two weeks later, on the 11th June after having been "Out for delivery" for a week. Dpd has not informed me about its arrival or where it was delivered in any way, and they do not have customer service or any way to contact them in the country. After some sleuthing I found the company responsible for their deliveries in Finland (PostNord) and managed to obtain the parcel, which had been throughly trashed and squeezed in transit. Like, wtf? Luckily the sample kit itself is sturdy and well packed so it survived. Dante Labs isn't really responsible for the delivery company's doings, but I don't get why they don't just drop these in the normal mail, it would arrive in a couple of days with none of the drama & damage.

MacUalraig
07-09-2019, 08:00 AM
According to the new-fangled kit manager my long read test (which they already gave me a FASTQ for) has now reverted to Passed QC and is scheduled for sequencing. Whether this is meaningful who knows.

pinoqio
07-09-2019, 06:34 PM
Has it ever been further than "Success DNA - A" in the old kit manager?
Mine was on "Success DNA" and I believe this simply translates into "QC passed" in the new one.
But yeah, same situation as you, got the FASTQ two months ago and since then absolutely nothing.

MacUalraig
07-09-2019, 06:40 PM
No my status was exactly the same as yours.

To be fair we may have just got a pilot run, I'm not complaining really. I would much rather have it than not!

Petr
07-09-2019, 06:52 PM
My status is still "Awaiting Your Kit" even though according to the tracking it was delivered on April 26th.

On July 3rd the Dante support wrote: "The results are ready to be uploaded. "

Still nothing.

pinoqio
07-09-2019, 07:52 PM
Yes, I'm ready to cut them a lot more slack on the long reads test, I was just pointing out that I don't think it "reverted", they just renamed the "extraction done" status.
And with sequencing, they were really fast:

Sample received
10 days pass
Sequencing, using the first flow cell (took 2.5 days)
8 days pass
Sequencing, second flow cell (took 2.5 days)
5 days pass
FASTQ available for download

Added up, from sample delivery to FASTQ in just under a month. I thought that was very quick.

MacUalraig
07-09-2019, 08:08 PM
Can we tell if we were run on a MinIon or the PromethIon? They use different cells but not sure if the ids in the output give a clue, I couldn't match them up with the cell product names.

It might explain why they churned one out quickly but it isn't 'official'.

Donwulff
07-10-2019, 06:36 AM
According to https://nanoporetech.com/services/providers/dante-labs it's PromethION and they're certified https://store.nanoporetech.com/lab-certification.html/ which means they should've produced data that meets quality standards within set period of time on the PromethION. From the certification procedure it doesn't seem like mapping/assembly ie. bioinformatics are necessarily required or included in the certification. From some of the announcements I've been wondering if Dante Labs could be using newer flowcells though: https://nanoporetech.com/about-us/news/new-r10-nanopore-released-early-access - The page https://nanoporetech.com/about-us/news/highest-throughput-yet-promethion-breaks-7-terabase-mark mentions "the latter of which is enabling human long read genomes to be commercially offered for under $1000" and I would assume the highest throughput is with the latest revisions of the flowcells.

Donwulff
07-10-2019, 07:32 AM
8.4 Asked Dante Labs via their web-form about long read bioinformatics processing and FAST5 availability, no answer but instead they raised price, so postponed order.
27.5 Ordered
28.5 Mailed to me via DPD
6.6 First time "Out for delivery", didn't arrive.
7.6 Went through delivery companies in the country, found the one responsible for delivery, who told me that "due to changes in contracts" they were no longer delivering to residences, only to pick-up points, which would happen some day, and because the delivery didn't have e-mail or phone number, they'd tell me by mail letter only.
I ask Dante Labs if they can apply the 100 EUR discount to my order because the price has been lowered by 100 EUR, DL answers second question but not that one.
10.6 I send second e-mail asking about the discount saying I can still cancel & re-order at the lower price. Customer service says they'll escalate the query, no response again.
11.6 Second time "Out for delivery", delivered to pickup point
14.6 Scheduled pickup with the DHL over their web-site, but they don't turn up.
17.6 Called up DHL and they showed up to pick up the sample.
Received mail letter notification of DPD parcel arrival which would be kept few more days if not picked up (Though I had already picked it up).
20.6 Dante Labs confirms receipt of the sample.
2.7 Kit status changes to "We Received Your Kit". I ask customer support about availability of FAST5 signals-level raw data, and they respond with standard FASTQ/BAM reply "download link for the raw data files (FASTQ/ BAM) is still a work in progress".


36 days from order to "Awaiting Quality Coontrol Inspection" state, has not moved further in the three weeks since they received my sample. Not too fazed, problems with the delivery companies aren't exactly their fault (Although DPD seems odd choice). I'm sure they're not too thrilled to deal with discount/refund requests, although the lack of response is what's giving them bad reputation. And I don't exactly expect their customer service to know about FAST5 etc. though I did stress that I don't mean FASTQ. Oxford Nanopore Tech is developing so fast that if it takes half a year, people may legitimately feel ripped off, but there's also a chance that the processing gets done with newer technology if it takes longer, as long as the data is available eventually. But I'm sharing my experience amidst the "got my results in under a month!" posts, because for me it was more than a month just to get the sample in.

Overall I don't understand the fuss with Dante Labs. On FTDNA projects I'm involved in people are reporting waiting for Big Y's longer than half a year, according to FGC's web-site they're running only few batches a year so it has to be over half a year as well assuming they don't need re-runs, but you don't see people filling forums calling them a scam. Poor expectation management or poor social media management? Hitting a price-point where people are making on-the-spot purchase decisions without understanding what it means to order experimental research wet-lab services? (Admittedly, yeah, there's bit of expecation management issue there...). And yes I know, nobody is complaining about the long reads (yet), just pointing out in general. But people definitely shouldn't expect long reads in a month from order, which is the impression one might get from current reports.

MacUalraig
07-18-2019, 11:09 AM
The Long Read test is today listed as 'Sold Out'.

"Whole GenomeL - Long Reads Whole Genome Sequencing for Researchers

€799.00 EUR
Sold Out Regular price
€1,299.00 EUR

"

The product Shop Now button is still enabled but the next page reconfirms its 'sold out' status. So did they use up all their starter kit flow cells?! Not even a 'we are expecting more in shortly' message? Or maybe not enough customer interest? That would be a bummer if they bought the PromethION.

Donwulff
07-18-2019, 01:23 PM
Hmm, that's weird. One of the main selling points of Oxford Nanopore is that "pay as you go" thing, there is very little up front cost (although, in business terms of course training, workforce etc.). Although the starter kits cost more than new packs. It sounds to me like nobody has yet received their final results(?) on long reads, so I suspect it's some sort of combination of bioninformatics/data processing holdup and possibly as the "sold out" implies, reaching capacity. Given the price point and their presentation and the London Calling conference presentation, that wouldn't surprise me.

Alternatively as someone earlier observed, if they're needing two flowcells per sequence, it's may not be profitable for them at this point (Although you would expect them to raise the price then, instead of dropping it as they've done), possibly waiting for R10 flow-cells to roll out. Finally, possible that Oxford Nanopore is just out of flowcells as they're switching to R10 production... Also I notice the product description has changed into "for researchers" only which makes sense/is how it should originally have been, because as noted earlier it's not going to be very useful for customers looking for genetic genealogy/SNP data.

MacUalraig
07-18-2019, 01:31 PM
Or, especially since the CEO is ex-Amazon, it could just be one of those 'only one remaining' sales gimmicks.

Donwulff
07-19-2019, 09:19 AM
You can't order it currently, though. Either way, I'll give it's a little confusing. I should just sum it up as "we don't really know". I did note that the standard PromethION package includes 4 months of warranty & license, which I think would be up right about now. But if they were going to quit offering ONT, I doubt they'd presented at London Calling Nanopore conference. If issue is the price, I doubt they'd lowered the price just a bit earlier. I think if I was in their shoes, I might wait for R10 flowcells though as the price seems to be same but sequencing yield higher. And the ONT store says "We are aiming to ship R10 flow cells in July/August".

What's the status on the data they have released? On Facebook they originally announced "We started releasing our first long reads whole genome sequencing data to our customers, in less than 30 days after receiving the samples." but I understood people have only reported receiving FASTQ files? Any final deliverables, or information on them?

MacUalraig
07-19-2019, 09:54 AM
No not a word since they posted the FASTQ on May 17th. Official status is

"Your Kit will be Sequenced Shortly

Your kit has passed the quality care inspection and is scheduled to be sequenced shortly. After sequencing is complete your results will be posted soon after."

Petr
07-19-2019, 12:18 PM
My kit delivered to them on April 26th still shows: "We Are Awaiting Your Kit".

Several e-mails exchanged, several weeks ago they wrote me "The results are ready to be uploaded.​" - but still no change.

Donwulff
07-31-2019, 08:19 AM
Dante Labs WholeGenomeL Long Read sequencing is again orderable, at least on EU site. No "last kit" or anything.
On the other hand no updates for my kit since 20th June when they reported the kit received; still in QC, which is unfortunate. I recall someone earlier reported they went from QC to FASTQ available in one go after a month, so fingers crossed...

ybmpark
08-01-2019, 11:53 AM
Not WGSL but my data have been erased from my account and they don't appear to want to issue refund. My kit number is till there.
Anyone else with the same experience?
And did you really pay 900 dollars to a company that a lot of people suspect of fraud?

tontsa
08-02-2019, 05:20 AM
The kit manager is just broken at the moment.. looking at the JSON they receive from their "API" backend seems to be missing some commas. They prolly get it fixed soon though.

Donwulff
08-03-2019, 05:30 AM
Genome Manager is back, at least for me, at least for the time being. There seems to be some changes though, "We found high risk variants in these conditions" no longer shows conditions, just pharmacogenomics. But Reports page had direct link to "Health & Wellness". This may be temporary with some changes they're making, but that's not necessarily bad chance, like I commented before the way they were displaying the genomic conditions seemed misleading because it read like a diagnosis, even if you were only a carrier for example, although they seemed to add some disclaimers earlier. The PDF reports presumably have the disclaimer pre-ample. Alternatively it's possible the "health conditions" reports will only be available via paying for them in their "New Reports" store, though right now it's bit unclear what they're going for (mayhaps testing things out themselves).

I would note though that the "Genome Manager" was never part of the deal of the product as sold; in fact I don't believe they promise updates or continued service anywhere on the site, just the one time download/disk delivery. Most DNA testing companies are requiring continuous subscription for updates to their product, which is borderline unethical because the interpretations are always changing so there should be efficient way of informing people of errors and new discoveries in their interpretations. Which makes me hope they maintain the free updates, but yeah, economically that may not make whole lot of sense.

pmokeefe
08-20-2019, 10:58 AM
I ordered my "Long Reads" kit back in June. On June 24, Dante Labs emailed me that the sample was received.
Today August 20th I received the email 'The status of your kit ... has been updated to "QC Completed"'.
Anyone else with recent news on the their Dante Long Reads kits?

Petr
08-20-2019, 01:36 PM
I have noticed that I have FASTQ file available about a week ago. The progress bar still shows "We Received Your Kit". Sample delivered to Dante in Italy on April 26th.

I uploaded this file to sequencing.com but I have received the following message:


Thank you for uploading 60820188482322.fastq.gz. Unfortunately, this file is not compatible with some of the apps at Sequencing.com. Because of this, you may not be able to select this file when starting some apps.

Donwulff
08-20-2019, 03:48 PM
I ordered my "Long Reads" kit back in June. On June 24, Dante Labs emailed me that the sample was received.
Today August 20th I received the email 'The status of your kit ... has been updated to "QC Completed"'.
Anyone else with recent news on the their Dante Long Reads kits?

Interesting. I'm not quite sure by what day I should go, because DL e-mailed me with "your kit statis is updated" as "Kit Received" on 20th June, though parcel tracking showed it had been accepted few days earlier. Honestly I wrote earlier that it went to QC on 2nd July, but I'm no longer certain because that appears to be just some e-mail I received. Either way, on 13th August I asked them if something was wrong because after 7 weeks it's still in QC. They replied almost instantly saying that yes, it's indeed still in QC and they will let me know if there's any problem with the sample so there's no need to ask (Honestly, some posts on their FB page, where they apologized for it, suggest that sometimes they haven't told people there's a problem with the sample...)

However, if they received yours on 24th, that means they're apparently not releasing them in order. Not entirely surprising as the sequencing is taking days, so they could finish at different times. I hope that means I get more data ;)

I'm also wondering where people are seeing the FASTQ downloads, so I'm not missing it. https://genome.dantelabs.com/reports on the "Raw Data Library"?

tontsa
08-21-2019, 04:24 AM
Yeah some lucky people see FASTQ and/or BAM links on https://genome.dantelabs.com/reports on the "Raw Data Library"? page.. but it seems really hit'n'miss who get them.. maybe they really have tens different labs they send the samples to and depending on lab capabilities they have to wait HDDs ship or they get raw data some other way..

Donwulff
08-21-2019, 11:44 AM
They're still listed on https://nanoporetech.com/services/providers#tabs-0=Dante-Labs for long read sequencing so they must have the equipment & proven they can operate them. Sending them to other centers for sequencing would cost them much more, and the gig would be out right away with other centers reporting they're actually doing the "under $1000 whole human long-read genomes" as promotion/advertisement. On the bright side, that would allow them to return the results in weeks rather than the months it seems to be taking now, but clearly that's not happening here.

The yield per flowcell varies, meaning the time the same flowcell is used, so it's likely some sequences take more flowcells and others take longer runtime etc. Alternatively, of course, they could have a process of "We run each sample on two flowcells for 2,5 days each, exact, and return whatever we get during the weekly upload", but in that case the results would be returned roughly in the order received. I'll have to stress that I'm talking about exactly one sample sample here, but it was interesting to note nonetheless. Of course it's always possible they put all the samples in a large black tophat and pull blindfolded one to sequence, but I kinda doubt that ;)

Post https://anthrogenica.com/showthread.php?16842-Dante-Labs-Long-Read-Test&p=569210&viewfull=1#post569210 in this thread earlier from pinoqio provides some great graphs about this, the number of reads per hour keeps dropping the longer the flowcell is in use (nanopores stop working, fewer nanopores performing sequencing) but it also depends a bit on the flowcell.

JamesKane
08-21-2019, 11:49 AM
My original sample was rejected in QC and their partner in Draper received the new one on June 3. The new sample was marked awaiting QC within two weeks of that, but no updates since.

I now have access to a haplogroup R individual's results, but won't have time to compare with his father's Big Y for at least a few weeks though. My initial impressions of coverage depth when aligned on GRCh38 using minimap2 aren't great. On the other hand the utility of the long reads is in de novo assembly, which I won't have a workflow established until my own is returned.

Petr
08-24-2019, 07:28 AM
As I mentioned, the progress of my Whole GenomeL - Long Reads Whole Genome Sequencing for Researchers test was the following:
26-04-2019: Sample delivered by UPS to Dante in Italy
03-07-2019: Reply from Dante support: "The results are ready to be uploaded."
16-08-2019: I happened to find that the FASTA file was available
The status is still "We Received Your Kit"
No other results available yet.

I asked YSEQ to map this FASTQ file, they used minimap2 to create the hg38 BAM, but it the results does not look very good James Kane wrote me: "Preliminary numbers indicate there is less than 4x coverage for most of chrY."
state nBases
REF_N 33591060
CALLABLE 7091657
NO_COVERAGE 146258
LOW_COVERAGE 12237917
EXCESSIVE_COVERAGE 0
POOR_MAPPING_QUALITY 4197763

I also submitted the hg38 BAM file to YSEQ and the results are also very bad:
SNPs (all): 155558
Positive: 2210 (1.42%)
Negative: 110187 (70.83%)
Ambiguous: 42517 (27.33%)
No call: 644 (0.41%)

STRs (all): 780
Reliable alleles: 229 (29.36%)
Uncertain alleles: 87 (11.15%)
N/A: 464 (59.49%)

No raw data statistics yet.

The specification from Dante is:
- Suggested for large structural variations (CNVs, SVs, large INDELs)
- Performed in the Dante Labs Oxford Nanopore-certified lab
- Results in only 8-12 weeks
- Whole Genome Sequencing 30X

Therefore I hoped that this test could be good for long STRs.

The lines in the FASTQ looks like:
@26a2a1b9-1866-4fda-be5f-d28f3c58955a runid=8d381fe0057aee336a817a93f54a3901fed6c90c read=9 ch=2093 start_time=2019-05-16T16:13:50Z flow_cell_id=PAD31817 protocol_group_id=DL_AQ_20190516_003_3 sample_id=60820188482322_1
ATCGGTATTGCTTCGTTCAGTTACGTATTGCTAAGAAAAAAAGAAGAATC AAATAGACACAATAAAAAATGATAGGGATATACCACTGATCCACAGAAAT ACAAACTACCAATCAGAGAATACTACAAAACACCTCTACAAAAAATTAAC TGGAAAATCTAGAAGAAATGGATGAAGATTACTGAACATACACCCTCAAG GCTAAACCAGGGAAGAAATTGAATCACTGAATGCACCAATAACAGGCTCT GAAATTGTGGCAGCGTCAATAGCTTACCAACCAAAAAGAGTCCAGGAACC AGATGGATTCACAGCCCGTTCTACCACAATTTTCTAG
+
*,(##-%(%"#&$))-22&+-/4614153-/424,,31672+%,,'28247>>::8;3&/.;110===2,)*&&(,,2342&*&&&$#'&..(++*-52/27:;+-.('+&$%'4)++/4526/2/9><,0/++,../,16===.):(5)/--//.&((/.-02-0;52697*$)'%$)$#%%'2<85.11./9+(*(((&$&*/684(55&4-+,,0+23%544/5654=>0.+.7.;2176+/(,3141)/.-2-+'%%&#))1235,+%''+*/5:(=(:=;1('(+,4-.,)-05984464*(8,4,.*'/,$$'1+2244/0'+(87/%##$
@c906e39b-9efa-49e9-bc65-8e96832e04e4 runid=8d381fe0057aee336a817a93f54a3901fed6c90c read=9 ch=1994 start_time=2019-05-16T16:13:50Z flow_cell_id=PAD31817 protocol_group_id=DL_AQ_20190516_003_3 sample_id=60820188482322_1
TCGGTATGCTTCGTTCAGTTACGTATTGCTCGCTTCTTTGCGAAGTGTTC TTTTGGCGTTTAATTTGTTCAGATTAACTCACCACCAACGCCACCAACGC CGATAGAAACATATCGCTTTCGCTTTTGTTAACAAAAAGCGCTTGGTGAG TCCTTTTACAGCAATCGCTTATTTTGCGCCACCGGCTGAAATTGAACGCT CAGAAGAACCTTGTGCAATTGCATTAAGCTGATATTCGCTTGTGCCAATG CAGAAGAAGCGGTACCG
+
&(')-+.*13=44175,234%))1*2)%&*.%(-)(()'#%$$%((-4?;121)*60/5?>6<1:8,.45*$#*5*5,.-),(%)#0$(*;<77725113<*)&*)15(&(,)/*$&&&*+&&(030)-,,(&*1783),)/-104141.2751;=0'%'#'%0):*,*0*/84,&(&'%#'/,509:6:3-6..;3/127.-15,0:/5.2,/.4:0;.+-+*059-13223)+70**2*,//4*//,(.+).*&-)&(%$%###"

So it looks like the sequencing was done just 3 weeks after delivery of the sample.

But now I wait more than 3 months for final interpreted results.

I don't know what is the reason for so bad quality of the Y results.

I'd like to check the quality of the FASTQ file but I don't know which tool I can use on Ubuntu 18.04.2 LTS. I found some tools base on R version 3.6, but this version of Ubuntu has version 3.4.4 only. And the tools require fast5 file, I don't know if this is the file I have or something different.

tontsa
08-24-2019, 06:25 PM
Your file looks to be just standard fastq as fast5 is binary format in "hdf5" container. You can run FastQC from command line too to get the report. If you have multiple fastqs you can pipe them: zcat *.fq.gz | ./fastqc stdin

Petr
08-24-2019, 07:25 PM
Your file looks to be just standard fastq as fast5 is binary format in "hdf5" container. You can run FastQC from command line too to get the report. If you have multiple fastqs you can pipe them: zcat *.fq.gz | ./fastqc stdin

fastqc ends with Memory Overflow.

tontsa
08-25-2019, 05:43 AM
I think you can just edit the fastqc wrapper script and add additional memory to the java innovacation. So which fastq and nano/vim/favorite-editor /path/to/fastqc

Donwulff
08-25-2019, 02:08 PM
I've been wondering if they could be using some sort of read-until for selecting human chromosomes, if that were the case, maybe they're also avoiding Y chromosome? Earlier post in the thread did report about 30X coverage of whole genome which is diploid, so against haploid Y it should average about 15X. Another possibility is that minimap2 with the chosen settings is just mapping very poorly againt Y chromosome, and the long reads might map preferentially against X in many cases. I'm curious about the STR performance, one of ONT's weaknesses has specifically been varying speed of the molecule/chromosome passing through the pore which can lead to spurious indels. However, the STR's aren't single nucleotide, so if you miss only part of the repeat motif it should be possible to reconstruct the repeat number - but not with standard tools, I think. Of course, I've yet to really delve into ONT & Y-chromosome. Maybe time to start collecting some resource URL's...

Donwulff
09-17-2019, 02:44 PM
With the changes to the products on Dante Labs site now, the Oxford Nanopore Long Read Sequencing is no longer available to order on their list. Not sure if that's final, https://nanoporetech.com/services/providers#tabs-0=Dante-Labs is still listing them as certified. Of course, waiting 90 days on any word/data from Dante Labs on the long read sequence (WGS I did in 2017 came as promised).

It seems most people on this thread reported getting FASTQ in under a month (Still not sure where they should appear for certain) including people who ordered after me, but some others received (or at least noticed they received) theirs four months later if I'm reading right? And no BAM's or further analysis yet?

MacUalraig
09-17-2019, 03:00 PM
Yeah still nothing further so I have two half tests, one with reports but no data and one produced data but no reports...

pmokeefe
09-17-2019, 03:32 PM
I ordered a Long Read kit on April 8, 2019, have received nothing back. The status is "QC Completed" and has been for some time.

pinoqio
09-18-2019, 05:30 PM
For my nanopore kit which has had downloadable FASTQ since May, the status changed from "QC passed" to "Results Ready" yesterday - but there is nothing new available for download.

Donwulff
09-18-2019, 06:03 PM
Related but not related, just today ONT announced flowcell flush kit which is intended for extending read yields. The previous ones, AFAIK, were only intended for changing sample in middle of flowcells lifetime, while this specifically promises to rejuvenate some of the pores. The example on ONT's page shows a flowcell with two washes reaching 90GB yield, which would make Dante Labs's previous long read pricing viable without huge discounts from the manufacturer. (But then, yields vary by center, I think they've quoted >90GB yields before, but as has been seen in real examples multiple flowcells have been used.) The flush doesn't rejuvenate all pores, though, so eventually they'll be dead anyway.
https://nanoporetech.com/about-us/news/new-kit-extends-yields-flow-cells

There are established processes for producing the end-user deliverables once you have FASTQ, so I'm not exactly sure what the holdup for Dante Labs on that is. At least people who have got their FASTQ files can (theoretically) process those themselves, for example https://github.com/nanoporetech/pipeline-structural-variation

A reminder, the product page said "CNV AND SV - Leverage long reads for Copy Number Variations and Structural Variations", they never spoke about SNP's, for which long-read technologies still have very high error rate. They did say FASTQ, BAM and VCF though, so people are right to expect BAM and CNV/SV VCF-files, as well as variant reports.

Donwulff
09-18-2019, 07:03 PM
Has anybody got further with the analysis of their FASTQ data? Without my own data I've not really wanted to dig into it, but I'm curious if there's any useful recipes for people who HAVE their FASTQ data.
My initial thought in this thread was to run the https://github.com/nanoporetech/ont-assembly-polish to map short read sequences to long-read scaffolding, to get optimal results, however that was before I looked at the Canu assembler specifics, which say "A well-behaved large genome, such as human or other mammals, can be assembled in 10,000 to 25,000 CPU hours, depending on coverage. A grid environment is strongly recommended, with at least 16GB available on each compute node, and one node with at least 64GB memory. You should plan on having 3TB free disk space, much more for highly repetitive genomes." That's 3 CPU-years! Of course you'll run a grid with hundreds of servers, but that will cost more than the sequence itself, and I don't think you can even get that many compute nodes on AWS without institutional account or something. I know de-novo genome assembly is highly compute-intensive, but that still really surprised me with ONT read lengths, there should be plenty of overlap to work with. This might still be a potential avenue for something like Y chromosome, if reads belonging to Y chromosome can first be identified with good confidence.

(Note that you don't need to de novo assemble the genome for most uses, this concerns advanced analysis)

Of course, a high quality assembly already exists for humans, so there's the possibility of reference-guided assembly. My first thought would be to just try to minimap2 the reads to human assembly, bridging over structural variants, and then apply the short-read polishing over that. But there's also actual reference-guided assembly solutions like for example https://github.com/malonge/RaGOO so perhaps just replacing the assembly stage with that could be doable. It says nothing about human genomes, though, so I'm not sure about the performance.

I've seen a plethora of different methods and papers floating around, though, so it's really hard to tell what would make the most sense.

pinoqio
09-19-2019, 06:35 AM
YSEQ's Thomas Krahn has posted an experimental pipeline based on wtdbg2 on the Dante Customers facebook page: https://gist.github.com/tkrahn/3034da009271287695364786478a5bb8
At a glance, wtdbg seems to be in the ballpark of 100 CPU hours with 300GB RAM - I haven't had the time to look into it myself though.

Donwulff
09-20-2019, 02:39 PM
This is probably (somewhat) specific to my sample because of different starting times, however I just received an e-mail saying "We would let you know that we are working hard to complete the bioinformatic analysis. We have almost completed it. Your comprehensive results, specifically VCF file, will be delivered by early Oct."

At this point I'm not completely certain what that is even referring to, because I've been trying out their different services including monthly updates, panels and personalized report. Long Read sequencing is the one I'm particularly missing VCF on though, although as noted on the other thread, for uniformness & validation of the results it would be better if they re-analyzed the short read sequence with current up to date pipeline when paying for updates.

Anyway, this is bit surprising since I still do not appear to have FASTQ files; although I asked for raw FAST5 files, too, the customer support appears to have thought I meant FASTQ. I was wondering before if it's possible they exist but they've missed adding the link to the genome manager, or if there's some other steps/special place people have used to get their FASTQ's. Just started getting the Aruban individual research data https://www.ncbi.nlm.nih.gov/Traces/study/?acc=ERP108807 (they have used different basecaller versions) to experiment with pipelines, what if I get my own data before they finish downloading, haha...

Donwulff
09-20-2019, 03:22 PM
According to https://github.com/ruanjue/wtdbg2 the wtdbg2 run on ONT >30X human genomes took about 1000 CPU-hours; they used a 32 processor thread server to get the 35 hour wall-clock time. While this is a significant improvement over Canu time, not everybody has a 32 processor-thread (2 CPU x 16 core maybe) server. Although less than a week runtime for 8-thread home computer doesn't sound so bad, compared to plain genome alignment runs. However, the note "For Nanopore data, wtdbg2 may produce an assembly smaller than the true genome." at the end of the readme might give people a pause.

There is https://github.com/cvdelannoy/poreTally which might come handy for benchmarking assembly, unfortunately though wtdbg2 is included in the documentation, the example data they have doesn't include wtdbg2 run. Additionally, genomes of different organisms behave in very different ways, and runtime to test on whole human genome would again be fairly prohibitive. I'm thinking if one could try this on pre-selected reads for single chromosome, possibly with randomly sampled reads of metagenome to match the Dante Labs samples. Not completely realistic, and probably still taking months of runtime to benchmark....

In any case https://github.com/fenderglass/Flye is looking good on the oldish poreTally benchmarks, and based on it's own web-page has 2500h CPU-time on 35X ONT human. However, the memory requirement is so high you'd need something like AWS r5d.metal that's about $6.912 per Hour. 2500h/50 (FAQ says 50 threads is most they use internally, so I assume it scales poorly at that) ~ 50h, assume rough ballpark of 350 dollars (Half that if performance scales linearly to threads). This isn't a recommendation btw, I didn't look at every spec and I'm not responsible if someone runs out of memory at 99% complete ;)

Again: Genome assembly isn't needed for basic analysis, because humans have existing high quality reference genome, but it might help where there's lots of large-scale structural variation compared to reference genome, including Y chromosome, and it can be used to construct an personal reference genome to map sequenced short-reads against for better alignment.

Donwulff
09-20-2019, 04:08 PM
wtdbg2 has >200GB memory requirement for ONT >30X too, with likely(?) incomplete assembly. Dante Labs ONT from saliva probably includes substantial oral microbiome, so it's like assembling two genomes at once, and the memory requirements will most likely be exceeded for wtdbg2 and Flye. If just requiring quick & dirty assembly, I'll go with the reference-guided https://github.com/malonge/RaGOO first.

For whole assembly I think I'll try a protocol like:
Minimap2 to human ref to filter out reads which are definitely of human origin.
Assemble remaining reads with metagenome settings.
Polish contigs if needed (Flye FAQ suggests might not be needed).
Minimap2 contigs against known eHOMD or similar metagenome to gather definitely microbial contigs.
BLAST remaining contigs against human & metagenome databases to determine most likely origin.
Minimap2 whole original reads against the assumed microbial origin contigs to filter them out.
Assemble remaining reads with human genome settings.
Also look at possible overlap between "definitely human" and "microbial" reads.

But yeah, that's going to cost several hundred at the very least, and first need the darned FASTQ, so going with reference-guided assembly first which might allow me to glean additional ideas about it. There are some standalone classifiers for microbial/human DNA sequences, but due to the high error-rate of ONT long-read sequencing, it is probably highly unreliable on individual reads.

JamesKane
09-20-2019, 06:02 PM
I've been wondering if they could be using some sort of read-until for selecting human chromosomes, if that were the case, maybe they're also avoiding Y chromosome? Earlier post in the thread did report about 30X coverage of whole genome which is diploid, so against haploid Y it should average about 15X. Another possibility is that minimap2 with the chosen settings is just mapping very poorly againt Y chromosome, and the long reads might map preferentially against X in many cases. I'm curious about the STR performance, one of ONT's weaknesses has specifically been varying speed of the molecule/chromosome passing through the pore which can lead to spurious indels. However, the STR's aren't single nucleotide, so if you miss only part of the repeat motif it should be possible to reconstruct the repeat number - but not with standard tools, I think. Of course, I've yet to really delve into ONT & Y-chromosome. Maybe time to start collecting some resource URL's...

The problem with my numbers quoted above is that the old CallableLoci tool cannot deal with the noisy nature of the ONT long reads. Scrubbing through the BAM that Thomas created for Petr shows there is better than 15x coverage on Y. The problem are all the random Deletions lining up in 11 of the reads, so it appears there are 4 reads or fewer. That causes the binning by loci to fail miserably.

I received a notice that my data should be available in a few weeks, so I'll be digging into things more at that time.

Donwulff
09-22-2019, 01:38 AM
I skipped "Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit" https://www.biorxiv.org/content/10.1101/715722v1 over earlier because of the "novel nanopore toolkit", but reading the paper it turns out they were using longer continuous read protocol, which isn't strictly necessary. (Admittedly it will mess up with genome assembler comparisons, but that's life...).

Some people, myself included, have been discussing and guessing the sequencing throughput, so I found this data from the recent paper (at that particular lab): "We used three flow cells per genome, with each flow cell receiving a nuclease flush every 20-24 hours. This flush removed long DNA fragments that could cause the pores to become blocked over time. Each flow cell received a fresh library of the same sample after the nuclease flush. A total of two nuclease flushes were performed per flow cell, and each flow cell received a total of three sequencing libraries. We used Guppy version 2.3.5 with the high accuracy flipflop model for basecalling (see Online Methods).

The nanopore sequencing for these eleven genomes was performed in nine days, producing 2.3 terabases of sequence. This was made possible by running up to 15 flow cells in parallel during these sequencing runs. Results are shown in Fig. 1 and Supplementary Tables 1, 2, and 3. Nanopore sequencing yielded an average of 69 gigabases (Gb) per flow cell, with the total throughput per individual genome ranging between 48x (158 Gb) and 85x (280 Gb) coverage per genome (Fig. 1a)."

The part of particular interest now is comparison of Canu, Flye, Wtdbg2 and Shasta, the new long-read assembler they present. I don't particularly know the innovation in all of those, but nice approach presented in Shasta is reducing the sequences so that runs of same nucleotide are represented only once, ie. AAACCT would become ACT. Because nanopore sequencing works by measuring current across the pore while the DNA passes through it, small changes in the speed lead to errors in run-length, so the representation avoids the most common error in nanopore sequencing. However, it appears to also use enormous amounts of memory. This might not be a deal-breaker if all the alternatives take cloud-server, and one with that much memory is available.

https://www.biorxiv.org/content/biorxiv/early/2019/07/26/715722/F2.large.jpg

In general every paper will present their own algorithm as best (Instead of "We did all this work, and found out it performs worse than existing methods") so take with a grain of salt. I think that Flye pulls ahead of Wtdbg2, the runtime isn't that much higher, but as one can see in box C, the number of apparent misassemblies between the two is pretty much even while box A shows that Wtdbg2 has the shortest recovered contigs by far. "Canu consistently produced the most contiguous assemblies, with contig NG50s of 39.0, 31.3, and 85.8 Mb, for samples HG00733, HG002, and CHM13, respectively (Fig. 2a). Flye was the second most contiguous, with contig NG50s of 24.2, 24.9, and 34.2 Mb, for the same samples. Shasta was next with contig NG50s of 20.3, 19.3, and 37.8 Mb. Wtdbg2 produced the least contiguous assemblies, with contig NG50s of 14.5, 12.2, and 13.6 Mb." Base-level error rate is higher, but that could also mean that Flue assembles mismatched bases better - because the presumed misassemblies don't take that large hit. Admittedly this is one of those situation where you can't be certain of the true answer. Of course, if the paper is right, then Shasta is actually best on Oxford Nanopore reads out of the full genome assemblers.

Donwulff
09-24-2019, 03:00 AM
If Dante Labs has stopped offering the sequences, it's probably not worth starting separate thread for general ONT sequencing. That would be a shame, though third generation sequencing is definitely in the future, just a question how and when it will be offered to the general public. Long research notes following, warning! ;)

I had a look at the RaGOO package I linked above, and turns out it's JUST assembly scaffolding tool without the assembly part, so Shasta looks like the best bet for assembly right now. There are several mentions to reference-guided assembly for ONT, but I couldn't find any ready one for ONT. For example, http://ibest.github.io/ARC/ appears to map reads to reference, then sub-divide the mapped reads into clustering regions, individually assemble those clusters and use the assembled contigs to map reads again, repeating for as long as needed. This is roughly what I would do for Y chromosome, although needs long-read compatible tools. Rebaler is another I've ran into, which replaces parts of referense sequene with matching reads, however the author says it wouldn't work well for this purpose: https://github.com/rrwick/Rebaler/issues/3

While waiting for my own FASTQ data (Odd, though, that my status is still awaiting for QC, I don't know if they've forgot to update the status or the e-mail is in error, but I'm so far resisting the temptation to ask them, again, what's going on. Seems like few minutes to make sure the accounts are up to date would save hours of support service time and money...) I was playing with publicly available Promethion sequencing data. The public sequencing samples have some benefits, namely there's no saliva microbiome, and there's existing studies to compare results to. NA19240 https://www.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=GM19240 is unfortunately female, because Y-chromosome is one of the larger points of interest, however it has fairly good quality de novo assemby https://www.ncbi.nlm.nih.gov/assembly/GCA_001524155.4/ in existence which makes it a good test case. (I should note that these "reference quality assemblies" take a huge lot of work and different techniques, they're not comparable to what we can have from saliva-sequencing with standard read-lengths and one or two sequencing technologies without signals-level data; but that also allows us to compare how close we get). There's raw data from https://gigabaseorgigabyte.wordpress.com/2018/05/24/promethion-human-genome-na19240/ and the specific run used https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR3219853

A quick search did not reveal any special minimap2 parameters I should be using, therefore I started with just the basics. Now I was curious, what would happen if I use my updated reference sequence from "Dante Labs Technical" thread. Minimap2 isn't alt-contig (Larger, often structural variation between individual genomes; different populations can have very different genomic segment, but the aligner should be aware that isn't completely separate piece) aware, and long reads tend to span over more than just the alt contig making mapping and alignment complicated.

The reference I have contains all the published alternate sequences for genomic locations, fix-sequences (Where the original reference was in error, but correcting it would change genomic coordinates of the variants, so they're published as "patches" which are in many ways like alt-sequences), Human Leycocyte Antigen HLA gene alt-sequences which are highly polymorphic regions of human genome, the standard decoy sequences (Mostly DNA which has been found on many human sequences but not placed on the assembly) and Expanded Human Oral Microbione reference-sequences to serve same purpose as decoy sequences for saliva sequencing, which contains oral microbiome. For the decoy & oral microbiome any sequences which have 101 or more nucleotide bases exact match to the human assembly with HLA's are removed following the original decoy sequences construction protocol.

Minimap2 takes more memory than Burrows-Wheeler Transformation FM-index (In a way, a tree-structure containing every DNA-sequence in the reference genome encoded in a special efficient way allowing retriving the genomic location of a given sequence) used by the likes of bwa mem/bowtie2. With the expanded reference genome, minimap2 required 29 gigabytes of memory with the default Oxford Nanopore settings, though the long-read alignment is really quite fast. I got 4 million long-reads reads in about 32 CPU hours, or 8 hours per million reads. Of course, with 8 CPU-treads that's about hour per million reads wall-time.

Good news is that the expanded oral microbiome gets very few hits from the research sample, as it should be, because it's not a saliva sample, and the remaining hits look plausible bloodborne pathogens. This means there's a chance of separating the microbiome & human genome in silico for de novo assembly. Many decoy, alt and fix contigts get comparatively lot of hits. A slight surprise to me, though, is that out of just over 4 million reads over 600 thousand reads (15.3%) wouldn't align to the expanded reference with the default settings.

To explore this further, I iteratively aligned the remaining unmapped reads to various reference sequences in decreasing order of likelihood. These reference genomes would make too large combined reference to run in one go at 32 gigabytes of memory, and they have a lot of overlap so which reference each individual read would map against would be somewhat random. Perhaps I should align whole sample against each reference genome and construct venn diagrams of mapped, unmapped and possibly other categories, but the iterative approach seems more practical. Each iteration was produced resembling this Bash monstrosity (The first mapping took 29G memory, so the stages need to be done with temporary files):

/usr/bin/time minimap2 -t`nproc` -2 -I8G -ax map-ont /mnt/GenomicData/asm.contigs.fasta.gz 20180413.unmapped.fq.gz 2> >(tee 20180413.asm.log 1>&2) | tee >(samtools view -f0x4 | samtools fastq | bgzip [email protected]`nproc` > 20180413.asm.unmapped.fq.gz) | samtools sort [email protected]`nproc` -m2G -O bam - | tee >(samtools index - 20180413.asm.sorted.bam.bai) > 20180413.asm.sorted.bam

First, I thought it pertinent to try minimap2 with -H option, which performs the "homopolymer compression" ie. reducing consequent identical nucleotides to single one to match length-errors. According to the documentation this helps with Pac Bio, but isn't recommended with Oxford Nanopore, however since ONT shares the same primary error-mode I decided to give it a try. Actually this took a LOT more memory, but luckily I had only small number of unmapped reads to process. The homopolymer compression -H option succeeded in mapping an additional 37300 reads, or 0.9% of the whole.

For third iteration, I attempted the earlier mentioned NA19240 preliminary assembly version 3 from https://www.ncbi.nlm.nih.gov/assembly/GCA_001524155.4/ - as said this is much higher quality than is possible "at home", but in some ways it illustrates the benefit of having personal reference genome available (African genome likely has more differences to caucasian, but it's also more complete, so it's just indicative). This aligned an additional 4100 reads, or 0.1% - this is actually pretty good, because it indicates the GRCh38 expanded reference used above is fairly complete, but there's still room for individual/personal improvement.

Next I used the Nanopore WGS consortium reference from https://github.com/nanopore-wgs-consortium/CHM13 - they've recently published a *complete* reference of X chromosome, from end to end. Using this alone aligned 7751 reads more, or 0.2%. This likely captured also some reads of telomeric, centromeric etc. regions on other chromosomes, although with only X chromosome reference sequenced end to end it's hard to tell just how specific it is. Last, I used the unpolished Canu assemblies from the same, as they say "Unpolished Canu assemblies are available below for each data release and may be a more suitable basis for the structural analysis of other chromosomes". This yielded 2367 further reads - about 0.06%, which actually works as a good check that we appear to have found most of the mappable reads in there.

Of course, the big question is there's still 572259 out of 4049538 reads or 14% that doesn't map to any of those references. Next stage might be using BLAST to try to figure out if they're something recognizable. An assembly and/or polishing stage would help to error-correct them, possibly yielding better matches to reference. And of course, for the whole dataset it would be possible to cluster reads that map to same region (like chr22) in different assemblies, and then de-novo assemble those regions individually. Although in the large scheme of things, it seems the additional references added under half percent of mapped reads, so they may be of little benefit to genome assembly. And I'm wondering how much unmappable reads we get on Dante Labs ONT...

Donwulff
09-24-2019, 09:52 PM
On the mobile phone, that does look like a book ;)

I started running BLAST search on the unmapped sequences of NA19240 Oxford Nanopore Promethion research-sample against current reference sequences, and it seems majority of the reads hit on human centromeric region, with up to 1000bp matches. The BLAST hits for some reason cluster on chromosome 18:s centromeric region, however, but the centromeric satellite repeats are quite similar so I'm not sure if that's specific. Chr 18 could just be among the first matches BLAST finds, or it could be the most complete centromeric reference. https://genome.cshlp.org/content/24/4/697.full concerns their construction, although there's been process at directly sequencing them (Mainly via Oxford Nanopore sequencing) since then.

Reference genomes used for short-read genome mapping usually mask out the centromeric and other repeat regions, because the short reads can't resolve things like 171bp AT-rich alpha satellite repeats https://www.nature.com/articles/s41467-018-06545-y so it would be wasted processing and storage, and so does the reference genome I use, although not for chr18. ("The file unmasked_cognates_of_masked_CEN_PAR.txt gives the locations of the unmasked cognate of those centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22 that were hard-masked with Ns because they are exact duplicates.")

This raises the question if it would be better to use a reference genome with complete centromeric, telomeric etc. model instead. However, the incremental mapping against complete unmasked NA19240 reference sequence, X chromosome end-to end sequence and ONT reference sequence which should include centromeric sequences resolved only small number of additional reads. One reason might be that Minimap2 has some built-in filters for processing efficiency, adjusting the minimizer frequency & minimizer chaining parameters might allow for more matches, although the centrometric regions may just be too repetitive and have too much structural variation for Minimap2 to comfortably tackle.

Instead of number of reads I might have to do the statistics in combined read length. The average read length for Run 1 of the sample was 15559.4 bases, but the final unmapped reads had an average length of only 4929.77 bases. While shorter reads also have less data for Minimap2 to work with finding matches, this could also be because shorter reads are more likely to span only repeats and not flanking recognizable sequences. Or ONT could have a bias on read length for them. This does mean, however, that only about 5% of the total sequence (but 15% of reads) is unmappable with Minimap2 long-read settings. Above reference says that alpha satellites make up about 3% of the genome, and I could imagine pericentrometric/telomeric satellites taking up another 2%, explaining the amount of unmappable reads. I'm trying to find out references for relative amounts, and any possible biases in ONT for centromeric regions.

While the female research sample does not contain Y chromosome, this should be fairly relevant for Y chromosome sequencing because majority of Y chromosome consists of highly repetitive heterochromatin like centromere and telomeres. https://www.theatlantic.com/science/archive/2018/03/y-chromosome-sequencing/556034/ I think the heterochromatin itself is not subject to high interest currently and is usually intentionally ignored, although there are some large scale structural variations related to it. The Small Supernumerary Marker Chromosomes http://ssmc-tl.com/Start.html for example are pretty interesting. But reads which are ONLY heterochromatin eg. from centromeres and telomeres should probably be excluded from analysis in any case.

Summary: Okay, I can map most of the euchromatic ONT reads with Minimap2. Additional/more complete reference genomes add only little to the mapped reads, though for reference-guided assembly every little bit of different DNA helps. There aren't any major hits to the (101bp match filtered) extended human oral microbiome databases from blood-derived ONT sequence, so it should be possible to use that to eliminate oral microbiome from saliva sequence. If heterochromatin reads are filtered, reference-guided assembly of reads that match known region + unmapped human euchromatin reads should be very feasible. It seems like comparing reads against chr18 centromere model with BLAST like algorithm would filter out at least most centromeric reads, although I should still find out why chr18 in particular.

Donwulff
10-01-2019, 07:31 PM
And we have FASTQ! I mean... at least I have FASTQ. Or rather, Dante Labs has my FASTQ... One qualm I have is they've added all sorts of security keys on the download, and though that's sweet for security, now I have to jump through all sorts of hoops and loops to get it to download on download manager (no pasting URL's, have to use integrated dl manager) or to another location with more space (JS download link without "save to" so have to set default download location instead). It's from Amazon S3 as usual, though the download doesn't seem to saturate the 4G connection I'm using for download, so it'll be a long wait.

The main page for the sequence jumped straight from "Waiting for QC" to "Your Results Are Ready" as I thought might happen. When I click on "Visit Report" it says "We are still processing your DNA samples", as I think others have reported as well, no structural variation VCF's or any reports. Oddly enough I could still buy specific panels, though the one free report credit I had before seems to have disappeared. FASTQ is on "Raw Data Library" as I thought.

Regarding the NA19240 research sample, I appear to have spoken too hastily and did the unmappable read analysis incorrectly. Upon trying to solve why everything seemed to match to chromosome 18, I realized that even when BLAST is asked to output a single alignment only, it will still list every alignment in the matched contig for the query sequence... So a centrometric sequence would get matched against every possible location in the chr18 centromeric region (largest in the human genome), hugely inflating the match-count for centromeric sequences. I'm going to look at it more carefully before making more proclamations about the contents of the unmappable reads (The high amout of noise in long-read sequencing makes it challenging to identify them, and they might be just too noisy to properly map), and repeat the same analysis on my own genome. Of course, Dante Labs ONT sequences have the challenge that we KNOW they should have substantial amount of oral microbiome, so identifying the unmappable read will be even more challenging but useful for any de novo assemblies.

Donwulff
10-02-2019, 08:49 PM
I have to see what people used for analysis of their FASTQ's, because mine looks totally lackluster so far. The Minimap2 run hit into swap on 32GB right away, which I now suspect may partly be because of the short read length. Two flowcells, 88 gigabases (82G compressed) so not even the >90 promised for 30X (excluding metagenome...). Average read length 9765. The N50 will matter more, but have to see what to analyze it with, once the minimap2 run finishes eating up my memory. This certainly doesn't meet the specification they gave, it'll probably meet the N50 only on technicality since on the order page they said "N50>8000" but "average N50>20.000" which is presumably average over all samples they process. Maybe they were waiting to re-sequence it, but instead decided to stop long read sequencing altogether? Although the product is still hidden on their store.

JamesKane
10-02-2019, 11:05 PM
I'm in the process of retrieving mine. It is only 72G compressed, but will take the better part of the day to download. Discussions with others indicate they don't believe Dante will be able to hit the N50>20,000bp advertised with the saliva kits being used. We will see.

As a note the Long Read test is still there on the US version of the site. Nothing being hidden at all.

Donwulff
10-03-2019, 02:38 AM
*cough* I'm going to claim I meant median read length, not average mean length ;) Had a small error in my read length calculation, because I was calculating it from file size stats but forgot that quality counts as linefeed but not as a read... In other words, quick & dirty tool scripts are still hard. I corrected the right average above.
However, I found NanoStat https://github.com/wdecoster/nanostat which claims the following stats (No nice graphics unfortunately, so have to keep looking!)
General summary:
Mean read length: 9,765.0
Mean read quality: 8.4
Median read length: 3,401.0
Median read quality: 9.0
Number of reads: 9,033,335.0
Read length N50: 28,812.0
Total bases: 88,210,112,208.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 7648226 (84.7%) 82124.0Mb
>Q7: 6419935 (71.1%) 72075.9Mb
>Q10: 3173309 (35.1%) 39309.9Mb
>Q12: 221868 (2.5%) 2001.4Mb
>Q15: 25 (0.0%) 0.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 16.1 (244)
2: 15.9 (251)
3: 15.8 (821)
4: 15.8 (230)
5: 15.7 (326)
Top 5 longest reads and their mean basecall quality score
1: 993432 (3.1)
2: 660357 (3.8)
3: 321815 (4.4)
4: 288245 (3.1)
5: 232994 (9.1)

I'm surprised at the N50 this tool claims. https://anthrogenica.com/showthread.php?16842-Dante-Labs-Long-Read-Test/page5 had some N50 values reported, which were varying widely. Origene saliva kit specs have >23k bp specs, and Dante Labs switched to different supplier (I took note but forgot which) claiming that was to improve DNA quality & yield. Of course, that doesn't quarantee that sample preparation and ONT chemistry will be able to read that long fragments. But 29kb is certainly good if that's genuine.

The first flowcell started on 26th June, and second 27th July. As far as I know, Promethion does the basecalling in real time, so that would mean they then sat two months on the FASTQ while I was asking if the QC had failed before making it available. Of course, it may be that their bandwidth to AWS S3 is limited, especially given they have to upload a lot of samples there, perhaps via a monthly AWS Snowball (Or if not, then they should do that!). Of note, I never received a report for the DNA quality as I did for the short read WGS, so it's possible they don't do real sample QC for ONT but just load the sample on and hope for the best?

The EU site doesn't appear to have the Long Reads anymore, though I'll have to try to check the US site from time to time... There's a page for "all" products on their web-store, though it doesn't appear to be linked off anywhere. I don't want to share that publicly, because I don't know if they're really available. It does have things like "Alignment to the GRCh38", "ACMG Panel" and "Upgrade Intro to Super Premium" at higher price than Super Premium itself...

I'm splitting the FASTQ in the separate flowcells to run separate statistics on them. And perhaps finding better statistics script...

Donwulff
10-03-2019, 03:15 AM
Meanwhile the Minimap2 run finished (with default ONT settings), and I was playing around with simple stats.
This gets number of reads on each contig, filtering out secondary hits (Ie. best mapping only) but leaving in supplementary mappings, in case parts of read match multiple contigs.
view -F0x100 sample2.bam | cut -f3 | sort | uniq -c | sort -nr > sample2.chr
The resulting list is kinda unwieldy, so let's group them a bit. Again, I'm using my expanded reference genome from the "Dante Labs technical" thread.
awk '/NCBI|SEQF|anae|vpar|smit|csho|crec|cgra|tlec|presp |bext|lbuc|einf|smoo/ { ehomd+=$0; next } /_random/ { random+=$0; next } /HLA/ { hla+=$0; next } /_alt/ { alt+=$0; next } /_fix/ { fix+=$0; next } /chrUn/ { if(/decoy/) decoy+=$1; else chrun+=$1; next } /chr[0-9XYM]+$/ { print; next } /ML|KZ/ { ncbi+=$0; next } 1; END { printf "%7d decoy\n", decoy; printf "%7d chrUn\n", chrun; printf "%7d NCBI\n",ncbi; printf "%7d alt\n",alt; printf "%7d fix\n",fix; printf "%7d HLA\n",hla; printf "%7d random\n", random; printf "%7d eHOMD", ehomd; }' sample2.chr | sort -nr
1860474 *
672680 chr1
626050 chr2
498045 chr3
468195 chr4
454894 chr5
426605 chr6
421128 chr7
377188 chr10
362316 chr8
350822 chr11
348794 chr12
340636 eHOMD
324224 chr9
252212 chr16
249108 chr13
241680 chr17
234028 chr14
214521 chr15
205731 chrX
197359 chr18
186072 chr20
183635 chr19
119245 chr22
116454 chr21
109803 fix
106526 alt
70583 chrY
60103 decoy
51091 chrUn
44221 random
42493 NCBI
3414 chrM
193 HLA

Since Minimap2 isn't alt-aware, this is troublesome. Currently working on using LiftOver/CrossMap to fold the alt, fix and HLA contigs over to the primary assembly. Second, and likely biggest problem is fixing Mapping Quality, since without alt-aware mapper MQ goes down when a read matches multiple alternate contigs, ie. it doesn't look like an unique mapping - even though biologically it is.

* = unmapped contigs; this seems to be 20.6% of the reads, which is a huge amount.
eHOMD = Expanded Human Oral Microbiome database reference sequences; 3.8% here actually I think oral DNA kits say average 11% microbial
alt = alternate contigs, representing a different sequence for part of the genome in some populations
fix = contigs fixing an error in the reference assembly; these are meant to be incorporated into the assembly but would change coordinates etc.
decoy = sequences found in human genomes in various research projects, but not placed on the reference genome; might be contaminants or personal pieces of genome
chrUn = unplaced contigs, sequence included in the human genome which hasn't been definitely localized to any chromosome
random = contigs which have been localized to specific chromosome, but their exact location on the chromosome is unknown
NCBI = GenBank NCBI accession prefixes, contigs which haven't yet been placed on the UCSC reference genome
HLA = Human Leukocyte Antigen, highly polymorphic region of human genome

Donwulff
10-03-2019, 12:59 PM
There's NanoPlot from same author as NanoStat. Most of the graphs didn't come out very informative, I feel. Also, I have to start running these on different samples to compare. This one is interesting, but with the extra contigs & no comparison point... Peak is at 91% though, which is certainly in ballpark of what we expect from ONT.
33616

The prettiest graph for first & second flowcell. The first flowcell was in many senses a dud, yielding only 28 gigabases (Are these flushed flowcells?) vs. 59 gigabases on the second. At least that means the majority of the sequence is higher quality. Note the logarithmic read length scale.
33617
33618
If they're using flush, it should certainly be on single customer's sample, because there will be some small amount of contamination from previous sample. Here the second run is better, with 2735 active channels vs. 2506 on the second run. No evidence of them using a flush though, but considering the extra yield it gives they should certainly consider just running same sample after flush for extra yield.

Command: NanoPlot -t8 --fastq_rich sample2.fastq.gz -o plot/ -p both --plots kde hex dot pauvre --store --raw --loglength --N50
Threads didn't seem to work, though disk IO is always limiting factor. I'll have to try non-logarithmic plots from the saved stats if it produces more useful graphs, as well as limiting max read length displayed. Separating flowcells had to be done manually, grep and cut flowcell ID from header lines to find the flowcells, then grep -n | head on the flowcell ID to get beginning line, and zcat | head -n xxx & zcat | tail -n +xxx to separate flowcells. If more than two, you'll need to partially repeat, or write an actual script that separates everything in single pass ;)

JamesKane
10-03-2019, 01:04 PM
Mine was successfully completed with a single flow cell.

Statistics from pauvre (https://github.com/conchoecia/pauvre). I ran the wrong options so didn't get the margin plot this run. It'll be a few hours to get that.


# Fastq stats for 60820188476960.fastq, reads >= 0bp
numReads: 19131752
%totalNumReads: 100.00
numBasepairs: 78706152350
%totalBasepairs: 100.00
meanLen: 4113.901975626696
medianLen: 1857.0
minLen: 1
maxLen: 559571
N50: 9754
L50: 1945428

# Fastq stats for 60820188476960.fastq, reads >= 1000bp
numReads: 13399663
%totalNumReads: 70.04
numBasepairs: 75612661332
%totalBasepairs: 96.07
meanLen: 5642.877834464942
medianLen: 2881.0
minLen: 1000
maxLen: 559571
N50: 10455
L50: 1792231

# Fastq stats for 60820188476960.fastq, reads >= 5000bp
numReads: 3966864
%totalNumReads: 20.73
numBasepairs: 53284992391
%totalBasepairs: 67.70
meanLen: 13432.523119270032
medianLen: 9593.0
minLen: 5000
maxLen: 559571
N50: 16823
L50: 945495

# Fastq stats for 60820188476960.fastq, reads >= 10000bp
numReads: 1889345
%totalNumReads: 9.88
numBasepairs: 38799236221
%totalBasepairs: 49.30
meanLen: 20535.81332207723
medianLen: 16833.0
minLen: 10000
maxLen: 559571
N50: 22560
L50: 572474

Donwulff
10-03-2019, 08:31 PM
For comparison. I don't think I like pauvre, or I'm not reading enough documentation. You mean I should've redirected the output to a file?!

pauvre stats -f sample2.fastq.gz -H
Unable to init server: Could not connect: Connection refused
Unable to init server: Could not connect: Connection refused

(pauvre:22054): Gdk-CRITICAL **: 13:59:06.667: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed


# Fastq stats for sample2.fastq.gz, reads >= 0bp
numReads: 9033335
%totalNumReads: 100.00
numBasepairs: 88210112208
%totalBasepairs: 100.00
meanLen: 9764.955269344046
medianLen: 3401.0
minLen: 1
maxLen: 993432
N50: 28812
L50: 962008

# Fastq stats for sample2.fastq.gz, reads >= 1000bp
numReads: 6725749
%totalNumReads: 74.45
numBasepairs: 87037653671
%totalBasepairs: 98.67
meanLen: 12940.960727347987
medianLen: 5902.0
minLen: 1000
maxLen: 993432
N50: 29257
L50: 941818

# Fastq stats for sample2.fastq.gz, reads >= 5000bp
numReads: 3697975
%totalNumReads: 40.94
numBasepairs: 79091165709
%totalBasepairs: 89.66
meanLen: 21387.696160466203
medianLen: 15072.0
minLen: 5000
maxLen: 993432
N50: 32332
L50: 812666

# Fastq stats for sample2.fastq.gz, reads >= 10000bp
numReads: 2452664
%totalNumReads: 27.15
numBasepairs: 70281581720
%totalBasepairs: 79.68
meanLen: 28655.201739822496
medianLen: 23630.0
minLen: 10000
maxLen: 993432
N50: 35822
L50: 683261

The N50 difference here is huge. I can think of differences in library preparation protocol, shorter read length might improve yields with less reads getting stuck in pores though I don't know if that's a thing. Or filtering out the shortest reads if the yield is high enough already, but I do have large number of shorter reads. The NA19240 FASTQ looks to have N50>12192 with 278134bp maximum (My largest is almost million bp, although that's an outlier, with NanoPlot/NanoStat on the largest FASTQ file I grabbed, albeit that was presumably with older Nanopore chemistry.

The marginplot is dismal failure, I'm not even sure what they were trying to do with that. Perhaps it's because I don't have proper fonts... although both margiplot and stats do the above summary, stats does histogram csv as well.

pinoqio
10-03-2019, 09:02 PM
Here's my nanostat output for comparison:

General summary:
Mean read length: 5,421.5
Mean read quality: 9.5
Median read length: 2,377.0
Median read quality: 9.6
Number of reads: 27,020,271.0
Read length N50: 13,606.0
Total bases: 146,489,346,783.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 27020271 (100.0%) 146489.3Mb
>Q7: 27020253 (100.0%) 146489.0Mb
>Q10: 10346713 (38.3%) 65533.0Mb
>Q12: 40660 (0.2%) 99.1Mb
>Q15: 1 (0.0%) 0.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 15.8 (1852)
2: 14.4 (1567)
3: 14.3 (313)
4: 14.3 (596)
5: 14.3 (618)
Top 5 longest reads and their mean basecall quality score
1: 239071 (10.2)
2: 195472 (10.9)
3: 193471 (10.4)
4: 192013 (9.8)
5: 188967 (7.5)

So my self calculated N50 matches the NanoStat output. This is the same sample, for which I posted the graphs.
Because I got 146GBases, I could theoretically remove shorter reads until I'm down to 30x, then my N50 would be 24700 - so I'd say my sample is within spec.


cover shortest N50 avg median
45.8x 34 13606 5421 2377 (original)
40x 2603 17249 10181 5564 (truncated to 40x)
35x 4416 20786 14287 9059 (truncated to 35x)
30x 7097 24706 19353 14303 (truncated to 30x)

Donwulff
10-03-2019, 09:42 PM
While the way it was expressed apparently wasn't clear, for truth's sake I'll still try to clarify that the "spec" for Dante Labs was N50 >8000. The 20k was for "average", which can't be promise to individual sample. It may indeed be they don't hit 20k average between samples, but that would be hard if not impossible to show. JamesKane has same problem as I do though, that's not 90 gigabases (significantly less so, actually) and it's definitely not 30X over human genome. I'm still seeing what I can get out of it; the mappable & unique lengths are certainly more important than simply looking at raw stats. Perhaps interestingly, the second flowcell had almost 30k N50 for me, the first one 28k.

I'm thinking that for casual use, better mapping strategy could be first mapping the FASTQ against the primary assembly only from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz (saving a lot of memory etc.) and then the unmapped reads separately to all the non-primary contigs (Possibly primary + non-primary will do, just takes more memory and resources) because the unmapped reads should have no matches to the primary assembly. I think that's pretty much what the alt-aware mapping *tries* to do in a single pass, and saves a lot of memory. It doesn't do the alt-mapping post-processing though, and I want to do the CrossMap because I'm looking at using completely different assemblies to hunt for novel contigs.

Also the argument can be made that decoy & oral microbiome should be there to prevent those reads from mapping into wrong places on the primary assembly, but I think that's lot less likely with long reads that have sufficient context to map them uniquely. Also there's probably dozen or dozens different basic combinations that should be tried on a large number of sequences of different quality to really figure out which is best, so meh...

I'm over-optimizing/editing this, but since it looks like Minimap2 takes just over 16 gigabytes even on the primary assembly only, memory-optimizing the run may not matter that significantly (Although I like having enough memory leftover to sort the BAM). I'll probably just end up aligning the whole FASTQ on different reference genomes and following the fate of each individual read, though I'd rather not!

pinoqio
10-03-2019, 09:54 PM
You're right the spec is 30x / N50>8k.
Other than N50, the big difference here are the Q5 / Q7 values - but at 100% this must have been either filtered, or the (older?) basecaller was overconfident and simply never outputted quality scores below 7.

Edit:
One thing that is also interesting to look at is the ultra long reads. I must have lost my records but I remember looking for my longest read in the BAM, and it wasn't actually a continuous read, but rather seemed to be a few reads stuck together with the basecaller not detecting the fact. I believe it was mapped with half the sequence cut off.

Donwulff
10-03-2019, 10:12 PM
On the mapped (Complete primary assy + all the rest) NanoPlot the lower identity reads interestingly seemed to split into two base quality groups, one just around 5 and one around 7. I thought that might be the two flowcells, although it's hard to tell because I don't have separate mappings of the two flowcells (At least yet...). The individual FASTQ reports do show both flowcells had qualities go well below 7 though.
33630
I'm not up to date if they've made a new release of basecalling in the meanwhile. Generally, of course, I think qualities should be going upwards.
I was curious about the ultra long reads myself, and was wondering if there's some easy way to find out the longest reads, or if I'll just have to script & go through the whole file.
OOh right, I just saw the NanoPlot/NanoStat report actually gives read ID's of the longest five, haha.
The aligned read lengths vs. sequenced read lengths plots in the report don't work very well, because it's auto-ranging the reports to that one 1 million bp read, I think I can cut it off at about 150k bp or so.

On this one the auto-ranging works, it shows that the apparent quality is best on reads below 10k basepairs, while there is a clear extension along the 91% identity towards higher read lengths (20k or so on the logarithmic scale?. Although this does include the metagenome and decoys. Genuine structural variation, and DNA breaks & repair figure into it as well, not just basecaller missing where one strand ends and another begins.
33631

JamesKane
10-04-2019, 09:47 AM
For comparison with the other NanoStats report, even though I prefer the graduated length bucketing. The whole point in the test was to get the much longer reads to polish with the short reads in my existing 30x WGS.



General summary
Number of reads: 19,131,752
Total bases: 78,706,152,350
Median read length: 1,857
Mean read length: 4,113.9
Read length N50: 9,754

Top 5 longest reads and their mean basecall quality score
1: 559,571 (10.49)
2: 541,309 (10.01)
3: 479,304 (10.09)
4: 422,309 (10.79)
5: 343,730 (10.01)

Top 5 highest mean basecall quality scores and their read lengths
1: 373 (20.13)
2: 407 (19.95)
3: 500 (19.90)
4: 302 (19.77)
5: 237 (19.76)

Number of reads and fraction above quality cutoffs
Q5: 19,131,752 (100.0)
Q10: 18,942,292 (0.0)
Q15: 5,167,403 (0.0)
Q20: 1 (0.0)
Q25: (0.0)



And now I'm noticing I'm getting a different output format from what should be the same tool's bundle. Interesting...

The FASTQ is now aligned with MiniMap2 as well to do some quick visual inspections around some regions of interest. Now to review my notes for which read polishing workflow looked the most promising.

JamesKane
10-04-2019, 11:30 AM
Has anyone who received their long reads back started comparing the results with a short read test? If so, what's your call match rate? I've spot checked a dozen Y chromosome sites and so far nothing matches downstream of R-DF13. It's to the point I'm doubting the file is my sample.

Confirmed: This is not my sample. The signature is R1b-M222. While it matches my surname, it's not my haplogroup. Time to contact them.

Donwulff
10-04-2019, 11:31 AM
And now I'm noticing I'm getting a different output format from what should be the same tool's bundle. Interesting...

The FASTQ is now aligned with MiniMap2 as well to do some quick visual inspections around some regions of interest. Now to review my notes for which read polishing workflow looked the most promising.

It seems your quality scores are along much narrower range as well. I mapped the flowcells individually, against the "analysis ready" primary assembly only, and as predicted am noticing there's very striking difference between the different flowcell runs. Particularly on the identity ie. read correctness (here against *aligned* read length).

33641
33642

Might go as far as to say run 1 (at about 30Gbases) is garbage, but perhaps I'll get some use out of it for long-range scaffolding. The sweet spot here is definitely below 10k bp length, maybe 2000bp-8000bp, though. Some are longer.

Most of the polishing workflows seem to rely on de novo assembling the long reads and/or using expected bases to guide the signals level data basecalling. I couldn't find a reference guided assembly workflow, but will attempt assembling a continuous part of a chromosome from mapped reads when I'm satisfied with the mapping.

Edit: And I forgot the most important point, that I was using NanoStat/NanoPlot with pip3. I don't know if the Git repo has any major changes. NanoPlot is certainly more informative, and includes the stats section in the HTML report.

Donwulff
10-04-2019, 11:43 AM
https://anthrogenica.com/showthread.php?16842-Dante-Labs-Long-Read-Test&p=569436&viewfull=1#post569436 has comparison between different technologies.

Donwulff
10-04-2019, 01:38 PM
Here's a phylogenetically relevant variant from heterochromatic region DYZ19 on my sequences. The are right to this is pretty much unmappable with short reads. Actually what I'm most wondering is I just noticed YFull says my Big Y read-depth is 2 for this variant, but it's 27 on GenomeBrowse. Dante Labs WGS has only 3 reads mapped here, possibly due to lower read length.

33650

The noisiness of the long reads is immediately apparent, we're expecting about 91% correctness here. Still, the phylogenetically relevant variant, yellow C near right edge is easily apparent, as is the A right from it. On the left we have some weirdness, as the leftmost C is nowhere to be seen on short reads, G looks right but there's no sign of A or T near the middle.
Based on the right edge C there's no question this is the same sample, though the errors I'm seeing on this stretch don't seem to really match what I've heard typical for Oxford Nanopore, CG/GC methylation errors and length mismatches (usually deletions) on runs of the same nucleotide. Of course, it's possible some errors are on the short read sequencing.
Read depth here is barely 10, with ~15+ expected, but I don't think higher read depth would help with most of these calls. (One read is hidden because it maps to too many places and is clearly erroneous).

Donwulff
10-04-2019, 07:33 PM
I wonder if there's any way to tell which flowcell version & basecaller Dante Labs is using. The flowcell ID might translate to flowcell revision, and if Promethease internal basecalling is used, there probably aren't many alternatives. I'm bit stricken by the high systematic errors, on two flowcells there though, especially as I don't seem to recognize the two named "typical errors" on ONT. I guess I'll just have to have a crash-course or something on ONT error profiles, but that's not quite what I'd expect if the basecaller is working right. On the other hand, ~90% identity is right, errors are expected. I'm thinking what kind of analysis or processing to try on that. Standard Base Quality Score Recalibration might reveal something about the error profile, though, and with "exact" calls available from short read sequencing & the nanopore structure one could do better. That's just re-inventing the basecaller on some level, though, but surely someone has done that already (And yeah, there's signals level polishing if signals were available).

Donwulff
10-04-2019, 08:42 PM
Also a chance discovery "The Guppy qscore distributions (Additional file 1: Figure S6) show that while the selection process removed the lowest quality reads, the resulting reads still span a wide quality range, with 458 (∼ 3%) falling below ONT’s ‘fail’ threshold of Q7." so the reason some sequences have no quality lower then 7 is they must've turned the default filter off, presumably to get close to 90Gbases with fewer flowcells. With the list-price of flowcells their price is still a steal, and I agree it's nice to have access to all the data without filtering, but there's something to be said about meeting customer promises aka. contracts. If they can't meet those with the flowcells, they need to up the price, not return garbage, that won't even win them goodwill or reputation. With the first flow-cell run having been so bad, there's only 72Gbases of reads total meeting Q7 (~80% accurracy) and significant portion of that is metagenome. As the Y chromosome sample shows, this is 20X sequence at best.

The systematic errors would in all likelihood be methylation in the heterochromatin esp. if they're not using modified base aware basecaller, although I'm scratching my head even at the methylation possibility. One of the A/T's could be 6mA I suppose https://www.cell.com/molecular-cell/pdfExtended/S1097-2765(18)30460-X but doesn't look typical. I don't pretend to know much about methylation/modified bases though, there's probably much wider range I'm aware of. Oh also just realized since the pore encompasses about five nucleotides (R9.4 pore) the methylation site could theoretically be four nucleotides away from the error, if it's not methylation-aware basecaller. The Dante Labs product page & I have stressed from the start that the Long Reads are mainly for structural variant detection, it should not be relied upon for short variants (SNV/INDEL). Trying to see if we can have computationally efficient way of filling that gap with short reads, though (Which would improve the alignment of the short reads because you have "personal genome reference" rather than one for whole world).

Another edit: The serendipity is strong with this one. Found https://www.nature.com/articles/s41467-019-11713-9 which actually deals directly with detection of 6mA from Oxford Nanopore reads, so more reading I guess. (Oddly enough the link on ONT Digest's article leads to completely different article). Yeast RNA, but it's a start. Also their algorithm works on FASTQ.

Donwulff
10-06-2019, 03:09 AM
Term polishing generally means improving/finalizing a genome assembly, and it certainly seems to be getting the spotlight in terms of ONT research right now. The more general term of interest is hybrid read error-correction, however. Major problem with them is that there's a huge amount of different approaches, and afaik no clear guidance on what might work best. Second is that they definitely don't go easy on the computing resources either; having an assembly to map against is in some ways much easier. But anyway:
https://doi.org/10.1186/s13059-018-1605-z "A comparative evaluation of hybrid error correction methods for error-prone long reads", Fu, S., Wang, A. & Au, K.F. Genome Biol (2019) 20: 26.
https://www.biorxiv.org/content/10.1101/476622v2 "Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data" Leandro Lima et. al Briefings in Bioinformatics, bbz058
LoRDEC has very flat memory usage in the DNA comparison, suggesting it's the only one which might fit to run on up to 32GB home computer. But the third problem is I'm not sure if there's any benchmarks about their use with current generation ONT human genomes. You would think this problem has been solved numerous times already, and the challenge is just finding the existing solution.
Going straight to de novo genome assembly without losing possible data by error correction first makes sense IF you can spare the computing power, which large/institute research projects generally can. For commercial/consumer use I still think it may make more sense to approach the problem progressively, since majority of the reads map against existing reference(s), grab clusters of overlapping reads and assemble, error-correct/polish those as desired, and possibly use that as next cycle. Granted, there are also graph-based alignment programs like HISAT2 which might be able to just construct the local reference on the fly.
Sadly, I don't have the computational resources (and time) to try everything...

Donwulff
10-18-2019, 12:34 PM
Calling genotypes from the Long Reads wasn't high on my priority due to knowing ONT performs poorly for that, but earlier post update about suspect sample swap got me curious. Checking various phylogenetic Y-SNP's and autosomal variants between short read & long read sequencing has me already convinced my sample is correct though. But it's probably a good idea to check that the sample tube barcode, the account sample number and raw data sample number match.

Unfortunately for genotype calling, Clairvoyant and it's successor Clair (Claimed by the author to be best variant caller for long-read sequencing) are trained on NGMLR genome alignment, so for compatibility and comparison purposes I'm re-aligning the long reads using NGMLR... 60 CPU-days and counting so far. It would probably be faster to re-train the variant caller using Minimap2 alignments, to the degree that the aligner choice even matters for results.

Meanwhile I also aligned the long reads to the "analysis ready" GRCh38 set with no alt or decoy sequences (Except for the standard esptein-barr retrovirus sequence).
33959

The results are not quite as terrible as I had feared from the NanoPlot QC runs; I'm guessing that substantial amount of even those <Q7 (Over 20% errors) "garbage" reads are aligning, so the autosomal chromosomes reach about 28X read depth. Y chromosome near the very right edge isn't very clearly visible in this graph, most of the Y-chromosome is heterochromatin that doesn't sequence well in any event, which is reflected in the genome reference itself.

The primary use for long-read sequencing is perhaps detecting large structural variants. To that effect I ran the Minimap2 aligned sequence through Sniffles; https://genome.cshlp.org/content/early/2019/06/11/gr.244939.118 suggests this may be currently the best method for CNV analysis of long reads. Of note for anyone thinking of doing this themselves, Minimap2 needs to be either ran with --MD option (in addition to "-ax ont" for SAM-formatted ONT alignment, and desired number of threads with something like -t8), or if aligned without --MD then "samtools calmd" will re-calculate MD tags for Sniffles.

Next big question is what to do with the structural variants. You could use AnnotSV at https://lbgi.fr/AnnotSV/runjob however, some of the ONT/Sniffles derived structural variants are larger than AnnotSV can handle and it crashes. I corrected the problem in my local copy of the tool, however some of the annotation resources are only available to non-profit institutes/hospitals, so I can't do GeneHancer annotations from GeneCards or up to date OMIM disease genes for example. Presumably due to the lack of GeneHancer resource I can't get the AnnotSV structural variant pathogenity classification/ranking. Worth noting at this point that structural variant interpretation is even more "bleeding edge" than single nucleotide variant classification, many if not most variants still have conflicting pathogenity evidence and classification. It would still be interesting to see the results, of course.

I'm not really sure how I would verify the SV calls themselves, though for starters I ended up running Sniffles itself on my short-read sequencing results. While it's not meant for short-read sequencing, it would seem like majority of the ~50 basepair structural variants (small enough to be completely located on the 100bp short reads) match on both datasets. Dante Labs did not provide me with structural variant calls on the short-read sequencing, and I'm still evaluating best way to do structural variant calls on it (In fact, GATK's structural variant pipeline failed to run for me for some reason).

Dizos9
10-19-2019, 08:37 AM
I'm looking to get my whole genome sequenced, or at least the portions relevant to health issues. Does anyone recommend Dante labs?

tontsa
10-19-2019, 06:16 PM
Dante has gotten better as of late.. though bare in mind they might face delays again after Black Friday. So be prepared to wait 6-18 months for delivery. SanoGenetics has also now launched WGS kit and they atleast claim to provide better curated reports than what Dante provides.


I'm looking to get my whole genome sequenced, or at least the portions relevant to health issues. Does anyone recommend Dante labs?

MacUalraig
10-20-2019, 03:08 PM
I'm looking to get my whole genome sequenced, or at least the portions relevant to health issues. Does anyone recommend Dante labs?

You've posted in a thread dedicated to the LONG READ (Oxford Nanopore technology) test at Dante so no I wouldn't recommend it. Would I recommend their standard short read WGS? Sorry but right now I wouldn't recommend that either. And I've done both. Or rather, I paid for them and sent my saliva in. This is the main Dante thread:

https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)

pmokeefe
10-23-2019, 02:24 AM
I just noticed two VCF files were available for download for my Dante Labs Long Read kit:
56001801069032.filtered.snp.vcf.gz 337MB
56001801069032.filtered.indel.vcf.gz 101MB

I've barely had a chance to look at them, but they appear to be aligned with GRCh37.
I did check the snp for my terminal y haplogroup, which was there, so hopefully it's actually my DNA:)

This is a link to a public folder on Google Drive which should contain my two vcf files: https://drive.google.com/open?id=1MkJ07RRpyDGjs1UHUCfs4BSRGS36PM_v

I was expecting and hoping for the FASTQ files not VCFs, hopefully the FASTQs will show up at some point.
Has anyone else received VCF files supplied by Dante Labs for their Long Read test?
Did anyone get both VCF and FASTQ files?

pmokeefe
10-23-2019, 06:12 PM
Is it possible that my "Long Reads Whole Genome" results might be actually be from a "short read" WGS test? Here's the beginning of the VCF file I downloaded from the Dante Labs site for my "Long Reads" test (the formatting appears a little off)

##fileformat=VCFv4.2
##DRAGENCommandLine=<ID=HashTableBuild,Version="SW: 01.003.044.3.3.5, HashTableVersion: 7",CommandLineOptions="/opt/edico/bin/dragen --lic-instance-id-location /root/.edico --build-hash-table true --ht-reference /data/input/appresults/126241120/hs37d5.fa --ht-build-rna-hashtable true --enable-cnv true --ht-alt-aware-validate true --output-directory /data/scratch/hs37d5">
##DRAGENCommandLine=<ID=dragen,Version="SW: 05.021.332.3.4.5, HW: 05.021.332",Date="Thu Oct 17 18:48:11 UTC 2019",CommandLineOptions="--lic-server https://XXXXXXXXXXXX:[email protected] nse.edicogenome.com --lic-instance-id-location /root/.edico --output_status_file /data/scratch/progress.log --enable-map-align true --enable-map-align-output true --output-format BAM --enable-duplicate-marking true --enable-cnv true --cnv-enable-self-normalization true --cnv-segmentation-mode cbs --enable-bam-indexing true --enable-variant-caller true --enable-vcf-compression true --vc-enable-bqd true --qc-cross-cont-vcf /opt/edico/config/sample_cross_contamination_resource_GRCh37.vcf --output-directory /data/output/appresults/7764758/56001801069032 --intermediate-results-dir /data/scratch/intermediate --output-file-prefix 56001801069032 --fastq-list /data/scratch/fastq_sheet.csv --ref-dir /data/scratch/hs37d5-cnv-anchor.v7 --skip-vc-on-contigs NC_007605,hs37d5 --enable-sv true --sv-call-regions-bed /data/scratch/SV_hs37d5.bed.gz --sv-reference /data/scratch/fasta/hs37d5.fa">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -39911.9977">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -39911.9977 <= x < -3.7324">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=ICNT,Number=2,Type=Integer,Description="Counts of INDEL informative reads based on the reference confidence model">
##FORMAT=<ID=LOD,Number=1,Type=Float,Description="Per-sample variant LOD score">
##FORMAT=<ID=MB,Number=4,Type=Integer,Description="Per-sample component statistics to detect mate bias">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PRI,Number=G,Type=Float,Description="Phred-scaled prior probabilities for genotypes">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias">
##FORMAT=<ID=SPL,Number=.,Type=Integer,Description="Normalized, Phred-scaled likelihoods for SNPs based on the reference confidence model">
##GATKCommandLine=<ID=ApplyVQSR,CommandLine="ApplyVQSR --recal-file /home/ubuntu/processing/data/56001801069032/snpoutput.recal --tranches-file /home/ubuntu/processing/data/56001801069032/snpoutput.tranches --output /home/ubuntu/processing/data/56001801069032/56001801069032.filtered.snp.vcf.gz --truth-sensitivity-filter-level 99.5 --mode SNP --variant /home/ubuntu/processing/data/56001801069032/56001801069032.ysnp.vcf.gz --reference /home/ubuntu/data/references/human_g1k_v37.fasta --use-allele-specific-annotations false --ignore-all-filters false --exclude-filtered false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.1.3.0-31-gf499656-SNAPSHOT",Date="October 19, 2019 4:33:04 AM UTC">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=FractionInformativeReads,Number=1,Type=Float,De scription="The fraction of informative reads out of the total reads">
##INFO=<ID=LOD,Number=1,Type=Float,Description="Variant LOD score">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Descript ion="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Descript ion="This variant was used to build the positive training set of good variants">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=R2_5P_bias,Number=1,Type=Float,Description="Score based on mate bias and distance from 5 prime end">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out">
##bcftools_annotateCommand=annotate -x FILTER /home/ubuntu/processing/data/56001801069032/56001801069032.xsnp.vcf.gz; Date=Sat Oct 19 03:05:46 2019
##bcftools_annotateVersion=1.7+htslib-1.7-2
##bcftools_filterCommand=filter -i 'TYPE="snp" ' /home/ubuntu/processing/data/56001801069032/56001801069032.raw.vcf.gz; Date=Sat Oct 19 03:02:17 2019
##bcftools_filterVersion=1.7+htslib-1.7-2
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=NC_007605,length=171823>
##contig=<ID=hs37d5,length=35477943>
##reference=file:///data/scratch/hs37d5-cnv-anchor.v7/reference.bin
##source=ApplyVQSR
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 56001801069032
1 10327 . T C 31.95 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.5;AN=2;DP=45;FS=0;FractionInformativeRea ds=0.511;MQ=24.33;MQRankSum=0.598;NEGATIVE_TRAIN_S ITE;QD=0.71;R2_5P_bias=-1.323;ReadPosRankSum=0.724;SOR=0.346;VQSLOD=-4.174e+00;culprit=MQ GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:14,9:0.391:23:5,4:9,5:32:67,0,41:31.95,0.002956, 43.75:0,34.77,37.77:1,13,1,8:8,6,6,3
1 10583 . G A 40.38 PASS AC=1;AF=0.5;AN=2;DP=28;FS=0;FractionInformativeRea ds=0.821;MQ=67.43;MQRankSum=-1.969;NEGATIVE_TRAIN_SITE;QD=1.44;R2_5P_bias=-13.674;ReadPosRankSum=2.227;SOR=0.275;VQSLOD=-1.302e+00;culprit=MQ GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB:PS 0|1:15,8:0.348:23:7,3:8,5:40:75,0,47:40.38,0.00044 01,50.1:0,34.77,37.77:0,15,0,8:8,7,5,3:10583
1 12783 . G A 23.03 VQSRTrancheSNP99.90to100.00 AC=2;AF=1;AN=2;DP=21;FS=0;FractionInformativeReads =1;MQ=12.78;MQRankSum=0.251;NEGATIVE_TRAIN_SITE;QD =1.1;R2_5P_bias=0;ReadPosRankSum=2.06;SOR=3.16;VQS LOD=-6.934e+00;culprit=MQ GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:3,18:0.857:21:1,7:2,11:9:60,11,0:23.03,9.031,0.6 046:0,34.77,37.77:3,0,18,0:1,2,7,11

JamesKane
10-24-2019, 12:54 AM
I don't see anything in the header that could distinguish the sequencing source with an admittedly quick glance. However, it is produced by Illumina's Dragen platform. This doesn't seem to be the right tool for the job here but someone with more time would need to dig into its long read support.

At least yours is R1b-CTS4466...

Petr
10-24-2019, 12:08 PM
I just noticed two VCF files were available for download for my Dante Labs Long Read kit:
56001801069032.filtered.snp.vcf.gz 337MB
56001801069032.filtered.indel.vcf.gz 101MB


I still see FASTQ file only.

Donwulff
10-25-2019, 08:31 PM
Gawd when posting VCF files, please click the "Go Advanced" button and check "Disable smilies in post" of the formatting is something out of this world... Alternatively embed it in CODE tags which will be even nicer for people browsing through.

VCF's before FASTQ? Uh... GQ Genotype Quality (Phred scaled) in that VCF is around 48 mode, which is almost 99.999% accuracy that wouldn't really be possible with Long Reads. Speaking of which, Long Reads are most useful for large structural variation, and those VCF's are specifically for the short variants that it's not very good at only.

Silver lining: It appears they're actually using DRAGEN which they announced on Facebook earlier, I thought that would be one of those "any year now" because it's rather new, especially in the context of GATK Best Practices. In fact, my understanding is that the announcement is bit premature since Broad Institute doesn't appear to have yet announced Best Practices for DRAGEN. And, uh, they're still using VQSR that's already old news for single-sample workflow (Convolutional Neural Network now). Of course, as opined, DRAGEN doesn't seem to support long reads in any way.


Ps. I agree companies need to deliver what they promise, though I think the delivery time is *mostly* red herring. FTDNA for example had ~1 year sample-to-BAM-download time back when their "storage drive filled up", and they're established clinical research firm (alias Gene-by-Gene) with in-house laboratory. It's still best deal on offer as long as people get their results. I don't even understand what people mean by "pre-canned answers". However, long read sequencing to me seems... dismal at best timestamps indicate most people seem to be getting their FASTQ's months after actually being produced, despite in-house sequencing, still no structural variant results which seem to have been promised (Long Reads seems to be back on EU site but no mention of deliverables now), and now it would seem majority of people who are posting on this forum are getting something entirely different from what they ordered. Which is perplexing, and frustrating. I certainly hope they'll make good on the promises, especially since they seem to be only provider currently. As mentioned elsewhere, I decided to also try out their update subscription, AI personalized report and custom health report, each which still remains undelivered (Actually listed "Unfulfilled" on my account). So, eh, only the short read sequences seem to work right now...

Donwulff
10-25-2019, 09:02 PM
Running (NGMLR or Minimap2 --MD) AND Sniffles to produce variant calls IF you get the FASTQ, and cutting the REF/ALT to 15k nucleotides allows running the VCF with AnnotSV (Again, caveat, a lot of noise even at "known pathogenic" level). I think Sniffles is bit slower on Minimap2 though I've yet not nailed down what all affects the processing speed (On some files, perhaps metagenome aligned, it seems limited to two processor cores) and if there are significant differences (The research paper testing those seemed to think either is fine).


awk -vOFS=$'\t' '!/^#/ { if(length($4)>15000) $4=substr($4,0,15000); if(length($5)>15000) $5=substr($5,0,15000); } 1' sample.calmd.vcf | bcftools sort -Oz -o sample.calmd.15k.vcf.gz

Also I ended up re-running Sniffles a few times on what should be identical input, which gave a few different structural variants. Might have to try SVIM too. So I think there's room for improvement, for sure. Still strange Dante Labs isn't doing even the mapping and SV calling. Not everybody has access to good performance Linux machine with 16GB memory or so to run that themselves.

https://genome.cshlp.org/content/early/2019/06/11/gr.244939.118
"The structural variant caller Sniffles after NGMLR or minimap2 alignment provides the most accurate results, but additional confidence or sensitivity can be obtained by combination of multiple variant callers. Sensitive and fast results can be obtained by minimap2 for alignment and combination of Sniffles and SVIM for variant identification."
https://github.com/philres/ngmlr
https://github.com/lh3/minimap2
https://github.com/fritzsedlazeck/Sniffles
https://github.com/eldariont/svim

MacUalraig
11-14-2019, 07:57 AM
Well the latest I saw on social media, from someone who appears to be staff?, was that not only are they binning the nanopore due to costs but that all you are ever going to get is the FASTQ. In a way its academic as the few of us who did it have already processed it ourselves but it would have been nice to get some VCFs particularly if they used a different set of tools.

Donwulff
11-14-2019, 04:53 PM
That's a shame, they should just match the price to the costs, they should still be able to make it work with the streamlined operations they're boasting (and more affordable saliva samples). Besides of which, with constantly increasing yields and flush kits specifically for re-using flowcells on same sample the costs should soon be below what they were charging, assuming they have low waste.

Of course, it may be true they're not getting whole lot of customers for it, and the certification/maintenance/facility and especially expertise costs can be significant. At the end of the day you can't transfer the 16 200 EUR annual lab + device certification cost to the single customer you get. That's of course not a problem if you manage to build and maintain significant demand for the product, but they never made it at all consumer-friendly. As the direct to consumer genetics lay right now, perhaps it is a solution in search of a problem, it would require wide availability and popularity to built up demand. And researchers would require the original signals-level FAST5 data which they may not be geared to provide.

That goes to say, I do hope this will become consumer-ready product soon (YSEQ, perhaps?) especially since I didn't seem to get enough high-quality data the first run around...

MacUalraig
11-15-2019, 03:41 PM
I agree with your para 1 - I thought the 800 euros or whatever was quite a bargain!

Petr
11-16-2019, 10:49 PM
After long waiting, YFull refused BAM file generated by YSEQ because of low quality. The YFull statistics was:

ChrY BAM file size: 0.52 Gb Hg38
Reads (all): 465621
Mapped reads: 465621 (100.00%)
Unmapped reads: 0
Length coverage: 23527298 bp (99.54%)

Min depth coverage: 1X
Max depth coverage: 1960X
Mean depth coverage: 14.17X
Median depth coverage: 10X
Length coverage for age: 8202809 bp
No call: 109057 bp

SNPs (all): 164318
Positive: 2193 (1.33%)
Negative: 11337 (6.90%)
Ambiguous: 4872 (2.96%)
No call: 59 (0.04%)

STRs (all): 780
Reliable alleles: 224 (28.72%)
Uncertain alleles: 90 (11.54%)
N/A: 466 (59.74%)

Novel SNPs (all): 5
Best qual: 3 (60.00%) [0 (0.00%) - best; 3 (60.00%) - acceptable]
INDELs: 0
Ambiguous qual: 2 (40.00%)
One read!: 0
Low qual: 0

Donwulff
11-16-2019, 11:20 PM
I wonder if whoever is accepting submissions at YFull knows all the details though, on the main Dante Labs thread it was reported they turned down Dante Labs 4X sequence because it was "2X depth" when the stats seem to show 15X which sounds like they just halved the nominal read depth and didn't look at the file. In case of ONT sequences, I've been saying they're not good enough quality for variant calls, so no need to even look at the file unless they have special handling for long reads...

This does raise a number of questions though, like did they just stick it through their normal, short-read pipeline? I forgot how to interpret the SNP stats there, those don't sum up nearly to 100%, what's up with that? If it's not positive, negative, ambiguous or no call then what is it?

Out of curiosity I put my sample's mtDNA through bcftools variant calling, and with filtering the lowest quality calls out, the results matched perfectly with my short-read sequencing calls. Which is sort of a relief, all the reading I've done on nuclear-integrated mitochondrial segments and mtDNA phantom variants started making me suspect everything ;) The short indels weren't called though. mtDNA has read depth >700, haploid with no heterezygous calls to deal with and perhaps kinetically little different from nuclear DNA though. This was illustrated by the earlier genome browser view of multiple differences between Y-chromosomal short-read and long-reads, possibly due to altered bases. Which makes me curious about running some large-scale variant call comparisons.

MacUalraig
11-21-2019, 02:50 PM
Big Nanopore study from Iceland using minimap2 and Sniffles

Long read sequencing of 1,817 Icelanders provides insight into the role of structural variants in human disease

Abstract
Long-read sequencing (LRS) promises to improve characterization of structural variants (SVs), a major source of genetic diversity. We generated LRS data on 1,817 Icelanders using Oxford Nanopore Technologies, and identified a median of 23,111 autosomal structural variants per individual (a median of 11,506 insertions and 11,576 deletions), spanning cumulatively a median of 9.9 Mb. We found that rare SVs are larger in size than common ones and are more likely to impact protein function. We discovered an association with a rare deletion of the first exon of PCSK9. Carriers of this deletion have 0.93 mmol/L (1.36 sd) lower LDL cholesterol levels than the population average (p-value = 2.4·10−22). We show that SVs can be accurately characterized at population scale using long read sequence data in a genomewide non-targeted fashion and how these variants impact disease.

https://www.biorxiv.org/content/biorxiv/early/2019/11/20/848366.full.pdf

pmokeefe
11-22-2019, 03:29 PM
Below I reproduced an email I sent to Dante Labs about my "long read test" .
The FASTQ files and the fastp output for that kit are on a public Google Drive folder at this link: https://drive.google.com/open?id=1MkJ07RRpyDGjs1UHUCfs4BSRGS36PM_v
What does it look like to you?

Dear Dante Labs,
Today I received the FASTQ files from kit #56001801069032 which was purchased as a Long Read kit. Upon inspection, it appears the results are from a ~150bp base pair short read test. I have included a log of the commands I used on MacOS below. I would like to receive results from a Long Read test. Can you please help me with that? Thank you.
Sincerely,
Patrick O'Keefe


Patricks-MacBook-Pro:56001801069032 patrick$ gzcat 56001801069032_S44_L001_R1_001.fastq.gz | head
@A00910:22:HTTGVDSXX:1:1101:8757:1000 1:N:0:CTAGTGCTCT+TGGAACAGTA
NCCCTAAGCCCATATTTGTTGTCAGTTTCACAAAAGTTCCATAGTTGGCA TGCACTCTGGCAGAGATGGACCTGGTGAAGATCCAAGGCATGTACCCAAG TTGAGTCAGAATATTGGCCAGGGACCCAAGTCTGGAAGCCTGTCCCATAG G
+
#FFFFFFFFFFF:FFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF :FFFFFFFF,FFFFFFFFFFF:FF:FF::FFFFFF:FF:FFF:FFFFF:F :FFFFF,FFFFFFF:FFFF:,FFFFFF:F,FFFFFFFFFFFFFFFFF:F, :
@A00910:22:HTTGVDSXX:1:1101:10077:1000 1:N:0:CTAGTGCTCT+TGGAACAGTA
NCCCTATCCCTTTCTATTCCTCTGACCCCGCCTCCTTTCTAAATGCAGCG ACCTCTGTTCTTCAGCCCTATCCCATTCTAATCCACATACCCCCCCTCCT ATCTAAATACAGCGACCTCAGTTCCAAAGACATAACCACTTATAATCCAC A
+
#FFF:F:FFFFFF:FFFFFFF:F:F:FFF,FF:FFFF,FFFFF:,:F,FF F:FF,FF:FF,FF:,,F:,,FFFF,,F,:F,FF,,F:,F::F,,F,,FF, ,,F::FFF,,F,FFFFF:F,,FFF,,,,,,F,,:,,F,::,,,:,:,:,: ,
@A00910:22:HTTGVDSXX:1:1101:11143:1000 1:N:0:CTAGTGCTCT+TGGAACAGTA
NTCCCGGGTTCAAGCCATTCTCCTGTCTCAGCCTCTTGAGTAGCTGGGAT TACAGGCACACACCACCACACCCGGCTAATTTTTTTTTTGTTTTGTTTTT GTATCTTTAGTAGAGACGGGGTTTCACCATATTGGCCAGGCTGGTCTTGA A

Patricks-MacBook-Pro:56001801069032 patrick$ gzcat 56001801069032_S44_L001_R1_001.fastq.gz | wc -l
88788156


Patricks-MacBook-Pro:56001801069032 patrick$ gzcat 56001801069032_S44_L001_R1_001.fastq.gz | awk '{print length($0)}' | head
65
151
1
151
66
151
1
151
66
151


Patricks-MacBook-Pro:56001801069032 patrick$ fastp -i 56001801069032_S44_L001_R1_001.fastq.gz -I 56001801069032_S44_L001_R2_001.fastq.gz
*** This is the beginning of fastp.html a file generated by the above command ***
fastp report

Summary

General
fastp version:0.20.0 (https://github.com/OpenGene/fastp)
sequencing:paired end (151 cycles + 151 cycles)
mean length before filtering:148bp, 148bp
mean length after filtering:147bp, 147bp
duplication rate:3.693047%
Insert size peak:261

Before filtering
total reads:44.394078 M
total bases:6.580795 G

MacUalraig
11-22-2019, 04:25 PM
Is it actually your data? Have you tried aligning it and looking at the SNPs?

pmokeefe
11-22-2019, 05:00 PM
[QUOTE=MacUalraig;623841]Is it actually your data? Have you tried aligning it and looking at the SNPs?[/QUOTE
Yes, that is my data. I already have plenty of short read data from other tests, I purchased a Long Read test and I would like Long Read test results.

JamesKane
11-23-2019, 11:28 AM
@A00910:22:HTTGVDSXX:1:1101:8757:1000 1:N:0:CTAGTGCTCT+TGGAACAGTA


From the 10x Genomics identification heuristic (and a few custom changes in my implementation) this read came from an Illumina NovaSeq not an Oxford Nanopore device. https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/illumina_instrument.py

Donwulff
11-23-2019, 12:59 PM
I'm guessing that's not 10X Chromium Linked Reads technology, though, despite the 10X mention? At least there's no BX:Z tag in the header. Pretty obvious that's paired end short-read sequence though. Since it seems like they would be taking a hit on most of the sales (Bearing in mind we have no way of knowing how much they exactly pay for the consumables) it seems weird if all this is just an attempt to save money. Conversely, if it was the scam some people accuse, it wouldn't make sense for them to deliver other expensive products (Whether it's VCF-only or SRS) instead. Right now it's looking terrible for Long Reads sequencing, though.

pmokeefe
11-23-2019, 05:49 PM
This is the first line of a FASTQ file from a short read kit that I recently received from Dante Labs:

@A00910:30:HN7JKDSXX:1:1101:22752:1000 1:N:0:ATAGCGGAAT+NCGCAGAGTA

This, on the other hand, is the first line of the FASTQ from Dante Labs that was supposed to be a long-read kit:

@A00910:22:HTTGVDSXX:1:1101:8757:1000 1:N:0:CTAGTGCTCT+TGGAACAGTA

In both cases the length of the second line, the one with the actual read, was 151.

MacUalraig
11-23-2019, 06:17 PM
I hope they aren't going to mess you around any further, very frustrating since currently no-one else is doing DTC Nanopore. Didn't someone suggest they simply used up all their starter kit flow cells then gave up? Priced around £1k plus per cell from then on. But then we only know what at most about half a dozen people who signed up.

pmokeefe
11-23-2019, 06:40 PM
This is the first line of a FASTQ file from a short read kit that I recently received from Dante Labs:

@A00910:30:HN7JKDSXX:1:1101:22752:1000 1:N:0:ATAGCGGAAT+NCGCAGAGTA

This, on the other hand, is the first line of the FASTQ from Dante Labs that was supposed to be a long-read kit:

@A00910:22:HTTGVDSXX:1:1101:8757:1000 1:N:0:CTAGTGCTCT+TGGAACAGTA

Here is the first line from a 10X genomics test I did last year via FGC:

@A00298:55:H37KYDSXX:3:1101:2049:1000 1:N:0:GGTATGCA

The length of second line (the one with the actual read) was 151 in all three cases.

pmokeefe
11-23-2019, 06:48 PM
Nanopore sequencing undergoes catastrophic sequence failure at inverted duplicated DNA sequences (https://www.biorxiv.org/content/10.1101/852665v1)
Pieter Spealman, Jaden Burrell, David Gresham

Abstract
Inverted duplicated sequences are a common feature of structural variants (SVs) and copy number variants (CNVs). Analysis of CNVs containing inverted duplicated sequences using nanopore sequencing identified recurrent aberrant behavior characterized by incorrect and low confidence base calls that result from a systematic elevation in the current recorded by the sequencing pore. The coincidence of inverted duplicated sequences with catastrophic sequence failure suggests that secondary DNA structures may impair transit through the nanopore.

JamesKane
11-25-2019, 12:45 AM
I'm guessing that's not 10X Chromium Linked Reads technology, though, despite the 10X mention? At least there's no BX:Z tag in the header. Pretty obvious that's paired end short-read sequence though. Since it seems like they would be taking a hit on most of the sales (Bearing in mind we have no way of knowing how much they exactly pay for the consumables) it seems weird if all this is just an attempt to save money. Conversely, if it was the scam some people accuse, it wouldn't make sense for them to deliver other expensive products (Whether it's VCF-only or SRS) instead. Right now it's looking terrible for Long Reads sequencing, though.

No, it's nothing to do with the chromium process. I'm just crediting the original implementation of the algorithm to detect the sequencing platform.

Donwulff
12-08-2019, 08:28 AM
The Whole GenomeH Hybrid Whole Genome Sequencing is interesting, especially as we heard they might be quitting long-read sequencing entirely. Instead it's now the default landing, at least on EU. "Hybrid genome assembly using short and long reads" seems to suggest they're actually supposed to do assembly and polish this time. Of course, as much as I want Dante Labs to succeed, they do have bit of credibility problem now as I'm not sure anybody has received their long-read FASTQ files processed, and by the posts on this thread most people seemed to receive something else than was promised. Offering this as complete product is otherwise great ideas I've pointed on thread (though price-wise it'd be even nicer if one could, successfully, buy the different tests separately and have them combined instead of putting 1799 EUR down for unproven product). (Interestingly they probably chose 15X long-reads because they can get that from one flowcell, but yields increasing, it could give more)

In my case, after couple of weeks Dante Labs replied that they're going to forward my query about the low-quality long-reads to bioinformatics team, but nothing in the month since then. In my experience the customer support replies have been really fast, except for this long reads one, but "escalating higher" is kiss of death after which I hear nothing. Regarding the read quality, however, I'd like to mention the paper:

Pontefract, A., Hachey, J., Zuber, M., Ruvkun, G. and Carr, C. 2018. Sequencing nothing: Exploring failure modes of nanopore sensing and implications for life detection. Life Sciences in Space Research. 18, Astrobiology 10 2010 (2018), 80–86.
https://www.sciencedirect.com/science/article/pii/S2214552418300245

A tantalizing title, it turns out that if you run Oxford Nanopore device through the motions with no DNA/RNA sample, it will still produce data from random fluctuations in the pores, which it interprets as DNA due to lack of valid signal. In the conclusion they find "generated 5 passing reads out of a total of 3568 measured reads, and contained estimated sequences with low complexity that did not map to the NCBI database. The noise characteristics in all instances suggest that quality thresholds were appropriately chosen by ONT". In other words, the (MinION) device generated 3568 reads, almost all which were filtered out using the standard quality threshold. Because of the higher read count in the PromethION whole-genome sequence, there could be many times more false reads in customer data, if not properly quality-filtered. While the paper suggests they do not map to NCBI genome database, this is a huge problem for genome assembly because the assembly software would spend significant effort trying to place those false reads on genome and, conceivably, could succeed in misplacing them in some cases.

Donwulff
12-08-2019, 09:31 AM
Running https://github.com/nanoporetech/pipeline-structural-variation on the raw FASTQ data. I had ran NGMLR/MiniMap2 and Sniffles separately before, but pipeline-structural-variation has some magic for choosing appropriate read-count thresholds for detection, and filters to filter out some of the more unlikely results.


git clone https://github.com/nanoporetech/pipeline-structural-variation.git
cd pipeline-structural-variation/
cat Dockerfile

FROM continuumio/miniconda3:latest
MAINTAINER prescheneder

COPY env.yml /home/
COPY lib /home/lib/

RUN conda config --add channels defaults \
&& conda config --add channels bioconda \
&& conda config --add channels conda-forge \
&& conda install -y snakemake \
&& conda env update -n base -f=/home/env.yml \
&& pip install /home/lib/

First we need snakemake, according to the page "conda install -y snakemake" but turns out this didn't work on a clean system and pulled way too old version of snakemake. (For some reason I'm assuming people have miniconda or can setup it). The Dockerfile (for building a virtual machine/container with known configuration) gives us a hint

conda config --add channels defaults \
&& conda config --add channels bioconda \
&& conda config --add channels conda-forge \
&& conda install -y snakemake
This actually worked and snakemake --version returns 5.8.1 which is enough for the pipeline. I didn't stop to look what was actually needed, possibly conda-forge provides some necessary dependencies for the later versions.

There are command line parameters, but I ended up editing config.yml to my liking. This includes the GRCh38 bare no-alt reference, because the aligners aren't alt-aware, and so we don't want any of the reads to map to multiple alternate or decoy alleles. In the git development version we need to also add line to set processor threads. input_fastq dir/path, workdir_top for output and sample_name for subdir in it need to also be edited to match, the rest I left as is.

reference_fasta: "/mnt/GenomicData/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
threads: 8

In the FAQ it says:

How can i filter my reads by q-score?

The most recent version of MinKNOW will perform filtering automatically. When using the FASTQ files from the pass output folder no additional filtering is required. For unfiltered datasets use NanoFilt with a minimum q-score of 6.

The "Sequencing nothing" paper above states "At the time of sequencing, Metrichor applied passing quality (Q) scores of Q > 6 for 1D single strand se quencing" and "Moreover, the basecalling software has been reworked and is now conducted using a program called Albacore (v.2.1.3) which utilizes an updated version of the Recurrent Neural Net; the quality score threshold used for 1D reads is now Q > 7." so it's reasonable to assume the FAQ is referring to old version of basecalling.

For other customers, it appears the reads came pre-filtered. I resolved to run the process with both Q 6 and Q 7 filter:

gunzip -c longsample.fastq.gz | NanoFilt --logfile longsample.q6.log -q 6 | gzip > longsample.q6.fastq.gz
gunzip -c longsample.fastq.gz | NanoFilt --logfile longsample.q7.log -q 7 | gzip > longsample.q7.fastq.gz
(Omitting installing NanoFilt; also it didn't actually write a log)

With output from above command(s) into input_fastq in config.yml, finally we're ready to run the pipeline with the command from the documentation:

snakemake --use-conda -p all

minimap2 -t 1 -ax map-ont --MD -Y /mnt/GenomicData/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -d my_sample_qq/index/minimap2.idx
Activating conda environment: /mnt/tmp/ont-sv/.snakemake/conda/b218ddd4

...which doesn't work. We're getting single thread processing, and in my environment there's an error about missing symbols in a library. The run gives the path of the conda environment it's trying to use, however. I have no idea how conda is not using correct library path, but let's set it manually using the above path, so our second try becomes:

export LD_LIBRARY_PATH=/mnt/tmp/ont-sv/.snakemake/conda/b218ddd4/lib
/usr/bin/time snakemake --cores 8 --use-conda -p all 1> my_sample_qq.log 2> my_sample_qq.err
Success! nb. you may want to skip the output redirections if you want to see progress in realtime; for me one run ended up taking about 8 hours on 8 (physical) core processor, but this will vary by read depth etc.

Comparing differenes from q6 and q7 filtering shows there's definite differences though. As usual problem, there's no truth set for my personal genome, so it's hard to evaluate which results are closer to correct. Some papers, eg. Kosugi, S., Momozawa, Y., Liu, X., Terao, C., Kubo, M. and Kamatani, Y. 2019. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome biology. 20, 1 (Jun. 2019), 117. supplements suggest that 60X 150bp PE short-read sequencing with certain SV callers might give a good reference set, alas I don't have that currently. I'm also looking if there's a good way for comparing SV calls, but there's a good chance I might have to write a script that counts some simple stats & compares to known SV's.

MacUalraig
12-17-2019, 09:52 PM
What do we think the *sale* price of the new hybrid short/long read test will be? I'm thinking of sponsoring someone in my project for it but the trick is timing it right.

The other question is - assuming they have actually done of of these pipelines (not a given, usually they seem to just launch stuff first), is it worth us asking them to apply it to earlier customers for whom they have both short and long read data but never did any analysis? Hey I'm feeling benevolent so I'd even not mind paying...