Page 17 of 17 FirstFirst ... 7151617
Results 161 to 169 of 169

Thread: Dante Labs WES/WGS Sequencing Technical

  1. #161
    Gold Class Member
    Posts
    1,288
    Sex
    Location
    Birmingham, UK
    Ethnicity
    Indian - Punjabi Jatt
    Nationality
    British
    Y-DNA (P)
    R2-SK2142 > Y1383*
    mtDNA (M)
    U7a3a5a1
    Y-DNA (M)
    R1b-Z2109 > Y84821
    mtDNA (P)
    M5a1a (185G)

    England United Kingdom India Punjab India
    Quote Originally Posted by teepean47 View Post
    I was testing the latest bwa-mem2 release (2.1) and memory requirements are now reasonable. Indexing hs37d5.fa still takes around 80 gigabytes but the resulting index is around 16 gigabytes so mapping can now be done on machines with 32 gigabytes of memory.
    How does BWA-MEM2 handle ALT contigs (i.e. hs38DH) now? This review recalled discrepancies between BWA-MEM and BWA-MEM2 in alignments for an earlier version of BWA-MEM2.

    On a somewhat related note, GATK-DRAGEN is apparently due to release early next month in the next GATK release, so you may want to take a look at that.
    YFull: YF72440 (FTDNA - IN41220)

    Ancestral Haplos (Punjabi Jatt):
    * Father: R2-SK2142 > Y1383* - M5a1a (185G)
    * Maternal Uncle: R1b-Z2109 > Y84821 - U7a3a5a1
    * MGMs MGF: R1a-Z93 > Y7 - ?

    Friends Haplos:
    * North Moroccan Berber: E-M35 > M81 - R0
    * Han Chinese: O-M117 > F1531 - M7e
    * Gujarati Lohana: T-M70 > Y11151 - R30b1

    Hidden Content

  2. The Following 3 Users Say Thank You to aaronbee2010 For This Useful Post:

     Jatt1 (10-29-2020),  maroco (10-28-2020),  pmokeefe (10-28-2020)

  3. #162
    Gold Class Member
    Posts
    271
    Sex
    Nationality
    Finnish
    Y-DNA (P)
    R1b-Z142
    mtDNA (M)
    H10g

    Quote Originally Posted by aaronbee2010 View Post
    How does BWA-MEM2 handle ALT contigs (i.e. hs38DH) now? This review recalled discrepancies between BWA-MEM and BWA-MEM2 in alignments for an earlier version of BWA-MEM2.
    As far as I know bwa-mem2's output should be identical to bwa mem's and an issue from June confirms that and the authors updated their article:

    Update June 11th, 2020:
    After our initial work for BWA-MEM2 evaluations, one of the developers, Vasimuddin Md (@wasim_galaxy), had contacted us about a number of improvements and shared with us the latest BWA-MEM2 binary on May 22nd, 2020. We have some updated information to share after testing out the suggested version and running parameters.

    With the latest binary, we revisited our discrepancy analysis between BWA-MEM and BWA-MEM2 with the three samples (SRR10150660, SRR10286881, and SRR10286930) that had the largest differences in our original post. We found out that the mapping results of BWA-MEM2 are exactly the same with BWA-MEM using this latest version.

    The identical result is discovered when aligning against both hg38 builds, hs38DH (GRCh38 primary contigs + decoy contigs + ALT contigs and HLA genes), as well as GCA_000001405.15_GRCh38_no_alt_analysis_set (GRCh38 primary contigs + decoy contigs, but no ALT contigs nor HLA genes). Additionally, while the per sample wall clock runtime is slightly longer with a range of 10%, BWA-MEM2 still gives the advantage of being around ~2X faster to its precedence, BWA-MEM, as we have found originally.

    Additionally, three of our previously failed BWA-MEM2 analysis, which were reported on bwa-mem2 github page (35, 49, and 50), can now be finished successfully by utilizing a lower number of thread usage (with bwa-mem2 option -t), or by increasing the memory capacity.
    https://github.com/bwa-mem2/bwa-mem2/issues/61

    DRAGEN-GATK seems very interesting but I suspect the speed will be at the same level as GATK4 unless they have implemented a multi-core/cpu capability that existed in GATK3.

    EDIT: Looks like some of the DRAGEN improvements are already committed to master:

    https://github.com/broadinstitute/ga...c8a555066271d1
    Last edited by teepean47; 10-28-2020 at 09:44 PM.

  4. The Following 2 Users Say Thank You to teepean47 For This Useful Post:

     aaronbee2010 (10-29-2020),  pmokeefe (10-28-2020)

  5. #163
    Gold Class Member
    Posts
    1,288
    Sex
    Location
    Birmingham, UK
    Ethnicity
    Indian - Punjabi Jatt
    Nationality
    British
    Y-DNA (P)
    R2-SK2142 > Y1383*
    mtDNA (M)
    U7a3a5a1
    Y-DNA (M)
    R1b-Z2109 > Y84821
    mtDNA (P)
    M5a1a (185G)

    England United Kingdom India Punjab India
    Quote Originally Posted by teepean47 View Post
    As far as I know bwa-mem2's output should be identical to bwa mem's and an issue from June confirms that and the authors updated their article:



    https://github.com/bwa-mem2/bwa-mem2/issues/61

    DRAGEN-GATK seems very interesting but I suspect the speed will be at the same level as GATK4 unless they have implemented a multi-core/cpu capability that existed in GATK3.

    EDIT: Looks like some of the DRAGEN improvements are already committed to master:

    https://github.com/broadinstitute/ga...c8a555066271d1
    Somehow I completely missed the update on the top, so thanks for pointing that out. I'm not sure whether the update was there before I lasted viewed the article or not but maybe checking the top of an article for updates would be a good idea

    There may be a difference in speed due to DRAGENs algorithm (at least the part that replaces BWA-MEM) but the version due to be release on GitHub won't be hardware-accelerated, although the GATK-DRAGEN pipeline removes the need for both BQSR and VQSR so there should already be improvements there. I'm not sure if this version of the DRAGEN mapper will have duplicate-marking built-in like Illuminas proprietary version does though. It doesn't seem to be the case judging by the diagram below (from their webinar) but I could be wrong on this:



    Also thought I should include this too:

    YFull: YF72440 (FTDNA - IN41220)

    Ancestral Haplos (Punjabi Jatt):
    * Father: R2-SK2142 > Y1383* - M5a1a (185G)
    * Maternal Uncle: R1b-Z2109 > Y84821 - U7a3a5a1
    * MGMs MGF: R1a-Z93 > Y7 - ?

    Friends Haplos:
    * North Moroccan Berber: E-M35 > M81 - R0
    * Han Chinese: O-M117 > F1531 - M7e
    * Gujarati Lohana: T-M70 > Y11151 - R30b1

    Hidden Content

  6. The Following 3 Users Say Thank You to aaronbee2010 For This Useful Post:

     Jatt1 (10-29-2020),  pmokeefe (10-29-2020),  teepean47 (10-29-2020)

  7. #164
    Gold Class Member
    Posts
    271
    Sex
    Nationality
    Finnish
    Y-DNA (P)
    R1b-Z142
    mtDNA (M)
    H10g

    I compiled the latest from GATK Github and it does have Dragen mode when using HaplotypeCaller (--dragen-mode true). I could not find mapping to test it. The extra dlls in the folder enable Intel GKL on Windows based machines.

    https://drive.google.com/drive/folde...L6?usp=sharing

  8. The Following User Says Thank You to teepean47 For This Useful Post:

     aaronbee2010 (10-29-2020)

  9. #165
    Gold Class Member
    Posts
    1,288
    Sex
    Location
    Birmingham, UK
    Ethnicity
    Indian - Punjabi Jatt
    Nationality
    British
    Y-DNA (P)
    R2-SK2142 > Y1383*
    mtDNA (M)
    U7a3a5a1
    Y-DNA (M)
    R1b-Z2109 > Y84821
    mtDNA (P)
    M5a1a (185G)

    England United Kingdom India Punjab India
    Quote Originally Posted by teepean47 View Post
    I compiled the latest from GATK Github and it does have Dragen mode when using HaplotypeCaller (--dragen-mode true). I could not find mapping to test it. The extra dlls in the folder enable Intel GKL on Windows based machines.

    https://drive.google.com/drive/folde...L6?usp=sharing
    I might have been slightly mistaken then (apologies ). The mapper itself might just release alongside GATK as a separate repository, not in the GATK repository or executable whereas the DRAGEN mode for HaplotypeCaller appears to be set to release within the GATK repository or executable. Time will tell, I guess.
    Last edited by aaronbee2010; 10-29-2020 at 10:46 PM.
    YFull: YF72440 (FTDNA - IN41220)

    Ancestral Haplos (Punjabi Jatt):
    * Father: R2-SK2142 > Y1383* - M5a1a (185G)
    * Maternal Uncle: R1b-Z2109 > Y84821 - U7a3a5a1
    * MGMs MGF: R1a-Z93 > Y7 - ?

    Friends Haplos:
    * North Moroccan Berber: E-M35 > M81 - R0
    * Han Chinese: O-M117 > F1531 - M7e
    * Gujarati Lohana: T-M70 > Y11151 - R30b1

    Hidden Content

  10. #166
    Registered Users
    Posts
    450
    Sex

    To be fair, BWA-GATK has already been moving away from BQSR and VQSR. For example, in Tian, S. et al. 2016. Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinformatics. 17, 1 (2016), 403. "BQSR had virtually negligible effect on INDEL calling and generally reduced sensitivity for SNP calling that depended on caller, coverage and level of divergence. Specifically, for SAMtools and FreeBayes calling in the regions with low divergence, BQSR reduced the SNP calling sensitivity but improved the precision when the coverage is insufficient." (The study is about exomes, and WGS has many lower coverage areas, so careful). And of course, VQSR has indeed been replaced with the covolutional neural net https://gatk.broadinstitute.org/hc/e...NScoreVariants in latest iterations of GATK Best Practices.

    I believe Broad Institute have themselves stated on their forums they believe these steps to be generally unnecessary, but to test & validate it for each workflow and input type. As one should, in general. Of course a large concern is compatibility in comparison with earlier processed samples. Imagine you had 10 000 samples with BQSR, no budget to re-process them and analysed another 10 000 without BQSR. Now whatever condition the first 10 000 samples had would look like it was associated with all the variants that BQSR brought up. This hits, to a lesser degree, even if you're just analysing your single sample with Promethease. But with the financially expedited swith to DRAGEN, the results are going to change anyway. But yeah, so far the "open source DRAGEN" seems disturbingly vapor-ware with no sightings, although they do keep talking about it still.

    In general, I don't think BWA-MEM2 or DRAGEN offer a solution for at-home genome processing, as it seems they may even more resource hungry. And if you're only analysing a handful of samples, peformance isn't an issue, really. Using well made cloud platforms seems like best bet, in that case you won't even need to particularly care about performance and resources/hardware, low learning-curve and standardized results.

    I'm more interested in any possible impvoments frm DRAGEN, I believe their blog said that SNP results should be comparable to BWA-GATK, but indel calling would be improved. Then again by then we should all be using long-read sequecing, though there's probably a sweet spot for short-read indels. BWA-MEM2 and Minimap2 seem to indicate that not everybody sees DRAGEN as the only game in town. But time will tell, when the first uninvolved publications of actual implementation start to come in (But first the implementation...).

    Speaking of changes, GRCh38.p14 should hit any day now, not too many working days of "second half of 2020" left. Of course, most people don't use the patches for analysis. I'm wondering if that will get to incorporate much of changes from the CHM13 telomere-to-telomere long read human reference that just got finished & published, or if that was too late to hit p14 and we'll have a few years to wait... One could map to CHM13 v1, but genomic coordinates are different, I haven't seen a single liftover chain, and of course you should probably mask repetitive regions etc. as for the official GRCh38 reference. I want to see what difference a "complete" (How many times have people said the human genome is now complete?) reference makes.

  11. #167
    Registered Users
    Posts
    450
    Sex

    If you REALLY need the performance, something different to consider: https://developer.nvidia.com/clara-parabricks
    https://doi.org/10.5808/GI.2020.18.1.e10
    "Parabricks was able to process a 50 whole-genome sequencing library in under 3 h and Sentieon finished in under 8 h, whereas GATK v4.1.0 needed nearly 24 h. These results were achieved while maintaining greater than 99% accuracy and precision compared to stock GATK."

    However, with the GATK move to Illumina's DRAGEN it will be interesting to see how Sentineon and NVIDIA Parabricks might respond. For that matter, BGI might have a stake in the matter as well, although Illumina may have made Dante Labs an offer they couldn't refuse, so non-Illumina sequences seem like a rarity in the enthusiast sphere for the time being. But first we need to actually GET the open source DRAGEN implementation which isn't being developed as open source, and perhaps some unaffiliated comparisons of performance, accuracy, precision. (Recalling that benchmarks tend to be at least partly based on BWA-GATK pipeline calls!).

    By the way https://www.ncbi.nlm.nih.gov/grc/human/issues all current GRCh38.p14 issues are Resolved, and there's already 9 issues open for p15 so I'm guessing they're indeed putting the p14 through final release and quality assurance steps. I might be mistaken but I don't think either issues list is yet incorporating anything specifically from https://github.com/nanopore-wgs-consortium/CHM13 v1.0. NCBI re-mapping service https://www.ncbi.nlm.nih.gov/genome/...ap/docs/whatis supports CHM13 Draft coordinates, so I hope they will add v1.0 pronto. https://hgdownload.soe.ucsc.edu/gold...hg38/liftOver/ is another to watch, but I don't think any of the versions are there. The procedure for generating liftover files is a bit, err, esoteric although it shouldn't be hard, in principle. I should probably just work on finding novel sequences and adding them as "decoys" though; that should be done in any event. But the CHM13 Telomere-to-Telomere reference is appealing reference for long-read sequencing and re-assembly starting point.

  12. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (11-03-2020)

  13. #168
    Registered Users
    Posts
    450
    Sex

    Bit of a wall of text, but looking this over again (I think I posted some of it dozens of posts ago):
    https://gatk.broadinstitute.org/hc/e...-more-specific

    "To be clear though, we don't anticipate that the work we're doing together is going to substantially accelerate the software-only version of the tools that we distribute as GATK. If you have a need for speed on that order of magnitude, you're going to want to check out Illumina's commercial DRAGEN offering, which includes the hardware-accelerated version. But what this collaboration does offer is to make available some specific accuracy improvements in the software-only open-source version (labeled DRAGEN-GATK). The good news: the outputs of the two versions (pure software vs hardware-boosted) will be functionally equivalent and compatible for cross-dataset analyses."

    If taken literally, that means Open Source software-only DRAGEN would be slower than BWA-MEM, because the pipeline does away with some of the GATK post-processing. However, an issue is also what is meant by "tools that we distribute as GATK", including DRAGEN or not? In any case, they've told us to only expect accuracy improvements. The hardware-accelerated process is, of course, the appeal in it to Illumina (Who already have DRAGEN-GATK on their cloud platform).

    "You can see the very clear gains in indel and heterozygous SNP calling precision of the DRAGEN pipeline (blue) compared to our original GATK 4.1.1.0 single-genome pipeline with 2D CNN filtering (orange), as well as a modest but always welcome increase in indel sensitivity."

    They've updated the blog-post, and this is actually news to me, as I earlier reported I believed they said the SNP results were comparable to BWA-GATK. Now they're specifically mentioning heterozygous SNP calling precision, and even compared to the new 2D Convolutional Neural Net Best Practices. That is indeed nice, and something to look forward to. As well as Sentineon, Parabricks and BGI to potentially respond to.

    "DRAGEN team found two areas of opportunity that were ripe for the picking: they decided to improve the indel error model to cope with PCR-induced stutter in STR regions, and refine the genotyping model by taking into account the presence of correlated pileup errors."

    The STR-stutter would of course be valuable to genealogical analysis like those performed by YFull, although I do not know what kind of pipeline they're currently using for calling STR's. FTDNA has their own in-house pipeline too I think, although I seem to recall they were making some changes with the latest Big-Y offerings (I should really check these things when writing). Either way the option of moving to use DRAGEN would be possible for both of them. (It took quite a while for FTDNA to re-process the samples to GRCh38, and YFull doesn't get any extra money from re-analysis, but if the performance and price is right...)

    https://gatk.broadinstitute.org/hc/e...ease-timeframe

    "After weathering some delays due to the COVID-19 pandemic, we are now expecting to be able to release the full open-source software version of this first DRAGEN-GATK pipeline in early November of this year"

    Which should be... any day now, too. Of course, slipping schedules are pretty common, and I'm still not sure what to think of closed source open source development projects. Why not release it early to make sure issues are ironed out for the first "production" release? This seems odd, but I guess we shall see soon.

  14. #169
    Gold Class Member
    Posts
    271
    Sex
    Nationality
    Finnish
    Y-DNA (P)
    R1b-Z142
    mtDNA (M)
    H10g

    Quote Originally Posted by Donwulff View Post
    If you REALLY need the performance, something different to consider: https://developer.nvidia.com/clara-parabricks
    Nvidia's GenomeWorks passes all of the tests even using a consumer GPU but I was wondering if their version GATK is freely available anywhere.

Page 17 of 17 FirstFirst ... 7151617

Similar Threads

  1. Dante Labs (WGS)
    By MacUalraig in forum Dante Labs
    Replies: 1243
    Last Post: 12-01-2020, 11:51 PM
  2. Dante Labs WGS (30x) $299
    By noman in forum Dante Labs
    Replies: 3
    Last Post: 08-30-2020, 09:06 PM
  3. Dante Labs subforum?
    By MacUalraig in forum Suggestions
    Replies: 2
    Last Post: 11-11-2019, 12:35 AM
  4. Whole Genome sequence $299 at Dante Labs
    By Dr_McNinja in forum Dante Labs
    Replies: 15
    Last Post: 02-18-2019, 12:30 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •