Page 11 of 65 FirstFirst ... 9101112132161 ... LastLast
Results 101 to 110 of 649

Thread: Dante Labs (WGS)

  1. #101
    Registered Users
    Posts
    185
    Sex

    Quote Originally Posted by boaziz View Post
    Can you please help me in finding a software for reading th VCF file i have of my Dante WGS 30x results?
    Because it is very Large 979 MB ( SNP.VCF )
    DNA.Land had some instructions on https://dna.land/vcf-info - unfortunately Dante Labs doesn't provide the VCF index file, so you'd need to figure a way to generate it for the Genome Compass if I recall right.

    If you're on a mobile device, often cloud-based solutions are the only option. I had some VCF exploring options on https://caelestic.com/variant-analysis-list though that's always in need of revision.

    Actually I think there needs to be some new lists/resources and perhaps even tools for VCF files, nobody really goes into how to use grep etc. and it's far from perfect. In fact, if you can use grep, then you can probably use tabix (https://archive.codeplex.com/?p=bow maybe?). Although that needs search by rsID too, blah.

    I do almost all bioinformatics on Linux, or maybe R, but if you're not already using them, then the curve to get using them is very high. Windows GUI tools are sadly lacking, and despite starting to pack the computing power of desktops just a few years ago, mobile phones/tablets aren't really well geared for this.

  2. The Following User Says Thank You to Donwulff For This Useful Post:

     boaziz (02-02-2018)

  3. #102
    Junior Member
    Posts
    8
    Sex

    glogg
    bonnefon.org

    Thank you

  4. #103
    Registered Users
    Posts
    185
    Sex

    Lot of questions though, kinda answering bit by bit when I have time Although stuff like mapping should probably go on a thread/post of it's own...
    Most of this and my previous comments are in context of Broad Institute GATK Best Practices https://software.broadinstitute.org/...est-practices/ which could be considered the reference book on this kind of pipelines.

    But on the Dante Labs pipeline analysis; a long read!

    Identifying reference genome, if not stated, isn't always straightforward. In DanteLabs's case, though, looking at the BAM file headers we have:
    @SQ SN:chrM LN:16571

    The 16571 basepair chrM is NC_001807 which UCSC used, but is used by practically nobody nowadays (It's not the same as rCRS or RSRS). Since in mapping reads will align to a location they best match with, changing the reference, even something as simple as the mitochondrial DNA, can cause reads to map differently elsewhere, and labs like to keep things unchanged so the results remain comparable. This one probably should be changed though, but YFull interpreted full mtDNA from the sequence, which is probably easiest option.

    Anyway, there's no decoy sequences either - bits of DNA that aren't part of the human genome, but are almost always present in the sequence, like the Epstein-Barr retrovirus that cut & pastes itself all around the host genome. Some studies have shown that results are better using the decoy sequence, because the reads aren't forced to map where they don't belong on the human sequences, though I've not seen them in any of the saliva based sequencing results. Personally I use the hs37d5 reference which has the decoy sequences, but this also means there will be more mapped reads on hs37d5 just because the non-human reads are mapping as well.

    Finally, there's the program group headers, which in this case let us know that the alignment was using a genome reference named "ucsc.hg19/ucsc.hg19.fa". Now that's a file name, so you can actually call it whatever you want, but as it happens this is consistent with the contingents (chrM) and lack of decoy sequence, so we can safely assume this is the original UCSC hg19 reference. These contigs match the ones in VCF, so we can assume at least the reference is same, though it's possible the customer delivered BAM file is different from what they used for calling.

    Since we have the program group lines, there are a few inessential things we can gleam from the DanteLabs pipeline:

    @PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 10 -M -Y -R @RG\tID:CL100XXXXXX_L0X_17\tPL:ILLUMINA\tPU:CL100X XXXXX_LX\tLB:WHYZLAADKGAAXXXXXX-17\tSM:1500XXXXXXXXXXX ucsc.hg19/ucsc.hg19.fa /l3bioinfo/CL100XXXXXX_L0X_17_1.05_1.fq.gz /l3bioinfo/CL100XXXXXX_L0X_17_1.05_2.fq.gz

    This tells us it's mapped with bwa-mem, this is industry standard so nothing surprising or useful in that. 0.7.15 is actually really recent version. The most recent changes in it have involved alternate contingent support, this is when the reference contains multiple different sequences for the same genetic region. Normally the reads would map to different alt-contigs haphazardly, preventing getting a good sequence for it, so the mapper needs to have support for this. The hg19/GRCh37 doesn't have alt-contigs though, so we're all good here, it's just good to see the pipeline is up to date. "l3bioinfo" folder/directory suggests this is L3 Bioinformatics pipeline and/or service. (And I don't consider this a tradesecret because it's openly in the header file; there could be number of reasons for that though)

    There's the parameters, "-t 10" means 10 threads. Actually there are 127 parallel invocations of the bwa mem, so that means the mapping in this case was done on 1270 parallel runs! The actual mapping won't take long, assuming they have that many cores. -M is "mark shorter split hits as secondary" which is expected by Picard Tools and therefore always used and -Y is "use soft clipping for supplementary alignments". Sometimes the whole read doesn't map at single place, but it can be split with the parts mapped to different locations on the genome, normally the whole read is listed just once with rest of the matches showing only the matching part of the read. With "-Y" the read is listed again with the matching region marked (ie. soft-clipped). I'm not actually sure what requires this, it's possible HaplotypeCaller could benefit from it, or it can be just to make sure all reads listed are 100bp long. In short, the settings here are pretty standard, and fit for Illumina 100bp paired end sequencing.

    The other kind of program group header we have here is MarkDuplicates, from Picard Tools. This is standard choice, and the reason for the -M on the bwa mem invocation:
    @PG ID:MarkDuplicates VN:2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_ 1466708365) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/l3bioinfo/... ... ...] OUTPUT=chr1.markdup.bam METRICS_FILE=chr1.metrics.txt VALIDATION_STRINGENCY=STRICT MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicate

    We can see this is ran with just the standard settings, no special options, which means that no reads are actually deleted/removed, they're just marked as duplicates. Good for extracting FASTQ from the BAM. However, there's one note I want to stress, the "READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values>" does not work because that's not the format of the read name in these files! So the FASTQ read names have changed, but apparently nobody's noticed. Also, the OPTICAL_DUPLICATE_PIXEL_DISTANCE should be 2500 or so for this kind of high-throughput sequencing. I'm not sure if this makes practical difference especially since the general duplicate marking is still done, but this means optical duplicate removal doesn't happen at all. (Actually, I'll just re-run the duplicate marking on it and compare stats). In short, the sequencer is an optical instrument that reads the sequence from the fluorescent dyes attached to the DNA bases. This makes it possible to read the sequence from same DNA fragment multiple times, which would bias variant-calling and using lower quality optical reads, so you want to ignore similar sequences which are optically close by, which means you need to extract their location and distance on the flowcell. Again, the general duplicate detection catches most of these, so I'm not sure that's a problem.

    There are no other program group files on the BAM. Specifically, Base Quality Score Re-calibration BQSR appears not to have been run. This is good news if you want to extract the reads for your own use, because it means the reads haven't been messed with. Likely, they ran BQSR before variant calling but provided the raw file for people's own analysis, though https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5048557/ suggests it may not always be beneficial, but the Broad Institute GATK Best Practices still call for it. If you're doing your own analysis on the DanteLabs provided BAM file, you may want to consider running BQSR. I should probably run a report on that as well BQSR determines the probability of sequencing error depending on sequencing context (Read group and preceding nucleotide bases) from the novel (Not seen before) variants on the BAM sequence, and then uses that to adjust the quality of the bases read. Ie. if every time GATC is the sequence read the C base is a novel variant not seen before, then maybe you should ignore C when you read the sequence GATC.

    The VCF file gives us the final part we can gleam from these:
    ##GATKCommandLine=<ID=ApplyRecalibration,Version=3 .3-0-g37228af,...,ts_filter_level=99.0>

    This is actually different from BQSR, the variant call recalibration works on variant call level and not on the BAM itself, though the idea is somewhat similar: Find systematic errors and apply them to the variant calls via machine learning. This one is called VQSR. In this case, the cut-off point is that they want variants to be 99.0% certain compared to "known real variants". I'd consider that pretty lenient, so further filtering or at least checking results might be warranted if using this VCF. The raw result (without percentages) is VQSLOD in the INFO annotation; larger is better.

    ##GATKCommandLine=<ID=SelectVariants,Version=3.3-0-g37228af,..., variant=(RodBinding name=variant source=out_snp.vcf.gz)>

    This one appears to just be selecting SNP's which are in their "out_snp" list. So no novel SNP's are listed. Theoretically this would be "SNP's we've validated to genotype correctly in our lab", but I'd assume this is some old copy of dbSNP, I could add dbSNP revision annotation to the VCF and see what's the latest revision they list. I noticed that when I run the analysis myself, I'm getting hugely larger list of variants. If you run Genomic VCF on the BAM on Sequencing.com, the only header it has is UnifiedGenotyper, so I'm thinking that probably skips BQSR and VQSR entirely, but see above reference for how they may not be beneficial, and at least you get all calls, including the gVCF ref-call blocks. (Also, UnifiedGenotyper is kinda old by now).

    Re-running the whole analysis pipeline might be worth it, but considering how relatively little you can do with at least novel SNP's, that may not be of benefit to the average user, especially in light of how many mistakes one might make doing it themselves. If one is looking to learn and experiment, then by all means, have at it though

  5. The Following 6 Users Say Thank You to Donwulff For This Useful Post:

     dudeweedlmao (04-13-2018),  Eridani (03-25-2018),  gotten (02-04-2018),  JamesKane (02-04-2018),  karwiso (02-05-2018),  Tonev (12-16-2018)

  6. #104
    Registered Users
    Posts
    185
    Sex

    I'll have to admit my description on optical duplicates was slightly incorrect. I just tested this out on the Dante Labs provided BAM file as I suggested, and found it doesn't affect the duplicates found at all.
    The main difference is it appears to be slightly faster with that; should run multiple trials with otherwise idle computer to be sure. I started these two in paraller with some free cores, hence 12G per tasks. samtools flagstat are exactly identical for both runs; although using latest Picard Tools at least, it's slightly different from flags on the original BAM file (41999309 duplicates flagged in original BAM having Picard 2.5.0; 42330893 with Picard 2.17.6 - you don't normally want to run latest version in bioinformatics pipelines by the way, I'm doing experimenting and development).

    Duplicate marking runtime without optical duplicates:
    real 499m10.051s
    user 588m38.432s
    sys 8m4.488s

    Duplicate marking runtime with optical duplicates:
    time java -Xms12G -jar picard.jar MarkDuplicates INPUT=sample1.bam OUTPUT=sample1_MarkDuplicates.bam METRICS_FILE=sample1_MarkDuplicates.bam.metrics.tx t
    real 471m32.868s
    user 454m16.360s
    sys 6m46.476s

    Broad Institute actually says at https://gatkforums.broadinstitute.or...withmatecigar:
    "The Broad's production workflow increases OPTICAL_DUPLICATE_PIXEL_DISTANCE to 2500, to better estimate library complexity. The default setting for this parameter is 100. Changing this parameter does not alter duplicate marking. It only changes the count for optical duplicates and the library complexity estimate in the metrics file in that whatever is counted as an optical duplicate does not factor towards library complexity. The increase has to do with the fact that our example data was sequenced in a patterned flow cell of a HiSeq X machine. Both HiSeq X and HiSeq 4000 technologies decrease pixel unit area by 10-fold and so the equivalent pixel distance in non-patterned flow cells is 250. You may ask why are we still counting optical duplicates for patterned flow cells that by design should have no optical duplicates. We are hijacking this feature of the tools to account for other types of duplicates arising from the sequencer. Sequencer duplicates are not limited to optical duplicates and should be differentiated from PCR duplicates for more accurate library complexity estimates."

    So my information on optical duplicates is actually slightly outdated, apparently it's no longer necessary for it's original purpose on newest technologies (Granted, I can't tell if this sequencing WAS done on patterned flowcells or not). However, it DOES have a significant difference for estimating library complexity: About 1.2 billion without optical duplicates, 3.5 billion with the Broad Institute recommended OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 and correct regex.

    Anyway, the bottom line is in my writeup above, I thought that the optical duplicates would have relatively little effect in the actual duplicates, but testing it out shows that it has exactly zero effect on which reads are marked as duplicate, and only matters if you're using sequencing library complexity for something (Probably not, it could be used for comparing different wet-lab workflows and preparation kits for example). It will probably save you some runtime, though. The bioinformatics tools version does have distinct effect though, as you might expect, I don't see any difference in options that should matter. (EDIT: I just remembered the Dante Labs pipeline marks duplicates per chromosome, probably to parallelise the runs. However, this probably causes it to be unable to mark duplicates fragments whose ends aren't on the same chromosome, which is more likely reason for the large difference than the tool version. I'm not really sure I want to test that theory, though. Maybe.)
    Last edited by Donwulff; 02-04-2018 at 08:20 PM.

  7. The Following 3 Users Say Thank You to Donwulff For This Useful Post:

     Eridani (03-25-2018),  gotten (02-05-2018),  karwiso (02-05-2018)

  8. #105
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden
    And here is some more statistics from my sample at YFull:

    ChrY BAM file size: 0.85 GbHg38
    Reads (all): 9877517
    Mapped reads: 9855042 (99.77%)
    Unmapped reads: 22475 (0.23%)
    Length coverage: 26360194 bp (99.79%)
    Min depth coverage: 1X
    Max depth coverage: 8010X
    Mean depth coverage: 32.61X
    Median depth coverage: 19X
    Length coverage for age: 8469975 bp
    No call: 54849 bp

  9. The Following User Says Thank You to karwiso For This Useful Post:

     gotten (02-05-2018)

  10. #106
    Registered Users
    Posts
    185
    Sex

    Quote Originally Posted by karwiso View Post
    And here is some more statistics from my sample at YFull:

    ChrY BAM file size: 0.85 GbHg38
    Reads (all): 9877517
    Mapped reads: 9855042 (99.77%)
    Unmapped reads: 22475 (0.23%)
    Length coverage: 26360194 bp (99.79%)
    Min depth coverage: 1X
    Max depth coverage: 8010X
    Mean depth coverage: 32.61X
    Median depth coverage: 19X
    Length coverage for age: 8469975 bp
    No call: 54849 bp
    I wanted to ask, how did the Hg38 mapping happen? I was under the impression YFull didn't even accept them yet.
    I think I've probably posted mine before, but here they are again for comparison purposes, Hg19 though so... yeah.

    ChrY BAM file size: 0.95 GbHg19
    Reads (all): 12090138
    Mapped reads: 12071826 (99.85%)
    Unmapped reads: 18312 (0.15%)
    Length coverage: 25578704 bp (99.71%)
    Min depth coverage: 1X
    Max depth coverage: 8016X
    Mean depth coverage: 40.56X
    Median depth coverage: 21X
    Length coverage for age: 8445972 bp
    No call: 74862 bp

    BigY:
    ChrY BAM file size: 0.41 GbHg19
    Reads (all): 6464708
    Mapped reads: 6464708 (100.00%)
    Unmapped reads: 0
    Length coverage: 14228866 bp (55.47%)
    Min depth coverage: 1X
    Max depth coverage: 7718X
    Mean depth coverage: 51.70X
    Median depth coverage: 36X
    Length coverage for age: 7946632 bp
    No call: 11424700 bp

  11. The Following 2 Users Say Thank You to Donwulff For This Useful Post:

     gotten (02-05-2018),  karwiso (02-05-2018)

  12. #107
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden
    Quote Originally Posted by Donwulff View Post
    I wanted to ask, how did the Hg38 mapping happen? I was under the impression YFull didn't even accept them yet.
    I think I've probably posted mine before, but here they are again for comparison purposes, Hg19 though so... yeah.
    I have got FASTQ files from Dante Labs - that allows to align the reads to any reference genome. YFull could extract the reads for Y-chromosome and align it to Hg38, but as I understood they would not normaly work with whole genomes data. I think this FASTQ option was on their order page, but it seems it is removed now.

    I have ordered the third party analysis at FGC, so we can see what they can supply, probably bam aligned to Hg38. I will try sequencing.com, but transferring 112-120 GB takes a lot of time. I have not tried samtools, but probably later - I need to learn more about bioinformatics and genetic data processing.
    Last edited by karwiso; 02-05-2018 at 01:35 PM.

  13. #108
    Registered Users
    Posts
    228
    Sex
    Location
    Wisconsin, USA
    Nationality
    American
    Y-DNA
    R1b-FGC29071
    mtDNA
    U5a1b1g*

    Ireland England Netherlands Germany France
    Quote Originally Posted by Donwulff View Post
    I wanted to ask, how did the Hg38 mapping happen? I was under the impression YFull didn't even accept them yet.
    They have been taking Hg38 BAMs for a bit. About as long as YSEQ started aligning their WGS files to the newer reference anyway.

    Since you should have the full set of reads from the FASTQ in a WGS BAM, anyone with a few tools and a good amount of free time could do this and upload them via some cloud storage.

  14. #109
    Registered Users
    Posts
    185
    Sex

    Quote Originally Posted by JamesKane View Post
    They have been taking Hg38 BAMs for a bit. About as long as YSEQ started aligning their WGS files to the newer reference anyway.

    Since you should have the full set of reads from the FASTQ in a WGS BAM, anyone with a few tools and a good amount of free time could do this and upload them via some cloud storage.
    Dante Labs offers FASTQ straight too, I got the BAM because I wanted to see their alignment, didn't realize I should've asked for both though, lol. And yes as I can be seen, I've already mapped them to several different references myself. I was curious how other people do it though. I know you can pay for mapping on Sequencing.com, possibly, and there was talk of it being too large upload. Wasn't aware YFull was already taking hg38 either, although in truth, I've always been little unsure of their pipeline. Accepting FASTQ in the past made it seem like they'd re-do mapping themselves, but then they're requiring hg38 mapped submissions, so I'm not sure. Also the stats page on YFull is straight copy of the original mapping stats, so either they take that before analysis (Likely, in fact) or they don't map it themselves. Also if you do the mapping yourself, there's still a choice of GRCh38 with or without alt contigs, with decoy sequences, analysis ready or latest patch etc. all which would subtly change the results.

    Anyway, yeah, the "In this case YFull mapped it to hg38 themselves" is a good answer

    BAM is *probably* better option normally, because it allows more additional information than FASTQ usually. Broad Institute's preference for raw reads is now "unmapped BAM" format. Of course, you can strip the mapping out, or just leave it in so you can check the pileup/alignment later. I rarely work with FASTQ directly, but I understand if YFull's workflow takes/took them.

    Also, while I'm here, just finished testing duplicate-flagging the Dante Labs provided BAM in queryname(readname) sorted order; a new feature in the Picard Tools that will also mark secondary alignments (When a read maps well to multiple locations on the genome) as duplicate if the primary alignment is duplicate. Some newer Broad Institute pipelines are using this feature. For my BAM file, this flagged 253 950 additional mappings out of 1 976 391 secondary mappings as duplicate, going from 42 330 893 (3.21%) to 42 584 843 (3.23%) duplicates total. And 41 999 309 (3.19%) in the originally provided BAM.

    Unfortunately this is fairly expensive operation for small gain (In fact, depending if the downstream pipeline even uses secondary alignments), in particular because Picard Tools requires the BAM file to be sorted in its own order which differs from everything else. Except... for the Broad Institute "Unmapped BAM" order, so if you use that, you can then use the new duplicate-marking flow on the freshly mapped uBAM before sorting it to chromosome/coordinate order adding little to the normal runtime (However, it reduces the ability to analyse pieces of same genome in parallel!). In theory at least, the more duplicates you're able to mark and exclude from the variant calling, the more fair it will be.
    Last edited by Donwulff; 02-05-2018 at 11:29 PM.

  15. The Following 3 Users Say Thank You to Donwulff For This Useful Post:

     gotten (02-06-2018),  karwiso (02-05-2018),  Petr (02-05-2018)

  16. #110
    Quote Originally Posted by karwiso View Post
    I have got FASTQ files from Dante Labs - that allows to align the reads to any reference genome. YFull could extract the reads for Y-chromosome and align it to Hg38, but as I understood they would not normaly work with whole genomes data. I think this FASTQ option was on their order page, but it seems it is removed now.

    I have ordered the third party analysis at FGC, so we can see what they can supply, probably bam aligned to Hg38. I will try sequencing.com, but transferring 112-120 GB takes a lot of time. I have not tried samtools, but probably later - I need to learn more about bioinformatics and genetic data processing.
    It looks like sequencing.com does the mapping for you - this really surprised me that they would be willing to do that - i sort of wonder what their privacy policy is. Regardless, FastQ i believe are significantly smaller than the BAM files.

    So they would take your FastQ, align to a reference genome (creating a BAM), and then create a VCF for you.

    i'd be very interested to see the differences in VCF files between Dante, Sequencing and any other provider's pipelines - including mapping to alternative references genomes. i'm working on putting together a pipeline geared more towards single genome analysis based on Broad/GATK4. They actually have some pipelines already made public, they are super confusing for a simpleton like me - and seem very much geared towards the analysis of massive numbers of genomes - not just a single one.
    Last edited by dbm; 02-06-2018 at 06:55 AM.

Page 11 of 65 FirstFirst ... 9101112132161 ... LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •