Page 6 of 8 FirstFirst ... 45678 LastLast
Results 51 to 60 of 80

Thread: Dante Labs WES/WGS Sequencing Technical

  1. #51
    Registered Users
    Posts
    373
    Sex

    I landed the promised MarkDuplicatesSpark using short read alignment script into my GitHub account at https://raw.githubusercontent.com/Do.../revert-bam.sh

    I'll have to say that MarkDuplicatesSpark is probably the most frustratingly annoying bioinformatics tools I've come across. My first attempt failed on creating the BAM index because it ran out of memory with the default Java memory allocation, second failed somewhere near the end because it throws temporary files all over the place and there's no way to continue, third try failed because at the end of the process it complains about leftover directory from previous run, fourth try failed midway because it realized there as single-end read in the input... well you start getting the idea.

    On n+1 try it... just crashed on creating BAM index for no apparent reason that I could find out, essentially giving same result as try #1. Okay, so I just generate the BAM index with samtools instead, and call it a day. Or rather, 48 hours it took for the test runs. Most of the issues wouldn't have been found on smaller test input, but ironically, on a four-core AMD the MarkDuplicatesSpark flow actually takes much longer than the previous MarkDuplicates workflow. This is contrary to what Broad Institute says about its performance, although it might be in part because my MarkDuplicates version was already optimized with fifo buffer, samtools for sorting and indexing etc. On the bright side MarkDuplicatesSpark manages to keep the cores busy most of the time, so it SHOULD be faster on 8 or 16 execution threads. Unfortunately, you can't make the BAM slices it processes larger than they are, so I imagine it's wasting a whole lot of work on juggling the Apche Spark/Hadoop blocks.

    That being said, MarkDuplicatesSpark works on a single compute node with no additional requirements besides Java & GATK so I made it the new workflow, because people running it probably have 8 cores or more and are in a hurry, right? I still preserved the old version, which can be used by commenting GATK_SPARK definition out. Maybe I should automatically choose between them depending on number of processor threads. But since MarkDuplicatesSpark seems to have a tendency to blow up, I added extra define KEEP_TEMPORARY which defaults to preserving the raw bwa mem output before MarkDuplicatesSpark in case one has to go back to it. I would imagine it's really, really slow without SSD for temp files due to it using literally tens thousands of fragments, but I'm not crazy enough to try.

    I would appreciate hearing about any unexpected effects should people try the script out, albeit in the end one is always responsible for checking the script out to make sure it does what they need it to do. To that end, there's bunch of links and references added in the comments.

  2. #52
    Registered Users
    Posts
    373
    Sex

    I started this thread in response to questions on how to properly align Dante Labs sequences to the reference genome, and then wrote script illustrating preparing Broad Institute's GATK Best Practices then new "unmapped BAM" format format raw data file. Just a reminder though that this isn't in any way Dante Labs specific, in fact my scripts include tricks for many other sequencing providers as well.

    Anyway, I've re-submitted my Dante Labs sequence to YFull for analysis using the current version (Oh dear, I may have to start version-numbering the scripts...) of my re-mapping script, just confirming that they recently finished the analysis based on that re-mapped GRCh38/hg38 BAM, I'm happy with the results and at least YFull didn't have any comments about the mapping. I used full options of the mapping pipeline script, ie. BQSR, adapter-runthrough trimming, alt-aware mapping with k8/postalt post-processing, Spark parallerization as well as my up to date GRCH38p13 + filtered decoy + HLA-3.36.0 reference (Well, 3.37.0 is out just today, although it's dated on the 10th... I'm thinking of freezing this on HLA-3.36.0 because I don't want HLA changes affecting mapping comparisons).

    There were no new novel SNP's found, which is somewhat expected for WGS GRCh38 conversion, but one novel SNP was removed in my case. I'm yet to get around to analysing why that happened. However, 11 "private" SNP's which have been seen in other phylogenic branches of Y-chromosome but not mine (Ie. high likelihood of being errors) were removed in comparison to the hg19 mapping. Two such SNP's were added. In addition 38 SNP's which should phylogenetically be there were found, and 10 lost mostly due to ambiguous ie. reads supporting both variants. This is giving me high confidence that the new mapping is an improvement over the earlier one. Consequently I'm ready to call that "version 1.0".

    If I ever get a year long vacation, I might get around to checking each of those SNP's in different mappings of the same raw data

    Currently testing adding oral microbiome to the reference genome which might improve mapping of saliva samples, I think it's looking good right now. However, since this part isn't "best practices" or anything, I guess for this step I really should validate the results against Genome In A Bottle, simulated reads, and manual inspection of some affected variants. That's just... a lot of work.

    EDIT: Whoops, https://github.com/Donwulff/bio-tool...master/mapping so people don't need to go through previous messages hunting for link. Be forewarned that this is "of course" Linux/unix, plenty of dependencies and significant amount of downloaded resources. I'm doing this for free, so it's not out-of-the-box ready, at least yet, although wrapping a virtual machine would be easy.
    Last edited by Donwulff; 07-05-2019 at 03:18 PM.

  3. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (07-05-2019)

  4. #53
    Registered Users
    Posts
    373
    Sex

    Slight digression on the topic of reference genome validation: It's striking that "Similarities and differences between variants called with human reference genome HG19 or HG38" https://bmcbioinformatics.biomedcent...859-019-2620-0 published in March states "To date, the impact of genome version on SNV identification has not been rigorously assessed." While this claim hinges on the definition of "rigorious", and their assessment looks rigorious indeed, GRCh38 was released in 2013 and has been considered the standard reference assembly since 2014. This means it's been used half a decade without "rigorious" assessment, on the assumption that since it's newer it must be better.

    This new paper does end in the note "Our findings suggest caution when translating genetic findings between different versions of the human reference genome. After carefully reviewing the results of our in depth comparison, we recommend that newer version (HG38) should be used going forward in SNV analysis." though. In more layman terms, this means that GRCh38 should be used in new analysis pipelines and products, but the results are significantly different enough that any interpretatitions (Of the "rs123456 means you're going to die" variety) should be re-validated on the new reference genome.

    Of course, various third party interpretation services certainly didn't validate those variants the first time around, simply copied them from publications or 23andMe, so this might not pose a huge challenge for them For really serious clinical interpretation this poses a challenge, because majority of studies and entires in databases like ClinVar have been made using GRCh37/hg19, and you shouldn't automatically trust that the findings will be the same using a new reference genome.

    Returning back to issue at hand, that's also why you don't usually use bleeding edge latest reference genome with all imaginable additions - there's just not enough time in the world to validate and get experience in using every possible reference genome (This is very, very interesting point for GRCh39/graph-based reference genomes too; I think most studies are currently just dealing with the need of graph-based, population-specific or pan-genome references, but very few are considering the analytic complications of such).

    On the flip side, something like adding oral microbiome to the reference will primarily mean that some reads are going to map preferentially to the oral microbiome, whereas without the microbiome in the reference they'd mapped against the human genome. This is far more likely to increase false negatives than false positives. With the foreseeable state of genomics, false positives are a lot bigger deal than false negatives, because basically no genomic test can prove that you're immune to a condition, at most that you have lower genetic risk. Obviously, you don't want to lose too many real true variants, but I have very strong reasoned hunch adding the microbiome should improve results from saliva sequencing. It's hard to exactly "prove" though. Simulate all ClinVar variants and align them against reference genome with and without the microbiome, or simulate reads from microbiome and see where in the human reference genome they will map? This approach will miss many complicationss in real genomes - microbiome that's not included in the reference, unexpected sequencing error types and biases etc. Or analyse a huge amount of real genomes with both references, find discordant ClinVar variant calls and try to figure out what is the true call? Without ability to Sanger sequence each discordant variant in a sample, this would be mostly a judgement call.

    Nevertheless, it's an interesting topic to explore and compare, and as I've noted before the oral microbiome in DTC saliva sequences is as of yet mostly untapped resource.

  5. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (07-05-2019)

  6. #54
    Registered Users
    Posts
    373
    Sex

    In smiliar vein to some wise-ass's crack that there are only two difficult issues in programming: Caching and naming... frigging naming.

    My bleeding-edge reference genome takes currently cue from Heng Li's hg38DH in which he named HLA sequences like "HLA-DPB2*01:01:01:01". Now under the sam file-format specification the colon : is allowed character, however considering how common the notation "chr1:20000" or "chr1:20000-30000" is, this turns out to be generally a Bad Idea. Yes, I did find that out the hard way I haven't yet decided what I should use instead, though, because I haven't ran into any other convention for identifying HLA alleles. I suppose underscore _ is pretty common for "unrepresentable character", though in the human assembly it's used explicitly for space, but still thinking.

    And yes, if you're writing bad code (and who isn't, for quick tooling scripts, and sometimes intentionally wanting to support wildcards) the star is also going to cause problems. I'm thinking of letting it pass because that would force people to at least thing aboout security when writing scripts for that, but yeah, basically I'm trying to come up with an acceptable alternative naming scheme for HLA alleles so probably need to change that as well as the minus-char because some downstream analysis is going to decide that means contrigs from HLA to DPB2

    I'm not sure there's convention for microbiome, and for various reasons this is actually going to be problematic - because the microbe names aren't settled, many aren't even named, and few have very high quality contiguous reference genomes. I'm nevertheless thinking of prefixing them with "HOM_" for Human Oral Microbiome to prevent naming collisions and easy filtering. Of course, this will turn out into a problem down the line because many of them aren't exclusive to the oral cavity. Perhaps people who have done commercial metagenome/microbiome tests know if they have a convention already? I don't think there exists easy solutions to the naming problem, though at least there are tools that allow easily re-mapping names in most bioinformatics file types to whatever convention is needed, so at least it's manageable.

    And speaking of that, I'm using the UCSC naming (chr16_KI270853v1_alt etc.) only for contigs that UCSC has actually properly named; currently I decide this by inclusion in the UCSC reference genome, the latest version being at http://hgdownload.cse.ucsc.edu/golde...8/bigZips/p12/ The rest get left with NCBI GenBank accession naming like KI270853.1 (Or is that KI270853v1? Yes, this is hard! KI270853.1 is the official name, but I fully expect some analysis to choke on the dot because v1 is used in UCSC names). There's a theoretical chance that the contig that ends up in the UCSC version could be different from the NCBI one, so one could wait until UCSC finalizes it. Although the revision should change, I've seen this happen at least with the soft-mask changing between releases. On the other hand, THIS is going to be a problem because the dbSNP is released with NCBI GenBank accession naming, meaning that versions from different times will become incompatible (Which, in a way, illustrates why GenBank accession naming would be superior, but it isn't easy on the eyes or filtering etc.). Luckily, few people really care about the alt-contigs and patches currently, right now you never see "The pathogenic variant is chrUn_KI270750v1:23602367" or use them for genetic genealogy matching. I might just choose to go full UCSC naming for human contigs if that ever happens.

    But this is just to note and put a little caveat that, yes, naming is still hard.

  7. #55
    Registered Users
    Posts
    373
    Sex

    Another footnote: After mapping "common" oral microbiome in my sample, vast majority of the remaining unidentified reads are from Sus scrofa... more commonly known as pork. The sample taking instructions normally tell not to eat, drink or chew gum for 30 minutes prior to giving sample, which is more related to avoiding washing out cells with your own DNA in the mouth. I'm sure I observed at least that requirement, but I have a faint memory I ate something hours earlier and thought it would be funny if I got its genome along. That was couple of years ago so I no longer remember the specifics, but I guess we may know now! What strikes me as odd is I'm not seeing any plant-based DNA though, you'd think there would be potatoes or rye or something. Some articles have talked about how long foreign DNA would likely survive in the mouth, which normally isn't very long, and I recall brushing my teeth earlier and hoping the chemicals wouldn't mess with the chemistry. It could also be contamination, but in any case should be noted that there are a LOT of reads in a whole genome sequence and these are very, very small part of it.

    However, in light of this, in case this kind of analysis ever becomes commonplace, people taking saliva-based sequencing tests now might want to make careful note of what they eat and how long before giving the sample, or why not even choose something interesting to eat beforehand!

    Meanwhile I'm left with the problem of these reads complicating my analysis because it's relatively computationally intensive to identity unknown genomic sequences, but as luck would have it an improved Sus scrofa reference genome has JUST been published: http://dx.doi.org/10.1101/668921 so it's easier to just align to this genome now.

  8. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (07-06-2019)

  9. #56
    Gold Class Member
    Posts
    180
    Sex
    Location
    Rome, Italy
    Ethnicity
    Polish/British Isles
    Nationality
    U.S.
    Y-DNA (P)
    R-A9185
    mtDNA (M)
    H1

    Poland England Ireland Munster

    hg19.fasta file for Dante Labs BAM?

    I'm about to start working on some BAM files I received from Dante Labs.
    I noticed in the VCF header that corresponds to the BAM:
    ##reference=file:///l3bioinfo/hg19.fasta
    I happen to already have an hg19.fa fasta file that I built by downloading the file "hg19.2bit" sometime last year for a different purpose. Unfortunately I don't recall where I got that.
    Does anyone happen to know how I might be able to determine if the fasta file I already have is the same file that Dante used? If they differ just for the mtDNA, I don't care, as I'm not interested in mtDNA right now. But would that cause problems for the rest of the sequence when using
    bcftools mpileup
    for example?
    If my fasta file is different, in a way that could cause problems for non-mtDNA, any ideas for a workaround, or where to get the proper fasta file for the Dante Labs BAM?
    YFull: YF14620 (Dante Labs 2018)

  10. #57
    Registered Users
    Posts
    373
    Sex

    The hg19.fasta file Dante Labs' bioinformatics workflow used in my sample had the contigs in different order from the "official" hg19. It's not too surprising, as common way to construct it is by concatenating the individual contigs/chromosomes together. This could be easily seen by mtDNA being in the beginning of the VCF. The contents of the contigs itself should not differ, or there's an error somewhere. I'm not sure if there is a common reference where they're in that order that you could just download.

    To get the contigs in same order, with shell & samtools you could do:

    wget http://hgdownload.cse.ucsc.edu/golde...ips/hg19.fa.gz
    gzip -d hg19.fa.gz
    samtools faidx hg19.fa
    samtools faidx hg19.fa `samtools idxstats my_bam_file.bam | head -n-1 | cut -f1` > hg19_dante.fa
    Those are backticks. Is there markup for /CCDE...

  11. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (07-25-2019)

  12. #58
    Gold Class Member
    Posts
    180
    Sex
    Location
    Rome, Italy
    Ethnicity
    Polish/British Isles
    Nationality
    U.S.
    Y-DNA (P)
    R-A9185
    mtDNA (M)
    H1

    Poland England Ireland Munster
    Quote Originally Posted by Donwulff View Post
    The hg19.fasta file Dante Labs' bioinformatics workflow used in my sample had the contigs in different order from the "official" hg19. It's not too surprising, as common way to construct it is by concatenating the individual contigs/chromosomes together. This could be easily seen by mtDNA being in the beginning of the VCF. The contents of the contigs itself should not differ, or there's an error somewhere. I'm not sure if there is a common reference where they're in that order that you could just download.

    To get the contigs in same order, with shell & samtools you could do:



    Those are backticks. Is there markup for /CCDE...
    Thanks. In the meantime I started running a very long bcftools process using my previous hg19 FASTA, it looks like it's going to take around 24 hours just on the initial pass through the BAM! Hopefully the FASTA I'm using won't make a significant difference. But if something looks wrong and I have to start over, I will try what you recommend.
    YFull: YF14620 (Dante Labs 2018)

  13. #59
    Registered Users
    Posts
    373
    Sex

    I couldn't remember why I had to re-order the fasta in the first place, I think it's mainly GATK which talks about "ROD"'s as "reference-ordered data". When you're combining data from several huge resources, it helps to have them in same order so programs can just stream them in order and have same data from all files at once. File indexes get around that problem though, so bcftools may work just fine with different order of contigs.

    The main thing is that the contigs being processed need to be called the same (eg. chr1 vs. 1, or even something like K12415v2 etc.) and have same length. I believe every tool checks that those match. At plain variant calling phase, the reference is mainly used to determine which variants match the reference, and which are different from the reference and therefore need be outputted to standard VCF. There are some interesting things one could intentionally do there, like the the hg19K https://bmcgenomics.biomedcentral.co...864-019-5854-3 reference. Because the hg19 reference is based on actual humans, the sequence contains some minor alleles ie. those that differ from the common/ancestral variant. hg19K has attempted to replace those minor alleles with the more common alleles. If you did variant calling against that reference, it would accomplish some of what the gVCF/bed files try to, ie. list variants even when they match the canonical reference if they're true minor alleles. Unofortunately I don't know where you could download hg19K, but now I'm feeling like I should construct one

    Other than that every hg19 reference ought to have same sequence though, they may just be in different order.

  14. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (07-25-2019)

  15. #60
    Gold Class Member
    Posts
    180
    Sex
    Location
    Rome, Italy
    Ethnicity
    Polish/British Isles
    Nationality
    U.S.
    Y-DNA (P)
    R-A9185
    mtDNA (M)
    H1

    Poland England Ireland Munster

    creating a GVCF file from Dante Labs raw data?

    My immediate issue is that I have two recent Dante Labs tests without a gvcf and an older one that did come with a gvcf. I have found the gvcf useful. In the short run I'd be happy with a quick and dirty solution which works OK for SNPs - I can wait for decent indels etc. I found this simple solution:
    bcftools mpileup -Ou --gvcf 0 -f hg19.fa my.bam |
    bcftools call -m --gvcf 0 -Oz -o my.g.vcf.gz
    https://www.biostars.org/p/360278/
    It's currently running on my old kit (has been for many hours!) and if it ever finishes and produces a result, I will compare that with the gvcf I received directly from Dante last year. If that looks good I will run my two new kits.
    I've also ordered analysis which should include hg38 gvcfs from companies including sequencing.com and FGC.

    Any other other methods, companies I should consider?
    YFull: YF14620 (Dante Labs 2018)

Page 6 of 8 FirstFirst ... 45678 LastLast

Similar Threads

  1. Dante Labs (WGS)
    By MacUalraig in forum Other
    Replies: 933
    Last Post: Today, 03:48 AM
  2. What are the chemicals used in DNA sequencing
    By Anabasis in forum Inquiries Corner
    Replies: 2
    Last Post: 12-17-2015, 02:05 AM
  3. Replies: 10
    Last Post: 07-01-2015, 03:51 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •