Page 1 of 8 123 ... LastLast
Results 1 to 10 of 80

Thread: Dante Labs WES/WGS Sequencing Technical

  1. #1
    Registered Users
    Posts
    374
    Sex

    Dante Labs WES/WGS Sequencing Technical

    Finally splitting different thread as I & others have been spamming the main Dante Labs thread with the sequencing and bioinformatics notes... I ordered my own Dante Labs WGS on Jul 10, 2017 at which time reviews implied DL had already delivered several sequences. A reference human genome dataset of the BGISEQ-500 sequencer stating "Here, we present the first human whole-genome sequencing dataset of BGISEQ-500" was published on 1st April 2017, making it seem impossible for Dante Labs to be using BGISEQ-500.

    On March 1st 2018 JamesKane observed "I really think they are using BGISEQ-500 here" causing me to take another look at this possibility, and now we can conclude that at least my sequence has definitely been made on BGISEQ instrument. This isn't bad news, as the above paper, and subsequent January 10th paper Germline and somatic variant identification using BGISEQ-500 and HiSeq X Ten whole genome sequencing indicate BGISEQ-500 sequencing quality is at least comparable to Illumina HiSeq X Ten. This may, however, have a few implications for re-analysis and bioinformatics work on the Dante Labs sequences, there have also been other questions and discussion about the workflows, so let's try to put them here. Note that Dante Labs pages have always implied they're using different technologies, so it's technically possible they could use different sequencers for different samples.

    Key information about the bioinformatics workflow that Dante Labs is using can be found in the BAM file's header section. Based on this, they are using very standard GATK Best Practices compliant bioinformatics workflow, nothing specific for BGISEQ-500, or nothing that looks out of ordinary from defaults. One thing of note is that the BAM-alignment I received uses UCSC hg-19 chrM mtDNA-reference, which today is highly outdated and not accepted by most third party services. YFull will re-align it and identify maternal haplogroup as part of their service. Their reference also doesn't include any "decoy sequences", which are DNA often found in human sequences that aren't part of the human reference genome (viruses etc.). For example Toward better understanding of artifacts in variant calling from high-coverage samples finds that "In general, we conclude that hs37d5 and GRCh38 are more complete than GRCh37" so re-mapping the sequence against hs37d5 which contains a newer mtDNA reference and decoy sequences may be justified even for those who want to do additional analysis in GRCh37.

    The VCF files provided by Dante Labs at present time contains only a subset of derived/non-reference SNP's from the sequence, that have been validated in some way. My bet is that it's based on some previous dbSNP (The official database of named/previously seen DNA variants) version, though I've yet to check which version it matches. For use in identical-by-descent matching like GEDMatch Genesis which assumes that locations without call are always a match, it would be beneficial to be able to call variants at all locations of the genome. Such a Genomic VCF file can be generated on Sequencing.com from the Dante Labs provided BAM file, though the resulting gVCF file is so large it can't be processed by all services. Hopefully there will be improvement all around as sequencing gets more common. Thomas Krahn of YSEQ has also developed extract23 (Linux) shell tool for producing a 23andMe raw data style file from a whole genome BAM file for use with services that accept 23andMe data (Yes, it does SNP calling using samtools/bcftools so you'll need some computing resources).
    Last edited by Donwulff; 03-03-2018 at 03:16 PM.

  2. The Following 6 Users Say Thank You to Donwulff For This Useful Post:

     dbm (03-04-2018),  gotten (03-03-2018),  karwiso (03-03-2018),  Petr (04-01-2018),  pmokeefe (03-10-2018),  Wireless (03-12-2018)

  3. #2
    Registered Users
    Posts
    374
    Sex

    My analysis of the Dante Labs WGS analysis pipeline at the time my sequence from the BAM file as answer to someone asking about it.
    BAM header "@SQ SN:chrM LN:16571" is the UCSC hg19 chrM and assembly. Very recent version of BWA MEM 0.7.15 was used with Picard 2.5.0 and GATK 3.3, SNP variants selected from some list.

    It turns out that using current Picard Tools (2.17.6) use of BGISEQ-500 specific read name template for identifying optical duplicates does not change duplicate marking results in any way, though it has effect on estimate of sequencing library complexity, and possibly speeds up the duplicate marking phase slightly.
    This forum appears to somehow have eaten my command lines. For Picard MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 READ_NAME_REGEX="CL10.......L.C([0-9]+)R([0-9]+)_([0-9]+)" (Check that this matches the read names, I may have assumed the wild-card dots wrong).

    The Dante Labs BAM file has been duplicate-marked in parallel, which ends up missing some duplicate reads. I tested the efficiency of whole-genome duplicate marking and the new, GATK-4 queryname sort-order duplicate marking.
    "For my BAM file, this flagged 253 950 additional mappings out of 1 976 391 secondary mappings as duplicate, going from 42 330 893 (3.21%) to 42 584 843 (3.23%) duplicates total. And 41 999 309 (3.19%) in the originally provided BAM."

    The BGISEQ-500 first whole genome paper's supplementary data identifies the canonical adapters used in the sequencing. Theoretically the GATK Best Practices pipeline should ignore unmatching sequences at ends of a read so this information is mostly useful for de-novo assembly, though it should also speed up alignment.
    SOAPnuke 1.5.3: SOAPnuke filter -l 10 -q 0.1 -n 0.01 -Q 2 -G -f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA -r CAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT

  4. The Following 4 Users Say Thank You to Donwulff For This Useful Post:

     gotten (03-03-2018),  karwiso (03-03-2018),  pmokeefe (03-10-2018),  Wireless (03-12-2018)

  5. #3
    Registered Users
    Posts
    374
    Sex

    On a philosophical note, problem with sequencing bioinformatics is that generally you don't know the "correct answer". In theory, it could even be possible that the reference sequence is all wrong, and every process since then has repeated the same errors to arrive at (roughly) same results. Fortunately, from different sequencing technologies, electron microscopy etc. it's know the reference sequence is mostly correct. For determining the optimal bioinformatics workflow, however, this poses major challenge as you have no simple way of determining which result is "more correct". Studies often employ simulated sequencing data, while the correct answer is known, but these in turn rely on the assumption that the simulated sequencing errors for example match those in real data.

    Given that, the debate on significance of trimming adapters still remains inconclusive. Basically, in modern NGS, the DNA to be sequenced is broken into random segments of some hundreds of bases each, then adapter sequences are ligated at both ends to facilitate sequencing, and then the fragments are sequences from both ends. When the fragment is shorter than the read length (100bp, 150bp etc.) a read-through will occur, where the sequencer reads into the adapter past the fragment.

    In the Complete Genomics/BGISEQ the resulting pieces of DNA are joined into a circle. The adapter sequences have combined length of 74 bases, so in theory if the fragment in question was shorter than 26 bases, it could get sequenced twice in the 100bp read-length. Another possibility could be that multiple fragments with two or more adapter-pairs could end up in one DNA nanoball, but I doubt the technology could cope with such reads. I recall there's an adapter-trimmer that can preserve the second sequence "past" the adapter on the other side but can't remember which, given the extremely short read-length possible that way, I'll just ignore those.

    gix024fig1.jpg

    You might think it trivial to find these adapters, but in the presence of sequencing errors, and when the read through can be as short as one or two bases, it becomes a challenging computational task. Due to this, in basic genomic re-sequencing (against a reference genome) for somatic variants the tendency has been to just map the reads against the reference, soft-clipping the ends of the read that don't match to the reference. In the SAM/BAM file this can be seen in the CIGAR field, for example "4S30M66S" means 4 first bases did not match reference, 30 middle ones did and 66 last bases again didn't match reference. But this could also indicate a structural variation, a genuine difference from the reference.

    One thing makes this problem more palatable, though: If a read-through occurred, then the fragment was shorter than the read-length... and so the reads from both ends will overlap, with the valid sequences from the fragment being mirror images of each other. Trimmomatic was one of the first adapter-trimmers to use this fact for matching reads. It's widely used and well regarded, while performance benchmarks depend a bit on who is doing them and what parameters are used, and the "no right answer" problem above. Let's try it on the Dante Labs sequence...

    samtools fastq sample1.bam -t -0 sample1.0.fq.gz -1 sample1.1.fq.gz -2 sample1.2.fq.gz # Getting fastq for Trimmomatic if we only have BAM; process-substitution for outputs would parallerize this

    # Make the palindromic paired-end adapter sequence file
    cat << EOF > BGISEQ-PE.fa
    > >PrefixPE/1
    > AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA
    > >PrefixPE/2
    > CAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT
    > EOF

    time java -Xms26G -jar trimmomatic-0.36.jar PE -phred33 -trimlog sample1.bgiseq.log -validatePairs sample1.1.fq.gz sample1.2.fq.gz -baseout sample1.fq.gz ILLUMINACLIP:BGISEQ-PE.fa:2:30:10:1:true
    TrimmomaticPE: Started with arguments:
    -phred33 -trimlog sample1.bgiseq.log -validatePairs sample1.1.fq.gz sample1.2.fq.gz -baseout sample1.fq.gz ILLUMINACLIP:BGISEQ-PE.fa:2:30:10:1:true
    Multiple cores found: Using 4 threads
    Using templated Output files: sample1_1P.fq.gz sample1_1U.fq.gz sample1_2P.fq.gz sample1_2U.fq.gz
    Using PrefixPair: 'AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA' and 'CAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT'
    ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
    Input Read Pairs: 658303183 Both Surviving: 658303183 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
    TrimmomaticPE: Completed successfully

    real 1302m35.373s
    user 2518m6.356s
    sys 270m3.212s

    This is rather old computer, but that's about 22 hours. While this can be ran highly parallel on multiple computers, and there are now faster options, it's easy to see why the preference is to skip this phase.

    The parameters in above Trimmomatic call mean allow 2 mismatches in first 16 bases of the adapter, require about 30/0.6=50 bases to mirror each other (The fragment between the adapters) or 10/0.6=17 bases to match the adapter sequence (Mismatches increase the number of needed matches for both cases in a complex manner, see documentation). The shortest adapter length clipped is 1 (Since we can identify them confidently), and both reads are kept even though they're mirror images of each other because most tools expect that.

    This identifies adapters in 3 214 470 out of 1 316 606 366 reads, or 0.24% and average read-length of 99.9867bp after adapter clipping. Looking at the filtered output, 19 exact cases of the reverse primer are left in the forward read; these appear to all be degenerated dimers, ie. adapters joined together with no sample fragment. This level of adapters is unlikely to significantly affect somatic SNP calling from NGS shotgun sequencing, even if the variant caller uses soft-clipped sequences. GATK HaplotypeCaller incidentally does use soft-clipped sequences unless you give it --dontUseSoftClippedBases because it essentially does local re-alignment to figure out structural variation by itself.

    In conclusion, trim adapters or don't depending on your personal preference, but whatever you do, don't use the Illumina adapter sequences as they will remove valid sequences while leaving the BIGSEQ adapters in reads. This is particularly important if using pre-made pipelines, as they often include an adapter trimming stage that doesn't expect BGISEQ adapters. Personally I had to look into this as I'm doing some de novo assembly on the Y chromosome and metagenome, where the adapter sequences matter, and I'm satisfied with the above results for now, though I might still run whole variant calling pipeline on reads trimmed in different ways for curiosity.
    Last edited by Donwulff; 03-04-2018 at 09:57 AM. Reason: attached image didn't work first time, probably expired

  6. The Following 4 Users Say Thank You to Donwulff For This Useful Post:

     dbm (03-10-2018),  karwiso (03-07-2018),  pmokeefe (03-10-2018),  Wireless (03-12-2018)

  7. #4
    Registered Users
    Posts
    374
    Sex

    On the main Dante Labs thread there was some issue over how complete the Dante Labs VCF is. On my original reading of the VCF file headers, I figured that the parameter "variant=(RodBinding name=variant source=out_snp.vcf.gz)" to GATK SelectVariants meant out_snp.vcf.gz contained list of the variants they would output. This is frequently done, if there's a list of variants which have been validated in some way, eg. by being listed in dbSNP as having been encountered elsewhere before.

    However, on closer look the "variant" parameter is actually the input sample, and instead, the actually used SelectVariants option is "-excludeFiltered=true" which means it's used to leave out variants which have failed one of the filters.

    I can easily find SNP's where read depth is 1 or allelic variant quality below 1. The lowest Mapping Quality is 21.0. The main filter looks to be the Variant Quality Score Recalibration with the Truth Sensitivity set at 99% - ie. each variant listed should be at least 99% certain compared to assumed true variants, such as those found in dbSNP.

    So this means variants are listed regardless of whether they're previously seen or not, as long as they're at least 99% likely to be true. That means it's fairly complete, and I don't think re-analysis is warranted on that basis, unless you know & want to deal with highly uncertain variants. The UCSC reference with old mtDNA and no decoy sequences, along with if Genomic VCF weren't provided, may be reason enough to re-map & call the sequence. Or doing crazy experimentation.

  8. The Following 2 Users Say Thank You to Donwulff For This Useful Post:

     pm1113 (03-25-2018),  Wireless (03-20-2018)

  9. #5
    Registered Users
    Posts
    374
    Sex

    I should be able to edit some old posts. I found out after checking that I think above adapter-trimming did not work as expected. I'm working on validating something that works, I may edit this post with the details

  10. #6
    Registered Users
    Posts
    374
    Sex

    Previously I was going through FASTQ files to be able to test adapter-trimming that works best on FASTQ files. However, in reference to the main Dante Labs thread, I moved to test the Broad Institute GATK 4.0 best practices recommended uBAM workflow. Here's the basic command for preparing uBAM from, well, most BAM files. In the case of Dante Labs provided BAM, I found out I had to add XS and XA to tags to clear.

    I ended up trying to make this fairly foolproof, I guess I should still check if java is installed and install it if needed etc This script *only* creates unmapped BAM from the original one. This should work equally well for BigY for example, the SANITIZE bit removing unpaired reads etc. However, I'll edit this post if I see any obvious problems. And need to stick these on public github... DONE: https://github.com/Donwulff/bio-tools

    Code:
    #!/bin/sh
    bamfile=${1:-"sample1.bam"}
    outfile=${bamfile%%.bam}.u.bam
    
    if [ ! -e $bamfile ];
    then
      echo "Runs Picard Tools on Java to create Broad Institute uBAM from input BAM"
      echo "$0 [input.bam] [gigabytes of memory to use, if not all available memory]"
      exit
    fi
    
    if [ ! -e picard.jar ];
    then
      wget https://github.com/broadinstitute/picard/releases/download/2.18.2/picard.jar
    fi
    
    # Different fs from data files, prefer SSD
    tmp=${TMP:-"/tmp"}
    
    # Sizes in bytes
    inputsize=`du -D -b $bamfile | gawk '{print $1}'`
    tempsize=$((inputsize+(inputsize/2)))
    outfree=`LC_ALL=C df -B1 $bamfile | grep -v "^File" | gawk '{print $4}'`
    tempfree=`LC_ALL=C df -B1 $tmp | grep -v "^File" | gawk '{print $4}'`
    
    if [ $tempsize -gt $tempfree ];
    then
      echo "Approximately 1.5X input size is required for temporary storage $tmp"
      echo "Run with TMP=/path/to/tmp to use different path"
      exit
    fi
    
    if [ $inputsize -gt $outfree ];
    then
      echo "Output file $outfile probably won't fit on remaining space"
      exit
    fi
    
    totalmem=`LC_ALL=C free | grep -e "^Mem:" | gawk '{print $7}'`
    # Allow 2 gigabytes for runtime
    javamem=${2:-$((totalmem/1024/1024-2))}
    # https://sourceforge.net/p/picard/wiki/Main_Page/#q-a-picard-program-that-sorts-its-output-sambam-file-is-taking-a-very-long-time-andor-running-out-of-memory-what-can-i-do
    bamrecords=$((javamem*250000))
    
    # Ref: https://gatkforums.broadinstitute.org/gatk/discussion/6484
    java -Xmx${javamem}G -jar picard.jar RevertSam \
        I=$bamfile \
        O=$outfile \
        SANITIZE=true \
        MAX_DISCARD_FRACTION=0.005 \
        ATTRIBUTE_TO_CLEAR=XT \
        ATTRIBUTE_TO_CLEAR=XN \
        ATTRIBUTE_TO_CLEAR=AS \
        ATTRIBUTE_TO_CLEAR=OC \
        ATTRIBUTE_TO_CLEAR=OP \
        ATTRIBUTE_TO_CLEAR=XS \
        ATTRIBUTE_TO_CLEAR=XA \
        SORT_ORDER=queryname \
        RESTORE_ORIGINAL_QUALITIES=true \
        REMOVE_DUPLICATE_INFORMATION=true \
        REMOVE_ALIGNMENT_INFORMATION=true \
        MAX_RECORDS_IN_RAM=$bamrecords \
        TMP_DIR=$tmp
    Last edited by Donwulff; 04-10-2018 at 01:08 AM.

  11. The Following User Says Thank You to Donwulff For This Useful Post:

     gotten (04-11-2018)

  12. #7
    Registered Users
    Posts
    463
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA (P)
    R-Y14088
    mtDNA (M)
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    Worked fine.

    Dante WGS file:
    picard.sam.RevertSam done. Elapsed time: 316.98 minutes.

    Input BAM file size: 81,2 GiB
    Output uBAM file size: 86 GiB
    TMP space used: 128,3 GiB

    But ValidateSamFile shows:
    Code:
    ERROR: Read name CL100054478_L01_10, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group
    ERROR: Read name CL100054478_L01_11, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group
    ERROR: Read name CL100054478_L01_12, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group
    ERROR: Read name CL100054478_L01_9, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group
    (gawk is not part of standard Ubuntu 1604 LTS, can be easily installed)

    (Genos BAM file has 2 additional attributes to clear, BD and BI)

    Unfortunately it does not work with FTDNA BAM files:
    Code:
    [Wed Apr 11 20:26:03 CEST 2018] RevertSam INPUT=/home/petr/Storage/f195364/f195364_Rinda_CZE_BigY.bam OUTPUT=/home/petr/Storage/f195364/f195364_Rinda_CZE_BigY.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=7000000    OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Wed Apr 11 20:26:03 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
    [Wed Apr 11 20:26:05 CEST 2018] picard.sam.RevertSam done. Elapsed time: 0.03 minutes.
    Runtime.totalMemory()=637009920
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: WARNING: Read name SN7001368:163:H8TUEADXX:2:1202:11355:57332, No M or N operator between pair of D operators in CIGAR
            at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454)
            at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:253)
            at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:606)
            at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1575)
            at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2087)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:811)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
            at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
            at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
            at picard.sam.RevertSam.doWork(RevertSam.java:297)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
    Adding VALIDATION_STRINGENCY=LENIENT helped:
    Code:
    [Thu Apr 12 00:38:22 CEST 2018] picard.sam.RevertSam done. Elapsed time: 5.15 minutes.
    But I'm not sure that the resulting BAM file is really correct. ValidateSamFile shows no errors.

    Now I'm trying FGC Elite 1.0, FGC WGS 15x and Genos WES.

    Edit: For 2016 FTDNA Big Y file, the error is different:
    Code:
    [Thu Apr 12 00:21:44 CEST 2018] RevertSam INPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.bam OUTPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=6750000    OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Apr 12 00:21:44 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
    INFO    2018-04-12 00:21:49     RevertSam       Reverted     1,000,000 records.  Elapsed time: 00:00:04s.  Time for last 1,000,000:    4s.  Last read position: chrY:7,983,283
    INFO    2018-04-12 00:21:54     RevertSam       Reverted     2,000,000 records.  Elapsed time: 00:00:09s.  Time for last 1,000,000:    4s.  Last read position: chrY:13,451,192
    INFO    2018-04-12 00:21:59     RevertSam       Reverted     3,000,000 records.  Elapsed time: 00:00:14s.  Time for last 1,000,000:    5s.  Last read position: chrY:15,783,059
    INFO    2018-04-12 00:22:05     RevertSam       Reverted     4,000,000 records.  Elapsed time: 00:00:20s.  Time for last 1,000,000:    5s.  Last read position: chrY:19,677,785
    INFO    2018-04-12 00:22:09     RevertSam       Reverted     5,000,000 records.  Elapsed time: 00:00:24s.  Time for last 1,000,000:    4s.  Last read position: chrY:22,893,813
    INFO    2018-04-12 00:22:16     RevertSam       Reverted     6,000,000 records.  Elapsed time: 00:00:31s.  Time for last 1,000,000:    6s.  Last read position: chrY:25,050,194
    INFO    2018-04-12 00:23:15     RevertSam       Reverted     7,000,000 records.  Elapsed time: 00:01:30s.  Time for last 1,000,000:   58s.  Last read position: chrY:28,818,028
    INFO    2018-04-12 00:23:15     RevertSam       Detected quality format for GRC14388618_S53_L00: Standard
    INFO    2018-04-12 00:23:26     RevertSam       Discarded 7027084 out of 7027084 (100.000%) reads in order to sanitize output.
    [Thu Apr 12 00:23:26 CEST 2018] picard.sam.RevertSam done. Elapsed time: 1.70 minutes.
    Runtime.totalMemory()=10565451776
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" picard.PicardException: Discarded 100.000% which is above MAX_DISCARD_FRACTION of 0.500%
            at picard.sam.RevertSam.doWork(RevertSam.java:336)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
    Last edited by Petr; 04-11-2018 at 11:02 PM.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  13. The Following User Says Thank You to Petr For This Useful Post:

     gotten (04-11-2018)

  14. #8
    Registered Users
    Posts
    463
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA (P)
    R-Y14088
    mtDNA (M)
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    FGC Elite 1.0

    Code:
    [Wed Apr 11 20:30:41 CEST 2018] RevertSam INPUT=/home/petr/Storage/KK2P9/KK2P9.bam OUTPUT=/home/petr/Storage/KK2P9/KK2P9.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=7000000    OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Wed Apr 11 20:30:41 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
    INFO    2018-04-11 20:30:48     RevertSam       Reverted     1,000,000 records.  Elapsed time: 00:00:05s.  Time for last 1,000,000:    5s.  Last read position: 1:78,627,563
    ....
    INFO    2018-04-11 20:44:20     RevertSam       Reverted    69,000,000 records.  Elapsed time: 00:13:37s.  Time for last 1,000,000:    3s.  Last read position: MT:1,113
    [Wed Apr 11 20:44:20 CEST 2018] picard.sam.RevertSam done. Elapsed time: 13.64 minutes.
    Runtime.totalMemory()=18256756736
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 69004234, Read name FCC63CUACXX:4:2306:8461:63745#, MAPQ should be 0 for unmapped read.
            at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:812)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
            at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
            at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
            at picard.sam.RevertSam.doWork(RevertSam.java:297)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
    FGC WGS 15x OK

    Code:
    [Wed Apr 11 23:07:23 CEST 2018] picard.sam.RevertSam done. Elapsed time: 115.50 minutes.
    Genos WES

    Code:
    [Wed Apr 11 23:31:43 CEST 2018] picard.sam.RevertSam done. Elapsed time: 21.26 minutes.
    Runtime.totalMemory()=12203327488
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" picard.PicardException: Discarded 2.572% which is above MAX_DISCARD_FRACTION of 0.500%
            at picard.sam.RevertSam.doWork(RevertSam.java:336)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
    Last edited by Petr; 04-11-2018 at 09:34 PM.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  15. #9
    Registered Users
    Posts
    374
    Sex

    Bit surprised about the missing gawk, it's pretty commonly used. I think java needs to be installed as well though, but I can think about doing that without gawk.

    The ValidateSam result on Dante Labs is expected, because it's (correctly) tagged as Complete Genomics, which not all tools know to handle. I'm not sure how it'll affect downstream analysis, it might be necessary to reheader/change it back to ILLUMINA.
    Discard fraction error on BigY is normal too, that's more of a warning, and comes because FTDNA left out pairs of reads that didn't map on Y chromosome, so those are discarded. (You could map them as single-ended, but that takes slightly more work).
    I didn't look at the Genos exome yet, I'll try that.

    Also no idea about the older FTDNA BAM; that's a warning as well, but that's not exactly expected. FGC reports error, but does it finish the conversion? Those are both errors in the original BAM, and RevertSam should just fix/revert those. The CIGAR and MAPQ are cleared, and the mapper should fill them again according to the new reference & mapping.

  16. #10
    Registered Users
    Posts
    463
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA (P)
    R-Y14088
    mtDNA (M)
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    With VALIDATION_STRINGENCY=LENIENT I was able to finish all BAM files (Big Y 2014-2015, FGC Elite 1.0) with except of one 2016 Big Y BAM file, this is the one that is not containing other chromosomes than Y. I have just one so I cannot verify if this is problem with all newer Big Y BAM files or just this one.

    ValidateSamFile is without any error for FGC Elite 1.0 and with the following error for Big Y 2014-2015:
    Code:
    ERROR: Read name GRC14388645_S18_L00, A platform (PL) attribute was not found for read group
    Summary ValidateSamFile report for original 2014-2015 BigY files:
    Code:
    Error Type      Count
    ERROR:MISSING_PLATFORM_VALUE    1
    WARNING:MISSING_TAG_NM  13361471
    and for 2016 file:
    Code:
    Error Type      Count
    ERROR:MATE_NOT_FOUND    7027084
    ERROR:MISSING_PLATFORM_VALUE    1
    WARNING:MISSING_TAG_NM  7027084
    These errors looks like
    Code:
    ERROR: Read name SN7001368:387:H5MNGADXX:1:2213:14624:97598 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
    ERROR: Read name SN7001368:387:H5MNGADXX:1:2115:5176:52276 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
    ERROR: Read name SN7001368:387:H5MNGADXX:2:2215:14715:88988 1:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
    ERROR: Read name SN7001368:387:H5MNGADXX:2:1204:7990:35829 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
    ERROR: Read name SN7001368:387:H5MNGADXX:1:2107:6608:72296 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
    I think "2:N:0:GGACTCCT+GTAAGGAG" should not be part of the read name.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

Page 1 of 8 123 ... LastLast

Similar Threads

  1. Dante Labs (WGS)
    By MacUalraig in forum Other
    Replies: 940
    Last Post: Yesterday, 11:49 AM
  2. What are the chemicals used in DNA sequencing
    By Anabasis in forum Inquiries Corner
    Replies: 2
    Last Post: 12-17-2015, 02:05 AM
  3. Replies: 10
    Last Post: 07-01-2015, 03:51 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •