PDA

View Full Version : Dante Labs WES/WGS Sequencing Technical



Donwulff
03-03-2018, 02:29 PM
Finally splitting different thread as I & others have been spamming the main Dante Labs thread with the sequencing and bioinformatics notes... I ordered my own Dante Labs WGS on Jul 10, 2017 at which time reviews implied DL had already delivered several sequences. A reference human genome dataset of the BGISEQ-500 sequencer (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467036/) stating "Here, we present the first human whole-genome sequencing dataset of BGISEQ-500" was published on 1st April 2017, making it seem impossible for Dante Labs to be using BGISEQ-500.

On March 1st 2018 JamesKane observed "I really think they are using BGISEQ-500 here" causing me to take another look at this possibility, and now we can conclude that at least my sequence has definitely been made on BGISEQ instrument. This isn't bad news, as the above paper, and subsequent January 10th paper Germline and somatic variant identification using BGISEQ-500 and HiSeq X Ten whole genome sequencing (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467036/) indicate BGISEQ-500 sequencing quality is at least comparable to Illumina HiSeq X Ten. This may, however, have a few implications for re-analysis and bioinformatics work on the Dante Labs sequences, there have also been other questions and discussion about the workflows, so let's try to put them here. Note that Dante Labs pages have always implied they're using different technologies, so it's technically possible they could use different sequencers for different samples.

Key information about the bioinformatics workflow that Dante Labs is using can be found in the BAM file's header section. Based on this, they are using very standard GATK Best Practices compliant bioinformatics workflow, nothing specific for BGISEQ-500, or nothing that looks out of ordinary from defaults. One thing of note is that the BAM-alignment I received uses UCSC hg-19 chrM mtDNA-reference, which today is highly outdated and not accepted by most third party services. YFull will re-align it and identify maternal haplogroup as part of their service. Their reference also doesn't include any "decoy sequences (http://www.cureffi.org/2013/02/01/the-decoy-genome/)", which are DNA often found in human sequences that aren't part of the human reference genome (viruses etc.). For example Toward better understanding of artifacts in variant calling from high-coverage samples (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271055/) finds that "In general, we conclude that hs37d5 and GRCh38 are more complete than GRCh37" so re-mapping the sequence against hs37d5 (http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use) which contains a newer mtDNA reference and decoy sequences may be justified even for those who want to do additional analysis in GRCh37.

The VCF files provided by Dante Labs at present time contains only a subset of derived/non-reference SNP's from the sequence, that have been validated in some way. My bet is that it's based on some previous dbSNP (The official database of named/previously seen DNA variants) version, though I've yet to check which version it matches. For use in identical-by-descent matching like GEDMatch Genesis which assumes that locations without call are always a match, it would be beneficial to be able to call variants at all locations of the genome. Such a Genomic VCF file (https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf) can be generated on Sequencing.com from the Dante Labs provided BAM file, though the resulting gVCF file is so large it can't be processed by all services. Hopefully there will be improvement all around as sequencing gets more common. Thomas Krahn of YSEQ has also developed extract23 (Linux) shell tool (https://github.com/tkrahn/extract23) for producing a 23andMe raw data style file from a whole genome BAM file for use with services that accept 23andMe data (Yes, it does SNP calling using samtools/bcftools so you'll need some computing resources).

Donwulff
03-03-2018, 02:52 PM
My analysis of the Dante Labs WGS analysis pipeline at the time my sequence from the BAM file as answer to someone asking about it. (https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=343002&viewfull=1#post343002)
BAM header "@SQ SN:chrM LN:16571" is the UCSC hg19 chrM and assembly. Very recent version of BWA MEM 0.7.15 was used with Picard 2.5.0 and GATK 3.3, SNP variants selected from some list.

It turns out that using current Picard Tools (2.17.6) use of BGISEQ-500 specific read name template for identifying optical duplicates does not change duplicate marking results in any way, though it has effect on estimate of sequencing library complexity, and possibly speeds up the duplicate marking phase slightly. (https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=343261&viewfull=1#post343261)
This forum appears to somehow have eaten my command lines. For Picard MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 READ_NAME_REGEX="CL10.......L.C([0-9]+)R([0-9]+)_([0-9]+)" (Check that this matches the read names, I may have assumed the wild-card dots wrong).

The Dante Labs BAM file has been duplicate-marked in parallel, which ends up missing some duplicate reads. I tested the efficiency of whole-genome duplicate marking and the new, GATK-4 queryname sort-order duplicate marking. (https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=343788&viewfull=1#post343788)
"For my BAM file, this flagged 253 950 additional mappings out of 1 976 391 secondary mappings as duplicate, going from 42 330 893 (3.21%) to 42 584 843 (3.23%) duplicates total. And 41 999 309 (3.19%) in the originally provided BAM."

The BGISEQ-500 first whole genome paper's supplementary data identifies the canonical adapters used in the sequencing. Theoretically the GATK Best Practices pipeline should ignore unmatching sequences at ends of a read so this information is mostly useful for de-novo assembly, though it should also speed up alignment. (https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=357507&viewfull=1#post357507)
SOAPnuke 1.5.3: SOAPnuke filter -l 10 -q 0.1 -n 0.01 -Q 2 -G -f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA -r CAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT

Donwulff
03-04-2018, 09:41 AM
On a philosophical note, problem with sequencing bioinformatics is that generally you don't know the "correct answer". In theory, it could even be possible that the reference sequence is all wrong, and every process since then has repeated the same errors to arrive at (roughly) same results. Fortunately, from different sequencing technologies, electron microscopy etc. it's know the reference sequence is mostly correct. For determining the optimal bioinformatics workflow, however, this poses major challenge as you have no simple way of determining which result is "more correct". Studies often employ simulated sequencing data (https://omictools.com/read-simulators-category), while the correct answer is known, but these in turn rely on the assumption that the simulated sequencing errors for example match those in real data (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/).

Given that, the debate on significance of trimming adapters (https://www.ecseq.com/support/ngs/trimming-adapter-sequences-is-it-necessary) still remains inconclusive. Basically, in modern NGS, the DNA to be sequenced is broken into random segments of some hundreds of bases each, then adapter sequences are ligated at both ends to facilitate sequencing, and then the fragments are sequences from both ends. When the fragment is shorter than the read length (100bp, 150bp etc.) a read-through will occur, where the sequencer reads into the adapter past the fragment.

In the Complete Genomics/BGISEQ the resulting pieces of DNA are joined into a circle. The adapter sequences have combined length of 74 bases, so in theory if the fragment in question was shorter than 26 bases, it could get sequenced twice in the 100bp read-length. Another possibility could be that multiple fragments with two or more adapter-pairs could end up in one DNA nanoball, but I doubt the technology could cope with such reads. I recall there's an adapter-trimmer that can preserve the second sequence "past" the adapter on the other side but can't remember which, given the extremely short read-length possible that way, I'll just ignore those.

21938

You might think it trivial to find these adapters, but in the presence of sequencing errors, and when the read through can be as short as one or two bases, it becomes a challenging computational task (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7). Due to this, in basic genomic re-sequencing (against a reference genome) for somatic variants the tendency has been to just map the reads against the reference, soft-clipping the ends of the read that don't match to the reference. In the SAM/BAM file this can be seen in the CIGAR field (http://bioinformatics.cvr.ac.uk/blog/java-cigar-parser-for-sam-format/), for example "4S30M66S" means 4 first bases did not match reference, 30 middle ones did and 66 last bases again didn't match reference. But this could also indicate a structural variation, a genuine difference from the reference.

One thing makes this problem more palatable, though: If a read-through occurred, then the fragment was shorter than the read-length... and so the reads from both ends will overlap, with the valid sequences from the fragment being mirror images of each other. Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) was one of the first adapter-trimmers to use this fact for matching reads. It's widely used and well regarded (https://www.reddit.com/r/bioinformatics/comments/63nu1f/comparing_quality_trimming_and_adapter_removing/), while performance benchmarks depend a bit on who is doing them and what parameters are used (https://academic.oup.com/bioinformatics/article/30/15/2114/2390096), and the "no right answer" problem above. Let's try it on the Dante Labs sequence...

samtools fastq sample1.bam -t -0 sample1.0.fq.gz -1 sample1.1.fq.gz -2 sample1.2.fq.gz # Getting fastq for Trimmomatic if we only have BAM; process-substitution for outputs would parallerize this

# Make the palindromic paired-end adapter sequence file
cat << EOF > BGISEQ-PE.fa
> >PrefixPE/1
> AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA
> >PrefixPE/2
> CAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT
> EOF

time java -Xms26G -jar trimmomatic-0.36.jar PE -phred33 -trimlog sample1.bgiseq.log -validatePairs sample1.1.fq.gz sample1.2.fq.gz -baseout sample1.fq.gz ILLUMINACLIP:BGISEQ-PE.fa:2:30:10:1:true
TrimmomaticPE: Started with arguments:
-phred33 -trimlog sample1.bgiseq.log -validatePairs sample1.1.fq.gz sample1.2.fq.gz -baseout sample1.fq.gz ILLUMINACLIP:BGISEQ-PE.fa:2:30:10:1:true
Multiple cores found: Using 4 threads
Using templated Output files: sample1_1P.fq.gz sample1_1U.fq.gz sample1_2P.fq.gz sample1_2U.fq.gz
Using PrefixPair: 'AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA' and 'CAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 658303183 Both Surviving: 658303183 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully

real 1302m35.373s
user 2518m6.356s
sys 270m3.212s

This is rather old computer, but that's about 22 hours. While this can be ran highly parallel on multiple computers, and there are now faster options (https://github.com/OpenGene/fastp), it's easy to see why the preference is to skip this phase.

The parameters in above Trimmomatic call mean allow 2 mismatches in first 16 bases of the adapter, require about 30/0.6=50 bases to mirror each other (The fragment between the adapters) or 10/0.6=17 bases to match the adapter sequence (Mismatches increase the number of needed matches for both cases in a complex manner, see documentation). The shortest adapter length clipped is 1 (Since we can identify them confidently), and both reads are kept even though they're mirror images of each other because most tools expect that.

This identifies adapters in 3 214 470 out of 1 316 606 366 reads, or 0.24% and average read-length of 99.9867bp after adapter clipping. Looking at the filtered output, 19 exact cases of the reverse primer are left in the forward read; these appear to all be degenerated dimers, ie. adapters joined together with no sample fragment. This level of adapters is unlikely to significantly affect somatic SNP calling from NGS shotgun sequencing, even if the variant caller uses soft-clipped sequences. GATK HaplotypeCaller incidentally does use soft-clipped sequences unless you give it --dontUseSoftClippedBases because it essentially does local re-alignment (https://software.broadinstitute.org/gatk/documentation/article.php?id=4146) to figure out structural variation by itself.

In conclusion, trim adapters or don't depending on your personal preference, but whatever you do, don't use the Illumina adapter sequences (https://support.illumina.com/downloads/illumina-customer-sequence-letter.html) as they will remove valid sequences while leaving the BIGSEQ adapters in reads. This is particularly important if using pre-made pipelines, as they often include an adapter trimming stage that doesn't expect BGISEQ adapters. Personally I had to look into this as I'm doing some de novo assembly on the Y chromosome and metagenome, where the adapter sequences matter, and I'm satisfied with the above results for now, though I might still run whole variant calling pipeline on reads trimmed in different ways for curiosity.

Donwulff
03-20-2018, 02:44 PM
On the main Dante Labs thread there was some issue over how complete the Dante Labs VCF is. On my original reading of the VCF file headers, I figured that the parameter "variant=(RodBinding name=variant source=out_snp.vcf.gz)" to GATK SelectVariants meant out_snp.vcf.gz contained list of the variants they would output. This is frequently done, if there's a list of variants which have been validated in some way, eg. by being listed in dbSNP as having been encountered elsewhere before.

However, on closer look the "variant" parameter is actually the input sample, and instead, the actually used SelectVariants option is "-excludeFiltered=true" which means it's used to leave out variants which have failed one of the filters.

I can easily find SNP's where read depth is 1 or allelic variant quality below 1. The lowest Mapping Quality is 21.0. The main filter looks to be the Variant Quality Score Recalibration with the Truth Sensitivity set at 99% - ie. each variant listed should be at least 99% certain compared to assumed true variants, such as those found in dbSNP.

So this means variants are listed regardless of whether they're previously seen or not, as long as they're at least 99% likely to be true. That means it's fairly complete, and I don't think re-analysis is warranted on that basis, unless you know & want to deal with highly uncertain variants. The UCSC reference with old mtDNA and no decoy sequences, along with if Genomic VCF weren't provided, may be reason enough to re-map & call the sequence. Or doing crazy experimentation.

Donwulff
04-09-2018, 09:55 PM
I should be able to edit some old posts. I found out after checking that I think above adapter-trimming did not work as expected. I'm working on validating something that works, I may edit this post with the details ;)

Donwulff
04-09-2018, 10:58 PM
Previously I was going through FASTQ files to be able to test adapter-trimming that works best on FASTQ files. However, in reference to the main Dante Labs thread, I moved to test the Broad Institute GATK 4.0 best practices recommended uBAM (https://gatkforums.broadinstitute.org/gatk/discussion/6484) workflow. Here's the basic command for preparing uBAM from, well, most BAM files. In the case of Dante Labs provided BAM, I found out I had to add XS and XA to tags to clear.

I ended up trying to make this fairly foolproof, I guess I should still check if java is installed and install it if needed etc ;) This script *only* creates unmapped BAM from the original one. This should work equally well for BigY for example, the SANITIZE bit removing unpaired reads etc. However, I'll edit this post if I see any obvious problems. And need to stick these on public github... DONE: https://github.com/Donwulff/bio-tools



#!/bin/sh
bamfile=${1:-"sample1.bam"}
outfile=${bamfile%%.bam}.u.bam

if [ ! -e $bamfile ];
then
echo "Runs Picard Tools on Java to create Broad Institute uBAM from input BAM"
echo "$0 [input.bam] [gigabytes of memory to use, if not all available memory]"
exit
fi

if [ ! -e picard.jar ];
then
wget https://github.com/broadinstitute/picard/releases/download/2.18.2/picard.jar
fi

# Different fs from data files, prefer SSD
tmp=${TMP:-"/tmp"}

# Sizes in bytes
inputsize=`du -D -b $bamfile | gawk '{print $1}'`
tempsize=$((inputsize+(inputsize/2)))
outfree=`LC_ALL=C df -B1 $bamfile | grep -v "^File" | gawk '{print $4}'`
tempfree=`LC_ALL=C df -B1 $tmp | grep -v "^File" | gawk '{print $4}'`

if [ $tempsize -gt $tempfree ];
then
echo "Approximately 1.5X input size is required for temporary storage $tmp"
echo "Run with TMP=/path/to/tmp to use different path"
exit
fi

if [ $inputsize -gt $outfree ];
then
echo "Output file $outfile probably won't fit on remaining space"
exit
fi

totalmem=`LC_ALL=C free | grep -e "^Mem:" | gawk '{print $7}'`
# Allow 2 gigabytes for runtime
javamem=${2:-$((totalmem/1024/1024-2))}
# https://sourceforge.net/p/picard/wiki/Main_Page/#q-a-picard-program-that-sorts-its-output-sambam-file-is-taking-a-very-long-time-andor-running-out-of-memory-what-can-i-do
bamrecords=$((javamem*250000))

# Ref: https://gatkforums.broadinstitute.org/gatk/discussion/6484
java -Xmx${javamem}G -jar picard.jar RevertSam \
I=$bamfile \
O=$outfile \
SANITIZE=true \
MAX_DISCARD_FRACTION=0.005 \
ATTRIBUTE_TO_CLEAR=XT \
ATTRIBUTE_TO_CLEAR=XN \
ATTRIBUTE_TO_CLEAR=AS \
ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP \
ATTRIBUTE_TO_CLEAR=XS \
ATTRIBUTE_TO_CLEAR=XA \
SORT_ORDER=queryname \
RESTORE_ORIGINAL_QUALITIES=true \
REMOVE_DUPLICATE_INFORMATION=true \
REMOVE_ALIGNMENT_INFORMATION=true \
MAX_RECORDS_IN_RAM=$bamrecords \
TMP_DIR=$tmp

Petr
04-11-2018, 06:31 PM
Worked fine.

Dante WGS file:
picard.sam.RevertSam done. Elapsed time: 316.98 minutes.

Input BAM file size: 81,2 GiB
Output uBAM file size: 86 GiB
TMP space used: 128,3 GiB

But ValidateSamFile shows:
ERROR: Read name CL100054478_L01_10, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group
ERROR: Read name CL100054478_L01_11, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group
ERROR: Read name CL100054478_L01_12, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group
ERROR: Read name CL100054478_L01_9, The platform (PL) attribute (COMPLETE) + was not one of the valid values for read group

(gawk is not part of standard Ubuntu 1604 LTS, can be easily installed)

(Genos BAM file has 2 additional attributes to clear, BD and BI)

Unfortunately it does not work with FTDNA BAM files:

[Wed Apr 11 20:26:03 CEST 2018] RevertSam INPUT=/home/petr/Storage/f195364/f195364_Rinda_CZE_BigY.bam OUTPUT=/home/petr/Storage/f195364/f195364_Rinda_CZE_BigY.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=7000000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Apr 11 20:26:03 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
[Wed Apr 11 20:26:05 CEST 2018] picard.sam.RevertSam done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=637009920
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: WARNING: Read name SN7001368:163:H8TUEADXX:2:1202:11355:57332, No M or N operator between pair of D operators in CIGAR
at htsjdk.samtools.SAMUtils.processValidationErrors(S AMUtils.java:454)
at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java: 253)
at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecor d.java:606)
at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRe cord.java:1575)
at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2 087)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.adva nce(BAMFileReader.java:811)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next (BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next (BAMFileReader.java:765)
at htsjdk.samtools.SamReader$AssertingIterator.next(S amReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(S amReader.java:548)
at picard.sam.RevertSam.doWork(RevertSam.java:297)
at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)


Adding VALIDATION_STRINGENCY=LENIENT helped:
[Thu Apr 12 00:38:22 CEST 2018] picard.sam.RevertSam done. Elapsed time: 5.15 minutes.
But I'm not sure that the resulting BAM file is really correct. ValidateSamFile shows no errors.

Now I'm trying FGC Elite 1.0, FGC WGS 15x and Genos WES.

Edit: For 2016 FTDNA Big Y file, the error is different:

[Thu Apr 12 00:21:44 CEST 2018] RevertSam INPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.bam OUTPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=6750000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Apr 12 00:21:44 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
INFO 2018-04-12 00:21:49 RevertSam Reverted 1,000,000 records. Elapsed time: 00:00:04s. Time for last 1,000,000: 4s. Last read position: chrY:7,983,283
INFO 2018-04-12 00:21:54 RevertSam Reverted 2,000,000 records. Elapsed time: 00:00:09s. Time for last 1,000,000: 4s. Last read position: chrY:13,451,192
INFO 2018-04-12 00:21:59 RevertSam Reverted 3,000,000 records. Elapsed time: 00:00:14s. Time for last 1,000,000: 5s. Last read position: chrY:15,783,059
INFO 2018-04-12 00:22:05 RevertSam Reverted 4,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 5s. Last read position: chrY:19,677,785
INFO 2018-04-12 00:22:09 RevertSam Reverted 5,000,000 records. Elapsed time: 00:00:24s. Time for last 1,000,000: 4s. Last read position: chrY:22,893,813
INFO 2018-04-12 00:22:16 RevertSam Reverted 6,000,000 records. Elapsed time: 00:00:31s. Time for last 1,000,000: 6s. Last read position: chrY:25,050,194
INFO 2018-04-12 00:23:15 RevertSam Reverted 7,000,000 records. Elapsed time: 00:01:30s. Time for last 1,000,000: 58s. Last read position: chrY:28,818,028
INFO 2018-04-12 00:23:15 RevertSam Detected quality format for GRC14388618_S53_L00: Standard
INFO 2018-04-12 00:23:26 RevertSam Discarded 7027084 out of 7027084 (100.000%) reads in order to sanitize output.
[Thu Apr 12 00:23:26 CEST 2018] picard.sam.RevertSam done. Elapsed time: 1.70 minutes.
Runtime.totalMemory()=10565451776
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Discarded 100.000% which is above MAX_DISCARD_FRACTION of 0.500%
at picard.sam.RevertSam.doWork(RevertSam.java:336)
at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)

Petr
04-11-2018, 07:11 PM
FGC Elite 1.0


[Wed Apr 11 20:30:41 CEST 2018] RevertSam INPUT=/home/petr/Storage/KK2P9/KK2P9.bam OUTPUT=/home/petr/Storage/KK2P9/KK2P9.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=7000000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Apr 11 20:30:41 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
INFO 2018-04-11 20:30:48 RevertSam Reverted 1,000,000 records. Elapsed time: 00:00:05s. Time for last 1,000,000: 5s. Last read position: 1:78,627,563
....
INFO 2018-04-11 20:44:20 RevertSam Reverted 69,000,000 records. Elapsed time: 00:13:37s. Time for last 1,000,000: 3s. Last read position: MT:1,113
[Wed Apr 11 20:44:20 CEST 2018] picard.sam.RevertSam done. Elapsed time: 13.64 minutes.
Runtime.totalMemory()=18256756736
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 69004234, Read name FCC63CUACXX:4:2306:8461:63745#, MAPQ should be 0 for unmapped read.
at htsjdk.samtools.SAMUtils.processValidationErrors(S AMUtils.java:454)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.adva nce(BAMFileReader.java:812)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next (BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next (BAMFileReader.java:765)
at htsjdk.samtools.SamReader$AssertingIterator.next(S amReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(S amReader.java:548)
at picard.sam.RevertSam.doWork(RevertSam.java:297)
at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)


FGC WGS 15x OK


[Wed Apr 11 23:07:23 CEST 2018] picard.sam.RevertSam done. Elapsed time: 115.50 minutes.

Genos WES


[Wed Apr 11 23:31:43 CEST 2018] picard.sam.RevertSam done. Elapsed time: 21.26 minutes.
Runtime.totalMemory()=12203327488
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Discarded 2.572% which is above MAX_DISCARD_FRACTION of 0.500%
at picard.sam.RevertSam.doWork(RevertSam.java:336)
at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)

Donwulff
04-12-2018, 04:39 AM
Bit surprised about the missing gawk, it's pretty commonly used. I think java needs to be installed as well though, but I can think about doing that without gawk.

The ValidateSam result on Dante Labs is expected, because it's (correctly) tagged as Complete Genomics, which not all tools know to handle. I'm not sure how it'll affect downstream analysis, it might be necessary to reheader/change it back to ILLUMINA.
Discard fraction error on BigY is normal too, that's more of a warning, and comes because FTDNA left out pairs of reads that didn't map on Y chromosome, so those are discarded. (You could map them as single-ended, but that takes slightly more work).
I didn't look at the Genos exome yet, I'll try that.

Also no idea about the older FTDNA BAM; that's a warning as well, but that's not exactly expected. FGC reports error, but does it finish the conversion? Those are both errors in the original BAM, and RevertSam should just fix/revert those. The CIGAR and MAPQ are cleared, and the mapper should fill them again according to the new reference & mapping.

Petr
04-12-2018, 08:07 AM
With VALIDATION_STRINGENCY=LENIENT I was able to finish all BAM files (Big Y 2014-2015, FGC Elite 1.0) with except of one 2016 Big Y BAM file, this is the one that is not containing other chromosomes than Y. I have just one so I cannot verify if this is problem with all newer Big Y BAM files or just this one.

ValidateSamFile is without any error for FGC Elite 1.0 and with the following error for Big Y 2014-2015:
ERROR: Read name GRC14388645_S18_L00, A platform (PL) attribute was not found for read group

Summary ValidateSamFile report for original 2014-2015 BigY files:


Error Type Count
ERROR:MISSING_PLATFORM_VALUE 1
WARNING:MISSING_TAG_NM 13361471
and for 2016 file:
Error Type Count
ERROR:MATE_NOT_FOUND 7027084
ERROR:MISSING_PLATFORM_VALUE 1
WARNING:MISSING_TAG_NM 7027084
These errors looks like

ERROR: Read name SN7001368:387:H5MNGADXX:1:2213:14624:97598 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
ERROR: Read name SN7001368:387:H5MNGADXX:1:2115:5176:52276 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
ERROR: Read name SN7001368:387:H5MNGADXX:2:2215:14715:88988 1:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
ERROR: Read name SN7001368:387:H5MNGADXX:2:1204:7990:35829 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
ERROR: Read name SN7001368:387:H5MNGADXX:1:2107:6608:72296 2:N:0:GGACTCCT+GTAAGGAG, Mate not found for paired read
I think "2:N:0:GGACTCCT+GTAAGGAG" should not be part of the read name.

Petr
04-12-2018, 02:15 PM
It is interesting that sometimes the uBAM file is smaller than the original BAM file and sometimes it is bigger:

Test bam hg19 u.bam J Kane hg38*
Dante WGS 83133 88048
FGC Elite 1.0 9339 6216 12962
FGC WGS 15x 26744 30102 74318
Big Y (2014) 829 1178 2462
Big Y (2014) 825 1129 2607
Big Y (2014) 831 1133 2415
Big Y (2014) 901 1226 2517
Big Y (2015) 812 1184 2449
Big Y (2015) 868 1174 2597
Big Y (2015) 807 1236 2580
Big Y (2015) 670 917 2122
Big Y (2016) 490 830** 1284
Genos WES 6396 5480 11153

File sizes in MiB.
* - the third column are BAM files created by http://www.it2kane.org/2016/10/y-dna-variant-discovery-workflow-pt-1-1/
** - the uBAM file cannot be created from the original BAM file so it was created from hg38 file created by the J. Kane workflow.

Donwulff
04-12-2018, 02:48 PM
It's unclear what are from ValidateSam and which are from the script. The script works for my BigY done in 2015, soon after they started filtering out non-Y reads. The script copies the original read group header, and for some reason FTDNA didn't put platform there. All that is annoying, because some tools expect it, I do wonder if I should make the script fill it in if missing.

[email protected]:~$ bio-tools/mapping/revert-bam.sh
Reverting sample1.bam into sample1.unmapped.bam with Picard Tools
14:03:25.386 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/_/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Apr 12 14:03:30 GMT 2018] RevertSam INPUT=sample1.bam OUTPUT=sample1.unmapped.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=2500000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Apr 12 14:03:30 GMT 2018] Executing as [email protected] on Linux 4.4.0-17133-Microsoft amd64; OpenJDK 64-Bit Server VM 9-internal+0-2016-04-14-195246.buildd.src; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
INFO 2018-04-12 14:03:35 RevertSam Reverted 1,000,000 records. Elapsed time: 00:00:05s. Time for last 1,000,000: 5s. Last read position: chrY:8,476,786
INFO 2018-04-12 14:03:41 RevertSam Reverted 2,000,000 records. Elapsed time: 00:00:10s. Time for last 1,000,000: 5s. Last read position: chrY:13,708,840
INFO 2018-04-12 14:04:05 RevertSam Reverted 3,000,000 records. Elapsed time: 00:00:35s. Time for last 1,000,000: 24s. Last read position: chrY:17,177,919
INFO 2018-04-12 14:04:11 RevertSam Reverted 4,000,000 records. Elapsed time: 00:00:40s. Time for last 1,000,000: 5s. Last read position: chrY:21,174,636
INFO 2018-04-12 14:04:15 RevertSam Reverted 5,000,000 records. Elapsed time: 00:00:45s. Time for last 1,000,000: 4s. Last read position: chrY:24,115,859
INFO 2018-04-12 14:04:38 RevertSam Reverted 6,000,000 records. Elapsed time: 00:01:07s. Time for last 1,000,000: 22s. Last read position: chrY:26,139,146
INFO 2018-04-12 14:04:40 RevertSam Detected quality format for _: Standard
INFO 2018-04-12 14:05:04 RevertSam Sanitized 1,000,000 records. Elapsed time: 00:00:23s. Time for last 1,000,000: 14s. Last read position: */*
INFO 2018-04-12 14:05:17 RevertSam Sanitized 2,000,000 records. Elapsed time: 00:00:37s. Time for last 1,000,000: 13s. Last read position: */*
INFO 2018-04-12 14:05:30 RevertSam Sanitized 3,000,000 records. Elapsed time: 00:00:50s. Time for last 1,000,000: 13s. Last read position: */*
INFO 2018-04-12 14:05:44 RevertSam Sanitized 4,000,000 records. Elapsed time: 00:01:03s. Time for last 1,000,000: 13s. Last read position: */*
INFO 2018-04-12 14:05:57 RevertSam Sanitized 5,000,000 records. Elapsed time: 00:01:16s. Time for last 1,000,000: 13s. Last read position: */*
INFO 2018-04-12 14:06:10 RevertSam Sanitized 6,000,000 records. Elapsed time: 00:01:30s. Time for last 1,000,000: 13s. Last read position: */*
INFO 2018-04-12 14:06:16 RevertSam Discarded 39868 out of 6464708 (0.617%) reads in order to sanitize output.
[Thu Apr 12 14:06:16 GMT 2018] picard.sam.RevertSam done. Elapsed time: 2.85 minutes.
Runtime.totalMemory()=10068426752
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Discarded 0.617% which is above MAX_DISCARD_FRACTION of 0.500%
at picard.sam.RevertSam.doWork(RevertSam.java:336)
at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)

That error is expected, since the BAM file contains unpaired reads, which need to be discarded or paired end mapping won't be possible. The unpaired reads can be mapped separately, although it's possible they don't belong on Y chromosome because their other end mapped somewhere else. With FTDNA cutting the pairs out, it's hard to tell. The script could leave that error out, but I personally think it's good to warn about possible loss of data.

The "WARNING:MISSING_TAG_NM" warnings are expected, because the result is unmapped BAM, which doesn't include mapping information.

2:N:0:GGACTCCT+GTAAGGAG is from Illumina Casava 1.8 FASTQ flags, since the script in question works on BAM's I'm not sure where those are coming from.

The latest version on GitHub now includes script to call bwa (if the reference index is available) to show one way to map the unmapped bam file, though it doesn't exactly follow the GATK best practices (I'm using samtools for sorting because it's faster, and gluing the unmapped BAM header with the RG lines on the mapped BAM instead of using MergeBamAlignment for same reason). There's no base recalibration or variant calling, basically the BAM is left in same state as the raw BAM's delivered by sequencing companies. I'll try to write user-friendly BQSR & calling script when I have time, though it's somewhat re-inventing the wheel. Technically one can use the GATK pipelines on the unmapped BAM, though they have more dependencies/prerequisites.

Petr
04-12-2018, 03:30 PM
In the case of Dante Labs provided BAM, I found out I had to add XS and XA to tags to clear.
I noticed that FTDNA Big Y BAM files contain some additional flags in lower case: qf, sc, sp, ac, nh.

Donwulff
04-12-2018, 03:33 PM
Nowadays, bioinformatics pipelines are usually written in some high level languages or frameworks. For example, Broad Institute's GATK has moved into using their own Workflow Definition Language WDL https://software.broadinstitute.org/wdl/ - of course, this means lots more dependencies, learning and added complexity. There may still be some spot for easily put together and hopefully understandable scripts to do the same.

That said, as GATK now mentions on their Best Practices front page:
"If someone hands you a script and tells you "this runs the GATK Best Practices", start by asking what version of GATK it uses, when it was written, and what are the key steps that it includes. Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single parameter in the commands. If anything is unfamiliar to you, you should find out what it does. If you can't find it in the documentation, ask us in the forum. It's one or two hours of your life that can save you days of troubleshooting on the tail end of the pipeline, so please protect yourself by being thorough." https://software.broadinstitute.org/gatk/best-practices/

Which, basically, translates to ONLY use Broad Institute provided official pipelines. I mean, they don't even provide the commands on their best practices pages anymore, it's all in the pipeline code. Another thing is computing has moved ahead, and like https://software.broadinstitute.org/gatk/blog?id=11415 details, there's great advantages to be had from using specific cloud computing configurations. On the other hand, one may not have the knowledge, transfer bandwidth or desire to send their data to large cloud computing companies where it could be intercepted. It's also possible people may want to learn & tweak the process themselves.

I do wonder a bit where GATK Best Practices is headed. They've been pushing the Unmapped BAM format for a while now, and when nobody wanted to use it, now they seem to be forcing it down people's throat. Me, personally I think the uBAM is actually a great idea, and it would be nice if bioinformatics tools had better support for it, but still GATK's current approach seems bit heavy-handed. Now they're also talking of doing away with BWA, for performance reasons of course, while their Java-based GATK & Picard are the slowest tools out there. But for example their MarkDuplicates is considered not just the best but practically only way to properly do duplicate marking. There are faster tools, like samblaster that do streaming/piped duplicate marking and sorting, which saves you from the processing power and IO to compress and store temporary files, but the results aren't identical with Picard MarkDuplicates. So to some degree you're stuck with that.

On the other hand, they seem to now be headed to do *everything* by themselves, their own way. I'm wondering if they will try to close it up as proprietary at some point, or if another, open standard will emerge. Some of the challenges FDA has ran seem to indicate that GATK Best Practices do produce some of the highest quality results, though: https://precision.fda.gov/challenges/truth/results And perhaps more importantly, the pipelines and results are reproducible, and comparable to each other. (The caveats about what is "true" for genomics apply though; the expected/correct answers are pretty much defined by GATK-like workflows right now). On the other hand, if you just want to get your hands dirty and experiment, knock yourself out.

Petr
04-12-2018, 03:38 PM
This is my unsuccessful run:

[Thu Apr 12 17:35:06 CEST 2018] RevertSam INPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.bam OUTPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=7000000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Apr 12 17:35:06 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
INFO 2018-04-12 17:35:11 RevertSam Reverted 1,000,000 records. Elapsed time: 00:00:04s. Time for last 1,000,000: 4s. Last read position: chrY:7,983,283
INFO 2018-04-12 17:35:16 RevertSam Reverted 2,000,000 records. Elapsed time: 00:00:09s. Time for last 1,000,000: 4s. Last read position: chrY:13,451,192
INFO 2018-04-12 17:35:22 RevertSam Reverted 3,000,000 records. Elapsed time: 00:00:14s. Time for last 1,000,000: 5s. Last read position: chrY:15,783,059
INFO 2018-04-12 17:35:28 RevertSam Reverted 4,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 5s. Last read position: chrY:19,677,785
INFO 2018-04-12 17:35:32 RevertSam Reverted 5,000,000 records. Elapsed time: 00:00:24s. Time for last 1,000,000: 4s. Last read position: chrY:22,893,813
INFO 2018-04-12 17:35:38 RevertSam Reverted 6,000,000 records. Elapsed time: 00:00:31s. Time for last 1,000,000: 6s. Last read position: chrY:25,050,194
INFO 2018-04-12 17:35:47 RevertSam Reverted 7,000,000 records. Elapsed time: 00:00:39s. Time for last 1,000,000: 8s. Last read position: chrY:28,818,028
INFO 2018-04-12 17:36:43 RevertSam Detected quality format for GRC14388618_S53_L00: Standard
INFO 2018-04-12 17:36:53 RevertSam Discarded 7027084 out of 7027084 (100.000%) reads in order to sanitize output.
[Thu Apr 12 17:36:53 CEST 2018] picard.sam.RevertSam done. Elapsed time: 1.78 minutes.
Runtime.totalMemory()=12125732864
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Discarded 100.000% which is above MAX_DISCARD_FRACTION of 0.500%
at picard.sam.RevertSam.doWork(RevertSam.java:336)
at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)

Donwulff
04-12-2018, 03:46 PM
Also, the new version in GitHub I changed the extension to .unmapped.bam because that's what Broad Institute used. If you re-name the old .u.bam to .unmapped.bam, the script won't need to re-make it for mapping.

I'm not seeing all the tags in my BigY file. Most of the BAM file tags won't matter in current use, because "samtools fastq -t" only outputs RG, BC and QT tags, which I think I mentioned before. I'll see about adding them to be stripped though; I left the FTDNA proprietary ones because they might hold additional metadata not related to mapping, though it's doubtful.

Donwulff
04-12-2018, 04:06 PM
Your 2016 BigY BAM looks "broken", it has the Illumina Casava 1.8 headers included in the read/querynames. I think it would be possible to fix that, though I don't have file to exactly test it on. Maybe I should add that to the script too if it's more common...

(samtools view -H sample1.bam; samtools view sample1.bam | awk '{ OFS="\t"; split($1,q," "); $1=q[1]; print }') | samtools view -b -o out.bam

Did they return the non-Y chromosome reads in those, or what's that?

BAM files can use different levels of compression, though most tools don't let you choose the level of compression. I'm thinking of changing the script to COMPRESSION_LEVEL=9 actually, the Broad Institute GATK page doesn't tell to set that, but if you're going to save the unmapped BAM it might be worth it to use maximum compression. In practice, that's likely going to make very little difference usually.

Another thing is that when a BAM file is sorted in chromosome order, it will compress a lot more due to the redudant/repetitive content in overlapping reads. A sorted BAM can therefore be smaller than unmapped bam. An unmapped BAM in chromosome sort order would be the smallest, but the queryname order is required by full MarkDuplicates, and provides for better insert length determination on mapping.

Petr
04-12-2018, 06:39 PM
Thank you Don, I think there is a dash sign for stdin missing, but still it gives error:

[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[main_samview] fail to read the header from "-".

Edit: this error was probably caused by using the old version of samtools.

With the samtools 1.8, there is another error:

[W::sam_read1] Parse error at line 96
[main_samview] truncated file.

If I understand correctly, the awk script changed space delimiter to tab delimiter.

For some reason this was not acceptable as input format for samtools.

So I tried to use sed -e s/\ [^\\t]*\\t/\\t/ instead of awk to remove the Illumina Casava 1.8 headers and now the result of RevertSam is very similar to your result:


[Thu Apr 12 22:01:15 CEST 2018] RevertSam INPUT=out.bam OUTPUT=out.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=7000000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Apr 12 22:01:15 CEST 2018] Executing as [email protected] on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
INFO 2018-04-12 22:01:19 RevertSam Reverted 1,000,000 records. Elapsed time: 00:00:04s. Time for last 1,000,000: 4s. Last read position: chrY:7,983,283
INFO 2018-04-12 22:01:24 RevertSam Reverted 2,000,000 records. Elapsed time: 00:00:09s. Time for last 1,000,000: 5s. Last read position: chrY:13,451,192
INFO 2018-04-12 22:01:32 RevertSam Reverted 3,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 7s. Last read position: chrY:15,783,059
INFO 2018-04-12 22:01:36 RevertSam Reverted 4,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 3s. Last read position: chrY:19,677,785
INFO 2018-04-12 22:01:42 RevertSam Reverted 5,000,000 records. Elapsed time: 00:00:26s. Time for last 1,000,000: 5s. Last read position: chrY:22,893,813
INFO 2018-04-12 22:01:46 RevertSam Reverted 6,000,000 records. Elapsed time: 00:00:31s. Time for last 1,000,000: 4s. Last read position: chrY:25,050,194
INFO 2018-04-12 22:01:53 RevertSam Reverted 7,000,000 records. Elapsed time: 00:00:38s. Time for last 1,000,000: 6s. Last read position: chrY:28,818,028
INFO 2018-04-12 22:02:44 RevertSam Detected quality format for GRC14388618_S53_L00: Standard
INFO 2018-04-12 22:02:56 RevertSam Sanitized 1,000,000 records. Elapsed time: 00:00:11s. Time for last 1,000,000: 11s. Last read position: */*
INFO 2018-04-12 22:03:07 RevertSam Sanitized 2,000,000 records. Elapsed time: 00:00:23s. Time for last 1,000,000: 11s. Last read position: */*
INFO 2018-04-12 22:03:18 RevertSam Sanitized 3,000,000 records. Elapsed time: 00:00:34s. Time for last 1,000,000: 11s. Last read position: */*
INFO 2018-04-12 22:03:30 RevertSam Sanitized 4,000,000 records. Elapsed time: 00:00:46s. Time for last 1,000,000: 11s. Last read position: */*
INFO 2018-04-12 22:03:42 RevertSam Sanitized 5,000,000 records. Elapsed time: 00:00:57s. Time for last 1,000,000: 11s. Last read position: */*
INFO 2018-04-12 22:03:54 RevertSam Sanitized 6,000,000 records. Elapsed time: 00:01:09s. Time for last 1,000,000: 11s. Last read position: */*
INFO 2018-04-12 22:04:05 RevertSam Discarded 61588 out of 7027084 (0.876%) reads in order to sanitize output.
[Thu Apr 12 22:04:05 CEST 2018] picard.sam.RevertSam done. Elapsed time: 2.84 minutes.
Runtime.totalMemory()=10565976064
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Discarded 0.876% which is above MAX_DISCARD_FRACTION of 0.500%
at picard.sam.RevertSam.doWork(RevertSam.java:336)
at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)

This problematic Big Y file is dated January 19th 2016.
The latest "good old" Big Y file I have is dated March 28th 2015.

Donwulff
04-13-2018, 02:29 PM
Ah yes, the dash not needed in current version so easy to forget ;)

samtools fastq actually works right on this file, returning correct Illumina Casava headered FASTQ, so that's an interesting case for the Unmapped BAM's.

gawk actually by default breaks input fields at any whitespace by default, which I missed because I didn't have anything to test it with.
(samtools view -H sample1.bam; samtools view sample1.bam | awk 'BEGIN { FS="\t"; OFS="\t"; } { split($1,q," "); $1=q[1]; print }') | samtools view -b -o out.bam -

The first columns of headers like @HD and @RG don't have spaces in them, so it would probably be okay to pipe both header and mappings through the gawk script, but better safe and sorry, the above filters only the mappings through it. Columns (fields) are split at tabulator, first column split at space, storing the part before the space back into first column. Columns are printed out tab-separated, and the resulting text-format (SAM) file is piped through samtools to store it into compressed BAM.

I don't get why "samtools view" doesn't support higher compression level, though it would save very little space. Testing & thinking on that a bit, for archival one might want to use xz compression. xz actually supports block-based parallel compression, though it would currently need a bit of tweaking to get xz to work with highly parallelized pipelines where you'd split the input bam into multiple sections. xz compression level 3 in blocks is still faster than bgzip level 9, and 13% smaller. For a mapped BAM, CRAM is best, but got me into thinking what would be best option for unmapped BAM's, especially if you were going to re-process them 10+ times with different pipeline settings.

Petr
04-15-2018, 09:13 AM
I'm not sure what is the purpose of READ_NAME_REGEX, i see in the log READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values>

If I check the read group naming for various tests:

FGC Elite 1.0
FCC63CUACXX:4:2206:18663:2479#
FGC WGS 15x
ST-E00126:132:HJHMLCCXX:7:2208:17695:46947
Dante
CL100054478L1C001R004_35916
FTDNA Big Y
SN7001368:163:H8TUEADXX:1:1110:12426:93783
Genos
SN7001218:858:HJLM5BCXY:1:1213:9857:47107

So if I understand correctly, the default behavior will take the last three ":" separated fields, i.e., in this example, sure correct for FTDNA, Genos and FGC WGS. For Dante there is your expression. I'm not sure how it is with FGC Elite 1.0, because the third (last) field is 2479#, I don't know it it will be evaluated as numeric value? The BAM file contains READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*

Donwulff
04-15-2018, 10:26 AM
I would expect the default read name regex will likely parse number until the first non-numeric character, but again not sure without testing. The BAM file parameter you quote (Though the "quick reply" option on the forum messes up the formatting) suggests it may have to be set manually.

Actually the script I have on GitHub is always work in process, the regex for Dante Labs currently doesn't work because of wrong quotation marks. Presently testing it through and updating the script. However, as I discussed on the main Dante Labs & this thread, after earlier testing, the read name regex doesn't really affect the variant calling. It affects some statistics calculated, and possibly slight effect on performance. Though setting it right will avoid the warning, and be cleaner.

Donwulff
04-19-2018, 01:09 AM
I downloaded an FGC sample off Personal Genomes Project to check the script, however unfortunately it didn't have the Y-Elite style "FCC63CUACXX:4:2206:18663:2479#" read names. Unsurprisingly, I guess, looking at your list above the FGC WGS sample had normal read names. I might try to simulate those just to see if the normal regex will catch them.

Y-chromosome only sequences have another issue in analysis down the line. Normally, the Base Quality Recalibration is ran over chromosome 22, the shortest autosomal chromosome, finding patterns for the "expected" variants. Of course, when you sequence the Y chromosome only, there aren't enough reads for chromosome 22 to do that. I'm not sure anybody has actually done a study what the best strategy would be in that case - male-specific region of the Y chromosome doesn't recombine, so the "expected" variants depend on the haplogroup of the sample. Nevertheless, the dbSNP database of known variants does include many Y chromosome variants, though the YBrowse list of phylogenetically informative SNP's might be better because they're more likely to be correct. Looking at the discussed pipeline at http://www.it2kane.org/2016/10/y-dna-variant-discovery-workflow-pt-1-1/ though, unless I'm missing something, it looks like it doesn't use the dbSNP file for base quality recalibration at all. On that note, I've also not really seen any research into what, if any, effect of the dbSNP release used could have (Latest GATK instructions seem to say "most recent dbSNP", in the past I think they recommended the version where 1KG variants were added, which was probably before a lot false positive and rare variants got added).

This has given me some interesting ideas in the past though, because the non-recombinatorial naturre of MSY means that you pretty much know most of the SNP's you're expecting to get. So if you've got a good phylogenetic reference, your base-quality recalibration could be quite exact since you would have pretty good "truth set" available with very few novel SNP's, both for training the recalibration, and for checking the results. Though, you'd probably need to also limit your recalibration to the general high confidence region, since palindromic and euchromatic regions will have a lot of errors. With a WGS, maybe even check if the recalibration differs for Y chromosome.

Which mostly leaves me wondering if anybody has done anything of the sort before. It could take a while to test all the options for enough genomes to see if there's significant effect.

karwiso
05-13-2018, 11:15 PM
Hi!

I am just learning to align myself.
I have a question about reference genome sequence - which one do you use?
There is a recommendation here (http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use), but there are more updated like 92 release of Ensembl (ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/dna/), which is updated to the patch 12 for GRCh38. I have tried NCBI's GRCh38 patch 12 (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.27_GRCh38.p12/), but it uses accessions numbers instead of chromosomes.
Is there any point to try to use the reference genome with the most recent patch or should I/we stick just with the unpatched reference?

Donwulff
05-15-2018, 04:01 PM
Things like GEDMatch, Promethease and annotation sources one could use currently stick to the primary assembly, ie. 1-22, X, Y and possibly MT. As such one doesn't currently gain more information from using another reference genome sequence. Additionally, the Broad Institute GATK Variant Quality Score Recalibration only really works for diploid primary assembly (That's 1-22). Yet another consideration is indeed all other tools and statistics are based on one of the official reference sequences; a genome mapped to another reference sequence isn't entirely comparable, because some reads will map to the different/alternative contigs instead.

I put on my Git tree linked somewhere before a script to build the latest GRCh38 reference sequence based on the analysis ready set with UCSC names for the contigs UCSC has officially released, decoy sequence, and latest Human Leukocyte Antigens mostly because I'm experimenting with it on my ongoing Y-chromosome and metagenome experimenting. In principle, those contigs should act as sink for reads that actually belong elsewhere on the human genome, allowing picking Y-chromosomal/metagenome reads apart. There are bound to be a number of issues though, so I'm definitely calling that experimental.

Here's a blog from Heng Li, author of the BWA aligner, on which reference genome to use that's really good: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Of course, the next question is going to be just whether to use GRCh38 or hg37 build. Resources are slowly as ever moving to GRCh38, so currently it's kind of a chaos where you'll find some services require hg37 and other require GRCh38 based calls. It's probably most beneficial to process both currently, and will have the additional benefit of checking uncertain variants on both.

karwiso
06-11-2018, 03:30 PM
I have been trying to use BaseRacalibrator with files from 1000Genomes: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/ and other mappings resources as Mills gold standard indels and ALL wgs 1000 phase 3 indels, but I have been getting constant errors about incompatible contigs. Looking in Mills gold standard indels I understood that it is marked as b38 in the file name, but naming convention and reference genome used in there is still GRCh37, or b37. I have tried to re-index and to re-sort, but there were just more errors about mis-formatted vcf files.

I have dowloaded GATK bundle from ftp://ftp.broadinstitute.org/bundle/hg38/
1000G_phase1.snps. high_confindence.hg38.vcf.gz, dbsnp_146.hg38.vcf.gz and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz were accepted by BaseRecalibrator just fine.

All files must be in the same directory. fa och fasta files should have exactly the same name and extension as fa.alt or fasta.alt file. If fa och fasta end with .gz it is better to unpack it, and then index it with bwa:


./bwa index /...../reference.fasta

I have done all steps without shell scripting and piping. Mainly because I wanted to learn and see what and exactly where could go wrong. I also start with FASTQ files, not the aligned BAM. I will publish my processing steps and comments.

karwiso
06-11-2018, 03:32 PM
Each pair of FASTQ files is processed with bwa ALT-aware alignment and eight sam files are produced:
Alternative A:

./bwa mem -t 6 -r 1.0 -T 20 -C -M -R "@RG\tID:CL1000....._L01_81\tPL:COMPLETE\tPU:CL1000 ....._L01_81\tLB:WH.....-81\tSM:1500.....7A\tCN:BGI" /mnt/ramdrv/grch38alt.fa /mnt/ramdrv/CL1000....._L01_81_1.fq.gz /mnt/ramdrv/CL1000....._L01_81_2.fq.gz > seqpair1.sam

Alternative B:

/dnatools/bwa/bwa mem -t 6 -k 15 -r 1.0 -T 20 -C -M -R "@RG\tID:CL1000....._L01_81\tPL:COMPLETE\tPU:CL1000 ....._L01_81\tLB:WH.....-81\tSM:1500.....7A\tCN:BGI" /dnatools/reference/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta CL1000....._L01_81_1.fq.gz CL1000....._L01_81_2.fq.gz > seqpair1.sam

Adjust _L01_81_ according to your FASTQ file names and for each processed pair of files. Think about adjusting it in the @RG description as well!

-t specifies number of threads that could be used for processing. 8 threads are possible for my CPU, but using them all can almost lock the PC, using 6 threads is ok and it possible to do other work (CPU Intel i7-6700K 4.0 GHz). FATSQ files are loaded to ramdrive together with reference genome files, output is written to hybrid 2xHDD in RAID1. Processing time – approximately 1.5 hour per pair (-r 1.0, Alternative A).

It is best to have the path to the reference sequence fa file. Just indexing fa file and referring to the index name is not working for ALT aware alignment. One can probably copy fa and fa.alt to bwa directory, but I haven’t experimented with it.

Read group information could be added already at this step. I think it is then reasonable to skip merging aligned and unaligned BAMs. Read group parameters are added after the -R option.

-T option is 30 by default and is for the quality of alignments. I have adjusted it to 20, that should give a bit more alignments but with lower quality. I think it could be reasonable for genealogical purposes, but an alignment for medical purposes should be more stringent.

Options -k and -r adjust seed sizes and reseeding, the defaults are 19 and 1.5. Lower values can give a better accuracy but increase the processing time. I adjusted them to “-k 15 -r 1.0”, but it significantly increased processing time, approximately 5-6 hours per pair (Alternative B ). The size of produced SAM files is slightly bigger than in Alternative A - +1.5-2Gb. Probably it is not worth it and one can just stick with the defaults values.

The output:

[M::bwa_idx_load_from_disk] read 3171 ALT contigs
[M::process] read 600000 sequences (60000000 bp)...
[M::process] read 600000 sequences (60000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6, 234754, 9, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (188, 223, 258)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (48, 398)
[M::mem_pestat] mean and std.dev: (223.23, 53.29)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 468)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 600000 reads in 483.206 CPU sec, 80.696 real sec

The messages about skipping orientations are normal. They are more for information, don’t care about them. See “read 3171 ALT contigs” - if reference genome FASTA file is not indexed properly you will probably get “read 0 ALT contigs”. Not a big deal, but you are not aligning to alternative sequences, rather primary assembly only.

Donwulff
06-17-2018, 02:00 PM
I have been trying to use BaseRacalibrator with files from 1000Genomes: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/ and other mappings resources as Mills gold standard indels and ALL wgs 1000 phase 3 indels, but I have been getting constant errors about incompatible contigs. Looking in Mills gold standard indels I understood that it is marked as b38 in the file name, but naming convention and reference genome used in there is still GRCh37, or b37. I have tried to re-index and to re-sort, but there were just more errors about mis-formatted vcf files.

I have dowloaded GATK bundle from ftp://ftp.broadinstitute.org/bundle/hg38/
1000G_phase1.snps. high_confindence.hg38.vcf.gz, dbsnp_146.hg38.vcf.gz and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz were accepted by BaseRecalibrator just fine.


I'm not sure if that's a question... the GATK resource bundle https://software.broadinstitute.org/gatk/download/bundle (Your FTP link won't work) is definitely hg38, and I've been using it successfully with GRCh38 based pipelines. It's not exactly clear what you're using as reference genome FASTA/version, however. There are different naming conventions for the contigs, not only UCSC "chr1" vs. "1", but the "raw" GRCh38 also uses sequence identifiers. One point of problem is the Mills indels file contains HLA-contigs in the header, and the lengths of the HLA contigs vary between HLA reference-build. It may be necessary to filter them out, although I'm not sure there are references with different HLA alleles floating around.

karwiso
06-23-2018, 09:51 AM
Merging SAM files to one file
I ran into problems when merging my seqpairX.sam files. Java was running out of memory and some other complains from picard. After upgrading my system I still had the same problems. Then I tried to see file headers for the files (samtools view -H seqpairX.sam) and discovered that one sam file was corrupted (don’t know why). Then I just realigned corresponding fastq files and the process ran smoothly.


java -Xmx55g -jar /dnatools/picard/picard.jar MergeSamFiles I=seqpair1.sam I=seqpair2.sam I=seqpair3.sam I=seqpair4.sam I=seqpair5.sam I=seqpair6.sam I=seqpair7.sam I=seqpair8.sam O=/home/user/Dokument/jjs_alt_merged.sam USE_THREADING=true SORT_ORDER=unsorted TMP_DIR=/home/user/Dokument/ MAX_RECORDS_IN_RAM=10000000

-Xmx55g option is setting the limit for memory usage, 55G in this case.
TMP_DIR should be set if you don't have unlimited RAM, but it is not necessary here. I have it for the case I go back and modify something and probably a temporary directory is needed then.

Processing with bwa-postalt.js
After the merging of FASTQ files the sam-file is processed with bwa-postalt.js in bwakit (should be downloaded):


/dnatools/bwa/bwa.kit/k8 /dnatools/bwa/bwa.kit/bwa-postalt.js -p /dnatools/bwa/bwa.kit/resource-GRCh38/hs38DH-extra.fa /dnatools/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta.alt jjs_alt_merged.sam > jjs_alt_merged_postalt.sam

Sorting SAM and producing BAM file


java -Xmx55g -jar /dnatools/picard/picard.jar SortSam I=jjs_alt_merged_postalt.sam O=jjs_alt_merged_postalt_sorted.bam SORT_ORDER=coordinate MAX_RECORDS_IN_RAM=10000000 TMP_DIR=/home/user/Dokument/

MAX_RECORDS_IN_RAM 10 millions work well for me, 20 and 15 millions were able to start, but processing was aborted after some time.

karwiso
06-23-2018, 10:02 AM
I'm not sure if that's a question... the GATK resource bundle https://software.broadinstitute.org/gatk/download/bundle (Your FTP link won't work) is definitely hg38, and I've been using it successfully with GRCh38 based pipelines. It's not exactly clear what you're using as reference genome FASTA/version, however. There are different naming conventions for the contigs, not only UCSC "chr1" vs. "1", but the "raw" GRCh38 also uses sequence identifiers. One point of problem is the Mills indels file contains HLA-contigs in the header, and the lengths of the HLA contigs vary between HLA reference-build. It may be necessary to filter them out, although I'm not sure there are references with different HLA alleles floating around.

I decided to work with the reference genom in the GATK bundle (GRCh38) and followed Tutorial 8017 (https://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38) for ALT aware processing.

Working with base recalibration now and it seems to go soomthly now.

karwiso
06-24-2018, 08:14 AM
Change default java jre because in Ubuntu 18.04 the default is OpenJDK11 (right now it is actually OpenJDK10):
sudo update-alternatives --config java and choose java 8. If it not present you need to install it (use Synaptic in Ubuntu).

Base recalibration is single threaded. As far as I understand it is the only option.


/dnatools/gatk/gatk --java-options "-Xmx55g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
BaseRecalibrator \
-R /dnatools/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta \
-O recal_data.table \
-I jjs_alt_merged_postalt_sorted_dup.bam \
--known-sites /dnatools/gatk-bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites /dnatools/gatk-bundle/hg38/dbsnp_146.hg38.vcf.gz \
--known-sites /dnatools/gatk-bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz

You should see something like this:

11:02:44.477 INFO ProgressMeter - Starting traversal
11:02:44.477 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
11:02:54.498 INFO ProgressMeter - chr1:1998772 0.2 535000 3203273.1
11:03:04.501 INFO ProgressMeter - chr1:3386832 0.3 1068000 3200159.8
11:03:14.516 INFO ProgressMeter - chr1:4873235 0.5 1685000 3365624.7
11:03:24.523 INFO ProgressMeter - chr1:6313851 0.7 2283000 3420566.3
11:03:34.534 INFO ProgressMeter - chr1:7840490 0.8 2916000 3495215.5
11:03:44.538 INFO ProgressMeter - chr1:9316286 1.0 3527000 3523476.5

Applying the re-calibration data:


/dnatools/gatk/gatk --java-options "-Xmx55g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
ApplyBQSR \
-R /dnatools/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta \
--bqsr-recal-file recal_data.table \
-I jjs_alt_merged_postalt_sorted_dup.bam \
-O jjs_alt_merged_postalt_sorted_dup_recal.bam

karwiso
06-28-2018, 05:22 PM
BamDeal could be downloaded from https://github.com/BGI-shenzhen/BamDeal
Change .configure right to executable. Unpack and run ./configure, make. The software is in /bin directory. Change it’s rights to executable. cd /bin. The software is run ./BamDeal_Linux.

Create text file “keep.list” with following content:

chrY
chrY_KI270740v1_random
chrM

Run BamDeal and extract Y and M chromosome aligned reads:


/dnatools/BamDeal/bin/BamDeal_Linux modify bamSubChr \
-i jjs_alt_merged_postalt_sorted_dup_recal.bam \
-k keep.list \
-o jjs_alt_merged_postalt_sorted_dup_recal_extract.ba m -r


I am not sure if it is any good, but it removed some trash from BAM header, but it removed all the other information in the header, like the commands that had been run. I have tried to remove "-r" at the end (-r stands for re-heading, writing a new BAM file header), but picards tools don't like this either. I don't how to get reliable statistics about coverage for just Y-chromosome.

karwiso
06-28-2018, 05:31 PM
Some statistics anyway.

samtools idxstats:

name / length / mapped reads / unmapped reads

chrY 57227415 10655132 47833
chrM 16569 347320 173
chrY_KI270740v1_random 37240 7168 3

BamDeal statistics:

#ID Length CoveredBase TotalDepth Coverage% MeanDepth
chrY 57227415 16998046 385613660 29.70 6.74
chrM 16569 16569 33844029 100.00 2042.61
chrY_KI270740v1_random 37240 26160 411621 70.25 11.05
#Genome 57281224 17040775 419869310 29.75 7.3

And some depth frequency statistics from BamDeal:
#Depth Number
0 40240449
1 149250
2 105245
3 95005
4 94016
5 99383
6 108567
7 119324
8 145157
9 183274
10 247905
11 327865
12 430610
13 556184
14 692226
15 835277
16 965610
17 1067544
18 1146164
19 1183656
20 1175601
21 1130570
22 1043963
23 934127
24 814229
25 691385
26 570508
27 456628
28 357236
29 269933
30 202853
31 148628
32 109403
33 77886
34 56217
35 41656
36 30256
37 22796
38 18104
39 14335
40 11969
41 10387
42 8801
43 8080
44 7577
45 7033
46 6531
47 5639
48 5362
49 4897
50 4872

I haven't been successful in applying picards CollectRawWgsMetrics. With the original header it calculates coverage for whole genome, not only chromosomes peresent. With re-headed BAM it starts with chr1 anyway and statistics is completely wrong. I'll try to sort out how to do statistics with picard.

And some statistics from Yfull:

ChrY BAM file size: 0.44 GbHg38
Reads (all): 5028416
Mapped reads: 5028416 (100.00%)
Unmapped reads: 0
Length coverage: 23591946 bp (89.31%)
Min depth coverage: 1X
Max depth coverage: 208X
Mean depth coverage: 20.18X
Median depth coverage: 18X
Length coverage for age: 8467364 bp
No call: 2823097 bp

The original BAM-file was around 1.1 GB, but Yfull filters out duplicate reads now. So the file is cleaned now. Still, the length coverage is much better than for BigY tests as I see it.

Petr
06-28-2018, 09:59 PM
My YFull statistics:

ChrY BAM file size: 0.34 Gb Hg38
Reads (all): 3925945
Mapped reads: 3925945 (100.00%)
Unmapped reads: 0
Length coverage: 23598887 bp (89.34%)
Min depth coverage: 1X
Max depth coverage: 206X
Mean depth coverage: 16.33X
Median depth coverage: 15X
Length coverage for age: 8430528 bp
No call: 2816156 bp

and

ChrM BAM file size: 471.67 Kb
Length coverage: 16569 bp (100.00%)
Min depth coverage: 5X
Max depth coverage: 98X
Mean depth coverage: 51.05X
Median depth coverage: 51X
One reading position!: 0
No call: 0 bp

FGC Corp
06-29-2018, 09:04 PM
My YFull statistics:

ChrY BAM file size: 0.34 Gb Hg38
Reads (all): 3925945
Mapped reads: 3925945 (100.00%)
Unmapped reads: 0
Length coverage: 23598887 bp (89.34%)
Min depth coverage: 1X
Max depth coverage: 206X
Mean depth coverage: 16.33X
Median depth coverage: 15X
Length coverage for age: 8430528 bp
No call: 2816156 bp

and

ChrM BAM file size: 471.67 Kb
Length coverage: 16569 bp (100.00%)
Min depth coverage: 5X
Max depth coverage: 98X
Mean depth coverage: 51.05X
Median depth coverage: 51X
One reading position!: 0
No call: 0 bp

FYI:
FGC offers a beta health report for people with WGS data.
See here:
https://anthrogenica.com/showthread.php?742-Full-Y-Chromosome-Sequencing-Phase-III-Pilot&p=425433&viewfull=1#post425433

Donwulff
06-30-2018, 06:45 PM
Base recalibration is single threaded. As far as I understand it is the only option.

You can run most of these phases in paraller over ranges of the genome. A notable exception is the GATK new queryname-sorted duplicate marking.
On the other hand I'm not entirely on the clear what you're trying to do here. If you want to do 100% GATK Best Practices compliant workflow, you should use the scripts provided by Broad Institute.
If you just want to get it done and possibly understand & experiment with the parameters, I already posted the link to scripts to do it at beginning of this thread, for BQSR:
https://github.com/Donwulff/bio-tools/blob/master/mapping/BQSR.sh

That supports GRCh38 and hg37 (Need to put in simple switch), uses latest dbSNP 151, GATK 4(.0.4.0), and processes chromosomes in parallel which is the cleanest way to do it. I see I have amount of memory per Java thread hardcoded, but BQSR doesn't really benefit from extra memory.

It looks like I haven't put in VQSR script yet.

The point about selecting java implementation is good, but I'm not sure why you feel Oracle Java 8 is needed, is that performance, compatibility or correctness?

The GATK Best Practices instructions look kind of a mess right now. 1000G_phase1.snps.high_confidence.hg38.vcf.gz has never seemed recommended BQSR reference though. Looking at the latest scripts, they're now using Homo_sapiens_assembly38.known_indels.vcf.gz which might well be ftp://[email protected]/bundle/hg38/beta/Homo_sapiens_assembly38.known_indels.vcf.gz

karwiso
07-02-2018, 07:00 PM
Antoher kit processed:

Yfull:
ChrY BAM file size: 0.46 GbHg38
Reads (all): 4796766
Mapped reads: 4796766 (100.00%)
Unmapped reads: 0
Length coverage: 23610458 bp (89.38%)
Min depth coverage: 1X
Max depth coverage: 210X
Mean depth coverage: 19.05X
Median depth coverage: 17X
Length coverage for age: 8454220 bp
No call: 2804585 bp

BamDeal:
#ID Length CoveredBase TotalDepth Coverage% MeanDepth
chrY 57227415 16879450 352861175 29.50 6.17
chrM 16569 16569 49190700 100.00 2968.84
chrY_KI270740v1_random 37240 25216 394344 67.71 10.59

samtools idxstats:
chrY 57227415 9360180 143596
chrM 16569 505275 202
chrY_KI270740v1_random 37240 7037 42

Geldius
07-02-2018, 07:18 PM
Yfull:

ChrY BAM file size: 0.47 GbHg19
Reads (all): 5999478
Mapped reads: 5999478 (100.00%)
Unmapped reads: 0
Length coverage: 25591690 bp (99.76%)
Min depth coverage: 1X
Max depth coverage: 206X
Mean depth coverage: 23.14X
Median depth coverage: 21X
Length coverage for age: 8458127 bp
No call: 61876 bp

ChrM BAM file size: 232.37 Kb
Length coverage: 16569 bp(100.00%)
Min depth coverage: 6X
Max depth coverage: 45X
Mean depth coverage: 24.90X
Median depth coverage: 25X
One reading position!: 0
No call: 0 bp

RobertN
07-03-2018, 05:54 PM
Hello! I had a WES from Dante labs and i have asked for BAM/FastQ files since the vcf file provided by DanteLabs seems incomplete? Now, i want to create a gVCF file(that i will use with promethease and any other similar services) from .bam file with EvE app from sequencing.

The thing is that i am a bit overwhelmed with so many information and options and i don't really understand what is better in my case, like:

Preprocessing: CutAdapt or FastQ or both

Annotation: SnpEff or Vep

Alignment / Mapping: BWA / Bowtie2 / Hisat / Isaac / Star / TopHat2

Interpretation: ClinVar Report / ClinVar Annotation ( i'm guessing this is not needed)

Variant Calling: GATK / SamTools

If i am not wrong and from what i have gathered, i should use BWA(Alignment/Mapping) + GATK(Variant Calling) but i'm not sure if i need to use any Annotation and preprocessing for best reliable results? I am more interested in health info.

Thank you.

JamesKane
07-05-2018, 11:11 PM
The issue with the 1KG hosted reference files are the data was lifted-over but the headers were not corrected. GATK3 which was standard at the time didn't care about this. GATK4 started validating the headers. Probably best just to use the bundle hosted by Broad Institute at this point.

Miqui Rumba
07-14-2018, 07:43 PM
I have uploaded 3 times my Dantelabs WES Bam to Sequencing.com and always said file was corrupt. I've opened this bam in IGV 4.2 several times and it seems fine though only is aligned first chromosome with Hg19 (Iab only aligns chr1 in bam file Dantelabs send us).
From Sequencing.com I run both snp & indels VCFs in EvE Premium with annotations (SnpsEffects seems better, I have CAVA also) because of clinVar shows gene's names and proteins affected by variants. This generate a new VCF 117 Mbs unzipped.
I've aligned some chromosomes with BWA MEM using fastq files Dantelabs sended me ( both UCSC Hg19 and GRCh38). I am very surprised that 25% of SNPs variants in Dantelabs snp.VCF didn't aligned in new bam file (sorted and indexed with samtools). Finally, I used bcftools to convert new sorted bams (one for every chromosome) in raw BCF file first and "call -m (multiallelic) -ploidy (Genome Reference) o- VCF.gz" to obtain a lot of new variants though losing 25% SNPs Dantelabs filtered from original raw VCF

Miqui Rumba
08-13-2018, 09:49 AM
Finally, I merged my 22 autosomal WES VCFs with "bcftools merge". It is not a good option because of bcftools generate a new column for every VCF file with sorted.bam name in the header. If column is empty, then gives GT 0/0 by defect. Even though Gedmatch Genesis accepted this merged VCF, it only can read variants, this new kit don't match nor Dantelabs WES.snp.VCF neiither my 23andme v4 raw data (or my Gencove imputed VCF from 23andme.txt). I generate a multifasta GRCh37.p13 (UCSC) though when I tried to index with BWA, running stopped in the fa.sa step (I need double RAM for this!). It is very easy to index chromosome by chromosome and aligning with BWA MEM (I did it with all my chroms. in GRCh37 and some chroms. in GRCh38), and paired end fastq files sended by Dantelabs, then Samtools view sam > bam, then Samtools sort -o output.sorted.bam, then Bcftools mpileup -f chr.fa -o output.bcf, then bcftools view -vsnps,indels -o output.vcf, and finally bcftools call -m -Oz -o output.vcf.gz
If you want to merge all gz VCFs, first you must run "tabix" chr.vcf.gz before running "bcftools merge". After merging, I did also "bcftools isec" with my 23andme raw data to extract common snps and indels shared with my new raw merged.VCF (only 11,000 snps) and with my filtered snp.VCF that sended me Dantelabs (only 71,000 snps! Why BWA MEM can't align all readings that chip detected from my fq files?). I have tihis file like altruistic in Sequencing.com database and in my Genome Space shared unit, you are free to check it.
Here comes some data after running "bcftools stats -v":
ID 0 mergedWES37.vcf.gz
# SN, Summary numbers:
# SN [2]id [3]key [4]value
SN 0 number of samples: 22
SN 0 number of records: 22157494
SN 0 number of no-ALTs: 17002553
SN 0 number of SNPs: 5037615
SN 0 number of MNPs: 0
SN 0 number of indels: 117326
SN 0 number of others: 0
SN 0 number of multiallelic sites: 21801
SN 0 number of multiallelic SNP sites: 21033
# TSTV, transitions/transversions:
# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 3258007 1800641 1.81 3254038 1783577 1.82
# SiS, Singleton stats:
# SiS [2]id [3]allele count [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
SiS 0 1 2031249 1391036 640213 30948 0 0 30948
# AF, Stats by non-reference allele frequency:
# AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
AF 0 0.000000 2031249 1391036 640213 30948 0 0 30948
AF 0 0.040000 3027399 1866971 1160428 87146 0 0 87146

Miqui Rumba
08-13-2018, 09:51 AM
Finally, I merged my 22 autosomal WES VCFs with "bcftools merge". It is not a good option because of bcftools generate a new column for every VCF file with sorted.bam name in the header. If column is empty, then gives GT 0/0 by defect. Even though Gedmatch Genesis accepted this merged VCF, it only can read variants, this new kit don't match nor Dantelabs WES.snp.VCF neiither my 23andme v4 raw data (or my Gencove imputed VCF from 23andme.txt). I generate a multifasta GRCh37.p13 (UCSC) though when I tried to index with BWA, running stopped in the fa.sa step (I need double RAM for this!). It is very easy to index chromosome by chromosome and aligning with BWA MEM (I did it with all my chroms. in GRCh37 and some chroms. in GRCh38), and paired end fastq files sended by Dantelabs, then Samtools view sam > bam, then Samtools sort -o output.sorted.bam, then Bcftools mpileup -f chr.fa -o output.bcf, then bcftools view -vsnps,indels -o output.vcf, and finally bcftools call -m -Oz -o output.vcf.gz
If you want to merge all gz VCFs, first you must run "tabix" chr.vcf.gz before running "bcftools merge". After merging, I did also "bcftools isec" with my 23andme raw data to extract common snps and indels shared with my new raw merged.VCF (only 11,000 snps) and with my filtered snp.VCF that sended me Dantelabs (only 71,000 snps! Why BWA MEM can't align all readings that chip detected from my fq files?). I have tihis file like altruistic in Sequencing.com database and in my Genome Space shared unit, you are free to check it.
Here comes some data after running "bcftools stats -v":
ID 0 mergedWES37.vcf.gz
# SN, Summary numbers:
# SN [2]id [3]key [4]value
SN 0 number of samples: 22
SN 0 number of records: 22157494
SN 0 number of no-ALTs: 17002553
SN 0 number of SNPs: 5037615
SN 0 number of MNPs: 0
SN 0 number of indels: 117326
SN 0 number of others: 0
SN 0 number of multiallelic sites: 21801
SN 0 number of multiallelic SNP sites: 21033
# TSTV, transitions/transversions:
# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 3258007 1800641 1.81 3254038 1783577 1.82
# SiS, Singleton stats:
# SiS [2]id [3]allele count [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
SiS 0 1 2031249 1391036 640213 30948 0 0 30948
# AF, Stats by non-reference allele frequency:
# AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
AF 0 0.000000 2031249 1391036 640213 30948 0 0 30948
AF 0 0.040000 3027399 1866971 1160428 87146 0 0 87146

Miqui Rumba
09-14-2018, 08:26 AM
Sequencing.com has changed some rules and now Dantelabs WES fastq files (forward.fq and reverse.fq) seems being accepted to upload although only first FASTQ (forward.fq) can be processed by EvE free (Bowtie2 aligner, UCSC GRCh38.86 FASTA). Results are finer right than SNPs and INDELs VCFs that Dantelabs sended me last year (abnormal quality, 25% no reading variants). Getting all Exome variant sites in HG38 is a big improvement. I did my EvE VCF results altruist in Sequencing database for everybody who wants to check it: CL100040257_L01_501_1.vcf.gz
Here Genome Overview analysis:
Warnings 150,681
Errors 256
Number of lines (input file) 201,252
Number of variants (before filter) 202,799
Number of not variants
(i.e. reference equals alternative) 0
Number of variants processed
(i.e. after filter and non-variants) 202,799
Number of known variants
(i.e. non-empty ID) 0 ( 0% )
Number of multi-allelic VCF entries
(i.e. more than two alleles) 1,179
Number of effects 1,025,109
Genome total length 47,871,129,608
Genome effective length 3,095,807,492
Variant rate 1 variant every 15,265 bases
Variants rate details
Chromosome Length Variants Variants rate
1 248,956,422 17,567 14,171
2 242,193,529 15,715 15,411
3 198,295,559 12,487 15,880
4 190,214,555 11,182 17,010
5 181,538,259 10,619 17,095
6 170,805,979 10,509 16,253
7 159,345,973 10,660 14,948
8 145,138,636 8,043 18,045
9 138,394,717 8,445 16,387
10 133,797,422 9,922 13,484
11 135,086,622 11,501 11,745
12 133,275,309 10,314 12,921
13 114,364,328 5,940 19,253
14 107,043,718 6,446 16,606
15 101,991,189 6,681 15,265
16 90,338,345 7,538 11,984
17 83,257,441 8,568 9,717
18 80,373,285 4,564 17,610
19 58,617,616 9,289 6,310
20 64,444,167 5,446 11,833
21 46,709,983 2,803 16,664
22 50,818,468 4,180 12,157
GL000008.2 209,709 9 23,301
GL000009.2 201,709 5 40,341
GL000194.1 191,469 1 191,469
GL000205.2 185,591 60 3,093
GL000208.1 92,689 4 23,172
GL000214.1 137,718 2 68,859
GL000216.2 176,608 21 8,409
GL000218.1 161,147 6 26,857
GL000219.1 179,198 33 5,430
GL000220.1 161,802 1 161,802
GL000221.1 155,397 3 51,799
GL000224.1 179,693 6 29,948
GL000225.1 211,173 320 659
GL000251.2 1 0 0
GL000252.2 1 0 0
GL000253.2 1 0 0
GL000254.2 1 0 0
GL000255.2 1 0 0
GL000257.2 1 0 0
GL339449.2 1 0 0
GL383533.1 1 0 0
GL383545.1 1 0 0
GL383557.1 1 0 0
GL383567.1 1 0 0
GL383573.1 1 0 0
GL383580.2 1 0 0
GL383581.2 1 0 0
GL582966.2 1 0 0
JH159136.1 1 0 0
JH159137.1 1 0 0
JH159146.1 1 0 0
KI270438.1 112,505 55 2,045
KI270442.1 392,061 68 5,765
KI270467.1 3,920 3 1,306
KI270517.1 3,253 1 3,253
KI270538.1 91,309 3 30,436
KI270589.1 44,474 4 11,118
KI270706.1 175,055 1 175,055
KI270707.1 32,032 1 32,032
KI270708.1 127,682 1 127,682
KI270711.1 42,210 6 7,035
KI270713.1 40,745 6 6,790
KI270716.1 153,799 1 153,799
KI270719.1 176,845 3 58,948
KI270722.1 194,050 24 8,085
KI270723.1 38,115 3 12,705
KI270725.1 172,810 3 57,603
KI270726.1 43,739 3 14,579
KI270728.1 1,872,759 1 1,872,759
KI270729.1 280,839 28 10,029
KI270732.1 41,543 1 41,543
KI270733.1 179,772 13 13,828
KI270735.1 42,811 1 42,811
KI270736.1 181,920 6 30,320
KI270738.1 99,375 1 99,375
KI270744.1 168,472 13 12,959
KI270746.1 66,486 26 2,557
KI270749.1 158,759 2 79,379
KI270750.1 148,850 1 148,850
KI270751.1 150,742 2 75,371
KI270754.1 40,191 7 5,741
KI270758.1 1 0 0
KI270779.1 1 0 0
KI270783.1 1 0 0
KI270791.1 1 0 0
KI270803.1 1 0 0
KI270817.1 1 0 0
KI270819.1 1 0 0
KI270830.1 1 0 0
KI270846.1 1 0 0
KI270847.1 1 0 0
KI270850.1 1 0 0
KI270853.1 1 0 0
KI270854.1 1 0 0
KI270855.1 1 0 0
KI270860.1 1 0 0
KI270872.1 1 0 0
KI270875.1 1 0 0
KI270878.1 1 0 0
KI270902.1 1 0 0
KI270904.1 1 0 0
KI270908.1 1 0 0
KI270927.1 1 0 0
KI270938.1 1 0 0
KN196479.1 1 0 0
KN196484.1 1 0 0
KN196487.1 1 0 0
KN538361.1 1 0 0
KN538362.1 1 0 0
KN538363.1 1 0 0
KN538369.1 1 0 0
KN538372.1 1 0 0
KQ031388.1 1 0 0
KQ090016.1 1 0 0
KQ090026.1 1 0 0
KQ458383.1 1 0 0
KQ759759.1 1 0 0
KQ983258.1 1 0 0
KV575245.1 1 0 0
KV766195.1 1 0 0
KV766196.1 1 0 0
KV766198.1 1 0 0
KZ208908.1 1 0 0
KZ208915.1 1 0 0
KZ208916.1 1 0 0
KZ559106.1 1 0 0
KZ559107.1 1 0 0
KZ559109.1 1 0 0
MT 16,569 30 552
X 156,040,895 3,263 47,821
Y 57,227,415 72 794,825
Total 3,095,807,492 202,799 15,265
Number variants by type
Type Total
SNP 184,222
MNP 0
INS 9,616
DEL 8,961
MIXED 0
INV 0
DUP 0
BND 0
INTERVAL 0
Total 202,799
Number of effects by impact
Type (alphabetical order) Count Percent
HIGH 4,340 0.423%
LOW 46,067 4.495%
MODERATE 27,181 2.652%
MODIFIER 947,265 92.429%
Number of effects by functional class
Type (alphabetical order) Count Percent
MISSENSE 25,612 45.06%
NONSENSE 240 0.422%
SILENT 30,988 54.518%

Missense / Silent ratio: 0.8265

Donwulff
03-31-2019, 04:15 PM
GRCh38 reference genome Patch 13 hit public just this month (well, this month in most of the world anyway) so I took the opportunity to check & upgrade my processing tools. As usual these come with no warranty explicit or implied and blablahblah, they're just simple tooling scripts I've used for my own anlysis & research that I published because people are asking how to do it. They use Bash which theoretically allows their use on Linux or Linux-like environments, and easy use & modification without needing to install and learn additional complex frameworks. If you need full GATK Best Practices, or cluster-support, look elsewhere. I've tested these with (old) BigY, Dante Labs and Genos Research raw data with some special cases, don't have access to FGC to test.

https://github.com/Donwulff/bio-tools includes https://github.com/Donwulff/bio-tools/blob/master/mapping/GRCh38_bwa_index.sh for creating my research BWA MEM reference index. This uses UCSC id's for the part that UCSC has officially published (up to Patch 11) with official hs38d1 decoy sequences; this allows one to just cut the BAM at suitable point for compatibility. NCBI naming is used for Patch 12 and Patch 13 which are added sequentially at the end. In addition it pulls whatever is the latest HLA reference sequences (these change frequently). In previous version path to one of the patches was incorrect, apologies about that.

The decoy-sequence is NOT updated to match the new contigs (which means that the decoy may inadvertently attract reads intended for the new alternate contigs) though that's on my TODO list. BWA mapper is alternate-contig aware, which means the reads should preferentially map to the primary assembly instead of the alternate contigs, but they can be useful for experimenting with what's in the genome.

https://github.com/Donwulff/bio-tools/blob/master/mapping/revert-bam.sh generates uBAM (Unmapped BAM file) from source BAM for use according to GATK Best Practices workflow. I've included simple tooling to drive BWA MEM for standard mapping and duplicate marking against a given reference index, streamlined from GATK Best Practices. New in this version, if the BAM file ReadGroup is missing PLatform-tag (As the BigY's do) it's added in mapping to avoid problems in downstram analysis. I've also added in comments a tested workflow (Using undocumented BWA MEM feature) for combining BigY hs37 & GRCh38 mapped BAM's into a single BAM, which recovers a couple of reads that have been filtered in either one. Since they don't map against the reference they're most likely garbage, but I thought why not.

https://github.com/Donwulff/bio-tools/blob/master/mapping/BQSR.sh does Base Quality Score Recalibration. The basic idea is BQSR looks for differences from the reference which aren't known SNP's/variants. Some of these will be genuine novel variants, but the majority of them are actually sequencing errors. Importantly, the genuine novel variants are randomly distributed in the reads and shouldn't depend on the state of the sequencing instrument. This allows building a statistical model of where the sequencing instrument will most likely make errors, and take that into account.

I've now experimentally added YBrowse's list of known Y-chromosome variants into the list of known (ignored) variants; hg38/GRCh38 only sorry, might add liftover but almost everybody is using GRCh38 now. In theory, BigY is targeted to the Y chromosome and filtered to only Y chromosome region, so the off-target reads might be garbage, but I've decided to include whole primary assembly in the BQSR analysis now. With my own DNA, including off-target reads causes 11 different calls which interestingly match my WGS results rather than BigY. However, regions of Y chromosome sequence very poorly or get interference from elsewhere in the genome in WGS. I'm trying to find out whether BQSR on Y or genome-wide helps with BigY results, but there's really no "ground truth" to compare to.

Snapshot of dbSNP variant database 152 should hit the public next week, at which point I'll update the BQSR script to use that for whole-genome analysis. There seems to also be lack of analysis on how the version of dbSNP snapshot affects the results. Many people use old version either because the software pipelines haven't been updated in years, or because they want to restrict the excluded variants to mostly those known before high-throughput sequencing on the theory that many of those are in itself artifacts of sequencing. On the mathematical theory of BQSR and a hunch I feel that it's more important to exclude as many genuine variants as possible, and the older databases are seriously lacking non-white/non-European samples, which would further bias against "ethnic" variants, so I'm personally using the latest variant database.

The scripts do not currently include variant calling & genotyping, in part because there are different schools of thought to that. For testing purposes my present approach was after GATK Best Practices, which produces allele-specific genomic vcf for joint genotyping (SNP names are prepared in BQSR script above, although the naming requires that all alt alleles match, so many are missed).
gatk-4.1.1.0/gatk --java-options -Xms30G HaplotypeCaller -R /mnt/GenomicData/GRCh38/hg38-p13-all.fa -I sample1BigYmerge.qsort.srt.bam -D /mnt/GenomicData/snps_hg38.vcf.gz -ERC GVCF --ploidy 1 -O sample1BigYmerge.mapsort.chrY.ploidy1.AS.g.vcf.gz -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation -L chrY

You may want to parallerize different ranges/chromosomes, oh and ploidy according to chromosome (Standard diploid genotyping may be of interest as well for palindromic regions or simply detecting dubious calls, but haploid matches commercial Y analysis best).

Donwulff
04-02-2019, 11:37 PM
The human variant rs-ID database at https://www.ncbi.nlm.nih.gov/snp states on their front page, "April 2, 2019: Starting next week dbSNP will release the new Entrez updates that will include the latest dbSNP build 152 data for user testing." Regardless, looking at data & updates page it's looking like the old style VCF files of the database are coming back, and even if they are, they're lagging several months from the dbSNP 152 snapshot. (I'll check what they say about this on GATK forums in time)

Therefore I figured out a procedure to convert the NCBI sequence-id based dbSNP VCF into the more familiar UCSC contig id's used by GATK. It's now included in the BQSR.sh script. The conversion took half a hour for me to run (maximum compression because the result file will be saved for a while), although it will also need to download the 14 gigabyte dbSNP dump into working directory the first time.

Contrary to the "official" GATK dbSNP dumps of past, the file created in this script includes every contig with an UCSC id listed by the Human Genome Reference Consortium https://www.ncbi.nlm.nih.gov/grc/human data. I'm teetering on this a big because at least on BigY BAM (Which is targeted AND filtered), most non-Y-chromosome hits are likely to be random chance rather than actual hits, and it's a significant departure from GATK best practices. However, comparing the results on my own BigY, the changes in calls seem altogether reasonable, so I've included this change for experimenting.

I'm considering validating this "simple" pipeline against HG001-NA12878 to see if I've diverged far from what's considered canonical. However, I'd also like to stress that contrary to what many people may have heard, whole genome sequencing rarely achieves much better than about 99% concordance with each other or the "expected" results. Some of the regions on Y-chromosome are even harder, and when you get differences there you don't really know which is better, or if it's just noise. But at least I can employ the philosophy of "Use latest data, use ALL the data" ;)

MacUalraig
04-03-2019, 07:45 AM
The conversion took half a hour for me to run (maximum compression because the result file will be saved for a while), although it will also need to download the 14 gigabyte dbSNP dump into working directory the first time.



My monthly limit is 15Gb. I have been toying with upgrading to 'unlimited' (ahem) level but often I don't need it so its a tricky choice.

Donwulff
04-03-2019, 08:39 PM
Ah yeah, limited mobile plans are interesting too... I'd say bioinformatics probably requires unlimited plan, the dbSNP dump isn't the only big file needed. On the other hand, to make it more accessible to people there are things like UseGalaxy, and I was toying with the idea of making pre-built Amazon virtual machine image for some of these analysis.

And finally the reference genome patches & dbSNP dumps are in nature such that most of the existing data remains unchanged, so it would be possible to just download the changes and merge them into previous version. Unfortunately, nobody's really thinking of metered connections in bioinformatics. Heck, it's a royal pain even with unmetered connections, the BLAST refseq-genomic database is 350 GB and every update requires a complete redownload.

The dbSNP re-organization seems to mean that rather than just new SNP's being added, all the row formats have change too though so this isn't even a "just get around to doing it" thing this time.

/mnt/GenomicData$ zgrep -m1 rs775809821 All_20180418_GRCh38p7.vcf.gz GCF_000001405.38.bgz
All_20180418_GRCh38p7.vcf.gz:1 10019 rs775809821 TA T . . RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SA O=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1 :100287102;WGT=1;VC=DIV;R5;ASP
GCF_000001405.38.bgz:NC_000001.11 10019 rs775809821 TA T . . RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO =DDX11L1:100287102;VC=INDEL

Though bioinformatics workdlows in general don't even need that INFO string there, you just need "1 10019 rs775809821 TA T" which gives chromosome, coordinate on that chromosome (in that build), the SNP rs-ID name, reference and alt alleles. (Actually BQSR strictly doesn't require more than contig & coordinate to exclude, but I'm assuming that at some point you'll want to annotate the rs-ID's on the correct variants for easy lookup and check that the ref and alt match). I can imagine if this was wrapped into some simple "internet appliance" type image or app, you'd want to minimize the downloads.

Donwulff
04-03-2019, 08:43 PM
Double post

Donwulff
04-05-2019, 10:41 AM
Original size of GCF_000001405.38.bgz (GRCh38 dbSNP build 152):
14,038,308,213 bytes (14 gigabytes)

Maximum bgzip compression with GATK contig names:
13,581,306,990

INFO column removed (Loses gene annotation, but there are other resources for this):
6,705,966,745

This is kind of significant already, cutting the file-size to less than half should actually speed up BQSR workflow somewhat because it's half as much IO and decompressing for that file.

I kind of forgot how annoying bioinformatica can be, while I decided to look at producing dbSNP build differentials. The dbSNP dumps are sorted only by genomic site, the ordering for different variants within the same genomic sites varies. This means a simple comparison between the dbSNP versions produces almost all lines as changed. I think I knew this before, but it took a while to find out that Picard Tools VcfSort is only method that actually sorts the rsID's within genomic sites as desired.

Next problem is, Linux diff will want to load the WHOLE files + more in memory to do the whole genome diff. One could probably pull this off on a cloud machine with 256 GB memory or something. This gave me the idea that known the files are sorted by genomic site already, a script could produce smaller delta between revisions with lot less resources, but this turned out to eventually be quite a brainteaser to write. I had to stick to testing with single chromosome; an Unix patch-file for changes in dbSNP 151 => 152 xz-compressed takes about 6% of original, bgzipped chr1 dbSNP size, therefore patch for whole file should be about 1 gigabyte.

In addition, it would STILL be necessary to download the original dbSNP dump to apply the delta to. Using xz compression instead of bgzip seems to reduce file size to about 77% of maximum bgzip compression, so the dbSNP dump without INFO column should be about 5 gigabytes, but that will require extracting and re-compressing it with bgzip for use. I'm also going to have to see if I can host these files off Sequencing.com, otherwise I'm not sure how it would be possible to offer them. However, it's worth noting that either way processing the xz compressed dbSNP dump ready for use, to say nothing of actual bioinformatics workflow, will require very significant computational resources & electricity, so I'm not really sure if the effort was worth it ;)

Donwulff
04-06-2019, 04:31 AM
My bad, Picard Tools VcfSort doesn't actually sort variants on same genomic site (Chromosome coordinates), it just preserves the existing ordering, and GCF_000001405.38.vcf.bgz happened to be in the order I wanted. I made my dbSNP comparison script sort them in the order I wanted before comparing, but this will make it slightly harder to apply the delta/diff between versions. dbSNP 151 to 152 change-list came out to 222 megabytes xz-compressed.

Just tested and updated my https://github.com/Donwulff/bio-tools/blob/master/mapping/BQSR.sh script to use a pre-prepared dbSNP 152 GRCh38 vcf-file hosted off Sequencing.com (Well, if the download succeeds). It came out to 6,3 gigabytes in the end, compared to the 14G original.

Unfortunately, Sequencing.com is pretty picky about what kind of filex they will store, and xz-compressed vcf isn't one of them. This is kind of a shame because xz-compressed the file size comes to 4,7 gigabytes, although it would be necessary to uncompress & recompress that one with bgzip to make it usable with GATK etc. The dbSNP delta couldn't be served off Sequencing.com either; one possibility might be to make the format conform to vcf, although that would make it larger, so I'm thinking of just using BitTorrent if there's ever demand for these files.

Worth noting that the BQSR.sh script will *also* pull all the GRCh37 resources, so people on metered connections should check what it does. Most of the script is just fetching & preparing the known sites lists for BQSR from various sources, the *actual* BQSR is basically just:

tabix -l ${DBSNP} \
| xargs -IZ -P${CORES} gatk-${GATK}/gatk --java-options -Xms4G BaseRecalibrator -R ${REF} \
--known-sites ${DBSNP} \
--known-sites ${INDEL1} \
--known-sites ${INDEL2} \
--known-sites ${YBROWSE} \
-I ${SAMPLE} -O ${SAMPLE}.Z.recal -L Z
gatk-${GATK}/gatk GatherBQSRReports -I chr.list -O ${SAMPLE}.recal
gatk-${GATK}/gatk ApplyBQSR -R ${REF} --bqsr ${SAMPLE}.recal -I ${SAMPLE} -O ${SAMPLE%%.srt.bam}.bqsr.bam

But you need to know which known sites to use, from where to get them, and how to prepare them for use, which the script hopefully helps with. The scripts are intended for people who have access to high-performance Linux system (Cloud servers will do just fine) and want to play with re-processing their Dante Labs or BigY BAM's (Other providers will likely work just as well). Ideally, people would experiment with different genomic references, known sites lists and contigs to include for example in the BQSR statistical model and co-operatively find best solution.

Donwulff
05-26-2019, 05:32 PM
I landed the promised MarkDuplicatesSpark using short read alignment script into my GitHub account at https://raw.githubusercontent.com/Donwulff/bio-tools/master/mapping/revert-bam.sh

I'll have to say that MarkDuplicatesSpark is probably the most frustratingly annoying bioinformatics tools I've come across. My first attempt failed on creating the BAM index because it ran out of memory with the default Java memory allocation, second failed somewhere near the end because it throws temporary files all over the place and there's no way to continue, third try failed because at the end of the process it complains about leftover directory from previous run, fourth try failed midway because it realized there as single-end read in the input... well you start getting the idea.

On n+1 try it... just crashed on creating BAM index for no apparent reason that I could find out, essentially giving same result as try #1. Okay, so I just generate the BAM index with samtools instead, and call it a day. Or rather, 48 hours it took for the test runs. Most of the issues wouldn't have been found on smaller test input, but ironically, on a four-core AMD the MarkDuplicatesSpark flow actually takes much longer than the previous MarkDuplicates workflow. This is contrary to what Broad Institute says about its performance, although it might be in part because my MarkDuplicates version was already optimized with fifo buffer, samtools for sorting and indexing etc. On the bright side MarkDuplicatesSpark manages to keep the cores busy most of the time, so it SHOULD be faster on 8 or 16 execution threads. Unfortunately, you can't make the BAM slices it processes larger than they are, so I imagine it's wasting a whole lot of work on juggling the Apche Spark/Hadoop blocks.

That being said, MarkDuplicatesSpark works on a single compute node with no additional requirements besides Java & GATK so I made it the new workflow, because people running it probably have 8 cores or more and are in a hurry, right? I still preserved the old version, which can be used by commenting GATK_SPARK definition out. Maybe I should automatically choose between them depending on number of processor threads. But since MarkDuplicatesSpark seems to have a tendency to blow up, I added extra define KEEP_TEMPORARY which defaults to preserving the raw bwa mem output before MarkDuplicatesSpark in case one has to go back to it. I would imagine it's really, really slow without SSD for temp files due to it using literally tens thousands of fragments, but I'm not crazy enough to try.

I would appreciate hearing about any unexpected effects should people try the script out, albeit in the end one is always responsible for checking the script out to make sure it does what they need it to do. To that end, there's bunch of links and references added in the comments.

Donwulff
07-05-2019, 03:11 PM
I started this thread in response to questions on how to properly align Dante Labs sequences to the reference genome, and then wrote script illustrating preparing Broad Institute's GATK Best Practices then new "unmapped BAM" format format raw data file. Just a reminder though that this isn't in any way Dante Labs specific, in fact my scripts include tricks for many other sequencing providers as well.

Anyway, I've re-submitted my Dante Labs sequence to YFull for analysis using the current version (Oh dear, I may have to start version-numbering the scripts...) of my re-mapping script, just confirming that they recently finished the analysis based on that re-mapped GRCh38/hg38 BAM, I'm happy with the results and at least YFull didn't have any comments about the mapping. I used full options of the mapping pipeline script, ie. BQSR, adapter-runthrough trimming, alt-aware mapping with k8/postalt post-processing, Spark parallerization as well as my up to date GRCH38p13 + filtered decoy + HLA-3.36.0 reference (Well, 3.37.0 is out just today, although it's dated on the 10th... I'm thinking of freezing this on HLA-3.36.0 because I don't want HLA changes affecting mapping comparisons).

There were no new novel SNP's found, which is somewhat expected for WGS GRCh38 conversion, but one novel SNP was removed in my case. I'm yet to get around to analysing why that happened. However, 11 "private" SNP's which have been seen in other phylogenic branches of Y-chromosome but not mine (Ie. high likelihood of being errors) were removed in comparison to the hg19 mapping. Two such SNP's were added. In addition 38 SNP's which should phylogenetically be there were found, and 10 lost mostly due to ambiguous ie. reads supporting both variants. This is giving me high confidence that the new mapping is an improvement over the earlier one. Consequently I'm ready to call that "version 1.0".

If I ever get a year long vacation, I might get around to checking each of those SNP's in different mappings of the same raw data ;)

Currently testing adding oral microbiome to the reference genome which might improve mapping of saliva samples, I think it's looking good right now. However, since this part isn't "best practices" or anything, I guess for this step I really should validate the results against Genome In A Bottle, simulated reads, and manual inspection of some affected variants. That's just... a lot of work.

EDIT: Whoops, https://github.com/Donwulff/bio-tools/tree/master/mapping so people don't need to go through previous messages hunting for link. Be forewarned that this is "of course" Linux/unix, plenty of dependencies and significant amount of downloaded resources. I'm doing this for free, so it's not out-of-the-box ready, at least yet, although wrapping a virtual machine would be easy.

Donwulff
07-05-2019, 05:02 PM
Slight digression on the topic of reference genome validation: It's striking that "Similarities and differences between variants called with human reference genome HG19 or HG38" https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2620-0 published in March states "To date, the impact of genome version on SNV identification has not been rigorously assessed." While this claim hinges on the definition of "rigorious", and their assessment looks rigorious indeed, GRCh38 was released in 2013 and has been considered the standard reference assembly since 2014. This means it's been used half a decade without "rigorious" assessment, on the assumption that since it's newer it must be better.

This new paper does end in the note "Our findings suggest caution when translating genetic findings between different versions of the human reference genome. After carefully reviewing the results of our in depth comparison, we recommend that newer version (HG38) should be used going forward in SNV analysis." though. In more layman terms, this means that GRCh38 should be used in new analysis pipelines and products, but the results are significantly different enough that any interpretatitions (Of the "rs123456 means you're going to die" variety) should be re-validated on the new reference genome.

Of course, various third party interpretation services certainly didn't validate those variants the first time around, simply copied them from publications or 23andMe, so this might not pose a huge challenge for them ;) For really serious clinical interpretation this poses a challenge, because majority of studies and entires in databases like ClinVar have been made using GRCh37/hg19, and you shouldn't automatically trust that the findings will be the same using a new reference genome.

Returning back to issue at hand, that's also why you don't usually use bleeding edge latest reference genome with all imaginable additions - there's just not enough time in the world to validate and get experience in using every possible reference genome (This is very, very interesting point for GRCh39/graph-based reference genomes too; I think most studies are currently just dealing with the need of graph-based, population-specific or pan-genome references, but very few are considering the analytic complications of such).

On the flip side, something like adding oral microbiome to the reference will primarily mean that some reads are going to map preferentially to the oral microbiome, whereas without the microbiome in the reference they'd mapped against the human genome. This is far more likely to increase false negatives than false positives. With the foreseeable state of genomics, false positives are a lot bigger deal than false negatives, because basically no genomic test can prove that you're immune to a condition, at most that you have lower genetic risk. Obviously, you don't want to lose too many real true variants, but I have very strong reasoned hunch adding the microbiome should improve results from saliva sequencing. It's hard to exactly "prove" though. Simulate all ClinVar variants and align them against reference genome with and without the microbiome, or simulate reads from microbiome and see where in the human reference genome they will map? This approach will miss many complicationss in real genomes - microbiome that's not included in the reference, unexpected sequencing error types and biases etc. Or analyse a huge amount of real genomes with both references, find discordant ClinVar variant calls and try to figure out what is the true call? Without ability to Sanger sequence each discordant variant in a sample, this would be mostly a judgement call.

Nevertheless, it's an interesting topic to explore and compare, and as I've noted before the oral microbiome in DTC saliva sequences is as of yet mostly untapped resource.

Donwulff
07-05-2019, 08:23 PM
In smiliar vein to some wise-ass's crack that there are only two difficult issues in programming: Caching and naming... frigging naming.

My bleeding-edge reference genome takes currently cue from Heng Li's hg38DH in which he named HLA sequences like "HLA-DPB2*01:01:01:01". Now under the sam file-format specification the colon : is allowed character, however considering how common the notation "chr1:20000" or "chr1:20000-30000" is, this turns out to be generally a Bad Idea. Yes, I did find that out the hard way ;) I haven't yet decided what I should use instead, though, because I haven't ran into any other convention for identifying HLA alleles. I suppose underscore _ is pretty common for "unrepresentable character", though in the human assembly it's used explicitly for space, but still thinking.

And yes, if you're writing bad code (and who isn't, for quick tooling scripts, and sometimes intentionally wanting to support wildcards) the star is also going to cause problems. I'm thinking of letting it pass because that would force people to at least thing aboout security when writing scripts for that, but yeah, basically I'm trying to come up with an acceptable alternative naming scheme for HLA alleles so probably need to change that as well as the minus-char because some downstream analysis is going to decide that means contrigs from HLA to DPB2 ;)

I'm not sure there's convention for microbiome, and for various reasons this is actually going to be problematic - because the microbe names aren't settled, many aren't even named, and few have very high quality contiguous reference genomes. I'm nevertheless thinking of prefixing them with "HOM_" for Human Oral Microbiome to prevent naming collisions and easy filtering. Of course, this will turn out into a problem down the line because many of them aren't exclusive to the oral cavity. Perhaps people who have done commercial metagenome/microbiome tests know if they have a convention already? I don't think there exists easy solutions to the naming problem, though at least there are tools that allow easily re-mapping names in most bioinformatics file types to whatever convention is needed, so at least it's manageable.

And speaking of that, I'm using the UCSC naming (chr16_KI270853v1_alt etc.) only for contigs that UCSC has actually properly named; currently I decide this by inclusion in the UCSC reference genome, the latest version being at http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/p12/ The rest get left with NCBI GenBank accession naming like KI270853.1 (Or is that KI270853v1? Yes, this is hard! KI270853.1 is the official name, but I fully expect some analysis to choke on the dot because v1 is used in UCSC names). There's a theoretical chance that the contig that ends up in the UCSC version could be different from the NCBI one, so one could wait until UCSC finalizes it. Although the revision should change, I've seen this happen at least with the soft-mask changing between releases. On the other hand, THIS is going to be a problem because the dbSNP is released with NCBI GenBank accession naming, meaning that versions from different times will become incompatible (Which, in a way, illustrates why GenBank accession naming would be superior, but it isn't easy on the eyes or filtering etc.). Luckily, few people really care about the alt-contigs and patches currently, right now you never see "The pathogenic variant is chrUn_KI270750v1:23602367" or use them for genetic genealogy matching. I might just choose to go full UCSC naming for human contigs if that ever happens.

But this is just to note and put a little caveat that, yes, naming is still hard.

Donwulff
07-05-2019, 11:27 PM
Another footnote: After mapping "common" oral microbiome in my sample, vast majority of the remaining unidentified reads are from Sus scrofa... more commonly known as pork. The sample taking instructions normally tell not to eat, drink or chew gum for 30 minutes prior to giving sample, which is more related to avoiding washing out cells with your own DNA in the mouth. I'm sure I observed at least that requirement, but I have a faint memory I ate something hours earlier and thought it would be funny if I got its genome along. That was couple of years ago so I no longer remember the specifics, but I guess we may know now! ;) What strikes me as odd is I'm not seeing any plant-based DNA though, you'd think there would be potatoes or rye or something. Some articles have talked about how long foreign DNA would likely survive in the mouth, which normally isn't very long, and I recall brushing my teeth earlier and hoping the chemicals wouldn't mess with the chemistry. It could also be contamination, but in any case should be noted that there are a LOT of reads in a whole genome sequence and these are very, very small part of it.

However, in light of this, in case this kind of analysis ever becomes commonplace, people taking saliva-based sequencing tests now might want to make careful note of what they eat and how long before giving the sample, or why not even choose something interesting to eat beforehand!

Meanwhile I'm left with the problem of these reads complicating my analysis because it's relatively computationally intensive to identity unknown genomic sequences, but as luck would have it an improved Sus scrofa reference genome has JUST been published: http://dx.doi.org/10.1101/668921 so it's easier to just align to this genome now.

pmokeefe
07-24-2019, 10:49 PM
I'm about to start working on some BAM files I received from Dante Labs.
I noticed in the VCF header that corresponds to the BAM:

##reference=file:///l3bioinfo/hg19.fasta
I happen to already have an hg19.fa fasta file that I built by downloading the file "hg19.2bit" sometime last year for a different purpose. Unfortunately I don't recall where I got that.
Does anyone happen to know how I might be able to determine if the fasta file I already have is the same file that Dante used? If they differ just for the mtDNA, I don't care, as I'm not interested in mtDNA right now. But would that cause problems for the rest of the sequence when using

bcftools mpileup
for example?
If my fasta file is different, in a way that could cause problems for non-mtDNA, any ideas for a workaround, or where to get the proper fasta file for the Dante Labs BAM?

Donwulff
07-25-2019, 08:28 AM
The hg19.fasta file Dante Labs' bioinformatics workflow used in my sample had the contigs in different order from the "official" hg19. It's not too surprising, as common way to construct it is by concatenating the individual contigs/chromosomes together. This could be easily seen by mtDNA being in the beginning of the VCF. The contents of the contigs itself should not differ, or there's an error somewhere. I'm not sure if there is a common reference where they're in that order that you could just download.

To get the contigs in same order, with shell & samtools you could do:


wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gzip -d hg19.fa.gz
samtools faidx hg19.fa
samtools faidx hg19.fa `samtools idxstats my_bam_file.bam | head -n-1 | cut -f1` > hg19_dante.fa

Those are backticks. Is there markup for /CCDE...

pmokeefe
07-25-2019, 09:48 AM
The hg19.fasta file Dante Labs' bioinformatics workflow used in my sample had the contigs in different order from the "official" hg19. It's not too surprising, as common way to construct it is by concatenating the individual contigs/chromosomes together. This could be easily seen by mtDNA being in the beginning of the VCF. The contents of the contigs itself should not differ, or there's an error somewhere. I'm not sure if there is a common reference where they're in that order that you could just download.

To get the contigs in same order, with shell & samtools you could do:



Those are backticks. Is there markup for /CCDE...

Thanks. In the meantime I started running a very long bcftools process using my previous hg19 FASTA, it looks like it's going to take around 24 hours just on the initial pass through the BAM! Hopefully the FASTA I'm using won't make a significant difference. But if something looks wrong and I have to start over, I will try what you recommend.

Donwulff
07-25-2019, 05:10 PM
I couldn't remember why I had to re-order the fasta in the first place, I think it's mainly GATK which talks about "ROD"'s as "reference-ordered data". When you're combining data from several huge resources, it helps to have them in same order so programs can just stream them in order and have same data from all files at once. File indexes get around that problem though, so bcftools may work just fine with different order of contigs.

The main thing is that the contigs being processed need to be called the same (eg. chr1 vs. 1, or even something like K12415v2 etc.) and have same length. I believe every tool checks that those match. At plain variant calling phase, the reference is mainly used to determine which variants match the reference, and which are different from the reference and therefore need be outputted to standard VCF. There are some interesting things one could intentionally do there, like the the hg19K https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5854-3 reference. Because the hg19 reference is based on actual humans, the sequence contains some minor alleles ie. those that differ from the common/ancestral variant. hg19K has attempted to replace those minor alleles with the more common alleles. If you did variant calling against that reference, it would accomplish some of what the gVCF/bed files try to, ie. list variants even when they match the canonical reference if they're true minor alleles. Unofortunately I don't know where you could download hg19K, but now I'm feeling like I should construct one ;)

Other than that every hg19 reference ought to have same sequence though, they may just be in different order.

pmokeefe
07-25-2019, 06:03 PM
My immediate issue is that I have two recent Dante Labs tests without a gvcf and an older one that did come with a gvcf. I have found the gvcf useful. In the short run I'd be happy with a quick and dirty solution which works OK for SNPs - I can wait for decent indels etc. I found this simple solution:

bcftools mpileup -Ou --gvcf 0 -f hg19.fa my.bam |
bcftools call -m --gvcf 0 -Oz -o my.g.vcf.gz
https://www.biostars.org/p/360278/
It's currently running on my old kit (has been for many hours!) and if it ever finishes and produces a result, I will compare that with the gvcf I received directly from Dante last year. If that looks good I will run my two new kits.
I've also ordered analysis which should include hg38 gvcfs from companies including sequencing.com and FGC.

Any other other methods, companies I should consider?

Donwulff
07-25-2019, 07:46 PM
Depends a lot on what your goals are with the processing. For one thing I'm starting to wish to get the word out that gVCF aren't really "a must". Sure, they're better, in the few places that accept them. However, for example with Promethease they *mostly* fill one's report with a lot of spammy "This variant matches the reference, so it *doesn't* have this effect..." entries. As the hg19K/hg19Kindel paper above points out, there are some variants where the minor allele isn't reported, but the catch is that all researchers have been using stock hg19 so pathological variants matching the reference genome just haven't been studied. In other words, gVCF/BED or hg19K might make a difference for research, but I would wager it doesn't really add anything of value to Promethease reports most of the time. This makes me think though that I need to make something like GRCh38KClinVar reference or something with all ancestral & non-pathogenic alleles, which would produce calls for pathological variants also. Field, M.A. et al. 2019. Recurrent miscalling of missense variation from short-read genome sequence data. BMC genomics. 20, Suppl 8 (Jul. 2019), 546. is interesting in this respect.

I've been working on the "bleeding edge" pipeline for personal computers, work I've been partly announcing on this thread. That started few years back though, so the bleeding edge is now closer to the recommended workflow. I haven't yet published any of the variant calling bits though, but those are even closer to older GATK workflows. The alignment & reference building scripts live at https://github.com/Donwulff/bio-tools/tree/master/mapping - I'll have to document *actual* features into that though ;)

- exact trimmining of adapter read-throughs, including Dante Labs. All existing pipelines mess up on the BGISEQ adapters.
- removal of all duplicate reads according to latest GATK best practices recommendations
- optionally using the latest reference genome patches and Human Leukocyte Antigen references
- proper handling and post-processing of alternate contigs; this can significantly improve mapping in many places
- decoy sequences cleaned of decoys that collide with the new reference reads; working of oral microbiome
- Base Quality Score Recalibration using latest known variant sites from dbSNP
(Note to self: Need to add gnomAD-SV; also possibly insert length covariate)

For variant calling, besides using the latest dbSNP etc. resources, there's also joint genotyping which would allow one exploit similarities in sequences. I'm working on adding new stuff whenever I can thought. Of course, as noted this started as example of how to do it few years back, right now one might want to roll out the Broad Institute reference pipeline that's been published since then out on a cloud service (or personal cluster if having access). If the goal is to produce comparable data to the Broad Institute/medical sequences then that would be a full stop there... if looking to do certain kinds of research then upgrading the resources in that is option too.

pmokeefe
07-25-2019, 11:43 PM
Depends a lot on what your goals are with the processing.
...

Thanks for the comprehensive reply, all good points, I don't disagree.

But here's why I still like the gVCF. I'm constantly browsing bioRxiv and Pubmed looking for interesting papers, especially those which mention SNPs. Here's an example of a paper that caught my eye: Genome-wide association study provides new insights into the genetic architecture and pathogenesis of heart failure (https://www.biorxiv.org/content/10.1101/682013v1). The paper mentioned rs17042102 which did not appear in my in my Dante Labs VCF (nor in my test results from Ancestry and 23andMe). So the question is: did rs17042102 fail to appear in my VCF because I had a high-quality read with two copies of the reference allele, or because it was a low-quality read? When I looked at the gVCF it was high quality with two copies of REF.

I've received the results from two Dante Labs WGS 30x tests, and using vcf-compare, there were about 4% unique sites in each test. So 4% appears to be a lower bound on the number the number of sites which are missing from a single Dante Labs VCF due to low quality. So I'm guessing that, unlike the case of rs17042102, >4% of the time when I look in the gVCF for a SNP that's missing from my VCF I'm going to find it's a low quality read - as opposed to two copies of REF. I'd like to know that.

Results from papers and preprints in bioRxiv and Pubmed may not appear in Clinvar for a while. I would suspect in many cases they will take a while to show up in the commericial interpretation services as well. Also, some things are interesting to me personally even though they don't meet the criteria for inclusion in Clinvar. When I'm skimming a new paper, I'm often curious about my own variant and I want to look it up while I'm reading. If it's the usual case and I have the nonrisk variant, I can relax:) If I do have a copy of the risk allele, I'll probably investigate further:( If the quality of the call is low, maybe I need to take another DNA test;)

Donwulff
07-26-2019, 07:28 PM
Yes, that's why I specifically mentioned that applied to Promethease in particular. If you have (very) short list of variants of interest that you want to check with something like tabix, you probably want to know if they were no-calls or reference calls. But then again, if you have the BAM file, it would probably be better to inspect the alignment in the BAM file at that point.

It's worth cautioning that sequencing (or any genotyping, really) produces false positives as well, depending on setting of the processing. If a variant appears on one sequence, but not another of the same person, there's higher chance it's a false positive, and you can't take it for lower bound of missing sites. Incidentally though in the BGISEQ paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467036/ they found their sample sequence to have 3.8% missing benchmark SNP's (vs. 3.6% for Illumina 150PE). They also had 0.22% false positives (0.16% for Illumina 150PE). This means that the expected maximum difference between two sequences of same person would be around 2 X (3.6% + 0.22%) (From both sequences) or 7.64%. (7.52% for Illumina 150PE; note the BGISEQ-500 used only had 100 BP read length accounting some of the error). Both platforms had sequencing depth >10X for more than 97.8% of the genome, so significant number of variants were missed for reasons other than read-depth. The situation for saliva derived sequences could be slightly worse, due to DNA quality & microbes in the sample.

(*) The reference genome doubtlessly should include regions which don't exist in every human genome, so 100% coverage may not be even theoretically possible. Additionally many of the false positives/negatives will actually be due to technical factors, such as bad mapping quality, which means they are more likely to be same variants. I edited this post to also include false positives from both sequences, although as said it would be expected for most of the variants to be missed in both sequences. It's also possible if not likely that the analysis pipeline used in the paper isn't optimal, because better variant sensitivity has been reported in other places.

pmokeefe
07-26-2019, 08:31 PM
I ran vcf-compare on my two Dante Labs tests:

# This file was generated by vcf-compare.
# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) 56001801067578A_WGZ.snp.vcf.gz 15001710504417A.snp.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN 1 .. number of sites unique to this particular combination of files
#VN 2- .. combination of files and space-separated number, a fraction of sites in the file
VN 138673 15001710504417A.snp.vcf.gz (4.1%)
VN 145447 56001801067578A_WGZ.snp.vcf.gz (4.3%)
VN 3264403 15001710504417A.snp.vcf.gz (95.9%) 56001801067578A_WGZ.snp.vcf.gz (95.7%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN Number of REF matches: 3264403
SN Number of ALT matches: 3264318
SN Number of REF mismatches: 0
SN Number of ALT mismatches: 85
SN Number of samples in GT comparison: 0

Donwulff
07-27-2019, 11:59 AM
I was just wondering if that included INDEL's, but seeing the .snp. I'm assuming that's only single nucleotide variants (SNV). Looking at the sequence ID's, is the other one 100PE and other 150PE or are they both same read length?

Otherwise I was thinking that a better match for that situation would be looking at studies about read depth, which incidentally I don't remember reference to just now... And at least if the sequences you have are same read-length, it would probably make more sense to treat them as single sequence, combine BAM's (samtools merge combined.bam sequence1.bam sequence2.bam) & call variants from that. I don't think anybody's really looked at what happens if the read-lengths differ, but I don't think that would pose a real problem, if alignment has been done separately (because they have different insert lenght) to identical reference. In turn, it would be possible to sub-sample from that and try to determine how much the results differ at different simulated read depths (And the original sequences, in particular).

Ah actually found a new paper for reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6370902/
I'll have to digest that for a while, it has results for many interesting questions like https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6370902/figure/Fig4/ where the sequencing results are compared to DNA microarray results at different depths (It's indeed practice to consider DNA microarrays the truth set that sequencing is compared against, despite most people trying to sell sequencing as the gold standard. But both have their own errors) and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6370902/figure/Fig5/ which gives concordance of subsampled read depth to full 500X sequence. Interpolating from that, difference between 30X and 60X (or 20X and 50X as closest lines) isn't very much, certainly not 4% in that study.

pmokeefe
07-27-2019, 08:02 PM
Both my tests were 100PE. Has anyone been getting 150PE from Dante Labs?

Correct, the stats were SNPs only.

I'm definitely keen to do a combined analysis on the two tests as you suggest. I have a request out to FGC to do that for me, but also looking into trying it myself.

Most of my test data is on a public Google Drive folder including both my Dante Labs results. Feel free to use and share as you see fit: https://drive.google.com/open?id=1_x7ZtSenJNUyb9nsq0hcNbyizx1a39F3 The FASTQ files are there, but the BAM files are not alas.

tontsa
08-09-2019, 06:13 PM
This might be a stupid question.. but how do you do ALT+HLA mapping with bwa-postalt.js when you want to use GRCh38.p12 or p13 reference? That seqs_for_alignment_pipelines folder only exists for the non patched genome? Or is it just ok to do the fastq->sam using .p13 and then do the postalt with that base version?

Donwulff
08-09-2019, 06:42 PM
For an old HLA there's a packaged solution because the bwakit https://github.com/lh3/bwa/tree/master/bwakit contains scripts to create alt-enabled reference with HLA and decoy sequences.

As I've explained somewhere in the depths of this thread, I've been constructing my own version of the reference mainly for testing, with the GRCh38p13, up to date HLA & other things. This is not entirely starightforward, because Heng Li appears to never have documented/scripted the complete process. However, the essence is that the .alt file of the reference includes the mappings of the alternate contigs against the primary assembly. bwa-postalt.js uses that file to copy and "lift over" to the correct coordinates the reads into the primary & alternative contigs as needed, so the .alt file essentially just provides the offsets of the alts into the primary assembly.

I've been using -x intractg preset, which most closely resembles this situation, for generating them but I've noted that the results don't entirely match the Heng Li & NCBI provided .alt files.

The decoy sequences act different from alt-contigs. The idea of the decoy sequences is to "soak" up reads that originate from the decoy sequences (Mostly retroviruses that integrate into the host genome; these references have been designed for blood, which is mainly host genome with retroviruses). But because of this it's important that the decoy-sequences do not also soak up reads that match the host genome. The GRCh37d5/hg19d5 decoy sequences had more complex method of ensuring this; for the GRCh38 version, the paper just states something to the effect that they "removed sequences with 101 bases or higher alignment to the reference". Since the decoy sequences map to some of the new alt/fix/HLA sequences in the patch 13, I've re-done that by simply completely removing the decoy sequences which have 101 base similarity to the new host reference sequences. I have not, at this point, added new decoy sequences mostly because the source materials are unclear on that.

This had some interesting discoveries that I've since promptly forgotten ;) Like some of the decoy sequences were very close to contigs added in some of the patches, for example. This is by intention and to be expected, interesting, and stresses the caution that rather than just add together the different sequences, one has to do some quality assurance.

tontsa
08-09-2019, 07:03 PM
Yeah I looked briefly your script.. but it got me even more confused. I guess I need to study more on the theory and not just try to align my .fqs to latest and greatest in hopes to get ultimate vcf/gvcf that can be used in services like https://hgpat.tinybox.io.

The GATK stuff is even more confusing.. the uBAM format.. does it really need both aligned reads produced with bwa and then unaligned reads from .fq and then you somehow merge them as one gigantic .bam?

Donwulff
08-09-2019, 07:13 PM
These are vaguely referenced in my code:

Probably best source on the decoy sequences (GRCh37p4d5 version) from Heng Li's presentation:
http://lh3lh3.users.sourceforge.net/download/decoyseq.pdf

Also my explanation above is little bad because these are called "The missing human sequences", it contains Ebstein-Barr virus and I believe some other viral sequences, but many are actually missing pieces of the reference genome as well. Process chart on slide three shows this is fairly involved; some of the tool scripts do exist and the flow-chart actually makes it clearer. But for example "List of problematic BACs provided by Deanna Church et al." I have no idea how to reproduce for new decoys.

But now we can forget all about that, because that's GRCh37 and we now have GRCh38.
"The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.", 2016 by Mallick S, Li H et al. Supplementary section 5 goes into detail on the decoy sequences.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5161557/bin/NIHMS798259-supplement-supp_info.pdf

"For each sample and for each fosmid and BAC clone from GenBank (v203), we extracted >500bp segments on which each 101-mer is absent from the human reference genome GRCh38, including all ALT contigs. We mapped these segments to GRCh38 plus HLA (from IMGT/HLA 3.18.0), and dropped sequences that had fewer than 20 mismatches/gaps or over 99% identity to GRCh38+HLA."

There's much more to it in that paper, but since I wasn't adding more decoy sequences at this point (I'm not even sure how I would easily extract fosmid and BAC clones only from GenBank). I'll probably have to do the rest for my oral microbiome work, and also for "public" saliva sequence samples, as they likely have additional sequences not included in the various sources. The decision to use 101 base match for filtering is also not expanded on; one thing I'm curious to see is if dropping the limit to for example 75 or 66 bases would be more appropriate for 100PE sequences.

bwakit documentation on the alt-processing is here:
https://github.com/lh3/bwa/blob/master/README-alt.md

"At this step, BWA-MEM reads the ALT contig names from "idxbase.alt", ignoring the ALT-to-ref alignment, and labels a potential hit as ALT or non-ALT, depending on whether the hit lands on an ALT contig or not." [bwa-postalt.js] "reads all potential hits reported in the XA tag, lifts ALT hits to the chromosomal positions using the ALT-to-ref alignment, groups them based on overlaps between their lifted positions, and then re-estimates mapQ across the best scoring hit in each group."

Worked example of alt-handling from Broad Institute/GATK:
https://software.broadinstitute.org/gatk/documentation/article?id=8017

Experimental script I'm using for building the "all inclusive" reference genome:
https://github.com/Donwulff/bio-tools/blob/master/mapping/GRCh38_bwa_index.sh

That has a version of the Human Oral Microbiome I'm using for testing currently, which is definitely experimental. There's some undesirable effects for duplicate marking at least because some reads will map to thousands of similar microbe sequences; I would prefer to be able to do metagenomic processing straight from the same BAM file, but for efficiency I may have to remove similiar sequences from the oral microbiome reference.

Donwulff
08-09-2019, 07:24 PM
The GATK stuff is even more confusing.. the uBAM format.. does it really need both aligned reads produced with bwa and then unaligned reads from .fq and then you somehow merge them as one gigantic .bam?

In the revert-bam.sh I may have to make that more obvious/flexible, started it in this thread with just an example to turn a mapped BAM from sequencing company into a unmapped BAM according to the Broad Institute GATK "specs". How I personally use it currently is, I just symlink the unmapped BAM's I have laying around into both sample.bam and sample.unmapped.bam (This lets me add the current experiment to the name, like sample-hg38DHO-p13.bam etc.) and then run the script on sample.bam; it will detect sample.unmapped.bam already exists and pick up from there. But yeah, my "next TODO" is to have it detect the .unmapped.bam extension and/or actually split that into two script that do what they say.

In the GATK workflow though, yeah, what they do is lift the bwa mem alignment information from the generated BAM file into the unmapped bam. This is probably always unnecessary, because you can lift the file & read headers into the aligned bam with bwa mem on the fly, like my script does, saving on some disk-IO. The unmapped bam is meant as the long-term storage format which contains both the reads and the metadata though. If GATK RevertSam didn't do so through name sort, it might be useful for generating the unmapped reads on fly from a mapped bam, meaning you wouldn't have to keep the unmapped bam around though... so eh, more possiblities for development.

Donwulff
09-18-2019, 01:39 PM
Every time I get interested in some aspect of sequencing, there's a research paper focused on it... hm... The lead time to publication is months, sometimes years though, so those ideas must have come long before I had the thought, but it's really handy for wondering about some things. Also, since the matched sequences from this paper have been made public, it will be possible for me to test my method on them. Yay!
"Impact of DNA source on genetic variant detection from human whole-genome sequencing data", Brett Trost, Susan Walker et. all
https://jmg.bmj.com/content/early/2019/09/11/jmedgenet-2019-106281

Actually this one had matched blood, saliva and buccal samples which isn't really possible to do without institutional backing. Summary: "Unless saliva or buccal samples are preferred for reasons such as those outlined above, we recommend using DNA derived from blood samples for WGS, as it equalled or surpassed saliva and buccal samples (although often only slightly) for all comparisons performed in this study." Well, for DTC style testing it's definitely preferred. The reassuring thing is that, once again, saliva/buccal isn't that much worse than whole blood. I need to add this paper to my large pile of papers to plod through in detail, though.

At present, I've been looking at separating microbial contamination in mapping, with the experimental genome reference mentioned in this thread.

I'm currently waiting for IPD-IMGT/HLA 3.38.0 which should be out any day now to make next version, though I've already changed the naming convention of HLA alleles (Turns out while * and : in HLA nomeclature are allowed in genetic contig names, vast majority of tools interpret them as wildcard and field separator, completely messing up downstream analysis, so let's just use the technical contig names.

There's also have new release from eHOMD, the Expanded Human Oral Microbiome database which I alluded to above, and I've switched to using the contig names in their newest release (Unfortunately it's still problematic, so I might add prefixes to that. Mapping the eHOMD reference to human genome reference shows there definitely are some hologous sequences which might mess up saliva based sequencing analysis; it's my hope that having the microbial sequences included in the reference will have them map preferentially to those, comparable to the decoy sequences for full blood, though there are some microbial sequences which are too similar and have to just be dropped from the reference. I do recall seeing a paper or method on such combined reference before, but haven't found it again yet.

Finally there's the usual bug-fixes; dropped filter parameter on output caused alternative-sequences with poor mapping quality to end up in the reference genome alt-file. I'm not entirely certain if this has much relevance, because they're homologous anyway, but since the official reference genomes appear to omit them, that's been fixed (Pending running some validation to see which is better, at least). A bug in the reference generation script caused extracting decoy sequences to fail if uncompressed source sequence wasn't present; now the script uncompresses it as required.

That probably made no sense to most people, but I know there are some folks interested in tinkering with genomic analysis out there, in which case, go crazy. xD

Donwulff
09-18-2019, 04:01 PM
Also, Dante Labs https://www.dantelabs.com/blogs/news/10-000-oral-microbiome-project
Just for the record, I've got nothing to do with that, lol.

Geldius
09-28-2019, 12:53 AM
I have been trying to figure out how to sort two disordered FASTQ files (150 GB each), because BWA cannot process it.

Such attempt exits with errors:
[mem_sam_pe] paired reads have different names:

Although I have already aligned this data set with Bowtie 2, which doesn't require mate reads in the same order, I was curious whether BWA aligns it better with more mapped reads.
The magical tool which successfully sorted my FASTQs is fastq-sort (component of fastq-tools) available at github https://github.com/dcjones/fastq-tools
Aligning sorted dataset with BWA was routine. It turned out that BWA produced more mapped reads than Bowtie 2.

Here is comparison spreadsheet with details:
https://docs.google.com/spreadsheets/d/1Slml6jrpohaOxsSDY1Ogza-Qp-dKAuthZxbi_Uwh8YI/edit?usp=sharing

Donwulff
09-28-2019, 01:44 AM
I have been trying to figure out how to sort two disordered FASTQ files (150 GB each), because BWA cannot process it.

How do you get FASTQ files in different ordering? I've seen this sometimes happen if using aggressive filtering which drops one read entirely; however, I would strongly recommend using a pair-aware trimmer that can drop both reads of a pair if one needs to be dropped. Other than that, I'd be concerned there's something else wrong with the FASTQ's. If you get a mapper to align paired-end FASTQ files which are in different order, odds are it wasn't running in paired-end mode but thought those were all separate reads, in which case the results will be far worse than in true paired-end file. BWA MEM and Bowtie2 shouldn't have a huge mapping efficiency difference (As long as one uses right parameters! Bowtie2 defaults to global alignment, for starters, --very-sensitive-local makes it more comparable.) but the results do differ, which means you get some different SNP's in the end. Of course, variant filtering matters too, so it's not like you're guaranteed exact results in any case, but the results do differ (slightly).

Petr
09-28-2019, 06:19 AM
I had this problem with my first FASTQ files more than year ago, I reported this to Dante and received new HDD with correct FASTQ files.

Geldius
09-28-2019, 08:26 PM
How do you get FASTQ files in different ordering?

I think Dante was delivering such FASTQs during 2018.
Looking at the header of BAM (hg19) delivered by Dante, it was aligned from 16 PE FASTQs using BWA (0.7.15-r1140). So I guess the two FASTQs delivered by Dante were merged/linked together.

@PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_582\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_582_1.clean.fq.gz CL100056732_L02_582_2.clean.fq.gz
@PG ID:bwa.1 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_582\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_582_1.clean.fq.gz CL100056732_L01_582_2.clean.fq.gz
@PG ID:bwa.2 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_581\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_581_1.clean.fq.gz CL100056732_L01_581_2.clean.fq.gz
@PG ID:bwa.3 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_581\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_581_1.clean.fq.gz CL100056732_L02_581_2.clean.fq.gz
@PG ID:bwa.4 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_578\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_578_1.clean.fq.gz CL100056732_L01_578_2.clean.fq.gz
@PG ID:bwa.5 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_578\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_578_1.clean.fq.gz CL100056732_L02_578_2.clean.fq.gz
@PG ID:bwa.6 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_577\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_577_1.clean.fq.gz CL100056732_L02_577_2.clean.fq.gz
@PG ID:bwa.7 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_577\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_577_1.clean.fq.gz CL100056732_L01_577_2.clean.fq.gz
@PG ID:bwa.8 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_576\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_576_1.clean.fq.gz CL100056732_L01_576_2.clean.fq.gz
@PG ID:bwa.9 PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_576\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_576_1.clean.fq.gz CL100056732_L02_576_2.clean.fq.gz
@PG ID:bwa.A PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_575\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_575_1.clean.fq.gz CL100056732_L01_575_2.clean.fq.gz
@PG ID:bwa.B PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_575\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_575_1.clean.fq.gz CL100056732_L02_575_2.clean.fq.gz
@PG ID:bwa.C PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_574\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_574_1.clean.fq.gz CL100056732_L01_574_2.clean.fq.gz
@PG ID:bwa.D PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_574\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_574_1.clean.fq.gz CL100056732_L02_574_2.clean.fq.gz
@PG ID:bwa.E PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L01_573\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L01_573_1.clean.fq.gz CL100056732_L01_573_2.clean.fq.gz
@PG ID:bwa.F PN:bwa VN:0.7.15-r1140 CL:/bioapp/bwa-0.7.15/bwa mem -t 8 -M -Y -R @RG\tID:CL100056732_L02_573\tPL:COMPLETE\tPU:B5EHU MiazRAAAEAAA-573\tLB:B5EHUMiazRAAAEAAA-573\tSM:15001702300459A\tCN:BGI /rdata/genedock/hg19_broad/ucsc.hg19.fasta CL100056732_L02_573_1.clean.fq.gz CL100056732_L02_573_2.clean.fq.gz

I have used Bowtie 2 with default parameters yet (--end-to-end --sensitive), it's worth to test it with --very-sensitive and compare.

It seems to me that Dante Labs were more flexible with results delivery during 2018. It went worse during this year, I wonder whether they deliver FASTQs/BAM for kits ordered during Black Friday 2018 (I'm waiting since May).

Donwulff
09-28-2019, 09:34 PM
See http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment for the global vs. local alignment, in very simple terms global alignment tries to find a place in the reference genome where the read matches completely from end to end, whereas local alignment allows for partial matches. The fragments being sequenced in second generation sequencing are fixed to the substrate using adapter-sequences ligated (glued) to the fragments. Consequently if the fragment happens to be shorter than the read-length, part of the adapter sequence is read instead. In addition, sequencing error accumulates the further along the read the sequencing proceeds, and it's common to see genomic rearrangements (strucutral variations), all which can cause the read-ends to become "garbage". Bwa mem chooses the best alignment, local or global, while Bowtie2 defaults to expecting global (end of to end) match unless using --very-sensitive-local etc.

That having been said, it's really challenging to reasonably estimate aligner performance. For example reads with garbage at the end may have garbage at the front too, and actually hurt the final results, so it might be worth trying something different. On the other hand, as far as I know/remember there aren't a single paper or source where Bowtie2 has been used with BGIseq sequencing, all the validation and review papers have used bwa mem. Consequently, there's no guarantee that Bowtie2 would work with BGIseq error profile at all. So definitely at your own risk.

I assume you read something like http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#paired-inputs ie. giving the paired end files with -1 and -2. I no longer recall how Bowtie2 works with regards to that exactly, but if it can't pair the sequential reads from those two files, it will default to just doing single-ended alignment if it can find unique location for both. Bwa mem does mate rescue too, but it expects that the read names from both sides match. Also, with Bowtie2 you're expected to give the correct fragment length range and orientation as parameters, though the defaults may work, whereas bwa mem determines them from input data. Paired end is an ingenious sequencing trick, where the same DNA fragments are sequenced from both ends, and therefore you will know that: 1) The paired ends are not very far from each in the DNA sequence. 2) The sequences are oriented against each other along the chromosome. 3) In fact, a lot of the time the DNA segment lengths are chosen so that many reads will slightly overlap each other giving an unbroken read of almost twice the length. If the read-ends aren't in same order, or the paired end options are wrong, you lose all that of course.

Sorting the FASTQ files does solve a few things, at least, assuming all the reads & files are in there, and they're named correctly... And read-groups are somehow separated.
Interestingly, if you go via unmapped BAM format, this sorting & matching problem should (ahem) sort itself out, although that's probably still more complicated and FastqToSam will likely complain the reads don't match...

Donwulff
09-28-2019, 10:12 PM
If one runs "samtools fastq" on a reference/coordinate sorted BAM file, they get a jumbled mess of fastq files. As the samtools-fastq man-page says, "If the input contains read-pairs which are to be interleaved or written to separate files in the same order, then the input should be first collated by name. Use samtools collate or samtools sort -n to ensure this." Depending on what's exactly being done, other things can go wrong too though. For example, "By default, BWA-MEM uses soft clipping for the primary alignment and hard clipping for supplementary alignments." so the first alignment that "samtools fastq" encounters in a reference/coordinate sorted BAM file may actually be hard-clipped, ie. local alignment with the non-matching parts clipped off if the read has both local and global alignment. Base qualities might be calibrated, which is good or bad depending... Worse, it might be missing some reads & pairs entirely if the FASTQ is produced from mapped reads only (This sort of happens with FTDNA BigY's) in which case the pairs won't match even simply sorted.

I should probably try out of GATK FastqToSam can be made to accept non-matching paired end files, then you could queryname/readname sort that file and use "samtools fastq" to exact the matched pairs only, and be certain that all reads have their pairs. Or you can write an ugly script to validate and drop out reads which don't have pairs, if needed. Read groups should probably be added by flowcell ID, though I'm not sure if there's any more convenient way to do that than grep by flowcell ID for separate FastqToSam runs.

Donwulff
09-28-2019, 11:51 PM
Just as I was afraid, the FastqToSam won't accept mismatched FASTQ files, getting around it is almost hard enough to be worth just writing separate utility to do it all ;)



# Make out mismatched FASTQ test data
echo -e "@read1\nCATG\n+\n####\[email protected]\nCATG\n+\n####\[email protected] d3\nCATG\n+\n####" > pair1
echo -e "@read2\nCATG\n+\n####\[email protected]\nCATG\n+\n####\[email protected] d2\nCATG\n+\n####" > pair2

# Make both FASTQ into separate SAM files; if you give /dev/stdout it will write bam which can't be easily edited...
gatk-4.1.3.0/gatk FastqToSam -F1=pair1 -O=read1.sam -RG=pair -SM=test -PL=ILLUMINA
gatk-4.1.3.0/gatk FastqToSam -F1=pair2 -O=read2.sam -RG=pair -SM=test -PL=ILLUMINA

# Edit the sam files to set correct paired-end flags on each file; we have to pass @ headers as-is
gawk '{ if(substr($1,1,1)!="@") { OFS="\t"; $2=77; } print }' read1.sam > read1-pair.sam
gawk '{ if(substr($1,1,1)!="@") { OFS="\t"; $2=141; } print }' read2.sam > read2-pair.sam

# Combine edited sam files holding opposite pairs into single file, collating identical read-groups
# You can use "samtools merge -cuf /dev/stdout ..." to skip compression & IO and pipe it to samtools collate;
# this is meant just as an illustration. Note the bam is unmapped and can't be used for analysis directly.
samtools merge [email protected]`nproc` -c merge.bam read1-pair.sam read2-pair.sam

# Group results by queryname/readname and output pairs only
samtools collate merge.bam -O | samtools fastq -s /dev/null -

# Output; use -1 and -2 to get separate files instead of interleaved
[M::bam2fq_mainloop] discarded 1 singletons
[M::bam2fq_mainloop] processed 6 reads
@read3/1
CATG
+
####
@read3/2
CATG
+
####
@read2/1
CATG
+
####
@read2/2
CATG
+
####

Edit: Current versions of samtools require "-" as input file name to accept piped input, keep forgetting that, added to samtools fastq example.

Donwulff
01-10-2020, 07:34 AM
On the Git repository https://github.com/Donwulff/bio-tools "bleeding edge reference genome" generation script https://github.com/Donwulff/bio-tools/blob/master/mapping/GRCh38_bwa_index.sh last change on Sept. 14th 2019 broke naming of the hs38d1 decoy sequences in the BWA alt-file generated. The repository contains a correct copy of the alt-file in question to avoid the script having to re-calculate it, which is why I didn't notice the code broke. So it affects results only if these files were re-generated. These are decoy sequences (Ie. they aren't analyzed in bioinformatics) this shouldn't have major impact on results even if the file was re-generated, but just a heads up.

I obviously fixed that error, and will try to test the script from an empty directory (ie. re-generating everything) from now on, in case anybody else than me is using it. It's experimental and subject to change anyway, but I'm noticing a few people discussing doing similar things themselves.

In addition I've changed the script to pull the version of the Human Leukocyte Antigen database from the website at the same time it pulls the database, and incorporate that to the reference genome version. https://www.ebi.ac.uk/ipd/imgt/hla/ was last updated to 3.38.0 in October last year (Though it only became available for download in November) which is the reason for this update. It's also possible to set the decoy & oral microbiome database versions to empty string, which will exclude them from the reference (Useful, because the oral microbiome increases memory usage for alignment significantly).

I also *tried* to clarify the comments on the script (It's good to look at your comments once in a while and see if they still make sense even when you don't remember what the heck you were doing at the time;) on the reasons things are done the way they are.

GitHub "recommends" that repositories are kept at under 1GB each, and I don't like the idea of download of the repo being multiple gigabytes every time, which precludes including pre-processed Human Oral Microbiome Database and pre-computed whole genome reference in the repo, although perhaps I could put a separate repo containing just the pre-computed reference genome if there's interest in downloading it.

tontsa
01-10-2020, 08:09 AM
Good timing! :) I was about to ask about the .alt file as I started playing with your script yesterday. I'll get back to you once I've gotten whole pipeline ran with some preliminary variant calling too..

Donwulff
01-10-2020, 08:29 AM
I should probably use GNU Make to manage the dependencies & pre-requisites in the reference genome build script. It now avoids unnecessary downloads and calculation if additional_hg38_contigs.alt already exists, but that file is dependent on the HLA database version, which is dynamically downloaded (Because for some reason IMGT/HLA doesn't provide link to specific versions, just the newest one).

That said, there still exists several questions over how to best do things, like the naming of the HLA alleles (* and : in names tend to break downstream processing and aren't stable, so I say nay), where to use UCSC and where NCBI identifiers, whether to soft-mask sequences or not, and how to best map the alt contigs against the primary assembly (Maximal Exact Match is not guaranteed to find all matches, but it looks better than Smith-Waterman, I can't reproduce the exact same results as Heng Li though) for starters. Therefore people probably want to tweak the script & results to whatever they expect anyway.

tontsa
01-10-2020, 09:30 AM
Atleast GATK UnifiedGenotyper ja HaplotypeCaller doesn't have issues with HLA-A*01:01:01:01:1 style contigs.. you just have to put + after the contig if you create interval file. Outside of pure research dunno what average person can do with those contigs is beyond me though.

Donwulff
01-10-2020, 11:51 AM
I was just thinking of returning the HLA-A*01:01:01:01:1 style contigs as option, but there's a major like the Y-chromosomal haplogroups, when a new branch is added upstream to the HLA tree all the allele-tree names change, and therefore they will tend to mean different things after a while. And to answer what they're good for, people are doing HLA-typing from them, although you can't tell the HLA type directly from the alignments. Possibly I should add the HLA region type to the contig names, that way you could more easily separate HLA-A for example. Now you need to check the HLA database for mappings like "HLA:HLA12864 DPB1*415:01 11510 bp". Should that be HLA_DPB1_12864 or DPB1_12864 for example? Although for that use, one has to also extract reads that align to the HLA regions in the primary assembly, so it will take some trying in any event.

Other than that, they server theoretically as more decoy-sequences since HLA alleles may have diverged significantly from the canonical reference and therefore cause false positives elsewhere (Hm, so many things one could test, like cut up the HLA database to 100bp or 150bp fragments and see where they map on the primary assembly). Also the postalt script in bwakit lifts reads that map only to the alt/fix-contigs back into matching part of the primary assembly, usually rescuing some false negatives that are caused by divergence from the canonical reference.

Which is incidentally what the alt-file alignment is used for, which is... tricky. I should probably just mail Heng Li to ask what exact paramters he used to generate the published alt-file, though it seems he's not answered that in public posts. There is potentially opportunity for research there, rather than just use "magic numbers". Otherwise using algorithms similar to what the actual final alignment will be (hence bwa mem) seem to make most sense so as not to introduce extra bias. But there's lots of opportunity for work and study in the alt-alignment.

I think one of the issues is GRCh39 is expected to turn it all around with graph-based reference genomes where alt-contigs are no longer alt-contigs, but I'm not sure the graph-based genomes are actually getting anywhere. Some new aligners are instead starting to do their own graphs and essentially assembly on the fly; this seems like it will always be slower than aligning to static reference, but with continuous improvement to sequencing like long-read sequencing and higher read depth it may start seeming reasonable to have NO reference bias. "Downside" is genomic coordinates will stop mattering, everything has to be defined by context (Flanking sequences like dbSNP or HGVS identifiers like protein change from translation initiation site (where a gene can be identified). In general that would actually be upside from the genomic build and coordinate confusion, though ;)

Donwulff
01-10-2020, 11:55 AM
Atleast GATK UnifiedGenotyper ja HaplotypeCaller doesn't have issues with HLA-A*01:01:01:01:1 style contigs.. you just have to put + after the contig if you create interval file. Outside of pure research dunno what average person can do with those contigs is beyond me though.

I should see if I still have the notes, but basically if you try to give GATK tools a range of contigs to process, it will interpret asterisk as genuine wildcard, and there's no way around it without changing the GATK source-code. Which they should do because * and : are legal contig names under the SAM file spec. But as noted, the tree style names are untenable if you intend to keep your database/contigs up to date (Although for reproducibility and statistics you might want to freeze the HLA database version, too, so all options are possible).

tontsa
01-10-2020, 12:13 PM
Heng Li has atleast commented in https://github.com/lh3/bwa/issues/231 that "There was no minimap2 when this GRCh38 .alt file was generated. The alignments there were provided by GRC with some manual curation as I remember. I would use minimap2 to generate .alt files these days.". So I'm guessing there were some manual editing back and forth there.

Donwulff
01-10-2020, 01:00 PM
Thanks, that's a good find that somehow I didn't run into while trying to study it. I guess I should look at the differences again, although this is one of those where one can spend days and then forget everything because it's not written down or anything...

One thing I particularily want to add to that is that BWA for example does nothing with the alignment in the .alt-file. For BWA, it simply prefers the contigs that aren't listed in the .alt-file, with the intention that the alignments to the non-alt contigs should be nearly identical to results had the alt-contigs not been there at all. It's the postalt JavaScript/ECMA-script script from bwakit which is bit of a mess that does the read re-assignment/duplication when needed.

I was pondering a bit where to best put all sorts of notes, maybe I should be posting Jupyter fragments with the notes & experiments for the alt-processing etc.

Donwulff
01-10-2020, 03:20 PM
I also wonder if I should move this to Open-Source Projects forum, especially as this now has nothig to do with Dante Labs.

National Center for Biotechnology Information Analysis Set has...
GCA_000001405.15_GRCh38_full_analysis_set.fna.alt which I've copied at https://github.com/Donwulff/bio-tools/blob/master/mapping/GCA_000001405.15_GRCh38_full_analysis_set.fna.alt so it isn't necessary to download the whole BWA index for it.

Funny thing if you look at that it's ACTUALLY a SAM file with the headers. The mappinq quality of every row is 255 and only flag 16 (Read reverse strand, ie. the sequence is complementary to the reference) is used. The sequence itself is masked out making the file smaller. Only NM attribute, edit distance (number of changes from reference) is used.

If you get bwakit resource kit from https://sourceforge.net/projects/bio-bwa/files/bwakit/ (Unfortunately I think there's no revision history for these files, which might show how they were created) the hs38DH.fa.alt has same beginning with the Analysis Ready reference, but then decoys (Read unmapped flag, no CIGAR - remember decoys are not supposed to overlap with the reference, they're just sink for the "trash" reads, preventing them from being forced to align against primary assy, but they're in the .alt file so aligner will prefer primary assy if they happen to align against it). Then there's the HLA contigs with MapQ of 60, including a few supplementary reverse alignments. Attributes AS = Alignment Score and XS = Suboptimal Alignment Score are used in addition to NM, and occassional XA for Alternative Alignments.

Row "HLA-DQB1*06:02:01 16 chr6 32660031 60 7102M * 0 0 * * NM:i:0 AS:i:14204 XS:i:6462" shows match-score for new contigs is 2 per nucleotide. "HLA-DQB1*06:03:01 16 chr6 32660031 60 2633M1I4469M * 0 0 * * NM:i:7 AS:i:14169 XS:i:6437" has same 7102 length. Edit-distance is 7, suggesting there's 6 substitutions and the one insert we see in CIGAR. (7102-6)*2 = 14192, which is 23 over the alignment score. 23/6 rounded down gives us 3 carry 5, which might be penalty of 3 for substitution with 5 for gap open + extension.

"HLA-DQB1*06:09:01 16 chr6 32660031 60 2682M1I1060M1D3359M * 0 0 * * NM:i:42 AS:i:13992 XS:i:6487" 40 substitutions and 2 gaps? We have 41 substitutions and one deletion, so (7102-42)*2 - 41*3 - 5 = 13992 which checks out.

Okay "HLA-DRB1*15:01:01:02 16 chr6 32578769 60 5249M2D6322M * 0 0 * * NM:i:2 AS:i:23136 XS:i:498" has 2 deletions and matching length 11571, times two gives 23142 points. That leaves 6 points which means gap extension is 1 point, 4 for open. Let's see if I can now reproduce this...

bwa mem GCA_000001405.15_GRCh38_no_alt_analysis_set.fna hs38DH-extra.fa -A2 -B3 -O4 -E1 | tee extra-test.sam
samtools view -q60 extra-test.sam | gawk '{ OFS="\t"; $10 = "*"; print }' > extra-test.alt
dwdiff -c hs38DH.fa.alt extra-test.alt | less -RS

And we're now VERY close. Writing thoughts down to work on them sorta works. There are some extra supplementary alignments and few other thinks I need to tweak, there may be minimum score cutoff. Also it wasn't at all as straightforward as this post implies, I had to try this with several different parameters while I was trying to rush through it. It's also possible the Analysis Set is using different parameters, it's harder to tell because the alignment score isn't shown, but it's difficult to tell which one is better anyway.

I'll also note the alignment parameters aren't as important because it's the CIGAR that's used to determine the lift-over between the primary assembly, but the alignment parameters determine what gets lifted. The intractg preset I was using was really conservative compared to this, meaning less stuff getting lifted over.

Donwulff
01-10-2020, 05:08 PM
Bit ugly shell exploration:


egrep "[[:space:]](2048|2064)[[:space:]]" hs38DH.fa.alt | cut -f 1-5,7-14
chr9_GL383539v1_alt 2048 chr9 7435082 60 * 0 0 * * NM:i:65 AS:i:14614 XS:i:307
chr9_GL383540v1_alt 2048 chr9 69426557 60 * 0 0 * * NM:i:102 AS:i:13613 XS:i:0
chr17_GL383566v1_alt 2048 chr17 77225258 60 * 0 0 * * NM:i:23 AS:i:11794 XS:i:110
HLA-DRB1*07:01:01:01 2064 chr6 32587382 60 * 0 0 * * NM:i:285 AS:i:4557 XS:i:4330
HLA-DRB1*07:01:01:02 2064 chr6 32587382 60 * 0 0 * * NM:i:283 AS:i:4567 XS:i:4340
HLA-DRB1*09:21 2064 chr6 32587382 60 * 0 0 * * NM:i:279 AS:i:4549 XS:i:4326

egrep "[[:space:]](2048|2064)[[:space:]]" extra-test.alt | grep -v "decoy" | gawk '{ if (int(substr($14,6))>=4000) print; }' | cut -f 1-5,7-12,14-15
HLA-DQB1*02:01:01 2064 chr6 32663898 60 * 0 0 * * NM:i:327 AS:i:4903 XS:i:3762
HLA-DQB1*02:02:01 2064 chr6 32663898 60 * 0 0 * * NM:i:326 AS:i:4890 XS:i:3750
HLA-DRB1*07:01:01:01 2064 chr6 32587382 60 * 0 0 * * NM:i:285 AS:i:4557 XS:i:4330
HLA-DRB1*07:01:01:01 2064 chr8 120886094 60 * 0 0 * * NM:i:180 AS:i:4068 XS:i:2981
HLA-DRB1*07:01:01:02 2064 chr6 32587382 60 * 0 0 * * NM:i:283 AS:i:4567 XS:i:4340
HLA-DRB1*07:01:01:02 2064 chr8 120886094 60 * 0 0 * * NM:i:180 AS:i:4068 XS:i:2981
HLA-DRB1*09:21 2064 chr6 32587382 60 * 0 0 * * NM:i:279 AS:i:4549 XS:i:4326


The alt and decoy sequences are not part of this test. Anyway, does that mean there's a filter which drops supplementary alignments with alignment score less than 4549 or edit distance over 285? Or are these hand-curated? Or did different version of BWA just not list those supplementary alignments? Primary alignments seem to be listed regardless of the alignment score & edit distance, and there's even this one alignment in the original .alt-file with MapQ 30 (I filter to MapQ 60 because that seems to be the case otherwise) and different results from new run. This is also soft-clipped while all the others are hard-clipped. There's also 346 basepair long deletion on the CIGAR, I can somewhat see this having been manually curated in, maybe due to some spotted discrepancy in alignment.

Old:
HLA-B*15:01:01:02N 16 chr6 31355091 30 61S133M93D277M135D3M346D6M83D5M3D2M7D258M245D138M1 I2M1D146M10D91M1I84M * 0 0 * * NM:i:968 AS:i:1106

New:
HLA-B*15:01:01:02N 2064 chr6 31355091 43 61H133M93D277M737H * 0 0 * * NM:i:105 AS:i:663 XS:i:469

Looking at bwa-postalt script, it seems to handle soft and hard clipping, and reverse flag, but doesn't do anything with headers, other flags or mapping quality.

aaronbee2010
01-15-2020, 02:59 AM
I aligned one of the .p13 sequences (KQ031383.1) to the chr1 reference and I got this:


chr1_KQ031383v1_fix 2048 chr1 12818488 60 135897M331246H * 0 0 * * NM:i:0
chr1_KQ031383v1_fix 0 chr1 13004385 60 158724S308419M * 0 0 * * NM:i:0
chr1_KQ031383v1_fix 2064 chr1 13171885 60 255737H16M3I5M1D8M2I20M1I8M1D62M1D3M1I1M2D3M1I14M2 I2M2D16M7I76M1D2M1I194M5D971M6I6653M1D1777M1D862M1 0I504M3D7765M2D28M2D36M1I9M1D10M1D2M1D2M1D2M1D2M1D 2M1D29M2D775M4D2506M1D1208M1D7843M10D1M2D48M3D53M9 I9M1I10M2D49M1D7M1I14537M1I1047M1I589M1I2960M1I506 0M1D198M1I10285M1I1650M10D15907M5I6293M1I4744M3D22 67M1D3758M1I3103M4I4007M3I122M2D6034M4I44M24I121M6 D4260M18D538M1D3062M9D558M2D2462M1I2237M12D1697M1I 6578M1D2457M1D354M1I400M2D1918M4D16M4I2893M1I241M2 I3191M5D2033M1D1691M1I160M1D1695M1D937M21D504M3D10 48M1I203M55848H * 0 0 * * NM:i:1056


Is it normal to have three alignments for the same fix or not? I did notice that a few alt contigs present in hs38DH.alt has more than one line in that .alt index, so I'm guessing this is fine maybe? In case you're wondering, this was my command line:

minimap2 -a -t 6 chr1.fa chr1_KQ031383v1_fix.fa > chr1_KQ031383v1_fix.sam

Also, the last line is a reverse-strand alignment. Is that also fine?

Any feedback would be appreciated!

Donwulff
01-19-2020, 09:44 AM
I would say, generally no reason to doubt those, stuff happens. DNA gets cut & pasted around all the time.
However, in this case the two first lines have especially good explanation: It's a fix patch which is correcting that particular location. From the CIGAR string we can see the second match skips the first 159k nucleotides and matches the next 308k nucleotides. The first line is a supplementary alignment which matches the first 136k nucleotides and then skips (hardclip because it's supplementary) 331k nucleotides. The patch sequence description says "This scaffold closes the gap between GRCh38 Chromosome 1 components AC245034.2 and AC245056.3 with the addition of AC256219.1 (NC_000001.11:12,954,385-13,004,384)" and those matches are just the anchor sequences.
The reverse strand match may be part of why this section was missed in the original reference genome, because many reads would also have matched the duplicate inversion of the genome sequence.

Of course, this is also why it's very important to sue alt-aware mapper with these, because you want the reads to align preferentially to the main assembly, not to the anchor sequences at the ends of the fix-patch. In fact the fix-patches are bit debatable addition to the reference, because they're not even "alternative" and should be incorporated into the reference at the location indicated by the anchor-sequence, but I use them decoy-like. If this sequence is not included in the mapping, reads from that location in the genome would likely map incorrectly to the duplicate inversion instead, leading to false positives in that region. (Although variant callers and filters tend to notice there's more than two haplotypes there, which can also lead to false negatives).

Though I should test how that region REALLY behaves, because the alt-handling meaning reads map preferentially to the primary assembly, means it may not prevent most of the reads from mapping to wrong place. Ultimately it might make more sense to handle the fix-patches more like decoy-sequences, removing all exact matches to the primary assembly from them. But right now there's no other "perfect" solution than to wait for GRCh39 where they're incorporated in the reference and the wrong sequences removed.

aaronbee2010
01-23-2020, 08:55 PM
Of course, this is also why it's very important to sue alt-aware mapper with these, because you want the reads to align preferentially to the main assembly, not to the anchor sequences at the ends of the fix-patch. In fact the fix-patches are bit debatable addition to the reference, because they're not even "alternative" and should be incorporated into the reference at the location indicated by the anchor-sequence, but I use them decoy-like. If this sequence is not included in the mapping, reads from that location in the genome would likely map incorrectly to the duplicate inversion instead, leading to false positives in that region. (Although variant callers and filters tend to notice there's more than two haplotypes there, which can also lead to false negatives).

Of course. I'm only using minimap to generate an alt index for BWA, as the patches are essentially long-read sequences. Mapping .fq files to a whole reference genome would be done with BWA-MEM (alt-aware thanks to the alt index file), followed by the best practices pipeline from GATK.


Though I should test how that region REALLY behaves, because the alt-handling meaning reads map preferentially to the primary assembly, means it may not prevent most of the reads from mapping to wrong place. Ultimately it might make more sense to handle the fix-patches more like decoy-sequences, removing all exact matches to the primary assembly from them. But right now there's no other "perfect" solution than to wait for GRCh39 where they're incorporated in the reference and the wrong sequences removed.

Post-alt processing from the bwa-postalt.js file in bwa-kit would assign reads that map well to both the primary assembly and the alternate contigs to both (as opposed to only one), so this shouldn't be an issue.

I thought that the main issue with decoys that they're genomic sequences that can't be mapped to a specific location on the human reference genome (like problematic regions such as centromeres), as opposed to fix-patches, so I'm not sure if treating them like decoys is a good idea.

Speaking of decoys, would you recommend compiling a list of oral microbiome genomes as a sink, like the EBV genome? By this I mean, incorporating the sequences into the .fa file, but not the .fa.alt index. I'm guessing this would help speed up alignments as oral microbiome reads which don't belong on the human reference would just get quickly absorbed by the oral microbiome references in the .fa file instead of the aligner spending time trying to assign them to a place on the human reference where they obviously do not belong.

In addition to the various species present (i.e. Streptococcus salivarus), there also appear to be multiple different strains of each species, so I'm not sure whether I should add all strains from each species or if just one from each (i.e. the most common strains or the strains with the longest genomic length) would be sufficient?

pmokeefe
01-24-2020, 02:51 PM
I just tried a very simple experiment. I found the most duplicated read in a FASTQ file (one of the old Dante Labs 100bp BGI tests).
I searched for that here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&BLAST_SPEC=MicrobialGenomes
It turned out to be an exact match for the bacteria Sanguibacteroides justesenii (https://en.wikipedia.org/wiki/Sanguibacteroides). I couldn't find much specifically about that particular strain, but it is related to common environment bacteria, some of which are found in the human gut.

I just checked a family member's test - same type of test, same bacterial sequence present. I haven't actually seen this person in decades, so that's some evidence it may be common.

One question that comes to mind: is there a possibility this bacteria was not present in our saliva, but was introduced subsequently?

How about you, what is your most common read?

aaronbee2010
01-24-2020, 08:33 PM
I just tried a very simple experiment. I found the most duplicated read in a FASTQ file (one of the old Dante Labs 100bp BGI tests).
I searched for that here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&BLAST_SPEC=MicrobialGenomes
It turned out to be an exact match for the bacteria Sanguibacteroides justesenii (https://en.wikipedia.org/wiki/Sanguibacteroides). I couldn't find much specifically about that particular strain, but it is related to common environment bacteria, some of which are found in the human gut.

I just checked a family member's test - same type of test, same bacterial sequence present. I haven't actually seen this person in decades, so that's some evidence it may be common.

One question that comes to mind: is there a possibility this bacteria was not present in our saliva, but was introduced subsequently?

How about you, what is your most common read?

I haven't used that tool, but I used CosmosID and this is what I got for my top 10 bacterial sequences from my DL .bam, sorted by relative abundance:

https://i.gyazo.com/d6039c06002824d13555c150a7a64f41.png

tontsa
01-30-2020, 06:18 AM
Aaron, did you already reach some conclusion how to process those new GRCh38 patches so you end up with usable .alt file for bwa-mem?

I tried Donwulff's GRCh38_bwa_index.sh but it ends up with 99844 different contigs in the .bam making it pretty impossible to process.. or should I just filter out all those non primary assembly contig reads before running variant calling?

aaronbee2010
01-30-2020, 02:41 PM
Aaron, did you already reach some conclusion how to process those new GRCh38 patches so you end up with usable .alt file for bwa-mem?

I tried Donwulff's GRCh38_bwa_index.sh but it ends up with 99844 different contigs in the .bam making it pretty impossible to process.. or should I just filter out all those non primary assembly contig reads before running variant calling?

I've been busy with exams so far. I've just had the last exam for this semester on Monday, so I'm getting back to it now.

The problem with mapping all the patches (in one file) to the entire primary assembly (also in one file) is that some patches will align to chromosomes in the primary assembly that they're not supposed to, so that can inflate the number of total alignments in the alt file, although I'm not sure if I ever approached 99844 alignments.

What I've done is separate the patches into chromosomes then align them to the corresponding chromosome in the primary assembly. I just use minimap2 for this, which seems to work fairly well, i.e. "minimap2 -a chr12_primary.fa chr12_patches.fa > chr1_KQ031383v1_fix.sam", although I got a few slight discrepancies between minimap2 and BLAST. I'm not sure if there are better parameters for minimap2 for this sort of thing. Would Oxford Nanopore settings be better?

There are still alignments from one fix patch to the correct chromosome, but the wrong position, so I use a list of GRCh38.p11 patches that specifies their target region. I can't find a corresponding list for p13, so for the patches not in p11, I just verify those coordinates via BLAST.

I also have patches that have multiple alignments to the same chromosome at the same position, so I'm guessing it's best to use the alignment at that position with the lowest edit distance, and delete the rest?

Thanks for your help.

Also, as far as variant calling goes, I would personally just focus on variants within the primary assembly, unplaced/unlocalised contigs, and if present, the ALT and HLA contigs. I don't think there's any use for calling variants in the patch regions. It may be interesting to do so, but I don't think anybody will have any use for those regions until GRCh39 comes out, in which case, it would just be best to just align to that instead. I think the best purpose of aligning with patches before the release of GRCh39 would be to improve alignment to the primary assembly, but that's just my opinion.

aaronbee2010
02-05-2020, 01:17 PM
Since the fixes are significantly larger than the reference genomes to which they apply, would an RNA-seq aligner be a better choice (pardon my ignorance) for producing the .alt index as opposed to a DNA-Seq aligner like minimap2? I used Magic-BLAST to align the fix patches and they come up as one alignment in the .sam file as opposed to several alignments.

Here are the .alt files I produced from the .sam files for each aligner (for the chrY_KZ208923v1_fix patch):

minimap2:

@SQ SN:chrY LN:57227415
chrY_KZ208923v1_fix 2048 chrY 9034110 60 12805M35565H * 0 0 * * NM:i:0
chrY_KZ208923v1_fix 2048 chrY 9055175 60 15991H167M1D2266M29946H * 0 0 * * NM:i:2
chrY_KZ208923v1_fix 2048 chrY 9107609 60 23112H4260M20998H * 0 0 * * NM:i:0
chrY_KZ208923v1_fix 2048 chrY 9112716 60 28489H1430M1D173M18278H * 0 0 * * NM:i:3
chrY_KZ208923v1_fix 0 chrY 9116372 60 32025S8966M1D109M1D7270M * 0 0 * * NM:i:2

Magic-BLAST:

@SQ SN:chrY LN:57227415
chrY_KZ208923v1_fix 0 chrY 9034110 255 12806M3184I8258D170M1D2264M4688I50000D4261M1116I84 6D1442M1D161M1933I2052D8979M1D112M1D7254M * 0 0 * * NM:i:72084

As the patch is being mapped to a reference with gaps represented by N's and also contains more bases than it's target sequence on the chrY reference genome, there are insertions followed by splicing:

https://i.gyazo.com/5e3b15af8de764a32164bc7bcc256b41.png

The gap between two sequences on the reference genome represented by N's on the reference are treated like introns in a gene and are classed as a splice by the aligner. The additional bases present on the patch that aren't present on the reference are treated as insertions. Carrying this analogy forward, the regions of the references with bases would be treated like exons in a gene.

My only concern (albeit probably a minor one) is the presence of 1 base that protrudes into the N region before and after the splicing/insertion. I don't think this would have a particularly adverse effect on alignment, but was something I noticed straightaway and thought it would nonetheless be worth mentioning.

newuser
02-15-2020, 07:18 PM
I have some issues creating my bams from my fastq files.
The bams that i have created are a lot smaller than the bam received from dante and have +/- 1/10 the size,read1 and read2:

hg19 Dante bam:
119.9GB total size
Running samtools flagstat on the bam results:
1425875935 + 0 in total (QC-passed reads + QC-failed reads)
2469421 + 0 secondary
0 + 0 supplementary
110086983 + 0 duplicates
1158281787 + 0 mapped (81.23% : N/A)
1423406514 + 0 paired in sequencing
711703257 + 0 read1
711703257 + 0 read2
1117928144 + 0 properly paired (78.54% : N/A)
1152533864 + 0 with itself and mate mapped
3278502 + 0 singletons (0.23% : N/A)
23105190 + 0 with mate mapped to a different chr
18562690 + 0 with mate mapped to a different chr (mapQ>=5)

hg19 created by myself:
Created with bwamem, samtools:
Steps I took:
*Downloaded the ucsc.hg19.fasta.gz (good file used succesfully to create my Gvcf with GATK from dantes bam)
*Indexed the ucsc.hg19.fasta with bwa command:
bwa index ucsc.hg19.fasta
*Map fastq files with bwa mem pipe trough samtools sort to get my bam, command used:
bwa mem -M -t 4 ucsc.hg19.fasta.gz kitid_SA_L001_R1_001.fastq.gz kitid_SA_L001_R2_001.fastq.gz | samtools sort [email protected] -o outputhg19.bam
14.4 GB total size
Running samtools flagstat on the bam results:
158295167 + 0 in total (QC-passed reads + QC-failed reads)
295167 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
123137949 + 0 mapped (77.79% : N/A)
158000000 + 0 paired in sequencing
79000000 + 0 read1
79000000 + 0 read2
118426766 + 0 properly paired (74.95% : N/A)
122456940 + 0 with itself and mate mapped
385842 + 0 singletons (0.24% : N/A)
2614094 + 0 with mate mapped to a different chr
2068901 + 0 with mate mapped to a different chr (mapQ>=5)

GRCh38.p13 created by myself:
Created similar approach as the hg19
14.5 GB total size
Running samtools flagstat on the bam results:
158094033 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
294033 + 0 supplementary
0 + 0 duplicates
123102913 + 0 mapped (77.87% : N/A)
157800000 + 0 paired in sequencing
78900000 + 0 read1
78900000 + 0 read2
118502764 + 0 properly paired (75.10% : N/A)
122438990 + 0 with itself and mate mapped
369890 + 0 singletons (0.23% : N/A)
2593204 + 0 with mate mapped to a different chr
1949976 + 0 with mate mapped to a different chr (mapQ>=5)


Downstream my generated bams i find +/- 1/10 the variants from annotating the vcf with the dbsnp compared with the result i got from dantes bam.
The variants that exist in all files match with base and genotype but I find 10 times less in the files created from my self created bams compared to the bam from dante.

Running fastp on my fastq files results:

Read1 before filtering:
total reads: 711703257
total bases: 71170325700
Q20 bases: 69152402634(97.1647%)
Q30 bases: 62991550198(88.5082%)

Read2 before filtering:
total reads: 711703257
total bases: 71170325700
Q20 bases: 67804343770(95.2705%)
Q30 bases: 60583369636(85.1245%)

Read1 after filtering:
total reads: 708551965
total bases: 70843453743
Q20 bases: 68843418213(97.1768%)
Q30 bases: 62718803973(88.5315%)

Read2 aftering filtering:
total reads: 708551965
total bases: 70843453743
Q20 bases: 67658943065(95.5049%)
Q30 bases: 60505206601(85.4069%)

Filtering result:
reads passed filter: 1417103930
reads failed due to low quality: 5091752
reads failed due to too many N: 1210832
reads failed due to too short: 0
reads with adapter trimmed: 1159894
bases trimmed due to adapters: 23578296


Is something completely wrong with the approach i used generating the bams or are the fastq files i received at fault for my poor results?

tontsa
02-15-2020, 07:36 PM
newuser: did you use fastp's processed fastqs (-o/-O flag) or did you just use fastp for statistics? You might get better results with the filtered fastqs. Also on hg19 I don't think you need/want -M flag for bwa mem. For "fun" try bwa mem -t 16 -Y -v 3 -K 100000000 -R '@RG\tID:LANENAME\tPL:ILLUMINA\tPU:xTen\tLB:SOMETH ING\tSM:SOMETHING\tCN:ILLUMINA' instead. That's what gatk's wgs pipelines use.

tontsa
02-15-2020, 07:44 PM
Also another point when it comes to Dante's BAMs.. their in-house processed ones use hs37d5 instead of straight hg19. It differs from hg19:
- Integrated reference sequence from the GRCh37 primary assembly (chromosomal plus unlocalized and unplaced contigs)
- The rCRS mitochondrial sequence (AC:NC_012920)
- Human herpesvirus 4 type 1 (AC:NC_007605)
- Concatenated decoy sequences (hs37d5cs.fa.gz)

newuser
02-15-2020, 09:58 PM
newuser: did you use fastp's processed fastqs (-o/-O flag) or did you just use fastp for statistics? You might get better results with the filtered fastqs. Also on hg19 I don't think you need/want -M flag for bwa mem. For "fun" try bwa mem -t 16 -Y -v 3 -K 100000000 -R '@RG\tID:LANENAME\tPL:ILLUMINA\tPU:xTen\tLB:SOMETH ING\tSM:SOMETHING\tCN:ILLUMINA' instead. That's what gatk's wgs pipelines use.

Used fastp only for statistics.
Used the fastq files provided by dante in bwa mem without editing.
It is a BGI sequence.
I'm downloading ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz now to see if it gives a better result.

newuser
02-16-2020, 06:35 PM
Using hs37d5 was not a success.
I think i have found the issue.
Was looking at the stdout generated by bwa mem and see this as last line :
'[mem_sam_pe] paired reads have different names: "CL100103014L2C001R002_51", "CL100103014L2C001R002_23"''
Will try fixing the fastq files for next attempt.

Donwulff
02-16-2020, 10:28 PM
The problem with mapping all the patches (in one file) to the entire primary assembly (also in one file) is that some patches will align to chromosomes in the primary assembly that they're not supposed to, so that can inflate the number of total alignments in the alt file, although I'm not sure if I ever approached 99844 alignments.

As the original post reported, the BAM has 99844 *contigs*, not 99844 *alignments*. This is because it includes the decoy sequences as well as experimentally the Human Oral Microbiome sequences without signinifcant (101 bases, following the hs38d1 decoy sequence methods) overlap to the human assembly. Indeed, as you note, the correct approach is to consult your variant-callers documentation on how to tell it to process only specified contigs, whether on the command line or separate file. There's probably no reason ever to run variant calibration or call variants on the decoy sequences (Although I've done that for "fun", so it's doable but indeed gets slow).

Thank you for the feedback on the fix-patches, although it will take me some time to digest your comments. It should be noted that the only thing that cares about the alignment of the additional contigs to the primary assembly is bwa-postalt.js script originally from BWAKIT, which is not widely used, but the alt-homology mapping strictly follows that tools convention. Since the fix-patches do not generally have GRCh38 genomic coordinates, they behave somewhat like decoy sequences. Alt-aware aligners, which are the only aligners that should be used with the alt-contigs, prefer the primary alignment over the anchor-sequences of the fix-patches.

But as I've noted somewhere, a reasonable argument can be raised that the fix-patches should not be included in a reference, or they should be reduced to the sections without direct homology in the primary assembly only to use them just as additional decoy sequences. The new reference is experimental though, and some people might have interest in using that fix-patches as additional sequences in analysis, anchor sequences and all. Since the fix-patch regions are already mis-assembled in the reference, I'm not sure it's even possible to make the situation worse, but I have not yet had time to analyze how the inclusion of those sequences affects the quality of the regions. So as always, ensure that the results are suitable for your intended use.

It's possible to disable things like the decoy/HOMD sequences in the script by removing their version definition, but it's not (yet) possible to disable generation of the fix-patch contigs.

Donwulff
02-16-2020, 10:36 PM
Using hs37d5 was not a success.
I think i have found the issue.
Was looking at the stdout generated by bwa mem and see this as last line :
'[mem_sam_pe] paired reads have different names: "CL100103014L2C001R002_51", "CL100103014L2C001R002_23"''
Will try fixing the fastq files for next attempt.

hs37d5 has the contigs in a different order from what Dante Labs is using. Of course if you are mapping the reads yourself, the reference compatibility won't be an issue. However, what command are you using to do the mapping and what are the inputs? I did write the example script https://github.com/Donwulff/bio-tools/blob/master/mapping/revert-bam.sh mentioned in this thread/post with the intention of not having to go through it again every time though. Also, I still heartily recommend using a commercial service rather than learning to do it yourself and/or buying new servers to do it. However, there does seem to be a variety in the services offered (Namely, I'm having trouble getting Sequencing.com's service even working for myself anymore;)

Donwulff
02-16-2020, 11:33 PM
Speaking of decoys, would you recommend compiling a list of oral microbiome genomes as a sink, like the EBV genome? By this I mean, incorporating the sequences into the .fa file, but not the .fa.alt index. I'm guessing this would help speed up alignments as oral microbiome reads which don't belong on the human reference would just get quickly absorbed by the oral microbiome references in the .fa file instead of the aligner spending time trying to assign them to a place on the human reference where they obviously do not belong.

In addition to the various species present (i.e. Streptococcus salivarus), there also appear to be multiple different strains of each species, so I'm not sure whether I should add all strains from each species or if just one from each (i.e. the most common strains or the strains with the longest genomic length) would be sufficient?

Whoops, I notice most of what I wrote above was already discussed. But anyway that's essentially what the https://github.com/Donwulff/bio-tools/raw/master/mapping/GRCh38_bwa_index.sh script does if you don't disable the HOMD generation. At least all decoy-sequences are in the BWAKIT .alt-file, and although I can't check it just now, I believe the EBV is in the default alt-file for the reference including EBV. I would be wary about not listing them to .alt-file, as that would inevitably change the results of the alignment compared to primary assembly only (The purpose of being alt-aware is that the alignment of the primary assembly should remain mostly identical).

I believe including all strains is useful, and analogous to the handling of different HLA variants. The strains are sufficiently diverged that the reads might not match (preferentially) to them otherwise. Because they contain some conserved sequences the oral microbiome alignments aren't likely very useful for any direct analysis. You'd need similar approach to HLA processing with assembly and realignment to the personal oral microbiome, or a graph-based/reference-free aligner for them. Those standard microbiome pipelines exist too. Of course, the script is available so I encourage people to try different options and report any useful results.

karwiso
02-17-2020, 08:02 PM
Do you have any comments about using bwa-mem2? It should 2-3 times faster than original bwa-mem. Have you used it for alignment?

Donwulff
02-18-2020, 12:26 AM
Do you have any comments about using bwa-mem2? It should 2-3 times faster than original bwa-mem. Have you used it for alignment?

I remain somewhat perplexed by people who have one or even handful of samples to process spending so much time trying to over-optimize it. The time & effort spent even talking about it is surely more than the time possibly saved by using another aligner. On top of that people usually seem to have monster of machines which could do 10X the work - with a significant risk or incorrect results.

Regarding bwa-mem2 the most important thing, I think, is to heed the comment at top of https://github.com/bwa-mem2/bwa-mem2 "WIP; not recommended for production uses at the moment". I would be interested to actually seeing comparisons of bwa-mem vs. bwa-mem2 results, whether they match. According to the GitHub summary "It produces alignment identical to bwa and is ~80% faster." Of course if you have to still do the slower alignment to compare, you've not actually won anything, besides providing some knowledge.

However, based on https://anthrogenica.com/showthread.php?12075-Dante-Labs-(WGS)&p=622998&viewfull=1#post622998 by tontsa "I managed to index hg38 with bwa-mem2 by using 320 gigs of swap space.. it's really really memory hungry. Haven't tried aligning yet though..". 320 GiB seems pretty limiting to me, plus given the reference genome I'm working with currently (with the eHOMD decoys etc.) is twice as large, would likely require 640 GiB or more to index. In the comparisons on the GitHub page they're using (mostly) server with 56 processor threads and 100 GiB memory. This is still way beyond the 32 GiB max. usually available on home computers.

Oddly enough, the DRAGEN reference seems to have disappeared off GATK site for now as suddenly as it appeared: https://gatk.broadinstitute.org/hc/en-us/community/posts/360057733992-Where-s-the-DRAGEN- but that's something else to keep tabs on.

tontsa
02-18-2020, 05:54 AM
bwa-mem2 authors actually clarified the memory requirement for the indexing part. Actual aligment doesn't that much memory.. https://github.com/bwa-mem2/bwa-mem2/issues/19 so The code needed 34N GB memory to build the index where N is the size of the reference sequence. So, for N = 3 Billion (for human genome), it would have needed nearly 100 GB memory. The latest commit (6f21de2873a892207c4d315a3a0e59cfffb0b71b) reduces the memory requirement to 28N GB (=84 GB for human genome). I must have had some very bad version as it required nearly 300 gigs of memory to finish.. but still 84GB for straight hg38 is pretty much.

teepean47
02-18-2020, 11:11 AM
Oddly enough, the DRAGEN reference seems to have disappeared off GATK site for now as suddenly as it appeared: https://gatk.broadinstitute.org/hc/en-us/community/posts/360057733992-Where-s-the-DRAGEN- but that's something else to keep tabs on.

The progress should be visible on Github.

https://github.com/broadinstitute/gatk/projects/6#card-29391735

Donwulff
02-18-2020, 10:09 PM
The progress should be visible on Github.

https://github.com/broadinstitute/gatk/projects/6#card-29391735

Thanks. That shows that there's been no reported complete tasks, even to get it to produce some results, and the last update is indeed from 3 months ago. I couldn't find any announcement of it being abandoned when I checked earlier, but it definitely looks that way. I was confused with how this would work anyway, with DRAGEN seeming to require proprietary hypercomputing (FPGA/ASIC) accelerators. If they were putting in the effort to make it work on regular hardware but still faster, it would be something to look forward to. On the other hand, as I remarked before there seems to be some conflict of interest in "requiring" people to suddenly shift to Illumina controlled processing pipeline especially if only low-performance version is available for free. It also seems like from scientific perspective major changes of processing pipeline would be best to reserve for when reference genome changes, but I'm not sure if GRCh39 is even coming due to technical challenges. A lot of places are still to move even to GRCh38 though, likely partly becaue they're waiting for GRCh39.

Statistical studies like genome-wide association studies, and the increasingly popular polygenic risk scores depend on variants being called the same way, usually using same reference and pipeline. Otherwise the results may be associated to the variant calling differences rather than the variants itself (Although quality control procedures, like ensuring only variants which are called in majority of samples and whose frequency match other sources, Hardy-Weinberg equilibrium, Mendelian inheritance as well as other statistical indications available can help). This can be a problem for checking known disease-associated variants as well, especially due to the "multiple-testing error" when using services like Promethease. If there's even 0.01% chance of false positive per variant (different alignment by different aligner, variant caller etc. in addition to just sequencing error) but 10,000 pathogenic variants are tested, an average of one false positive pathogenic variant would be expected. This is why processing should be as close to the discovery study as reasonable; most research seems to be using GATK Best Practices style pipeline. For genealogical uses, incorrect variant calls of course cause false breakpoints.

Good news is that bwa-mem2 results should be identical to bwa-mem, but as said, even according to the GitHub page they don't recommend using it in production (because there could be an error). So you should probably run the old bwa-mem as well to compare that the results actually are identical. If Broad Institute switches to DRAGEN-GATK, it will be used widely enough that significant differences in variant-calls would likely be noticed and taken into account, though that might take a while. Dante Labs already using DRAGEN based pipeline suggests we're going that way anyway, but I don't think you can run DRAGEN at home yet. (It's available for a price on some commercial platforms).

Of course, by the same token you probably shouldn't use the expanded reference genomes like the script I've been referencing for "production" either. As the comments note it's strictly for research, or perhaps second opinion after running analysis with regular GRCh38 reference. Oral microbiome in saliva samples already introduces some mis-calls which inclusion of the oral microbiome in reference should fix, though. Hopefully there will be some paired blood & saliva samples to match that against. There's a few recent studies like https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-020-0664-7 "Quality of whole genome sequencing from blood versus saliva derived DNA in cardiac patients. Yao RA, Akinrinade O, Chaix M, Mital S. BMC Med Genomics 13:2020 11" which are relevant here, but sequence data sets are generally not available due to privacy.

newuser
02-21-2020, 11:22 PM
hs37d5 has the contigs in a different order from what Dante Labs is using. Of course if you are mapping the reads yourself, the reference compatibility won't be an issue. However, what command are you using to do the mapping and what are the inputs? I did write the example script https://github.com/Donwulff/bio-tools/blob/master/mapping/revert-bam.sh mentioned in this thread/post with the intention of not having to go through it again every time though. Also, I still heartily recommend using a commercial service rather than learning to do it yourself and/or buying new servers to do it. However, there does seem to be a variety in the services offered (Namely, I'm having trouble getting Sequencing.com's service even working for myself anymore;)



Thanks for the script
Was originally using "bwa mem -M -t 4 reference kitid_SA_L001_R1_001.fastq.gz kitid_SA_L001_R2_001.fastq.gz" for mapping
Tried some other commands with the same result.
But the issue is the fastq files I received from dante.
The R1 and R2 fastq files need to have a proper pair at the same location.
With a header like "@CL100103014L2C011R057_51/1" in R1 and "@CL100103014L2C011R057_51/2" in R2.
In the fastq files received from dante this is not the case for the largest part in the 2 files.
Debugged the issue by stripping all data in the files after the first case where the pairs don't match and sending that to bwa mem and got the exact same sam file as before without getting the error.
So it seems that bwa mem stops mapping after the first error it gets where the pairs don't match.
Tried repairing the fastq files with BBtools repair.sh but my laptop crashes because it has to little RAM.

My goal is to create a bam mapped to a different reference.
So i think it is now better to start from the bam file provided by dante instead of the fastq's.

tontsa
02-22-2020, 06:18 AM
Did you also try fastp -i kitid_SA_L001_R1_001.fastq.gz -I kitid_SA_L001_R2_001.fastq.gz -o kitid_SA_L001_fixed_R1_001.fastq.gz -O kitid_SA_L001_fixed_R2_001.fastq.gz to get repaired fastqs?

MacUalraig
02-22-2020, 08:38 AM
If Broad Institute switches to DRAGEN-GATK, it will be used widely enough that significant differences in variant-calls would likely be noticed and taken into account, though that might take a while. Dante Labs already using DRAGEN based pipeline suggests we're going that way anyway, but I don't think you can run DRAGEN at home yet. (It's available for a price on some commercial platforms).



Sano are also headed in the same direction...:

"We are likely going to switch over to a DRAGEN pipeline and new hg38 with patches in the next few months."

Donwulff
02-22-2020, 05:20 PM
My goal is to create a bam mapped to a different reference.
So i think it is now better to start from the bam file provided by dante instead of the fastq's.

There were some previous discussions on how to fix the FASTQ's in wrong order, although I think that most people just requested Dante Labs for new, correct ones. If you have the bam file, the rever-bam.sh script above should also be just what's needed. Originally it was just a quick example of how to create the Unmapped BAM uBAM file used in GATK's new "best practices" workflow, but then I had to add in an example of how to use uBAM for mapping, and then do some process pipelining etc. to optimize it. Anyway I think GATK still won't run on Windows for some silly temporary-file-path issues, but if you have Linux or Windows Susbsystem for Linux it should be fine.

I have to also get around to incorporating FASTQ handling to the revert-bam script, though that complicates the simple command line syntax.

Example workflow where I converted both FASTQ's to BAM, script-touched them to make them paired-end, merged and sorted the BAM's and extracted sorted FASTQ's (Yeah, it's that complicated):
https://anthrogenica.com/showthread.php?13642-Dante-Labs-WES-WGS-Sequencing-Technical&p=605467&viewfull=1#post605467

@Geldius found a script that just sorts the FASTQ files, assuming all paired-ends are there, I haven't checked how well this works:
https://anthrogenica.com/showthread.php?13642-Dante-Labs-WES-WGS-Sequencing-Technical&p=605105&viewfull=1#post605105

I think most of the FASTQ repair attempts would just filter out any unmatched read pairs, which would be the most of them. Paired end alignment works by trying to find a location where both ends match the reference within the expected "insert length" (the fragment whose ends were sequenced) distance. If both ends aren't available at the same time, it can't do that. You could of corse align them as individual, non-paired-end reads, but the alignment would be much, much worse because it loses the information about the proximity of the read ends, and ability to detect duplicate fragments (with identical endpoints), so... don't do that.

aaronbee2010
02-22-2020, 07:40 PM
Aaron, did you already reach some conclusion how to process those new GRCh38 patches so you end up with usable .alt file for bwa-mem?

I tried Donwulff's GRCh38_bwa_index.sh but it ends up with 99844 different contigs in the .bam making it pretty impossible to process.. or should I just filter out all those non primary assembly contig reads before running variant calling?

Just to give you an update, the .alt index made with Magic-BLAST is almost done now. When it's finished, I'll start aligning the patches with bwa mem tonight. I'll also prepare an .alt index using minimap2 and then align that with bwa mem afterwards so I can compare the two. Any feedback is appreciated.

newuser
02-23-2020, 02:10 AM
There were some previous discussions on how to fix the FASTQ's in wrong order, although I think that most people just requested Dante Labs for new, correct ones. If you have the bam file, the rever-bam.sh script above should also be just what's needed. Originally it was just a quick example of how to create the Unmapped BAM uBAM file used in GATK's new "best practices" workflow, but then I had to add in an example of how to use uBAM for mapping, and then do some process pipelining etc. to optimize it. Anyway I think GATK still won't run on Windows for some silly temporary-file-path issues, but if you have Linux or Windows Susbsystem for Linux it should be fine.

I have to also get around to incorporating FASTQ handling to the revert-bam script, though that complicates the simple command line syntax.

Example workflow where I converted both FASTQ's to BAM, script-touched them to make them paired-end, merged and sorted the BAM's and extracted sorted FASTQ's (Yeah, it's that complicated):
https://anthrogenica.com/showthread.php?13642-Dante-Labs-WES-WGS-Sequencing-Technical&p=605467&viewfull=1#post605467

@Geldius found a script that just sorts the FASTQ files, assuming all paired-ends are there, I haven't checked how well this works:
https://anthrogenica.com/showthread.php?13642-Dante-Labs-WES-WGS-Sequencing-Technical&p=605105&viewfull=1#post605105

I think most of the FASTQ repair attempts would just filter out any unmatched read pairs, which would be the most of them. Paired end alignment works by trying to find a location where both ends match the reference within the expected "insert length" (the fragment whose ends were sequenced) distance. If both ends aren't available at the same time, it can't do that. You could of corse align them as individual, non-paired-end reads, but the alignment would be much, much worse because it loses the information about the proximity of the read ends, and ability to detect duplicate fragments (with identical endpoints), so... don't do that.

Thanks.
Requesting new files would be difficult since dante is not responding to customer messages anymore.
And I have bandwidth limit so will fix them myself.

Yes the fastq files are out of order.
Wrote a little python script to load 1/10(to little Ram to load them all) of the unpaired header id's from R1 into RAM.
And was looking if they were present in R2.
Found all of them, So sorting will be a fix.

Will see if i have enough RAM to run the script Geldius found.
Else i will write my own to work with lower amount RAM needed in golang or rust, python would be a bit to slow for sorting two 167.6 GB fastq files i think .

Donwulff
02-24-2020, 05:24 AM
If you Google, there's multiple solutions presented, I liked the UNIX tool-think elegance of the top suggestion at https://www.biostars.org/p/15011/ (Note this command DOES NOT fix any possible missing/duplicated pairs; it assumes each read name occurs once and only once. Individual FASTQ sorting solutions have this limitation. Although according to your test above all the pairs should be there.)


zcat reads.fq.gz \
| paste - - - - \
| sort -k1,1 -S 3G \
| tr '\t' '\n' \
| gzip > sorted/reads.fq.gz


However, I would like to warn that some tools like GATK expect the reads in very specific order. For the aligners it's enough that the read ends match each other and their ordering is irrelevant. It may require extra-step of sorting output to appropriate read-name order (There are few different sort orders!) before processing with GATK etc. The uBAM approach avoids this, as the "unmapped BAM" is already in the order expected by GATK, and so is the output.

Although possible gotcha for bwa-mem/bwa-mem2 and similar is that they "sample" the inster length of the fragments whose ends are being sequenced (Because sequencing errors accumulate with each "cycle" of sequencing, they have this ingenious trick of reading 100/160 basepairs from each end of a fragment, sometimes overlapping, to get two high-quality reads within expected distance) to dynamicalaly determine insert fragment length distribution. So read-ordering which affect that distribution can cause problems. Sequencing order (That is, essentially, numeric sort on the numeric parts of the read name) would conceivably be best, although other solutions have been to shuffle the read order (keeping paired ends together; samtools collate in the FASTQ order fixing script for example does this shuffling) or determining the insert fragment length as well as possible (for example from the BAM file stats of earlier assembly) and using fixed values (bwa mem -I <insert size>, see manual for optional distribution values although defaults are probably good enough) over the whole sequence.

Now all that is fairly advanced issues though...

Donwulff
02-25-2020, 06:33 AM
Broad Institute just announced this poster session for Advances in Genome Biology and Technology conference now underway. While the focus is NOT on the software version of DRAGEN-GATK, it's mentioned at the end of the poster, so it appears to not be dead & buried.
https://drive.google.com/file/d/1fa0RG9RjoPOBlqvPoH6CXL121NldqAph/view
"The Illumina DRAGEN Bio-IT Platform (DRAGEN) uses highly reconfigurable field-programmable gate array technology (FPGA) to provide hardware-accelerated implementations of common genomic analysis algorithms."
"The GATK team is working on porting the improvements developed by the DRAGEN team into our open-source, software-only, functionally equivalent version." (Elsewhere they say "within margin of error that we consider inconsequential"). Note that results WILL differ grom BWA-GATK, that's rather the point.

The poster sows some more confusion as they appear to call the hardware and software version both DRAGEN-GATK, while elsewhere it's said DRAGEN-GATK is specifically the future result compatible software version.

https://gatk.broadinstitute.org/hc/en-us/articles/360039984151-DRAGEN-GATK-Update-Let-s-get-more-specific
Their new blog-entry descends further into, er, shall we say wapor-vare confusion:
"Given those results, we've determined that the DRAGEN pipeline we tested (v 3.4) is the standard that we want to aim for, so we consider it to be the hardware-accelerated version 1.0 of the DRAGEN-GATK pipeline for germline short variants. In other words, if you've been running DRAGEN 3.4, congratulations, you've been running the DRAGEN-GATK pipeline this whole time :)"

So the proprietary hardware-accellerated stock version of DRAGEN is the release version 1.0 of their new open-source software-only DRAGEN-GATK pipeline. Congrats, release done! Err, what?

"The porting work mainly involves modifications to the HaplotypeCaller, which will be released as part of the regular GATK software package."

...so what about the DRAGEN aligner now? By comparison, in earlier announcement they wrote "For example, we're going to replace BWA with the DRAGEN aligner, which is quite a bit faster, in our DNA pre-processing pipelines (full details and benchmarking results to follow)" ... "All the software involved in the DRAGEN-GATK pipelines will be fully open source and available in Github, including a new open source version of the DRAGEN aligner, and we'll continue to publish WDL workflows for every pipeline in Github and in Terra workspaces." which is why I brought it up as improved aligner.

According to the blog entry it seems to not be expected to be (much) faster, but more accurrate for indels.

aaronbee2010
02-29-2020, 01:22 PM
I've just finished making the GRCh38.p13 BAM using Magic-BLAST to make the .alt index (one large alignment for each patch as opposed to multiple smaller alignments for each patch like with minimap2). I'm working on the minimap2 .alt index right now (it just needs some polishing) to make the BAM file for that. In the meantime, I'm currently aligning a BAM without cigar strings and edit distances (here, the patches are being treated as decoys). Here's an example of a patch's line in the .alt index for each aligner:

Magic-BLAST:

KQ031383.1 0 chr1 12818488 255 135897M22827I50000D308419M * 0 0 * * NM:i:72827

minimap2:

KQ031383.1 2048 chr1 12818488 60 135897M331246H * 0 0 * * NM:i:0
KQ031383.1 0 chr1 13004385 60 158724S308419M * 0 0 * * NM:i:0

Treating patches as decoys:

KQ031383.1

I've made two copies, one with and one without post-alt processing. I've extracted the chrY regions into separate BAM files, removed duplicates with Picard (Optical Duplicate Pixel Distance = 2500) and ran BamQC on the data (analysing regions within YSEQ's SNP criteria) and compared it to other non .p13 BAMs I've done previously. Here are the results I have so far:

https://i.gyazo.com/6544490f9b18f62c136f71c28af872ef.png

The results for the GRCh38.p13 Magic-BLAST BAMs with and without post-alt processing are completely identical except for the post-alt BAM having slightly lower average MAPQ (39.30 vs 39.34) than the BAM without post-alt processing. The Magic-BLAST BAMs look similar to the DRAGEN BAM aligned by Dante Labs, with higher coverage and more no-calls than the other samples.

EDIT: I'm also considering running the above test again on all BAM files with non-MAPQ=60 reads filtered off afterwards. Would look forward to anyones feedback as to whether or not this is a good idea.

newuser
03-02-2020, 06:40 AM
Update.
Got my fastq files sorted and bwa mem completes without throwing a error.

Used like Donwulff suggested:


zcat reads.fq.gz \
| paste - - - - \
| sort -k1,1 -S 3G \
| tr '\t' '\n' \
| gzip > sorted/reads.fq.gz


But found a issue with this method.
If i run 'zcat R1.fq.gz | head -n 12' on the sorted file i got following ids:
@CL100103014L2C001R002_0/1
...
@CL100103014L2C001R002_100000/1
...
@CL100103014L2C001R002_10000/1
...

And for sorted R2:
@CL100103014L2C001R002_0/2
...
@CL100103014L2C001R002_100000/2
...
@CL100103014L2C001R002_100001/2
...

The /1 and /2 in the sequence id made the order different in both files.
I inserted a little patch for temporary flipping the /2 to /1 before sorting and flipping it back before gzipping.
code for flip.py patch:


import sys
output = sys.stdin.readline()
while output:
output = '\t'.join([x if n != 0 else x.replace('/2','/1') if "/2" in x else x.replace('/1','/2') for n,x in enumerate(output.split('\t'))])
print (output.rstrip('\n'))
output = sys.stdin.readline()

And resorted R2:


zcat R2.fastq.gz \
| paste - - - - \
| python3 flip.py \
| sort -k1,1 -S 3G \
| python3 flip.py \
| tr '\t' '\n' \
| gzip > R2fixed.fq.gz


This made the order equal for R1 and R2 file.

Donwulff
03-02-2020, 02:03 PM
Thanks for the addition! The read pair ID /1 and /2 isn't really supposed to be in the FASTQ though, I believe simply cutting it off would be enough. Be careful not to cut off any other flags that might follow the read name. Although all of that is really unsatisfactory as the FASTQ generation method like those files have used misses possible metadata.
I need to get around to testing, but if the read-paid-id is there, it should be possible to concate and sort those FASTQ's into a single interlaced file (pair ends following each other), depending on the sorting algorithm of course. While this doubles the amount of data to sort with each other, with any sane sorting algorithm I don't think that should present an issue.

Donwulff
03-02-2020, 02:22 PM
I'm wondering if we have any good way to benchmark genome alignment or variant call accurracy. There's the Genome In a Bottle, for example, which provides high confidence call-sets https://github.com/ga4gh/benchmarking-tools/blob/master/resources/high-confidence-sets/giab.md to many academic samples. However, for DTC use there's the issue that those research samples are from whole blood, and not saliva, ie. they don't have complicating metagenome.

Additional complication is that the interplay of the variant caller and alignment meas one alignment method is likely better with another variant caller, while another works better for another (I need to find the papers comparing multiple different combinations). This means that to get comparable data, one would need to either settle on a variant calling methodology they're going to use "in any case", or preferably produce comparisons with multiple ones. (And, unfortunately, I'm simplifying because each tool can be ran with multitude of different parameters which would be better in different cases!)

Either one could ignore metagenome and just benchmark different tools against GIAB and assume metagenome is sufficiently different for comparisons to still be valid, or possibly take all unmapped reads from saliva-based genome and add them to a GIAB reference genome for benchmarking. There are also limited options for simulating reads both from human genome & microbiome, ansuring that one has absolute "truth set" of what variants are in the sample, but this approach is only valid if sequncing errors are correctly modeled.

Final consideration is if we have partitipants who have multiple high-quality sequences of their own genome, and aren't averse to making the data public, it would be possible to construct a test-set from that. Using sequences to derive a high-confidence callset, then that callset to train variant quality recalibrator/convolutional neural network to cross-filter variants, possibly sticking to variants which are visible in most sequences etc. Need to look at how GIAB has done the validation; it's always bit subjective, but multiple sequences & deep sequencing always allows for higher confidence in variant calls. Parent sequences to check for Mendelian violations would be even better. Of course, I'm yet to do most of that with my limited sequences.

aaronbee2010
03-05-2020, 09:41 AM
I'm wondering if we have any good way to benchmark genome alignment or variant call accurracy. There's the Genome In a Bottle, for example, which provides high confidence call-sets https://github.com/ga4gh/benchmarking-tools/blob/master/resources/high-confidence-sets/giab.md to many academic samples. However, for DTC use there's the issue that those research samples are from whole blood, and not saliva, ie. they don't have complicating metagenome.

Additional complication is that the interplay of the variant caller and alignment meas one alignment method is likely better with another variant caller, while another works better for another (I need to find the papers comparing multiple different combinations). This means that to get comparable data, one would need to either settle on a variant calling methodology they're going to use "in any case", or preferably produce comparisons with multiple ones. (And, unfortunately, I'm simplifying because each tool can be ran with multitude of different parameters which would be better in different cases!)

Either one could ignore metagenome and just benchmark different tools against GIAB and assume metagenome is sufficiently different for comparisons to still be valid, or possibly take all unmapped reads from saliva-based genome and add them to a GIAB reference genome for benchmarking. There are also limited options for simulating reads both from human genome & microbiome, ansuring that one has absolute "truth set" of what variants are in the sample, but this approach is only valid if sequncing errors are correctly modeled.

Final consideration is if we have partitipants who have multiple high-quality sequences of their own genome, and aren't averse to making the data public, it would be possible to construct a test-set from that. Using sequences to derive a high-confidence callset, then that callset to train variant quality recalibrator/convolutional neural network to cross-filter variants, possibly sticking to variants which are visible in most sequences etc. Need to look at how GIAB has done the validation; it's always bit subjective, but multiple sequences & deep sequencing always allows for higher confidence in variant calls. Parent sequences to check for Mendelian violations would be even better. Of course, I'm yet to do most of that with my limited sequences.

What are your thoughts on fabricating simulated reads (like GATK did with their tutorial on mapping reads to references with alt contigs - https://gatk.broadinstitute.org/hc/en-us/articles/360037498992?id=8017) and using those reads to benchmark alignment methods? This seems ideal to me because you only have the reads you're interested in (i.e. no microbiome or adapter sequences to worry about) and you immediately have a 100% accurate reference to compare the benchmarks to (i.e. comparing benchmarked alignments to your original simulated reads and calculating some sort of percentage accuracy score). You could also add SNPs and indels to the simulated reads and evaluate how accurate each aligner is in taking those into account.

Of course, the main disadvantage seems to be time constraints in creating simulated reads across the entire GRCh38 reference, but you could just focus on one (or more) part(s) of said reference. For a lot of users here, there would be a focus specifically on the accuracy of alignments to the Y chromosome. Maybe somebody could code an algorithm that randomly splits regions of the GRCh38 reference into overlapping 150 bp fragments perhaps? I lack the know-how to do this, but I'm sure somebody else here could work with that idea.

On another note, I'm still working on the .alt index for GRCh38.p13 generated by minimap2, but it's taking a lot more time than I expected, and juggling all of this with my university degree (I'm in my final semester for my final year) and other errands is proving difficult.

Donwulff
03-05-2020, 11:35 PM
Kinda all depends what your goals are on the benchmarking. Personally I'm specifically fretting about how to deal wiht the metagenome. Since I assume most of us here are mostly talking about re-analysing direct-to-consumer sequencing data, those usually include saliva, and therefore any "validation" on the variant calling method should preferably take the microbiome into account. Paired whole blood & saliva samples like that one study https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-020-0664-7 I linked earlier. However, researchers aren't allowed to share those for home-studies.

I'm still wondering if we have any enthusiasts who have managed to get both blood & saliva sequences (preferably at high read depth; another trick is sub-sampling those sequences randomly because higher read depth means less overall errors) and would be willing to share.

If the intent is to JUST see how the alignment/variant calling performs under optimal conditions, from whole blood etc. then simulating reads from just known genome is of course feasible, and generally done (in parallel with real samples) in any study about a new aligner or variant caller. However, these simulations are only as good as the simulated error profile and, in this context, perhaps more importantly the reference genome. In terms of testing alternate contigents and fixes etc. I don't think there is much real benefit to be gained from testing with simulated reads.

It does get complicated though because there's no "typical human genome", so ascertaining saliva sequence against linked-reads deep whole blood sequence for example for one person just indicates it works for their structural variants and metagenome. But yes, of course, there's lots to be learned this way by looking at the potential errors and seeing if specific changes seem to be making things better or worse.

On that note, Dante Labs is now advertising their Hybrid Genome with WGS, WES and Long Reads combined, has anybody tried that? I hate to say, but seeing as they never really delivered as promised on the Long Read sequence and some of the bioinformatics products, I'd hate to be throwing good money after the bad. Right now that's looking quite good deep (exome sequencing) ans structural variant product, though I'm not even finding any details on how they perform the combination.

aaronbee2010
03-06-2020, 09:28 PM
Big update: I've managed to find a list of .gff files from NCBI's FTP server which gives alignments of all patch contigs, so the previous tests I've done with fabricating .alt indexes with Magic-BLAST and minimap2 appear to be no longer necessary (which is a shame as I just finished my minimap2 .alt index last night!). I'll start aligning with the NCBI .alt index as soon as I get it completed. This shouldn't take too long, but it does involve me having to determine whether any additional non-primary alignments are secondary or supplementary (these terms confuse me a little bit at the moment).

Donwulff
03-14-2020, 05:44 AM
Kinda interesting synchronicity/serendipity keeps up I guess. Now we have this new paper https://genome.cshlp.org/content/early/2020/03/10/gr.255349.119
"Genomic loci susceptible to systematic sequencing bias in clinical whole genomes"

"Current standard practices to differentiate between loci that can and cannot be sequenced with high confidence utilize consensus between different sequencing methods as a proxy for sequencing confidence. These practices have significant limitations, and alternative methods are required to overcome them. We have developed a novel statistical method based on summarizing sequenced reads from whole-genome clinical samples and cataloging them in “Incremental Databases” that maintain individual confidentiality. Allele statistics were cataloged for each genomic position that consistently showed systematic biases with the corresponding MPS sequencing pipeline. We found systematic biases present at ∼1%–3% of the human autosomal genome across five patient cohorts. We identified which genomic regions were more or less prone to systematic biases, including large homopolymer flanks (odds ratio = 23.29–33.69) and the NIST high confidence genomic regions (odds ratio = 0.154–0.191)."

Really been wanting to work on that sort of solution myself. The paper itself is unfortunately not open access for six month.

On the other hand this gets slightly side-tracked from the challenge of testing/validating alternate/fix-patches, I guess.

teepean47
03-14-2020, 08:23 PM
Kinda interesting synchronicity/serendipity keeps up I guess. Now we have this new paper https://genome.cshlp.org/content/early/2020/03/10/gr.255349.119
"Genomic loci susceptible to systematic sequencing bias in clinical whole genomes"

"Current standard practices to differentiate between loci that can and cannot be sequenced with high confidence utilize consensus between different sequencing methods as a proxy for sequencing confidence. These practices have significant limitations, and alternative methods are required to overcome them. We have developed a novel statistical method based on summarizing sequenced reads from whole-genome clinical samples and cataloging them in “Incremental Databases” that maintain individual confidentiality. Allele statistics were cataloged for each genomic position that consistently showed systematic biases with the corresponding MPS sequencing pipeline. We found systematic biases present at ∼1%–3% of the human autosomal genome across five patient cohorts. We identified which genomic regions were more or less prone to systematic biases, including large homopolymer flanks (odds ratio = 23.29–33.69) and the NIST high confidence genomic regions (odds ratio = 0.154–0.191)."

Really been wanting to work on that sort of solution myself. The paper itself is unfortunately not open access for six month.

On the other hand this gets slightly side-tracked from the challenge of testing/validating alternate/fix-patches, I guess.

https://sci-hub.tw/10.0000/genome.cshlp.org/genome/early/2020/03/10/gr.255349.119

aaronbee2010
03-14-2020, 09:28 PM
On the other hand this gets slightly side-tracked from the challenge of testing/validating alternate/fix-patches, I guess.

I've just realised that some gff files that annotate the alignments don't mention minute changes in individual bases, so after I've finished adding CIGARS and edit distances to my .alt index template, I'll need to manually verify the alignments myself.

GRCh38.p14 is apparently due to be released in Spring (although SARS-CoV-2 may delay that somewhat) so if this takes too long, I might need to add those extra patches in too. I wouldn't be surprised if things took that long with the workload my degree is giving me at the moment.

aaronbee2010
03-16-2020, 09:22 PM
newuser: did you use fastp's processed fastqs (-o/-O flag) or did you just use fastp for statistics? You might get better results with the filtered fastqs. Also on hg19 I don't think you need/want -M flag for bwa mem. For "fun" try bwa mem -t 16 -Y -v 3 -K 100000000 -R '@RG\tID:LANENAME\tPL:ILLUMINA\tPU:xTen\tLB:SOMETH ING\tSM:SOMETHING\tCN:ILLUMINA' instead. That's what gatk's wgs pipelines use.

Interesting. Where did you find this, if you don't mind me asking? I'm trying to get the hang of GATK's best practice pipeline and I couldn't find these parameters for bwa mem on GATKs website. I've tried to merge a uBAM with an aligned BAM before to no avail, but I'm currently trying again, hence my interest.

tontsa
03-17-2020, 05:19 AM
Interesting. Where did you find this, if you don't mind me asking? I'm trying to get the hang of GATK's best practice pipeline and I couldn't find these parameters for bwa mem on GATKs website. I've tried to merge a uBAM with an aligned BAM before to no avail, but I'm currently trying again, hence my interest.

https://github.com/gatk-workflows/gatk4-genome-processing-pipeline/blob/master/tasks/UnmappedBamToAlignedBam.wdl
String bwa_commandline = "bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta"

That same UnmappedBamToAlignedBam is used also in:
https://github.com/gatk-workflows/gatk4-exome-analysis-pipeline

https://github.com/gatk-workflows/gatk4-mitochondria-pipeline/blob/master/tasks/alignment-pipeline.wdl has slightly different but still:
String bwa_commandline = "bwa mem -K 100000000 -p -v 3 -t 2 -Y $bash_ref_fasta"

Cheers,
Toni

aaronbee2010
03-17-2020, 12:37 PM
https://github.com/gatk-workflows/gatk4-genome-processing-pipeline/blob/master/tasks/UnmappedBamToAlignedBam.wdl
String bwa_commandline = "bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta"

That same UnmappedBamToAlignedBam is used also in:
https://github.com/gatk-workflows/gatk4-exome-analysis-pipeline

https://github.com/gatk-workflows/gatk4-mitochondria-pipeline/blob/master/tasks/alignment-pipeline.wdl has slightly different but still:
String bwa_commandline = "bwa mem -K 100000000 -p -v 3 -t 2 -Y $bash_ref_fasta"

Cheers,
Toni

Brilliant find! I dismissed their .wdl + .json pipelines after failing to get Cromwell working from my external HDD, so I didn't think to look there again. Thank you very much!

I might try converting their .wdl + .json pipelines into written instructions for my own reference. I can share those instructions here as well if anyone's interested :)

tontsa
03-17-2020, 12:40 PM
I've also done crude bash script based on those workflows for my own use as spinning up those dockers is bit slow on my own computer and no desire to run them on google cloud at the moment.

aaronbee2010
03-18-2020, 01:51 AM
I've also done crude bash script based on those workflows for my own use as spinning up those dockers is bit slow on my own computer and no desire to run them on google cloud at the moment.

Does processing 100000000 input bases in each batch regardless of nThreads (the "-K 100000000" option) yield improved mapping?

On another note, my current attempt at aligning with GATKs Best Practices pipeline has worked out successfully so far and I'm now at the BQSR stage but I'm not sure exactly what co-variate .vcf.gz file to use. I'm currently downloading the "00-All.vcf.gz" file from ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/. If/when this completes, I'll do the same thing again, but with the GATK BWA-MEM options enabled so I can compare the two. I would like to see if the inclusion of the -K and -Y options has a noticeable effect on the MAPQ and number of no-calls.

tontsa
03-18-2020, 05:26 AM
I think the purpose of -K 100000000 option is to ensure with WGS data you get consistent results run-to-run.. not sure if it really affects mapping quality. I've been lazy with the dbSNP build version as it's quite tedious to rename those contigs in later builds to match your reference.. currently using the gatk's https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/ as for my uses I haven't really needed the newer dbSNP builds as many services just use co-ordinates instead of rsXXXX.

aaronbee2010
03-29-2020, 01:13 PM
I've tried to split my FASTQ files by header (in this case: instrument ID, run ID, flowcell ID and flowcell lane). Here are the following commands I used:

zgrep "@" R1.fastq.gz > 1.txt (Example line: @A00966:8:HVYV3DSXX:3:1101:25482:1094 1:N:0:ATACTTGTTC+TAACCGCCGA)

awk -F ":" '{ print $1, $2, $3, $4 }' 1.txt > 2.txt (Example line: @A00966 8 HVYV3DSXX 3)

uniq 2.txt > 3.txt (removes duplicates)

3.txt > manually replace " " with ":" > 4.txt (Example line: @A00966:8:HVYV3DSXX:3) - at this stage, I had 8 unique headers.

zgrep -A 3 "@A00966:8:HVYV3DSXX:3" R1.fastq.gz | bgzip -c > 1_R1.fastq.gz (repeat for all other lines)

cat 1_R1.fastq.gz 2_R1.fastq.gz 3_R1.fastq.gz 4_R1.fastq.gz 5_R1.fastq.gz 6_R1.fastq.gz 7_R1.fastq.gz 8_R1.fastq.gz > R1_split_recombined.fastq.gz

When evaluating the GBase count for the original R1.fastq.gz file I received from Dante Labs and the R1_split_recombined.fastq.gz file I generated from the original R1.fastq.gz file, there was a 0.1 GBase discrepancy between them (original = 46.32 GBases and split/recombined = 46.22 GBases). Even when doing awk with the decompressed R1.fastq file (awk 'NR % 4 == 1' R1.fq > 1.txt), I still ended up with a 4.txt file being completely identical to the 4.txt generated by starting with zgrep instead of awk.

I then tried splitting the original R1.fastq.gz file into sections of 4 lines then immediately recombining and recompressing, all in one command:

zgrep -A 3 "@" R1.fastq.gz | bgzip -c > R1_split_recombined.fastq.gz

When evaluating the original R1.fastq.gz file and the R1_split_recombined.fastq.gz file, the same discrepancy exists.

Since FASTQ files only have 4 lines per read, none of the commands I've posted above should theoretically result in any loss of data (for either set of commands), so I'm having a hard time figuring out why this is happening. I would appreciate the input of anyone here who has more knowledge with this sort of thing than I do. Many thanks in advance.

Donwulff
03-29-2020, 09:17 PM
https://gatkforums.broadinstitute.org/gatk/discussion/21351/bwa-mem-m-option discusses -M and -K options.
On -M: "However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments."
Afaik it hurts nothing but improves compatibility, however with DRAGEN etc. "compatibility" is admittedly in flux. Also, I suffered a boot-drive failure at home so I can't test if there seems to be any difference in eventual results even if I wanted to. In many/most cases the answer should be "try out and report what happens"... rememberig we don't really have good benchmark for "correct" results on Direct-To-Consumer sequences.

"So, the -K option is set internally to speed up the pipeline so it does not need to be set by the user." I think their technical explanation about "overwrite the file" makes no sense, lol, but yeah pipeline speedup. In fact I should try that out in my own pipeline, because you definitely don't want those buffered line by line.

Speaking of pipelines, see https://github.com/Donwulff/bio-tools/blob/master/mapping/BQSR.sh - I should probably stop publishing these example scripts if nobody uses or looks at them ;) They do make a few assumptions about the environment they're being run in, patches gladly accepted. You can get the training sets and weights off the Broad Institute pipelines too, but the resource locations/ID's and non-matching contigs are a real pain. Currrently though the BQSR script applies the recalibration to whole BAM, including oral microbiome if you have it included... which is just wasted effort. There's now https://gatk.broadinstitute.org/hc/en-us/articles/360040097972-ApplyBQSRSpark-BETA- which might be useful, but applying BQSR covariates should be about as fast as concatenating the BAM segments, which makes things interesting. Should have a file-system that allows linking files in a series after each other without actually moving anything.

And that script in turn reminds me I want to add the insert length as covariate in BQSR. I seem to recall the old "commercial" GATK 3 had that, at least selectable covariates. I've for long felt insert length should be included, and there was a recent paper that claimed it explains much of the residual error in sequences.

aaronbee2010
04-02-2020, 01:18 PM
https://gatkforums.broadinstitute.org/gatk/discussion/21351/bwa-mem-m-option discusses -M and -K options.
On -M: "However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments."
Afaik it hurts nothing but improves compatibility, however with DRAGEN etc. "compatibility" is admittedly in flux. Also, I suffered a boot-drive failure at home so I can't test if there seems to be any difference in eventual results even if I wanted to. In many/most cases the answer should be "try out and report what happens"... rememberig we don't really have good benchmark for "correct" results on Direct-To-Consumer sequences.

"So, the -K option is set internally to speed up the pipeline so it does not need to be set by the user." I think their technical explanation about "overwrite the file" makes no sense, lol, but yeah pipeline speedup. In fact I should try that out in my own pipeline, because you definitely don't want those buffered line by line.

Speaking of pipelines, see https://github.com/Donwulff/bio-tools/blob/master/mapping/BQSR.sh - I should probably stop publishing these example scripts if nobody uses or looks at them ;) They do make a few assumptions about the environment they're being run in, patches gladly accepted. You can get the training sets and weights off the Broad Institute pipelines too, but the resource locations/ID's and non-matching contigs are a real pain. Currrently though the BQSR script applies the recalibration to whole BAM, including oral microbiome if you have it included... which is just wasted effort. There's now https://gatk.broadinstitute.org/hc/en-us/articles/360040097972-ApplyBQSRSpark-BETA- which might be useful, but applying BQSR covariates should be about as fast as concatenating the BAM segments, which makes things interesting. Should have a file-system that allows linking files in a series after each other without actually moving anything.

And that script in turn reminds me I want to add the insert length as covariate in BQSR. I seem to recall the old "commercial" GATK 3 had that, at least selectable covariates. I've for long felt insert length should be included, and there was a recent paper that claimed it explains much of the residual error in sequences.

Interesting point about the -M option. I was wondering why YSEQ used the -M option when generating my BAM file as I thought that the version of Picard I use wouldn't need it, but now it seems worthwhile to add it to my personal workflow guide.

Regarding your scripts, I've been put off the idea of scripting as I was only aware .wdl scripts existed and was unaware of .sh scripts (yes, I'm a massive Linux novice, make fun of me all you want lol). Now that I've discovered this, I'm more open to the idea of using .sh scripts to pipe multiple commands from my personal workflow together. I've been working on an .alt index for the .p13 patches, but I realised that for the .gff3 files NCBI provides containing alignment statistics of each patch to the primary assembly, I was making a big mistake before. I assumed that you just needed to add the total insertions and deletions from the CIGAR string to obtain the edit distance, but it turn out I also need the value for "num_mismatch" in calculating the edit distance as this needs to include the number of mismatches within the parts of the patches that line up on the primary assembly (the "M" regions within the CIGAR strings). Guess I'm off to start all over again :)

Here is what the .alt file for chr1 looks like so far (ignore the bold, that's just to remind me which contigs I've completed):

https://i.gyazo.com/df4cc8b4950795788a0e32ec85d47184.png

One thing I would also like to do is rename the FASTA contig headers to the equivalents for the UCSC naming convention for patches (i.e. chr1_KQ031383v1_fix instead of KQ031383.1 and chr1_KQ983255v1_alt instead of KQ983255.1).

Also, I've looked at your BQSR script, and when I downloaded the snps_hg38.vcf.gz file from YBrowse raw data, I got multiple errors with BQSR because the chrY length in the vcf header was that of the hg19 reference and because there were lots of so-called variants with ancestral and derived alleles being identical to each other, so I've corrected the contig length and identical allele errors and the BQSR went smoothly. Unfortunately, The YBrowse VCF is updated all the time, so I don't think I'll be updating my VCF with corrections that regularly. Using YBrowse is a really clever idea from you though.

I've looked at the dbSNP153.vcf.gz and I now completely understand why tontsa wants nothing to do with it. Imagine having to replace bucketloads of NC contigs with contig names that match the hs38DH reference. I'm sure there's a command that I can use to replace all of them, but doing this in a file that's 13.4 GB when compressed seems tedious. I found a script on GitHub that converts the contig names to UCSC format, but the script lists all patch contigs as chr0_XX000000v0_alt, even though most of the patches are fix and not novel patches, so this would need correcting.

Also, for predicted insert length, I used BBTools and Picard to obtain predicted values and got two very different results, so I'm not sure which one to use. BBTools has the advantage of being able to fetch predicted values directly from the FASTQ files, whereas with Picard, I have to align with BWA then sort before I can obtain these values, but I'm guessing Picard values are best for the sake of consistency, if that makes any sense?

When I'm done aligning BAMs from all of my split FASTQ's, I'll do a BamQC test with BAMs aligned with and without GATK best practices and compare them. I've already done one BAM with and without GATK BP and the results look like this (bear in mind I haven't added predicted insert length to the headers of either BAM):

https://i.gyazo.com/df6ed69340865f94866e915a8e277fe1.png

It looks like you're trading the number of reads for mapping quality. Bear in mind this disparity is probably due to me setting MergeBamAlignments to unmap suspected contaminant reads though, which GATK does in their .wdl script. Once I'm done with this, I could try experimenting with patches and oral microbiome sequences too. I've finished my .alt index for the eHOMD genomes as well. I'm considering concatenating all of the .gff files for each patch and finding a way to convert the CIGAR strings into SAM format and calculating edit distances. Seems like much less effort than doing it all manually, since I've gotten some practice with awk, grep etc. over the last week.

Donwulff
04-03-2020, 08:41 AM
UCSC isn't just a naming convention, they release their "own" contigs. Theoretically it would be possible for them to differ from the raw contigs, though I hope not. The soft-masking does vary, but that's another can of worms. Perhaps it's laziness talking, but I prefer to keep the contig names according to the source of the contigs to avert any possible mis-match. As you've noted, the challenge for renaming them is finding out which are fixes and which are noval patches, I'm sure this is relatively easy, but why bother when (presumably) nothing can process them yet? One exception is if one wants to exclude fix-patches from the reference, ugh, which I've noted earlier ;)

The BQSR.sh script contains some tricks to fix the various reference files, in many cases it involces just removing contig-lines from the VCF header, because the tools used don't require it, but sensibly fail if the contig definitions between resources don't match. This, of course, is a problem when people put in wront contigs, or contigs they're not really using. A future note for me is I should probably validate the contig-lines and/or replace the YBrowse ones with correct ones so that people received error if they're using genuinely wrong resource.

I recall earlier version of YBrowse had some malformed variant-lines which I had to work around, perhaps that's happened again, will have to see if it needs processing.

I'm not sure what the exact difference for "alignned with GATK best practices" is on your table. BWA estimates insert-length on the fly, so I'm not sure why that would be required, unless you're trying to map unmapped contigs again (And I wonder how often I forget to set insert length manually then...). Referenceless tools only report insert length for overlapping reads, in which case insert length < 2X read-length. This is obviously wrong for things like read-alignment, and insert length can only be counted from aligned reads with samtools etc.

aaronbee2010
04-04-2020, 12:37 AM
UCSC isn't just a naming convention, they release their "own" contigs. Theoretically it would be possible for them to differ from the raw contigs, though I hope not. The soft-masking does vary, but that's another can of worms. Perhaps it's laziness talking, but I prefer to keep the contig names according to the source of the contigs to avert any possible mis-match. As you've noted, the challenge for renaming them is finding out which are fixes and which are noval patches, I'm sure this is relatively easy, but why bother when (presumably) nothing can process them yet? One exception is if one wants to exclude fix-patches from the reference, ugh, which I've noted earlier ;)

Yes, I am aware of the presence of soft-masking within UCSC that isn't present in the NCBI contigs. When I wrote what I said above, I was referring specifically to the naming convention. Apologies for not making this clear in my previous post.

As far as the distinction between which patches are fix patches and which ones are novel patches, this information is available on the NCBI "Nucleotide" section. Renaming all of them manually seems a bit tedious though.


The BQSR.sh script contains some tricks to fix the various reference files, in many cases it involces just removing contig-lines from the VCF header, because the tools used don't require it, but sensibly fail if the contig definitions between resources don't match. This, of course, is a problem when people put in wront contigs, or contigs they're not really using. A future note for me is I should probably validate the contig-lines and/or replace the YBrowse ones with correct ones so that people received error if they're using genuinely wrong resource.

I recall earlier version of YBrowse had some malformed variant-lines which I had to work around, perhaps that's happened again, will have to see if it needs processing.

When preparing the YBrowse vcf.gz file for BQSR, the only issues I had were the mismatched contig length for chrY and the presence of SNPs with derived alleles being identical to the ancestral alleles. After resolving those issues, the BQSR went smoothly. These issues don't appear to be that serious to me, but from what I've heard, GATK tools can be quite strict with things like this.


I'm not sure what the exact difference for "alignned with GATK best practices" is on your table. BWA estimates insert-length on the fly, so I'm not sure why that would be required, unless you're trying to map unmapped contigs again (And I wonder how often I forget to set insert length manually then...). Referenceless tools only report insert length for overlapping reads, in which case insert length < 2X read-length. This is obviously wrong for things like read-alignment, and insert length can only be counted from aligned reads with samtools etc.

Oh good lord I am a novice! For some reason that makes no sense in hindsight, I mistakenly thought you were referring to adding insert lengths to the header of the uBAM, despite you specifically mentioning the BQSR stage. If we can all pretend I never said that, then that would help me sleep at night! I will note down what you've said about the issues of predicting insert length from the FASTQ files without a reference, though.


Aaron, did you already reach some conclusion how to process those new GRCh38 patches so you end up with usable .alt file for bwa-mem?

I've just finished the .alt index for the patches by converting the alignment data for the patches provided by NCBI themselves (sorry this took so long!) so I would like to see how this compares to the .alt indexes I previously generated using Magic-BLAST, minimap2 and just listing the contig names in the .alt index without any extra information. Since I've now managed to run GATKs Best Practices workflow, I would probably need to do another comparison from scratch, which shouldn't be a big issue as I still have the other .alt indexes saved. I have a feeling this will probably be the best .alt index to use, so I don't think a comparison will be particularly important.

It turns out that concatenating the .gff3 files provided by NCBI and using awk, grep, Notepad++ and Excel to convert the cigar strings to SAM format and calculating edit distances (insertions from CIGAR + deletions from CIGAR + mismatches within overlapping regions) was much better than doing it all manually as I was doing previously. This only took me a day to generate the entire .alt index, whereas I could only manage contigs for one chromosome a day when doing it manually. Learning about awk and grep when splitting my FASTQ files was probably the best thing I've done all week, as it came in really handy. I feel silly for trying to do it all manually back then!

Hopefully, I can learn more about linux bash commands like these, as I feel they may be useful in the future. Particularly, it would be handy to learn how to make .sh scripts, so I can perform an entire workflow with one script instead of performing each step separately. When GATK released their .wdl files for the DRAGEN aligner, it would be handy to copy their commands into an .sh script, as I've only managed to set up Cromwell from my C: drive, which is a small SSD (only 240GB), whereas I failed when trying to set this up on my external HDD, so I would prefer to avoid .wdl scripts.

In any case, here are the links to the .alt files for the NCBI patches and eHOMD genomes (they're missing headers, as I intended them to be concatenated with the existing .alt index for the reference one wishes to use - adding headers to the .alt indexes below is easy to do anyways):

Patches from NCBI: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_assembly_structure/PATCHES/alt_scaffolds/FASTA/alt.scaf.fna.gz
My .alt index (without header) generated from the NCBI .gff3 files: https://drive.google.com/file/d/1JsPHU2n6VlCqpEyEo96Vw0i9DHOslGDf/view?usp=sharing
eHOMD genomes: ftp://ftp.homd.org/genomes/PROKKA/V9.03/fsa/ALL_genomes.fsa
My .alt index (without header) generated from the eHOMD genome .fsa file: https://drive.google.com/file/d/11du8q-R_ESM2COecSVAS8TC8TNkx7Gx3/view?usp=sharing

It may be worth making a version with the patch names following the UCSC naming convention, as well as removing fix patches from the .fna file and .alt index, if that's your preference @Donwulff.

This is all new to me, so any feedback is appreciated! :)

teepean47
05-26-2020, 07:25 PM
BioRxiv has a paper about a new mapper called Urmap. "URMAP is an order of magnitude faster than BWA and Bowtie2 with comparable accuracy on a benchmark test using simulated paired 150nt reads of a well-studied human genome."

https://www.biorxiv.org/content/10.1101/2020.01.12.903351v1

https://github.com/rcedgar/urmap

Creating an index requires around 41 gigabytes of memory and accessing swap slows it down to a crawl. First tests show it is very fast especially with "veryfast" option but seems to quit without error messages when source data has errors. The author is very busy with Covid-19 related stuff at the moment.

Adamm
06-25-2020, 08:44 PM
Question:

YSEQ has converted my DanteLabs Fastq files (FASTQ Mapping to hg38) and now I want to get my autosomal dna results with it (like uploading on FTDNA, MyHeritage and other autosomal websites), what step do I need to take? Do I use WGSextract to make the autosomal files out of the BAM file that YSEQ delivered me?

tontsa
06-26-2020, 04:32 AM
You get better autosomal results if you use the hs37d5 BAM file Dante made for you with WGS extract as those co-ordinates don't have to be lift-overed from hg38->hg19.

Adamm
06-26-2020, 08:08 AM
You get better autosomal results if you use the hs37d5 BAM file Dante made for you with WGS extract as those co-ordinates don't have to be lift-overed from hg38->hg19.

is that the standard bam file they delivered?

aaronbee2010
06-26-2020, 08:35 AM
is that the standard bam file they delivered?

Yes. They include (or at least, they're supposed to include) a hs37d5 BAM file for you alongside your FASTQ and VCF files.

Adamm
06-26-2020, 08:39 AM
Yes. They include (or at least, they're supposed to include) a hs37d5 BAM file for you alongside your FASTQ and VCF files.

Yes thanks, the problem is that WGSExtract doesn't work for that BAM file, when I try to generate a raw file (MyHeritage or 23andMe format) it (the program) just freezes and doesnt do anything. Do you perhaps have a solution to this?

aaronbee2010
06-26-2020, 09:59 AM
Yes thanks, the problem is that WGSExtract doesn't work for that BAM file, when I try to generate a raw file (MyHeritage or 23andMe format) it (the program) just freezes and doesnt do anything. Do you perhaps have a solution to this?

Normally, the "please wait" window in WGSExtract is supposed to freeze until the autosomal file is generated. Now, I have an issue that occurs more often than not when I try to generate an autosomal file. The "please wait" window pops up, but when I look at the Windows task manager, I see that nothing is actually happening. Everytime this happens, I have to keep restarting WGSExtract and trying again until I see a "Python" process in my task manager taking around 16.5% of my CPU, which tells me that it's now working. I'm not sure if this is the issue that you may be having?

Adamm
06-26-2020, 10:34 AM
Normally, the "please wait" window in WGSExtract is supposed to freeze until the autosomal file is generated. Now, I have an issue that occurs more often than not when I try to generate an autosomal file. The "please wait" window pops up, but when I look at the Windows task manager, I see that nothing is actually happening. Everytime this happens, I have to keep restarting WGSExtract and trying again until I see a "Python" process in my task manager taking around 16.5% of my CPU, which tells me that it's now working. I'm not sure if this is the issue that you may be having?

Now I managed to get it work, the problem is that they generate the file but its empty. It only shows this:

https://i.imgur.com/Yr0UffD.png

aaronbee2010
06-26-2020, 11:12 AM
Now I managed to get it work, the problem is that they generate the file but its empty. It only shows this:

https://i.imgur.com/Yr0UffD.png

I don't recall anything like that happening with me! Did anything suspicious pop up on the command prompt that opens alongside the WGSExtract program?

Looking at the extract23 bash script that WGSExtract uses internally, the main commands performed therein are:

1. samtools mpileup -C 50 -v -l ${REF_23ANDME} -f ${REF} ${BAMFILE_SORTED} > 23andMe_raw.vcf.gz (finding and recording parts of the BAM file that correspond to coordinates present in genotyping chips)

2. bcftools call -O z -V indels -m -P 0 23andMe_raw.vcf.gz > 23andMe_called.vcf.gz (uses certain parameters to make position calls from the previous file [i.e. AA, CT or -- (no-call)])

3. bcftools annotate -O z -a ${REF_23ANDME} -c CHROM,POS,ID 23andMe_called.vcf.gz > 23andMe_annotated.vcf.gz (adds rsIDs present in chip genotyping files)

4. bcftools query -f '%ID\t%CHROM\t%POS[\t%TGT]\n' 23andMe_annotated.vcf.gz | sed 's/chr//' | sed 's/\tM\t/\tMT\t/g' | sed 's/\///' | sed 's/\.\.$/--/' | sed 's/TA$/AT/' | sed 's/TC$/CT/' | sed 's/TG$/GT/' | sed 's/GA$/AG/' | sed 's/GC$/CG/' | sed 's/CA$/AC/' > 23andMe_V3_hg19.tab (converts the previous vcf.gz file to a 23andMe-style tabulated file and performs some string replacement using "sed" for compatibility)

5. sort -t $'\t' -k2,3 -V 23andMe_V3_hg19.tab > 23andMe_V3_hg19_sorted.tab (sorts tabulated file by chromosomal coordinates)

6. cat header.txt 23andMe_V3_hg19_sorted.tab >> ${OUTFILE} (adds 23andMe-style header to tabulated file)

All I can see from your picture is that step 6 worked fine, but something appears to have went wrong in at least one of the previous 5 steps. My gut feeling is saying that it's either step 1 or 2 that failed in this case, as the other steps are simplistic in nature and I feel like an error in those steps is a lot less likely to occur. Would you be able to try and get this going again and see what the command prompt says?

Adamm
06-26-2020, 11:25 AM
I don't recall anything like that happening with me! Did anything suspicious pop up on the command prompt that opens alongside the WGSExtract program?

Looking at the extract23 bash script that WGSExtract uses internally, the main commands performed therein are:

1. samtools mpileup -C 50 -v -l ${REF_23ANDME} -f ${REF} ${BAMFILE_SORTED} > 23andMe_raw.vcf.gz (finding and recording parts of the BAM file that correspond to coordinates present in genotyping chips)

2. bcftools call -O z -V indels -m -P 0 23andMe_raw.vcf.gz > 23andMe_called.vcf.gz (uses certain parameters to make position calls from the previous file [i.e. AA, CT or -- (no-call)])

3. bcftools annotate -O z -a ${REF_23ANDME} -c CHROM,POS,ID 23andMe_called.vcf.gz > 23andMe_annotated.vcf.gz (adds rsIDs present in chip genotyping files)

4. bcftools query -f '%ID\t%CHROM\t%POS[\t%TGT]\n' 23andMe_annotated.vcf.gz | sed 's/chr//' | sed 's/\tM\t/\tMT\t/g' | sed 's/\///' | sed 's/\.\.$/--/' | sed 's/TA$/AT/' | sed 's/TC$/CT/' | sed 's/TG$/GT/' | sed 's/GA$/AG/' | sed 's/GC$/CG/' | sed 's/CA$/AC/' > 23andMe_V3_hg19.tab (converts the previous vcf.gz file to a 23andMe-style tabulated file and performs some string replacement using "sed" for compatibility)

5. sort -t $'\t' -k2,3 -V 23andMe_V3_hg19.tab > 23andMe_V3_hg19_sorted.tab (sorts tabulated file by chromosomal coordinates)

6. cat header.txt 23andMe_V3_hg19_sorted.tab >> ${OUTFILE} (adds 23andMe-style header to tabulated file)

All I can see from your picture is that step 6 worked fine, but something appears to have went wrong in at least one of the previous 5 steps. My gut feeling is saying that it's either step 1 or 2 that failed in this case, as the other steps are simplistic in nature and I feel like an error in those steps is a lot less likely to occur. Would you be able to try and get this going again and see what the command prompt says?

Only this happens:

https://i.imgur.com/Fal9Uph.png

And it stays like this.

Normally how long would you need to wait in order for it to generate an autosomal file?

aaronbee2010
06-26-2020, 12:10 PM
Only this happens:

https://i.imgur.com/Fal9Uph.png

And it stays like this.

Normally how long would you need to wait in order for it to generate an autosomal file?

Quite a long time. I haven't counted exactly how long, but for a 30x BAM file, it's generally taken me 1-2 hours, and that's using 16.5% of a quad core Core i7 with a mild overclock.

All I can think of on your end is deleting the WGSExtract folder and re-downloading it (this has helped whenever the program would quit upon me selecting a BAM file). Apart from that, I'm at a loss here :(

Adamm
06-26-2020, 12:35 PM
Quite a long time. I haven't counted exactly how long, but for a 30x BAM file, it's generally taken me 1-2 hours, and that's using 16.5% of a quad core Core i7 with a mild overclock.

All I can think of on your end is deleting the WGSExtract folder and re-downloading it (this has helped whenever the program would quit upon me selecting a BAM file). Apart from that, I'm at a loss here :(

Thanks! I will do that.

teepean47
06-26-2020, 10:30 PM
Thanks! I will do that.

Cleaning WGSExtract\temp folder might help as well.

Adamm
06-27-2020, 01:04 PM
Cleaning WGSExtract\temp folder might help as well.

Yes I've downloaded WGS again from scratch and now it works, thanks everyone for the input! :)

Adamm
07-11-2020, 12:47 PM
Hi all, so I've been added to Yfull only I don't understand one thing:

https://i.imgur.com/VpQv8F9.png

As you can see I'm at the top of E-CTS12227 but according to YSEQ I got the following results:

E-L19 > E-M81 > E-PF2548 > E-Y8827 > E-PF2546 > E-CTS12227 > E-MZ284


And 'E-MZ284' is the equivalent of E-Y177201 on the haplo-tree (https://www.yfull.com/tree/E-Y177201/)

So how come I've not been put into that branch and instead I'm at the top?

Ibericus
07-11-2020, 01:48 PM
Hi all, so I've been added to Yfull only I don't understand one thing:

https://i.imgur.com/VpQv8F9.png

As you can see I'm at the top of E-CTS12227 but according to YSEQ I got the following results:

E-L19 > E-M81 > E-PF2548 > E-Y8827 > E-PF2546 > E-CTS12227 > E-MZ284


And 'E-MZ284' is the equivalent of E-Y177201 on the haplo-tree (https://www.yfull.com/tree/E-Y177201/)

So how come I've not been put into that branch and instead I'm at the top?

According to this tree from FTDNA's E-M81 Project, Y177201 is actually a subclade of MZ284.

38436

https://docs.google.com/spreadsheets/u/0/d/1die-cBnThrGT3ysMnvEl91B6HqdF4eiuH8YBB3vwaZA/htmlview?pli=1#gid=314601745

Adamm
07-11-2020, 01:53 PM
According to this tree from FTDNA's E-M81 Project, Y177201 is actually a subclade of MZ284.

38436

https://docs.google.com/spreadsheets/u/0/d/1die-cBnThrGT3ysMnvEl91B6HqdF4eiuH8YBB3vwaZA/htmlview?pli=1#gid=314601745

Shouldnt that that I should be placed into that branch on Y-full?

Ibericus
07-11-2020, 02:01 PM
Well, now that I check carefully I see Y177920, but not Y177201. Maybe nobody is quite sure where to place it, I don't know...

Adamm
07-11-2020, 02:36 PM
Well, now that I check carefully I see Y177920, but not Y177201. Maybe nobody is quite sure where to place it, I don't know...

I e mailed Yfull and they told me they will place me in a new branch this upcoming tree update. Thanks for helping me!

teepean47
10-28-2020, 11:44 AM
I was testing the latest bwa-mem2 release (2.1) and memory requirements are now reasonable. Indexing hs37d5.fa still takes around 80 gigabytes but the resulting index is around 16 gigabytes so mapping can now be done on machines with 32 gigabytes of memory.

aaronbee2010
10-28-2020, 08:09 PM
I was testing the latest bwa-mem2 release (2.1) and memory requirements are now reasonable. Indexing hs37d5.fa still takes around 80 gigabytes but the resulting index is around 16 gigabytes so mapping can now be done on machines with 32 gigabytes of memory.

How does BWA-MEM2 handle ALT contigs (i.e. hs38DH) now? This review (https://blog.dnanexus.com/2020-03-10-bwa-mem2-review/) recalled discrepancies between BWA-MEM and BWA-MEM2 in alignments for an earlier version of BWA-MEM2.

On a somewhat related note, GATK-DRAGEN is apparently due to release early next month in the next GATK release, so you may want to take a look at that.

teepean47
10-28-2020, 09:41 PM
How does BWA-MEM2 handle ALT contigs (i.e. hs38DH) now? This review (https://blog.dnanexus.com/2020-03-10-bwa-mem2-review/) recalled discrepancies between BWA-MEM and BWA-MEM2 in alignments for an earlier version of BWA-MEM2.

As far as I know bwa-mem2's output should be identical to bwa mem's and an issue from June confirms that and the authors updated their article:


Update June 11th, 2020:
After our initial work for BWA-MEM2 evaluations, one of the developers, Vasimuddin Md (@wasim_galaxy), had contacted us about a number of improvements and shared with us the latest BWA-MEM2 binary on May 22nd, 2020. We have some updated information to share after testing out the suggested version and running parameters.

With the latest binary, we revisited our discrepancy analysis between BWA-MEM and BWA-MEM2 with the three samples (SRR10150660, SRR10286881, and SRR10286930) that had the largest differences in our original post. We found out that the mapping results of BWA-MEM2 are exactly the same with BWA-MEM using this latest version.

The identical result is discovered when aligning against both hg38 builds, hs38DH (GRCh38 primary contigs + decoy contigs + ALT contigs and HLA genes), as well as GCA_000001405.15_GRCh38_no_alt_analysis_set (GRCh38 primary contigs + decoy contigs, but no ALT contigs nor HLA genes). Additionally, while the per sample wall clock runtime is slightly longer with a range of 10%, BWA-MEM2 still gives the advantage of being around ~2X faster to its precedence, BWA-MEM, as we have found originally.

Additionally, three of our previously failed BWA-MEM2 analysis, which were reported on bwa-mem2 github page (35, 49, and 50), can now be finished successfully by utilizing a lower number of thread usage (with bwa-mem2 option -t), or by increasing the memory capacity.



https://github.com/bwa-mem2/bwa-mem2/issues/61

DRAGEN-GATK seems very interesting but I suspect the speed will be at the same level as GATK4 unless they have implemented a multi-core/cpu capability that existed in GATK3.

EDIT: Looks like some of the DRAGEN improvements are already committed to master:

https://github.com/broadinstitute/gatk/commit/48afe160c9cfba5a82e40a6be9c8a555066271d1

aaronbee2010
10-29-2020, 05:14 AM
As far as I know bwa-mem2's output should be identical to bwa mem's and an issue from June confirms that and the authors updated their article:



https://github.com/bwa-mem2/bwa-mem2/issues/61

DRAGEN-GATK seems very interesting but I suspect the speed will be at the same level as GATK4 unless they have implemented a multi-core/cpu capability that existed in GATK3.

EDIT: Looks like some of the DRAGEN improvements are already committed to master:

https://github.com/broadinstitute/gatk/commit/48afe160c9cfba5a82e40a6be9c8a555066271d1

Somehow I completely missed the update on the top, so thanks for pointing that out. I'm not sure whether the update was there before I lasted viewed the article or not but maybe checking the top of an article for updates would be a good idea :P

There may be a difference in speed due to DRAGENs algorithm (at least the part that replaces BWA-MEM) but the version due to be release on GitHub won't be hardware-accelerated, although the GATK-DRAGEN pipeline removes the need for both BQSR and VQSR so there should already be improvements there. I'm not sure if this version of the DRAGEN mapper will have duplicate-marking built-in like Illuminas proprietary version does though. It doesn't seem to be the case judging by the diagram below (from their webinar) but I could be wrong on this:

https://i.gyazo.com/126df24ae015997595cbc55412c4e540.png

Also thought I should include this too:

https://i.gyazo.com/0e317a1162127404838438e881bdf083.png

teepean47
10-29-2020, 06:19 PM
I compiled the latest from GATK Github and it does have Dragen mode when using HaplotypeCaller (--dragen-mode true). I could not find mapping to test it. The extra dlls in the folder enable Intel GKL on Windows based machines.

https://drive.google.com/drive/folders/1jFax6MYv28qPlVOaxTb4U8yGpO5N9IL6?usp=sharing

aaronbee2010
10-29-2020, 09:45 PM
I compiled the latest from GATK Github and it does have Dragen mode when using HaplotypeCaller (--dragen-mode true). I could not find mapping to test it. The extra dlls in the folder enable Intel GKL on Windows based machines.

https://drive.google.com/drive/folders/1jFax6MYv28qPlVOaxTb4U8yGpO5N9IL6?usp=sharing

I might have been slightly mistaken then (apologies :P). The mapper itself might just release alongside GATK as a separate repository, not in the GATK repository or executable whereas the DRAGEN mode for HaplotypeCaller appears to be set to release within the GATK repository or executable. Time will tell, I guess.

Donwulff
11-03-2020, 02:33 PM
To be fair, BWA-GATK has already been moving away from BQSR and VQSR. For example, in Tian, S. et al. 2016. Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinformatics. 17, 1 (2016), 403. "BQSR had virtually negligible effect on INDEL calling and generally reduced sensitivity for SNP calling that depended on caller, coverage and level of divergence. Specifically, for SAMtools and FreeBayes calling in the regions with low divergence, BQSR reduced the SNP calling sensitivity but improved the precision when the coverage is insufficient." (The study is about exomes, and WGS has many lower coverage areas, so careful). And of course, VQSR has indeed been replaced with the covolutional neural net https://gatk.broadinstitute.org/hc/en-us/articles/360050815552-CNNScoreVariants in latest iterations of GATK Best Practices.

I believe Broad Institute have themselves stated on their forums they believe these steps to be generally unnecessary, but to test & validate it for each workflow and input type. As one should, in general. Of course a large concern is compatibility in comparison with earlier processed samples. Imagine you had 10 000 samples with BQSR, no budget to re-process them and analysed another 10 000 without BQSR. Now whatever condition the first 10 000 samples had would look like it was associated with all the variants that BQSR brought up. This hits, to a lesser degree, even if you're just analysing your single sample with Promethease. But with the financially expedited swith to DRAGEN, the results are going to change anyway. But yeah, so far the "open source DRAGEN" seems disturbingly vapor-ware with no sightings, although they do keep talking about it still.

In general, I don't think BWA-MEM2 or DRAGEN offer a solution for at-home genome processing, as it seems they may even more resource hungry. And if you're only analysing a handful of samples, peformance isn't an issue, really. Using well made cloud platforms seems like best bet, in that case you won't even need to particularly care about performance and resources/hardware, low learning-curve and standardized results.

I'm more interested in any possible impvoments frm DRAGEN, I believe their blog said that SNP results should be comparable to BWA-GATK, but indel calling would be improved. Then again by then we should all be using long-read sequecing, though there's probably a sweet spot for short-read indels. BWA-MEM2 and Minimap2 seem to indicate that not everybody sees DRAGEN as the only game in town. But time will tell, when the first uninvolved publications of actual implementation start to come in (But first the implementation...).

Speaking of changes, GRCh38.p14 should hit any day now, not too many working days of "second half of 2020" left. Of course, most people don't use the patches for analysis. I'm wondering if that will get to incorporate much of changes from the CHM13 telomere-to-telomere long read human reference that just got finished & published, or if that was too late to hit p14 and we'll have a few years to wait... One could map to CHM13 v1, but genomic coordinates are different, I haven't seen a single liftover chain, and of course you should probably mask repetitive regions etc. as for the official GRCh38 reference. I want to see what difference a "complete" (How many times have people said the human genome is now complete?) reference makes.

Donwulff
11-03-2020, 03:18 PM
If you REALLY need the performance, something different to consider: https://developer.nvidia.com/clara-parabricks
https://doi.org/10.5808/GI.2020.18.1.e10
"Parabricks was able to process a 50× whole-genome sequencing library in under 3 h and Sentieon finished in under 8 h, whereas GATK v4.1.0 needed nearly 24 h. These results were achieved while maintaining greater than 99% accuracy and precision compared to stock GATK."

However, with the GATK move to Illumina's DRAGEN it will be interesting to see how Sentineon and NVIDIA Parabricks might respond. For that matter, BGI might have a stake in the matter as well, although Illumina may have made Dante Labs an offer they couldn't refuse, so non-Illumina sequences seem like a rarity in the enthusiast sphere for the time being. But first we need to actually GET the open source DRAGEN implementation which isn't being developed as open source, and perhaps some unaffiliated comparisons of performance, accuracy, precision. (Recalling that benchmarks tend to be at least partly based on BWA-GATK pipeline calls!).

By the way https://www.ncbi.nlm.nih.gov/grc/human/issues all current GRCh38.p14 issues are Resolved, and there's already 9 issues open for p15 so I'm guessing they're indeed putting the p14 through final release and quality assurance steps. I might be mistaken but I don't think either issues list is yet incorporating anything specifically from https://github.com/nanopore-wgs-consortium/CHM13 v1.0. NCBI re-mapping service https://www.ncbi.nlm.nih.gov/genome/tools/remap/docs/whatis supports CHM13 Draft coordinates, so I hope they will add v1.0 pronto. https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/ is another to watch, but I don't think any of the versions are there. The procedure for generating liftover files is a bit, err, esoteric although it shouldn't be hard, in principle. I should probably just work on finding novel sequences and adding them as "decoys" though; that should be done in any event. But the CHM13 Telomere-to-Telomere reference is appealing reference for long-read sequencing and re-assembly starting point.

Donwulff
11-03-2020, 04:37 PM
Bit of a wall of text, but looking this over again (I think I posted some of it dozens of posts ago):
https://gatk.broadinstitute.org/hc/en-us/articles/360039984151-DRAGEN-GATK-Update-Let-s-get-more-specific

"To be clear though, we don't anticipate that the work we're doing together is going to substantially accelerate the software-only version of the tools that we distribute as GATK. If you have a need for speed on that order of magnitude, you're going to want to check out Illumina's commercial DRAGEN offering, which includes the hardware-accelerated version. But what this collaboration does offer is to make available some specific accuracy improvements in the software-only open-source version (labeled DRAGEN-GATK). The good news: the outputs of the two versions (pure software vs hardware-boosted) will be functionally equivalent and compatible for cross-dataset analyses."

If taken literally, that means Open Source software-only DRAGEN would be slower than BWA-MEM, because the pipeline does away with some of the GATK post-processing. However, an issue is also what is meant by "tools that we distribute as GATK", including DRAGEN or not? In any case, they've told us to only expect accuracy improvements. The hardware-accelerated process is, of course, the appeal in it to Illumina (Who already have DRAGEN-GATK on their cloud platform).

"You can see the very clear gains in indel and heterozygous SNP calling precision of the DRAGEN pipeline (blue) compared to our original GATK 4.1.1.0 single-genome pipeline with 2D CNN filtering (orange), as well as a modest but always welcome increase in indel sensitivity."

They've updated the blog-post, and this is actually news to me, as I earlier reported I believed they said the SNP results were comparable to BWA-GATK. Now they're specifically mentioning heterozygous SNP calling precision, and even compared to the new 2D Convolutional Neural Net Best Practices. That is indeed nice, and something to look forward to. As well as Sentineon, Parabricks and BGI to potentially respond to.

"DRAGEN team found two areas of opportunity that were ripe for the picking: they decided to improve the indel error model to cope with PCR-induced stutter in STR regions, and refine the genotyping model by taking into account the presence of correlated pileup errors."

The STR-stutter would of course be valuable to genealogical analysis like those performed by YFull, although I do not know what kind of pipeline they're currently using for calling STR's. FTDNA has their own in-house pipeline too I think, although I seem to recall they were making some changes with the latest Big-Y offerings (I should really check these things when writing). Either way the option of moving to use DRAGEN would be possible for both of them. (It took quite a while for FTDNA to re-process the samples to GRCh38, and YFull doesn't get any extra money from re-analysis, but if the performance and price is right...)

https://gatk.broadinstitute.org/hc/en-us/articles/360049493892-DRAGEN-GATK-update-Upcoming-webinar-and-pipeline-release-timeframe

"After weathering some delays due to the COVID-19 pandemic, we are now expecting to be able to release the full open-source software version of this first DRAGEN-GATK pipeline in early November of this year"

Which should be... any day now, too. Of course, slipping schedules are pretty common, and I'm still not sure what to think of closed source open source development projects. Why not release it early to make sure issues are ironed out for the first "production" release? This seems odd, but I guess we shall see soon.

teepean47
11-03-2020, 05:12 PM
If you REALLY need the performance, something different to consider: https://developer.nvidia.com/clara-parabricks


Nvidia's GenomeWorks passes all of the tests even using a consumer GPU but I was wondering if their version GATK is freely available anywhere.

aaronbee2010
01-29-2021, 09:56 PM
GATK New Year update (https://gatk.broadinstitute.org/hc/en-us/articles/360055062832-New-Year-update-GATK-takes-on-2021)

Regarding DRAGEN-GATK:


As mentioned previously (https://gatk.broadinstitute.org/hc/en-us/articles/360039984151-DRAGEN-GATK-Update-Let-s-get-more-specific), we've been working with the DRAGEN team at Illumina on a joint DRAGEN-GATK Best Practices pipeline for calling short germline variants that will replace the current Best Practices for that use case. In our most recent release estimate, we predicted that the new pipeline would be ready in November 2020. Obviously that hasn't happened, so here's what's going on.

The new pipeline will use DRAGMAP instead of BWA for genome mapping and alignment, which was originally implemented on the proprietary DRAGEN hardware. Since a key goal of the DRAGEN-GATK collaboration is to provide a fully open-source version of the joint pipeline, the DRAGEN team has been working on reimplementing DRAGMAP as an open-source software project. It's taking a bit longer than we expected, which is not unusual for the very first implementation of a rather sophisticated tool, so the release has been delayed as a result.

At this time it's difficult to predict exactly how much longer it will be, but I'm confident the release will happen sometime in the first half of 2021. In the meantime, we're looking into whether it would make sense for folks to already proceed with evaluations of the downstream changes to the pipeline (mainly HaplotypeCaller's new DRAGEN mode) using data aligned with BWA. Stay tuned for more on that.