Page 3 of 5 FirstFirst 12345 LastLast
Results 21 to 30 of 50

Thread: Dante Labs WES/WGS Sequencing Technical

  1. #21
    Registered Users
    Posts
    174
    Sex

    I would expect the default read name regex will likely parse number until the first non-numeric character, but again not sure without testing. The BAM file parameter you quote (Though the "quick reply" option on the forum messes up the formatting) suggests it may have to be set manually.

    Actually the script I have on GitHub is always work in process, the regex for Dante Labs currently doesn't work because of wrong quotation marks. Presently testing it through and updating the script. However, as I discussed on the main Dante Labs & this thread, after earlier testing, the read name regex doesn't really affect the variant calling. It affects some statistics calculated, and possibly slight effect on performance. Though setting it right will avoid the warning, and be cleaner.

  2. The Following User Says Thank You to Donwulff For This Useful Post:

     Petr (04-15-2018)

  3. #22
    Registered Users
    Posts
    174
    Sex

    I downloaded an FGC sample off Personal Genomes Project to check the script, however unfortunately it didn't have the Y-Elite style "FCC63CUACXX:4:2206:18663:2479#" read names. Unsurprisingly, I guess, looking at your list above the FGC WGS sample had normal read names. I might try to simulate those just to see if the normal regex will catch them.

    Y-chromosome only sequences have another issue in analysis down the line. Normally, the Base Quality Recalibration is ran over chromosome 22, the shortest autosomal chromosome, finding patterns for the "expected" variants. Of course, when you sequence the Y chromosome only, there aren't enough reads for chromosome 22 to do that. I'm not sure anybody has actually done a study what the best strategy would be in that case - male-specific region of the Y chromosome doesn't recombine, so the "expected" variants depend on the haplogroup of the sample. Nevertheless, the dbSNP database of known variants does include many Y chromosome variants, though the YBrowse list of phylogenetically informative SNP's might be better because they're more likely to be correct. Looking at the discussed pipeline at http://www.it2kane.org/2016/10/y-dna...rkflow-pt-1-1/ though, unless I'm missing something, it looks like it doesn't use the dbSNP file for base quality recalibration at all. On that note, I've also not really seen any research into what, if any, effect of the dbSNP release used could have (Latest GATK instructions seem to say "most recent dbSNP", in the past I think they recommended the version where 1KG variants were added, which was probably before a lot false positive and rare variants got added).

    This has given me some interesting ideas in the past though, because the non-recombinatorial naturre of MSY means that you pretty much know most of the SNP's you're expecting to get. So if you've got a good phylogenetic reference, your base-quality recalibration could be quite exact since you would have pretty good "truth set" available with very few novel SNP's, both for training the recalibration, and for checking the results. Though, you'd probably need to also limit your recalibration to the general high confidence region, since palindromic and euchromatic regions will have a lot of errors. With a WGS, maybe even check if the recalibration differs for Y chromosome.

    Which mostly leaves me wondering if anybody has done anything of the sort before. It could take a while to test all the options for enough genomes to see if there's significant effect.

  4. The Following 2 Users Say Thank You to Donwulff For This Useful Post:

     karwiso (04-19-2018),  Petr (04-19-2018)

  5. #23
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden
    Hi!

    I am just learning to align myself.
    I have a question about reference genome sequence - which one do you use?
    There is a recommendation here, but there are more updated like 92 release of Ensembl, which is updated to the patch 12 for GRCh38. I have tried NCBI's GRCh38 patch 12, but it uses accessions numbers instead of chromosomes.
    Is there any point to try to use the reference genome with the most recent patch or should I/we stick just with the unpatched reference?

  6. #24
    Registered Users
    Posts
    174
    Sex

    Things like GEDMatch, Promethease and annotation sources one could use currently stick to the primary assembly, ie. 1-22, X, Y and possibly MT. As such one doesn't currently gain more information from using another reference genome sequence. Additionally, the Broad Institute GATK Variant Quality Score Recalibration only really works for diploid primary assembly (That's 1-22). Yet another consideration is indeed all other tools and statistics are based on one of the official reference sequences; a genome mapped to another reference sequence isn't entirely comparable, because some reads will map to the different/alternative contigs instead.

    I put on my Git tree linked somewhere before a script to build the latest GRCh38 reference sequence based on the analysis ready set with UCSC names for the contigs UCSC has officially released, decoy sequence, and latest Human Leukocyte Antigens mostly because I'm experimenting with it on my ongoing Y-chromosome and metagenome experimenting. In principle, those contigs should act as sink for reads that actually belong elsewhere on the human genome, allowing picking Y-chromosomal/metagenome reads apart. There are bound to be a number of issues though, so I'm definitely calling that experimental.

    Here's a blog from Heng Li, author of the BWA aligner, on which reference genome to use that's really good: http://lh3.github.io/2017/11/13/whic...-genome-to-use

    Of course, the next question is going to be just whether to use GRCh38 or hg37 build. Resources are slowly as ever moving to GRCh38, so currently it's kind of a chaos where you'll find some services require hg37 and other require GRCh38 based calls. It's probably most beneficial to process both currently, and will have the additional benefit of checking uncertain variants on both.

  7. The Following User Says Thank You to Donwulff For This Useful Post:

     karwiso (05-16-2018)

  8. #25
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden
    I have been trying to use BaseRacalibrator with files from 1000Genomes: http://ftp.1000genomes.ebi.ac.uk/vol...erence_genome/ and other mappings resources as Mills gold standard indels and ALL wgs 1000 phase 3 indels, but I have been getting constant errors about incompatible contigs. Looking in Mills gold standard indels I understood that it is marked as b38 in the file name, but naming convention and reference genome used in there is still GRCh37, or b37. I have tried to re-index and to re-sort, but there were just more errors about mis-formatted vcf files.

    I have dowloaded GATK bundle from ftp://ftp.broadinstitute.org/bundle/hg38/
    1000G_phase1.snps. high_confindence.hg38.vcf.gz, dbsnp_146.hg38.vcf.gz and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz were accepted by BaseRecalibrator just fine.

    All files must be in the same directory. fa och fasta files should have exactly the same name and extension as fa.alt or fasta.alt file. If fa och fasta end with .gz it is better to unpack it, and then index it with bwa:

    Code:
    ./bwa index /...../reference.fasta
    I have done all steps without shell scripting and piping. Mainly because I wanted to learn and see what and exactly where could go wrong. I also start with FASTQ files, not the aligned BAM. I will publish my processing steps and comments.
    Last edited by karwiso; 06-11-2018 at 03:41 PM.

  9. #26
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden

    Processing FASTQ paired reads to SAM files

    Each pair of FASTQ files is processed with bwa ALT-aware alignment and eight sam files are produced:
    Alternative A:
    Code:
    ./bwa mem -t 6 -r 1.0 -T 20 -C -M -R "@RG\tID:CL1000....._L01_81\tPL:COMPLETE\tPU:CL1000....._L01_81\tLB:WH.....-81\tSM:1500.....7A\tCN:BGI" /mnt/ramdrv/grch38alt.fa /mnt/ramdrv/CL1000....._L01_81_1.fq.gz /mnt/ramdrv/CL1000....._L01_81_2.fq.gz > seqpair1.sam
    Alternative B:
    Code:
    /dnatools/bwa/bwa mem -t 6 -k 15 -r 1.0 -T 20 -C -M -R "@RG\tID:CL1000....._L01_81\tPL:COMPLETE\tPU:CL1000....._L01_81\tLB:WH.....-81\tSM:1500.....7A\tCN:BGI" /dnatools/reference/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta CL1000....._L01_81_1.fq.gz CL1000....._L01_81_2.fq.gz > seqpair1.sam
    Adjust _L01_81_ according to your FASTQ file names and for each processed pair of files. Think about adjusting it in the @RG description as well!

    -t specifies number of threads that could be used for processing. 8 threads are possible for my CPU, but using them all can almost lock the PC, using 6 threads is ok and it possible to do other work (CPU Intel i7-6700K 4.0 GHz). FATSQ files are loaded to ramdrive together with reference genome files, output is written to hybrid 2xHDD in RAID1. Processing time – approximately 1.5 hour per pair (-r 1.0, Alternative A).

    It is best to have the path to the reference sequence fa file. Just indexing fa file and referring to the index name is not working for ALT aware alignment. One can probably copy fa and fa.alt to bwa directory, but I haven’t experimented with it.

    Read group information could be added already at this step. I think it is then reasonable to skip merging aligned and unaligned BAMs. Read group parameters are added after the -R option.

    -T option is 30 by default and is for the quality of alignments. I have adjusted it to 20, that should give a bit more alignments but with lower quality. I think it could be reasonable for genealogical purposes, but an alignment for medical purposes should be more stringent.

    Options -k and -r adjust seed sizes and reseeding, the defaults are 19 and 1.5. Lower values can give a better accuracy but increase the processing time. I adjusted them to “-k 15 -r 1.0”, but it significantly increased processing time, approximately 5-6 hours per pair (Alternative B ). The size of produced SAM files is slightly bigger than in Alternative A - +1.5-2Gb. Probably it is not worth it and one can just stick with the defaults values.

    The output:
    Code:
    [M::bwa_idx_load_from_disk] read 3171 ALT contigs
    [M::process] read 600000 sequences (60000000 bp)...
    [M::process] read 600000 sequences (60000000 bp)...
    [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6, 234754, 9, 2)
    [M::mem_pestat] skip orientation FF as there are not enough pairs
    [M::mem_pestat] analyzing insert size distribution for orientation FR...
    [M::mem_pestat] (25, 50, 75) percentile: (188, 223, 258)
    [M::mem_pestat] low and high boundaries for computing mean and std.dev: (48, 398)
    [M::mem_pestat] mean and std.dev: (223.23, 53.29)
    [M::mem_pestat] low and high boundaries for proper pairs: (1, 468)
    [M::mem_pestat] skip orientation RF as there are not enough pairs
    [M::mem_pestat] skip orientation RR as there are not enough pairs
    [M::mem_process_seqs] Processed 600000 reads in 483.206 CPU sec, 80.696 real sec
    The messages about skipping orientations are normal. They are more for information, don’t care about them. See “read 3171 ALT contigs” - if reference genome FASTA file is not indexed properly you will probably get “read 0 ALT contigs”. Not a big deal, but you are not aligning to alternative sequences, rather primary assembly only.
    Last edited by karwiso; 06-11-2018 at 04:14 PM.

  10. #27
    Registered Users
    Posts
    174
    Sex

    Quote Originally Posted by karwiso View Post
    I have been trying to use BaseRacalibrator with files from 1000Genomes: http://ftp.1000genomes.ebi.ac.uk/vol...erence_genome/ and other mappings resources as Mills gold standard indels and ALL wgs 1000 phase 3 indels, but I have been getting constant errors about incompatible contigs. Looking in Mills gold standard indels I understood that it is marked as b38 in the file name, but naming convention and reference genome used in there is still GRCh37, or b37. I have tried to re-index and to re-sort, but there were just more errors about mis-formatted vcf files.

    I have dowloaded GATK bundle from ftp://ftp.broadinstitute.org/bundle/hg38/
    1000G_phase1.snps. high_confindence.hg38.vcf.gz, dbsnp_146.hg38.vcf.gz and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz were accepted by BaseRecalibrator just fine.
    I'm not sure if that's a question... the GATK resource bundle https://software.broadinstitute.org/...ownload/bundle (Your FTP link won't work) is definitely hg38, and I've been using it successfully with GRCh38 based pipelines. It's not exactly clear what you're using as reference genome FASTA/version, however. There are different naming conventions for the contigs, not only UCSC "chr1" vs. "1", but the "raw" GRCh38 also uses sequence identifiers. One point of problem is the Mills indels file contains HLA-contigs in the header, and the lengths of the HLA contigs vary between HLA reference-build. It may be necessary to filter them out, although I'm not sure there are references with different HLA alleles floating around.

  11. The Following User Says Thank You to Donwulff For This Useful Post:

     karwiso (06-23-2018)

  12. #28
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden

    Merging SAM files, postalt processing and BAM file

    Merging SAM files to one file
    I ran into problems when merging my seqpairX.sam files. Java was running out of memory and some other complains from picard. After upgrading my system I still had the same problems. Then I tried to see file headers for the files (samtools view -H seqpairX.sam) and discovered that one sam file was corrupted (don’t know why). Then I just realigned corresponding fastq files and the process ran smoothly.

    Code:
    java -Xmx55g -jar /dnatools/picard/picard.jar MergeSamFiles I=seqpair1.sam I=seqpair2.sam I=seqpair3.sam I=seqpair4.sam I=seqpair5.sam I=seqpair6.sam I=seqpair7.sam I=seqpair8.sam O=/home/user/Dokument/jjs_alt_merged.sam USE_THREADING=true SORT_ORDER=unsorted TMP_DIR=/home/user/Dokument/ MAX_RECORDS_IN_RAM=10000000
    -Xmx55g option is setting the limit for memory usage, 55G in this case.
    TMP_DIR should be set if you don't have unlimited RAM, but it is not necessary here. I have it for the case I go back and modify something and probably a temporary directory is needed then.

    Processing with bwa-postalt.js
    After the merging of FASTQ files the sam-file is processed with bwa-postalt.js in bwakit (should be downloaded):

    Code:
    /dnatools/bwa/bwa.kit/k8 /dnatools/bwa/bwa.kit/bwa-postalt.js -p /dnatools/bwa/bwa.kit/resource-GRCh38/hs38DH-extra.fa /dnatools/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta.alt jjs_alt_merged.sam > jjs_alt_merged_postalt.sam
    Sorting SAM and producing BAM file

    Code:
     java -Xmx55g -jar /dnatools/picard/picard.jar SortSam I=jjs_alt_merged_postalt.sam O=jjs_alt_merged_postalt_sorted.bam SORT_ORDER=coordinate MAX_RECORDS_IN_RAM=10000000 TMP_DIR=/home/user/Dokument/
    MAX_RECORDS_IN_RAM 10 millions work well for me, 20 and 15 millions were able to start, but processing was aborted after some time.
    Last edited by karwiso; 06-23-2018 at 09:59 AM. Reason: some additional information

  13. #29
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden
    Quote Originally Posted by Donwulff View Post
    I'm not sure if that's a question... the GATK resource bundle https://software.broadinstitute.org/...ownload/bundle (Your FTP link won't work) is definitely hg38, and I've been using it successfully with GRCh38 based pipelines. It's not exactly clear what you're using as reference genome FASTA/version, however. There are different naming conventions for the contigs, not only UCSC "chr1" vs. "1", but the "raw" GRCh38 also uses sequence identifiers. One point of problem is the Mills indels file contains HLA-contigs in the header, and the lengths of the HLA contigs vary between HLA reference-build. It may be necessary to filter them out, although I'm not sure there are references with different HLA alleles floating around.
    I decided to work with the reference genom in the GATK bundle (GRCh38) and followed Tutorial 8017 for ALT aware processing.

    Working with base recalibration now and it seems to go soomthly now.

  14. #30
    Registered Users
    Posts
    67
    Sex
    Location
    Stockholm, Sweden
    Ethnicity
    Ingrian + Russian
    Nationality
    Swedish, Russian
    Y-DNA
    R1a1a1b1:Z92FGC46016
    mtDNA
    J1c3k

    Finland Russian Federation Sweden

    GATK 4 BaseRecalibrator

    Change default java jre because in Ubuntu 18.04 the default is OpenJDK11 (right now it is actually OpenJDK10):
    Code:
    sudo update-alternatives --config java
    and choose java 8. If it not present you need to install it (use Synaptic in Ubuntu).

    Base recalibration is single threaded. As far as I understand it is the only option.

    Code:
    /dnatools/gatk/gatk --java-options "-Xmx55g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
    BaseRecalibrator \
    -R /dnatools/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta \
    -O recal_data.table \
    -I jjs_alt_merged_postalt_sorted_dup.bam \
    --known-sites /dnatools/gatk-bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --known-sites /dnatools/gatk-bundle/hg38/dbsnp_146.hg38.vcf.gz \
    --known-sites /dnatools/gatk-bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
    You should see something like this:

    11:02:44.477 INFO ProgressMeter - Starting traversal
    11:02:44.477 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
    11:02:54.498 INFO ProgressMeter - chr1:1998772 0.2 535000 3203273.1
    11:03:04.501 INFO ProgressMeter - chr1:3386832 0.3 1068000 3200159.8
    11:03:14.516 INFO ProgressMeter - chr1:4873235 0.5 1685000 3365624.7
    11:03:24.523 INFO ProgressMeter - chr1:6313851 0.7 2283000 3420566.3
    11:03:34.534 INFO ProgressMeter - chr1:7840490 0.8 2916000 3495215.5
    11:03:44.538 INFO ProgressMeter - chr1:9316286 1.0 3527000 3523476.5


    Applying the re-calibration data:

    Code:
    /dnatools/gatk/gatk --java-options "-Xmx55g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
    ApplyBQSR \
    -R /dnatools/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta \
    --bqsr-recal-file recal_data.table \
    -I jjs_alt_merged_postalt_sorted_dup.bam \
    -O jjs_alt_merged_postalt_sorted_dup_recal.bam

Page 3 of 5 FirstFirst 12345 LastLast

Similar Threads

  1. Dante Labs (WGS)
    By MacUalraig in forum Other
    Replies: 636
    Last Post: 04-17-2019, 10:21 PM
  2. What are the chemicals used in DNA sequencing
    By Anabasis in forum Inquiries Corner
    Replies: 2
    Last Post: 12-17-2015, 02:05 AM
  3. Replies: 10
    Last Post: 07-01-2015, 03:51 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •