Page 5 of 5 FirstFirst ... 345
Results 41 to 50 of 50

Thread: Dante Labs WES/WGS Sequencing Technical

  1. #41
    Registered Users
    Posts
    27
    Sex
    Location
    Elx/Elche, Valencian State
    Ethnicity
    Iberian
    Nationality
    Valencian
    Y-DNA
    R-L21/M529 Z253+
    mtDNA
    U3a1c1

    Catalonia Andorra Gibraltar
    Finally, I merged my 22 autosomal WES VCFs with "bcftools merge". It is not a good option because of bcftools generate a new column for every VCF file with sorted.bam name in the header. If column is empty, then gives GT 0/0 by defect. Even though Gedmatch Genesis accepted this merged VCF, it only can read variants, this new kit don't match nor Dantelabs WES.snp.VCF neiither my 23andme v4 raw data (or my Gencove imputed VCF from 23andme.txt). I generate a multifasta GRCh37.p13 (UCSC) though when I tried to index with BWA, running stopped in the fa.sa step (I need double RAM for this!). It is very easy to index chromosome by chromosome and aligning with BWA MEM (I did it with all my chroms. in GRCh37 and some chroms. in GRCh38), and paired end fastq files sended by Dantelabs, then Samtools view sam > bam, then Samtools sort -o output.sorted.bam, then Bcftools mpileup -f chr.fa -o output.bcf, then bcftools view -vsnps,indels -o output.vcf, and finally bcftools call -m -Oz -o output.vcf.gz
    If you want to merge all gz VCFs, first you must run "tabix" chr.vcf.gz before running "bcftools merge". After merging, I did also "bcftools isec" with my 23andme raw data to extract common snps and indels shared with my new raw merged.VCF (only 11,000 snps) and with my filtered snp.VCF that sended me Dantelabs (only 71,000 snps! Why BWA MEM can't align all readings that chip detected from my fq files?). I have tihis file like altruistic in Sequencing.com database and in my Genome Space shared unit, you are free to check it.
    Here comes some data after running "bcftools stats -v":
    ID 0 mergedWES37.vcf.gz
    # SN, Summary numbers:
    # SN [2]id [3]key [4]value
    SN 0 number of samples: 22
    SN 0 number of records: 22157494
    SN 0 number of no-ALTs: 17002553
    SN 0 number of SNPs: 5037615
    SN 0 number of MNPs: 0
    SN 0 number of indels: 117326
    SN 0 number of others: 0
    SN 0 number of multiallelic sites: 21801
    SN 0 number of multiallelic SNP sites: 21033
    # TSTV, transitions/transversions:
    # TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
    TSTV 0 3258007 1800641 1.81 3254038 1783577 1.82
    # SiS, Singleton stats:
    # SiS [2]id [3]allele count [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
    SiS 0 1 2031249 1391036 640213 30948 0 0 30948
    # AF, Stats by non-reference allele frequency:
    # AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
    AF 0 0.000000 2031249 1391036 640213 30948 0 0 30948
    AF 0 0.040000 3027399 1866971 1160428 87146 0 0 87146

  2. The Following User Says Thank You to Miqui Rumba For This Useful Post:

     pmokeefe (04-03-2019)

  3. #42
    Registered Users
    Posts
    27
    Sex
    Location
    Elx/Elche, Valencian State
    Ethnicity
    Iberian
    Nationality
    Valencian
    Y-DNA
    R-L21/M529 Z253+
    mtDNA
    U3a1c1

    Catalonia Andorra Gibraltar
    Finally, I merged my 22 autosomal WES VCFs with "bcftools merge". It is not a good option because of bcftools generate a new column for every VCF file with sorted.bam name in the header. If column is empty, then gives GT 0/0 by defect. Even though Gedmatch Genesis accepted this merged VCF, it only can read variants, this new kit don't match nor Dantelabs WES.snp.VCF neiither my 23andme v4 raw data (or my Gencove imputed VCF from 23andme.txt). I generate a multifasta GRCh37.p13 (UCSC) though when I tried to index with BWA, running stopped in the fa.sa step (I need double RAM for this!). It is very easy to index chromosome by chromosome and aligning with BWA MEM (I did it with all my chroms. in GRCh37 and some chroms. in GRCh38), and paired end fastq files sended by Dantelabs, then Samtools view sam > bam, then Samtools sort -o output.sorted.bam, then Bcftools mpileup -f chr.fa -o output.bcf, then bcftools view -vsnps,indels -o output.vcf, and finally bcftools call -m -Oz -o output.vcf.gz
    If you want to merge all gz VCFs, first you must run "tabix" chr.vcf.gz before running "bcftools merge". After merging, I did also "bcftools isec" with my 23andme raw data to extract common snps and indels shared with my new raw merged.VCF (only 11,000 snps) and with my filtered snp.VCF that sended me Dantelabs (only 71,000 snps! Why BWA MEM can't align all readings that chip detected from my fq files?). I have tihis file like altruistic in Sequencing.com database and in my Genome Space shared unit, you are free to check it.
    Here comes some data after running "bcftools stats -v":
    ID 0 mergedWES37.vcf.gz
    # SN, Summary numbers:
    # SN [2]id [3]key [4]value
    SN 0 number of samples: 22
    SN 0 number of records: 22157494
    SN 0 number of no-ALTs: 17002553
    SN 0 number of SNPs: 5037615
    SN 0 number of MNPs: 0
    SN 0 number of indels: 117326
    SN 0 number of others: 0
    SN 0 number of multiallelic sites: 21801
    SN 0 number of multiallelic SNP sites: 21033
    # TSTV, transitions/transversions:
    # TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
    TSTV 0 3258007 1800641 1.81 3254038 1783577 1.82
    # SiS, Singleton stats:
    # SiS [2]id [3]allele count [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
    SiS 0 1 2031249 1391036 640213 30948 0 0 30948
    # AF, Stats by non-reference allele frequency:
    # AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
    AF 0 0.000000 2031249 1391036 640213 30948 0 0 30948
    AF 0 0.040000 3027399 1866971 1160428 87146 0 0 87146

  4. #43
    Registered Users
    Posts
    27
    Sex
    Location
    Elx/Elche, Valencian State
    Ethnicity
    Iberian
    Nationality
    Valencian
    Y-DNA
    R-L21/M529 Z253+
    mtDNA
    U3a1c1

    Catalonia Andorra Gibraltar
    Sequencing.com has changed some rules and now Dantelabs WES fastq files (forward.fq and reverse.fq) seems being accepted to upload although only first FASTQ (forward.fq) can be processed by EvE free (Bowtie2 aligner, UCSC GRCh38.86 FASTA). Results are finer right than SNPs and INDELs VCFs that Dantelabs sended me last year (abnormal quality, 25% no reading variants). Getting all Exome variant sites in HG38 is a big improvement. I did my EvE VCF results altruist in Sequencing database for everybody who wants to check it: CL100040257_L01_501_1.vcf.gz
    Here Genome Overview analysis:
    Warnings 150,681
    Errors 256
    Number of lines (input file) 201,252
    Number of variants (before filter) 202,799
    Number of not variants
    (i.e. reference equals alternative) 0
    Number of variants processed
    (i.e. after filter and non-variants) 202,799
    Number of known variants
    (i.e. non-empty ID) 0 ( 0% )
    Number of multi-allelic VCF entries
    (i.e. more than two alleles) 1,179
    Number of effects 1,025,109
    Genome total length 47,871,129,608
    Genome effective length 3,095,807,492
    Variant rate 1 variant every 15,265 bases
    Variants rate details
    Chromosome Length Variants Variants rate
    1 248,956,422 17,567 14,171
    2 242,193,529 15,715 15,411
    3 198,295,559 12,487 15,880
    4 190,214,555 11,182 17,010
    5 181,538,259 10,619 17,095
    6 170,805,979 10,509 16,253
    7 159,345,973 10,660 14,948
    8 145,138,636 8,043 18,045
    9 138,394,717 8,445 16,387
    10 133,797,422 9,922 13,484
    11 135,086,622 11,501 11,745
    12 133,275,309 10,314 12,921
    13 114,364,328 5,940 19,253
    14 107,043,718 6,446 16,606
    15 101,991,189 6,681 15,265
    16 90,338,345 7,538 11,984
    17 83,257,441 8,568 9,717
    18 80,373,285 4,564 17,610
    19 58,617,616 9,289 6,310
    20 64,444,167 5,446 11,833
    21 46,709,983 2,803 16,664
    22 50,818,468 4,180 12,157
    GL000008.2 209,709 9 23,301
    GL000009.2 201,709 5 40,341
    GL000194.1 191,469 1 191,469
    GL000205.2 185,591 60 3,093
    GL000208.1 92,689 4 23,172
    GL000214.1 137,718 2 68,859
    GL000216.2 176,608 21 8,409
    GL000218.1 161,147 6 26,857
    GL000219.1 179,198 33 5,430
    GL000220.1 161,802 1 161,802
    GL000221.1 155,397 3 51,799
    GL000224.1 179,693 6 29,948
    GL000225.1 211,173 320 659
    GL000251.2 1 0 0
    GL000252.2 1 0 0
    GL000253.2 1 0 0
    GL000254.2 1 0 0
    GL000255.2 1 0 0
    GL000257.2 1 0 0
    GL339449.2 1 0 0
    GL383533.1 1 0 0
    GL383545.1 1 0 0
    GL383557.1 1 0 0
    GL383567.1 1 0 0
    GL383573.1 1 0 0
    GL383580.2 1 0 0
    GL383581.2 1 0 0
    GL582966.2 1 0 0
    JH159136.1 1 0 0
    JH159137.1 1 0 0
    JH159146.1 1 0 0
    KI270438.1 112,505 55 2,045
    KI270442.1 392,061 68 5,765
    KI270467.1 3,920 3 1,306
    KI270517.1 3,253 1 3,253
    KI270538.1 91,309 3 30,436
    KI270589.1 44,474 4 11,118
    KI270706.1 175,055 1 175,055
    KI270707.1 32,032 1 32,032
    KI270708.1 127,682 1 127,682
    KI270711.1 42,210 6 7,035
    KI270713.1 40,745 6 6,790
    KI270716.1 153,799 1 153,799
    KI270719.1 176,845 3 58,948
    KI270722.1 194,050 24 8,085
    KI270723.1 38,115 3 12,705
    KI270725.1 172,810 3 57,603
    KI270726.1 43,739 3 14,579
    KI270728.1 1,872,759 1 1,872,759
    KI270729.1 280,839 28 10,029
    KI270732.1 41,543 1 41,543
    KI270733.1 179,772 13 13,828
    KI270735.1 42,811 1 42,811
    KI270736.1 181,920 6 30,320
    KI270738.1 99,375 1 99,375
    KI270744.1 168,472 13 12,959
    KI270746.1 66,486 26 2,557
    KI270749.1 158,759 2 79,379
    KI270750.1 148,850 1 148,850
    KI270751.1 150,742 2 75,371
    KI270754.1 40,191 7 5,741
    KI270758.1 1 0 0
    KI270779.1 1 0 0
    KI270783.1 1 0 0
    KI270791.1 1 0 0
    KI270803.1 1 0 0
    KI270817.1 1 0 0
    KI270819.1 1 0 0
    KI270830.1 1 0 0
    KI270846.1 1 0 0
    KI270847.1 1 0 0
    KI270850.1 1 0 0
    KI270853.1 1 0 0
    KI270854.1 1 0 0
    KI270855.1 1 0 0
    KI270860.1 1 0 0
    KI270872.1 1 0 0
    KI270875.1 1 0 0
    KI270878.1 1 0 0
    KI270902.1 1 0 0
    KI270904.1 1 0 0
    KI270908.1 1 0 0
    KI270927.1 1 0 0
    KI270938.1 1 0 0
    KN196479.1 1 0 0
    KN196484.1 1 0 0
    KN196487.1 1 0 0
    KN538361.1 1 0 0
    KN538362.1 1 0 0
    KN538363.1 1 0 0
    KN538369.1 1 0 0
    KN538372.1 1 0 0
    KQ031388.1 1 0 0
    KQ090016.1 1 0 0
    KQ090026.1 1 0 0
    KQ458383.1 1 0 0
    KQ759759.1 1 0 0
    KQ983258.1 1 0 0
    KV575245.1 1 0 0
    KV766195.1 1 0 0
    KV766196.1 1 0 0
    KV766198.1 1 0 0
    KZ208908.1 1 0 0
    KZ208915.1 1 0 0
    KZ208916.1 1 0 0
    KZ559106.1 1 0 0
    KZ559107.1 1 0 0
    KZ559109.1 1 0 0
    MT 16,569 30 552
    X 156,040,895 3,263 47,821
    Y 57,227,415 72 794,825
    Total 3,095,807,492 202,799 15,265
    Number variants by type
    Type Total
    SNP 184,222
    MNP 0
    INS 9,616
    DEL 8,961
    MIXED 0
    INV 0
    DUP 0
    BND 0
    INTERVAL 0
    Total 202,799
    Number of effects by impact
    Type (alphabetical order) Count Percent
    HIGH 4,340 0.423%
    LOW 46,067 4.495%
    MODERATE 27,181 2.652%
    MODIFIER 947,265 92.429%
    Number of effects by functional class
    Type (alphabetical order) Count Percent
    MISSENSE 25,612 45.06%
    NONSENSE 240 0.422%
    SILENT 30,988 54.518%

    Missense / Silent ratio: 0.8265

  5. The Following User Says Thank You to Miqui Rumba For This Useful Post:

     MacUalraig (09-14-2018)

  6. #44
    Registered Users
    Posts
    174
    Sex

    GRCh38 reference genome Patch 13 hit public just this month (well, this month in most of the world anyway) so I took the opportunity to check & upgrade my processing tools. As usual these come with no warranty explicit or implied and blablahblah, they're just simple tooling scripts I've used for my own anlysis & research that I published because people are asking how to do it. They use Bash which theoretically allows their use on Linux or Linux-like environments, and easy use & modification without needing to install and learn additional complex frameworks. If you need full GATK Best Practices, or cluster-support, look elsewhere. I've tested these with (old) BigY, Dante Labs and Genos Research raw data with some special cases, don't have access to FGC to test.

    https://github.com/Donwulff/bio-tools includes https://github.com/Donwulff/bio-tool...8_bwa_index.sh for creating my research BWA MEM reference index. This uses UCSC id's for the part that UCSC has officially published (up to Patch 11) with official hs38d1 decoy sequences; this allows one to just cut the BAM at suitable point for compatibility. NCBI naming is used for Patch 12 and Patch 13 which are added sequentially at the end. In addition it pulls whatever is the latest HLA reference sequences (these change frequently). In previous version path to one of the patches was incorrect, apologies about that.

    The decoy-sequence is NOT updated to match the new contigs (which means that the decoy may inadvertently attract reads intended for the new alternate contigs) though that's on my TODO list. BWA mapper is alternate-contig aware, which means the reads should preferentially map to the primary assembly instead of the alternate contigs, but they can be useful for experimenting with what's in the genome.

    https://github.com/Donwulff/bio-tool.../revert-bam.sh generates uBAM (Unmapped BAM file) from source BAM for use according to GATK Best Practices workflow. I've included simple tooling to drive BWA MEM for standard mapping and duplicate marking against a given reference index, streamlined from GATK Best Practices. New in this version, if the BAM file ReadGroup is missing PLatform-tag (As the BigY's do) it's added in mapping to avoid problems in downstram analysis. I've also added in comments a tested workflow (Using undocumented BWA MEM feature) for combining BigY hs37 & GRCh38 mapped BAM's into a single BAM, which recovers a couple of reads that have been filtered in either one. Since they don't map against the reference they're most likely garbage, but I thought why not.

    https://github.com/Donwulff/bio-tool...apping/BQSR.sh does Base Quality Score Recalibration. The basic idea is BQSR looks for differences from the reference which aren't known SNP's/variants. Some of these will be genuine novel variants, but the majority of them are actually sequencing errors. Importantly, the genuine novel variants are randomly distributed in the reads and shouldn't depend on the state of the sequencing instrument. This allows building a statistical model of where the sequencing instrument will most likely make errors, and take that into account.

    I've now experimentally added YBrowse's list of known Y-chromosome variants into the list of known (ignored) variants; hg38/GRCh38 only sorry, might add liftover but almost everybody is using GRCh38 now. In theory, BigY is targeted to the Y chromosome and filtered to only Y chromosome region, so the off-target reads might be garbage, but I've decided to include whole primary assembly in the BQSR analysis now. With my own DNA, including off-target reads causes 11 different calls which interestingly match my WGS results rather than BigY. However, regions of Y chromosome sequence very poorly or get interference from elsewhere in the genome in WGS. I'm trying to find out whether BQSR on Y or genome-wide helps with BigY results, but there's really no "ground truth" to compare to.

    Snapshot of dbSNP variant database 152 should hit the public next week, at which point I'll update the BQSR script to use that for whole-genome analysis. There seems to also be lack of analysis on how the version of dbSNP snapshot affects the results. Many people use old version either because the software pipelines haven't been updated in years, or because they want to restrict the excluded variants to mostly those known before high-throughput sequencing on the theory that many of those are in itself artifacts of sequencing. On the mathematical theory of BQSR and a hunch I feel that it's more important to exclude as many genuine variants as possible, and the older databases are seriously lacking non-white/non-European samples, which would further bias against "ethnic" variants, so I'm personally using the latest variant database.

    The scripts do not currently include variant calling & genotyping, in part because there are different schools of thought to that. For testing purposes my present approach was after GATK Best Practices, which produces allele-specific genomic vcf for joint genotyping (SNP names are prepared in BQSR script above, although the naming requires that all alt alleles match, so many are missed).
    gatk-4.1.1.0/gatk --java-options -Xms30G HaplotypeCaller -R /mnt/GenomicData/GRCh38/hg38-p13-all.fa -I sample1BigYmerge.qsort.srt.bam -D /mnt/GenomicData/snps_hg38.vcf.gz -ERC GVCF --ploidy 1 -O sample1BigYmerge.mapsort.chrY.ploidy1.AS.g.vcf.gz -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation -L chrY

    You may want to parallerize different ranges/chromosomes, oh and ploidy according to chromosome (Standard diploid genotyping may be of interest as well for palindromic regions or simply detecting dubious calls, but haploid matches commercial Y analysis best).

  7. The Following 4 Users Say Thank You to Donwulff For This Useful Post:

     karwiso (04-06-2019),  MacUalraig (04-02-2019),  pmokeefe (04-03-2019),  traject (04-03-2019)

  8. #45
    Registered Users
    Posts
    174
    Sex

    The human variant rs-ID database at https://www.ncbi.nlm.nih.gov/snp states on their front page, "April 2, 2019: Starting next week dbSNP will release the new Entrez updates that will include the latest dbSNP build 152 data for user testing." Regardless, looking at data & updates page it's looking like the old style VCF files of the database are coming back, and even if they are, they're lagging several months from the dbSNP 152 snapshot. (I'll check what they say about this on GATK forums in time)

    Therefore I figured out a procedure to convert the NCBI sequence-id based dbSNP VCF into the more familiar UCSC contig id's used by GATK. It's now included in the BQSR.sh script. The conversion took half a hour for me to run (maximum compression because the result file will be saved for a while), although it will also need to download the 14 gigabyte dbSNP dump into working directory the first time.

    Contrary to the "official" GATK dbSNP dumps of past, the file created in this script includes every contig with an UCSC id listed by the Human Genome Reference Consortium https://www.ncbi.nlm.nih.gov/grc/human data. I'm teetering on this a big because at least on BigY BAM (Which is targeted AND filtered), most non-Y-chromosome hits are likely to be random chance rather than actual hits, and it's a significant departure from GATK best practices. However, comparing the results on my own BigY, the changes in calls seem altogether reasonable, so I've included this change for experimenting.

    I'm considering validating this "simple" pipeline against HG001-NA12878 to see if I've diverged far from what's considered canonical. However, I'd also like to stress that contrary to what many people may have heard, whole genome sequencing rarely achieves much better than about 99% concordance with each other or the "expected" results. Some of the regions on Y-chromosome are even harder, and when you get differences there you don't really know which is better, or if it's just noise. But at least I can employ the philosophy of "Use latest data, use ALL the data"

  9. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (04-03-2019)

  10. #46
    Registered Users
    Posts
    1,122
    Sex
    Location
    Glasgow, Scotland
    Ethnicity
    Pictland/Deira
    Y-DNA
    R1b-M222-FGC5864
    mtDNA
    H5r*

    Quote Originally Posted by Donwulff View Post
    The conversion took half a hour for me to run (maximum compression because the result file will be saved for a while), although it will also need to download the 14 gigabyte dbSNP dump into working directory the first time.
    My monthly limit is 15Gb. I have been toying with upgrading to 'unlimited' (ahem) level but often I don't need it so its a tricky choice.
    YSEQ:#37; YFull: YF01405 (Y Elite 2013)
    GEDMatch: A828783 (autosomal DNA), 9427684 (GEDCOM) for segment matching DO NOT POST ADMIXTURE REPORTS USING MY KIT
    WGS (Full Genomes Nov 2015, YSEQ Feb 2019, Dante Mar 2019) - further WGS tests pending ;-)
    Ancestry GCs: Scots in central Scotland & Ulster, Ireland; English in Yorkshire & Pennines
    Hidden Content

  11. #47
    Registered Users
    Posts
    174
    Sex

    Ah yeah, limited mobile plans are interesting too... I'd say bioinformatics probably requires unlimited plan, the dbSNP dump isn't the only big file needed. On the other hand, to make it more accessible to people there are things like UseGalaxy, and I was toying with the idea of making pre-built Amazon virtual machine image for some of these analysis.

    And finally the reference genome patches & dbSNP dumps are in nature such that most of the existing data remains unchanged, so it would be possible to just download the changes and merge them into previous version. Unfortunately, nobody's really thinking of metered connections in bioinformatics. Heck, it's a royal pain even with unmetered connections, the BLAST refseq-genomic database is 350 GB and every update requires a complete redownload.

    The dbSNP re-organization seems to mean that rather than just new SNP's being added, all the row formats have change too though so this isn't even a "just get around to doing it" thing this time.

    /mnt/GenomicData$ zgrep -m1 rs775809821 All_20180418_GRCh38p7.vcf.gz GCF_000001405.38.bgz
    All_20180418_GRCh38p7.vcf.gz:1 10019 rs775809821 TA T . . RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SA O=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1 :100287102;WGT=1;VC=DIV;R5;ASP
    GCF_000001405.38.bgz:NC_000001.11 10019 rs775809821 TA T . . RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO =DDX11L1:100287102;VC=INDEL

    Though bioinformatics workdlows in general don't even need that INFO string there, you just need "1 10019 rs775809821 TA T" which gives chromosome, coordinate on that chromosome (in that build), the SNP rs-ID name, reference and alt alleles. (Actually BQSR strictly doesn't require more than contig & coordinate to exclude, but I'm assuming that at some point you'll want to annotate the rs-ID's on the correct variants for easy lookup and check that the ref and alt match). I can imagine if this was wrapped into some simple "internet appliance" type image or app, you'd want to minimize the downloads.

  12. The Following 2 Users Say Thank You to Donwulff For This Useful Post:

     MacUalraig (04-04-2019),  pmokeefe (04-03-2019)

  13. #48
    Registered Users
    Posts
    174
    Sex

    Double post

  14. #49
    Registered Users
    Posts
    174
    Sex

    Original size of GCF_000001405.38.bgz (GRCh38 dbSNP build 152):
    14,038,308,213 bytes (14 gigabytes)

    Maximum bgzip compression with GATK contig names:
    13,581,306,990

    INFO column removed (Loses gene annotation, but there are other resources for this):
    6,705,966,745

    This is kind of significant already, cutting the file-size to less than half should actually speed up BQSR workflow somewhat because it's half as much IO and decompressing for that file.

    I kind of forgot how annoying bioinformatica can be, while I decided to look at producing dbSNP build differentials. The dbSNP dumps are sorted only by genomic site, the ordering for different variants within the same genomic sites varies. This means a simple comparison between the dbSNP versions produces almost all lines as changed. I think I knew this before, but it took a while to find out that Picard Tools VcfSort is only method that actually sorts the rsID's within genomic sites as desired.

    Next problem is, Linux diff will want to load the WHOLE files + more in memory to do the whole genome diff. One could probably pull this off on a cloud machine with 256 GB memory or something. This gave me the idea that known the files are sorted by genomic site already, a script could produce smaller delta between revisions with lot less resources, but this turned out to eventually be quite a brainteaser to write. I had to stick to testing with single chromosome; an Unix patch-file for changes in dbSNP 151 => 152 xz-compressed takes about 6% of original, bgzipped chr1 dbSNP size, therefore patch for whole file should be about 1 gigabyte.

    In addition, it would STILL be necessary to download the original dbSNP dump to apply the delta to. Using xz compression instead of bgzip seems to reduce file size to about 77% of maximum bgzip compression, so the dbSNP dump without INFO column should be about 5 gigabytes, but that will require extracting and re-compressing it with bgzip for use. I'm also going to have to see if I can host these files off Sequencing.com, otherwise I'm not sure how it would be possible to offer them. However, it's worth noting that either way processing the xz compressed dbSNP dump ready for use, to say nothing of actual bioinformatics workflow, will require very significant computational resources & electricity, so I'm not really sure if the effort was worth it

  15. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (04-05-2019)

  16. #50
    Registered Users
    Posts
    174
    Sex

    My bad, Picard Tools VcfSort doesn't actually sort variants on same genomic site (Chromosome coordinates), it just preserves the existing ordering, and GCF_000001405.38.vcf.bgz happened to be in the order I wanted. I made my dbSNP comparison script sort them in the order I wanted before comparing, but this will make it slightly harder to apply the delta/diff between versions. dbSNP 151 to 152 change-list came out to 222 megabytes xz-compressed.

    Just tested and updated my https://github.com/Donwulff/bio-tool...apping/BQSR.sh script to use a pre-prepared dbSNP 152 GRCh38 vcf-file hosted off Sequencing.com (Well, if the download succeeds). It came out to 6,3 gigabytes in the end, compared to the 14G original.

    Unfortunately, Sequencing.com is pretty picky about what kind of filex they will store, and xz-compressed vcf isn't one of them. This is kind of a shame because xz-compressed the file size comes to 4,7 gigabytes, although it would be necessary to uncompress & recompress that one with bgzip to make it usable with GATK etc. The dbSNP delta couldn't be served off Sequencing.com either; one possibility might be to make the format conform to vcf, although that would make it larger, so I'm thinking of just using BitTorrent if there's ever demand for these files.

    Worth noting that the BQSR.sh script will *also* pull all the GRCh37 resources, so people on metered connections should check what it does. Most of the script is just fetching & preparing the known sites lists for BQSR from various sources, the *actual* BQSR is basically just:

    tabix -l ${DBSNP} \
    | xargs -IZ -P${CORES} gatk-${GATK}/gatk --java-options -Xms4G BaseRecalibrator -R ${REF} \
    --known-sites ${DBSNP} \
    --known-sites ${INDEL1} \
    --known-sites ${INDEL2} \
    --known-sites ${YBROWSE} \
    -I ${SAMPLE} -O ${SAMPLE}.Z.recal -L Z
    gatk-${GATK}/gatk GatherBQSRReports -I chr.list -O ${SAMPLE}.recal
    gatk-${GATK}/gatk ApplyBQSR -R ${REF} --bqsr ${SAMPLE}.recal -I ${SAMPLE} -O ${SAMPLE%%.srt.bam}.bqsr.bam

    But you need to know which known sites to use, from where to get them, and how to prepare them for use, which the script hopefully helps with. The scripts are intended for people who have access to high-performance Linux system (Cloud servers will do just fine) and want to play with re-processing their Dante Labs or BigY BAM's (Other providers will likely work just as well). Ideally, people would experiment with different genomic references, known sites lists and contigs to include for example in the BQSR statistical model and co-operatively find best solution.

  17. The Following User Says Thank You to Donwulff For This Useful Post:

     pmokeefe (04-06-2019)

Page 5 of 5 FirstFirst ... 345

Similar Threads

  1. Dante Labs (WGS)
    By MacUalraig in forum Other
    Replies: 636
    Last Post: 04-17-2019, 10:21 PM
  2. What are the chemicals used in DNA sequencing
    By Anabasis in forum Inquiries Corner
    Replies: 2
    Last Post: 12-17-2015, 02:05 AM
  3. Replies: 10
    Last Post: 07-01-2015, 03:51 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •