Page 2 of 6 FirstFirst 1234 ... LastLast
Results 11 to 20 of 51

Thread: Dante Labs WES/WGS Sequencing Technical

  1. #11
    Registered Users
    Posts
    436
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    It is interesting that sometimes the uBAM file is smaller than the original BAM file and sometimes it is bigger:

    Test bam hg19 u.bam J Kane hg38*
    Dante WGS 83133 88048
    FGC Elite 1.0 9339 6216 12962
    FGC WGS 15x 26744 30102 74318
    Big Y (2014) 829 1178 2462
    Big Y (2014) 825 1129 2607
    Big Y (2014) 831 1133 2415
    Big Y (2014) 901 1226 2517
    Big Y (2015) 812 1184 2449
    Big Y (2015) 868 1174 2597
    Big Y (2015) 807 1236 2580
    Big Y (2015) 670 917 2122
    Big Y (2016) 490 830** 1284
    Genos WES 6396 5480 11153

    File sizes in MiB.
    * - the third column are BAM files created by http://www.it2kane.org/2016/10/y-dna...rkflow-pt-1-1/
    ** - the uBAM file cannot be created from the original BAM file so it was created from hg38 file created by the J. Kane workflow.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  2. #12
    Registered Users
    Posts
    257
    Sex

    It's unclear what are from ValidateSam and which are from the script. The script works for my BigY done in 2015, soon after they started filtering out non-Y reads. The script copies the original read group header, and for some reason FTDNA didn't put platform there. All that is annoying, because some tools expect it, I do wonder if I should make the script fill it in if missing.

    _@ASPIRE-V3:~$ bio-tools/mapping/revert-bam.sh
    Reverting sample1.bam into sample1.unmapped.bam with Picard Tools
    14:03:25.386 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/_/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Thu Apr 12 14:03:30 GMT 2018] RevertSam INPUT=sample1.bam OUTPUT=sample1.unmapped.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] MAX_RECORDS_IN_RAM=2500000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Apr 12 14:03:30 GMT 2018] Executing as _@ASPIRE-V3 on Linux 4.4.0-17133-Microsoft amd64; OpenJDK 64-Bit Server VM 9-internal+0-2016-04-14-195246.buildd.src; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
    INFO 2018-04-12 14:03:35 RevertSam Reverted 1,000,000 records. Elapsed time: 00:00:05s. Time for last 1,000,000: 5s. Last read position: chrY:8,476,786
    INFO 2018-04-12 14:03:41 RevertSam Reverted 2,000,000 records. Elapsed time: 00:00:10s. Time for last 1,000,000: 5s. Last read position: chrY:13,708,840
    INFO 2018-04-12 14:04:05 RevertSam Reverted 3,000,000 records. Elapsed time: 00:00:35s. Time for last 1,000,000: 24s. Last read position: chrY:17,177,919
    INFO 2018-04-12 14:04:11 RevertSam Reverted 4,000,000 records. Elapsed time: 00:00:40s. Time for last 1,000,000: 5s. Last read position: chrY:21,174,636
    INFO 2018-04-12 14:04:15 RevertSam Reverted 5,000,000 records. Elapsed time: 00:00:45s. Time for last 1,000,000: 4s. Last read position: chrY:24,115,859
    INFO 2018-04-12 14:04:38 RevertSam Reverted 6,000,000 records. Elapsed time: 00:01:07s. Time for last 1,000,000: 22s. Last read position: chrY:26,139,146
    INFO 2018-04-12 14:04:40 RevertSam Detected quality format for _: Standard
    INFO 2018-04-12 14:05:04 RevertSam Sanitized 1,000,000 records. Elapsed time: 00:00:23s. Time for last 1,000,000: 14s. Last read position: */*
    INFO 2018-04-12 14:05:17 RevertSam Sanitized 2,000,000 records. Elapsed time: 00:00:37s. Time for last 1,000,000: 13s. Last read position: */*
    INFO 2018-04-12 14:05:30 RevertSam Sanitized 3,000,000 records. Elapsed time: 00:00:50s. Time for last 1,000,000: 13s. Last read position: */*
    INFO 2018-04-12 14:05:44 RevertSam Sanitized 4,000,000 records. Elapsed time: 00:01:03s. Time for last 1,000,000: 13s. Last read position: */*
    INFO 2018-04-12 14:05:57 RevertSam Sanitized 5,000,000 records. Elapsed time: 00:01:16s. Time for last 1,000,000: 13s. Last read position: */*
    INFO 2018-04-12 14:06:10 RevertSam Sanitized 6,000,000 records. Elapsed time: 00:01:30s. Time for last 1,000,000: 13s. Last read position: */*
    INFO 2018-04-12 14:06:16 RevertSam Discarded 39868 out of 6464708 (0.617%) reads in order to sanitize output.
    [Thu Apr 12 14:06:16 GMT 2018] picard.sam.RevertSam done. Elapsed time: 2.85 minutes.
    Runtime.totalMemory()=10068426752
    To get help, see http://broadinstitute.github.io/pica...ml#GettingHelp
    Exception in thread "main" picard.PicardException: Discarded 0.617% which is above MAX_DISCARD_FRACTION of 0.500%
    at picard.sam.RevertSam.doWork(RevertSam.java:336)
    at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
    at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
    at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)

    That error is expected, since the BAM file contains unpaired reads, which need to be discarded or paired end mapping won't be possible. The unpaired reads can be mapped separately, although it's possible they don't belong on Y chromosome because their other end mapped somewhere else. With FTDNA cutting the pairs out, it's hard to tell. The script could leave that error out, but I personally think it's good to warn about possible loss of data.

    The "WARNING:MISSING_TAG_NM" warnings are expected, because the result is unmapped BAM, which doesn't include mapping information.

    2:N:0:GGACTCCT+GTAAGGAG is from Illumina Casava 1.8 FASTQ flags, since the script in question works on BAM's I'm not sure where those are coming from.

    The latest version on GitHub now includes script to call bwa (if the reference index is available) to show one way to map the unmapped bam file, though it doesn't exactly follow the GATK best practices (I'm using samtools for sorting because it's faster, and gluing the unmapped BAM header with the RG lines on the mapped BAM instead of using MergeBamAlignment for same reason). There's no base recalibration or variant calling, basically the BAM is left in same state as the raw BAM's delivered by sequencing companies. I'll try to write user-friendly BQSR & calling script when I have time, though it's somewhat re-inventing the wheel. Technically one can use the GATK pipelines on the unmapped BAM, though they have more dependencies/prerequisites.

  3. #13
    Registered Users
    Posts
    436
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    Quote Originally Posted by Donwulff View Post
    In the case of Dante Labs provided BAM, I found out I had to add XS and XA to tags to clear.
    I noticed that FTDNA Big Y BAM files contain some additional flags in lower case: qf, sc, sp, ac, nh.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  4. #14
    Registered Users
    Posts
    257
    Sex

    Nowadays, bioinformatics pipelines are usually written in some high level languages or frameworks. For example, Broad Institute's GATK has moved into using their own Workflow Definition Language WDL https://software.broadinstitute.org/wdl/ - of course, this means lots more dependencies, learning and added complexity. There may still be some spot for easily put together and hopefully understandable scripts to do the same.

    That said, as GATK now mentions on their Best Practices front page:
    "If someone hands you a script and tells you "this runs the GATK Best Practices", start by asking what version of GATK it uses, when it was written, and what are the key steps that it includes. Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single parameter in the commands. If anything is unfamiliar to you, you should find out what it does. If you can't find it in the documentation, ask us in the forum. It's one or two hours of your life that can save you days of troubleshooting on the tail end of the pipeline, so please protect yourself by being thorough." https://software.broadinstitute.org/...est-practices/

    Which, basically, translates to ONLY use Broad Institute provided official pipelines. I mean, they don't even provide the commands on their best practices pages anymore, it's all in the pipeline code. Another thing is computing has moved ahead, and like https://software.broadinstitute.org/gatk/blog?id=11415 details, there's great advantages to be had from using specific cloud computing configurations. On the other hand, one may not have the knowledge, transfer bandwidth or desire to send their data to large cloud computing companies where it could be intercepted. It's also possible people may want to learn & tweak the process themselves.

    I do wonder a bit where GATK Best Practices is headed. They've been pushing the Unmapped BAM format for a while now, and when nobody wanted to use it, now they seem to be forcing it down people's throat. Me, personally I think the uBAM is actually a great idea, and it would be nice if bioinformatics tools had better support for it, but still GATK's current approach seems bit heavy-handed. Now they're also talking of doing away with BWA, for performance reasons of course, while their Java-based GATK & Picard are the slowest tools out there. But for example their MarkDuplicates is considered not just the best but practically only way to properly do duplicate marking. There are faster tools, like samblaster that do streaming/piped duplicate marking and sorting, which saves you from the processing power and IO to compress and store temporary files, but the results aren't identical with Picard MarkDuplicates. So to some degree you're stuck with that.

    On the other hand, they seem to now be headed to do *everything* by themselves, their own way. I'm wondering if they will try to close it up as proprietary at some point, or if another, open standard will emerge. Some of the challenges FDA has ran seem to indicate that GATK Best Practices do produce some of the highest quality results, though: https://precision.fda.gov/challenges/truth/results And perhaps more importantly, the pipelines and results are reproducible, and comparable to each other. (The caveats about what is "true" for genomics apply though; the expected/correct answers are pretty much defined by GATK-like workflows right now). On the other hand, if you just want to get your hands dirty and experiment, knock yourself out.

  5. #15
    Registered Users
    Posts
    436
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    This is my unsuccessful run:

    [Thu Apr 12 17:35:06 CEST 2018] RevertSam INPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.bam OUTPUT=/home/petr/Storage/f348784/f348784_Autrata_CZE_BigY.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=7000000 OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Apr 12 17:35:06 CEST 2018] Executing as petr@D152 on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
    INFO 2018-04-12 17:35:11 RevertSam Reverted 1,000,000 records. Elapsed time: 00:00:04s. Time for last 1,000,000: 4s. Last read position: chrY:7,983,283
    INFO 2018-04-12 17:35:16 RevertSam Reverted 2,000,000 records. Elapsed time: 00:00:09s. Time for last 1,000,000: 4s. Last read position: chrY:13,451,192
    INFO 2018-04-12 17:35:22 RevertSam Reverted 3,000,000 records. Elapsed time: 00:00:14s. Time for last 1,000,000: 5s. Last read position: chrY:15,783,059
    INFO 2018-04-12 17:35:28 RevertSam Reverted 4,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 5s. Last read position: chrY:19,677,785
    INFO 2018-04-12 17:35:32 RevertSam Reverted 5,000,000 records. Elapsed time: 00:00:24s. Time for last 1,000,000: 4s. Last read position: chrY:22,893,813
    INFO 2018-04-12 17:35:38 RevertSam Reverted 6,000,000 records. Elapsed time: 00:00:31s. Time for last 1,000,000: 6s. Last read position: chrY:25,050,194
    INFO 2018-04-12 17:35:47 RevertSam Reverted 7,000,000 records. Elapsed time: 00:00:39s. Time for last 1,000,000: 8s. Last read position: chrY:28,818,028
    INFO 2018-04-12 17:36:43 RevertSam Detected quality format for GRC14388618_S53_L00: Standard
    INFO 2018-04-12 17:36:53 RevertSam Discarded 7027084 out of 7027084 (100.000%) reads in order to sanitize output.
    [Thu Apr 12 17:36:53 CEST 2018] picard.sam.RevertSam done. Elapsed time: 1.78 minutes.
    Runtime.totalMemory()=12125732864
    To get help, see http://broadinstitute.github.io/pica...ml#GettingHelp
    Exception in thread "main" picard.PicardException: Discarded 100.000% which is above MAX_DISCARD_FRACTION of 0.500%
    at picard.sam.RevertSam.doWork(RevertSam.java:336)
    at picard.cmdline.CommandLineProgram.instanceMain(Com mandLineProgram.java:282)
    at picard.cmdline.PicardCommandLine.instanceMain(Pica rdCommandLine.java:98)
    at picard.cmdline.PicardCommandLine.main(PicardComman dLine.java:108)
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  6. #16
    Registered Users
    Posts
    257
    Sex

    Also, the new version in GitHub I changed the extension to .unmapped.bam because that's what Broad Institute used. If you re-name the old .u.bam to .unmapped.bam, the script won't need to re-make it for mapping.

    I'm not seeing all the tags in my BigY file. Most of the BAM file tags won't matter in current use, because "samtools fastq -t" only outputs RG, BC and QT tags, which I think I mentioned before. I'll see about adding them to be stripped though; I left the FTDNA proprietary ones because they might hold additional metadata not related to mapping, though it's doubtful.

  7. The Following User Says Thank You to Donwulff For This Useful Post:

     Petr (04-12-2018)

  8. #17
    Registered Users
    Posts
    257
    Sex

    Your 2016 BigY BAM looks "broken", it has the Illumina Casava 1.8 headers included in the read/querynames. I think it would be possible to fix that, though I don't have file to exactly test it on. Maybe I should add that to the script too if it's more common...

    (samtools view -H sample1.bam; samtools view sample1.bam | awk '{ OFS="\t"; split($1,q," "); $1=q[1]; print }') | samtools view -b -o out.bam

    Did they return the non-Y chromosome reads in those, or what's that?

    BAM files can use different levels of compression, though most tools don't let you choose the level of compression. I'm thinking of changing the script to COMPRESSION_LEVEL=9 actually, the Broad Institute GATK page doesn't tell to set that, but if you're going to save the unmapped BAM it might be worth it to use maximum compression. In practice, that's likely going to make very little difference usually.

    Another thing is that when a BAM file is sorted in chromosome order, it will compress a lot more due to the redudant/repetitive content in overlapping reads. A sorted BAM can therefore be smaller than unmapped bam. An unmapped BAM in chromosome sort order would be the smallest, but the queryname order is required by full MarkDuplicates, and provides for better insert length determination on mapping.
    Last edited by Donwulff; 04-12-2018 at 05:43 PM.

  9. The Following User Says Thank You to Donwulff For This Useful Post:

     Petr (04-12-2018)

  10. #18
    Registered Users
    Posts
    436
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    Thank you Don, I think there is a dash sign for stdin missing, but still it gives error:

    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    [main_samview] fail to read the header from "-".

    Edit: this error was probably caused by using the old version of samtools.

    With the samtools 1.8, there is another error:

    [W::sam_read1] Parse error at line 96
    [main_samview] truncated file.

    If I understand correctly, the awk script changed space delimiter to tab delimiter.

    For some reason this was not acceptable as input format for samtools.

    So I tried to use sed -e s/\ [^\\t]*\\t/\\t/ instead of awk to remove the Illumina Casava 1.8 headers and now the result of RevertSam is very similar to your result:

    Code:
    [Thu Apr 12 22:01:15 CEST 2018] RevertSam INPUT=out.bam OUTPUT=out.u.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP, XS, XA, BD, BI] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/tmp] VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=7000000    OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic KEEP_FIRST_DUPLICATE=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Apr 12 22:01:15 CEST 2018] Executing as petr@D152 on Linux 4.13.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.2-SNAPSHOT
    INFO    2018-04-12 22:01:19     RevertSam       Reverted     1,000,000 records.  Elapsed time: 00:00:04s.  Time for last 1,000,000:    4s.  Last read position: chrY:7,983,283
    INFO    2018-04-12 22:01:24     RevertSam       Reverted     2,000,000 records.  Elapsed time: 00:00:09s.  Time for last 1,000,000:    5s.  Last read position: chrY:13,451,192
    INFO    2018-04-12 22:01:32     RevertSam       Reverted     3,000,000 records.  Elapsed time: 00:00:17s.  Time for last 1,000,000:    7s.  Last read position: chrY:15,783,059
    INFO    2018-04-12 22:01:36     RevertSam       Reverted     4,000,000 records.  Elapsed time: 00:00:20s.  Time for last 1,000,000:    3s.  Last read position: chrY:19,677,785
    INFO    2018-04-12 22:01:42     RevertSam       Reverted     5,000,000 records.  Elapsed time: 00:00:26s.  Time for last 1,000,000:    5s.  Last read position: chrY:22,893,813
    INFO    2018-04-12 22:01:46     RevertSam       Reverted     6,000,000 records.  Elapsed time: 00:00:31s.  Time for last 1,000,000:    4s.  Last read position: chrY:25,050,194
    INFO    2018-04-12 22:01:53     RevertSam       Reverted     7,000,000 records.  Elapsed time: 00:00:38s.  Time for last 1,000,000:    6s.  Last read position: chrY:28,818,028
    INFO    2018-04-12 22:02:44     RevertSam       Detected quality format for GRC14388618_S53_L00: Standard
    INFO    2018-04-12 22:02:56     RevertSam       Sanitized     1,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:   11s.  Last read position: */*
    INFO    2018-04-12 22:03:07     RevertSam       Sanitized     2,000,000 records.  Elapsed time: 00:00:23s.  Time for last 1,000,000:   11s.  Last read position: */*
    INFO    2018-04-12 22:03:18     RevertSam       Sanitized     3,000,000 records.  Elapsed time: 00:00:34s.  Time for last 1,000,000:   11s.  Last read position: */*
    INFO    2018-04-12 22:03:30     RevertSam       Sanitized     4,000,000 records.  Elapsed time: 00:00:46s.  Time for last 1,000,000:   11s.  Last read position: */*
    INFO    2018-04-12 22:03:42     RevertSam       Sanitized     5,000,000 records.  Elapsed time: 00:00:57s.  Time for last 1,000,000:   11s.  Last read position: */*
    INFO    2018-04-12 22:03:54     RevertSam       Sanitized     6,000,000 records.  Elapsed time: 00:01:09s.  Time for last 1,000,000:   11s.  Last read position: */*
    INFO    2018-04-12 22:04:05     RevertSam       Discarded 61588 out of 7027084 (0.876%) reads in order to sanitize output.
    [Thu Apr 12 22:04:05 CEST 2018] picard.sam.RevertSam done. Elapsed time: 2.84 minutes.
    Runtime.totalMemory()=10565976064
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" picard.PicardException: Discarded 0.876% which is above MAX_DISCARD_FRACTION of 0.500%
            at picard.sam.RevertSam.doWork(RevertSam.java:336)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
    This problematic Big Y file is dated January 19th 2016.
    The latest "good old" Big Y file I have is dated March 28th 2015.
    Last edited by Petr; 04-12-2018 at 08:19 PM.
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

  11. #19
    Registered Users
    Posts
    257
    Sex

    Ah yes, the dash not needed in current version so easy to forget

    samtools fastq actually works right on this file, returning correct Illumina Casava headered FASTQ, so that's an interesting case for the Unmapped BAM's.

    gawk actually by default breaks input fields at any whitespace by default, which I missed because I didn't have anything to test it with.
    (samtools view -H sample1.bam; samtools view sample1.bam | awk 'BEGIN { FS="\t"; OFS="\t"; } { split($1,q," "); $1=q[1]; print }') | samtools view -b -o out.bam -

    The first columns of headers like @HD and @RG don't have spaces in them, so it would probably be okay to pipe both header and mappings through the gawk script, but better safe and sorry, the above filters only the mappings through it. Columns (fields) are split at tabulator, first column split at space, storing the part before the space back into first column. Columns are printed out tab-separated, and the resulting text-format (SAM) file is piped through samtools to store it into compressed BAM.

    I don't get why "samtools view" doesn't support higher compression level, though it would save very little space. Testing & thinking on that a bit, for archival one might want to use xz compression. xz actually supports block-based parallel compression, though it would currently need a bit of tweaking to get xz to work with highly parallelized pipelines where you'd split the input bam into multiple sections. xz compression level 3 in blocks is still faster than bgzip level 9, and 13% smaller. For a mapped BAM, CRAM is best, but got me into thinking what would be best option for unmapped BAM's, especially if you were going to re-process them 10+ times with different pipeline settings.

  12. The Following User Says Thank You to Donwulff For This Useful Post:

     Petr (04-13-2018)

  13. #20
    Registered Users
    Posts
    436
    Sex
    Location
    Praha, Czech Republic
    Ethnicity
    Czech
    Nationality
    Czech
    Y-DNA
    R-Y14088
    mtDNA
    J1c1i

    Czech Republic Austria Austrian Empire Bohemia Carinthia
    I'm not sure what is the purpose of READ_NAME_REGEX, i see in the log READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values>

    If I check the read group naming for various tests:

    FGC Elite 1.0
    FCC63CUACXX:4:2206:18663:2479#
    FGC WGS 15x
    ST-E00126:132:HJHMLCCXX:7:2208:17695:46947
    Dante
    CL100054478L1C001R004_35916
    FTDNA Big Y
    SN7001368:163:H8TUEADXX:1:1110:12426:93783
    Genos
    SN7001218:858:HJLM5BCXY:1:1213:9857:47107

    So if I understand correctly, the default behavior will take the last three ":" separated fields, i.e., in this example, sure correct for FTDNA, Genos and FGC WGS. For Dante there is your expression. I'm not sure how it is with FGC Elite 1.0, because the third (last) field is 2479#, I don't know it it will be evaluated as numeric value? The BAM file contains READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).*
    Y-DNA: R-Y14088 (ISOGG: R1b1a1a2a1a2b1c2b1a1a)
    mtDNA: J1c1i (J1c1 + 7735G and 8848C) Extras: 198T 12007A 16422C 16431A

Page 2 of 6 FirstFirst 1234 ... LastLast

Similar Threads

  1. Dante Labs (WGS)
    By MacUalraig in forum Other
    Replies: 796
    Last Post: Yesterday, 09:42 AM
  2. What are the chemicals used in DNA sequencing
    By Anabasis in forum Inquiries Corner
    Replies: 2
    Last Post: 12-17-2015, 02:05 AM
  3. Replies: 10
    Last Post: 07-01-2015, 03:51 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •