I landed the promised MarkDuplicatesSpark using short read alignment script into my GitHub account at https://raw.githubusercontent.com/Do.../revert-bam.sh

I'll have to say that MarkDuplicatesSpark is probably the most frustratingly annoying bioinformatics tools I've come across. My first attempt failed on creating the BAM index because it ran out of memory with the default Java memory allocation, second failed somewhere near the end because it throws temporary files all over the place and there's no way to continue, third try failed because at the end of the process it complains about leftover directory from previous run, fourth try failed midway because it realized there as single-end read in the input... well you start getting the idea.

On n+1 try it... just crashed on creating BAM index for no apparent reason that I could find out, essentially giving same result as try #1. Okay, so I just generate the BAM index with samtools instead, and call it a day. Or rather, 48 hours it took for the test runs. Most of the issues wouldn't have been found on smaller test input, but ironically, on a four-core AMD the MarkDuplicatesSpark flow actually takes much longer than the previous MarkDuplicates workflow. This is contrary to what Broad Institute says about its performance, although it might be in part because my MarkDuplicates version was already optimized with fifo buffer, samtools for sorting and indexing etc. On the bright side MarkDuplicatesSpark manages to keep the cores busy most of the time, so it SHOULD be faster on 8 or 16 execution threads. Unfortunately, you can't make the BAM slices it processes larger than they are, so I imagine it's wasting a whole lot of work on juggling the Apche Spark/Hadoop blocks.

That being said, MarkDuplicatesSpark works on a single compute node with no additional requirements besides Java & GATK so I made it the new workflow, because people running it probably have 8 cores or more and are in a hurry, right? I still preserved the old version, which can be used by commenting GATK_SPARK definition out. Maybe I should automatically choose between them depending on number of processor threads. But since MarkDuplicatesSpark seems to have a tendency to blow up, I added extra define KEEP_TEMPORARY which defaults to preserving the raw bwa mem output before MarkDuplicatesSpark in case one has to go back to it. I would imagine it's really, really slow without SSD for temp files due to it using literally tens thousands of fragments, but I'm not crazy enough to try.

I would appreciate hearing about any unexpected effects should people try the script out, albeit in the end one is always responsible for checking the script out to make sure it does what they need it to do. To that end, there's bunch of links and references added in the comments.