Results 1 to 2 of 2

Thread: Sequencing & mitochondrial heteroplasmy

  1. #1
    Registered Users

    Sequencing & mitochondrial heteroplasmy

    Mitochondrial heteroplasmy came up on another sequencing thread. For those who might have been asleep through their biology classes, mitochondria is the powerhouse of the cell... And actually, there's anywhere from none (red blood cells) to thousands (liver cells) of mitochondria in cells of human tissue, and they each have their own mtDNA. Because of this, sequencing samples generally have tremendously more copies of the mitochondrial DNA than the chromosomes itself.

    This has given rise to extracting mtDNA sequence straight off whole genome or exome sequences, or even Y-chromosome targeted sequencing if mtDNA is not filtered out. This also means it is possible for different copies of the mtDNA to have different genetic variants, known as heteroplasmy. In practical terms, mtDNA variants are often talked about as fraction of the mtDNA that have specific variant, rather than a simple "is/is not" or the paired approach of autosomal chromosomes, albeit there is a tendency for mtDNA variants to settle to one or the other over generations.

    Anyway, a newly published research note "Assessing mitochondrial heteroplasmy using next generation sequencing: A note of caution" gives a cautionary note about medical interpretation of mtDNA heteroplasmy just as the title implies. This is because DNA occasionally breaks apart and attaches itself where it doesn't belong, and due to the large number of copies of mtDNA in typical cell, this has happened especially often with DNA. This could also be one source of "ghost haplotypes" sometimes seen in mtDNA data on microarrays:

    I'll add another cautionary note, though, which is that most medical mtDNA sequences select for the mtDNA specifically and do not include (as much) of the nuclear DNA chromosomes, which the above study assumes. I'm not privy to the details, but I suspect FTDNA's mtDNA products do the same, possibly even using Sanger sequencing where the reads could be more confidently identified as coming from mtDNA itself, meaning more specifity but less sensitivity due to lower read depth. This is one reason to prefer mtDNA-targeted sequencing, however it's undeniable for consumer & genealogical uses getting the data out of whole genome sequence is attractive option.

    This recent research note got me curious how much this happens in our samples. Perhaps I'll get around to drawing nice histograms with R, but in the meanwhile I played around just a bit doing exploratory stuff with basic bioinformatics & shell tools. This monstrosity could be repeated with Linux, or even Windows 10 Subsystem for Linux, but setting that up is beyond the scope of this post. The columns indicate number of short reads sequenced (Here BGISEQ-500/Dante Labs short read sequencing, 100bp paired-end sequencing to be specific), location on mtDNA (100bp resolution) and location on nuclear DNA (1000bp resolution) of reads which don't map to mtDNA alone.

    A Personal Genomes Project participant sample:
    samtools view -F 0xF00 PGP-DanteLabs.bam chrM | gawk -F"[ \t]" '{ if($7!="=") { ci=$6; mc=""; if(substr($12,1,5)=="MC:Z:") mc=$12; if(substr($13,1,5)=="MC:Z:") mc=$13; print "chrM",int($4/100)/10"Kb","to",$7,int($8/1000)/1000"Mb" } }' | sort | uniq -c | sort -nr | head -n25
    476 chrM 16.3Kb to chr17 22.02Mb
    192 chrM 16.4Kb to chr17 22.02Mb
    111 chrM 16.2Kb to chr17 22.02Mb
    35 chrM 0.3Kb to chr2 33.141Mb
    16 chrM 16Kb to chr11 49.883Mb
    16 chrM 11.5Kb to chr11 100.015Mb
    15 chrM 16.1Kb to chr11 49.883Mb
    13 chrM 13.1Kb to chr4 190.598Mb
    11 chrM 16.5Kb to chr17 22.02Mb
    11 chrM 13Kb to chr4 190.598Mb
    10 chrM 0.4Kb to chr2 33.141Mb
    9 chrM 12.3Kb to chr4 190.598Mb
    8 chrM 16.4Kb to chr11 4.656Mb
    7 chrM 16.1Kb to chr17 22.02Mb
    6 chrM 16.4Kb to chr11 49.883Mb
    5 chrM 0.2Kb to chr2 33.141Mb
    4 chrM 6.8Kb to chr11 33.919Mb
    3 chrM 12.9Kb to chr4 190.598Mb
    3 chrM 11Kb to chr2 33.141Mb
    3 chrM 0Kb to chr11 49.883Mb
    3 chrM 0.1Kb to chr17 22.02Mb
    2 chrM 9.8Kb to chrX 16.275Mb
    2 chrM 9.8Kb to chr8 121.547Mb
    2 chrM 9.5Kb to chr16 88.341Mb
    2 chrM 9.3Kb to chr1 0.569Mb

    My own sample:
    443 chrM 16.3Kb to chr17 22.02Mb
    276 chrM 16.4Kb to chr17 22.02Mb
    62 chrM 0.3Kb to chr2 33.141Mb
    48 chrM 16.2Kb to chr17 22.02Mb
    48 chrM 0.4Kb to chr2 33.141Mb
    17 chrM 16.1Kb to chr11 49.883Mb
    17 chrM 12.8Kb to chr4 29.441Mb
    14 chrM 0.5Kb to chr2 33.141Mb
    13 chrM 14.8Kb to chr2 33.892Mb
    12 chrM 16.4Kb to chr11 4.656Mb
    9 chrM 16.1Kb to chr17 22.02Mb
    8 chrM 16.4Kb to chr11 49.883Mb
    8 chrM 14.9Kb to chr2 33.892Mb
    8 chrM 12.9Kb to chr4 29.441Mb
    8 chrM 12.3Kb to chr4 190.598Mb
    6 chrM 16Kb to chr11 49.883Mb
    6 chrM 13.1Kb to chr4 190.598Mb
    5 chrM 2.4Kb to chr5 8.255Mb
    5 chrM 16.5Kb to chr17 22.02Mb
    4 chrM 6.5Kb to chr1 0.567Mb
    4 chrM 6.5Kb to chr1 0.566Mb
    4 chrM 6.1Kb to chr4 87.059Mb
    4 chrM 5Kb to chr1 0.565Mb
    4 chrM 5.7Kb to chr21 23.08Mb
    4 chrM 4Kb to chr1 0.565Mb

    We can see these discordant mappings have fairly similar pattern between the two samples, which is as expected. Indeed, since this exploratory analysis is working off reads mapped to a specific reference, it will actually *only* reveal nuclear mitochondrial segments that already exist in the reference sequence. As the research note referenced points out, individual genomic sequences are expected to also have nuclear mitochondrial segments which are not included in the reference genome. On the other hand, we can note that with a mitochondrial read depth upwards to 5000, only the mtDNA 16200-16500 range ( ) appear to pose significant overlap.

    On the two samples at hand, this range does indeed appear to exhibit low level of heteroplasmy, though in these samples they stay at under 1%, ie. 0.01 allele fraction (As you would expect if the mitochondrial read depth is 4000 and nuclear DNA read depth 30-40). As it was noted to me FTDNA apparently won't call heteroplasmy unless it's over 10% (This could be due to using Sanger sequencing with lower read depth though), and in any event they are presumably using some kind of targeted mtDNA sequencing. But for people looking for lower levels of heteroplasmy in WGS data, this provides empirical validation that false heteroplasmy up to at least around 1% can be shown in known and presumably unknown, individual specific locations of the mtDNA as well.

    for b in `cat samples`; do bcftools query -f 'AF:[ %AF]\n' -i 'POS>16200&&AF!="."' $b; echo "---"; done
    AF: 0.992,0
    AF: 0.996,0.0008597,0
    AF: 1,0
    AF: 0.997,0
    AF: 0.999,0
    AF: 0.994,0
    AF: 0.993,0

    There are some very interesting & involved papers on the subject, even from quite far back like but for most practices and purposes ignoring low levels of apparent heteroplasmy is good idea.

  2. The Following 2 Users Say Thank You to Donwulff For This Useful Post:

     DMXX (05-07-2019),  spruithean (05-07-2019)

  3. #2
    Registered Users

    Incidentally, there was just another paper on Numts: Duan, M. et al. 2019. Evaluating heteroplasmic variations of the mitochondrial genome from whole genome sequencing data. Gene. 699, (May 2019), 145–154.
    Naturally, I got curious. I was not fully impressed by the methodology in that paper, but in it they mention "The first alignment step aimed to maximally obtain the reads that are most similar to the mitochondrial genome. The alignment parameters were modified based on the default ‘--very-sensitive-local’ parameters. The number of failed extends in a row before giving up extending was upgraded to 20 (default 15). We also tolerated one mistake in seed alignment whose default value is zero tolerance. The length of seed substrings was shortened to 16 to improve the sensitivity of the alignment." Bwa mem and bowtie2 are both well maintained state-of-the-art aligners with similar memory profile, with --very-sensitive-local approximating bwa mem's accuracy, but bwa mem is generally considered the better of the two on human genomes.

    Consequently I used my "bwa mem" based pipeline described elsewhere to map my own whole genome against just my own mitochondria reference with minimum seed length 16. While that's clever for extracting mitochondrial only reads, in retrospect for Numts extraction one might want to use Yoruban YRI reference or perhaps even some primate mitochondria due to reproducibility and more ancient Numts being more common. The pipeline is detailed elsewhere, but roughly that equates to:
    samtools fastq -t sample1-YXXXXX.unmapped.bam | fastp -QLGp --stdin --stdout --interleaved_in | bwa mem -k16 -p -t 8 -M -C -H sample1-YXXXXX.unmapped.bam.hdr /mnt/GenomicData/GRCh38/YFXXXXX.fasta | grep -EP "(^#|\tchrM\t)" | samtools view -b -o sample1-YFXXXXX.chrM.bam
    This takes a LONG time because it processes and tries to map whole sequence, and filters out reads whose either end doesn't match my mitochondria sequence for at least 16 consequent nucleotides. And yes, there are certainly faster methods of doing something this simple, in fact there are methods that try to identify mitochondria-like DNA with just their k-mer base-content, but anyway... The "postp" part is very important, as it filters out adapter-readthroughs from the reads.

    My ultimate goal is to look for Numts that aren't included in the GRCh38 reference build, so I can't use a fully reference based approach, and have to ditch the one in the paper. I decided to use Fermi2 genome-assembler to make a de novo assembly of the "mt-dna like" reads extracted in previous step. Let's take a wild guess that the "genome size" is 16k (16000) here:

    samtools fastq sample1-YFXXXXX.chrM.bam > sample1-YFXXXXX.chrM.fq
    ./ unitig -s16k -t8 -l100 -p sample1-YFXXXXX.chrM.tmp sample1-YFXXXXX.chrM.fq > sample1-YFXXXXX.chrM.mak
    make -f sample1-YFXXXXX.chrM.mak
    Okay, that led to bunch of weird results (-nan etc.) in the log files. '
    zgrep "^[GACT]*$" sample1-YFXXXXX.chrM.mag.gz | tr -d "\n" | wc
    ' shows we have maybe 500.000 nucleotides worth of unique genomic data. Actually running that with the Fermi2 assembler says "approximate genome size: 1682799 (inaccurate!)" so I compromised and made a final run with a round 1 million nucleotide expected genome size. Incidentally, "Human Molecular Genetics" by Tom Strachan, Andrew Read says "The transferred mtDNA sequences have acquired inactivating mutations over time and so are sometimes described as nuclear mitochondrial pseudogenes, but they are more generally known as nuclear mitochoondria DNA sequences (NUMTs). In total they account for over 627 kb of nuclear genome; see Table 9.3 for five prominent examples." so we're right in the ballpark. Since this assembly will capture also some false positives as well as up to (100-16)=84 nucleotides flanking sequences around the Numts, I think 1M is a good rough estimate. Written out, without the makefile generated by the perl script:

    ./bfc -s 1M -t 8 <(samtools fastq sample1-YFXXXXX.chrM.bam) <(samtools fastq sample1-YFXXXXX.chrM.bam) 2> sample1-YFXXXXX.chrM.fq.log | bgzip > sample1-YFXXXXX.chrM.fq.gz
    ./bfc -1s 1M -k 61 -t 8 sample1-YFXXXXX.chrM.fq.gz 2> sample1-YFXXXXX.chrM.flt.fq.log | bgzip > sample1-YFXXXXX.chrM.flt.fq.gz
    ./ropebwt2 -dNCr sample1-YFXXXXX.chrM.flt.fq.gz > sample1-YFXXXXX.chrM.flt.fmd 2> sample1-YFXXXXX.chrM.flt.fmd.log
    ./fermi2 assemble -l 51 -m 76 -t 8 sample1-YFXXXXX.chrM.flt.fmd 2> sample1-YFXXXXX.chrM.pre.log | bgzip > sample1-YFXXXXX.chrM.pre.gz
    ./fermi2 simplify -CSo 56 -m 76 -T 51 sample1-YFXXXXX.chrM.pre.gz 2> sample1-YFXXXXX.chrM.mag.log | bgzip > sample1-YFXXXXX.chrM.mag.gz
    Now the most obvious thing we can do with these de novo assembled mitochondria-like sequences is map them against the GRCh38 genome build and see where they will map:
    bwa mem -M -k15 /mnt/GenomicData/GRCh38/hg38DH-p13.fa sample1-YFXXXXX.chrM.mag.gz | samtools sort -o sample1-YFXXXXX.chrM.mag.sort.bam -
    And one of those headache-to-do analysis scripts to print out a bed file of genomic reqions where they map:
    samtools depth sample1-YFXXXXX.chrM.mag.sort.bam | gawk 'BEGIN { min_depth = 1; max_gap = 250 } { if( $3 >= min_depth ) { if( !start ) { printf("%s\t%i\t",$1,$2-1); start=$2 } else if( ( ( chr == $1 ) && ( pos + max_gap < $2 ) ) || ( start && chr != $1 ) ) { print pos-1, pos-start+1; start=0; } chr = $1; pos = $2 } } END { if( start ) { print pos-1, pos-start; start=0; } }' | less
    I couldn't find a list of known Numts on GRCh38 build, there exists some for the previous builds. I could lift over them to GRCh38 coordinates, but due to changes in the reference genome they might no longer be exactly valid. However, recalling Table 9.3 mentioned above, it's in GRCh38 coordinates:
    Chr17 Chr5 Chr1 Chr5 Chr7
    22,521,401 100,055,045 629,084 134,928,526 57,185,765
    22,532,515 100,045,983 634,924 134,923,309 57,198,375
    11,115 9,108 5,841 5,219 12,611
    85.1% sequence identity 89.2% sequence identity 98.6% sequence identity 94.1% sequence identity 79.3% sequence identity

    And what do we have? I'm restricting this list to segments over 5000 nucleotides to keep it shorter, but Chr17 segment has ~200 base gap.

    chr1 628976 634986 6011
    chr5 100045781 100055197 9417
    chr5 134923135 134928707 5573
    chr11 103405734 103411096 5363
    chr17 22519085 22532829 13745
    chrM 1 16568 16568

    Ha! Not bad for a first try. Except for the table formatting, that's bad.

    Note that my regions include the flanking sequences. Several of my regions have higher "read depth" than 2, meaning that more than two de novo assembled haplotypes mapped to the same region, suggesting that they may in fact come from separate regions of the genome which don't exist in the reference build. There's also 11 shorter segments that don't map anywhere; that's curious as you'd think they'd map to at least chrM if nothing else. Next step might be mapping the 100bp PE whole genome sequence back against those de novo assembled segments to see what's the real read depth and variants, and of course serve as sink for Numts, although the Oxford Nanopore long-read sequence would probably be a back-stop to this with a good chance of being able to really resolve the mitochondria-like sequences.

  4. The Following User Says Thank You to Donwulff For This Useful Post:

     spruithean (06-14-2019)

Similar Threads

  1. Replies: 2
    Last Post: 04-17-2017, 08:59 PM
  2. Why Do We Inherit Mitochondrial DNA Only From Our Mothers?
    By rock hunter in forum Inquiries Corner
    Replies: 2
    Last Post: 07-30-2016, 02:38 AM
  3. Replies: 10
    Last Post: 07-01-2015, 03:51 PM
  4. Mitochondrial DNA as a factor in osteoarthritis
    By apophis99942 in forum Medical Genetics
    Replies: 1
    Last Post: 05-21-2015, 06:26 PM
  5. Another Neandertal mitochondrial genome
    By Jean M in forum Human Evolution
    Replies: 0
    Last Post: 02-01-2014, 11:48 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts