Results 1 to 3 of 3

Thread: New paper: Systematic analysis of dark and camouflaged genes [...]

  1. #1
    Registered Users

    New paper: Systematic analysis of dark and camouflaged genes [...]

    Always interesting timing, the lack of information on reference-confidence/confidently called regions/no-calls on some sequencing data, and the newly available DTC Oxford Nanopore Technologies long-read sequencing have raised related issues & questions on many threads. Just recently, an open-access paper came through discussing the regions that are generally hard to sequence either due to low coverage, or similarity other genomic regions, in particular comparing different long-read sequencing technologies. The paper deals with implications of this from perspective of health-relevant variation in particular. For genetic genealogy, it should be relatively easy to mask out as "no-call" most common "dark regions" and assume the rest match reference even if not called in the output. Of particular interest in the papers findings, "Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively."

    "Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight"

    I'll reproduce the beginning of the introduction in the paper because it lays out the basic challenges really well:

    Researchers have known for years that large, complex genomes, including the human genome, contain “dark” regions—regions where standard high-throughput short-read sequencing technologies cannot be adequately assembled or aligned—thus preventing our ability to identify mutations within these regions that may be relevant to human health and disease. Some dark regions are what we term “dark by depth” (few or no mappable reads), while others are what we term “dark by mapping quality” (reads aligned to the region, but with a low mapping quality). Regions that are dark by depth may arise because the region is inherently difficult to sequence at the chemistry level (e.g., high GC content [1, 2]), essentially eliminating sequencing reads from that region altogether.

    Other dark regions arise, not because the sequencing is inherently problematic, but because of bioinformatic challenges. Specifically, many dark regions arise from duplicated genomic regions, where confidently aligning short reads to a unique location is not possible; we term these regions as “camouflaged”. These camouflaged regions are generally either large contiguous tandem repeats (e.g., centromeres, telomeres, and other short tandem repeats), or a larger specific DNA region that has been duplicated (e.g., a gene duplication) either in tandem or in a more distal genome region.

    In fact, many genes in the human genome were duplicated over evolutionary time and are still transcriptionally and translationally active (e.g., heat-shock proteins) [3, 4, 5, 6, 7, 8, 9], while others have been duplicated, but are considered inactive (i.e., pseudogenes). Regardless of whether the duplication is active, however, any genomic region that has been nearly identically duplicated and is large enough to prevent sequencing reads from aligning unambiguously will be “dark”, because the aligner cannot determine which genomic region the read originated from.

  2. The Following 4 Users Say Thank You to Donwulff For This Useful Post:

     JMcB (05-29-2019),  Megalophias (05-24-2019),  Piquerobi (05-24-2019),  spruithean (05-27-2019)

  3. #2
    Registered Users

    The data is interesting and analysis good, but I want to point out that comparing the different reference genomes in the way the paper appears to do ("We also found that aligning GRCh38+alt increased the number of dark nucleotides > 3 times compared to GRCh37.") makes little sense.

    First, when using an alt-aware read aligner, the resulting primary assembly should more or resemble one without the alt contigs, suggesting that they weren't running alt-aware mapping (A quick read of the methods section didn't clarify that, but I believe so from the results). And if you do include the alt-contigs (Or indeed more complete reference genome like GRCh38), then some genomes will map to one version of the contig, and other genomes will map to an alternate version of it, so more complete reference genome will always mean more "dark by depth" regions.

    Where it's possible for a sequencing read to map to multiple locations on the genome, that means the apparent variants could also come from any of those regions, and therefore you would want to know it's not unique. Therefore, between references more "dark regions" is better for most practices and purposes because it includes more variation between humans, and lets you know when a variant is likely to be a mis-call from similar region in the genome, but it's not something where comparison is meaningful. Within the same reference though, lower amount of "dark regions" is generally better.

  4. #3
    Registered Users
    Colonial British/German

    Pat here. I am new to this forum, but not to some other forums. Some people may recognize my Ysearcher online name from other forums.

    Thank you for this very interesting post. Right on target with what's been going through my mind over the past couple of months. My first DNA test (Y chromosome 12 STR test from FTDNA) was in 2004, and my most recent test was the Nebula (Gencove) 0.4x whole genome sequence. In the interim, I have had many, many Y chromosome STR and haplogroup tests from many companies, full sequence mtDNA from FTDNA, autosomal (genotype) testing from many companies, an early whole genome 30x sequence from CGI (Complete Genomics Inc, Mountain View, CA) that used very short read technology (35 bp if I'm correct), about 13 immediate and extended family members tested by 23andMe (not counting 2nd/3rd cousins or more distant), whole exomes (illumina Hi Seq Nextera Enriched whole exome) for one son and my ex-wife, and miscellaneous and sundry other targeted Sanger sequencing, etc.

    So now, 15 years later, after spending literally thousands of hours reviewing all of this data, I have come to the realization that my hodge podge collection of data has more holes than a screen door, to the point that is almost useless for clinical purposes. My very short read whole genome sequence (CGI, PGP-188) has endless strings of no-calls in many genes related to genetic and complex diseases. The two whole exomes are extremely suspect giving high call scores for old variants and very low call scores for novel and rare variants. The 23andMe genotype data is very difficult to compare between family members, because the testing chips have been continuously revised over the years, with different sets of variants tested in each version. It's enough to make me pull my hair out and scream.

    So we are finally getting to the point where useful and (more) complete test packages are becoming available, at about the same time that my working years have come to a close, and I no longer have the income to pay for these products, that are still (relatively) expensive. What started out as curiosity 15 years ago has evolved into essential information as my family has aged, and a number of family have been diagnosed with complex genetic disorders. It's not just a compulsive hobby any longer, it is a necessary pursuit.

    So I'm at the end of my rope, trying to get high quality, reliable whole genome sequencing (NGS), with well annotated VCF files and BAM files, as well as long read sequencing (Pacific Biosciences, and/or optical mapping (Bionano Genomics &/or Oxford Nanopore Technologies). One son has been on a ventilator for more than 16 months now (MND/ALS),and he is still lucid, but of course can't last much longer. I don't live anywhere near him (he lives in Canada), so as far as I know, the Dante Labs whole genome collection kit that was delivered to him last November is still at his place, unopened and unused, but he continues to post updates on clinical trials to Facebook. You can lead a horse to water, but you can't make him drink.

    Any suggestions appreciated.

    EDIT - My son's neurologist(s) have never demonstrated any serious interest in genomic testing. It seems that most neurologists specifically do not encourage genomic testing for sporadic ALS, most likely because there is such a dismal track record in finding and proving pathogenic variants for ALS. So he has had zero clinical testing, in spite of my best efforts to accomplish that. Canadian healthcare pretty much writes off molecular diagnosis of sporadic ALS as a hopeless cause, not worth the time and expense.

    Another interesting research publication along the same lines, but pertaining mostly to structural variation, was published in April -
    Multi-platform discovery of haplotype-resolved structural variation in human genomes
    Last edited by Ysearcher; 05-29-2019 at 07:13 PM.

  5. The Following User Says Thank You to Ysearcher For This Useful Post:

     JMcB (05-29-2019)

Similar Threads

  1. Replies: 1
    Last Post: 05-22-2017, 01:39 PM
  2. Replies: 2
    Last Post: 01-10-2016, 02:06 PM
  3. Eurogenes Uralic genes Analysis
    By J Man in forum Autosomal (auDNA)
    Replies: 58
    Last Post: 09-26-2015, 01:20 PM
  4. Replies: 0
    Last Post: 05-15-2015, 08:41 AM
  5. Replies: 0
    Last Post: 03-25-2013, 09:22 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts