View Full Version : What haplogroup is the Genome Reference Consortium Human Genome Reference hg19/GRCh38

11-02-2017, 09:04 PM
Should be pretty easy to check from the reference data, but Googling it I didn't find the answer instantly, so wondering if anybody knows it already.

A related thought is that as I've got FTDNA BigY and Dante Labs WGS sequence for myself, there's already some data to look at structural differences in Y chromosomes. This isn't great as there's no long-range reads/haplotypes, and WGS doesn't separate the Y chromosome reads, but BigY's up to 165bp read length and Y-capture give some hope of recovering more structural variation from WGS. The R and I paternal haplogroups (as well as their sister clades) for example have thought to have diverged up to 50,000 years ago, making for 100,000 years of divergent evolution.

In the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4817776/ researchers looked at differences between mammalian Y chromosomes, finding significant differences between human and chimpanzee Y chromosomes for example. More interesting is the methodology; SPAdes De Novo assembler for example was used for building the references, because it's coverage-agnostic, which is important since BigY is targeted and there's separate sequences. They used SSPACE for scaffolding, but I'm not entirely certain if there would be benefit from alternate scaffolder, given SPAdes already comes with scaffolder.

For my own workflow, it seems pertinent to also cut out Nextera sequencing adapters, using trim_galore --paired with -q0 for skipping quality trimming because SPAdes can benefit from the low-quality data and will run BayesHammer error-correction on the reads to begin with. At present time I'm pressing with first using SPAdes to construct core contigs from adapter-trimmed BigY reads, and then running SPAdes on the adapter-trimmed WGS data using the previous BigY contigs as high-quality contigs and the GRCH human Y-reference as low-quality gap-filler.
Of course, since running SPAdes on the WHOLE genome is quite resource-intensive, it's probably best course of action to use only the unmapped and Y-mapped reads from WGS, though I'm still evaluating best strategy. I'm using GRCh38 of course as the most complete starting point.

When a mapped genome is filtered to specific chromosomes/contigs like the BigY, or if you do "samtools view input.bam chrY", some unpaired reads will result by the way, to fix this it's necessary to run "samtools sort -n input.bam | samtools fixmate | samtools sort" equivalent workflow. Then read pairs and un can be separated with -f65/-f129/-F1. I can stick a complete script somewhere, but I'm curious to hear if anybody has done anything similar, or has suggestions.

11-05-2017, 05:33 PM
I believe it is a mix of humans and therefore haplogroups but you can probably assign something like R1b>P312>U152>L2>L20 to it if you look at some of the calls.

11-05-2017, 06:03 PM
Thanks. And yes it's a mix; I've read before some more details on it, but I couldn't find those details around this time. The closest I found now is http://genome.cshlp.org/content/23/2/388/F1.expansion.html which says "(Yellow bar) Segment of the reference sequence derived from haplogroup G (14,328,588–15,370,586); (blue bar) Haplogroup R1b regions."

As an aside to my technical side, I'd forgotten BayesHammer error-correction portion of SPAdes takes huge amounts of memory, 42GB in the case of the Y+unmapped (GRCh38 + all alts, decoy etc.) De Novo alignment. My home computer I'm using for testing and development isn't server class and therefore only has 32GB of memory, zram-config makes that part feasible, though it's really bringing my hybrid-SSD drives to their knees. A cloud-node with SSD and 64GB memory would surely breeze through this, though. Afterwards, I realized that SPAdes can probably process both tests in a single run, because it's sequencing-library aware (ie. builds separate statistical models for different libraries).

Coincidentally, paper on the RecoverY utility mentioned in the aforementioned De Novo sequencing paper was released just this summer: https://www.biorxiv.org/content/early/2017/06/14/148114 - I have not used it, because due to the targeted capture it probably wouldn't work on BigY, and I believe being as they come from saliva samples, the metagenome would throw off any Y-chromosome sequencing depth estimates. I also don't mind getting metagenome contigs from the same analysis, though separating contigs from Y and metagenome could be a pain. Earlier when I did the same work with just BigY, I BLAST:ed the contigs against nucleotide archive to identify them, but I suspect with WGS that could become unwieldy.

11-25-2017, 01:21 AM
Some intermediary results from my "citizen scientist" Y chromosome de-novo sequencing effort. Probably should post this under another topic. Alignment of the SPAdes Genome Assembler generated scaffolds against the GRCh38(p11) validates the overall approach of filtering out reads that map to GRCh38's other chromosomes and alt contigs, and combining the remaining sequences from Big Y and Dante Labs WGS. As can be seen, the coverage (green and red areas) for *de novo* sequencing is improved by using data from both sequences. Red bars mean the sequence is in different order from reference; this could be either a mis-assembly or a change in the chromosome. The scaffolds are constructed from multiple continuous segments when the paired-end reads can be used to determine where they belong, and the "broken" versions are ones were those continuous segments are separated. As can be expected from consumer-grade sequencing without deep sequencing, long "jumping library" insert sizes or nanopore long reads, and I've not yet attempted to gap-fill from the human reference.


However, I had to backtrack and re-start the analysis after noticing some issues with my adapter trimming. Most significantly, I've been unable to identify the actual adapter sequence used in the Dante Labs sequence, and no adapter trimming utilities with stock adapters seem to work perfectly (First time through I used bwa-kit's trimadap which did nothing useful, then trim_galore auto-identified the adapters as Nextera and picard MarkIlluminaAdapters). FastQC suggests from Kmer frequencies it's something like "AAGTCGGATCGTAGCCATGTCGTTCCTTAGGAA" but there are no over-represented sequences reported. I had best luck with Trimmomatic (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true seemed to give most trim, though it's not a match) with it's palindromic mode, but I'm still trying to find best adapter sequence to use. I intent to use that, and then try to identify most common read-through from the trimmed reads. The upshot is that I can use Trimmomatic for all samples, and robustly identify even single-nucleotide adapter read-through without missing any, so the overall assembly quality should improve just a bit. I also recalled there are two 1000 Genomes Project samples that YFull dates with 1750 ybp common ancestor, ie. within 3500 years divergence, so it's worth trying to use those to improve the read depth for purposes of de-novo assembly only. I'm also changing the methodology for extracting Y chromosome reads from the original assembly to include or exclude paired ends of reads which mapped to different chromosome. I think including them will be best as they should get rejected if they don't fit the assembly, but I will probably have to see if there is any difference. And, I think FTDNA's GRCh38 BAM won't make it for this try either ;)

According to YFull's calculation, the MRCA of G, R1b and I1 clades was 48.500 years ago, or 97.000 years of divergent evolution between I1 and those branches. There have been some academic efforts to de novo sequence local reference genomes, for example Skov, L. and Schierup, M.H. 2017. Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLoS genetics. 13, 8 (Aug. 2017), e1006834 and Maretty, L. et al. 2017. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature. 548, 7665 (2017), 8791, but to my knowledge none have published the assemblies due to patient confidentiality. Though without the long-range mapping data I don't think my assembly will be exactly publication quality, I'm still weighting the pro's and con's of making the assembly public. Since even structural variation is phylogenetically bound, revealing variants or even the de novo assembly doesn't seem to reveal significantly more than the terminal haplogroup and novel mutations, and improving the assembly with 1000 Genomes sequences means even the novel mutations might not be specific. On the other hand it would reveal "personal information" about not only myself, but anybody on the same general branch. But then again, sequences of Y chromosomes on the same branch, such as the 1000 Genomes ones, are already public. In any case, I'm hoping it'll illuminate Y chromosome structural variation, and insight into Y chromosome de novo assembly with current consumer offerings.