# Thread: Big Y - different reads for hg19 and hg38

1. ## Big Y - different reads for hg19 and hg38

Does anybody have an idea, what is behind the big differences between hg19 and hg38 on some positions?

One example: position 22463781 (hg19) = 20301895 (hg38)

Kit # 195364 hg19: 24T, 97G - hg38: 12G
Kit # 348782 hg19: 25T, 70G - hg38: 13G

Where disappeared those 109 and 82 reads respectively?

And where is the origin of this difference? In FASTQ -> BAM process, or in the processing of the BAM file?

In case of direct conversion of hg19 BAM to hg38 BAM, would be the result the same?

FGC Y Elite 1.0 shows on the same position in hg19 3A 116T 553G and WGS 15x shows in hg19 5T 6G, it would be interesting to see the result in hg38, I already ordered the analysis but it will take some time.

2. ## The Following 3 Users Say Thank You to Petr For This Useful Post:

Celt_?? (11-05-2017), gotten (11-04-2017), Robert1 (11-04-2017)

3. The position you cite is probably in a Palindromic region, (most positions starting with 224 are). In theory both arms of the palindrome are identical, and so reads can get mapped to two positions (one on the forward, one on the backward strand), if you would blast the sequence around the position you would find the corresponding positon of the second arm of the palindrome. But sometimes an SNP happens on one of the arms of the palindrome. As Palindromes can recombine it's possible for such an SNP to be either erased or copied to the second arm, but generally it will remain as such. The HG37 results refect this, reads from both arms get mapped to the same position, hence both T and G, with a preference for the G (this might be due to strand bias).
What is surprising however is that in HG38 he finds just G, and less so. While it might well be that G is the correct read I do not understand how the switch to HG38 would make for more accurate mapping in the palindromic area. Alternatively it's possible that they also upgraded their mapping tools creating better results. Offcourse there is still the problem of the dissapearing G reads.
I tried to blast the sequence at that position to find the corresponding position on the other arm of the palindrome (if it is in fact on a palidrome), but NCBI's Blast tool doesn't find anything (it should at least find the original position). At first glance the region doesn't seem too repetitive. I'll try to see if I can find out the second position later. Would be interesting if FTDNA works better with palindromes now.

4. ## The Following 3 Users Say Thank You to rafc For This Useful Post:

Celt_?? (11-05-2017), Petr (11-04-2017), Robert1 (11-04-2017)

5. Reads are placed onto a reference using edit distances. Modifications to the reference can cause reads to shift around substantially, if there are new areas where the 100 bases fit better.

To see what is going on spot check some of the individual read sequences through a BLAT search using GRCh38/hg38: http://genome.ucsc.edu/cgi-bin/hgBlat

6. Thank you, unfortunately I have no idea how to use BLAT, how I could obtain the sequence to test?

7. I use IGV when actually looking at BAMs, so these instructions are specific to that tool.

1) Right click a read to reveal the context menu
2) Select 'copy read sequence'
3) Vist the URL in my previous post and paste the string into the text box
4) Ensure you have GRCh38/hg38 selected and submit the search.

The results will tell you which segments are most similar to the read, and aligners will typically concur with the top result from BLAT. You want to see if the read is being put somewhere completely differently or there are regions of high similarity.

I should note IGV actually has a menu item that makes that process one step when your reads are aligned to the same reference. It won't help here since the BAM you have access to is hg19.

8. ## The Following 4 Users Say Thank You to JamesKane For This Useful Post:

gotten (11-05-2017), Muircheartaigh (11-04-2017), Petr (11-04-2017), Robert1 (11-04-2017)

9. Nevermind

10. On second look the location you cite is in the DYZ19 area. This is known to be a very complex and highly variable region. It seems there are some differences in DYZ19 between HG37 and 38 so that might explain the new results. Let's hope it's an improvement.

11. I have checked 8 kits I manage and FTDNA found the following new variants in the DYZ19 area:

hg38 position: hg19 reads -- hg38 reads
20067651: 199A 17T -- 13T
20067680: 181A 20C -- 9C
20067682: 182A 21T -- 9T
20067692: 175C 20G -- 9G
20076829: 45A 3T -- 23A
20101608: 26G -- 21G
20101641: 29T -- 12T
20101641: 34T -- 16T
20111610: 23A 1T -- 16A
20125974: 49C 57G -- 48C 8G
20136397: 94T 71G -- 61T 3G
20136526: 86A 44G -- 25G
20136543: 71A 28C -- 10C
20136544: 70A 28T -- 10T
20136545: 28T 70G -- 10T
20136552: 62A 28G -- 10G
20147942: 63A 83C -- 34A 45C
20278796: 1A 1C 79G -- 1C 34G
20284034: 126A 8G -- 71A
20284466: 49T 122C -- 37T 4C
20296178: 83A 38C -- 11A
20296405: 54C -- 54C
20296736: 7A 115T -- 73T
20296965: 145T 2C 122G -- 139T 7G
20300432: 86A 19T -- 82A
20301875: 43A 76T -- 35T
20301895: 25T 70G -- 13G
20301895: 24T 97G -- 12G
20302968: 70G -- 40G
20305178: 100A 119C -- 64A 1C
20308040: 22A 33G -- 12A
20308041: 22T 33C -- 12T
20308044: 22T 33C -- 12T
20308053: 32A 31G -- 16A
20311965: 8A 83T -- 74T
20314557: 25A 48T -- 12A
20314561: 48T 25G -- 12G
20314765: 66A 1T 94C 6G -- 5A 72C
20314848: 69G -- 9G
20316024: 134A 26C -- 133A
20324088: 51T 167C -- 20C
20324129: 109A 45G -- 9A
20325095: 6A 15T 126C 434G 3DEL -- 16C
20344583: 11A 1T -- 10A
20345643: 59T 10G -- 18T
20345643: 77T 11G -- 35T 1G
20346468: 8A 101C 386G -- 52C
20348281: 96A 595G -- 32A 2G
20348281: 118A 1T 589G -- 33A
20348281: 84A 434G -- 16A 1G
20349897: 155T 217G -- 59T 2G

So really big difference.

Now there is a question what is the cause of this change. hg38 itself? Or different alignment process done by FTDNA?

It looks like some variants are specific for certain haplogroups - 20348281 G->A R1a, 20345643 G->T R-Z280.

12. They are probably getting read o really bad mapping quality reads.

Look the the VCF for the total number of reads and the number of mapping quality zero.

Discarding ALL reads of mapping quality zero results in enormous increasesa
in reliability.

13. ## The Following 2 Users Say Thank You to dtvmcdonald For This Useful Post:

Mikewww (11-14-2017), Petr (11-14-2017)

14. Do you mean
##FORMAT=<ID=DP,Number=1,Type=Integer,Description= "Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description=" Total Mapping Quality Zero Reads">
?

20301895: 24T 97G -- 12G -- DP=12 -- MQ0=0

chrY 20301895 . T G 69.9194 PASS BQ=25.343;GC=0.458725;HL=2;HR=2;IndelCnt=0;MQ=46.4 013;MQ0=0;MismatchCnt=0 GT:AD:DP:GQ:PL:AB:SR:BQ:LowMQ:ClipCnt:ReadOffset:R AD:AS 1/1:0,12:12:19:194,194,0:1:0.25:25:0,4:0,1:0,95.3333 :0,9:0,11.9645

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•