PDA

View Full Version : Shared SNP's within different haplogroups



vettor
11-08-2016, 05:20 AM
How useful are SNP which are shared by other haplogroups

Afshar
11-08-2016, 07:21 AM
They developed independently from each other so I would think they are useful in that sense, unless its a site where mutation occurs frequently/random.

RobertCasey
11-08-2016, 02:58 PM
I think the saga of L159 and L69 demonstrate the politics and usefulness of YSNPs that have multiple mutations of the same YSNP. First, these two YSNPs are dependent on each other since both are very near a YSTR structure. Today, YSEQ will not even add any YSNP mutations that are located near YSTR structures due to their unstable nature. FTDNA does allow some of these YSNPs into their haplotree but will generally remove them pretty quickly if they start finding inconsistent testing results (multiple mutations).

When first discovered, L159 and L69 were part of the L21 haplotree. L159 mutates much less than L69 but both were added to the ISOGG haplotree and were tested a lot (before NGS testing became available). However, after a second mutation of L69 under L21 and the fifth known mutation, L69 was purged from the ISOGG haplotree as being too inconsistent to be reliable. There was a secondary push to remove L159 as well but L159.2 was a pretty major branch with a very active group of testers that successfully lobbied to keep L159.2 as part of the haplotree.

L69 is still a viable YSNP mutation but two mutations were very close to each other on the haplotree which makes this YSNP less reliable in that area of the haplotree. I think the consensus to today, is to just eliminate these unstable types of YSNPs since there are just too many private YSNPs that could potentially replace these mutations. FTDNA continues to promote YSNP mutations that are in unstable areas as new branches. The vast majority of these have consistent testing results in the haplotree but there is a constant need to remove these types of YSNPs when inconsistent testing results are found. Many of these inconsistent results could be due to the error prone nature of the very economical Mass Array SNP packs that FTDNA is rolling out. Some of these YSNP branches could be later declared valid branches.

I think FTDNA's approach is much better than YSEQ's approach. There are five significant L226 brenches that have consistent testing results today in unstable areas. FTDNA includes most of these branches in their haplotree yet YSEQ takes a more purist view and will not test any YSNPs found in unstable areas. FTDNA will correct these inconsistent test but it takes a lot of effort on all parties to monitor and make adjustments. These YSNPs cause a lot of confusion to all parties but can be extremely useful as well. ISOGG has also started to exclude many of these YSNPs mutations outright as well. I used to avoid these unstable YSNPs in my haplotree of R-L226, but have changed my mind about their usefulness. It is similar in genealogy as finding evidence of relatedness via land records that are not as reliable but add to the overall information associated with a family grouping.

Most genealogists like having as much information as possible as brick walls can be sometimes be removed with many less reliable sources implying a relationship. In this case, FTDNA has the right approach - but this approach creates a lot of errors in their haplotree which comes along with using YSNPs in unstable areas. It also puts a lot more burden on the admins and FTDNA to resolve conflicts as well. Around 10 % of the L226 branches are declared unstable by the experts but have yielded consistent testing results to date. Z17669, DC33, BY4103, DC24 and DC40 are good solid branches under L226 that YSEQ refuses to test (include in their individual YSNP tests or YSNP panel tests) but most of these are included in the FTDNA L226 SNP pack.

vettor
11-08-2016, 05:15 PM
I think the saga of L159 and L69 demonstrate the politics and usefulness of YSNPs that have multiple mutations of the same YSNP. First, these two YSNPs are dependent on each other since both are very near a YSTR structure. Today, YSEQ will not even add any YSNP mutations that are located near YSTR structures due to their unstable nature. FTDNA does allow some of these YSNPs into their haplotree but will generally remove them pretty quickly if they start finding inconsistent testing results (multiple mutations).

When first discovered, L159 and L69 were part of the L21 haplotree. L159 mutates much less than L69 but both were added to the ISOGG haplotree and were tested a lot (before NGS testing became available). However, after a second mutation of L69 under L21 and the fifth known mutation, L69 was purged from the ISOGG haplotree as being too inconsistent to be reliable. There was a secondary push to remove L159 as well but L159.2 was a pretty major branch with a very active group of testers that successfully lobbied to keep L159.2 as part of the haplotree.

L69 is still a viable YSNP mutation but two mutations were very close to each other on the haplotree which makes this YSNP less reliable in that area of the haplotree. I think the consensus to today, is to just eliminate these unstable types of YSNPs since there are just too many private YSNPs that could potentially replace these mutations. FTDNA continues to promote YSNP mutations that are in unstable areas as new branches. The vast majority of these have consistent testing results in the haplotree but there is a constant need to remove these types of YSNPs when inconsistent testing results are found. Many of these inconsistent results could be due to the error prone nature of the very economical Mass Array SNP packs that FTDNA is rolling out. Some of these YSNP branches could be later declared valid branches.

I think FTDNA's approach is much better than YSEQ's approach. There are five significant L226 brenches that have consistent testing results today in unstable areas. FTDNA includes most of these branches in their haplotree yet YSEQ takes a more purist view and will not test any YSNPs found in unstable areas. FTDNA will correct these inconsistent test but it takes a lot of effort on all parties to monitor and make adjustments. These YSNPs cause a lot of confusion to all parties but can be extremely useful as well. ISOGG has also started to exclude many of these YSNPs mutations outright as well. I used to avoid these unstable YSNPs in my haplotree of R-L226, but have changed my mind about their usefulness. It is similar in genealogy as finding evidence of relatedness via land records that are not as reliable but add to the overall information associated with a family grouping.

Most genealogists like having as much information as possible as brick walls can be sometimes be removed with many less reliable sources implying a relationship. In this case, FTDNA has the right approach - but this approach creates a lot of errors in their haplotree which comes along with using YSNPs in unstable areas. It also puts a lot more burden on the admins and FTDNA to resolve conflicts as well. Around 10 % of the L226 branches are declared unstable by the experts but have yielded consistent testing results to date. Z17669, DC33, BY4103, DC24 and DC40 are good solid branches under L226 that YSEQ refuses to test (include in their individual YSNP tests or YSNP panel tests) but most of these are included in the FTDNA L226 SNP pack.

Thanks

I asked because recently I was given a new SNP ( Z19945 and new branch in yfull) there is an equivalent SNP to Z19945 called CTS1848 ( under T haplogroup )...........myself and the only other T person with positive Z19945 , have CTS1848, with me showing negative and the other person as Positive .................this SNP CTS1848 is also under the R haplogroup

Another SNP is L25 ...............shared between T and J haplogroups ..............I am negative for L25

Cofgene
11-09-2016, 01:42 AM
Basic problem with the statements concerning stability - THEY HAVE NOT BEEN DEFINED!!! Can you explain the stability issue in terms of time or generations, and relationship of one occurrence relative to another? What is an "unstable area" on the Y? Are there mutations present every 5 generations, or what? Let's stop using the historically inaccurate concepts to describe situations related to stability and mutation usefulness and get around to using terminology based upon solid facts of occurrence rates.

ArmandoR1b
11-09-2016, 02:14 AM
Thanks

I asked because recently I was given a new SNP ( Z19945 and new branch in yfull) there is an equivalent SNP to Z19945 called CTS1848 ( under T haplogroup )...........myself and the only other T person with positive Z19945 , have CTS1848, with me showing negative and the other person as Positive .................this SNP CTS1848 is also under the R haplogroup

Another SNP is L25 ...............shared between T and J haplogroups ..............I am negative for L25

You have already sent your BAM to YFull and they have already sent you the results of their analysis?

RobertCasey
11-09-2016, 04:51 PM
Basic problem with the statements concerning stability - THEY HAVE NOT BEEN DEFINED!!! Can you explain the stability issue in terms of time or generations, and relationship of one occurrence relative to another? What is an "unstable area" on the Y? Are there mutations present every 5 generations, or what? Let's stop using the historically inaccurate concepts to describe situations related to stability and mutation usefulness and get around to using terminology based upon solid facts of occurrence rates.

The old methodology was to remove YSNPs that have mutated three or four times. The old methodology did allow quite of few YSNPs in unstable areas to be part of the haplotree but forced painful removals of those that had inconsistent testing results. For $1 per YSNP, YSEQ will respond to you and inform why they think these are in unstable areas. There are at least of a dozen scenarios: 1) located too close to YSTR structure; 2) found in the palindromic region where mass cross over can happen; 3) located near the re-combinational area of the XY pairing; 4) Since the Y chromosome originally came from the X chromosome and some areas are very static, it is hard to tell apart the X and Y in certain regions; 5) ISOGG states that inserts and deletes are unstable (really ??); 6) YSNPs that are very close to each other tend to track each other and ISOGG will not allow these potentially branching YSNPs. This is only half of the reasons that YDNA mutations are considered in unstable areas.

The new methodology is still somewhat in flux. However, consistent results from YSNPs in unstable areas is blessed by FTDNA (not totally but they are making a major effort on this front) and promoted by Alex Williamson's BigTree as well. It is ironic though that YSEQ is now the hold out for ignoring any YSNP in unstable areas. This bad for their business model. Many admins do allow YSNPs branches in unstable areas - and that trend seems to be growing as well. I used to ignore these branches but Dennis Wright convinced me that there are too many that have consistent testing results. Like tracking YSEQ YSNP results, removing YSNPs in unstable areas with inconsistent testing results will remain a challenge for all to monitor and keep the haplotrees up to date. FTDNA does introduce a lot of these YSNPs as valid branches and get it wrong up front - but they seem willing to make corrections when presented the supporting documentation.

FTDNA remains monopoly like by not acknowledging valid branches and equivalents discovered by their competitors. A6097 remains a major branch under L226 that they will not only add to the haplotree - but they refuse our requests to add these kinds of branches to their individual YSNP tests or include them in their SNP pack updates. They have added YSNPs of the leadership (FGC5647 and FGC5639, DC69, FGC122XX) but fail to add those branches where individuals go out on their own and discover these new branches like A6097. This attitude just ends up with unnecessary higher costs for the genetic genealogy community.

Again, discovery of new branches via individual testing at YSEQ is running very consistent at 10 % the cost per branch as Big Y testing. Of course, no new private YSNPs are discovered but 400 private and equivalent YSNPs is being ignored as well. Fortunately, the new L226 YSNP pack has revealed six new branches (not fully vetted though) at about 66 % of the cost of Big Y. Not that impressive from a cost point of view. But these SNP packs have greatly increased testing coverage which has allowed prediction of branches below L226 to increase from 50 % to 70 %. There is just under two predicted for every submission that is thoroughly YSNP tested being revealed. As the haplotree gets more tested, the ratio of predicted to tested will drop of course but SNP pack testing will continue to chip away at the outliers that have not been thoroughly YSNP tested. It remains to be seen just how much prediction via signatures will continue to create a robust descendant chart where analysis can finally be done to make much better testing recommendations.

However, charting of L226 with around 25 % YSNP tested allows for 70 % to be charted (via prediction via signatures of tested submissions). These charts make it painfully obvious to those that should test Big Y (a lot), those that should take the SNP pack test (a lot), those that should test individual YSNPs at YSEQ (not that many as expected) and those few that need the resolution of Y Elite 2.1 (bottlenecks in the trunk of the L226 haplotree where there is a significant lack of divergence of YSTRs) as well as the obvious testing for genealogical YSNPs under 1,000 years old (a lot more is needed here).

Cofgene
11-10-2016, 02:33 AM
You continue to use the term "unstable area" without defining it. I think what you could be referencing are those poorly assembled regions in the sequence. These regions have the same stability as other parts of the y or other chromosomes. What you term "unstable" represents low quality assembly data regions specific to the protocol and assembly parameters utilized to create the sequence. Let's get away from using the inappropriate terminology and move towards one which reflects the what is present in the specific referenced bit of data. Different sequencing technologies from Sanger, to NGS gen 1, gen 2, gen 3, gen4 will have different regions where they have their issues.

In terms of Elite and WGS results they are just as critical for use on older branches as for the newer ones. Wherever you have more than 2 descendant branches you most likely are missing an intermediate level. The higher resolution tests increase the likelihood of identifying and placing intermediate levels that low resolution tests such as Big-Y miss. Within the R-Z326 region there are two levels within a 6 level segment that were identified from WGS results. We had to wait for SNP pack results to properly align 5 brother branches that Big-Y was incorrectly positioning.

RobertCasey
11-10-2016, 04:08 AM
The first paragraph gives six examples that are considered "unstable" areas by the various experts. "Unstable" areas mean parts of the Y chromosome where the reliability of consistent testing results is not always possible due to nature of how DNA works. For instance, any YSNP used near a YSTR structure is not really a random mutation but is a result of the YSTR expanding and contracting and affecting neighboring YSNPs that are very close to the YSTR structure. You also can not depend on YSNP in the palindromic region as very large strands of YDNA get wholesale replaced resulting in what appears to be dozens of random YSNP mutations that are actually only one massive replacement. Parts of the YDNA is very similar to XDNA since the YDNA strand started out as a XDNA pair. Some areas have so much in common between XDNA and YDNA that the YDNA enrichment process (which separates YDNA from all other chromosomes) can not tell XDNA and YDNA apart. This partially because current technology cuts up the DNA into small strands where it is impossible to tell XDNA from YDNA.

I think that there are two very special scenarios where getting extra resolution really helps: 1) if you are trying to discover YSNPs in the genealogical time frame; 2) if your NGS testing candidates are found in the trunk of the haplotree where very little YSTR genetic diversity happens (YSTR provides very little genetic information to work with). For other scenarios, extra resolution always helps some - but you can also have 30 % more Big Ys vs. 30 % more coverage per NGS test. So there is less impact on haplotree development since 30 % more tests are somewhat equivalent 30 % more coverage - both at the same cost and both have pretty similar overall discovery capabilities. Having 30 % more resolution for the other two scenarios is really needed.

I definitely agree that different technologies have different reliability which many people do not understand. The Array chip tests - Nat Geo, CROMO2, Living DNA have the lowest quality and many YSNPs just do not have reliable output. Also, Array chip sets have significant set up costs and can not be updated frequently due to the costs of developing the Array tests for YSNP testing. FTDNA usage of Mass Array is producing inconsistent results as well since it is very difficult to set up each SNP pack. However, this test allows a lot of branches or even private YSNPs to be included and can be updated frequently which offsets the reliability issues. Even NGS tests have major differences as well due to read length. YElite 2.1 will more reliably read the same areas since longer read lengths reduce the error rates associated with NGS tests.

LorenAmelang
01-09-2017, 08:10 AM
For instance, any YSNP used near a YSTR structure is not really a random mutation but is a result of the YSTR expanding and contracting and affecting neighboring YSNPs that are very close to the YSTR structure.

I fear this is a hopeless newbie question, but I've been looking for an answer for weeks and finding no clues. If SNPs can be located by a position which is the number of bases from the marker to the end of the chromosome, and those numbers are consistent between different tests and test systems (at least within a particular reference file version), how do STRs fit into this location system? Your post is the first I've seen that explicitly mentions an effect on SNPs when nearby STRs expand or contract. Something has to give... How does that work? Is there some technical term I should be searching for?

MacUalraig
01-09-2017, 08:39 AM
I fear this is a hopeless newbie question, but I've been looking for an answer for weeks and finding no clues. If SNPs can be located by a position which is the number of bases from the marker to the end of the chromosome, and those numbers are consistent between different tests and test systems (at least within a particular reference file version), how do STRs fit into this location system? Your post is the first I've seen that explicitly mentions an effect on SNPs when nearby STRs expand or contract. Something has to give... How does that work? Is there some technical term I should be searching for?

If you view the data in a genome browser you will see the difference in tandem repeats (STRs) marked as insertions or deletions, the same as it does if one person has an INDEL. But you are right in a sense, alignment is at the heart of genome analysis, except that each person's genome is a different length.

RobertCasey
01-09-2017, 06:39 PM
I fear this is a hopeless newbie question, but I've been looking for an answer for weeks and finding no clues. If SNPs can be located by a position which is the number of bases from the marker to the end of the chromosome, and those numbers are consistent between different tests and test systems (at least within a particular reference file version), how do STRs fit into this location system? Your post is the first I've seen that explicitly mentions an effect on SNPs when nearby STRs expand or contract. Something has to give... How does that work? Is there some technical term I should be searching for?

YSTR strings are constantly expanding and contracting (that is what the number associated with with each YSTR value). When 460 goes from a string of 11 to a string of 12 repeated sequences, it is expanding. These expansions and contractions affected neighboring YSNPs - flipping values, deletions or additions. Flipping values, deletions and additions are considered YSNP mutations (technically additions and deletions are not YSNPs but we tend to label all mutations as YSNP mutations). YSNP mutations located very near YSTRs tend to have multiple YSTRs mutations since the mechanism causing the mutations to happen is that the whole area is unstable due to YSTR expansions and contractions.

The chromosomes expand and contract in length extremely often. Genes expand and contract the most since their structures are huge (sometimes over 1,000,000 base pairs). All YSTR changes in values with some very volatile YSTRs changing every few generations (getting longer or shorter). Sometimes the entire YSTR structure is lost for many generations (reported as null values). Also, FTDNA uses the more simplistic repeat numbers for YSTRs. In addition to the number of sequences increasing and decreasing, there are also partial strings that that are added and deleted as well that are mixed in with the primary repeat pattern. Other vendors used to report these partial values as 12.1 or 12.2 (where 1 or 2 partial strings were found).

An extreme example of an expansion is how DNA was discovered via electron microscope and known to track only one genetic line of a French Canadian line that started in the mid 1800s. This was extra attachment to the YDNA that increased the length of the YDNA by 30 or 40 %. Male descendants of this genetic line were tested and examined under microscope and this is how scientists first learned that one chromosome was being passed down through the generations that could be visibly seen via a microscope. FTDNA had a male customer submit DNA for a YDNA test and it turned out that even though he was male and was married - he had no Y Chromosome (which explained why he and his wife had no children which is extremely rare). DNA is not static, it changes length constantly, this is why you have to look for patterns to know where you are on the chromosomes.

Cofgene
01-09-2017, 10:11 PM
Copy number variation (https://en.wikipedia.org/wiki/Copy-number_variation) would be a more accurate way of explaining one source for variation in length. CNV's represent one of the untouched areas for genealogical/ancestral work.

LorenAmelang
01-17-2017, 06:35 AM
I'm still struggling with this... Appreciate the clues, and totally believe there are lots of length changes happening - some much longer than the ones popularly used by the genealogy companies. I finally found a way to locate those STRs in the same position systems used for SNPs:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chrY%3A14512000%2D14712320&hgsid=575549523_cnRhCxt2kYAXDgfFqKN6CLhHxtuP

They show:
STS Marker DYS389
Chromosome: chrY
Start: 14612000
End: 14612320
Band: Yq11.21
Other names: CHLC.GATA30F10, GATA-DYS389, CHLC.GATA30F10.P14912, G00-365-241
UCSC STS id: 96958
UniSTS id: 29347
Genbank: G09600
GDB: GDB:365241

Those locations are apparently between two of my tested SNPs (all in GRCh37/hg19):
Y 14577177 rs9786712 A G . PASS . GT 1
Y 14640715 rs35960273 A G . PASS . GT 0

But it seems like the locations in my test results are standardized. I did this exploring before I found the UCSC browser, so it shows random locations, but for my sister, me, and a random Chinese site, the same rs numbers show the same positions:

Karen
rs11240777 1 798959 GG
rs12043282 1 249179856 TT
rs5939319 X 2700157 AG
rs669237 X 154916845 GG

Loren
1 798959 rs11240777 G A . PASS . GT 0/0
1 249179856 rs12043282 T C . PASS . GT 0/0
X 2700157 rs5939319 G A . PASS . GT 1
X 154916845 rs669237 G T . PASS . GT 0

<http://rv.psych.ac.cn/variant.do>
rs11240777 chr1:798959-798960
rs12043282 chr1:249179856-249179857
rs5939319 chrX:2700157-2700158
rs669237 chrX:154916845-154916846 A/C

So is there some magic that forces other CNVs between 14577177 and 14640715 to exactly counteract length changes in the STR at 14612000 to 14612320?

Or are those precise-looking, standardized location numbers not precisely consecutive? Maybe they just arbitrarily reset the odometer when they pass defined mileposts?

Not sure why I'm so stuck on understanding this. Would be more fun to get on with trying to fit my Genes for Good Y-SNP results into the FTDNA haplotree...

Loren

Cofgene
01-17-2017, 12:38 PM
The genome reference sequence h37 provides the position numbering that new results are compared against. It remains the same for that reference build. Your DNA is matched down on top of that road map. Where you have an extra STR repeat unit that represents an insertion, or extra bases, in your result. The bases in the inserted STR won't have their own position numbers but are referenced as a group by the location on the reference genome where they were added. For a deletion, where you are missing some sequence, that just represents a hole in your sequence. In that case the reference numbering would be say .....14568, 14567, 14601, 14602... because you have no matching sequence in the hole from 14568 to 14600. Don't get caught up in the concept of chromosomes different lengths due to the variability of some of the contents. Everything is relative to the consensus reference model. hg38 is the current reference genome and it will have some different position numbers for the Y due to corrections, additions, deletions based upon recent findings. To see the numbering differences look at the same RSid between the two builds in the UCSC browser.

LorenAmelang
01-19-2017, 04:53 AM
Cofgene,

So in my .vcf files, all those "position" numbers that were explained as "the position or number of bases from this marker to the end of the chromosome" in the FAQ, are positions in the "reference=GRCh37" sequence. They are not counting along my particular chromosome, or along any individual's, but along the reference. In effect, resetting my odometer whenever my bases have gotten out of sync and then start matching the reference again. Right? So of course every GRCh37 file shows the same rs numbers at the same locations.

Looking for lines that aren't just one base pair, I find things like:

X 34962597 rs199682461 C T . PASS . GT 0
X 34962704 X:34962704_CTTTATGGACCAATTGC/- CTTTATGGACCAATTGC - . PASS . GT 0
X 34962909 rs6632131 A C . PASS . GT 0

Doesn't look like anything repeats. Is that an insertion? That's the only kind of unusual line I see. The filename says "filtered" - maybe someone has filtered out the deletions?

Appreciate the clues! I certainly jumped to a wrong impression at first. Still, I'm amazed that it is so impossible to find an explanation of this on the web. I guess people who grew up with genetics absorbed it as the process developed.

Cofgene
01-19-2017, 11:51 AM
For understanding the VCF file look at http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it or some other web resources. Your results are aligned to the reference as that is the only way they can identify and position the differences between the two. There is more to some of the lines then just PASS. The quality score is important and then the position of the call. Is it is an appropriately long BED region [the assembly is good]? Don't get hung up on items in the VCF. What matters is the comparison of VCF vs VCF to identify your shared variants and your new unshared variants. If there is something of interest that needs to be checked one will go back to the raw data, the BAM file, to see what it shows.