Page 1 of 5 123 ... LastLast
Results 1 to 10 of 46

Thread: Rating and classifying Y SNPs for subclade identification

  1. #1
    Registered Users
    Posts
    3,908
    Sex
    Y-DNA (P)
    R1b
    mtDNA (M)
    H

    Rating and classifying Y SNPs for subclade identification

    The process of reviewing comparisons of NGS results involves "weeding" out SNPs. The challenge is generally not having enough SNPs, but identifying the ones that appear to be shared, stable and easily testable.

    In some cases, we may be weeding a little too much, but for early passes that is probably okay.

    I'd like to classify SNPs in a more disciplined fashion for my recordkeeping and share the classifications so they are open for review and criticism.

    Phylogenetically, this is not hard for a person like me, because I'm am just observing comparative results. From a superficial phylogenetic perspective, I don't care much anything about a marker. I only care about who has and hasn't particular markers. Ken Nordtvedt has long said that all markers are subjected to uncertainty and therefore there is no such thing as a UEP (Unique Event Polymorphism), or at least it can't be proven with absolute certainty.

    From this perspective, I'm currently using the following coding
    Single individual
    Single family/surname but multiple individuals
    Multiple family/surnames but unsure if public
    Public consistent
    Public semi-consistent (known recurrent but useful would fit here)
    Public inconsistent

    I don't see an SNP has necessarily one of the above from the git-go. I just see it as a process of evaluation and status change as the comparative database grows.

    However, I think there is another dimension that many of us little understand but folks like Thomas Krahn are able to evaluate. I'm getting the hang of this a little, because of the Ybrowse tool, but I'm just trying to understand this set of properties that relate to biology and equipment/test reliability.

    As near as I can tell, these are classifications I hear of:

    X Chromosome similarities - possibly subject to recombination events so not necessarily paternally inherited
    125bp region - hard to read consistently by the testing methods
    Palindrome region - subject to recLOH's
    Embedded in an STR - subject to disappearance or shifting due to STR mutations
    Insertion/deletion event - as opposed to point SNPs, clearly are more sporadic in most cases
    Test edge read - on the first location of a tested region and more likely to have a bad read

    What I am missing and what don't I understand?
    Last edited by Mikewww; 05-16-2014 at 10:41 PM.

  2. The Following 2 Users Say Thank You to Mikewww For This Useful Post:

     MfA (01-12-2017),  rms2 (05-17-2014)

  3. #2
    Registered Users
    Posts
    4,825
    Sex
    Location
    Australia
    Ethnicity
    Italian Alpine
    Nationality
    Australian and Italian
    Y-DNA (P)
    T1a2b- Z19945
    mtDNA (M)
    H95a1

    Australia Italy Veneto Friuli Italy Trentino Alto Adige Austria Tirol Australia Eureka
    Quote Originally Posted by Mikewww View Post
    The process of reviewing comparisons of NGS results involves "weeding" out SNPs. The challenge is generally not having enough SNPs, but identifying the ones that appear to be shared, stable and easily testable.

    In some cases, we may be weeding a little too much, but for early passes that is probably okay.

    I'd like to classify SNPs in a more disciplined fashion for my recordkeeping and share the classifications so they are open for review and criticism.

    Phylogenetically, this is not hard for a person like me, because I'm am just observing comparative results. From a superficial phylogenetic perspective, I don't care much anything about a marker. I only care about who has and hasn't particular markers. Ken Nordtvedt has long said that all markers are subjected to uncertainty and therefore there is no such thing as a UEP (Unique Event Polymorphism), or at least it can't be proven with absolute certainty.

    From this perspective, I'm currently using the following coding
    Single individual
    Single family/surname but multiple individuals
    Multiple family/surnames but unsure if public
    Public consistent
    Public semi-consistent (known recurrent but useful would fit here)
    Public inconsistent

    I don't see an SNP has necessarily one of the above from the git-go. I just see it as a process of evaluation and status change as the comparative database grows.

    However, I think there is another dimension that many of us little understand but folks like Thomas Krahn are able to evaluate. I'm getting the hang of this a little, because of the Ybrowse tool, but I'm just trying to understand this set of properties that relate to biology and equipment/test reliability.

    As near as I can tell, these are classifications I hear of:

    X Chromosome similarities - possibly subject to recombination events so not necessarily paternally inherited
    125bp region - hard to read consistently by the testing methods
    Palindrome region - subject to recLOH's
    Embedded in an STR - subject to disappearance or shifting due to STR mutations
    Insertion/deletion event - as opposed to point SNPs, clearly are more sporadic in most cases
    Test edge read - on the first location of a tested region and more likely to have a bad read

    What I am missing and what don't I understand?
    confused!

    can you supply me an idea with the scenario below .........mine

    Both T group ( both L446 group) , both have tested with natgen2

    only differences
    he has basal M184 , I do not , I have basal M272
    he has an extra SNP ( L455 ) , which I do not have

    I have excluded the tested negative SNP

    So, with very many SNPs that match how does this method above compare?

    [[[Mikewww/moderator: Please post this in the haplogroup T sub-forum. Likely, a T expert will be able to help. Some SNPs are phylogenetcally inconsistent, rendering them not useful. I'm not sure what the case is here, though.]]]
    Last edited by Mikewww; 05-18-2014 at 02:32 PM.


    My Path = ( K-M9+, TL-P326+, T-M184+, L490+, M70+, PF5664+, L131+, L446+, CTS933+, CTS3767+, CTS8862+, Z19945+, Y70078+ )

  4. #3
    Gold Class Member
    Posts
    896
    Sex
    Location
    California
    Y-DNA (P)
    R1b-Z2103>Y14416

    There does seem to be an "art" to judging the suitability of SNPs for testing.
    These two examples may be covered in the list you compiled. Just in case they are not, here are two quotes from Thomas Krahn. Both these SNPs were successfully tested.

    R1b-Z2106 -
    The region is a little bit "primer unfriendly", because of a huge GAAA repeat. It looks like a STR of almost 600 bases....
    R1b-Z2104 -
    You see that exactly at the SNP position (marked with blue background) we have a double peak red and blue (= T/C) which means that we have two sections of your genome amplified simultaneously.

    The region around the SNP is 100% identical to several other chromosomes and we try o filter out the Y chromosome specific sequence with selective primers, but so far only with limited success. We are currently trying to use a technology called "Nested PCR" in order to do his filtering in two rounds, but if that doesn't work, we'll likely have to give up.
    The problem with this SNP is that you can never tell if you're really sequencing the Y chromosome because highly homologous regions can recombine with each other so that characteristic markers may end up at a completely different chromosome.
    Last edited by Joe B; 05-17-2014 at 04:31 AM.
    YFull R1b-M269>L23>Z2103>Z2106>Z2108>Y14512>Y20971>Y22199, ISOGG R1b1a1a2a2c1b Y14416, FTDNA R-M64

  5. #4
    Registered Users
    Posts
    3,908
    Sex
    Y-DNA (P)
    R1b
    mtDNA (M)
    H

    Quote Originally Posted by Joe B View Post
    There does seem to be an "art" to judging the suitability of SNPs for testing. ...
    I agree and I don't like it. That's not a personal comment. I'm much appreciative of the people who have great knowledge and are able and willing to apply suitability judgements.

    I think we just need to bring more discipline and definition as a genetic genealogy community.

  6. The Following User Says Thank You to Mikewww For This Useful Post:

     Joe B (05-19-2014)

  7. #5
    Registered Users
    Posts
    3,908
    Sex
    Y-DNA (P)
    R1b
    mtDNA (M)
    H

    Quote Originally Posted by Mikewww View Post
    As near as I can tell, these are classifications I hear of:

    X Chromosome similarities - possibly subject to recombination events so not necessarily paternally inherited
    125bp region - hard to read consistently by the testing methods
    Palindrome region - subject to recLOH's
    Embedded in an STR - subject to disappearance or shifting due to STR mutations
    Insertion/deletion event - as opposed to point SNPs, clearly are more sporadic in most cases
    Test edge read - on the first location of a tested region and more likely to have a bad read
    ...
    I should add something of this nature too -
    Region stability rating, or perhaps just a list of problematic regions.

    I'm not sure of the exact position range, but Thomas has said the region around 22.2 .. 22.6 Mb (hg19) is problematic. He's also told me that 47z is unstable. I'm not sure which is worse or "how bad" the problem really is. That's the point we need some kind of rating system for this.
    Last edited by Mikewww; 05-19-2014 at 12:30 AM. Reason: correct grammar

  8. #6
    Registered Users
    Posts
    332
    Sex
    Location
    United States
    Nationality
    US
    Y-DNA (P)
    R1b U106 Z326 Z81
    mtDNA (M)
    K1a1b1a

    Quote Originally Posted by Mikewww View Post
    I agree and I don't like it. That's not a personal comment. I'm much appreciative of the people who have great knowledge and are able and willing to apply suitability judgements.

    I think we just need to bring more discipline and definition as a genetic genealogy community.
    FYI:
    http://www.biomedcentral.com/1471-2105/14/274

    Our preliminary work has shown that for sequencing datasets that have high coverage and are of high quality, SNP calling programs can perform similarly [35]. However, when the coverage level is low in a sequencing dataset, it is challenging to accurately call SNVs [36]. Moreover, commonly used SNP calling programs (e.g., SOAPsnp [19], Atlas-SNP2 [20], SAMtools [37], and GATK [27,38]) all include different metrics for each potential SNP in their output files. These metrics are highly correlated in complex patterns, which make it challenging to select SNPs that are used for further experimental validations.
    Therefore, in areas of lower coverage there is also this issue:
    when the coverage level is low in a sequencing dataset, it is challenging to accurately call SNVs
    These may apply to areas with with fewer reads in which judgement calls must be made about SNP calls.
    Last edited by warwick; 05-18-2014 at 06:21 PM.

  9. The Following User Says Thank You to warwick For This Useful Post:

     haleaton (05-18-2014)

  10. #7
    Registered Users
    Posts
    376
    Sex
    Location
    USA
    Ethnicity
    Northern Europe
    Nationality
    USA
    Y-DNA (P)
    R-FGC5301 or R-A197
    mtDNA (M)
    T1a1

    United States of America Scotland England North of England Norway England
    Interesting discussion.

    In my own notes I use the the short expression "non-amenable" [to Sanger sequencing] when expert advice is that it would be difficult to obtain results but also to indicate it might be possible.

    I get confused when the word "unstable" is used in that it is the process of verification not the SNP that that is unstable.

    When I look at the many heterozygous read SNPs with the "shared" SNPs I wonder that if future technology allows much longer read lengths these may either go away or become for some actual reliable SNPs.

    Then there are NGS SNPs which may never be Sanger sequenced at least with current methods.

    It does interest me that there may be more SNPs in the high quality NGC BGI data hidden in the non-reliable or non-reported low read SNPs. So far, I have found two.
    Last edited by haleaton; 05-18-2014 at 08:32 PM.

  11. The Following User Says Thank You to haleaton For This Useful Post:

     Mikewww (05-19-2014)

  12. #8
    Gold Class Member
    Posts
    1,769
    Sex
    Location
    Virginia, USA
    Y-DNA (P)
    DF27, FGC15733
    mtDNA (M)
    T2f3

    Quote Originally Posted by haleaton View Post
    I get confused when the word "unstable" is used in that it is the process of verification not the SNP that that is unstable.
    I don't think I'm exactly confused, but at least annoyed, by another use of "unstable," to indicate a locus that can mutate frequently (like every few hundred years, instead of every 5-10 millennia). However, having mutated, it's as stable as it was before. So the alleged "ancestral" form might just as well be called unstable; the thing toggles back and forth occasionally, and the reference sequence happened to be caught in step A instead of step B, big deal. I have this problem with my L484 group under CTS4065. It's been stable for about 1800 years, for my subclade; but it's found in several other haplogroups, and even elsewhere within DF27, so clearly it's "unstable" -- if that's how one chooses to use that word.

  13. The Following 2 Users Say Thank You to razyn For This Useful Post:

     haleaton (05-19-2014),  Mikewww (05-19-2014)

  14. #9
    Registered Users
    Posts
    376
    Sex
    Location
    USA
    Ethnicity
    Northern Europe
    Nationality
    USA
    Y-DNA (P)
    R-FGC5301 or R-A197
    mtDNA (M)
    T1a1

    United States of America Scotland England North of England Norway England
    If FTDNA really did add 40K worth of samples positive SNP data of the GENO 2.0 SNP Chip to their new tree, I wonder just how many positives in differing clades are just statistically probable independent mutations given huge increase in sample size for the SNPs that it measures.

  15. #10
    Registered Users
    Posts
    332
    Sex
    Location
    United States
    Nationality
    US
    Y-DNA (P)
    R1b U106 Z326 Z81
    mtDNA (M)
    K1a1b1a

    Quote Originally Posted by haleaton View Post
    ... I wonder that if future technology allows much longer read lengths these may either go away or become for some actual reliable SNPs.
    Read lengths of 10,000 are possible with technologies in development.

  16. The Following User Says Thank You to warwick For This Useful Post:

     haleaton (05-19-2014)

Page 1 of 5 123 ... LastLast

Similar Threads

  1. which U152 subclade am I
    By Il PapÓ in forum R1b-U152
    Replies: 19
    Last Post: 02-20-2017, 10:44 AM
  2. Naming A Subclade of H
    By Solothurn in forum H
    Replies: 26
    Last Post: 01-06-2017, 04:41 PM
  3. New J2 subclade : J - PF7395
    By LUKE33 in forum J2a-M410
    Replies: 5
    Last Post: 02-12-2014, 09:24 AM
  4. New subclade of Roots M223+
    By scottraveler in forum I2-M438
    Replies: 11
    Last Post: 12-27-2013, 02:08 PM
  5. Replies: 12
    Last Post: 08-01-2013, 05:31 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •