View Full Version : Best way to intermix 67 marker and 111 marker signatures for charting

12-11-2017, 04:51 AM
With so many embracing 111 markers and the upcoming FTDNA release of many more YSTRs with Big Y testers, I really feel that adding 111 markers to my 67 marker charts would increase the number of signatures and therefore increase the accuracy of charting. I know that many others have mixed resolutions and would like your input on the best approaches. There are two major different approaches:

1) Start charting in the first round with only 111 marker signatures - assuming that the increased number markers is more important than the significant decrease sample size.

2) Start charting in the first round with only 67 marker signatures - assuming that the much larger sample size will reveal better signatures than the much smaller 111 marker sample size.

Assuming one starts with 111 markers testers as first described in 1), there are two options and a third option that could be added:

1-a) Forces 67 markers testers to fit into 111 marker YSTR branches (assuming missing data will match 111 markers).

1-b) Leave 67 markers above since there is no evidence that they belong with 111 marker YSTR branches (assuming missing data may not match 111 markers).

1-a1 or 1-b1) Use tie breaker criteria at times to chose based on more rare marker events. This list would include: very rare marker values, very slow mutating markers, null markers, extra multi-copy markers, dependent mutations (two mutations of the same marker along the same path) and maybe multi-step mutations. This list would not include: average and faster mutating markers (the vast majority of mutations).

Assuming one starts with 67 markers testers first as described in 2), there are two options and a third option that could be added:

2-a) Only allow 111 marker YSTR branches to be placed between existing 67 marker signatures (do not assume missing markers with 67 marker signatures will be forced into 111 marker YSTR branches).

2-b) Force 67 markers to fit into 111 marker signatures (assuming missing data matches 111 markers).

2-a1 or 2-b1) Use tie breaker criteria at times to chose based on more rare marker events. This list would include: very rare marker values, very slow mutating markers, null markers, extra multi-copy markers, dependent mutations (two mutations of the same marker along the same path) and multi-step mutations. This list would not include: average and faster mutating markers (the vast majority of mutations).

I spent a significant amount of time already fine tuning a 67 marker chart for over 500 testers for L226, so it would be easier to just to start with the 67 marker only chart and just add 111 markers. It just seems that that starting with one resolution vs. another would significantly affect accuracy. This choice will only practically really be available with automated charting software as starting over every time as new results comes in creates a bias with any manual method. It has also become apparent that you should have some pretty rigid rules for missing data of 67 marker testers. And lastly, I think that tie breaking characteristics of 67 marker mutations could be used for to determine what to assume about missing marker values.

Would appreciate any input on the above - things are missing or one choice over another, etc.

12-11-2017, 03:53 PM
Great question... I’ll describe how SAPP handles it and perhaps others have a different approach. SAPP basically takes your approach 1a (or 2a) but with some nuances.

I’ll talk about 67 markers vs. 111 but this same description is true for any variability in the number of markers among kits.

Say you have a complete group of 67 marker data with some number of the kits having tested out to 111. There are then two attributes of the data that are important for branching purposes:

A. The available marker data in the 68-111 range only help define what I’ll call the “coalescence subtree” of the kits that have tested out to 111 markers. For instance, if you have a group of N kits that have all tested to 67 and,

- if there is only 1 kit that has tested to 111 markers, obviously that 68-111 range data is useless for branching purposes,
- if there are 2 kits that have tested to 111 markers, then that 68-111 range data is only useful for branching within the subtree defined by the common MRCA of those two kits,
- if there are 3 kits that have tested to 111 markers, then that 68-111 range data is only useful for branching within the subtree defined by the (up to) 2 MRCAs that the 3 kits have in common,
- in general as the number of 111 marker kits approaches N, the 68-111 range data will be useful for branching everywhere except subtrees where all kits have only 67 markers.

B. For branching purposes, markers that don’t mutate are equivalent to having no data for that marker. In other words, no matter how many 111 markers you have in the input set, if there were NO differences in the 68-111 range then there are no signatures or anything to help determine branching and it’s functionally equivalent to not having any data in that range at all. In terms of your original post, this means that when there are no off-modals for a marker from the group’s modal haplotype, your scenario 1a is functionally equivalent to your 1b (and 2a to 2b).

You can apply these two attributes to sort through the data and only apply the extra markers when you need them. In a bottom-up algorithm like SAPP this is very easy by applying two rules:

I. When comparing two nodes (kits or branch points) of unequal number of markers, fill in the kit with the smaller number of markers with the values from the larger kit’s range, and if they belong under a common MRCA, use those markers for the common MRCA’s haplotype as well. This is essentially your 1a.

II. When comparing two nodes of equal length (whether 67 or 111), the resulting branching point has the same number of markers.

So if you have one 111 marker test under a common subtree with 4 other 67 marker tests, this will add the same 68-111 range to all 4 other kits. By attribute B this doesn’t change the analysis for that subtree, but it will matter as soon as the next coalescence point with another 111 marker test is reached in the tree.

Obviously the analysis would be improved if you had the actual 68-111 range data for the 67 marker kits and you’d probably discover some mutations and perhaps even signatures that weren’t visible before. But you can only do the best you can with the data you DO have. Once other kits have upgraded to 111 markers the branching may change.

In a top down approach you have to go by the available signatures first. So in a top-down approach you compare either the complete range of two kits of equal length or the shorter range of two unequal kits to determine likely common groups first. That way again you’re using the 68-111 range where it’s applicable to find the “coalescence subtree” for the 111 marker kits, otherwise you’re just using the first 67 markers.

SAPP does both - it first applies a top-down signature methodology, and then using those signature groups for weighting it applies the bottoms-up methodology.

12-12-2017, 12:40 AM
Thanks Dave - I guess I was expecting this to more complicated than it really has to be.

First - you really start with 111 markers first - since they could produce larger signatures. Eventually, some 67 marker signatures would be larger than any found in 111 marker tests. I always start with the largest signatures first as they are always have the highest probably of being correct. For signatures of four or higher, they have been unique to date and only a couple of three marker signatures required some tie breaking which was pretty obvious. For some two marker signatures, 90 % are surprisingly unique but those last ten percent are hard to analyze and resolve.

Second - I like 1a) like you do to. This is more of a style choice but I do not like the thought of artificially breaking up a branch due to missing data. Of course, the missing data may later prove not to be matching - requiring changes in the chart at a later day when some testers upgrade to 111 markers. It just seems that there should be a higher probability of matching vs. not matching on the missing markers.

This is similar to moving a signature up one branch level if part of the YSTR branch later tests negative for the lower branch. In this case, I always leave the untested YSNP YSTR branches at the lower level until they are later tested and are required to moved to the higher level. This is a huge issue for L226. DC36 is full of O'Brien's who want to be connected to King Brian Boru. Recently, one YSTR signature tested DC36 negative and move up a level. If I move up all other YSTR branches that have not YSNP tested, this would imply that I support that this is part of the royal O'Brien line which I do not want to publish without more YSNP testing.

We are also in sync with sections A and B which define signatures and some rules on how they work. However, I am different from what you are doing in a significant way. I always start with largest signatures first as they always have the highest probability of accuracy. The only weighting that I do is what I call tie breaking choices when you have two equal signatures to chose from (almost always two marker signatures). I used to discard signatures when two different ones were found but now I use rare marker characteristics to favor one over the other. I still discard signatures where there are three possible choices. This tie breaking methodology is sometimes used for three marker signatures (once or twice) and only used for two marker signatures five or six times. It is probably more effort than it is worth to actually program this tie breaking methodology (a lot of code for very little improved charting).

Unlike a program, manual analysis is biased to using the existing chart - which in my case is 616 67 markers. So it will be more tricky to maintain a chart manually since it would be so time consuming to start the analysis completely over again with each new test result. However, I have not found this to be a major issue (other than occasionally catching a missed signature that is obviously better).

On a side note, I keep finding parallel mutations to the same marker value in the same part of the chart on a regular basis. But I spent way too much time attempting to eliminate these parallel mutations as any changes just create equal probability parallel mutations that have the same odds. But it does bother me that I find ten to twenty of these that just can not be resolved. Obviously, more markers will continue resolve these unusually higher than expected occurrences.

I would like to hear more about your weighting of signature groups. I assume that this is tie breaking methodology to determine which two marker signatures are higher probability ? I find that out of 616 67 marker testers, I have less than ten tie breakers for two and three marker signatures. Or is this primarily used to one marker signatures - I do not chart these due to having ten to twenty possible places to place them and this would in the the five to ten accuracy range. The probability is probably a lot higher as many of the matches are deep in the progression of YSTR mutations and could be eliminated. But assigning one marker signatures would always be below 50 % accuracy.

01-02-2018, 08:11 PM
I'm afraid I am not going to help you much as I create STR signatures by hand but I applaud your work. I think the days of Y67 as base are about done. It's on to Y111 as the base with the tsunami of Big Y extracted STRs coming.

Y500 or whatever they are will be pose an interesting consideration vis a vis autosomal DNA testing as a supplement. I think in the past Y DNA barely encroached into the last couple of hundred years which is where the Family Finder types of tests play. They run out of gas at about 200 years.

There may be an important role for Y500 in concert with at DNA. The Y has a strict father-son inheritance property so it can be firm on that side of things but since STRs can back-mutate that takes away some of that value. However, if you can get a multiple STR signature pattern that increases confidence again, particularly for a short duration like the last 200 years.

01-02-2018, 10:04 PM
I have confirmed that I've seen Y STR panels come with 4-5 week turnaround. I think the additional equipment capacity they have brought in will allow them to re-emphasize STRs.