PDA

View Full Version : Checking my math... matching phased data: What to consider beyond simple probability?



therrien.joel
12-19-2017, 09:21 PM
I've been sitting on this data for a while but finally had a moment to work out the rather simple match for this scenario. I just want to see if there is anything else that I should take into consideration.

I have a likely distant match where I had originally tested the probability of us being related via triangulation. The results were pretty good, however because we are talking very small segments here (4-3 cM range) I of course had some healthy skepticism if I could trust what I was seeing. As it happens, this person also has phased data, so I checked myself against that. Some of the matching segments that had been there with the unphased data disappeared. But some remained. Nice! But the trouble was I was unsure what the probability of a false match were in this case. This is where I would love to hear some comments on whether I am doing this right and if there are other considerations I should take into account.

Basically, I approached calculating the probabilities of a false match per matching SNP. If the phased data is an 'A' for example, then I can consider all the combinations of the other person that would have an 'A' as well. There are sixteen possible combinations total. Four would represent true matches where the 'A' shows up on a strand that has the same sequence. Another three combinations contain 'A" but not in the matching strand. This would mean that out of the seven combinations that yield a match, three are false positives. This means that there is a 42.8% chance that a matching SNP is a false positive. Now, if I consider a contiguous sequence of SNPs that make up a matching segment, I would take that probability and raise it to the number of SNPs in the segment to find the probability that this entire segment is a false match. If I do that I end up with astronomically small values which makes me think I am missing something here.

BTW, this is for data from the same chip type, so I can thankfully be less worried about Gedmatch's tendency to count no-calls as matching SNPs.

And, if I do get the math correct at some point, does anyone have thoughts on how far back generation-wise a segment on the range of 3cM could survive? I have no illusions this will be discoverable by a paper search, but this particular match would confirm something for me even without a paper trail. That is unless a segment like this could survive for thousands of years! BTW, there are multiple matching segments if that matters to this question.