PDA

View Full Version : STR matching for R people



Eochaidh
12-11-2015, 10:33 PM
I'm encouraging people to upgrade to 111 Y STRs while they are on sale. Here is more background on the project; and the haplotype and gateway services provided.

is there any chance that FTDNA will relax their maximum GD of 10, so that people who have long ago upgraded to 111 STRs can see some results?

Rick
12-28-2015, 02:33 PM
is there any chance that FTDNA will relax their maximum GD of 10, so that people who have long ago upgraded to 111 STRs can see some results?
This would be a great feature. My recently discovered little cluster/variety consisting so far of 3 men shares about 10 off modal markers at 111. I match one man at 101 markers and the other at 100. My 111 match results only show the former. FTDNA is already using drop down menus for match distances. It should not be too difficult to expand the range it would seem. Ysearch is of course more flexible, but seems to have fallen into relative disuse since the explosion of fine resolution SNP testing.

Mikewww
12-28-2015, 07:30 PM
This would be a great feature. My recently discovered little cluster/variety consisting so far of 3 men shares about 10 off modal markers at 111. I match one man at 101 markers and the other at 100. My 111 match results only show the former. FTDNA is already using drop down menus for match distances. It should not be too difficult to expand the range it would seem. Ysearch is of course more flexible, but seems to have fallen into relative disuse since the explosion of fine resolution SNP testing.
I don't think it is a technical issue. I think there is some kind of privacy related policy in play. They might need to change a policy statement and/or ask another question in the profile that allows appearance on the new "extended" matching. It's a good idea, but my guess is the implementation would not go the way we'd like. They seem prone to err on the side of privacy.

miiser
12-28-2015, 08:11 PM
The GD limit is lower than it used to be. It was reduced around the time that there was a concern by FTDNA over data harvesting by Semargl.me. So I believe the restriction is intended to make it more difficult for other organizations to harvest FTDNA's data.

Whether this is a sensible or effective response by FTDNA, or a decision in the best interest of the customer, is a question that I won't attempt to answer here.

Mikewww
12-28-2015, 11:33 PM
The GD limit is lower than it used to be. It was reduced around the time that there was a concern by FTDNA over data harvesting by Semargl.me. So I believe the restriction is intended to make it more difficult for other organizations to harvest FTDNA's data.

Whether this is a sensible or effective response by FTDNA, or a decision in the best interest of the customer, is a question that I won't attempt to answer here.

Do you have any evidence for what you are saying? I haven't kept record of things like this but my memory is good enough to know my highest GD at 67 STRs ever displayed on my FTDNA matches screen was 7. That's probably from about 5 or 6 years ago. It's still 7 today. It's never been more than 7 (and I do have a good sized group in my cluster/project). If what you are saying is true, at least as far as 67 STRs, it should be 6 or less as the maximum match.

I don't have time to research everyone's matches but my maximum at 67 of 7 amounts to a 10.45% diversity. My maximun at 37 is a GD=4 and I believe (not sure) that has been consistent over the years too. A GD of 4 at 37 amounts to a 10.81% diversity. [[[EDIT: added - I'm not sure how valuable the matches are at 37, but at 12 or 25 they probably should be messing around with the thresholds. A lot of people see a lot of junk matches at 12 and 25.]]]

As far as I can tell the FTDNA threshold is 11%. Any GD of 11% or more is not displayed on the matching screen.

I do remember FTDNA updating their TIP calculator when they came out with 111 Y STRs and they said at that time, I think, they were going to more of a infinite allele method. That should actually have opened things up a bit. I don't know.

Either way, if you wish to denigrate FTDNA on a general topic please start a new thread somewhere else.

miiser
12-28-2015, 11:46 PM
Do you have any evidence for what you are saying? I haven't kept record of things like this but my memory is good enough to know my highest GD at 67 STRs ever displayed on my FTDNA matches screen was 7. That's probably from about 5 or 6 years ago. It's still 7 today. If what you are saying is true, at least as far as 67 STRs, it should be 6 or less as the maximum match.

I don't have time to research everyone's matches but my maximum at 67 of 7 amounts to a 10.45% diversity. My maximun at 37 is a GD=4 and I believe (not sure) that has been consistent over the years too. A GD of 4 at 37 amounts to a 10.81% diversity.

As far as I can tell the FTDNA threshold is 11%. Any GD of 11% or more is not displayed on the matching screen.

I do remember FTDNA updating their TIP calculator when they came out with 111 Y STRs and they said at that time, I think, they were going to more of a infinite allele method. That should actually have opened things up a bit. I don't know.

Either way, if you wish to denigrate FTDNA on a general topic please start a new thread somewhere else.

I can't travel through time, and I don't have any screen shots from the old website, so no, I don't have any "evidence". I have a distinct memory from this time period of being annoyed that the GD distance had been reduced, and at this same time some other actions were taken by FTNDA to discourage data mining in unison with a public announcement addressing Semargl's behavior. I suppose it is possible that the GD limit was reduced only for some number of markers (for example, 12 marker matches were restricted but 111 marker matches were unchanged). I don't specifically remember the details of the before and after GD limits.

At any rate, the current limit of ~10% is not even enough to encompass surname lineages, which frequently extend to 15% GD.

You started the off topic speculation as to FTDNA's reason for limiting the GD, and my own comment was only a response to yours. And my own comment denigrates FTDNA no more than your own does, but simply describes a possible motivation for the GD limitation. If you wish to move both my and your own comment regarding such speculation to a seperate thread, this would be a rationally defensible moderator action, though certainly not a necessary or respectable one.

Dubhthach
12-28-2015, 11:54 PM
If I recall if someone had tested to 111 markers you could look at the 67 STR matches with GD up to 11, you could even use this trick at lower levels (37,25 etc.) , reckon it was a "bug" (*cough* Feature *cough*) of their web interface which they subsequently "fixed"

VinceT
12-29-2015, 01:05 AM
If I recall if someone had tested to 111 markers you could look at the 67 STR matches with GD up to 11, you could even use this trick at lower levels (37,25 etc.) , reckon it was a "bug" (*cough* Feature *cough*) of their web interface which they subsequently "fixed"
Yup. That short-lived "bug" had allowed many, including myself, to find haplotype matches critical and advantageous to advancing their research.

Mikewww
12-29-2015, 01:07 AM
This has come up on other threads off-topic so I'll move it here.

Mikewww
12-29-2015, 01:13 AM
...
You started the off topic speculation as to FTDNA's reason for limiting the GD, and my own comment was only a response to yours...
The off-topic tangent was not started by me. It was an innocent question in the flow of normal conversation. See reply #1. I attempted to answer it. That's what I get for allowing anyway leeway.

miiser
12-29-2015, 03:25 AM
The off-topic tangent was not started by me. It was an innocent question in the flow of normal conversation. See reply #1. I attempted to answer it. That's what I get for allowing anyway leeway.

My point was that you didn't seem to be irritated by the off topic leeway with regards to your own comment. It was only when someone disagreed with you that the leeway became a "problem". I'll let it drop there and say no more.

Mikewww
12-29-2015, 02:43 PM
My point was that you didn't seem to be irritated by the off topic leeway with regards to your own comment. It was only when someone disagreed with you that the leeway became a "problem". I'll let it drop there and say no more.I thought your point was "You started the off topic speculation" directed towards me, but in fact I didn't start the off-topic speculation. I guess one point begats another which is how spiraling out of control happens.

I'm just try to keep things on topic per the thread while allowing reasonable and innocent off-tangent questions in the normal flow of the conversation. It does take a little more time to struggle through these kinds of things from a moderator perspective. I try to keep it with the thought that a one step off-tangent for a brief reply is okay but when it gets beyond into additional steps like it was in this case it is time to clean up and stick to the original purpose of the thread. I probably should use "delete" more often as it is easier from a moderator perspective but I just don't want to suppress posting. I just think they belong in the right places.

George Chandler
12-29-2015, 03:57 PM
I doubt FTDNA's reason would be to prevent data harvest because why would they encourage people to upload their results to YSearch? It's probably more to do with people coming to incorrect conclusions about how close the relationship actually is depending on the STR's that are off. I have one in my Chandler group that we differ by 10 and yet the MRCA is 1595. I have others who have a similar STR pattern and a couple of positions off in places like YCAII and any MRCA is likely beyond 5,000 years.

George

miiser
12-29-2015, 09:55 PM
I doubt FTDNA's reason would be to prevent data harvest because why would they encourage people to upload their results to YSearch? It's probably more to do with people coming to incorrect conclusions about how close the relationship actually is depending on the STR's that are off. I have one in my Chandler group that we differ by 10 and yet the MRCA is 1595. I have others who have a similar STR pattern and a couple of positions off in places like YCAII and any MRCA is likely beyond 5,000 years.

George

Whether the GD limit is related to data mining cannot be known for certain without insider knowledge. But there can be no doubt that FTDNA is concerned about data mining, as changes were made to both the FTDNA website and the YSearch website in response to the Semargl crisis. Some of those changes were even publicly announced. Added security measures were introduced to YSearch to prevent rapidly repeated automated searches by bots. FTDNA customers were informed that they were not permitted to join numerous projects without reason. Project admins affiliated with Semargl were removed from projects. In at least one of the larger projects with multiple admins, the project was taken over by an FTDNA rep and the other admins, including myself, were interrogated regarding their involvement in the project and the purpose of the project. FTDNA's terms of use were revised to make it clear that any behavior in support of data mining is forbidden.

There should be no question that FTDNA is concerned about data mining. The existence of YSearch predates the Semargl crisis, and although I applaud FTDNA for still permitting its use in the wake of this crisis, their willingness to allow it is not proof that there is no concern. And the Big Y matching algorithm seems to go against your suggestion that FTDNA is worried about confusing people with misleading "matches".

I don't know for certain that the matching limit is associated with data mining concerns. But I know that getting a long list of matches with a single search makes it easier to data mine, and a short list makes it more difficult to data mine. I know that changes were made to FTDNA's website and the YSearch website in response to data mining. And I know the Y-DNA matching GD limit changed at around the same time. The larger GD being permitted in the first place may have been only a software bug, as suggested by others. But if this is the case, the timing of events strongly suggests that the bug was brought to FTDNA's attention, flagged as a concern, and "corrected" as a direct response to the data mining crisis. The concurrence of events is strongly suggestive.

George Chandler
12-30-2015, 03:51 PM
Most companies on the web are concerned about data mining issues in one form or another. I personally don't see the connection for this topic though - I have no connection to FTDNA or any other company other than being a customer. Most people post their STR's publically within groups and although it would be frustrating for a company to see another copy their results so I understand that aspect. I have no idea what happened with Semargl or how it was a crisis, but it's good that FTDNA is taking steps to protect peoples privacy and if that involves you being asked about your association to a group where data was possibly being pirated then so be it.

The Big Y matching issues is a separate topic probably best suited under the gripes thread.

I'm not really sure how getting a long list of potential STR matches from within the personal page makes it easier to data mine?? There is nothing to indicate which STR's mismatch? What data is being data mined when the it says you are off another surname by 10 mutations at 111 markers? The ability to send a message to that recipient?

George

miiser
12-30-2015, 11:12 PM
I'm not really sure how getting a long list of potential STR matches from within the personal page makes it easier to data mine?? There is nothing to indicate which STR's mismatch? What data is being data mined when the it says you are off another surname by 10 mutations at 111 markers? The ability to send a message to that recipient?

George

I assumed someone would eventually ask this question, so I had the answer ready. The values can easily be determined by triangulating the distance to a few known kits. This would give a data miner the STR values and the surname, which is all Semargl cares about. For example, suppose we have access to three kits with known STR values, and known GD to an unknown kit:

13 24 14 10 11 15 12 12 12 13 13 29 GD=1
13 25 14 10 11 15 12 12 12 13 13 29 GD=2
13 25 14 10 11 15 12 12 12 14 13 29 GD=1

Unknown kit = ?

Can the values of the unknown kit can be deduced from the GD to each known kit? I can do it. Pretty sure Semargl and other data miners can as well. It is fairly easy to write a computer program that can do this automatically by triangulating from a handful of known kits.

If the Y-DNA matching limit were to loosen up and permit me to see the GD to, let's say, 5% of the FTDNA database with each search, then I'd be able to deduce the STR values of most kits in the database using a pretty small number of searches.

For reference, here's a blog post that will give you some background regarding the Semargl crisis: http://dna-explained.com/category/yfull-company/

FTDNA certainly responded as if it was a crisis. It was a pretty big to-do at the time.

As a side note, I think it's strange that certain people on this forum always assume that I'm anti-FTDNA. Recognizing that FTDNA has motives as a business, and that their motives may not always align with mine, does not mean I am anti-FTDNA. It just means I'm unbiased enough to see things from both sides and smart enough to imagine the motives of people other than myself.

Data mining prevention is a valid, rational reason for FTDNA to limit the GD of matching. I don't believe it's what most customers would prefer, but it's a valid concern and I don't hold it against them for being concerned about it. FTDNA sometimes does things for the benefit of their profits rather than the benefit of the customer, as most businesses do. Recognizing this reality does not make me some kind of covert rebel agitator. But I think refusing to acknowledge that FTDNA has their own motives and agenda which don't always align with the customer's makes one either a chump or a corporate shill. When FTDNA takes actions that are not in the best interest of the customer, I think it's important that project administrators call them out on it and apply pressure to fix it. Recognizing the reason for FTDNA's behavior is a necessary prerequisite to developing a strategy to modify it.

Eochaidh
01-01-2016, 09:54 PM
The point of my original question was that as of now, the algorithm gives me access to 0% of the FTDNA database. A simple cut-off depending only upon GD which returns no data would get me fired from my job as an Oracle Developer. You always return data, but the quantity of that data may be limited by business rules

miiser
01-01-2016, 10:35 PM
The point of my original question was that as of now, the algorithm gives me access to 0% of the FTDNA database. A simple cut-off depending only upon GD which returns no data would get me fired from my job as an Oracle Developer. You always return data, but the quantity of that data may be limited by business rules

I agree with you, and I think quite a few FTDNA customers would say that FTDNA's IT people need to be fired and replaced.

The point of my previous posts was to suggest a possible reason for FTDNA to decide to limit the GD so arbitrarily and at such a low value. I don't think it's for customer privacy reasons, because there are already privacy options that can enable or disable one's appearance in matching. Increasing the GD limit won't affect this. I don't think it's a technology limitation, because increasing the GD limit would be a trivially easy software change. Also, the limit was higher in the past, so we already know it can be done. I don't think it's to avoid confusing customers with distant matches, because there is already ample evidence that this is not a priority for FTDNA.

I believe FTDNA could easily increase the GD limit, but have deliberately chosen not to. I join you in requesting that they increase the GD limit in order to increase the usefulness of their Y-DNA matching.

The limited matching is one of the big reasons that most of the data analysis has moved outside of the FTDNA sphere into the hands of project admins and independents. All the data is now collected, organized, analyzed, and presented outside of the FTDNA system, because the FTDNA system is insufficient. It is in FTDNA's own interest to remedy this situation, because it tends to improve the long term viability of FTDNA's competitors by making it desirable to take one's data elsewhere and unnecessary to remain within the FTDNA ecosystem.