-
Registered Users
Automated SNP naming - a question
Apologies in advance if this has been discussed before.
Y Chromosome snips seem to get named AFTER they are found in an individual and often even then only when have been found in more than one individual.
I am not really sure why SNP naming began in the first place. I have assumed it was as some sort of recollection and discussion aid. Everyone can easily recall what we mean by L21 rather than saying 15654428C>G.
We all know the existing problem with the growing number of synonyms for each SNP adding to confusion. A newer problem for me is the length of the names. DF27 was easy to remember, FGC17112 not so much and FGC100001 can not be far off. For people who get a little muddled at times it's getting to the point when remembering 15654428C>G is not that much harder than remembering something like Z123454 that may be along sooner or later.
So I was wondering if there had been much or any discussion about naming all the snips right now rather than waiting until they are discovered in someone. That would allow a logical and simple naming pattern to be deployed that avoids duplication, synonyms and keeps the number of characters down to 5 or less so that ALL SNP names don't become so long they are forgettable.
My reasoning behind naming all snips in advance is thus. There are over 7 billion people in the world today and just over 60 million base pairs on the y chromsome. That suggests that most of those 60 million positions exist as snips in at least one of those 7 billion people and those that don't are being quickly filled up by ongoing human reproduction perhaps one of them did whilst I was writing this.
So if most exist and those that don't now will exist at some point in the future why not just name everything now in advance?
I was thinking of something along the lines of...
60 million positions is nicely close to the 60 million+ permutations we can get with the 36 basic alphanumeric characters, so all snip names would be a maximum of 5 characters long.
So position 1 on the Y chromosome is named A, position 2 is named A1, position 3 is named A2 and so on until A0, the next position A11 and so on until A9999, then AA, then AA1 and so on.
Then when a new snp is discovered in someone they don't have to wait for it to be named as it's all already taken care of an does not later inherit a bunch of other names and is guaranteed to be short. The downside to limiting it to 5 characters is that you get some pretty unmemorable names like 1A2BB but is that worse than saying and remembering YSC0000191 that sits above P312?
The other issue is that obviously different mutations can occur on the same position. Where this happens you could employ an _1, _2 system or _G, _A or something similar and this is not much different than recurrent naming is done now.
So in any snp database you then have the position and mutation, the 5 character name and some flags so you can exclude unreliable regions and currently unmaped reasons from display as needed.
So after all that rambling is this something that has been discussed before and is anyone aware if someone already implemented a SNP naming system like this somewhere?
Earl.
-
-
Gold Class Member
The position cannot be considered constant over time. It may change from build to build of the reference genome. More specifically any nomenclature system like the one which you propose needs to handle insertions, deletions, and translocations. For palidromic regions how would you track the translocations?
-
The Following 2 Users Say Thank You to Cofgene For This Useful Post:
Earl Davis (09-15-2015), MJost (09-15-2015)
-
Registered Users
Thanks. I had not considered the position would change from build to build of the reference genome. Oh well.
Earl.
-
-
Registered Users