View Full Version : Automated SNP naming - a question

Earl Davis
09-14-2015, 10:16 PM
Apologies in advance if this has been discussed before.

Y Chromosome snips seem to get named AFTER they are found in an individual and often even then only when have been found in more than one individual.

I am not really sure why SNP naming began in the first place. I have assumed it was as some sort of recollection and discussion aid. Everyone can easily recall what we mean by L21 rather than saying 15654428C>G.

We all know the existing problem with the growing number of synonyms for each SNP adding to confusion. A newer problem for me is the length of the names. DF27 was easy to remember, FGC17112 not so much and FGC100001 can not be far off. For people who get a little muddled at times it's getting to the point when remembering 15654428C>G is not that much harder than remembering something like Z123454 that may be along sooner or later.

So I was wondering if there had been much or any discussion about naming all the snips right now rather than waiting until they are discovered in someone. That would allow a logical and simple naming pattern to be deployed that avoids duplication, synonyms and keeps the number of characters down to 5 or less so that ALL SNP names don't become so long they are forgettable.

My reasoning behind naming all snips in advance is thus. There are over 7 billion people in the world today and just over 60 million base pairs on the y chromsome. That suggests that most of those 60 million positions exist as snips in at least one of those 7 billion people and those that don't are being quickly filled up by ongoing human reproduction perhaps one of them did whilst I was writing this.

So if most exist and those that don't now will exist at some point in the future why not just name everything now in advance?

I was thinking of something along the lines of...

60 million positions is nicely close to the 60 million+ permutations we can get with the 36 basic alphanumeric characters, so all snip names would be a maximum of 5 characters long.

So position 1 on the Y chromosome is named A, position 2 is named A1, position 3 is named A2 and so on until A0, the next position A11 and so on until A9999, then AA, then AA1 and so on.

Then when a new snp is discovered in someone they don't have to wait for it to be named as it's all already taken care of an does not later inherit a bunch of other names and is guaranteed to be short. The downside to limiting it to 5 characters is that you get some pretty unmemorable names like 1A2BB but is that worse than saying and remembering YSC0000191 that sits above P312?

The other issue is that obviously different mutations can occur on the same position. Where this happens you could employ an _1, _2 system or _G, _A or something similar and this is not much different than recurrent naming is done now.

So in any snp database you then have the position and mutation, the 5 character name and some flags so you can exclude unreliable regions and currently unmaped reasons from display as needed.

So after all that rambling is this something that has been discussed before and is anyone aware if someone already implemented a SNP naming system like this somewhere?


09-15-2015, 12:00 AM
The position cannot be considered constant over time. It may change from build to build of the reference genome. More specifically any nomenclature system like the one which you propose needs to handle insertions, deletions, and translocations. For palidromic regions how would you track the translocations?

Earl Davis
09-15-2015, 07:40 AM
Thanks. I had not considered the position would change from build to build of the reference genome. Oh well.


09-15-2015, 08:48 AM
One of several ideas I've had about this issue:

Develop a standardized format, i.e.:

TaxID=9606; // = homo sapiens

This is the generic definition for rs2032658 (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=2032658) (= M207/Page37/UTY2). A checker program would find the current reference build position for the variant (such as NC_000024.10:g.13470103G>A in HGVS (http://www.hgvs.org/mutnomen/recs-DNA.html) notation), and check to see if that mapping exists in its database.

If so, the ID corresponding to that mapping is returned.

If not, the record is assigned a 6-character ID, similar to Ysearch/Mitosearch IDs or shortened URLs.

Why 6 characters? Use a set of 32 digits {ABCDEFGHJKLMNPQRSTUVWYZ23456789} (0,1,I,O omitted). 32^6 = 1,073,741,824 possible combinations, enough for all possible locations (say 25 million known possible positions, with room for multiple mutation states per position, plus collision space, plus any additional desired formatting constraints such as alpha prefixes.) 5 characters wouldn't be quite enough.

The 6-character ID is then mapped to the database entry containing the generic description and build position. Any lab-specific IDs (such as "M207") are also mapped to this ID.

You now have an independent variant registry system that links variants with build locations, with generic descriptions, with lab IDs, and with any other meta-data you need.

Note that translocations (i.e. palindromic markers such as ZZ12) may be rejected by the checker as they wouldn't map to a unique position.