PDA

View Full Version : Are all these Y tests reliable?



Rathna
01-03-2014, 09:32 AM
These answers of Thomas Krahn to Peter Mac Donald on YahooGroup R-L21 are extremely interesting. Hope that the delay in my Chromo2 isn't due to my bad spit, but this could clarify also all the mistakes about M269 (and others) I spoke about:

"Peter MacDonald wrote:
Thomas (please correct me if I am mistaken or not accurate as I am hoping to understand the process at a basic level) from looking over the link to the comparison of platforms and your comments the following is true:
1. The max read length of the MiSeq is anywhere from double to triple to that of the various HiSeq protocols.
Correct. Illumina tries to increase the read length of the HiSeq platforms, but they're not there yet. Ion Torrent however has already sequencing kits for 400 bases and they're already working on 500 base reads.
2. The Paired-End Reads (Max.) and Single-End Reads (Max.) of the HiSeq protocols are extremely greater than those of the MiSeq, however these facts are not relevant in getting high quality SNP data for genealogical purposes (as the is a higher chance for randomness).
No. The greater throughput of the HiSeq is of course important, but note that the same sequencing run will be shared with multiple samples. While it still makes sense to run one single sample on a MiSeq, it's economically not feasible to run a single BigY sample on the HiSeq. Therefore you must mix the DNA libraries from multiple persons on a single run. In order to sort the sequencing traces out to the individual samples, there is a technology called "barcoding" where synthetic DNA molecules with a characteristic key code in their sequence is attached to each sequencing read. According to this key code it is possible to assign the sequences to the individual samples after the sequencing run is finished.
The problem with the mixing of samples on a single run is to find the correct balance in between the different samples. Not everybody swabs his cheeks with the same intensity and some samples are older or more degraded than others. Therefore a very strong sample may take away the majority of the reads from the other weaker samples, and even if you try your best to balance the DNA concentrations you still need a lot more reserve so that the week samples still get enough (and equal) coverage through their whole stretch of Y-DNA.
The random false matching is only a result of short reads. A shorter stretch of DNA sequence may map in two regions of the genome more likely than a long one. Note that a large region of the Yp arm is almost identical with the X chromosome (and I'm not only talking about the pseudoautosomal region which is by definition identical with the counterpart on the X). There are stretches of DNA where you can be lucky to find one base difference every 500 bases between the X and the Y. Of course a 100 base read can be mapped on either location then. Those regions may not even be properly mapped with any NextGen sequencing technology, even with 300 base reads. Even Sanger sequencing is challenging for this, but we have invested quite a lot research and experience to solve them with clever primer design. In general, long high quality reads are the best and most informative for any sequencing technology.
Paired end reads is a technology that somewhat improves this mapping problem by separating the forward and reverse read with a mostly constant spacer in between. Because you know that your forward- and reverse read is separated (e.g) a constant ~500 bases apart from each other, it is easier to make de-novo assemblies of unknown organisms (where you don't have a reference sequence). Paired end sequencing is a bit pointless when you have a pretty good reference sequence, that's why nobody will invest money in preparing libraries with distinct spacing for human sequencing anymore. (except if they want to proof an unusual chromosomal arrangement). Some labs may still talk about "paired end" reads because they will do a forward- and reverse read on the same segment. However this is not a real paired end read if the forward- and reverse sequence mostly overlaps (because this doesn't improve the mapping).
Question: what role does "Bases above Q30" play within SNP testing (mainly pertaining to the Big Y Test)?
This is a measurement of the quality of each base in a sequencing trace. Explaining the details about this will go beyond the scope f this Yahoo group but it can be easily found on the web. Essentially you want to have all your bases that you use for evaluation of a SNP to be above Q30. But be careful: Different sequencing technologies use different definitions of their Q score, therefore you can't compare a Q score from a MiSeq with a Q score from a Sanger sequencer.
I hope this helps,
Thomas".