PDA

View Full Version : Long Read Chromium Whole Genome Product



FGC Corp
09-05-2018, 02:21 AM
The Long read chromium test yields phased data and includes structural variants. The short read (i.e. 30x whole genome) does not yield phased data. The long read is useful if you are doing special medical testing or you want the best coverage of the Y chromosome for ancestry.
20 mb Chromium
14 mb Y Elite
9.3 mb Big Y

See article:
"Standard short read sequencing provides accurate base level sequence to provide short range information, but struggles to provide long range information. This means that standard sequencing and analysis approaches typically do well at calling single nucleotide variants (SNVs) but fail to robustly identify the full spectrum of structural variation seen in an individual genome. A novel data type known as Linked-Reads utilizes molecular barcodes to tag reads that come from the same long DNA fragment.



Linked-Reads provide the long range information missing from standard approaches. By adding a unique barcode to every short read generated from an individual molecule, you can link the short reads together."

https://community.10xgenomics.com/t5/10x-Blog/A-basic-introduction-to-linked-reads/ba-p/95

FGC Corp
09-05-2018, 03:32 AM
From our analyst, Greg Magoon:


The raw reads from the Chromium sequencing are still only 2x151 bp (Illumina platform), but the barcoding used by the approach enables linking reads from much larger parent DNA fragments, around say 40 kbp. So the linked reads (sometimes called "read clouds") enable significant improvements to mapping, phasing, and structural variant identification. One of the biggest advantages of this approach over current long-read platforms (e.g. Oxford Nanopore and Pacific Biosciences) is that it doesn't suffer from the high error rate associated with those technologies.

FGC Corp
09-13-2018, 09:30 PM
https://www.10xgenomics.com/solutions/genome/?gclid=EAIaIQobChMI3K342fO43QIVyB6GCh1_rAemEAAYBCA AEgJbIvD_BwE


Call and phase major classes of structural variants (SVs) like deletions, inversions, and translocations, even in genes inaccessible to short-read sequencers
Phase SNVs, indels and SVs across >10 Mb haplotype blocks

From the manufacturer, 10x Genomics

FGC Corp
09-13-2018, 09:35 PM
25892

Long read statistics vs 30x whole genome

FGC Corp
09-14-2018, 08:50 PM
Another long read Chromium result:

Y coverage for a Long Read Chromium
20,578,879 loci

In comparison:
Y Elite: 14,000,000 loci
Big Y: 9,300,000 loci

pmokeefe
12-19-2018, 02:27 AM
I ordered my FGC long read Chromium test in March 2018, received the results at the end of July. My tests results are now available to the public. This Google Drive folder contains test results from various labs, the FGC Long Read results are in subfolder 9LSKM.
https://drive.google.com/drive/folders/1_x7ZtSenJNUyb9nsq0hcNbyizx1a39F3?usp=sharing
Included are the vcf and fastq files, there is a bam file for the Y chromosome only at the moment. Please feel free to take a look at those if you're considering taking the test or are waiting for your results and are wondering what you might get back - or for any other purpose.

MacUalraig
12-19-2018, 08:50 AM
I ordered my FGC long read Chromium test in March 2018, received the results at the end of July. My tests results are now available to the public. This Google Drive folder contains test results from various labs, the FGC Long Read results are in subfolder 9LSKM.
https://drive.google.com/drive/folders/1_x7ZtSenJNUyb9nsq0hcNbyizx1a39F3?usp=sharing
Included are the vcf and fastq files, there is a bam file for the Y chromosome only at the moment. Please feel free to take a look at those if you're considering taking the test or are waiting for your results and are wondering what you might get back - or for any other purpose.

Thanks, are you able to offer a comparison between Y variant detection on this compared with any prior tests either with FGC or elsewhere? Anything found eg new types of variant that would require the LR technology to detect? Extra SNPs or STRs that other tests didn't read?

JamesKane
12-20-2018, 03:24 AM
9LKSM is the same donor as 15001710504417A on haplogroup-r.org. The much longer kit# is Dante Labs 30x WGS. The tree entrees show 7 unshared mutations in each test, which I haven't had time to chase down to see if they are just alignment issues in the Dante test. The close proximity of half of them would strongly indicate that though.

The many of his FLR named variants have been seen in other non-Big Ys, but I'm also sure my tool chain scraped quite a few in 9LKSM and the 5 other Long Read tests. For raxml to play well with data sets, all the locations not sampled in at least 90% of the cohort with at least 13 million base pairs called must be excluded. On the positive side it's hinting at a new layer of branching at FGC40956 under R-Z16252 that the O'Keefe share with the O'Moriarty and O'Donoghue lines from southern Ireland.

pmokeefe
12-20-2018, 04:35 AM
Thanks, are you able to offer a comparison between Y variant detection on this compared with any prior tests either with FGC or elsewhere? Anything found eg new types of variant that would require the LR technology to detect? Extra SNPs or STRs that other tests didn't read?
I currently don't have much information beyond what James Kane just provided. The kit number for my Dante Labs test on Yfull is YF14620, but I have not submitted my FGC Long Read results there. I believe FGC can also provide Y analysis, but I haven't received those yet (due to circumstances, not at all their fault). I just requested that from FGC and will post those when I receive them.

FGC Corp
01-01-2019, 09:08 PM
At 39 minutes, 57 seconds there is a discussion of FGC's long read test in this presentation:
https://www.youtube.com/watch?v=SJnI8I9WitE&t=2814s

There is one point that is worth mentioning. The test yields up to 40% or more data than Y Elite. Also, the yield will improve even more with new analytical approaches.

Petr
01-03-2019, 02:03 AM
There is one point that is worth mentioning. The test yields up to 40% or more data than Y Elite. Also, the yield will improve even more with new analytical approaches.
How do you calculate this? My Elite 1.0 has length coverage 89.28% and Long Read test 89.09%? And the median depth coverage is 33x for Elite 1.0 and 7x for LR?

JamesKane
01-03-2019, 12:28 PM
I believe the 40% more figure is coming from the callable loci statistic. You can indeed see a marked improvement over tests without the linked-read technology. Most of the men with Chromium results in my chart on the Haplogroup-R Kits page (https://haplogroup-r.org/kits.html) also have a Y-Elite or WGS.

9T7D6 is the same donor as huA2692E. He gained 36% more callable loci.
9LKSM is the same donor as 15001710504417A. He gained 30% more callable loci.

Of course there isn't always a gain...

RWTZZ is the same donor as 2TA5B. He actually achieved a 2% smaller callable loci score than the Y Elite 2. That's somewhat misleading because the 99.2% of the known chromosome does have reads in the long read test. The Y Elite only had 82.9%. The real issue appears to be he just had less than 4 reads average in a larger percentage than the other men. Possibly due to sample quality or maybe the software doing the alignments isn't 100% there yet.

These are all very interesting tests. I'd be considering one myself were I not waiting to see what develops on the nanopore front over the next couple years.

dtvmcdonald
01-03-2019, 06:17 PM
Just looking at the data, my Y in the FGC LR run has about 22X coverage. They claimed
19,877,589 callable loci.

They found something like 30 new SNPs (that they called FGCLR5xx) that were
not found in BigY. HOWEVER ... I had previously found many of those in the BigY bam file.
Some SNPs I previously found in BigY were not found by them, because of low read count.

The LR software is being very conservative about the number of good reads needed to get a call,
and being quite overly enthusiastic in using the linked read data to raise mapping quality, especially
in difficult areas. The idea of linked read cannot overcome using short actual reads in repetitive areas.

I found in my investigations that so far only Pacbio is a cure for long repetitive areas ... and with the older technology
used in the data I compared my FGC-LR to, the error rate was a big problem. They now claim vastly improved
base calling accuracy.

Petr
01-03-2019, 09:15 PM
My kits show for chrY:
Callable - Low Coverage - Poor Alignment - Y-DNA %
FGC Elite 1.0: 14268906 - 394594 - 8918643 - 99.8 %
FGC WGS 15x: 11379880 - 9198175 - 2608248 - 98.7 %
FGC LR Chromium: 16428317 - 5821364 - 1283413 - 99.7 %
Dante WGS 30x: 14214830 - 907373 - 8474661 - 99.9 %
So LR Chromium has 15 % more callable loci (with 4 or more reads) than Elite 1.0 or Dante WGS 30x.

FGC Corp
01-03-2019, 10:24 PM
My kits show for chrY:
Callable - Low Coverage - Poor Alignment - Y-DNA %
FGC Elite 1.0: 14268906 - 394594 - 8918643 - 99.8 %
FGC WGS 15x: 11379880 - 9198175 - 2608248 - 98.7 %
FGC LR Chromium: 16428317 - 5821364 - 1283413 - 99.7 %
Dante WGS 30x: 14214830 - 907373 - 8474661 - 99.9 %
So LR Chromium has 15 % more callable loci (with 4 or more reads) than Elite 1.0 or Dante WGS 30x.

That's not accurate. Long read data yields up to 20 million callable loci vs 14 million for Y Elite. James Kane can validate that statement.

It is true that saliva samples vary in the quality of DNA. However, that depends on the quality of the DNA in the sample provided by the customer (as in this example), rather than the potential of the technology. Blood samples are the best.

FGC Corp
01-03-2019, 10:31 PM
Just looking at the data, my Y in the FGC LR run has about 22X coverage. They claimed
19,877,589 callable loci.

They found something like 30 new SNPs (that they called FGCLR5xx) that were
not found in BigY. HOWEVER ... I had previously found many of those in the BigY bam file.
Some SNPs I previously found in BigY were not found by them, because of low read count.
.

Big Y is obsolete. It was an obsolete approach when it was introduced. Many people have switched to WGS or Y Elite and don't even consider the Big Y for their projects.

Big Y was designed for profit margin considerations (i.e. lower coverage yields a better profit margin, because the design is cheaper). You can ask the designer, Thomas Krahn, to confirm that.


RWTZZ is the same donor as 2TA5B. He actually achieved a 2% smaller callable loci score than the Y Elite 2. That's somewhat misleading because the 99.2% of the known chromosome does have reads in the long read test. The Y Elite only had 82.9%. The real issue appears to be he just had less than 4 reads average in a larger percentage than the other men. Possibly due to sample quality

RWTZZ was a lower quality sample. That accounts for the coverage difference. 9LSKM was a higher quality sample (less bacterial contamination).


9T7D6 is the same donor as huA2692E. He gained 36% more callable loci.

9T7D6 was a blood sample. The data was very high quality.

Petr
01-03-2019, 10:53 PM
That's not accurate. Long read data yields up to 20 million callable loci vs 14 million for Y Elite. James Kane can validate that statement.I already posted the BAM files to him. If you check the table at: https://haplogroup-r.org/kits.html you can see even only 13,865,029 callable loci for Chromium LR test (kit RWTZZ). Probably not every Chromium LR test is the same.

FGC Corp
01-03-2019, 10:54 PM
I already posted the BAM files to him. If you check the table at: https://haplogroup-r.org/kits.html you can see even only 13,865,029 callable loci for Chromium LR test (kit RWTZZ). Probably not every Chromium LR test is the same.

That's because the DNA sample provided by the customer was low quality. That's because of the quality of the DNA in the saliva sample, which, unfortunately is beyond our control.

However, the customer did have some new findings in his results and improved his position in ytree.net.


My present block has changed to FGC81931 from BY41286 thks to u and even though I am quite alone may be some day new genetic cousins will appear.


I feel vy happy with the results got. And if I should have to recommend anyone to take the best possible Y dna test the answer is clear. What's a pity is that u are a few years ahead of ur competitors as the tools / approach they are using to do the analysis are not the right ones as u already pointed in one of ur previous e-mails.

Overall, he was quite happy with his results, so the inference that he was not is misleading and not true.

FGC Corp
01-03-2019, 11:07 PM
Just looking at the data, my Y in the FGC LR run has about 22X coverage. They claimed
19,877,589 callable loci.

They found something like 30 new SNPs (that they called FGCLR5xx) that were
not found in BigY. HOWEVER ... I had previously found many of those in the BigY bam file.
Some SNPs I previously found in BigY were not found by them, because of low read count.

The LR software is being very conservative about the number of good reads needed to get a call,
and being quite overly enthusiastic in using the linked read data to raise mapping quality, especially
in difficult areas. The idea of linked read cannot overcome using short actual reads in repetitive areas.

I found in my investigations that so far only Pacbio is a cure for long repetitive areas ... and with the older technology
used in the data I compared my FGC-LR to, the error rate was a big problem. They now claim vastly improved
base calling accuracy.

We found a deletion in J1 that could not be found in Big Y or Y Elite. So, the potential for new discovery is there as the analysis gets better.

For the time being, the other 3rd party companies don't have capability to do this type of analysis, so their feedback is not correct.

Doing the analysis properly requires a computer with at least 128GB of RAM, as well as the use of the appropriate 10x Chromium software (long ranger and loupe). There's also an opportunity to refine the analysis with other custom approaches.

For example, other groups aren't considering indels or structural variants in their y-trees. That yields an incomplete result which doesn't incorporate useful findings.

FGC Corp
01-03-2019, 11:22 PM
How do you calculate this? My Elite 1.0 has length coverage 89.28% and Long Read test 89.09%? And the median depth coverage is 33x for Elite 1.0 and 7x for LR?


My Elite 1.0 has length coverage 89.28%

That's also misleading because that group is including regions that aren't useful for phylogeny. So, the coverage of the useful regions, in our view is higher than 89.28%.

People can disagree about the criteria for SNP calling but that group largely excludes a variety of useful data that is used by Alex Williamson's group. The automatic assumption that their analysis is the best is unsupported by the facts. The automatic assumption that this group's assessment of Y chromosome coverage is the most accurate is just not correct. Criticism can lead to improvement, and we've made suggestions to a variety of groups.

We've also been working on improvements to our analysis. There is significant room for improvement and we're doing R&D to improve the analysis of the LR data.

FGC Corp
01-03-2019, 11:35 PM
Just looking at the data, my Y in the FGC LR run has about 22X coverage. They claimed
19,877,589 callable loci.

In the past, a number of analysts have viewed your approaches as too conservative and that your approach misses a lot of useful data. Of course, a more conservative approach to the analysis has been adopted by a number of people.

To each his own.

FGC Corp
01-03-2019, 11:50 PM
My overall assessment here is that are a number of different approaches to the analysis, which include software approaches that we haven't used yet.

Accordingly, the yield should improve over our previous analysis.

In fairness, then, it is worth noting that FGC believes that our approach can also be improved and we are working on the R&D to make that happen.

pmokeefe
01-04-2019, 09:33 PM
I just uploaded FGC's analysis of my Long Read test (vcf files, Y chromosome variants etc.) in the following Google Drive folder: https://drive.google.com/drive/folders/1ex5S-HdILlDiOVNdjOUdD7rRl0Rye6HP?usp=sharing
For comparison, there are other test results (Ancestry, 23andMe, DanteLabs WGS) in this Google Drive Folder:
https://drive.google.com/drive/folders/1_x7ZtSenJNUyb9nsq0hcNbyizx1a39F3?usp=sharing
Please feel free to use and share those results as you see fit. I hope it might be of interest to those considering an FGC Long Read test.

Petr
01-15-2019, 01:48 PM
That's not accurate. Long read data yields up to 20 million callable loci vs 14 million for Y Elite. James Kane can validate that statement.

It is true that saliva samples vary in the quality of DNA. However, that depends on the quality of the DNA in the sample provided by the customer (as in this example), rather than the potential of the technology. Blood samples are the best.

Screenshot from https://haplogroup-r.org/kits.html

http://i68.tinypic.com/207p6qe.png

pmokeefe
01-18-2019, 08:20 PM
Both these kits are from the same person (me).

28484

FGC Corp
01-19-2019, 03:46 AM
Both these kits are from the same person (me).

28484

Your sample was a high quality DNA sample.

Magovalle
01-29-2019, 11:45 PM
Hi there, James. We are gonna try again with a re-run of the LR Chromium test. The sample was degraded due to bacterial contamination. I will let u know after a few months. Will be in touch

JamesKane
02-16-2019, 12:34 AM
In case anyone is wondering how this test compares with FGC's other products in a more visual manner, I have created histograms:

Chromium LR:
https://haplogroup-r.org/data/histograms/FGC-Chromium%20LR.png

FGC's 30x WGS:
https://haplogroup-r.org/data/histograms/FGC-30x%20WGS.png

Y Elite 2:
https://haplogroup-r.org/data/histograms/FGC-Y%20Elite%202.png

Regions in Green correspond with Callable regions in the BED file. Regions in Red represent Poor Mapping Quality Regions.

I hot-linked the images, so they should update as new samples are collected.

FGC Corp
02-16-2019, 06:56 PM
In case anyone is wondering how this test compares with FGC's other products in a more visual manner, I have created histograms:

Chromium LR:
https://haplogroup-r.org/data/histograms/FGC-Chromium%20LR.png



FGC's 30x WGS:
https://haplogroup-r.org/data/histograms/FGC-30x%20WGS.png

Y Elite 2:
https://haplogroup-r.org/data/histograms/FGC-Y%20Elite%202.png

Regions in Green correspond with Callable regions in the BED file. Regions in Red represent Poor Mapping Quality Regions.

I hot-linked the images, so they should update as new samples are collected.

I'm assuming that's derived from the averages found in the haplogroup-r statistics table?

As you know, the average is skewed by a few low quality DNA samples (one of which is being re-run).

JamesKane
02-16-2019, 07:35 PM
Correct, the outlier that's being rerun is represented here still. The next update run will exclude him until the new result is back. At the moment I'm working on incorporating scale and landmarks, so folks know the white gaps are limitations of the reference sequence not the tests.

Edit to add:

You can also drill into individual results on the haplogroup-r kits page. Click on the test type links to see the individual histogram. The depth bars represent the average coverage over 3,000 bases in that context.

MacUalraig
02-17-2019, 02:40 PM
As I understand it the 'Callable Loci' is based on the arbitrary number of 4 reads min that the GATK people plucked out of the air - I have yet to see any science behind it. If there is a foundation for it can you show us the link to it please?

Nice pics though.

JamesKane
02-17-2019, 03:12 PM
It's not exactly pulled out of the air. Having four reads is a heuristic quality check based on the likelihood of all the reads being aligned incorrectly. The combined likelihood in the simplest case is 1/10 * 1/10 * 1/10 * 1/10 = 0.0001% chance of all four reads being assigned incorrectly. For completeness you should also be verifying the alleles are in agreement in all the reads at the site, but this has left the realm of quick quality checks and entered into genotyping.

One can use fewer reads, if you are willing to accept the larger probability of admitting false positive/negative calls. The classic sensitivity versus specificity debate when it comes to experiment design.

This is also why they advocate joint genotyping for comparison of samples. You can use the reads below threshold in some of the samples when their matches have much better coverage in the vicinity.

pmokeefe
02-17-2019, 03:25 PM
De novo diploid genome assembly for genome-wide structural variant detection (https://www.biorxiv.org/content/10.1101/552430v1): authors from Stanford University. Nice to see more research based on the 10X Genomics platform.
Abstract

Structural variants (SVs) in a personal genome are important but, for all practical purposes, impossible to detect comprehensively by standard short-fragment sequencing.
...
Interestingly, we uncover 214 SVs that may have been maintained as polymorphisms in the human lineage since before our divergence from chimp. Overall, we show that de novo assembly of 10x linked-read data can achieve cost-effective SV detection for personal genomes.

FGC Corp
02-23-2019, 07:02 PM
There was a point raised about the variability in data quality. Unfortunately, in certain cases, a few customers may not have observed the proper saliva collection protocols (i.e. not eating and washing the mouth). That increases the amount of bacteria and decreases the number of Y chromosome reads.

neanderling
02-23-2019, 07:24 PM
There was a point raised about the variability in data quality. Unfortunately, in certain cases, a few customers may not have observed the proper saliva collection protocols (i.e. not eating and washing the mouth). That increases the amount of bacteria and decreases the number of Y chromosome reads.

What evidence do you have that the customers themselves are to blame in such instances, as opposed to that the test just fails for some people? For a test at this price, I suspect compliance with collection instructions is very high.

FGC Corp
02-23-2019, 07:26 PM
What evidence do you have that the customers themselves are to blame in such instances, as opposed to that the test just fails for some people? For a test at this price, I suspect compliance with collection instructions is very high.

The mapping ratio (percentage of reads mapped to human reference) is low in cases of bacterial contamination. That's the evidence. The other cause is lower quality DNA in the saliva sample.

If the quality of the DNA sample is high, in my experience, we always get good results.

neanderling
02-23-2019, 08:00 PM
The mapping ratio (percentage of reads mapped to human reference) is low in cases of bacterial contamination. That's the evidence.

That does not, however, prove that customers did not follow the collection instructions.

FGC Corp
02-23-2019, 08:05 PM
That does not, however, prove that customers did not follow the collection instructions.

The overall point here is that the quality of the DNA sample determines the quality of the result. We get high quality results from high quality DNA samples. It is true that the best approach is to use blood samples, however that is not always practical.

We don't have control over the quality of the sample that the the customer submits, which is affected by bacterial contamination or the overall quality of the DNA in their saliva.

In a blood sample, for example, we can get fragment lengths of 60,000 bases. In saliva samples fragment lengths average about 25,000 bases. It is worth pointing out that we did achieve Callable Loci of 20,000,000 bases in both saliva (high quality) and blood samples.

9LSKM (mentioned earlier in the thread) is an example of a high quality saliva sample:

9LKSM 19,477,445 Callable Loci

Blood sample example:
9T7D6 21,066,571 Callable Loci

neanderling
02-23-2019, 08:14 PM
The overall point here is that the quality of the DNA sample determines the quality of the result. We get high quality results from high quality DNA samples. It is true that the best approach is to use blood samples, however that is not always practical.

We don't have control over the quality of the sample that the the customer submits, which is affected by bacterial contamination or the overall quality of the DNA in their saliva.

In a blood sample, for example, we can get fragment lengths of 60,000 bases. In saliva samples fragment lengths average about 25,000 bases.


My point being that the customer may also not have control over the quality of the submitted sample to the extent that following the instructions may not assure a quality result.

MacUalraig
02-23-2019, 08:15 PM
There is a short article from DNA Genotek looking at the impact here:

https://blog.dnagenotek.com/blogdnagenotekcom/bid/96963/The-impact-of-bacterial-DNA-in-saliva-on-whole-genome-sequencing

(bit dated I know). Based on the figures it doesn't sound like it should be a showstopper - even their worst cases 40% contamination 'only' led to 9.9% reads unmapped and the average was 5% for comparison?

MacUalraig
02-23-2019, 08:17 PM
My point being that the customer may also not have control over the quality of the submitted sample to the extent that following the instructions may not assure a quality result.

I assume issues like general oral hygiene have an impact eg you might have a bit of gum disease/plaque/not flossed recently/chipped a filling and so on (not a dentist so I may be wrong).

FGC Corp
02-23-2019, 08:19 PM
There is a short article from DNA Genotek looking at the impact here:

https://blog.dnagenotek.com/blogdnagenotekcom/bid/96963/The-impact-of-bacterial-DNA-in-saliva-on-whole-genome-sequencing

(bit dated I know). Based on the figures it doesn't sound like it should be a showstopper - even their worst cases 40% contamination 'only' led to 9.9% reads unmapped and the average was 5% for comparison?

That omits cases that we've seen from our work.

FGC Corp
02-23-2019, 08:21 PM
My point being that the customer may also not have control over the quality of the submitted sample to the extent that following the instructions may not assure a quality result.

There's also another factor, which is that older men have "loss of Y chromosome." Aging yields loss of Y in DNA [in some cases of older men, not all]. That also reduces the quality of results in all of the Y tests (not just LR).

Without identifying the customer, that happened recently with a sample from a customer over age 60.

FGC Corp
02-23-2019, 08:37 PM
There's also another factor, which is that older men have "loss of Y chromosome." Aging yields loss of Y in DNA [in some cases of older men, not all]. That also reduces the quality of results in all of the Y tests (not just LR).

Without identifying the customer, that happened recently with a sample from a customer over age 60.

Incidentally, LOY (Loss of Chromosome Y) is associated with increased risk of cancer.

This is an example that demonstrates that the Y does have health implications.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5418310/


Recent discoveries have shown that harboring cells without the Y chromosome in the peripheral blood is associated with increased risk for all-cause mortality and disease such as different forms of cancer, Alzheimer’s disease, as well as other conditions in aging men. In the entire world, the life expectancy of men is shorter compared to women, a sex difference that has been known for centuries, but the underlying mechanism(s) are not well understood. As a male-specific genetic risk factor, an increased risk for pathology and mortality associated with mosaic loss of chromosome Y (LOY) in blood cells could help to explain that men on average live shorter lives compared to women. This review primarily focuses on observed associations between LOY in blood and various diseases in aging men. Other topics covered are known risk factors for LOY, methods to detect LOY, and a discussion regarding mechanisms such as immunosurveillance, that could possibly explain how an acquired mutation in blood cells can be associated with disease processes in other organs.

FGC Corp
02-24-2019, 02:23 AM
Another way to look at the data:

Average Callable Loci: (top 80% of the results, excluding bottom 20%)
17,249,165

Average Callable Loci: (top 90%, excluding bottom 10%)
16,901,529

Also:

The best (top 10% of the distribution):
Average: 19,823,723 Callable Loci

FGC Corp
02-24-2019, 04:54 AM
The main question for most customers is the SNP payoff in the LR Chromium data. Even in the lower quality LR samples, there is going to be an opportunity for greater SNP discovery because the greater fragment length should increase SNP discovery even at low coverage depth.

Plus, imputation yields should be better.
We shall see.

FGC Corp
02-24-2019, 08:52 PM
Example of QC protocol:
#1 passed, #2 did not.
High intensity band was 20,000bp plus on #1:

29043

FGC Corp
02-24-2019, 08:53 PM
delete double post

FGC Corp
02-25-2019, 03:26 AM
We are investigating a technique to purify the highest size DNA, which would improve the results.

Petr
02-25-2019, 08:22 PM
There was a point raised about the variability in data quality. Unfortunately, in certain cases, a few customers may not have observed the proper saliva collection protocols (i.e. not eating and washing the mouth). That increases the amount of bacteria and decreases the number of Y chromosome reads.

So you think that my 16 449 147 callable loci (the lowest decil) are caused by not observing the proper saliva collection protocols? I believe I observed everything really carefully. And I'm not over 60. So what may be the reason?

I remember that in the past the first step was to determine the sample quality and if it was not sufficient, you have required new saliva sample and repeated the test. This is no longer your standard?

FGC Corp
02-25-2019, 09:17 PM
So you think that my 16 449 147 callable loci (the lowest decil) are caused by not observing the proper saliva collection protocols? I believe I observed everything really carefully. And I'm not over 60. So what may be the reason?

I remember that in the past the first step was to determine the sample quality and if it was not sufficient, you have required new saliva sample and repeated the test. This is no longer your standard?

The LR Chromium requires higher quality DNA than the Y Elite test. That's why new procedures may be needed for new sample collection.

FGC Corp
02-26-2019, 02:20 AM
There's also an opportunity for additional snp and indel discovery by manual analysis. We found a number of new snps and indels in a long read kit using that approach.

Petr
02-26-2019, 07:04 PM
Does this "Long Read" mean that it should be possible to determine all Y-STRs, including 232 bp long Y-GGAAT-1B07 (ChrY:10687489..10687720)?

And what about the MAOA "Warrior Gene"? It is defined as number of repeats on chromosome X - https://www.familytreedna.com/landing/warrior-gene.aspx ? Since proxy via rs6323 and rs6909525 does not work, it could be interesting to determine it this way.

And 32 bp long deletion (for CCR5) can be detected? https://www.familytreedna.com/learn/test-types/ccr5-test-genealogical-test/

FGC Corp
02-26-2019, 09:31 PM
Does this "Long Read" mean that it should be possible to determine all Y-STRs, including 232 bp long Y-GGAAT-1B07 (ChrY:10687489..10687720)?

And what about the MAOA "Warrior Gene"? It is defined as number of repeats on chromosome X - https://www.familytreedna.com/landing/warrior-gene.aspx ? Since proxy via rs6323 and rs6909525 does not work, it could be interesting to determine it this way.

And 32 bp long deletion (for CCR5) can be detected? https://www.familytreedna.com/learn/test-types/ccr5-test-genealogical-test/

We could investigate.

FGC Corp
02-27-2019, 10:42 PM
It's worth pointing out that, in my view, certain reviews reflect a status-quo bias. Without pioneering and pilot projects, the community would wait for years before new technology becomes available (just as other companies wait 5,6 years to uograde their offerings).

All the other Y and WGS offerings use modified versions of 10 year old short read next generation sequencing tech, for example. The Y chromosome coverage is essentially equivalent to the Y sequencing we did starting in the 2012 pilot test (of the standard offerings available at a few companies).

FGC Corp
02-28-2019, 01:15 AM
It's worth pointing out that, in my view, certain reviews reflect a status-quo bias. Without pioneering and pilot projects, the community would wait for years before new technology becomes available (just as other companies wait 5,6 years to uograde their offerings).

All the other Y and WGS offerings use modified versions of 10 year old short read next generation sequencing tech, for example. The Y chromosome coverage is essentially equivalent to the Y sequencing we did starting in the 2012 pilot test (of the standard offerings available at a few companies).

Otherwise, you can look forward to a closed loop in which the consumer doesn't know what they're missing, and what reports are omitted, and an unrealized potential in genetics that can be achieved by pilot programs.

MacUalraig
02-28-2019, 04:35 PM
I completely agree. The current fuss in another thread basically boils down to a rival having just (possibly) caught up with Y Elite 1.0.

I'd like to be on the next development. Whether its Chromium only time will tell.

JamesKane
02-28-2019, 06:06 PM
Just curious would a targeted version of Chromium similar to the original Y Elite runs price point be possible? I saw mention of targeted sequencing on 10x Genomics site while skimming it the other day.

The autosomal data is nice to have, but I expect years will pass before we use it for much more than medical purposes.

FGC Corp
02-28-2019, 06:08 PM
Just curious would a targeted version of Chromium similar to the original Y Elite runs price point be possible? I saw mention of targeted sequencing on 10x Genomics site while skimming it the other day.

The autosomal data is nice to have, but I expect years will pass before we use it for much more than medical purposes.

We've been investigating that option recently.

FGC Corp
02-28-2019, 08:15 PM
I completely agree. The current fuss in another thread basically boils down to a rival having just (possibly) caught up with Y Elite 1.0.

I'd like to be on the next development. Whether its Chromium only time will tell.

The best coverage with short read Y designs is with the 250bp Y design. However, the current costs don't justify the small additional payoff, in our view. There's a small variation with the short read designs (5%-7%) and not much room for improvement. That's why we've been working on the Chromium approach.

FGC Corp
03-01-2019, 08:45 PM
Another LR batch has undergone QC and is being sequenced.

FGC Corp
03-06-2019, 02:43 AM
For those who are interested, here is a QC example for one of the LR tests:

29196

FGC Corp
03-08-2019, 07:16 AM
For those who are interested, here is a QC example for one of the LR tests:

29196

Thus far, the QC is better than in the previous batch. We're using new equipment.

FGC Corp
03-22-2019, 02:59 AM
Results are being returned for the latest batch. Analysis is underway.

MacUalraig
04-16-2019, 01:03 PM
Hybrid assembly (non-human) using a combination of PacBio and Chromium 10X:

"We first generated contigs based on PacBio sequencing libraries, which were then merged
with linked-read 10x Chromium data followed by scaffolding using a BioNano optical genome map and a Hi-C
chromatin interaction map, complemented by a genetic linkage map."

29835

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5642-0

MacUalraig
04-22-2019, 07:28 AM
Results are being returned for the latest batch. Analysis is underway.

Is there an update, you said it would take 'up to 30 days' ?

FGC Corp
04-22-2019, 09:57 PM
Is there an update, you said it would take 'up to 30 days' ?

I sent you an email. The analysis of the 10x Chromium data is particularly time consuming.

MacUalraig
04-30-2019, 06:51 AM
Big thank you to Greg, Leon and Justin for working hard to get my 10X reports ready for this morning as I head off to the 10X Genomics User Day at the Roslin Institute near Edinburgh!

https://www.eventbrite.com/e/10x-genomics-user-day-edinburgh-registration-59995283508

""Resolving Biology with 10x Genomics
Whether you want to dissect cell-type differences, investigate the adaptive immune system, or discover copy number variation and genomic heterogeneity on a cell-by-cell basis, the Chromium System from 10x Genomics is the answer. "

dtvmcdonald
05-02-2019, 08:04 PM
I got the autosomal data from my last 1st cousin yesterday and processed it to get what
is probably my final take on phased data from my LR results. I got about 90% of the
length of the euchromatic regions covered, the rest is simply not covered by any match that
I can coax into giving me raw data. The LR data proved invaluable, especially using the
last cousin, who unfortunately got his sample in one week too late to get MyHeritage's
old chip. There are still some random missing SNPs here and there, caused by blocks
unphased by LR and/or no cousin data for those small blocks.

One comment: the LR phasing regions are not very reliable. I also generated (usually smaller)
phase blocks using SNPs the LR indicated were unphased as breakpoints. It was clear
in many places that the cousin data is saying these break in correct places where the
LR breaks are clearly missing. On the other hand, the reverse is also true in places.
I had to trust the cousin data to decide which was better in each case. The results
show essentially perfect phasing for all the cousins as well as random matches (in
the phased regions of course) with very very very few misattributed SNPs.

MacUalraig
05-16-2019, 03:00 PM
Got the disk with my BAM and fastqs today, the bam is around 55.2 Gb. If you view it in IGV you can use their linked read extensions features (as long as you have a recent version) to display reads linked either by the barcode (BX tag) or molecule (MI tag). I'll be reporting in more detail on my website when I've looked at it in more detail.

pmokeefe
06-05-2019, 06:16 PM
Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads (https://www.biorxiv.org/content/10.1101/660605v1)
Abstract
Variant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Aquila achieves contiguity for both haplotypes on a genome-wide scale, and its phasing nature guarantees a real haplotype-resolved assembly instead of a haploid consensus assembly. Over 98% of a human Aquila-assembled genome is diploid, facilitating detection of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.

...

Our method, Aquila, makes use of the reference human genome as a reliable scaffold, and
performs de novo local assembly in small chunks separately for each haplotype, yielding a truly
diploid whole genome sequence. It then discovers the most important types of variation on the
basis of pairwise alignment to the reference, and infers phasing for all types of assembled
variants through previous long-range phasing information. We tested its performance with six
libraries of 10x linked-reads data for NA12878 and NA24385 individuals. It offers excellent small
indel and SV detection at virtually no compromise for SNP detection, as well as highly accurate
phasing of the vast majority of heterozygous variants, at reasonable reagent and computational
costs.

pmokeefe
08-08-2019, 04:26 PM
Assessment of human diploid genome assembly with 10x Linked-Reads data (https://doi.org/10.1101/729608)

Lu Zhang, Xin Zhou, Ziming Weng, Arend Sidow

Abstract
Background: Producing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries. Findings: We prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332X and 823X and assembly quality worsened if it increased to greater than 1,000X for a given C. Long DNA fragments could significantly extend phase blocks, but decreased contig contiguity. The optimal length-weighted fragment length (Mu_FL) was around 50 to 150kb. When broadly optimal parameters were used for library preparation and sequencing, ca. 80% of the genome was assembled in a diploid state. Conclusion: The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.

Erik_Maher
11-29-2020, 12:23 PM
Hi, I ordered FGC Chromium linked read sequencing in Nov 2019, intending to submit my data to Harvard's Personal Genome Project and NIH's All Of Us Research Program, and maybe discover some new Y-SNPs as the icing on the cake. For optimal results, I chose to submit a blood specimen rather than cheek swab or saliva sample. After my doctor declined to participate in "experimental procedures" and wouldn't write me a prescription for the testing, and after several unsuccessful attempts at urgent care centers including BetterMed, MedExpress, Patient First, and Quest Diagnostics who all refused to draw blood for "non-medical" reasons, I finally found a cooperative company called Any Lab Test Now in Newtown PA and submitted my blood sample to FGC's lab in late Jan 2020. Then the Covid lockdowns hit just as my specimen passed QC.

I came home for Thanksgiving and while catching up with personal emails, noticed that my Chromium sequencing finally finished on 26 Oct 2020. I see there are two files; a large 5.2 GB ZIP file and a gigantic 81.4 GB BAM file. Looking in the SNP folder of the 5.2 GB file, there are two files that seem the most useful, called B3BNQ.haplogroupCompare.20201124.named.tab and B3BNQ.variantCompare.20191121.named.tab. I'm struggling a bit to interpret the results and was hoping someone could help me.

Firstly, I was curious if FGC discovered any novel "Long Read" SNPs when analyzing my kit so I searched the file for "FGCLR" and found 36 entries. All of them are listed after the Private and new-to-GRCh38 SNPs heading on page 54 of my 112-page haplogroupCompare file. Likewise, all of them are listed after the Private SNPs or new-to-GRCh38 (612 total variants; 93 "high reliability" variants) heading on page 1137 of my 1445-page variantCompare file. However, most of them seem to be rather low numbers (already discovered?):

From FGCLR24 = 7059 G to T, up to FGCLR40 = 35472 C to G; not including FGCLR31 and FGCLR32.
From FGCLR90 = 10928917 TC to TCC, up to FGCLR97 = 12666523 CTTTTTT to CTTTTT.
From FGCLR155 = 16060 G to A, up to FGCLR159 = 24880 A to C.
FGCLR206 = 11010357 TCCC to TCCCC.
FGCLR211 = 14529 T to A.
From FGCLR223 = 10962596 CAAA to CAAAA, up to FGCLR226 = 8292 T to G; not including FGCLR224.
FGCLR369 = 56683000 G to C.
FGCLR421 = 20269759 A to G.
FGCLR889, site of FT92274 = 11032415 GCCC to GCCCC.

Most of them are derived (+) for every near-L1336 kit. There are a few such as FGCLR223 that show some kits as ancestral (-). My foremost question is why are there gaps between the SNPs? If they are new to my kit, shouldn't they be in a continuous sequence? Are they new but not new to my kit; that is, only discovered a few weeks or months ago, still considered new to science, and I'm simply joining the prior discoverer as someone who's also derived for that SNP?

I'm also a little confused with the normal FGC SNPs. Below the Private or new-to-GRCh38 heading in each file, there are many normal (non-LR) FGC SNPs listed, but they are scattered with large gaps between many of them. They start at FGC76930 = 28212 A to G and end at FGC89307 (rs140514458) = 11557624 G to A. I'm accustomed to seeing novel SNPs clustered together as a continuous series with no gaps when they're first discovered. For example, my first FGC analysis several years ago, back around 2013, clustered the newly-discovered SNPs as a continuous sequence from FGC11316 to FGC11366 (of which FGC11336 and FGC11353 were key to splitting at least two of us from the rest of L1336). Why are the ones in Chromium so far apart from each other? The vast majority of them have a long string of pure plus signs following them. I only found a few, such as PF5374 (rs751426793) = 14020608 G to T and FGC82879 = 14416305 A to C, that are not 100% derived for everyone analyzed. Could it be that Chromium LR discovered few or no new SNPs that are unique to me, or unique to me and one or two other people, when analyzing my kit? I am not at all disappointed; I'm just trying to decipher what I'm seeing into plain English.

I'm curious why the ISOGG SNP Index has no FGCLR SNPs. At least, I couldn't find any when I downloaded the current XLSX and searched for "FGCLR".

Finally, why are so many SNPs missing from Ybrowse? If I take any random SNP, whether normal FGC, or FGCLR-type, and search on Ybrowse, there are many that I can't find.

Thank you.

dtvmcdonald
12-09-2020, 03:04 AM
Using this data is an extreme pain. It took me months. I had to write computer programs in C just to
split the overly large ones into usable pieces. I needed the BAM. I have used the whole genome except the X.

They did discover numerous FGCLR SNPs in me, and they were in sequence. I however was the
first R1a they did and as far as I know the first R1a done my Chromium. Some, for instance FGCLR579
which is critical for my genealogy, do show up on Ybrowse.

I did two things: first, and why I paid the money, was to verify a mistake I suspected in the
standard human genome. I verified that what look in HUGO like three large deletes totalling 10 kilobases
was in fact one contiguous delete. The FGCLR data showed that HUGO was wrong because
the "reads" from one snippet were scattered over what in HUGO appeared an impossibly large
spread (like on average 250 kilobases). To show what was right required realigning my raw reads
comparing them not to HUGO but to a "de novo" assembly of another man from PacBio data.

The "structural variants" in the long read Y data were hilariously wrong: apparently the result of
expecting a diploid genome.

Second, I used their "phased" autosomal data to phase myself. The long read data was
of course not capable by itself of telling which was paternal and which maternal.
Since my parents are long dead, I had to use an aunt and numerous cousins. This took months
but worked. There are no "canned" computer tools for doing this. The results were uploaded,
with extreme difficulty, to FTDNA, MyHeritage, and Gedmatch. They have proven invaluable in genealogy!

I should add that there WERE mistakes in the Chromium phasing: some of the longer purportedly
uniformly phased pieces in fact had switches in them. The shorter ones were OK.

I should add that I did not use a blood sample. The saliva sample worked just fine.

MacUalraig
12-12-2020, 01:37 PM
I re-ran the LongRanger analysis myself, it took about 3 days run time on an 8 core i9. But not much new in the way of Y-SNVs and I already had a Y Elite anyway so I was more interested in some of the longer SVs.