PDA

View Full Version : YFull vs. FGC



jbarry6899
07-13-2015, 08:15 PM
Has anyone had a BigY BAM file analyzed by both YFull and Full Genomes Corp.? If so, did you find significant differences in the results? Was there value in having both companies do the analysis?

Thanks,

Jim

lgmayka
07-13-2015, 08:39 PM
Yes, I have done that myself with three BAM files, and I was given the dual results in one more case.

- FGC gives only a downloadable bundle of text files, whereas YFull offers an online account with a friendly interactive interface (although major results can also be downloaded as CSV files).

- FGC does not, as far as I know, automatically update comparative results when new genetic neighbors arrive. (I think you can specifically ask FGC to re-run comparisons.) YFull continually updates its comparative results (e.g., shared SNPs vs. unshared SNPs) as new BAM files are analyzed.

- YFull presents a public current haplotree, with estimated TMRCAs, that is updated at least once a month. FGC does not.

- YFull offers access to the raw data directly. So for example, if someone asks for the precise results of YF03788 at position 19736631 (ZZ12_1), I can simply copy&paste this:


Sample: #YF03788 (R-P312)
ChrY position: 19736631 (+strand)
Reads: 164
Position data: 97T 67C
Weight for T: 0.581543624161
Weight for C: 0.418456375839
Probability of error: 0.115320099214 (0<->1)
Sample allele: Y (C or T)
Reference (hg19) allele: T


The primary reason for submitting a BAM file to both companies would be to update both databases, to maximize the visibility of a particular case (e.g., a member of a rare subclade). So for example, I did this for a member of N-Y7310 (http://yfull.com/tree/N-Y7310/).

jbarry6899
07-13-2015, 08:58 PM
Thanks very much. I have six kits of interest and have YFull analysis for all of them. I was wondering whether FGC would add any value. Looks like not.

MitchellSince1893
07-13-2015, 09:14 PM
Has anyone had a BigY BAM file analyzed by both YFull and Full Genomes Corp.? If so, did you find significant differences in the results? Was there value in having both companies do the analysis?

Thanks,

Jim

Yes I did both and think there is value in doing both. I concur with lgmayka's statements above.

Yfull ranked 13 novel SNPs as "Best", 10 novel SNPs as "acceptable"
FGC ranked 19 at 99%, 5 at 95%

This screen shot from a spreadsheet I made illustrates why I think doing both may be beneficial based on analysis of my BigY BAM file.

Notice that there are SNPs marked "unreliable" on not even listed by Yfull that received the highest ratings by FGC; and SNPs marked "Acceptable' by Yfull that are ranked 40% and 10% by FGC.

Of the 27 shown, 20 SNPs receive top ratings from both FGC and Yfull. Thus over 25% of these SNPs rated highly (top 2 categories) by one entity wasn't given the highest two ratings by the other.

jbarry6899
07-13-2015, 10:37 PM
Very interesting--I just got some sample reports from FGC. I'll study them and see if they seem likely to help.

Cofgene
07-13-2015, 10:54 PM
FGC will place the new novel named SNPs into ybrowse. Initially YFULL did not do this. YFULL may be fudging the reported STRs by using the results presented at FTDNA. Very curious how yFulll they can report a null425 event.

jbarry6899
07-13-2015, 11:45 PM
FGC will place the new novel named SNPs into ybrowse. Initially YFULL did not do this.

That could be a real advantage. Thank you.

VinceT
07-13-2015, 11:57 PM
FGC will place the new novel named SNPs into ybrowse. Initially YFULL did not do this. YFULL may be fudging the reported STRs by using the results presented at FTDNA. Very curious how yFulll they can report a null425 event.
It is also curious how YFull is able to report Y-GGAAT-1B07, which was never mapped to GRCh37. It has since been mapped to GRCh38 at Y:10687543-10687730

haleaton
07-14-2015, 12:05 AM
Both FGC and YFull analysis were highly useful to me to understand what FTDNA Big Y was doing and not doing as far analysis goes. FGC is explicit in showing how it removes the "widely shared" SNPs from delta to the reference sequence based on other data sets which I found useful to know exactly.

The YFull interactive features to examine the raw data and also that of others who join a specific group is a big plus with YFull.

FGC also provide data on private INDELs of which I have 6 that are higher reliability, two of which were verifiable by YSEQ Sanger. YFull does not report INDELs which could turn out to be useful.

I also had, more importantly, a NGC Full Y at the BGI lab, and had YFull also analyze that. The value of the higher coverage and quality helps resolves the ambiguous cases from FTDNA Big Y data.

I had 4 SNPs that were non high reliable in FGC analysis that I was able to verify by Sanger at YSEQ. YFull identified 3 as reliable and 1 as unreliable. Interestingly, FTDNA ranked two of them as medium quality.

I am waiting on getting FGC data (from US lab) for a paper 12th cousin who is STR match, having failed twice--probably have to get new sample. This will be interesting case.

jbarry6899
07-14-2015, 01:20 AM
I ordered an FGC analysis and if it proves to be useful will order for other project members. Thanks.

lgmayka
07-14-2015, 01:40 AM
YFULL may be fudging the reported STRs by using the results presented at FTDNA.
On what evidence do you make this serious accusation? And how does your "conspiracy theory" explain YFull's ability to provide 100 of FTDNA's 111 Y-STRs to customers who never tested more than 12 markers at FTDNA?

Very curious how yFulll they can report a null425 event.
Why are you surprised at this? Do you understand what a so-called Null at DYS425 really is? Here is what the Null DYS425 Project writes (https://www.familytreedna.com/groups/null-425/about/goals):
---
This is what FamilyTreeDNA.com says about Nulls at DYS425 and its relationship to the advanced marker DYF371: DYF371 is usually a four copy marker like DYS464. Its alleles are located on the palindromes P1 and P5 on the Y chromosome. See Y Chromosome Palindromic Map One of the copies on P5 can carry a mutation in the flanking region from C to T. This T-type allele was discovered as an independent marker and was called DYS425. Our lab has developed a test to detect the C- and T-type alleles for all DYF371 STR alleles simultaneously. This test is called DYF371X. We can see the T-type alleles and the C-type alleles in different fluorescent colors, so we can label each allele with c or t. A normal person without a NULL at DYS425 would look like 10c-12t-13c-14c for example. You will notice that in a DYS425 NULL result you don't have a T-type allele. This is because a C-type allele has overwritten the T-type allele during the RecLOH event.
---

YFull also provides all the DYF371 values labeled as such.

lgmayka
07-14-2015, 01:42 AM
It is also curious how YFull is able to report Y-GGAAT-1B07, which was never mapped to GRCh37. It has since been mapped to GRCh38 at Y:10687543-10687730
You apparently haven't read YFull's explanation. It's on the "STR results" page.
---
YFull Y-GGAAT-1B07 does not coincide with the FTDNA nomenclature!

Y-GGAAT-1B07 is kind of an oddball. The SMGF laboratory has developed primers for this STR on the AC019099 sequence of the RP11-428D10 clone of the RP11 donor. This sequence doesn’t appear on the human reference genome though. Not even on GRCh37/hg18. We know for sure that the sequence exists in (almost) every male and that it is inherited in a direct male line. Therefore it must be somewhere on the Y chromosome. The 5 bp long GGAAT repeat pattern is very common in the centromeric region of the Y chromosome, but also it is the general structure motif in the Yq12 region. Both of these regions represent huge gaps were we haven't got enough sequencing data in order to create a propper assembly and a consensus sequence from it. Therefore the exact position of the Y-GGAAT-1B07 marker will remain a mystery until we'll have better sequencing technologies that are able to sequence several 10 thousand bases in a row.
Thomas Krahn // http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2013-01/1357339085
---

In short, YFull's so-called Y-GGAAT-1B07 may not correspond to FTDNA's, and should not be used for comparisons with FTDNA results.

VinceT
07-14-2015, 02:18 AM
You apparently haven't read YFull's explanation. It's on the "STR results" page.
---
YFull Y-GGAAT-1B07 does not coincide with the FTDNA nomenclature!

Y-GGAAT-1B07 is kind of an oddball. The SMGF laboratory has developed primers for this STR on the AC019099 sequence of the RP11-428D10 clone of the RP11 donor. This sequence doesn’t appear on the human reference genome though. Not even on GRCh37/hg18. We know for sure that the sequence exists in (almost) every male and that it is inherited in a direct male line. Therefore it must be somewhere on the Y chromosome. The 5 bp long GGAAT repeat pattern is very common in the centromeric region of the Y chromosome, but also it is the general structure motif in the Yq12 region. Both of these regions represent huge gaps were we haven't got enough sequencing data in order to create a propper assembly and a consensus sequence from it. Therefore the exact position of the Y-GGAAT-1B07 marker will remain a mystery until we'll have better sequencing technologies that are able to sequence several 10 thousand bases in a row.
Thomas Krahn // http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2013-01/1357339085
---

In short, YFull's so-called Y-GGAAT-1B07 may not correspond to FTDNA's, and should not be used for comparisons with FTDNA results.

That is a quizzical defense. As per Thomas' posting you quoted, the SMGF reference sequence for Y-GGAAT-1B07 does not exist on the GRCh37 reference sequence. Consequently, any reads matching it would have been included in one of several dozen unmapped assemblies. I simply find it fascinating that YFull had the insight to (1) analyze those unmapped assemblies for STR motifs, and (2) to assign one of them with the specific Y-GGAAT-1B07 designation. Notwithstanding, their analysis of several FGC BAM files matched FTDNA's scores perfectly, a fact that I find amazing, if not incredible.

lgmayka
07-14-2015, 10:07 AM
I simply find it fascinating that YFull had the insight to (1) analyze those unmapped assemblies for STR motifs, and (2) to assign one of them with the specific Y-GGAAT-1B07 designation.
I would simply call it an educated guess as to what FTDNA is measuring.

Notwithstanding, their analysis of several FGC BAM files matched FTDNA's scores perfectly, a fact that I find amazing, if not incredible.
YFull's explanation is specifically telling you that their guess may be good but not perfect. When I look at the entire R1a group on YFull, with over 300 participants, I see these values of Y-GGAAT-1B07:
3 instances of 9
3 instances of 11
9 instances of blank (i.e., not readable)
the rest are 10

Many of these people have never tested 111 markers at FTDNA (and hence have no recorded Y-GGAAT-1B07 value), but I have no reasonable way to check even those who have--nor does FTDNA, since YFull does not ask customers for their FTDNA kit number. The "conspiracy theory" running in this thread imagines that YFull spends a lot of time guessing the customer's FTDNA kit number, merely in order to align some of its 400 Y-STR readings with FTDNA's. I see no evidence of such a thing. As YFull correctly says, its customers consider Y-STRs a lower priority than Y-SNPs, and therefore so does YFull--the Y-STR readings are delivered somewhat later than the rest of the analysis.

lgmayka
07-14-2015, 10:55 AM
The "conspiracy theory" running in this thread imagines that YFull spends a lot of time guessing the customer's FTDNA kit number, merely in order to align some of its 400 Y-STR readings with FTDNA's.
We know that YFull does not play such a guessing game for its bread-and-butter business, SNPs:
- DF27 is often a no-call, so unless a reliable downstream SNP is available, YFull must place some customers in R-P312 (without the asterisk) (http://yfull.com/tree/R-P312/), a kind of holding area--even when the customer has tested DF27+ at FTDNA.
- Similarly, frequent no-calls on Z49 result in a holding area at R-L2 (without asterisk) (http://yfull.com/tree/R-L2/) even when the customer's FTDNA account shows Z49+ . (Again, only if no reliable downstream SNP is found.)

On the bright side, I was able to provide to YFull sufficient evidence that BY653 is downstream of DF27, so members of R-BY653 should move to their own new subclade downstream from DF27 in the next version of YFull's haplotree.

MitchellSince1893
07-14-2015, 11:46 AM
There is a conspiracy that Yfull is looking at FTDNA online to fill in STR values and thus that they are always identical?

It's not the case for me as there are a few values that are different.

Yfull has my CDY.2 = 37. FTDNA =36. I had a spreadsheet showing additional differences between my FTDNA 111 markers and Yfull's but can't find it at the moment.

TigerMW
07-14-2015, 12:08 PM
We know that YFull does not play such a guessing game for its bread-and-butter business, SNPs.
This doesn't make feel any better. How do we "know" they never guess in their "bread-and-butter" business? Hopefully, they clearly state where they are playing a guessing game. If not, it is very hard for someone to ferret this out.

If there is any guessing involved anywhere along the way in calling and reporting results, it would be nice to see a probability or confidence percentage assigned along with the result.

I'm always amused when people say "let me speak frankly" or "let me tell you the truth", as if they weren't already.

My reply is not intended as an opinion of YFull. I don't know about YFull. They could be completely clear and straightforward. I know they are questioned by some, and they can be a bit of a black box. I do appreciate their publishing of a paper related to their SNP counting methods. I would like to see them publish that in a more recognized international science journal.

lgmayka
07-14-2015, 01:20 PM
This doesn't make feel any better. How do we "know" they never guess in their "bread-and-butter" business?
Please understand that we are dealing in this thread with a "conspiracy theory" for which absolutely zero evidence has been shown, and about which the alleged "culprit" has not even been asked.

If you actually worry that YFull may be filling in Y-STR values by guessing customers' FTDNA kit numbers, ask them.

lgmayka
07-14-2015, 01:26 PM
If there is any guessing involved anywhere along the way in calling and reporting results, it would be nice to see a probability or confidence percentage assigned along with the result.
I repeat: YFull provides an interactive interface so that you can determine exactly what your BAM file says about any location. As an example, YF02497 is actually a Z49+ singleton, but is listed by YFull as R-L2 (YF02497) (http://yfull.com/tree/R-L2/) because of this no-call:


Sample: #YF02497 (R-L2)
ChrY position: 28462237 (+strand)
Sample allele: no call position
Reference (hg19) allele: G
Known SNPs at this position: Z49 (G->T) Rating for known SNP


The obvious question here is: Are the "conspiracy theorists" are applying their same suspicions to FGC? As far as I know, FGC's BAM file analysis presents its intepretation but does not provide any reasonable way to see the raw data on which that interpretation is based. Is that correct?

To be fair, I must admit that new FTDNA project members sometimes ask whether FTDNA is really reading Y-STRs at all. They suspect that FTDNA is simply making up numbers out of thin air.

VinceT
07-14-2015, 05:08 PM
I would simply call it an educated guess as to what FTDNA is measuring.

YFull's explanation is specifically telling you that their guess may be good but not perfect. When I look at the entire R1a group on YFull, with over 300 participants, I see these values of Y-GGAAT-1B07:
3 instances of 9
3 instances of 11
9 instances of blank (i.e., not readable)
the rest are 10

Many of these people have never tested 111 markers at FTDNA (and hence have no recorded Y-GGAAT-1B07 value), but I have no reasonable way to check even those who have--nor does FTDNA, since YFull does not ask customers for their FTDNA kit number. The "conspiracy theory" running in this thread imagines that YFull spends a lot of time guessing the customer's FTDNA kit number, merely in order to align some of its 400 Y-STR readings with FTDNA's. I see no evidence of such a thing. As YFull correctly says, its customers consider Y-STRs a lower priority than Y-SNPs, and therefore so does YFull--the Y-STR readings are delivered somewhat later than the rest of the analysis.

One of the founders of YFull also runs the STR matching website semargl.me. This website did, on occasion, scrape Ysearch for STR haplotypes (but no longer), and still does for many FTDNA projects. It must be said that this is a website that I value greatly. But having a direct owner-relationship between the two services does leave some room for this particular conspiracy theory.

VinceT
07-14-2015, 05:30 PM
The obvious question here is: Are the "conspiracy theorists" are applying their same suspicions to FGC? As far as I know, FGC's BAM file analysis presents its intepretation but does not provide any reasonable way to see the raw data on which that interpretation is based. Is that correct?

To be fair, I must admit that new FTDNA project members sometimes ask whether FTDNA is really reading Y-STRs at all. They suspect that FTDNA is simply making up numbers out of thin air.

The raw data is in the BAM file. All one needs to do is download one of several available tools to view BAM file alignments. The Broad Institute's IGV (https://www.broadinstitute.org/igv/) is one, NCBI's Genome Workbench (http://www.ncbi.nlm.nih.gov/tools/gbench/) is another, Artemis from the Sanger Institute (https://www.sanger.ac.uk/resources/software/artemis/) is another. There are many more. Granted, figuring out how to use them with the proper reference sequence can be tricky, and the applications that run locally typically require a computer with sufficient memory.

YFull supplies their own SNP viewer integrated into their website. It is lovely, and I really appreciate it. But I also appreciate seeing the aligned reads in context. No, FGC does not have their own viewer; they only supply the plain text reports.

MitchellSince1893
07-15-2015, 02:38 AM
There is a conspiracy that Yfull is looking at FTDNA online to fill in STR values and thus that they are always identical?

It's not the case for me as there are a few values that are different.

Yfull has my CDY.2 = 37. FTDNA =36. I had a spreadsheet showing additional differences between my FTDNA 111 markers and Yfull's but can't find it at the moment.

Found it. Actually there were just 2 difference between Yfull and FTDNA STR values.
Previously mentioned CDY.2 = 37. FTDNA =36.
and DYS710 = 39 on FTDNA and 37 on Yfull.

Also Yfull had no reads at DYS455, DYS454, DYS447, YCAII.1, YCAII.2, DYS413.1, DYS413.2, DYS490, DYS617, DYS568, DYS487, DYS572, DYS640, DYS492, DYS565, DYS716, DYS717, DYS452, DYS445, Y-GATA-A10, and DYS463.

Petr
07-15-2015, 07:36 AM
I have 8 BigY results analyzed by YFull and 6 of them by FGC too. And one FGC Elite result analyzed by YFull.

There are differences in the interpretation - the view on the quality is different, sometimes highest quality SNP at FGC is low quality at YFull and vice versa.

The STR results looks much better at YFull - much more markers and the results are closer to FTDNA standard tests.

YFull ignore some SNPs, like S12993 with explanation
We do not analyze region 1002xxxx. There are so many bad-quality SNPs.

For some kits I got an explanation why YFull results are not the same as FTDNA results:
This is all to be expected, but his DYS464 and DYS714 FTDNA to be displayed incorrectly.

DYS464 actually has a value of 13-15-15-15. Abroad of repeats, there is an independent deletion 13 nucleotides. PCR in this case will show -3 repeat.
Thus in FTDNA this STR has the value 12-13-15-15. That is not true. :)

DYS714 actually has a value of 27.c . In this marker occurred insertion CTTCT. PCR in this case will show +1 repeat.

As of CDY, for all 8 kits the values given by FTDNA and YFull are the same.

Other differences are:
DYS464: 13-13-15-15 / 13-13-16-16
DYS710: 34 / 33 (but it was marked as "Uncertain")
Y-GGAAT-1B07 8 / 10
and
DYS413: FTDNA: 23-23, FGC Elite, interpreted both by FGC and YFull: 22-23

For DYS464, YFull displays just 4 alleles.