PDA

View Full Version : WHY HASN'T NATIONAL GEOGRAPHIC/HELIX DONE ANYTHING ABOUT RELEASING RAW DATA FILES ?



WilliamBruce
08-19-2017, 06:25 PM
WHY HASN'T NATIONAL GEOGRAPHIC/HELIX DONE ANYTHING ABOUT RELEASING RAW DATA FILES ?
It was first represented in December 2016 that they were working on a way to release the raw data files. Any new news?
It looks like once the Genographic Project was completed and Dr. Wells left they could't care less about it.

chipmunk226
08-20-2017, 03:53 AM
WHY HASN'T NATIONAL GEOGRAPHIC/HELIX DONE ANYTHING ABOUT RELEASING RAW DATA FILES ?
It was first represented in December 2016 that they were working on a way to release the raw data files. Any new news?
It looks like once the Genographic Project was completed and Dr. Wells left they could't care less about it.

I have also been wondering the same thing. I inquired and got a response on July 31, 2017 that they were in the testing stage and hoped that it would be available within the next month and I would get an email when the option was available.

WilliamBruce
08-20-2017, 06:46 PM
I have also been wondering the same thing. I inquired and got a response on July 31, 2017 that they were in the testing stage and hoped that it would be available within the next month and I would get an email when the option was available.


Thanks. I inquired twice and all they could tell me was that they were currently working on it. It was the same response I received before I purchased Geno 2.0 in February.

wombatofthenorth
08-24-2017, 08:26 PM
I wonder what happened to the promised more regions update for non-Helix users. When people got Geno 2.0 NG non-Helix they were promised they would be updated with expanded regions and yet only the Helix users have gotten them.

MacUalraig
11-15-2017, 12:09 PM
They seem to have made a somewhat vague claim its coming 'later this year'

"Downloads will be available later this year. Exact cost and timing will be announced in the coming weeks. If you'd like to be added to a list for notification when this feature becomes available, let us know at https://support.helix.com/hc/en-us/requests/new"

https://twitter.com/my_helix/status/928679680219136000

WilliamBruce
11-24-2017, 07:36 PM
Geno 2.0 has finally released downloadable files. However they are incomplete and do not work on Gedmatch or Gedmatch Genesis.

Ann Turner
11-24-2017, 11:53 PM
Geno 2.0 has finally released downloadable files. However they are incomplete and do not work on Gedmatch or Gedmatch Genesis.
Could you describe the content a little more? Or if you don't mind sending me the raw data, I could take a look at the file format. I'm [email protected]

wombatofthenorth
11-27-2017, 02:33 AM
Is it really true that the raw download includes full mtDNA??? On another forum someone was claiming it gave 16,000+ data points for mtDNA (7% no calls), so is it really like a poor man's full mtDNA? Or did they make some mistake and say just see some 16,000 something position and not notice tons of spots missing along the way from 0 to 16,000+?

Ann Turner
11-27-2017, 07:41 PM
Is it really true that the raw download includes full mtDNA??? On another forum someone was claiming it gave 16,000+ data points for mtDNA (7% no calls), so is it really like a poor man's full mtDNA? Or did they make some mistake and say just see some 16,000 something position and not notice tons of spots missing along the way from 0 to 16,000+?
I'm just now in the process of analyzing the files William Bruce graciously sent to me. It does have slots for all 16569 positions but with a very high no-call rate, which seems to run for rather long stretches (1000+ bases). If anyone has another mtDNA file to send me, I could check to see if the no-call positions are consistent. [email protected]

Ann Turner
11-28-2017, 03:04 PM
Thanks to William Bruce for sending me his raw data from the Helix version of the Genographic Project.

He was unable to upload the file to GEDmatch because of the file format. It is sorted by rsid instead of chromosome and position, it lacks a column for position, and it puts the two alleles in separate columns. I used Excel's VLOOKUP function against a 23andMe v3 template to identify the position, concatenated allele1 and allele2 to make a genotype column, rearranged columns, and sorted by chromosome and position. This was sufficient for GEDmatch to recognize the file format (23andMe). There is no X data at all, and GEDmatch made note of that but still accepted the file.

General observations:

The genotype distribution is unremarkable. SNPs where the two alleles are also complementary base pairs are avoided. Transitions ( A<->G or C <-> T ) are more common than transversions.

The no-call rate is quite high at 13%. Genealogy companies aim for no more than 3% and often achieve much better than that. It would be useful to know if the no-calls have a consistent pattern.

The overall homozygosity is 60.2%, lower than found in the SNP selection for 23andMe v3 (70.5%) or LivingDNA (83%). Homozygosity increases when more SNPS with rare alleles are added, since most people will share the more common allele. The Genographic Project may have looked for SNPs that are somewhat common globally but have different distributions in various parts of the world.

20032

William has a file from LivingDNA for comparison. GEDmatch Genesis shows that 99.5% of the calls in the Helix file match the calls in the LivingDNA file (about 237 differences out of the 47423 SNPs available for comparison). This sounds like a decent concordance rate, but it is lower than achieved by chip technology. My son's LivingDNA file compared to his 23andMe file shows 54 differences out of 183,824 SNPs (99.97%).

The SNP overlap with different platforms is important for GEDmatch. These stats are comparisons of specific files, so they would vary slightly depending on no-calls in the other files.

20033

With the current GEDmatch algorithm, William's LivingDNA and Helix files show lots of gaps in sections where the SNP overlap falls below their threshold. He shows only a 76.3% similarity with himself.

20034

The Helix file fared better with a comparison with a parent/child kit from 23andMe v4, where there are more SNPs in common. There were still some small breaks: there were 66 segments vs the expected 22. However, the total cM added up to 3455, just shy of the 3565 I see for a 23andMe v3 kit. GEDmatch also introduces breaks for a v3 kit in regions where the SNP density is low, so 45 segments were reported for a self-to-self comparison.

MacUalraig
12-01-2017, 04:46 PM
Ann, thanks.

What about Y SNPs?

Ann Turner
12-01-2017, 06:43 PM
Ann, thanks.

What about Y SNPs?
There are 4746 rows of data, with 998 no-calls (21%). The SNPs are given names like V193 or CTS72. I'm not set up to analyze the phylogenetic coverage, but I expect William Bruce would be happy to share that portion of his results.

Ann Turner
12-01-2017, 11:08 PM
I have looked at the Helix mtDNA data provided by William Bruce. There are rows of data numbered 1 to 16569 (the number of bases in the whole mtDNA molecule). Of those, 870 are no-calls (5.3%). Many of the no-calls come in stretches, as shown in the attached table. I'd be very curious to know if the gaps apply across the board. Has anyone else received raw data from the Helix version of the Genographic Project?

I transformed the data into a FASTA file by putting the 16569 bases into a text file of 60 rows each. That number is not essential, but it's commonly used to make the data more readable.

I replaced the no-calls with N, the convention for missing data in mtDNA. When I ran the FASTA file through Jame's Licks utility http://dna.jameslick.com/mthap/ it came up with a haplogroup of T2a1a, the same as reported by Geno. There were six missing SNPs for T2a1a, but the overall pattern was enough to identify the haplogroup.
20144

Ann Turner
12-03-2017, 06:14 PM
I have looked at the Helix mtDNA data provided by William Bruce. There are rows of data numbered 1 to 16569 (the number of bases in the whole mtDNA molecule). Of those, 870 are no-calls (5.3%). Many of the no-calls come in stretches, as shown in the attached table. I'd be very curious to know if the gaps apply across the board. Has anyone else received raw data from the Helix version of the Genographic Project?

I transformed the data into a FASTA file by putting the 16569 bases into a text file of 60 rows each. That number is not essential, but it's commonly used to make the data more readable.

I replaced the no-calls with N, the convention for missing data in mtDNA. When I ran the FASTA file through Jame's Licks utility http://dna.jameslick.com/mthap/ it came up with a haplogroup of T2a1a, the same as reported by Geno. There were six missing SNPs for T2a1a, but the overall pattern was enough to identify the haplogroup.
20144
Correction: the raw data file has 16559 rows of data, missing the last 10 bases. These would probably not be informative anyway -- in my personal database of37,000 GenBank records, no variants were reported for 16560-16569.

Simplification: the data doesn't really need to be reformatted. You can copy the Allele1 column to a text file and replace the dash "-" with an N. James Lick's utility will handle thus just fine.

Update: I've now seen another Helix file. It shares many of the gaps found in the first file I looked at. I also have a niggling feeling that some of the calls might not be right, but I would need to compare with a full mitochondrial sequence to be sure.

wombatofthenorth
12-09-2017, 05:08 AM
I have looked at the Helix mtDNA data provided by William Bruce. There are rows of data numbered 1 to 16569 (the number of bases in the whole mtDNA molecule). Of those, 870 are no-calls (5.3%). Many of the no-calls come in stretches, as shown in the attached table. I'd be very curious to know if the gaps apply across the board. Has anyone else received raw data from the Helix version of the Genographic Project?

I transformed the data into a FASTA file by putting the 16569 bases into a text file of 60 rows each. That number is not essential, but it's commonly used to make the data more readable.

I replaced the no-calls with N, the convention for missing data in mtDNA. When I ran the FASTA file through Jame's Licks utility http://dna.jameslick.com/mthap/ it came up with a haplogroup of T2a1a, the same as reported by Geno. There were six missing SNPs for T2a1a, but the overall pattern was enough to identify the haplogroup.
20144

so it could be like 90-95% of full mtDNA?

Ann Turner
12-09-2017, 09:57 AM
so it could be like 90-95% of full mtDNA?
Right, the coverage is about 95%, based on a sample of two.

The person who sent me the second file happens to have a full mitochondrial sequence from FTDNA. There were three discrepancies, which are almost certainly errors on the Helix side.

wombatofthenorth
12-09-2017, 10:34 PM
Right, the coverage is about 95%, based on a sample of two.

The person who sent me the second file happens to have a full mitochondrial sequence from FTDNA. There were three discrepancies, which are almost certainly errors on the Helix side.

that is pretty impressive, I don't think the old Geno, 23 or LivingDNA come close.

Dumidre
12-11-2017, 11:47 PM
Hi Ann,
May I send you mine also? I managed to obtain the Helix Raw Data.

Ann Turner
12-12-2017, 12:06 AM
Yes, please. I'm mostly interested in the mtDNA part. If you've had any other mtDNA testing, I would appreciate those results for comparison. My email is [email protected]

wombatofthenorth
12-12-2017, 12:28 AM
It would be odd to got for 95% and not just 100% at that part. Maybe they went for 99% (at least minus the last few at the end that don't change) and ended up with 5% or so where the chip fails and gives no calls. Anyway, seems possibly far better than 23 or LivingDNA.
A shame then more don't take Geno and across the world since this could've been a way to really build mtDNA studies in a huge way.

Ann Turner
12-12-2017, 01:02 AM
It would be odd to got for 95% and not just 100% at that part. Maybe they went for 99% (at least minus the last few at the end that don't change) and ended up with 5% or so where the chip fails and gives no calls. Anyway, seems possibly far better than 23 or LivingDNA.
A shame then more don't take Geno and across the world since this could've been a way to really build mtDNA studies in a huge way.
I agree that it seems odd, but it's sequencing, not a chip process. So far the gaps have been in similar places, so perhaps the problem is aligning the short reads and assigning them to the right place. Pure speculation on my part.

aosamores
08-15-2018, 09:36 AM
Thanks to William Bruce for sending me his raw data from the Helix version of the Genographic Project.

He was unable to upload the file to GEDmatch because of the file format. It is sorted by rsid instead of chromosome and position, it lacks a column for position, and it puts the two alleles in separate columns. I used Excel's VLOOKUP function against a 23andMe v3 template to identify the position, concatenated allele1 and allele2 to make a genotype column, rearranged columns, and sorted by chromosome and position. This was sufficient for GEDmatch to recognize the file format (23andMe). There is no X data at all, and GEDmatch made note of that but still accepted the file.

General observations:

The genotype distribution is unremarkable. SNPs where the two alleles are also complementary base pairs are avoided. Transitions ( A<->G or C <-> T ) are more common than transversions.

The no-call rate is quite high at 13%. Genealogy companies aim for no more than 3% and often achieve much better than that. It would be useful to know if the no-calls have a consistent pattern.

The overall homozygosity is 60.2%, lower than found in the SNP selection for 23andMe v3 (70.5%) or LivingDNA (83%). Homozygosity increases when more SNPS with rare alleles are added, since most people will share the more common allele. The Genographic Project may have looked for SNPs that are somewhat common globally but have different distributions in various parts of the world.

20032

William has a file from LivingDNA for comparison. GEDmatch Genesis shows that 99.5% of the calls in the Helix file match the calls in the LivingDNA file (about 237 differences out of the 47423 SNPs available for comparison). This sounds like a decent concordance rate, but it is lower than achieved by chip technology. My son's LivingDNA file compared to his 23andMe file shows 54 differences out of 183,824 SNPs (99.97%).

The SNP overlap with different platforms is important for GEDmatch. These stats are comparisons of specific files, so they would vary slightly depending on no-calls in the other files.

20033

With the current GEDmatch algorithm, William's LivingDNA and Helix files show lots of gaps in sections where the SNP overlap falls below their threshold. He shows only a 76.3% similarity with himself.

20034

The Helix file fared better with a comparison with a parent/child kit from 23andMe v4, where there are more SNPs in common. There were still some small breaks: there were 66 segments vs the expected 22. However, the total cM added up to 3455, just shy of the 3565 I see for a 23andMe v3 kit. GEDmatch also introduces breaks for a v3 kit in regions where the SNP density is low, so 45 segments were reported for a self-to-self comparison.


I need help to convert my new Geno 2.0 file to GEDmatch. Tx.

Carl45
07-18-2019, 03:29 AM
I need help to convert my new Geno 2.0 file to GEDmatch. Tx.
Would someone be kind enough to tell us step-by-step what to do to upload the data to gedmatch? Everyone should be aware that Geno 2.0, both from FTDNA and HELIX, many times, give customers the wrong haplogroup. Apparently, it is National Geographic's bad algorithm. They have terrible customer service. I'm also disappointed at the rate of no-calls in Helix compared to FTDNA. I can't wait to see the haplogroup they identify once it is corrected.

Ann Turner
07-18-2019, 07:11 PM
Is this a Helix kit? I outlined the process in a message upthread, but if you have autosomal data from another company, I don't think it's worth the effort.