View Full Version : I need help with Paabo's Vindija Neanderthal DNA files

10-02-2020, 03:18 AM
I asked Svante Paabo where I could find the DNA sequence for the Vindija Neanderthal whose DNA on chromosome 3 matches a human sequence that is highly associated with severe covid.

This is what he sent me.
The coordinates are in the paper (chr3: 45,859,651-45,909,024, hg19).

The Neandertal genome are available at http://cDNA.eva.mpg.de/neandertal/VINDIJA/VCF.

So you can get the sequence there.

We're talking about chromosome 3, 45,859,651 to 45,909,024. Whatever hg19 is. The 13 SNPs associated with severity of covid is a haplotype almost always inherited as a block, and they contain 49,000 or so base pairs between them, from three genes.

The link doesn't work. So I backed up to /neandertal, and then chose the folder Vindija, and then chose the folder VCF.

It's by chromsome. So,
chr3_mq25_mapab100.vcf.gz.tbi and chr3_mq25_mapab100.vcf.gz .

I am running the Ubuntu operating system and my spreadsheet software is Libre Office Calc.

The Readme file contains NO help.

I suppose one is supposed to unzip the gz file and then open it?

Well, it doesn't seem to even finish unzipping the file, but it does leave a big file.

The format is .vcf

Online I learn that this is a contact database, ALWAYS. This format is supposedly not even used for anything else. Supposedly you just open it in your address book and it displays the addresses which are supposed to be what it contains. OR, you can allegedly open it in Excel. Every time I try to open it in Calc, my entire computer hangs, and it's an i5 computer running 32 gbs of RAM.

How the bleep do I open it?!!!

WHY is it not in Genbank? Where they deal in plain old fashioned text files?!!!!

Or is it? If it is there, how do I find it? I gather noone else here has found that a straightforward matter.

If anyone wanted to extract the portion of chromosome 3 that I need from the Vindija person and zip it and send it to me, I would be eternally grateful. It's probably 49,000 lines, albeit short ones, so probably one does have to zip it.

I guess I'm interested in the positive strand read forward or whatever.

Or, there are 20 or 30 SNPs I actually need - the ones that are variously at Ancestry, 23andMe, and Family Tree DNA (which contains a subset of the SNPs at 23andMe). I listed for each which SNPs are included.

Understand, this block of DNA got incorporated into the modern human genome and is inherited as a block, so many humans have the same block of genetic code. 16% of Europeans have atleast one copy of it. Then there are other sequences that didn't come from the Vindija Neanderthal. (The two Neanderthals in central Siberia only had some of the changes.)

So, I've got one copy of two of the covid-associated Neanderthal alleles that are in Ancestry's data, and almost certainly the entire block, and what I want to do, is figure out what ancestor they came from. I'd like to know what the Vindija alleles for these SNPs are.


Dora Smith

10-02-2020, 08:09 PM
I found the solution to this.

A vcf file is just a text file whose data is formatted in a specific way, just like a gedcom file for genealogy. If you rename it as a text file it still works though the data might not make sense.

I renamed the vcf file I had managed to unzip as .txt. The file as I unzipped it, maybe it all unzipped, was 7 Gb in size.

Then I used the Linux split facility in Terminal to split the file into thousands or something of 1/2 megabyte pieces. (To do the same thing in Windows you have to find and install a utility).

Then I used the fact that Libre Office Calc previews the contents of a file when it asks how you want to open it, to search for the segment starting with 45,859,851 and ending with 45,909,024. Then I moved the files with that segment to a different file. These can be opened in Calc (or Excel), saved in Calc or Excel format, and renamed. Then I deleted the other files and emptied my trash.

It is not possible or not easy to make sense of the data without opening it in spreadsheet software.

I now have the sequence, if someone else wants it. It consists of 9 text files, most of them just over a half Mb in size.

Dora Smith