PDA

View Full Version : little script ( windows + R) converting to 23&me



anglesqueville
08-15-2017, 08:35 PM
For those that might interest, I wrote a small script (nothing really great, it's really very basic), which within its limits can replace the professional tools to convert dna.land data directly to the format 23 & me. It's under R of course. Limitation:
R does not accept, without installing a tricky extra package (Bigmemory) to handle really big data. Now an uncompressed imputed.vcf file weighs ca 2.5 Gb, so it is excluded to process it in its entirety. In any case this would be useless, since the txt file obtained would be unusable. My hypothesis will therefore be that you want to convert extracts from your genome.
Script: I'll put it in the end of post, you'll just have to copy and paste it with a notepad, and save it under any name with extension .r
Usage: Once downloaded the file imputed.cf.gz (it is the 3rd, for the wired townspeople, a few minutes. For rural people like me who live at the bottom of a hole, count 20 to 40 mn). You unzip it, and open it with glogg. You select with the mouse the part you want to convert. This is the painful part. You copy and paste this part on a text program (it is slow if you have taken many lines, f***g windows), and you save under the name myfile.csv. You put this file and the script in a single folder, open a R session, source the script, and it runs. If your file is a bit large, there is a latency time (R reads your file), then it runs (very fast). I advise against more than 2.500,000 lines. Especially, Windows doing everything wrong, as soon as the number of lines becomes a bit big processing files with notepad (or other) becomes painful. As output you will get in your folder a file named myconvertedfile.txt, which you can open with a text program, a spreadsheet (or glogg if it is big). Since I wanted a 23&me file, I did not keep the allele probabilities, but this is not a problem (except that it will slow down because there is a calculation to be made from the "Genetic likehoods"). If someone asks me, I will post an augmented script).

script:


tab<-read.csv("myfile.csv",sep='\t',header=F)
Ntab<-tab[1,1:4]
Ntab[,4]<-as.character(Ntab[,4])
I<-which(substr(tab[,10],1,3)=="0/0")
Ntab[I,4]<-paste(substr(tab[I,4],1,1),substr(tab[I,4],1,1),sep="")
I<-which(substr(tab[,10],1,3)=="0/1")
Ntab[I,4]<-paste(substr(tab[I,4],1,1),substr(tab[I,5],1,1),sep="")
I<-which(substr(tab[,10],1,3)=="1/1")
Ntab[I,4]<-paste(substr(tab[I,5],1,1),substr(tab[I,5],1,1),sep="")
Ntab[,1]<-tab[,3]
Ntab[,2]<-tab[,1]
Ntab[,3]<-tab[,2]
write.table(Ntab,"myconvertedfile.txt",sep='\t', row.names=FALSE,col.names=FALSE,quote=F)

Warning: of course this elementary script does not correct the damned flipped snps!

Theconqueror
08-16-2017, 11:42 AM
Thanks for this Angles. Bigmemory is easily installed in R Studio. What is the tricky part with Bigmemory? Is it how it can be used in conjonction with your script?

anglesqueville
08-16-2017, 11:56 AM
There is nothing really tricky with Bigmemory, in fact, but I wanted to write a script for people with very elementary knowledge of R ( just open and source). That said honnestly I never manage for myself genetic files in R ... Well, I just ran the chromosomes 1 and 2 of my mom with this little script. For the chrom2: 2 minutes. That's perhaps only a funny thing, perhaps useful, don't know. In any case it was'nt that big a work...