PDA

View Full Version : How to add RSIDs to a .BAM or VCF file?



Smashorpass
09-22-2018, 12:47 AM
Does anyone know how to add RSIDs to a BAM or FASTQ or even VCF file? I tested with a company called Seeq a long time ago and I have a BAM and FASTQ and VCF but they are basically irrelevant. Is there a way to add these?

MacUalraig
09-22-2018, 03:04 PM
Does anyone know how to add RSIDs to a BAM or FASTQ or even VCF file? I tested with a company called Seeq a long time ago and I have a BAM and FASTQ and VCF but they are basically irrelevant. Is there a way to add these?

BAM and FASTQ files are collections of reads and so not structured that way. Annotating a VCF with rsIDs can be done eg

Annotate a VCF with dbSNP IDs and depth of coverage for each sample

https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_Va riantAnnotator.php

but I've not done one myself :-)

Kurd
09-23-2018, 04:21 PM
BAM and FASTQ files are collections of reads and so not structured that way. Annotating a VCF with rsIDs can be done eg

Annotate a VCF with dbSNP IDs and depth of coverage for each sample

https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_Va riantAnnotator.php

but I've not done one myself :-)



You're sending the guy on a wild goose chase with GATK tools. They are absolutely not easy to figure out and use I'm pretty sure that I'm the only user here who has figured out how to use them.

In addition to a linux operating environment with a strong computer and linux command familiarity, the poor guy will have to figure out all the dependencies needed, the correct formats for human reference genomes, the types of headers to add to bams and how to add them, so that GATK will work (GATK is extremely picky), and a host of other issues. Quite a commitment to be sure.

Since I'm feeling generous today, here is a script that I have put together using Linux bash and AWK that simplifies matters alot.

1 - All you will need is a Linux environment and to install AWK with this command:


sudo apt-get install gawk

2- Next create your Unix shell (command processor) by saving my following AWK commands in a text file called say RS_Converter.sh (the .sh denotes that it is a shell):


# Lookup rs nos from data.bim by position number and insert into A.bim. Don't change A.bim if corresponding position in Data.bim is not found. file2 file1 > output

awk 'NR == FNR {REP[$1,$4] = $2; next} ($1,$4) in REP {$2 = REP[$1,$4]} 1' OFS="\t" Data.bim A.bim > A_rs.bim

Data.bim is a plink bim file that already contains rs numbers. Get one with over 1 million genotyped positions if possible.

A.bim is your plink bim file that does not contain rs numbers. You can easily convert your VCF to plink bed bim fam format using plink tools.

A_rs.bim will be the outputted bim file with rs numbers added.


3- Next you need to make the shell executable. You can do this by making the location where A_rs.bim your working directory by using the linux cd commands. Once you set your working directory, make RS_Converter.sh file executable by typing the following command:


chmod +x RS_Converter.sh

If successful, the commands inside RS_Converter.sh will change from black to a color indicating that your file is an executable shell.


To invoke your shell, all you do is point terminal to the directory and type the shell's file name.


You can always change your plink bed bim fam files back to 23andMe text format by using the plink command:


/plink --bfile A_rs --recode 23 --out A_rs.txt


Good luck....

Danzo
09-23-2018, 07:48 PM
You can always change your plink bed bim fam files back to 23andMe text format by using the plink command:

Wow I never knew about --recode 23, thanks for that. I was manually compiling the ped/map to 23andme format this entire time (using linux commands of course).