PDA

View Full Version : How to call BAM files?



Kale
09-07-2018, 05:44 PM
I've found this wonderful documentation...
https://gaworkshop.readthedocs.io/en/latest/contents/04_genotyping/genotyping.html

I found the program referenced (I think)
https://github.com/stschiff/sequenceTools/blob/master/src-simpleBamCaller/simpleBamCaller.hs
Does this need to be compiled or anything?

But I can't locate a certain few things referenced in this documentation that are necessary to actually do it...
/projects1/users/schiffels/PublicData/HumanOriginsData.backup/EuropeData.positions.txt
/projects1/Reference_Genomes

In addition, it seems simpleBamCaller does one chromosome at a time. How to merge the chromosomes into one individual?
For .ind and .snp this is easy, but the documentation states .geno is a simple text file like the others, but this is not the case.

I'm only looking to do this for a few samples at the moment, namely Tianyuan, Sunghirs, KEB-IAM-TOR, and Baikal_N and Chan_Meso/Canes (if they are available, idk)
But the recent ISBA abstracts look to have a lot of interesting stuff coming up...

For anyone that plays the game RuneScape, I will heftily reward any help, anyone else will have to settle for my undying gratitude :P

kawhi
09-07-2018, 09:05 PM
I would like to know how to do this as well.

I tried installing the sequencetools package on Windows (I've done it successfully before) & ran into many problems and wasn't able to compile the program.

However I was able to locate the executables that installed from before. Unfortunately, simplebamcaller executable never installed (I don't know why).

Anyways. I tried using pileupCaller and failed miserably. I don't know if my samtools is messed up, if I'm using the samtools mpileup incorrectly, or if my pileupCaller executable is messed up.

It would be nice if someone who has had success and experience with either program (simplebamcaller or pileupcaller) could write up a step-by-step tutorial.

Cofgene
09-07-2018, 09:29 PM
Follow the instructions here: http://www.it2kane.org/2016/10/y-dna-variant-discovery-workflow-pt-1-1/ This represents the industry standard way of getting to the information. For STR calls you can use HipSTR or lobSTR on the reassembled BAM file.

Note that the steps are I/O intensive so running from a SSD or M.2 SSD drive is recommended. For specific steps the more threads you have the faster it will process up to the point that you saturate the disk I/O. Yes one CAN saturate the I/O on a M.2 SSD based system doing this. If you don't have a linux system you can set up the pipeline to run under Windows 10 WSL with Ubuntu installed.

Kale
09-08-2018, 02:04 AM
Follow the instructions here: http://www.it2kane.org/2016/10/y-dna-variant-discovery-workflow-pt-1-1/ This represents the industry standard way of getting to the information. For STR calls you can use HipSTR or lobSTR on the reassembled BAM file.

Note that the steps are I/O intensive so running from a SSD or M.2 SSD drive is recommended. For specific steps the more threads you have the faster it will process up to the point that you saturate the disk I/O. Yes one CAN saturate the I/O on a M.2 SSD based system doing this. If you don't have a linux system you can set up the pipeline to run under Windows 10 WSL with Ubuntu installed.

Is this implying a prefab computer from 2007 won't be up to the task?

Also, I should have noted in the OP, I'm not too interested in Y-calls. I want autosomal data on the 1240k ancients panel.

anglesqueville
09-08-2018, 07:24 AM
Kale, as I'm lazy, for the BAMs I always use Felix's http://www.y-str.org/2014/04/bam-analysis-kit.html , which is mainly a bundle of classical tools, including BAMtools. I've been told that the Y calls it provides are sometimes dubious, but I never experienced problems with the autosomes and I don't see why it could exist any. If you really want to use the pipeline of Schmutzi Workshop (amazing site btw), you'll have to DL the references by yourself, in particular Lazaridis 2014 is on the Reich repository at Harvard: https://reich.hms.harvard.edu/datasets ( but of course it has nothing to see with the BAM work for itself). To stich chromosomes use simply the merging flags in PLINK (after converting to bed/bim/fam of course).
Note: my computer is 6 yrs old, 32Gb RAM and a big CPU, and working with BAMs is very slow and painful with it. If yours is from 2007, I'm not too optimistic for you.

Cofgene
09-08-2018, 10:43 AM
You can update some of the tools and all of the reference sequences used by the bam-analysis-kit. It's a pain and runs really slow compared to the standard multithreaded options due to Cygwin limitations.

For the extraction of a set of autosomal calls from a BAM see https://github.com/tkrahn/extract23

anglesqueville
09-08-2018, 03:15 PM
You can update some of the tools and all of the reference sequences used by the bam-analysis-kit. It's a pain and runs really slow compared to the standard multithreaded options due to Cygwin limitations.

For the extraction of a set of autosomal calls from a BAM see https://github.com/tkrahn/extract23

Yes, I'm aware of it, but amateurs like me have rarely to analyse BAMs, so it's not dramatic. Furthermore my Linux is on a virtual box, sure you see what it means for the slowness, even with multithread processes :\ . Thanks for extract23.

Kale
09-08-2018, 04:19 PM
That's unfortunate that such analysis is very CPU intense. Another thing I've noticed is that every published paper does work in admixtools, meaning they've already done what I'm looking to do. What are the odds of being able to e-mail the contact for a paper and being able to get a copy of the dataset they ran in admixtools? What should I say if that's an option? So far I'm 0 for 1 in that regard (ignored, no response).

Generalissimo
09-09-2018, 12:31 AM
Kale, as I'm lazy, for the BAMs I always use Felix's http://www.y-str.org/2014/04/bam-analysis-kit.html , which is mainly a bundle of classical tools, including BAMtools. I've been told that the Y calls it provides are sometimes dubious, but I never experienced problems with the autosomes and I don't see why it could exist any. If you really want to use the pipeline of Schmutzi Workshop (amazing site btw), you'll have to DL the references by yourself, in particular Lazaridis 2014 is on the Reich repository at Harvard: https://reich.hms.harvard.edu/datasets ( but of course it has nothing to see with the BAM work for itself). To stich chromosomes use simply the merging flags in PLINK (after converting to bed/bim/fam of course).
Note: my computer is 6 yrs old, 32Gb RAM and a big CPU, and working with BAMs is very slow and painful with it. If yours is from 2007, I'm not too optimistic for you.

Felix's genotyping kit produces biased calls, and this is especially a problem for low coverage samples.

So you should definitely not use the output in formal stats analyses, and I would even discourage people from uploading such samples to GEDmatch.

If genotype data aren't available from the authors of the relevant papers, then the only samples that should be used are those processed with Schiffels' genotyping pipeline.

kawhi
09-09-2018, 12:37 AM
Felix's genotyping kit produces biased calls, and this is especially a problem for low coverage samples.

So you should definitely not use the output in formal stats analyses, and I would even discourage people from uploading such samples to GEDmatch.

If genotype data aren't available from the authors of the relevant papers, then the only samples that should be used are those processed with Schiffels' genotyping pipeline.

Have you ever tried the latter? Did you have success with it? Was it a simple or awkward process?

Generalissimo
09-09-2018, 12:48 AM
Have you ever tried the latter? Did you have success with it? Was it a simple or awkward process?

I haven't done it. I don't have the bandwidth for downloading a lot of BAMs. But I know a couple of people who set it up without much of a problem, except, from memory, an e-mail or two to the author of the pipeline to sort out some issues.

From some of the experiments that we ran, the results were definitely more sound than anything else, including my own efforts with Samtools, in which I was trying to be extra cautious.

It would definitely be useful for several people on this board to set up the pipeline on their computers, and then work as a team whenever ancient data pop up at the ENA and the genotypes aren't available from the authors.

Kurd
09-09-2018, 01:41 AM
I have alot of experience with this sort of thing as I have not only called genotypes using various pipelines but have also assembled and mapped raw sequences from fasta and fastq files at ENA and SRA over the past couple of years. There are many pitfalls and challenges to doing this accurately, and I have outlined some at http://www.eurasiandna.com/2017/10/02/diploid-genotyping-low-medium-coverage-ancient-dna/
(http://www.eurasiandna.com/2017/10/02/diploid-genotyping-low-medium-coverage-ancient-dna/)

I strongly recommend reading the article for anyone contemplating in getting into this. There I also discuss diploid genotyping steppe aDNA which was published by Harvard as pseudo-haploid.

A good pipeline if you already have the assembled and mapped BAM files would be the ATLAS pipeline which also deals with aDNA damage. It can be used for pseudo-haploid as well as diploid genotyping aDNA. You can find a link to it in the references of the article

anglesqueville
09-09-2018, 08:18 AM
Felix's genotyping kit produces biased calls, and this is especially a problem for low coverage samples.

So you should definitely not use the output in formal stats analyses, and I would even discourage people from uploading such samples to GEDmatch.

If genotype data aren't available from the authors of the relevant papers, then the only samples that should be used are those processed with Schiffels' genotyping pipeline.

I never had enough confidence in the output of "my" (if I can speak so) BAM analyses ( only a handful as I told) to merge them with asserted data and use them with formal stats. On the other hand, to tell the truth, comparing the output of Felix's bundle (for the Myceneans if I recall) against the output of Schiffel's pipeline, I was unable to see any difference (autosomes only). About Gedmatch, well, I don't feel concerned, just because I never upload anything to Gedmatch. Generalissimo, off topic but as you are in the corner, I've put yesterday a question about imputation in this inquiries thread, and I would be happy to get your advice: do you happen to impute genetic data, and if yes which pipeline do you use?

Generalissimo
09-09-2018, 08:26 AM
I never had enough confidence in the output of "my" (if I can speak so) BAM analyses ( only a handful as I told) to merge them with asserted data and use them with formal stats. On the other hand, to tell the truth, comparing the output of Felix's bundle (for the Myceneans if I recall) against the output of Schiffel's pipeline, I was unable to see any difference (autosomes only). About Gedmatch, well, I don't feel concerned, just because I never upload anything to Gedmatch. Generalissimo, off topic but as you are in the corner, I've put yesterday a question about imputation in this inquiries thread, and I would be happy to get your advice: do you happen to impute genetic data, and if yes which pipeline do you use?

I still use BEAGLE 3.3.2. It's very easy to use and accurate, as long as you use a lot of reference samples from a wide range of geographic regions. So, say, a couple thousand from basically every part of the world, including as many as you can relevant to the sample being imputed.

http://faculty.washington.edu/browning/beagle/b3.html

By the way, yeah, ATLAS could be a good option for genotyping BAM files.

https://bitbucket.org/phaentu/atlas/wiki/Home

anglesqueville
09-09-2018, 08:32 AM
^^ Generalissimo: thanks, I was looking for something "lighter" than BEAGLE, but your advice is in line with all the others I got: everybody seemingly still uses BEAGLE. Among the 3 professionals I met only one uses the version 5, but I guess that the 2 others are simply too lazy to update :).