Page 1 of 7 123 ... LastLast
Results 1 to 10 of 69

Thread: Few questions on getting started with analysis

  1. #1
    Registered Users
    Posts
    963
    Sex
    Omitted

    Few questions on getting started with analysis

    Hi there, I'm looking to get started doing things like admixture, d-stats, etc. and I have a few questions.
    1) Where to get datasets on modern human populations?
    - I found one, George Busby's something-or-other, 163 world populations, but that leads me to question 2.
    2) How do I view them in a text-editor friendly format?
    - The .bim file in that dataset looks a lot like 23andme's raw data (it's the first I've seen so that's my frame of reference), and outlines the SNPs used in the dataset presumably.
    - The .bed file (which contains the actual calls for populations), from online search is supposed to be text-editor friendly, but it's not, .bed + npp = gibberish.
    3) How can I prune multiple datasets to get only the snps shared between them
    - I can automate this in openoffice calc pretty easily, but calc is sllooooow at doing mass comparisons, there has to be a faster way...

    Thanks in advance for any help

  2. #2
    Moderator
    Posts
    4,592
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA
    U152>L2>Z367
    mtDNA
    H5a1

    Normandie Netherlands Friesland Finland
    Last edited by anglesqueville; 03-14-2018 at 08:07 AM.
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  3. The Following User Says Thank You to anglesqueville For This Useful Post:

     Kale (03-15-2018)

  4. #3
    Registered Users
    Posts
    352
    Sex
    Ethnicity
    Turkish
    Nationality
    German
    Y-DNA
    J2b2a1-L283>Z1297
    mtDNA
    HV4

    Regarding datasets, the most widely used one is probably the Human Origins dataset from the Reich Lab. It comes in geno format, which can be easily converted to Plink format with the convertf program.
    Then there is the Estonian Biocentre, which host a lof of datasets in Plink format. The largest one of them is their own:Estonian Genome Diversity Panel (EGDP) dataset
    There is also the Simons Genome Diversity Project (SGDP) dataset. This is whole genome data, so quite large. Available in different formats.
    And then there is of course the old 1000 genomes dataset.

    There are probably many more sources that provide smaller datasets from papers, like datadryad, but these are the big ones that I know of.

    To view large text files I would recommend glogg. But bed files aren't readable by humans, due to their format. Trying to read geno files doesn't make much sense either.
    Combining and editing datasets is best done in Plink, it's quite powerful.

  5. The Following User Says Thank You to Sangarius For This Useful Post:

     Kale (03-15-2018)

  6. #4
    Moderator
    Posts
    4,592
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA
    U152>L2>Z367
    mtDNA
    H5a1

    Normandie Netherlands Friesland Finland
    Quote Originally Posted by Kale View Post
    Hi there, I'm looking to get started doing things like admixture, d-stats, etc. and I have a few questions.
    1) Where to get datasets on modern human populations?
    - I found one, George Busby's something-or-other, 163 world populations, but that leads me to question 2.
    2) How do I view them in a text-editor friendly format?
    - The .bim file in that dataset looks a lot like 23andme's raw data (it's the first I've seen so that's my frame of reference), and outlines the SNPs used in the dataset presumably.
    - The .bed file (which contains the actual calls for populations), from online search is supposed to be text-editor friendly, but it's not, .bed + npp = gibberish.
    3) How can I prune multiple datasets to get only the snps shared between them
    - I can automate this in openoffice calc pretty easily, but calc is sllooooow at doing mass comparisons, there has to be a faster way...

    Thanks in advance for any help
    Out of curiosity, you have during the last months, published many D-stats, including double outgroups analyses, for ancient and modern datas. I have some difficulties to understand how you could do that without being reasonably aware of PLINK, admixtools, eigensoft, data management, etc. Am I missing something there?
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  7. The Following User Says Thank You to anglesqueville For This Useful Post:

     Saetro (03-14-2018)

  8. #5
    Registered Users
    Posts
    963
    Sex
    Omitted

    Quote Originally Posted by anglesqueville View Post
    Out of curiosity, you have during the last months, published many D-stats, including double outgroups analyses, for ancient and modern datas. I have some difficulties to understand how you could do that without being reasonably aware of PLINK, admixtools, eigensoft, data management, etc. Am I missing something there?
    Someone else made the D-stats and I'm moderately handy with excel I don't need no nmonte, I can just eyeball it

  9. #6
    Moderator
    Posts
    4,592
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA
    U152>L2>Z367
    mtDNA
    H5a1

    Normandie Netherlands Friesland Finland
    Quote Originally Posted by Kale View Post
    Someone else made the D-stats and I'm moderately handy with excel I don't need no nmonte, I can just eyeball it
    OK. I'm not a professional but I've worked a lot on D-stats, including on their mathematical background (mathematics are my job, even if statistics and probability theories are far from being "my" maths ). I'm unable to tell how many hours I've spent on Patterson's fundamental text. I draw from this work 3 conclusions:
    1) The best way to use D-stats is direct use
    2) The only sensical way to combine formal stats is qpAdm. The so called " 2 outgroups" method is nothing serious, to say the least.
    3) There is a great empirical part in the use of D-stats, and that often means you'll have to do many trials, to look inside the population clusters, including on individuals. Therefore you must work by yourself, no escape. Learn PLINK, learn EIGENSOFT, learn ADMIXTOOLS. I would like to be able to add "learn BEAGLE/FastIBD", but honnestly my own trials with FastIBD so far have been just pathetic ...
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  10. #7
    Registered Users
    Posts
    963
    Sex
    Omitted

    Quote Originally Posted by anglesqueville View Post
    OK. I'm not a professional but I've worked a lot on D-stats, including on their mathematical background (mathematics are my job, even if statistics and probability theories are far from being "my" maths ). I'm unable to tell how many hours I've spent on Patterson's fundamental text. I draw from this work 3 conclusions:
    1) The best way to use D-stats is direct use
    2) The only sensical way to combine formal stats is qpAdm. The so called " 2 outgroups" method is nothing serious, to say the least.
    3) There is a great empirical part in the use of D-stats, and that often means you'll have to do many trials, to look inside the population clusters, including on individuals. Therefore you must work by yourself, no escape. Learn PLINK, learn EIGENSOFT, learn ADMIXTOOLS. I would like to be able to add "learn BEAGLE/FastIBD", but honnestly my own trials with FastIBD so far have been just pathetic ...
    1) Unquestionably.
    2) Perhaps so, but if it can produce logical results which correspond with other formal methods (including those published in literature) then there has to be some validity.
    3) I shall.

  11. #8
    Registered Users
    Posts
    197
    Sex

    Ireland England Ireland Munster European Union
    About the FastIBD, how easy is it to run? Does it require linkage disequilibrium to be done? I've tried IBD/IBS stuff on PLINK, using the --genome flag but I've heard some people say PLINK isn't great for IBD/IBS. My results were a bit weird with --genome using PLINK, but that was probably because I didn't do any LD beforehand.

  12. #9
    Moderator
    Posts
    4,592
    Sex
    Location
    Normandy
    Ethnicity
    northwesterner
    Y-DNA
    U152>L2>Z367
    mtDNA
    H5a1

    Normandie Netherlands Friesland Finland
    Quote Originally Posted by Bas View Post
    About the FastIBD, how easy is it to run? Does it require linkage disequilibrium to be done? I've tried IBD/IBS stuff on PLINK, using the --genome flag but I've heard some people say PLINK isn't great for IBD/IBS. My results were a bit weird with --genome using PLINK, but that was probably because I didn't do any LD beforehand.
    The only version of BEAGLE I know is the last one ( v 4.1). That one is autonomous: everything (imputing, phasing, IBD detection) can be done without getting outside. I tried it only on my dad's genome (already imputed). Difficult, at least for me, and very long on my too old computer. Never tried IBD research on PLINK.
    En North alom, de North venom
    En North fum naiz, en North manom

    (Roman de Rou, Wace, 1160-1170)

  13. The Following User Says Thank You to anglesqueville For This Useful Post:

     Bas (03-15-2018)

  14. #10
    Registered Users
    Posts
    963
    Sex
    Omitted

    I've just read a lot of the information on the PLINK website...I'm running into 2 obstacles. 1) I'm not a programmer. 2) It looks like that requires learning a lot before one can do anything, and I'm only looking to do a few things (hell I'd be content with pruning non-overlapping snps between genomes and d-stats, and nothing else). I'm calculating it would be quicker (if possible to convert genomes to plain text) to just do the pruning in excel rather than learn what looks to me like Yoda playing Sumerian charades.

    EDIT: Actually Bas gave me a great guide on how to do d-stats, thanks again Bas! So I guess really I just need to know how to merge datasets and prune non-overlapping snps? Actually, I probably don't even need to prune non-overlapping snps, that's of secondary importance if anything. So how would I merge datasets?
    Last edited by Kale; 03-16-2018 at 05:22 AM.

  15. The Following 2 Users Say Thank You to Kale For This Useful Post:

     Bas (03-17-2018), MindHive (04-05-2018)

Page 1 of 7 123 ... LastLast

Similar Threads

  1. Just getting started-- I1-M253
    By BWallace in forum I1-M253
    Replies: 6
    Last Post: Yesterday, 11:04 PM
  2. Replies: 12
    Last Post: 08-22-2017, 05:36 PM
  3. New L1335 FaceBook Group Started
    By Peter MacDonald in forum L1335
    Replies: 0
    Last Post: 12-13-2016, 12:41 AM
  4. Curious...Getting Started.
    By Kale in forum Genealogy
    Replies: 3
    Last Post: 06-28-2015, 03:49 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •