Tuesday, December 22, 2015

Multi-marker tests of association using ensemble learning

The following R program can be used for testing association of a group of markers with continuous and case-control traits. (For methodological details please refer to: Padhukasahasram B, Reddy CK, Levin AM, Burchard EG, Williams LK (2015) Powerful Tests for Multi-Marker Association Analysis Using Ensemble Learning. PLoS ONE 10(11): e0143489. doi:10.1371/journal.pone.0143489)



The following are the command line parameters to be specified in this same order:

-snps Number of markers.

-features Number of markers to use in predictive model. Must be <= snps.

-samples Number of samples.

-trait 1 for continuous traits and 2 for case-control data.

-covs Number of covariates to adjust for.

-snpfile Filename storing SNP genotype data (e.g. 0, 1, or 2). Numbers must be separated by spaces and rows represent samples while columns represent markers.

-phenofile Filename storing phenotype information. Each line represents the phenotype value for samples in the same order as in snpfile.

-covfile Filename storing information about covariates. Each row should have values of covariates for a sample separated by spaces. Samples should be in the same order as in snpfile.

For example:

Rscript multimarker-association-version2.0-faster.R -snps 12 -features 3 -samples 2790 -trait 1 -covs 3 -snpfile snps -phenofile pheno -covfile covariates



The following R program can be used to conduct a multi-marker test for interactions.
https://sourceforge.net/projects/multi-marker-association-tests/files/multimarker-interactions-test.R/download


Usage is same as before. For example:
Rscript multimarker-interactions-test.R -snps 12 -features 3 -samples 2790 -trait 1 -covs 3 -snpfile snps -phenofile pheno -covfile covariates


Saturday, September 20, 2014

Bayesian recombination inference program

InferRho is a MCMC based program for jointly estimating the crossing-over and gene-conversion parameters from population genomic datasets under a Bayesian framework. It uses a full-likelihood method to infer the posterior distribution of recombination rates along the sequence under a variable recombination rate model assuming that the ratio of gene- conversion to crossing-over rate ( f ) is uniform along the sequence. The program assumes a prior distribution for background recombination rates and hotspots. The starts, ends and intensities of hotspots are also output by the program in the posterior distribution generated by the MCMC chain.
The source code is available at:

References


Wang, Y., and B. Rannala, 2008 Bayesian inference of fine-scale recombination rates using population genomic data. Philos. Trans. R. Soc. B
Wang, Y., and B. Rannala, 2009 Population genomic inference of recombination rates and hotspots. Proc. Natl. Acad. Sci. USA
Badri Padhukasahasram and Bruce Rannala 2011 Bayesian population genomic inference of crossing-over and gene-conversion. Genetics
Input File Format
1
h

10
15
A T T C C G A C G G A G A C A
T A G C G G C C G C A A A C T
A T T C G G A T G C A A A C A
A T G C C G C C G C A A A C A
A A G C C G C T C C A G C C A
A T T C C G A T G C C G C A T
A A G C C G C T C C A G C C A
A T G C G A A T G C C G C A T
T A G C G G C C G C A G C C T
A A G C G G C C G C A G C C T
1530 1607 2123 2695 2937 3859 5148 6017 6335 6568 6693 6705 7769 8206 9285

The input file has to be in the following format:
Line 1: Should always be 1.
Line 2: Character ‘h’ or ‘g’ to denote haplotype or genotype data respectively.
Line 3: Blank.
Line 4: Number of Individuals in the sample.
Lines 5: Number of Markers in the sample.
For haplotype data, each subsequent line represents the haplotype of an individual in the sample.
For genotype data, consecutive 2 lines represent a single individual with each base of a genotyped marker in each line corresponding to the marker location.
The bases at different marker positions should be separated by spaces. “N” must be used for denoting missing data at a base position.
Last line: Positions of markers in the sample separated by spaces.
We also suggest omitting SNPs with minor allele frequency < 10% from your data and applying the program to short windows in your dataset (e.g. 5-7 kb).
Command Line Options
IRv1 – Inference of Recombination Rates and Hotspots version: 1.0.0
Type Make to compile the program from the Makefile.
Usage: ./IRv1   inputfile    [options]
Options:
-burnin n Number of burnin iterations. Default is 20000.
-iter n Number of iterations. Default is 100000.
-thin n Thinning iterations. Default is 10. (This means we sample after every 10 iterations of the chains).
-dTheta x Mixing parameter for the mutation parameter theta. Default is 0.2.
-dRho x Mixing parameter for background recombination rate. Default is 100.
-dScale x Mixing parameter for background recombination rate. Default is 0.1.
-nChains n Number of chains. Default is 5.
-temp x Temperature of chains. Default is 1.05.
-seed x Seed. Default seed is set based on system time.
For example you can type on the command line:
./IRv1 input.txt -burnin 20000 -iter 100000 -thin 10 -dTheta 0.2 -dRho 100 -dScale 0.1 -nChains 5 -temp 1.05 -seed 10
Output Files
There are many output files generated by the program.
*.log: Contains the parameters with which the chain was run.
*.rho: Contains the values of f and {crossing-over rate, gene-conversion rate, mean tract length, theta} between adjacent markers along the chain.
*.hotsp: Contains the starts, ends and the intensity of hotspots along the chain.
*.rhoBackg: Contains the value of background crossing-over rates between markers at different steps in the chain.

Sunday, December 27, 2009

Natural Selection Simulator

Catalogued on GSR
All these programs can be run if you have the gcc compiler (freely available at: http://gcc.gnu.org) installed in your computer. Please report any questions or concerns to: pkbadri at yahoo.com.

The program given below is based on the algorithm described in Padhukasahasram et al. 2008.




Download the 3 program files in the links above. newsel.c can simulate a Wright-Fisher population of constant size undergoing mutation, recombination and natural selection at multiple sites. First create a file named newselinput.txt in the same directory in which you want to run this program. Fix the number and positions of selected sites and selection coefficients in newselinput.txt. This file shud have the following:


1st line: Number of sites under selection.
2nd line: Positions of selected sites in ascending order separated by spaces. Positions vary from 0 to len - 1, where len is the length of the DNA sequence in base pairs.
3rd line: Respective selection coefficients for the above positions for the genotype heterozygous for alleles (10 or 01). These should be separated by spaces.
4th line: Respective selection coefficients for the above positions for the genotype homozygous for one allele (11). These should be separated by spaces.
5th line: Respective selection coefficients for the above positions for the genotype homozygous for other allele (00). These should be separated by spaces.

For example:
5
0 100 200 300 499
0.5 0.5 0.5 0.5 0.5
0 0 0 0 0
0 0 0 0 0

The fitness for each genotype is equal to 1 + selection coefficient. Fitness effects are either added or multiplied across the sites. Compile using g++ -O3 -o sel newsel.c mtrand.cpp -Wno-deprecated. There are 10 different command line parameters:

-samples (Number of samples to output from final population. Choose < pop/2)
-pop (Total number of chromsomes in the diploid population. Should be a even number.)
-len (Length of the DNA sequence in basepairs. Choose larger than del*u*pop)
-r (Per-generation per-sequence rate of recombination)
-u (Per-generation per-sequence rate of mutation)
-s (Probability of self-fertilization)
-gen (Total number of generations)
-del (Generations after which fixed mutations are removed)
-reps (Number of replicates)
-fitness (1 to add fitness effects. 2 to multiply effects)

To run type ./sel along with all the command line parameters in the same order. For example: ./sel -samples 100 -pop 1000 -len 1000000 -r 0.50 -u 0.50 -s 0.0 -gen 10000 -del 500 -reps 10 -fitness 2. Outputs positions of biallelic polymorphisms and haplotypes in 0-1 format for samples collected from the final population.

Forwsim

Catalogued on GSR

The program given below is based on the algorithm described in Padhukasahasram et al. 2008.

Forward Wright-Fisher Simulator

Download the files in the 3 links above. forwsim.c is a program that can efficiently simulate the evolution of a diploid Wright-Fisher population of constant size with uniform crossing-over rate, uniform mutation rate and with or without self-fertilization, forward in time. To compile type: g++ -O3 -o forwsim forwsim.c mtrand.cpp -Wno-deprecated. There are 8 different command line parameters which are:

-samples (Number of samples to output from final population.)

-pop (Total number of chromsomes in the diploid population. Should be a even number.)

-len (Length of the DNA sequence. Choose larger than del*u*pop)

-r (Per-generation per-sequence rate of recombination)

-u (Per-generation per-sequence rate of mutation)

-s (Probability of self-fertilization)

-gen (Total number of generations)

-del (Generations after which fixed mutations are removed)


To run type ./forwsim along with all the command line parameters in the same order. For example: ./forwsim -samples 100 -pop 1000 -len 1000000 -r 0.50 -u 0.50 -s 0.0 -gen 10000 -del 500. The output file is called finalpopulation.txt. Each line in this file corresponds to a chromosome in the population and the numbers represent positions of mutations in it. Consecutive pairs of lines represent homologous chromosomes in the diploid population.

Saturday, July 11, 2009

Standard Coalescent Simulators

Standard Coalescent Simulator

Download the 3 files in the links above. arg.c is a C++ program that can efficiently simulate datasets under the standard coalescent with uniform mutation rate, constant population size and with either non-uniform or uniform crossing-over and conversion rates. Parameters can be changed at the top of the code. Can be compiled using: g++ -O3 -o arg arg.c mtrand.cpp -Wno-deprecated and run by typing ./arg. Outputs SNP count and SNP locations with the alleles at those locations.


Fixed Segregating Sites Simulator

Download the 3 files in the links above. fixedseg.c allows the user to fix the positions of segregating sites in the sequence. Parameters can be changed at the top of the code. Can be compiled by typing: g++ -O3 -o fixedseg fixedseg.c mtrand.cpp -Wno-deprecated and run by typing ./fixedseg. Outputs SNP count, their positions and the haplotypes in 0-1 format where 0 denotes ancestral allele and 1 denotes derived allele.

Friday, July 10, 2009

Recombination estimation program

Crossing over and Gene conversion Estimation
http://www.sourceforge.net/projects/forwsim/files/fixeds.c/download
http://www.sourceforge.net/projects/forwsim/files/mtrand.h/download
http://www.sourceforge.net/projects/forwsim/files/mtrand.cpp/download
The C++ program fixeds.c outputs 20 different summary statistics (described in Padhukasahasram et al. 2006 and references there) under models with constant population size and uniform or nonuniform crossing-over and gene-conversion rates. The positions of segregating sites, population crossing-over and gene-conversion rates can be specified by the user at the top of the file fixeds.c. Program can be compiled using: g++ -O3 -o fixeds fixeds.c mtrand.cpp -Wno-deprecated and run by typing ./fixeds. The summaries that are output are respectively:
1. Frequency of D' < 1.0 (30% acceptance error)
2. Frequency of D' < 0.75
3. Frequency of D' < 0.50
4. Average number of distinct haplotypes for 5 kb sliding windows with 4 kb overlap. (15% acceptance error)
5. Average number of distinct haplotypes for 10 kb sliding windows with 9 kb overlap. (15% acceptance error)
6. Total number of distinct haplotypes H (15% acceptance error).
7. Frequency of D'(AB) < D'(AC) and D'(BC) < D'(AC) for 5 kb range. (30% acceptance error)
8. Frequency of D'(AB) < D'(AC) or D'(BC) < D'(AC) for 5 kb range. (30% acceptance error)
9. Frequency of D'(AB) < D'(AC) and D'(BC) < D'(AC) for 10 kb range. (30% acceptance error)
10. Frequency of D'(AB) < D'(AC) or D'(BC) < D'(AC) for 10 kb range. (30% acceptance error)
11. Frequency of D'(AB) < 1.00 and D'(BC) < 1.00 for 50 kb range.
12. Frequency of D'(AB) < 0.75 and D'(BC) < 0.75 for 50 kb range.
13. Frequency of D'(AB) < 0.50 and D'(BC) < 0.50 for 50 kb range. (30% acceptance error)
14. Frequency of D'(AB) < 0.25 and D'(BC) < 0.25 for 50 kb range.
15. Frequency of D'(AB) < 0.10 and D'(BC) < 0.10 for 50 kb range.
16. Frequency of D'(AB) < 0.50 and D'(BC) < 0.50 for 5 kb range.
17. Frequency of D'(AB) < 0.25 and D'(BC) < 0.25 for 5 kb range.
18. Frequency of D'(AB) < 0.50 and D'(BC) < 0.50 for 10 kb range.
19. Frequency of D'(AB) < 0.25 and D'(BC) < 0.25 for 10 kb range.
20. Frequency of D'(AB) < 0.10 and D'(BC) < 0.10 for 10 kb range.

Summaries that are most accurate for joint crossing-over and gene-conversion estimation are 1, 6, 13 for long-range data and 4, 5, 7, 8, 9, 10 for short-range. Suggested acceptance errors are given in brackets. These choices are slightly different from those used in Padhukasahasram et al 2006 and work better under fixed segregating sites model. Including the bounds on the minimum number of recombination events (Myers and Griffiths 2003, Song, Wu and Gusfield 2005 or Bafna and Bansal 2005) in the rejection scheme and smoothing the likelihood surfaces can result in further improvements in accuracy. For sequence length different from 50 kb, it may be necessary to change the distance cutoffs i.e. choose values other than 5 kb, 10 kb and 50 kb. Distance cutoffs can be changed at the end of the code.