NAME
split-ppred-ali.pl - Split ALI files into subsets of sites based on ppred data
VERSION
version 0.181120
SYNOPSIS
$ split-ppred-ali.pl cpVITRELLA-80x8363.puz --phylip
--sim-files=`ls cpNOVITRELLA-79x8363-CATGTRG-PP_sample_*.ali`
--sim-seq-list=sim.idl --obs-seq-list=obs.idl
--bin-number=10 --percentile --out=-ppred
$ cat sim.idl
Karlodiniu
$ cat obs.idl
Vitrella_b
# for testing
$ perl -Ilib bin/split-ppred-ali.pl test/for-ppred.phy --phylip
--sim-files=`ls test/ppred-*.phy`
--sim-seq-list=test/sim.idl --obs-seq-list=test/obs.idl
--bin-number=10 --percentile --out=-ppred --only-mask
$ perl -Ilib bin/split-ppred-ali.pl test/for-ppred.phy --phylip
--sim-files=`ls test/ppred-*.phy` --by-seq --only-dump-freqs
At each site, a profile is computed from the simulated primary sequences for ids listed in --sim-seq-list
and compared to the character state observed in the sequences listed in --obs-seq-list
. A mask corresponding to the simulated frequencies for the observed states is built and sites are ranked according to these descending frequencies, which means that highest bins include sites where the observed state is rarely (or never found) in simulations. Sites where the state is a gap or is missing always get the maximum frequency and thus fall in the lowest bins.
USAGE
split-ppred-ali.pl <infiles> --simfiles=<files>... [optional arguments]
REQUIRED ARGUMENTS
- <infiles>
-
Path to input ALI files [repeatable argument]. If infiles are not in ALI but in PHYLIP format, use the
--phylip
option below. - --sim-files=<files>...
-
List of paths to simulated input files. These files are assumed to be in PHYLIP format as they result from PhyloBayes'
ppred
.
OPTIONAL ARGUMENTS
- --out[-suffix]=<suffix>
-
Suffix to append to (possibly stripped) infile basenames for deriving outfile names [default: none]. When not specified, outfile names are taken from infiles but original infiles are preserved by being appended a .bak suffix.
- --sim-seq-list=<file>
-
Path to IDL file listing the ids of the sequences from which site profiles will be computed after acquiring simulated input files [default: all seqs].
- --obs-seq-list=<file>
-
Path to IDL file listing the ids of the sequences that will give observed frequencies and thus govern site trimming in infiles [default: all seqs].
- --by-seq
-
Enable seq-specific simulated site profiles [default: no]. When not specified, average site profiles are computed from simulated input files.
- --from-scafos
-
Consider the input ALI file as generated by SCaFoS [default: no]. Currently, specifying this option results in turning all ambiguous and missing character states to gaps.
- --del-const
-
Delete constant sites just as the
-dc
option of PhyloBayes [default: no]. - --phylip
-
Assume infiles and outfiles are in PHYLIP format (instead of ALI format) [default: no].
- --bin-number=<n>
-
Number of bins to define [default: 10].
- --percentile
-
Define bins containing an equal number of sites rather than bins of equal width in terms of observed state frequencies [default: no].
- --cumulative
-
Define bins including all previous bins [default: no]. This leads to ALI outfiles of increasing width where the sequences listed in
-obs-seq-list
include ever more character states rarely observed in simulated primary sequences for ids listed in--sim-seq-list
. - --only-mask
-
Mask rarely observed states in sequences listed in
--obs-seq-list
instead of removing the corresponding sites from the alignment [default: no]. - --only-dump-freqs
-
Output simulated and observed state frequencies instead of producing regular output files [default: no]. When specified, this option supercedes all those pertaining to site binning.
- --reorder
-
Reorder sequences following descending observed freqs [default: no]. This option only applies when
--only-dump-freqs
is specified. - --version
- --usage
- --help
- --man
-
Print the usual program information
AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.