NAME
alncut - filter sites in alignments based on variation and gap-content
SYNOPSIS
alncut [options] [MULTIFASTA-FILE...]
DESCRIPTION
alncut takes multifasta format alignment data as input and returns that data filtered for sites with various properties. By default, only invariant sites (sites with no variation) are returned. When the -f option is used, sites will be returned that are invariant up to a specified cut-off. More precisely, a site will be returned if the complement of the largest frequency component of that site is less than or equal to the cut-off.
alncut may also be used to degap alignments. Gap-free sites may be selected with the -g option. When combined with the -f option, sites will be returned that are gap-free up to a cut-off, i.e. in which the gap-frequency is less than or equal to the cut-off.
With the ---allgap or -a option, alncut returns sites that contain only gaps. The -f option is ignored. In all of its uses, the -v option will cause alnsite to output the set-complement of sites it has selected. Therefore, to print all sites that are not all gap, combine the -a and -v options.
Parsimoniously informative sites are variable sites in which at least two different site-characters or states are each represented in at least two different sequences. alnsite wil return parsimoniously informative sites with the -parsinf or -p option.
Options specific to alnsite: -g, --gapfree print gap-free sites -a, --allgap print all-gap sites -p, --parsinf print parsimoniously informative sites -v, --negate print set-complement of selected sites -f, --frequency=<int> print sites with max <int> minor variants or gaps -f, --frequency=<float> print sites with max <float> minor variants or gaps -V, --verbose report number and indices of selected sites to STDERR
Options general to FAST: -h, --help print a brief help message --man print full documentation --version print version -l, --log create/append to logfile -L, --logname=<string> use logfile name <string> -C, --comment=<string> save comment <string> to log --format=<format> use alternative format for input --moltype=<[dna|rna|protein]> specify input sequence type
INPUT AND OUTPUT
alnsite is part of FAST, the FAST Analysis of Sequences Toolbox, based on Bioperl. Most core FAST utilities expect input and return output in multifasta format. Input can occur in one or more files or on STDIN. Output occurs to STDOUT. The FAST utility fasconvert can reformat other formats to and from multifasta.
OPTIONS
- -g --gapfree
-
Print only sites that contain no gaps
- -a --allgap
-
Print only sites that contain exclusively gaps
- -p --parsinf
-
Print only sites that are parsimoniously informative. Parsimoniously informative sites are variable sites in which at least two different site-characters or states are each represented in at least two different sequences.
- -v --negate
-
Print set-complement of sites otherwise selected; as a sole option, will print only variable sites
- -f [int], --frequency=[int]
-
Print sites that contain gaps or minor variants up to a maximum of [int] sequences
- -f [float], --frequency=[float]
-
Print sites that contain gaps or minor variants up to a maximum of [float] relative frequency
- --verbose
-
Print numbers and indices of sites selected by the criteria to STDERR
- -h, --help
-
Print a brief help message and exit.
- --man
-
Print the manual page and exit.
- --version
-
Print version information and exit.
- -l, --log
-
Creates, or appends to, a generic FAST logfile in the current working directory. The logfile records date/time of execution, full command with options and arguments, and an optional comment.
- -L [string], --logname=[string]
-
Use [string] as the name of the logfile. Default is "FAST.log.txt".
- -C [string], --comment=[string]
-
Include comment [string] in logfile. No comment is saved by default.
- --format=[format]
-
Use alternative format for input. See man page for "fasconvert" for allowed formats. This is for convenience; the FAST tools are designed to exchange data in Fasta format, and "fasta" is the default format for this tool.
- -m [dna|rna|protein], --moltype=[dna|rna|protein]
-
Specify the type of sequence on input (should not be needed in most cases, but sometimes Bioperl cannot guess and complains when processing data).
EXAMPLES
Print sites that are not all gap:
alncut -av data.fas
Print sites with gaps in maximum 2 sequences:
alncut -gf 2 data.fas
Print sites in which the frequency of minor variants is less than 15 percent:
alncut -f 0.15 data.fas
Print variable sites:
alncut -v data.fas
SEE ALSO
To degap each sequence on input individually, see
fastr --degap
man perlre
perldoc perlre
-
Documentation on perl regular expressions.
man FAST
perldoc FAST
-
Introduction and cookbook for FAST
- The FAST Home Page"
CITING
If you use FAST, please cite Ardell (2013). FAST: FAST Analysis of Sequences Toolbox. Bioinformatics and Bioperl Stajich et al..