NAME
bioaln - Alignment manipulations based on BioPerl
SYNOPSIS
bioaln [options] <alignment file>
bioaln [-h | --help | -V | --version | --man]
Alignment descriptors
bioaln -l aln_file # [l]ength of an alignment
bioaln -L aln_file # [L]ist sequence IDs
bioaln -n aln_file # [n]umber of aligned sequences
bioaln -a aln_file # [a]verage percent identity
bioaln -w '30' aln_file # average identifies for sliding [w]indows of 30
Alignment viewers
bioaln -c aln_file # [c]odon view (groups of 3 nts)
bioaln -m aln_file # [m]atch view (show variable sites)
Alignment filters (output a new alignment)
bioaln -d 'Seq1,Seq2' aln_file # [d]elete sequences
bioaln -p 'Seq1,Seq2' aln_file # [p]ick sequences
bioaln -i 'fasta' fasta_aln_file # [i]nput FASTA alignment (CLUSTALW is dafault)
bioaln -o 'fasta' aln_file # [o]utput a FASTA alignment (CLUSTALW is dafault)
bioaln -g aln_file # remove [g]apped sites
bioaln -r 'seq_id' aln_file # change [r]eference (1st) sequence
bioaln -s '10,20' # [s]lice alignment from 10-20
bioaln -u aln_file # get [u]nique sequences
bioaln -v aln_file # show only [v]ariable sites
bioaln --pep2dna 'cds.fas' pep.aln # Back-align CDS seqs according to protein alignment
bioaln --dna2pep cds.aln # DNA alignment => protein alignment
Evolutionary analysis
bioaln --concat *.aln # concatenate aln files (throw error if IDs don't match)
bioaln --con-blocks aln_file # extract conserved blocks
bioaln --shuffle-sites aln_file # shuffle sites (for testing conserved blocks)
bioaln --resample '10' aln_file # [R]e-sampled an alignment of 10 sequences
bioaln --boot aln_file # bootstrap an alignment (for testing branch stability)
bioaln --permute-states aln_file # permute at each site (for testing tree-ness)
bioaln --remove-third aln_file # remove [T]hird site (assume coding sequences)
change alignment format
bioaln -i 'fasta' -o 'phylip' # FASTA => PHYLIP
bioaln -i 'fasta' -o 'pmal' # FASTA => PAML
Chaining with pipes
bioaln -i'fasta' fasta.aln | bioaln -s'10,20' | bioaln -a # read, slice & identity
bioaln -o 'fasta' cds.aln | bioseq -t1 | bioaln -i 'fasta' # chain with bioseq: CDS => protein alignment
DESCRIPTION
bioaln performs common, routine manipulations of sequence alignments based on BioPerl modules including Bio::AlignIO, Bio::SimpleAlign and Bio::Align::Utilities. By default, bioaln assumes that both the input and the output files are in CLUSTALW format so that multiple bioaln runs can be chained with UNIX pipes.
Users are encouraged to use bioseq for sequence manipulations not depending on alignment (e.g., deletion, picking sequences), by transforming an alignment into FASTA format.
OPTIONS
- --aln-index, -I <seq_id,position>
-
Return aligned position of a residue of a sequence based on its unaligned (gap-free) position.
- --avg-pid, -a
-
Return average percent identity of an alignment.
- --binary
-
Transform sequences into binary (0/1) strings for e.g., PHYLIP suites (see below)
- --bin-inform
-
Print only binary informative sites (e.g., for parsimony analysis). Example PHYLIP application:
bioaln --bin-inform --binary -o'phylip' foo.aln > foo.phy
- --boot, -b
-
Produced a bootstrapped alignment. To produce multiple bootstrapped alignments, use a BASH loop, e.g.: for i in {1..10}; do bioaln -b foo.aln > foo.boot-$i.aln; done
- --codon-view, -c 'num-of-codon-per-line' (default 20 codons per line)
-
Print a CLUSTALW-like alignment, but separated by codons. Intended for use with in-frame DNA sequences. Block-final position numbers are printed at the end of every alignment block at the point of wrapping, and block-initial counts appear over first nucleotide in a block.
If invoked as
--codon-view=n
where n is some number, will print n codons per line. Other normally stackable options, such as--match
, can be used alongside it. If piping through bioaln, ensure codon-view is used in the last invocation.For
bioaln -c input_DNA.aln
wheninput_DNA.aln
contains:Seq1 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAAATAAGC Seq2 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAAATAAGC Seq3 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAAATAAGT Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAAATAAGT Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAATAAGC Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAATAAGT ******** ** * *** *************** **** ********
you get: 4 1 8 Seq1 ATG AAT AAA AAG ATA TAC AGC ATA GAA GAA TTA ATA GAT AAA ATA AGC Seq2 ATG AAT AAT AAA ATA TAC AGC ATA GAA GAA TTA ATA GAT AAA ATA AGC Seq3 ATG AAT AAA AAG ATA TAT AGC ATA GAA GAA TTA GTA GAT AAA ATA AGT Seq4 ATG AAT AAA AAA ACA TAT AGC ATA GAA GAA TTA ATA GAT AAA ATA AGT Seq5 ATG AAT AAA AAA ATA TAT AGC ATA GAA GAA TTA ATA GAC AAA ATA AGC Seq6 ATG AAT AAA AAA ATA TAT AGC ATA GAA GAA TTA ATA GAC AAA ATA AGT
- --con-blocks, -B 'block-length' (default length 6)
-
Extract perfectly conserved blocks (PCBs, gap excluded) from an alignment, each to a new clustalw file. This may be used to e.g., identify conserved intergenic sequences.
With
bioaln --conblocks input.aln
whereinput.aln
is:Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT ******** ** * *** *************** **** *** *****
you get:
nuc.aln.slice-1.aln : file contents below. Site positions indicated after the '/' Seq1/1-8 ATGAATAA Seq2/1-8 ATGAATAA Seq3/1-8 ATGAATAA Seq4/1-8 ATGAATAA Seq5/1-8 ATGAATAA Seq6/1-8 ATGAATAA ******** nuc.aln.slice-19.aln: Seq1/19-33 AGCATAGAAGAATTA Seq2/19-33 AGCATAGAAGAATTA Seq3/19-33 AGCATAGAAGAATTA Seq4/19-33 AGCATAGAAGAATTA Seq5/19-33 AGCATAGAAGAATTA Seq6/19-33 AGCATAGAAGAATTA *************** nuc.aln.slice-40.aln Seq1/40-47 AAAATAAG Seq2/40-47 AAAATAAG Seq3/40-47 AAAATAAG Seq4/40-47 AAAATAAG Seq5/40-47 AAAATAAG Seq6/40-47 AAAATAAG ********
- --concat, -A
-
Concatenate multiple alignments sharing the same set of IDs. This is normally used for concatenating individual gene alignments of the same set of samples to a single one for making a "supertree".
bioaln --concat gene1.aln gene2.aln gene3.aln gene4.aln
or using wildcard to specify multiple files:
bioaln --concat gene*.aln
- --consensus, -C 'percent' (default 50)
-
Add a consensus sequence to the end of the alignment with a certain threshold percent and id Consensus_<percent>.
- --delete, -d 'seq_id1,seq_id2,etc'
-
Delete sequences based on their ids. Option takes a comma-separated list of ids.
- --dna2pep, -D
-
Turn an in-frame protein-coding sequence alignment to a corresponding protein alignment.
- --gap-states
-
Prints one alignment gap per line, including its start, end, whether in-frame, whether on-edge, how many copies, and alignment length. (Can't remember what context this was developed at first; ignore)
- --gap-states2
-
Prints one alignment gap per column, including its start-end as column heading and presence/absence (1/0) in each sequence.
- --input, -i 'format'
-
Specify input file format. Common ones include 'clustalw' (default), 'fasta' and 'phylip'. See Bio::AlignIO for supported formats.
In addition, it reads NCBI-blast outputs as well. The preferred output format is 'blastxml' (-outfmt 5). e.g., bioaln -i'blastxml' blast.out.
- --length, -l
-
Print alignment length.
- --listids, -L
-
List all sequence ids.
- --match, -m
-
Go through all columns and change residues identical to the reference sequence to be the match character, '.'.
For input:
Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT ******** ** * *** *************** **** *** *****
bioaln -m input.aln
gives:Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT Seq2 .................C...............A........CG.....C Seq3 ........T..A.....C...............A...............C Seq4 ...........A.C...................A................ Seq5 ...........A.....................A....C...C......C Seq6 ...........A.....................A....C...........
- --no-flat, -F
-
By default, sequence names do not contain 'begin-end'. This option turns ON 'begin-end' naming.
- --no-gaps, -g
-
Remove gaps (and returns an de-gapped alignment).
- --num-seq, -n
-
Print number of sequences in alignment.
- --output, -o 'format'
-
Output file format. Common ones include 'clustalw' (default), 'fasta' and 'phylip'. See Bio::AlignIO for supported formats. An additional format 'paml' is supported.
- --pep2dna, -P 'unaligned-cds-file' <protein_alignment>
-
Produce an in-frame codon alignment by align CDS sequences according to their corresponding protein alignment. Throws an error if names in two files do not match exactly.
- --permute-states, -M
-
Generate an alignment with randomly permuted residues at each site. This operation removes phylogenetic signal among aligned sequences, if there is any in the original alignment. This is the basis of the Permutation Trail Prob (PTP) test of the tree-ness of an alignment (should increase total tree length after permutation), Note this is different from bootstrap, which leaves individual alignment columns intact.
- --phy-nonint
-
Generate non-interleaved PHYLIP output (e.g., for clique program; should be wrapped into --output).
- --pick, -p 'seq1,seq2,etc'
-
Pick sequences based on their id. Option takes a comma-separated list of ids.
- --random-slice 'length'
-
Get a random alignment slice (can't remember the usage).
- --ref-seq, -r 'seqid'
-
Change the reference sequence to be seq_id.
- --remove-third
-
Remove third-codon positions (the least phylogenetically informative sites) from an in-frame codon alignment. Also see --select-third below.
- --resample, -R 'num'
-
Picks num random sequences from input alignment and produces a new alignment consisting of those sequences. If n is not given, default is the number of sequences in alignment divided by 2, rounded down.
This functionality uses an implementation of Reservoir Sampling, based on the algorithm found here: http://blogs.msdn.com/b/spt/archive/2008/02/05/reservoir-sampling.aspx
- --rm-col, -E 'seq_id'
-
Remove columns with gap in designated sequence.
For
bioaln --rm-col 'Seq5' input.aln
whereinput.aln
contains:Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT ******** ** * *** *************** **** *** *****
you get output:
Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA-ATAAGT Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACATAAGC Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA-ATAAGC Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA-ATAAGT Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAACATAAGC Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA-ATAAGT ******** ** * *** *************** **** *** *****
- --select-third
-
Generate an alignment of every-third (mostly synonymous) bases (assuming a CDS alignment).
- --shuffle-sites, -S
-
Make a shuffled (not bootstrapped, which is sampling with replacement) alignment. This operation permutes alignment columns. It is used for testing the significance of long-runs of conserved sites in an alignment (e.g., conserved intergenic spacer sequences).
- --slice, -s 'start|-,end|-'
-
Get a slice of the alignment.
Using a '-' character in the first or second position defaults to the beginning or end, respectively. Therefore specifying -s'-,-' is the same as grabbing the whole alignment.
--slice'20,80' or --slice '20,80' or -s='20,80' or --slice='20,80': Slice from position 20 to 80, inclusive. --slice'-,80': Slice from beginning up to, and including position 80 --slice'20,-': Slice from position 20 up to, and including, the end of the alignment
NOTE: --slice'-,x' (where x is '-' or a position) does NOT work. Use --slice='-,x' (or a space in place of =) instead.
- --split-cdhit <cdhit clrs file>
-
Generate alignment for each CDHIT family (based on .clrs file). Ignore if you don't use cdhit for family clustering.
- --trim-ends
-
Remove 5'- and 3'-gapped columns.
- --uniq, -u.
-
Extract the alignment of unique sequences.
- --upper
-
Make an uppercase alignment.
- --varsites, -v
-
Extracts variable sites. Used in conjunction with -g: do not show sites with gaps in any sequence.
- --window, -w 'size|30'
-
Calculate pairwise average sequence difference by windows (overlapping windows with fixed step of 1). Default value for window_size is 30.
Common Options
- --help, -h
-
Print a brief help message and exit.
- --man
-
Print the manual page and exit.
- --version, -V
-
Print current release version and exit.
SEE ALSO
Bio::BPWrapper::TreeManipulations, the underlying Perl Module
CONTRIBUTORS
Yözen Hernández <yzhernand at gmail dot com> (initial design and implementation)
William McCaig <wmccaig at gmail dot com> (developer)
Girish Ramrattan <gramratt at gmail dot com> (developer, documentation)
Che Martin <che dot l dot martin at gmail dot com> (developer)
Levy Vargas <levy dot vargas at gmail dot com> (developer)
Rocky Bernstein (testing and release)
Weigang Qiu <weigang@genectr.hunter.cuny.edu> (maintainer)
TO DO
Add Align::Statistics methods, especially genetic distances
TO CITE
Hernandez, Bernstein, Qiu, et al (2017). "BpWrappers: Command-line utilities for manipulation of sequences, alignments, and phylogenetic trees based on BioPerl". (In prep).
Stajich et al (2002). "The BioPerl Toolkit: Perl Modules for the Life Sciences". Genome Research 12(10):1611-1618.