NAME
SYNOPSIS
# Alignment descriptors:
bioaln -l aln_file # [l]ength of an alignment
bioaln -n aln_file # [n]umber of aligned sequences
bioaln -L aln_file # [L]ist all sequence IDs
bioaln -a input.aln # [a]verage percent identity
bioaln -w '30' aln_file # average identifies for sliding [w]indows of 30
bioaln -I 'B31,100' aln_file # get aln column [I]ndex of seq 'B31' residue 100
bioaln --gapstates aln_file # identify unique gaps (start, end, frequency, whether internal)
# Alignment viewers:
bioaln -m input.aln # [m]atch view (show variable sites)
bioaln -c aln_file # [c]odon view (groups of 3 nts)
# Alignment filters (produce a new alignment):
bioaln -i 'fasta' fasta_aln_file # [i]nput is a FASTA alignment (CLUSTALW is dafault)
bioaln -o 'fasta' aln_file # [o]utput a FASTA alignment (CLUSTALW is dafault)
bioaln -g aln_file # remove [g]apped sites
bioaln -s '10,20' # alignment [s]lice from 10-20
bioaln -r 'seq_id' aln_file # change [r]eference (1st) sequence
bioaln -d 'Seq1,Seq2' aln_input # [d]elete sequences
bioaln -p 'Seq1, Seq2' aln_input # [p]ick sequences
bioaln -u aln_file # [u]nique-fy sequences (remove redundant seqs)
bioaln -v aln_file # show only variable sites
bioaln -C '90' aln_file # add a 90% [C]onsensus sequence
bioaln -D cds.aln # D(na) alignment => protein alignment
bioaln -E 'Seq5' input.aln # [E]rase sites gapped at Seq5
bioaln -P 'cds.fas' pep.aln # Back-align CDS seqs according to [P]rotein alignment
bioaln -U aln_file # turn into [U]pper-case
# Evolutionary analysis:
bioaln -A *.aln # conc(A)tenate aln files
bioaln -B aln_file # extract conserved [B]locks
bioaln -S aln_file # [S]huffle sites (for testing conserved blocks)
bioaln -R '10' aln_file # [R]e-sampled an alignment of 10 sequences
bioaln -b aln_file # [b]ootstrap an alignment (for testing branch stability)
bioaln -M aln_file # per[m]ute at each site (for testing tree-ness)
bioaln -T aln_file # extract [T]hird site (assume coding sequences)
bioaln --remove-third aln_file # remove [T]hird site (assume coding sequences)
# change alignment format:
bioaln -i 'fasta' -o 'phylip' # FASTA => PHYLIP
bioaln -i 'fasta' -o 'pmal' # FASTA => PAML
# Chaining with pipes:
# Read a FASTA alignment, slice it, and calcualte percent identity:
bioaln -i'fasta' fasta.aln | bioaln -s'10,20' | bioaln -a
# Chaining with bioseq:
# Turn a coding-sequence alignment into a protein alignemnt (an alternative to -D):
bioaln -o'fasta' cds.aln | bioseq -t1 | bioaln -i'fasta'
DESCRIPTION
bioaln performs common, routine manipulations of sequence alignments. By default, bioaln assumes that both the input and the output files are in CLUSTALW format so that multiple bioaln runs can be chained with UNIX pipes. Upper-case options are less commonly used.
OPTIONS and EXAMPLES
--help, -h
Print a brief help message and exit.
--avpid, -a
Calculate the average percent identity of an alignment. Returns the value alone. Wrap Bio::SimpleAlign->average_percentage_identity().
Usage: bioaln -a <alignment_file>
--bootstrap, -b
Produced a bootstrapped alignment. Wrap Bio::Align::Utilities->bootstrap(). Note that only a single alignment is produced. To produce multiple bootstraped alignment, use BASH loop (see below)
Usage: bioaln -b <alignment_file> (single bootstrapped alignment) Usage: for i in {1..10}; do bioaln -b foo.aln > foo.boot-$i.aln; done (10 bootstrapped alignments)
--codon-view, -c
Print a CLUSTALW-like alignment, but separated by codons. Intended for use with DNA sequences. Block-final position numbers are printed at the end of every alignment block at the point of wrapping, and block-initial counts appear over first nucleotide in a block.
If invoked as --codon-view=n where n is some number, will print n codons per line. Other normally stackable options, such as -m, can be used alongside it. If piping through bioaln, ensure codon-view is used in the last invocation.
Usage: bioaln -c <alignment_file>
EXAMPLE: bioaln -c input_DNA.aln
INPUT:
Seq1 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAAATAAGC
Seq2 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAAATAAGC
Seq3 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAAATAAGT
Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAAATAAGT
Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAATAAGC
Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAATAAGT
******** ** * *** *************** **** ********
OUTPUT: 4 1 8 Seq1 ATG AAT AAA AAG ATA TAC AGC ATA GAA GAA TTA ATA GAT AAA ATA AGC Seq2 ATG AAT AAT AAA ATA TAC AGC ATA GAA GAA TTA ATA GAT AAA ATA AGC Seq3 ATG AAT AAA AAG ATA TAT AGC ATA GAA GAA TTA GTA GAT AAA ATA AGT Seq4 ATG AAT AAA AAA ACA TAT AGC ATA GAA GAA TTA ATA GAT AAA ATA AGT Seq5 ATG AAT AAA AAA ATA TAT AGC ATA GAA GAA TTA ATA GAC AAA ATA AGC Seq6 ATG AAT AAA AAA ATA TAT AGC ATA GAA GAA TTA ATA GAC AAA ATA AGT
--delete, -d
Delete sequences based on their id. Option takes a comma-separated list of ids.
Usage: bioaln -d 'seq_id_1, seq_id_2, ... , seq_id_n' <alignment_file>
--nogaps, -g
Remove gaps (and returns an de-gapped alignment). Wrap Bio::SimpleAlign->remove_gaps();
Usage: bioaln -g <alignment_file>
--input, -i
Input file format (see Bio::AlignIO for supported formats). By default, this is 'clustalw'. Wrap Bio::AlignIO->new().
Usage: bioaln -i 'format'
--length, -l.
Print alignment length. Wrap Bio::SimpleAlign->length().
Usage: bioaln -l <alignment_file>
--match, -m.
Go through all columns and change residues identical to the reference sequence to be the match character, '.' Wrap Bio::SimpleAlign->match().
Usage: bioaln -m <alignment_file>
EXAMPLE: bioaln -m input.aln
INPUT:
Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT
Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC
Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC
Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT
Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC
Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT
******** ** * *** *************** **** *** *****
OUTPUT:
Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT
Seq2 .................C...............A........CG.....C
Seq3 ........T..A.....C...............A...............C
Seq4 ...........A.C...................A................
Seq5 ...........A.....................A....C...C......C
Seq6 ...........A.....................A....C...........
--numseq, -n
Get number of sequences in alignment.
Usage: bioaln -n <alignment_file>
--output, -o
Output file format (see Bio::AlignIO for supported formats). By default, this is 'clustalw'. Wrap Bio::AlignIO->new().
Usage: bioaln -o 'format'
--pick, -p
Pick sequences based on their id. Option takes a comma-separated list of ids.
Usage: bioaln -p 'seq_id_1, seq_id_2, ... , seq_id_n' <alignment_file>
--refseq, -r
Change the reference sequence to be seq_id. Wrap Bio::SimpleAlign->set_new_reference().
Usage: bioaln -r 'seq_id' <alignment_file>
--slice, -s
Get a slice of the alignment. Wrap Bio::SimpleAlign->slice() with improvements.
Using a '-' character in the first or second position defaults to the beginning or end, respectively. Therefore specifying -s'-,-' is the same as grabbing the whole alignment.
Usage: bioaln -s 'min,max' <alignment_file>
-s'20,80' or --slice'20,80' or -s='20,80' or --slice='20,80' Slice from position 20 to 80, inclusive. -s'-,80' Slice from beginning up to, and including position 80 -s'20,-' Slice from position 20 up to, and including, the end of the alignment
NOTE: --slice'-,x' (where x is '-' or a position) does NOT work. Use --slice='-,x' instead.
--uniq, -u.
Extract the alignment of unique sequences. Wrap Bio::SimpleAlign->uniq_seq().
Usage: bioaln -u <alignment_file>
EXAMPLE: bioaln -u input.aln
INPUT:
seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT
seq11 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT
seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC
seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC
seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT
seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC
seq7 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC
seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT
******** ** * *** *************** **** *** *****
OUTPUT:
seq1 ST1
seq11 ST1
seq2 ST2
seq3 ST3
seq4 ST4
seq5 ST5
seq7 ST5
seq6 ST6
ST1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT
ST2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC
ST3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC
ST4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT
ST5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC
ST6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT
******** ** * *** *************** **** *** *****
--varsites, -v
Extracts variable sites. Used in conjunction with -g: do not show sites with gaps in any sequence.
Usage: bioaln -v <alignment_file>
--window, -w
Calculate pairwise average sequence difference by windows (overlapping windows with fixed step of 1). Default value for window_size is 30.
Usage: bioaln -w 'window_size'[default is 30] <alignment_file>
--concat, -A
Concatenate multiple alignments sharing the same set of unique IDs. This is normally used for concatenating individual gene alignments of the same set of samples to a single one for making a "supertree". Wrap Bio::Align::Utilities->cat().
Usage: bioaln -A gene1.aln gene2.aln gene3.aln gene4.aln,
Or: using wildcard to specify files:
Usage: bioaln -A gene*.aln
--conblocks, -B
Extract perfectly conserved blocks (PCBs, gap excluded) from an alignment, each to a new clustalw file. Default minimum length of PCB is 6 sites.
Usage: bioaln -B <alignment_file>
EXAMPLE: bioaln -B input.aln
INPUT:
Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT
Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC
Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC
Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT
Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC
Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT
******** ** * *** *************** **** *** *****
OUTPUT: files containing perfectly conserved blocks
nuc.aln.slice-1.aln : file contents below. Site positions indicated after the '/'
Seq1/1-8 ATGAATAA
Seq2/1-8 ATGAATAA
Seq3/1-8 ATGAATAA
Seq4/1-8 ATGAATAA
Seq5/1-8 ATGAATAA
Seq6/1-8 ATGAATAA
********
nuc.aln.slice-19.aln:
Seq1/19-33 AGCATAGAAGAATTA
Seq2/19-33 AGCATAGAAGAATTA
Seq3/19-33 AGCATAGAAGAATTA
Seq4/19-33 AGCATAGAAGAATTA
Seq5/19-33 AGCATAGAAGAATTA
Seq6/19-33 AGCATAGAAGAATTA
***************
nuc.aln.slice-40.aln
Seq1/40-47 AAAATAAG
Seq2/40-47 AAAATAAG
Seq3/40-47 AAAATAAG
Seq4/40-47 AAAATAAG
Seq5/40-47 AAAATAAG
Seq6/40-47 AAAATAAG
********
--consensus, -C
Add a consensus sequence to the end of the alignment with a certain threshold percent and id Consensus_<percent>. By default percent is 50. Wrap Bio::SimpleAlign->consensus_string().
Usage: bioaln -C 'percent' <alignment_file>
EXAMPLE: bioaln -C '90' input.aln
INPUT:
Seq1 MNNKIYSIEELIDKISMPVVAYAGEAKSFLREALEYAKNK
Seq2 MNKKTYSIEELIDKISMPVVAYSGEAKSFLREALEHAKNK
Seq3 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq4 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq5 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq6 MNKKIYSIEELVDKISMPVVAYSGEAKSFLREALEYAKNK
**:* ******:**********:************:****
OUTPUT:
Seq1 MNNKIYSIEELIDKISMPVVAYAGEAKSFLREALEYAKNK
Seq2 MNKKTYSIEELIDKISMPVVAYSGEAKSFLREALEHAKNK
Seq3 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq4 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq5 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq6 MNKKIYSIEELVDKISMPVVAYSGEAKSFLREALEYAKNK
Consensus_90 MN?K?YSIEEL?DKISMPVVAY?GEAKSFLREALE?AKNK
** * ****** ********** ************ ****
--dna2pep, -D
Turn a CDS alignment to a corresponding protein alignment. Wrap Bio::Align::Utilities->dna_to_aa_aln().
Usage: bioaln -D <cds_alignment>
--erasecol, -E
Remove columns with gap in designated sequence.
Usage: bioaln -E 'seq_id' <alignment_file>
EXAMPLE: bioaln -E 'Seq5' input.aln
INPUT:
Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA--ATAAGT
Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACGATAAGC
Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA--ATAAGC
Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA--ATAAGT
Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAAC-ATAAGC
Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA--ATAAGT
******** ** * *** *************** **** *** *****
OUTPUT:
Seq1 ATGAATAAAAAGATATATAGCATAGAAGAATTAGTAGATAAA-ATAAGT
Seq2 ATGAATAAAAAGATATACAGCATAGAAGAATTAATAGATAAACATAAGC
Seq3 ATGAATAATAAAATATACAGCATAGAAGAATTAATAGATAAA-ATAAGC
Seq4 ATGAATAAAAAAACATATAGCATAGAAGAATTAATAGATAAA-ATAAGT
Seq5 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAACATAAGC
Seq6 ATGAATAAAAAAATATATAGCATAGAAGAATTAATAGACAAA-ATAAGT
******** ** * *** *************** **** *** *****
--noflatname, -F
By default, sequence names do not contain 'begin-end'. This option turns ON 'begin-end' naming. Wrap Bio::SimpleAlign->set_displayname_flat().
Usage: bioaln -F <alignment_file>
--aln-index, -I
Identify aligned position of a residue. Wrap Bio::SimpleAlign->column_from_residue_number().
Usage: bioaln -I 'seq_id,position' <alignment_file>
--listids, -L
List all sequence ids.
Usage: bioaln -L <alignment_file>
--permute-states, -M
Generate an alignment with randomly permuted residues at each site. This operation removes phylogenetic signal among aligned sequences, if there is any in the original alignment. This is the basis of the Permutation Trail Prob (PTP) test of tree-ness of an alignment.
Usage: bioaln -M <alignment_file>
--pep2dna, -P
Align CDS sequences according to their corresponding protein alignment. Wrap Bio::Align::Utilities->aa_to_dna_aln().
Usage: bioaln -P 'unaligned_cds.fas' <protein_alignment>
--resample, -R
Picks n random sequences from input alignment and produces a new alignment consisting of those sequences.
If n is not given, default is the number of sequences in alignment divided by 2, rounded down.
This functionality uses an implementation of Reservoir Sampling, based on the algorithm found here: http://blogs.msdn.com/b/spt/archive/2008/02/05/reservoir-sampling.aspx
Usage: bioaln -R 'n' <alignment_file>
EXAMPLE: bioaln -R4 input.aln
INPUT:
Seq1 MNNKIYSIEELIDKISMPVVAYAGEAKSFLREALEYAKNK
Seq2 MNKKTYSIEELIDKISMPVVAYSGEAKSFLREALEHAKNK
Seq3 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq4 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq5 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq6 MNKKIYSIEELVDKISMPVVAYSGEAKSFLREALEYAKNK
**:* ******:**********:************:****
OUTPUT:
Seq2 MNKKTYSIEELIDKISMPVVAYSGEAKSFLREALEHAKNK
Seq3 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq4 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq6 MNKKIYSIEELVDKISMPVVAYSGEAKSFLREALEYAKNK
**** ******:***********************:****
--shuffle-sites, -S
Make a shuffled (not bootstraped) alignment. This operation randomizes alignment columns. It is used for testing the signficance of long-runs of conserved sites in an alignment (e.g., conserved intergenic spacers [IGSs]).
Usage: bioaln -S <alignment_file>
EXAMPLE: bioaln -S input.aln
INPUT:
Seq1 MNNKIYSIEELIDKISMPVVAYAGEAKSFLREALEYAKNK
Seq2 MNKKTYSIEELIDKISMPVVAYSGEAKSFLREALEHAKNK
Seq3 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq4 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq5 MNKKIYSIEELIDKISMPVVAYSGEAKSFLREALEYAKNK
Seq6 MNKKIYSIEELVDKISMPVVAYSGEAKSFLREALEYAKNK
**:* ******:**********:************:****
OUTPUT:
Seq1 VIEELKYEKAAEAFIPNISEDASGKIRLKLMSVNAYKMYN
Seq2 VIEELKYEKAAEAFIPNISEDASGKTRLKLMSVKSYKMHN
Seq3 VIEELKYEKAAEAFIPNISEDASGKIRLKLMSVKSYKMYN
Seq4 VIEELKYEKAAEAFIPNISEDASGKIRLKLMSVKSYKMYN
Seq5 VIEELKYEKAAEAFIPNISEDASGKIRLKLMSVKSYKMYN
Seq6 VIEELKYEKAAEAFVPNISEDASGKIRLKLMSVKSYKMYN
**************:********** *******::***:*
--third-sites, -T
Generate an alignment of every-third (mostly synonymous) bases (assuming a CDS alignment).
Usage: bioaln -T <alignment_file>
--uppercase, -U
Make an uppercase alignment.
Usage: bioaln -U <alignment_file>
--version, -V
Print current release version of bp-utils.
Usage: bioseq -V
REQUIRES
Perl 5.10 or greater, BioPerl
FEEDBACK
Support
Please see project wiki page https://github.com/bioperl/bp-utils/wiki for documentation and use cases.
Report bugs & Contribute
Please email weigang@genectr.hunter.cuny.edu.
CONTRIBUTORS
William McCaig <wmccaig at gmail dot com>
Che Martin <che dot l dot martin at gmail dot com>
Yoezen Hernandez <yzhernand at gmail dot com>
Levy Vargas <levy dot vargas at gmail dot com>
Weigang Qiu <weigang at genectr dot hunter dot cuny.edu> (Maintainer)