NAME
bioseq - FASTA sequence utility based on Bio::Perl.
SYNOPSIS
bioseq [-h
| --help
| --v
| --version
--man
]
bioseq options file
bioseq is a command-line utility for common, routine sequence manipulations. Most methods are wrappers for Bio::Perl modules: Bio::Seq, Bio::SeqIO, Bio::SeqUtils, and Bio::Tools::SeqStats.
DESCRIPTION
By default, bioseq assumes that both the input and the output files are in FASTA format, to facilitate the chainning (by UNIX pipes) of multiple bioseq runs.
Methods that are currently not wrappers should ideally be factored into individual Bio::Perl modules, which are better tested and handle exceptions better than stand-alone codes in the Bio::BPWrapper package. As a design principle, command-line scripts here should consist of only wrapper calls.
Options
- --help, -h
-
Print a brief help message and exit.
- --man (but not "-m")
-
Print the manual page and exit.
- --composition, -c <input_file>
-
Base or AA composition. A wrapper for Bio::Tools::SeqStats#count_monomers
- --delete, -d 'tag:value' <input_file>
-
Delete a sequence or a comma-separated list of sequences, e.g.,
-d 'id:foo' # by id -d 'order:2' # by order -d 'length:n # by min length, where 'n' is length -d 'ambig:x' # by min % ambiguous base/aa, where 'x' is the % -d 'id:foo,bar' # list by id -d 're:REGEX' # using a regular expression (only one regex is expected)
- --fetch, -f <genbank_accession>
-
Retrieves a sequence from GenBank using the provided accession number. A wrapper for Bio::DB::GenBank-#get_Seq_by_acc.
- --nogaps, -g <input_file>
-
Remove gaps
- --input, -i <input_file>
-
Input file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'. Wrap Bio::SeqIO.
- --length, -l <input_file>
-
Print all sequence lengths. Wraps Bio::Seq#length.
- --numseq, -n.
-
Print number of sequences.
- --output, -o 'format' <input_file>
-
Output file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'. Wraps Bio::SeqIO.
- --pick, -p
-
Select a single sequence:
--pick 'id:foo' by id --pick 'order:2' by order --pick 're:REGEX' using a regular expression
Select a list of sequences:
--pick 'id:foo,bar' list by id --pick 'order:2,3' list by order --pick 'order:2-10' list by range
Usage: bioseq -p 'tag:value' <input_file>
- --revcom | -r <input_file>
-
Reverse complement. Wraps Bio::Seq#revcom.
- --subseq | -s 'beginning_index, ending_index' <input_file>
-
Select substring (of the 1st sequence). Wraps Bio::Seq#subseq. For example:
bioseq -s'20,80' <input_file> (or -s='20,80')
- --translate | -t [1|3|6] <input_file>
-
Translate in 1, 3, or 6 frames. eg, -t1, -t3, or -t6. Wraps Bio::Seq#translate, Bio::SeqUtils#translate_3frames, and Bio::SeqUtils#translate_6frames.
- --restrict | -x 'RE' <dna_fasta_file>
-
Predicted fragments from digestion by a specified restriction enzyme. An input file with a single sequence is expected. A wrapper of Bio::Restriction::Analysis#cut>.
- --anonymize | -A 'number' <input_file>
-
Replace sequence IDs with serial IDs 'n' characters long, including a leading 'S' (e.g., -A'5' gives S0001). Produces a sed script file with a '.sed' suffix that may be used with sed's '-f' argument. If the filename is '-', the sed file is named
STDOUT.sed
instead. The sed filename is specified onSTDERR
. - --break | -B <input_file>
-
Break into individual sequences, one sequence per file
- --count-codons | -C <input_file>
-
Count codons for coding sequences (e.g., a genome file consisting of CDS sequences). A wrapper of
Bio::Tools::SeqStats-
count_codons()>. - --feat2fasta | -F
-
Extract gene sequences in FASTA from a GenBank file of bacterial genome. Won't work for a eukaryote genbank file. For example:
bioseq -i'genbank' -F <genbank_file>
- --leadgaps | -G <input_file>
-
Count and return the number of leading gaps in each sequence.
- --hydroB, -H
-
Return the mean Kyte-Doolittle hydropathicity for protein sequences. A wrapper of Bio::Tools::SeqStats#hydrophobicity.
- --linearize, -L <input_file>
-
Linearize FASTA, one sequence per line.
- --reloop, -R
-
Re-circularize a bacterial genome by starting at a specified position. For example for sequence "ABCDE".
bioseq -R'2' ..
would generate"'BCDEA".bioseq -R 'number' <input_file>
- --version, -V
-
Print current release version of this command and Bio::BPWrapper
Usage: bioseq -V
- --removestop, -X
-
Remove stop codons (e.g., PAML input)
bioseq -X <input_file>
EXAMPLES
FASTA descriptors
bioseq -l fasta_file # [l]engths of sequences
bioseq -n fasta_file # [n]umber of sequences
bioseq -c fasta_file # base or aa [c]omposition of sequences
FASTA filters
These take a FASTA-format file as input and output one or more FASTA-format file.
Multiple FASTA-file output
bioseq -r fasta_file # [r]everse-complement sequences
bioseq -p'order:3' fasta_file # [p]ick the 3rd sequences
bioseq -p're:B31' fasta_file # [p]ick sequences with regex
bioseq -d'order:3' fasta_file # [d]elete the 3rd sequences
bioseq -d're:B31' fasta_file # [d]elete sequences with regex
bioseq -t1 dna_fasta # [t]ranslate in 1st reading frame
bioseq -t3 dna_fasta # [t]ranslate in 3 reading frames
bioseq -t6 dna_fasta # [t]ranslate in 6 reading frames
bioseq -g fasta_file # remove [g]aps
bioseq -A fasta_file # [A]nonymize sequence IDs
Single FASTA-file output
bioseq -s'1,10' fasta_file # [s]ubsequence from positions 1-10
bioseq -R'10' bac_genome_fasta # [R]e-circularize a genome t position 10
# Retrieve sequence from database
bioseq -f 'X83553' -o 'genbank' # [f]etch a genbank file by accession
bioseq -f 'X83553' -o 'fasta' # [f]etch a genbank file in FASTA
# Less common usages (options in CAPs)
bioseq -L fasta_file # [L]inearize FASTA: one sequence per line
bioseq -B fasta_file # [B]reak into single-seq files
bioseq -C cds_fasta # [C]odon counts (for coding sequences)
bioseq -H pep_fasta # [H]ydrophobicity score (for protein seq)
bioseq -i'genbank' -F file.gb # extract genbank [F]eatures to FASTA
bioseq -x 'EcoRI' dna_fasta # Fragments from restriction digest
Examples involving Unix pipes:
bioseq -p'id:B31' dna_fasta | bioseq -g | bioseq -t1 # pick a seq, remove gaps, & translate
bioseq -p'order:2' dna_fasta | bioseq -r | bioseq -s'10,20' # pick the 2nd seq, rev-com it, & subseq
SEE ALSO
CONTRIBUTORS
Yozen Hernandez <yzhernand at gmail dot com>
Girish Ramrattan <gramratt at gmail dot com>
Levy Vargas <levy dot vargas at gmail dot com>
Weigang Qiu (Maintainer)
Rocky Bernstein