NAME
bioseq - FASTA sequence utility based on Bio::Perl
SYNOPSIS
bioseq options [input_file]
bioseq [-h
| --help
| --v
| --version
--man
]
bioseq is a command-line utility for common, routine sequence manipulations. Most methods are wrappers for Bio::Perl modules: Bio::Seq, Bio::SeqIO, Bio::SeqUtils, and Bio::Tools::SeqStats.
By default, bioseq assumes that both the input and the output files are in FASTA format, to facilitate the chainning (by UNIX pipes) of multiple bioseq runs.
Methods that are currently not wrappers should ideally be factored into individual Bio::Perl modules, which are better tested and handle exceptions better than stand-alone codes in the Bio::BPWrapper package. As a design principle, command-line scripts here should consist of only wrapper calls.
Options
- --composition, -c <input_file>
-
Base or AA composition. A wrapper for Bio::Tools::SeqStats->count_monomers
- --delete, -d 'tag:value' <input_file>
-
Delete a sequence or a comma-separated list of sequences, e.g.,
--delete id:foo # by id --delete order:2 # by order --delete length:n # by min length, where 'n' is length --delete ambig:x # by min % ambiguous base/aa, where 'x' is the % --delete id:foo,bar # list by id --delete re:REGEX # using a regular expression (only one regex is expected)
- --fetch, -f <genbank_accession>
-
Retrieves a sequence from GenBank using the provided accession number. A wrapper for
Bio::DB::GenBank>#get_Seq_by_acc
. - --nogaps, -g <input_file>
-
Remove gaps
- --input, -i <input_file>
-
Input file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'. Wraps Bio::SeqIO.
- --length, -l <input_file>
-
Print all sequence lengths. Wraps Bio::Seq->length.
- --numseq, -n.
-
Print number of sequences.
- --output, -o 'format' <input_file>
-
Output file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'. Wraps Bio::SeqIO.
- --pick, -p
-
Select a single sequence:
--pick 'id:foo' by id --pick 'order:2' by order --pick 're:REGEX' using a regular expression
Select a list of sequences:
--pick 'id:foo,bar' list by id --pick 'order:2,3' list by order --pick 'order:2-10' list by range
Usage: bioseq -p 'tag:value' <input_file>
- --revcom | -r <input_file>
-
Reverse complement. Wraps Bio::Seq->revcom().
- --subseq | -s 'beginning_index, ending_index' <input_file>
-
Select substring (of the 1st sequence). Wraps Bio::Seq->subseq(). For example:
bioseq --subseq 20,80 <input_file>
- --translate | -t [1|3|6] <input_file>
-
Translate in 1, 3, or 6 frames. eg, -t1, -t3, or -t6. Wraps Bio::Seq->translate(), Bio::SeqUtils->translate_3frames(), and Bio::SeqUtils->translate_6frames().
- --restrict | -x 'RE' <dna_fasta_file>
-
Predicted fragments from digestion by a specified restriction enzyme. An input file with a single sequence is expected. Wraps Bio::Restriction::Analysis->cut().
- --anonymize | -A 'number' <input_file>
-
Replace sequence IDs with serial IDs n characters long. The sequence is prefaced with a leading
'S'
.For example using option
--anonymize '5'
the first ID will beS0001
.A sed script file with a
.sed
suffix that may be used with sed's-f
argument. If the filename is-
, the sed file is namedSTDOUT.sed
instead. A message containing the sed filename is written toSTDERR
. - --break | -B <input_file>
-
Break into individual sequences, writing a FASTA file for each sequence.
- --count-codons | -C <input_file>
-
Count codons for coding sequences (e.g., a genome file consisting of CDS sequences). Wraps Bio::Tools::SeqStats->count_codons().
- --feat2fas | -F
-
Extract gene sequences in FASTA from a GenBank file of bacterial genome. Won't work for a eukaryote genbank file. For example:
bioseq --input genbank --feat2fas <genbank_file>
- --leadgaps | -G <input_file>
-
Count and return the number of leading gaps in each sequence.
- --hydroB, -H
-
Return the mean Kyte-Doolittle hydropathicity for protein sequences. Wraps Bio::Tools::SeqStats->hydropathicity().
- --linearize, -L <input_file>
-
Linearize FASTA, one sequence per line.
- --reloop, -R
-
Re-circularize a bacterial genome by starting at a specified position. For example for sequence "ABCDE".
bioseq -R'2' ..
would generate"'BCDEA".bioseq --reloop 'number' <input_file>
- --removestop, -X
-
Remove stop codons (e.g., PAML input)
bioseq --removestop <input_file>
- --split-cdhit
Common Options
- --help, -h
-
Print a brief help message and exit.
- --man
-
Print the manual page and exit.
- --version, -V
-
Print current release version of this command and exit.
- --man (but not "-m")
-
Print the manual page and exit.
EXAMPLES
FASTA descriptors
bioseq --length fasta_file # lengths of sequences
bioseq --numseq fasta_file # number of sequences
bioseq --composition fasta_file # base or aa composition of sequences
xo
=head2 FASTA filters
These take a FASTA-format file as input and output one or more FASTA-format file.
Multiple FASTA-file output
bioseq --revcom fasta_file # reverse-complement sequences
bioseq --pick 'order:3' fasta_file # pick the 3rd sequences
bioseq --pick 're:B31' fasta_file # pick sequences with regex
bioseq --delete order:3 fasta_file # delete the 3rd sequences
bioseq --delete re:B31 fasta_file # delete sequences with regex
bioseq --translate 1 dna_fasta # translate in 1st reading frame
bioseq --translate 3 dna_fasta # translate in 3 reading frames
bioseq --translate 6 dna_fasta # translate in 6 reading frames
bioseq --nogaps fasta_file # remove gaps
bioseq --anonymize fasta_file # Anonymize sequence IDs
Single FASTA-file output
bioseq --subseq 1,10 fasta_file # subsequence from positions 1-10
bioseq --reloop 10 bac_genome_fasta # re-circularize a genome t position 10
# Retrieve sequence from database
bioseq --fetch X83553 --output genbank # fetch a genbank file by accession
bioseq --fetch X83553 --output fasta # fetch a genbank file in FASTA
# Less common usages
bioseq --linearize fasta_file # Linearize FASTA: one sequence per line
bioseq --break fasta_file # Break into single-seq files
bioseq --count-codons cds_fasta # Codon counts (for coding sequences)
bioseq --hydroB pep_fasta # Hydrophobicity score (for protein seq)
bioseq --input genbank --feat2fas file.gb # extract genbank features to FASTA
bioseq --restrict EcoRI dna_fasta # Fragments from restriction digest
Examples involving Unix pipes
bioseq --pick id:B31 dna_fasta | bioseq -nogaps | bioseq --translate 1 # pick a seq, remove gaps, & translate
bioseq --pick order:2 dna_fasta | bioseq -r | bioseq --subseq 10,20 # pick the 2nd seq, rev-com it, & subseq
SEE ALSO
Bio::BPWrapper::SeqManipulations, the underlying Perl Module
bioaln: a wrapper of Bio::SimpleAlign and additional methods
CONTRIBUTORS
Yözen Hernández yzhernand at gmail dot com
Girish Ramrattan <gramratt at gmail dot com>
Levy Vargas <levy dot vargas at gmail dot com>
Weigang Qiu (Maintainer)
Rocky Bernstein