NAME
bioseq - Manipulation of FASTA sequences based on BioPerl
SYNOPSIS
bioseq [options] input_file
bioseq [-h | --help | -V | --version | --man]
FASTA descriptors
bioseq -l fasta_file # [l]engths of sequences
bioseq -n fasta_file # [n]umber of sequences
bioseq -c fasta_file # base or aa [c]omposition
FASTA filters
Multiple FASTA-file output
bioseq -r fasta_file # [r]everse-complement sequences
bioseq -p 'order:3' fasta_file # pick the 3rd sequences
bioseq -p 're:B31' fasta_file # pick sequences with regex
bioseq -d 'order:3' fasta_file # delete the 3rd sequences
bioseq -d 're:B31' fasta_file # delete sequences with regex
bioseq -t 1 dna_fasta # translate in 1st reading frame
bioseq -t 3 dna_fasta # translate in 3 reading frames
bioseq -t 6 dna_fasta # translate in 6 reading frames
bioseq -g fasta_file # remove gaps
Single FASTA-file output
bioseq -s '1,10' fasta_file # sub-sequence from positions 1-10
bioseq --reloop '10' contig_fasta # re-circularize a genome at position 10
Less common usages
bioseq --linearize fasta_file # Linearize FASTA: one sequence per line
bioseq --break fasta_file # Break into single-seq files
bioseq --count-codons cds_fasta # Codon counts (for coding sequences)
bioseq --hydroB pep_fasta # Hydrophobicity score (for protein seq)
bioseq --input 'genbank' --feat2fas file.gb # extract genbank features to FASTA
bioseq --restrict 'EcoRI' dna_fasta # Fragments from restriction digest
Serialize with pipes
bioseq -p 'id:B31' dna_fasta | bioseq -g | bioseq -t1 # pick a seq, remove gaps & translate
bioseq -p 'order:2' dna_fasta | bioseq -r | bioseq -s '10,20' # pick the 2nd seq, rev-com & subseq
DESCRIPTION
bioseq is a command-line utility for common, routine sequence manipulations based on BioPerl modules including Bio::Seq, Bio::SeqIO, Bio::SeqUtils, and Bio::Tools::SeqStats.
By default, bioseq assumes that both the input and the output files are in FASTA format, to facilitate the chaining (by UNIX pipes) of serial bioseq runs.
Methods that are currently not wrappers should ideally be factored into individual Bio::Perl modules, which are better tested and handle exceptions better than stand-alone codes in the Bio::BPWrapper package. As a design principle, command-line scripts here should consist of only wrapper calls.
OPTIONS
- --anonymize, -A 'number'
-
This options was designed for legacy programs (e.g., PHLIP suites) that takes only 10 character-long sequence IDs.
Replace sequence IDs with serial IDs n characters long. The sequence is prefaced with a leading
'S'
. For example using option--anonymize '5'
the first ID will beS0001
.A sed script file with a
.sed
suffix that may be used with sed's-f
argument. If the filename is-
, the sed file is namedSTDOUT.sed
instead. A message containing the sed filename is written toSTDERR
. - --break, -B
-
Break into individual sequences, writing a FASTA file for each sequence.
- --composition, -c
-
Base or AA composition.
- --count-codons, -C
-
Count codons for coding sequences (e.g., a genome file consisting of CDS sequences).
- --delete, -d 'tag:value'
-
Delete a sequence or a comma-separated list of sequences, e.g.,
--delete id:foo # by id --delete order:2 # by order --delete length:n # by min length, where 'n' is length --delete ambig:x # by min % ambiguous base/aa, where 'x' is the % --delete id:foo,bar # list by id --delete re:REGEX # using a regular expression (only one regex is expected)
- --feat2fas | -F
-
Extract gene sequences in FASTA from a GenBank file of bacterial genome. Won't work for a eukaryote genbank file. For example:
bioseq -i 'genbank' -F <genbank_file>
- --fetch, -f <genbank_accession>
-
Don't use! Method broken due to NCBI protocol change. It used to be able to retrieves a sequence from GenBank using the provided accession number.
- --hydroB, -H
-
Return the mean Kyte-Doolittle hydropathicity for protein sequences.
- --input, -i
-
Input file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'.
- --lead-gaps | -G
-
Count and return the number of leading gaps in each sequence.
- --length, -l
-
Print all sequence lengths.
- --linearize, -L
-
Linearize FASTA, one sequence per line.
- --no-gaps, -g
-
Remove gaps
- --num-seq, -n
-
Print number of sequences.
- --output, -o 'format'
-
Output file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'.
- --pick, -p 'tag:value'
-
Select a single sequence:
--pick 'id:foo' by id --pick 'order:2' by order --pick 're:REGEX' using a regular expression
Select a list of sequences:
--pick 'id:foo,bar' list by id --pick 'order:2,3' list by order --pick 'order:2-10' list by range
- --reloop, -R 'number'
-
Re-circularize a bacterial genome by starting at a specified position. For example, for sequence "ABCDE",
bioseq -R'2'
would generate "BCDEA". - --remove-stop, -X
-
Remove stop codons (e.g., for PAML input)
- --restrict, -x 'RE'
-
Predicted fragments from digestion by a specified restriction enzyme.
- --restrict-coord 'RE'
-
Predicted fragments from digestion by a specified restriction enzyme. Outputs cooridnates of overhangs in BED format.
- --revcom | -r
-
Reverse complement.
- --split-cdhit 'cdhit .clstr file'
-
Parse cdhit output .clstr file and generate a FASTA file for each CDHIT family.
- --subseq | -s 'beginning_index,ending_index'
-
Select substring (of the 1st sequence).
- --translate | -t [1|3|6]
-
Translate in 1, 3, or 6 frames. e.g., -t1, -t3, or -t6.
Common Options
- --help, -h
-
Print a brief help message and exit.
- --man (but not "-m")
-
Print the manual page and exit.
- --version, -V
-
Print current release version of this command and exit.
SEE ALSO
Bio::BPWrapper::SeqManipulations, the underlying Perl Module
Lawrence et al (2015). FAST: FAST analysis of sequences toolbox. Front. Genet. 6:172. weblink
CONTRIBUTORS
Yözen Hernández <yzhernand at gmail dot com> (Initial desgin and implementation)
Girish Ramrattan <gramratt at gmail dot com> (developer)
Levy Vargas <levy dot vargas at gmail dot com> (developer)
Weigang Qiu (Maintainer)
Rocky Bernstein (testing and release)
Filipe G. Vieira (developer of --restrict; --restrict-coord methods)
TO DO
Add bioperl scripts ("bp_xxx.pl") functions?
TO CITE
Hernandez, Bernstein, Qiu, et al (2017). "BpWrappers: Command-line utilities for manipulation of sequences, alignments, and phylogenetic trees based on BioPerl". (In prep).
Stajich et al (2002). "The BioPerl Toolkit: Perl Modules for the Life Sciences". Genome Research 12(10):1611-1618.