NAME
Bio::Seq - bioperl sequence object
SYNOPSIS
Object Creation
$seq = Bio::Seq->new;
$seq = Bio::Seq->new(-seq=>'ACTGTGGCGTCAACTG');
$seq = Bio::Seq->new(-seq=>$sequence_string);
$seq = Bio::Seq->new(-seq=>@character_list);
$seq = Bio::Seq->new($file,$seq,$id,$desc,$names,
$numbering,$type,$ffmt,$descffmt);
Object Creation from files
There are two ways to create Bio::Seq objects from files. One is using internal Sequence reading routines in this object, which can handle a few formats. The second is to use the newer SeqIO system, which can handle slightly more formats, can handle multiple sequences in one file, and can be easily extended to new formats.
Try to use the new style. It does give you more flexibility and stability.
# old-style and deprecated,
$seq = Bio::Seq->new($filename); # guesses Fasta format
$seq = Bio::Seq->new(-file=>'seqfile.aa',
-desc=>'Sample Bio::Seq sequence',
-start=>'1',
-ffmt=> 'Fasta',
-type=>'Amino',
);
# new style, better, but somewhat more wordy
# notice this loops over multiple sequences
$stream = Bio::SeqIO->new(-file => 'myfile' -fmt => 'Fasta');
while $seq ( $stream->next_seq() ) {
# $seq is a Bio::Seq object
}
Object Manipulation
$seq->[METHOD];
$result = $seq->[METHOD];
Accessors
--------------------------------------------------------
There are a wide variety of methods designed to give easy
and flexible access to the contents of sequence objects
The following accessors can be invoked upon a sequence object
ary() - access sequence (or slice of sequence) as an array
str() - access sequence (or slice of sequence) as a string
getseq() - access sequence (or slice) as string or array
seq_len() - access sequence length
id() - access/change object id
desc() - access/change object description
names() - access/change object names
start() - access/change start point of the sequence (see note below)
end() - access/change end point of the sequence (see note below)
numbering() - access/change sequence numbering offset (deprecated)
origin() - access/change sequence origin
type() - access/change sequence type
setseq() - change sequence
Deprecated format changes.
ffmt() - access/change default output format
descffmt() - access/change description format
Methods
--------------------------------------------------------
The following methods can be invoked upon a sequence object
copy() - returns an exact copy of an object
alphabet_ok() - check sequence against genetic alphabet
alphabet() - returns the genetic alphabet currently in use
layout() - sequence formatter for output
revcom() - reverse complement of sequence
complement() - complement of sequence
reverse() - reverse of sequence
Dna_to_Rna() - translate Dna seq to Rna
Rna_to_Dna() - translate Rna seq to Dna
translate() - protein translation of Dna/Rna sequence
copy, revcom and translate all return new Bio::Seq objects. This
makes it easy to use these objects in other Bioperl modules and/or
use all the new SeqIO system for format dumping.
complement, reverse, Dna_to_Rna and Rna_to_Dna all return strings,
as it is less likely that you want these things as real Seq objects
OBJECT IN TRANSITION
The Bio::Seq object is by far the oldest object in the bioperl set of modules, and it shows, with around 4/5 people developing methods and much of the documentation focused on general bioperl issues. The bioperl core group have a commitment to eventually rewrite the Bio::Seq object with some more sensible design principles, but this rewrite will
a) be heavily tested against old uses of the code
b) aim to be as backwardly compatible as possible
c) be well signposted that it is occuring.
For more information read the bioperl web page, projects, sequence object,
http://bio.perl.org/Projects/Sequence/
INSTALLATION
This module is included with the central Bioperl distribution:
http://bio.perl.org/Core/Latest
ftp://bio.perl.org/pub/DIST
Follow the installation instructions included in the README file.
DESCRIPTION
This module is the generic sequence object which lies at the core of the bioperl project. It stores Dna, Rna, or Protein sequence information and annotation. It has associated methods to perform various manipulations of sequences and support for a reading and writing sequence data in a variety of file formats.
Bio::Seq has completly superceeded Bio::PreSeq.pm.
The older PreSeq.pm code can be found at Chris Dagdigian's site: http://www.sonsorol.org/dag/bioperl/top.html
Sequence Types
Currently the following sequence types are recognized:
Dna
Rna
Amino
Alphabets
This module uses the standard extended single-letter genetic alphabets to represent nucleotide and amino acid sequences.
In addition to the standard alphabet, the following symbols are also acceptable in a biosequence:
? (a missing nucleotide or amino acid)
- (gap in sequence)
Extended Dna / Rna alphabet
(includes symbols for nucleotide ambiguity)
------------------------------------------
Symbol Meaning Nucleic Acid
------------------------------------------
A A Adenine
C C Cytosine
G G Guanine
T T Thymine
U U Uracil
M A or C
R A or G
W A or T
S C or G
Y C or T
K G or T
V A or C or G
H A or C or T
D A or G or T
B C or G or T
X G or A or T or C
N G or A or T or C
IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE:
Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.
Amino Acid alphabet
------------------------------------------
Symbol Meaning
------------------------------------------
A Alanine
B Aspartic Acid, Asparagine
C Cystine
D Aspartic Acid
E Glutamic Acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
K Lysine
L Leucine
M Methionine
N Asparagine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
V Valine
W Tryptophan
X Unknown
Y Tyrosine
Z Glutamic Acid, Glutamine
* Terminator
IUPAC-IUP AMINO ACID SYMBOLS:
Biochem J. 1984 Apr 15; 219(2): 345-373
Eur J Biochem. 1993 Apr 1; 213(1): 2
Sequence IO Formats
You are encouraged to use the SeqIO system of IO, which in essence looks like:
use Bio::SeqIO;
$instream = Bio::SeqIO->new( -file => 'my.file', -format => 'Fasta' );
$outstream = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Raw' );
while $seq ( $instream->next_seq ) {
$outstream->write_seq($seq);
}
The available formats can be found by listing the SeqIO directory
in the distribution that this comes with (as new SeqIO formats are
very easy to add, it is better to go to the directory, not try to list them
here).
Notice that the SeqIO system will only convert information which the Seq object stores. The Seq object is a lightweight object, and does not contain annotation or feature table information. This information is stored in a development object, called AnnSeq, which will be available in the 0.06 releases and later.
USAGE
Using Bio::Seq in your perl programs
Seq.pm is invoked via the perl 'use' command
use Bio::Seq;
Creating a biosequence object
The "constructor" method in Bio::Seq.pm is the new() function.
The proper syntax for accessing the new() function in Seq.pm is as follows:
$myseq = Bio::Seq->new;
Of course, objects are only useful if they have something in them so you would probably want to pass along some additional information or arguments to the constructor. The foundation of any biosequence object is course the sequence itself.
You can address new() with a sequence directly:
$myseq = Bio::Seq->new(-seq=>'AACTGGCGTTCGTG');
Or you can pass in a string or a list:
$myseq = Bio::Seq->new(-seq=>$sequence_string);
$myseq = Bio::Seq->new(-seq=>@sequence_list);
It is also possible to create a new sequence object based on a sequence contained in a file. You can tell constructor where to find the sequence file by passing in the 'file' parameter:
$myseq = Bio::Seq->new(-file=>'seqfile.gcg');
Because there are so many different conventions or formats for storing sequence information in files, it would be polite (although not absolutely necessary) to tell the constructor what format the sequence file is in. We can provide that information via the file-format or 'ffmt' field. To create a sequence object based upon a GCG-formatted sequence file:
$myseq = Bio::Seq->new(-file=>'seqfile.gcg',-ffmt=>'GCG');
We've already introduced three different object attributes or arguments that can be passed to the new() object constructor ('seq','file' and 'ffmt') so now would be a good time to introduce them all:
BioSeq Constructor Arguments
file: The "file" argument should be a string value containing path and filename information for a sequence file that is to be read into an object.
seq: The "seq" argument is for passing in sequence directly instead of reading in a sequence file. The sequence should consist of RAW info (no whitespace, newlines or formatting) and can be passed in as either an array/list or string.
id: The "id" argument should be a ONE-WORD string value giving a short name for the sequence.
desc: The "desc" argument should be a string containing a description of the sequence. This field is not limited to one word.
names: The "names" argument should be a hash or reference to a hash that contains any number of user generated key-value pairs. Various bits of identifying information can be stored here including name(s), database locations, accession numbers, URL's, etc.
type: The "type" argument should be a string value describing the sequence type eg; "Dna", "Rna" or "Amino".
origin: The "origin" argument should be a string value describing sequence origin info
start: The start point, in biological coordinates of the sequence
end: The end point, in biological coordinates of the last residue in the sequence
start/end attributes are not strongly tied to what is actually in the sequence (ie, $seq->start()+length($seq->getseq()) doesn't necessarily equal $seq->end()-1 - most of the time it should).
This is to allow some oddities to be stored in the Seq object sensibly.
The numbering convention is 'biological' coordinates. ie the sequence ATG would start at 1 (A) and finish at 3 (G). (NB - this is different from how perl represents ranges in sequences).
numbering() is equivalent to start() (old version). Eventually it will be removed. numbering() accesses the same attribute as start()
numbering: (Deprecated) The "numbering" argument should be an integer value containing the sequence numbering offset value. By default all sequence are numbered starting with 1.
ffmt:
This documentation describes the old format system: you are encouraged to use the newer SeqIO system described separately in the SeqIO documentation.
The "ffmt" argument should be a string describing sequence file-format. If a sequence is being read from a file via the "file" argument, "ffmt" is used to invoke the proper parsing code. "ffmt" is also the default format for sequence output when the layout method is called. See elsewhere in this documentation for info regarding recognized sequence file-formats.
If most of these arguments were used at once to create a sequence object, it would look something like this:
#Set up the name hash
%names = (
'CloneID','DB1',
'Isolate','5',
'Tissue','Xenopus',
'Location','/usr2/users/dag/bioperl/sample.tfa'
);
$name_ref = \%names;
#Create the object
$myseq = new Bio::Seq(-file=>'sample.tfa',
-names=>$name_ref,
-type=>'Dna',
-origin=>'Xenopus mesoderm',
-start=>'1',
-desc=>'Sample Bio::Seq sequence',
-ffmt=>'Fasta');
Methods
Once an object has been created, there are defined ways to go about accessing the information -- users are encouraged to poke around "under the hood" of Seq.pm to see what is going on but it is considered bad form to bypass the defined accession methods and mess around with the internal code. Bypassing the defined methods "voids the warrantee" of the module and can lead to problems down the road. The implied agreement between module creators and users is that the creators will strive to keep the interface standard and backwards-compatible while the users will avoid becoming dependent on bits of internal code that may change or disappear in future revisions.
Detailed information about each method described here can be found in the Appendix.
Accessing information
For each defined way to access information from a biosequence object, there is a corresponding "method" that is invoked. What follows is a brief description of each accessor method. For more detailed information see the individual annotations for each method near the end of this document.
Sequence
The sequence can be accessed in several ways via the getseq() method. Depending on how it is invoked, it can return either a string or a list value.
Both examples are appropriate:
@sequence_list = $myseq->getseq; $sequence_string = $myseq->getseq;
Sequence "slices" can be accessed by passing start and stop integer position arguments to getseq():
@slice = $myseq->getseq($start,$stop); @slice = $myseq->getseq(1,50); @slice = $myseq->getseq(100);
If no stop value is passed in, getseq() will return a slice from the start position to the end of the sequence. Slices are returned in the context of the object "start" attribute, not absolute position so be aware of the objects numbering scheme.
Sequences can also be accessed in with the ary() and str() methods. The ary() method will always return a list value and str() will always return a string. Otherwise they are functionally identical to the getseq() method.
$sequence = $myseq->str; @sequence = $myseq->ary; @slice = $myseq->ary($start,$stop); $slice = $myseq->str($start,$stop);
Sequence length
The sequence length can be accessed using the seq_len() method
$len = $myseq->seq_len;
Sequence ID
The ID field can be accessed using the id() method
$ID = $myseq->id;
Description
The object description field can be accessed using the desc() method
$description = $myseq->desc;
Names
The associative array (hash) that contains flexible information regarding alternative sequence names, database locations, accession numbers, etc. can be accessed by
%name_hash = $myseq->names;
Sequence start
The biological position of the first residue in the sequence sequence can be accessed via start()
$start = $myseq->start;
Sequence end
The biological position of the last residue in the sequence sequence can be accessed via end()
$end = $myseq->end;
Sequence Origin
The object origin (source organism) field can be accessed via origin()
$seq_origin = $myseq->origin;
File input format / default output format
The object format field can be accessed using the ffmt() method
$format = $myseq->ffmt;
Changing Information in Sequence Objects
In the previous section it was shown how object attributes and values could be retrieved from a sequence object by calling upon various methods. Many of the above methods will also allow the user to CHANGE object attributes by passing in additional arguments. Detailed information on each method can be found in the Appendix.
Changing the sequence
The sequence information for an object can be changed by passing a string or list value to the setseq() method. Here are some ways that sequence information can be changed
$myseq->seqseq($new_sequence_string); $myseq->setseq(@new_sequence_list); $myseq->setseq("aaccttgcctgc");
The setseq() method checks sequence elements and warns if it finds non-standard characters. Because of this, arbitrary sequence compositions are not supported at this time. This method is considered slightly 'insecure' because the 'id','desc' and 'type' fields are not updated along with the sequence. If necessary, the user must make the appropriate changes to these fields whenever sequence information is updated or changed.
Changing the sequence ID
The ID field can be changed by passing in a new ID argument to id()
$myseq->id($new_id);
Changing the object description
The object description field can be changed by passing in a new argument to desc()
$myseq->desc($new_desc);
Changing the object names hash
The associative array (hash) that contains flexible information regarding alternative sequence names, database locations, accession numbers, etc. can be changed by passing in a reference to a new hash to names()
$hash_ref = \%name_hash; $myseq->names($hash_ref);
Changing the sequence start or end
The default numbering offset for the sequence can be changed by passing in a new value to start() or end()
$myseq->start(1); $myseq->start($new_value);
Sequence Origin
The object origin field can be changed by passing in a new string value to origin()
$myseq->origin("mitochondrial"); $myseq->origin($origin_string);
File input format / default output format
The object format field can be accessed by passing in a new value to ffmt()
$myseq->ffmt("GCG");
Manipulating sequences
Creating, accessing and changing biosequence objects and fields is all well and good, but eventually you are going to want to actually do some work.
Included with Seq.pm are some commonly used utility methods for manipulating sequence data. So far Seq.pm contains methods for:
Copying a biosequence object
using copy()
# NB - new_obj is a Bio::Seq object $new_obj = $myseq->copy;
Reversing a sequence
using reverse()
$reversed_seq = $myseq->reverse;
Complementing a sequence
The 2nd strand, or "complement" of a biosequence can be obtained by calling upon the complement() method.
$comp_seq = $myseq->complement;
Reverse complementing a sequence
using revcom()
# NB - rev_comp is a Bio::Seq object $rev_comp = $myseq->revcom;
Translating Dna to Rna
using Dna_to_Rna()
$rna_seq = $myseq->Dna_to_Rna;
Translating Rna to Dna
using Rna_to_Dna()
$dna_seq = $myseq->Rna_to_Dna;
Translating Dna or Rna to protein
using translate()
# NB - peptide_seq is a Bio::Seq object $peptide_seq = $myseq->translate;
Checking the sequence alphabet
To check if any nonstandard characters are present in a biosequence, an alphabet_ok() method is provided. The method returns "1" if everything is OK, otherwise it returns a "0".
if($myseq->alphabet_ok) { print "OK!!\n"; } else { print "Not OK! \n"; }
To get alphabet itself, use the alphabet() method, which will return a string containing all characters in the current alphabet.
$alph = $myseq->alphabet;
To use restrictive alphabets that do not permit ambiguity codes, include '-strict => 1' in the parameters sent to new(). Or, for any existing sequence object, try:
$myseq->strict(1); $myseq->alphabet_ok() or die "alphabet not okay.\n";
FEEDBACK
Mailing Lists
User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.
vsns-bcd-perl@lists.uni-bielefeld.de - General discussion
vsns-bcd-perl-guts@lists.uni-bielefeld.de - Technically-oriented discussion
http://bio.perl.org/MailList.html - About the mailing lists
Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via email or the web:
bioperl-bugs@bio.perl.org
http://bio.perl.org/bioperl-bugs/
ACKNOWLEDGEMENTS
Some pieces of the code were contributed by Steven E. Brenner, Steve Chervitz, Ewan Birney, Tim Dudgeon, David Curiel, and other Bioperlers. Thanks !!!!
REFERENCES
BioPerl Project Page http://bio.perl.org/
VERSION
Bio::Seq.pm, beta 0.051
COPYRIGHT
Copyright (c) 1996-1998 Chris Dagdigian, Georg Fuellen, Richard
Resnick, and others All Rights Reserved. This module is free
software; you can redistribute it and/or modify it under the same
terms as Perl itself.
Appendix
The following documentation describes the various functions contained in this module. Some functions are for internal use and are not meant to be called by the user; they are preceded by an underscore ("_").
new
Title : new
Usage : $mySeq = Bio::Seq->new($file,$seq,$id,$desc,$names,
$start,$end,$type,$ffmt,$descffmt);
: - or -
: $mySeq = Bio::Seq->new(-file=>$file,
-seq=>$seq,
-id=>$id,
-desc=>$desc,
-names=>$names,
-start=>$start,
-end=>$end,
-type=>$type,
-origin=>$origin,
-ffmt=>$ffmt,
-descffmt=>$descffmt);
Function : The constructor for this class, returns a new object.
Example : See usage
Returns : Bio::Seq object
Argument : $file: file from which the sequence data can be read; all
the other arguments will overwrite the data read in.
"_nofile" is recommanded if no file is given.
$seq: String or array of characters
$id: String describing the ID the user wishes to assign.
$desc: String giving a description of the sequence
$names: A reference to a hash which stores {loc,name}
pairs of other database locations and corresponding names
where the sequence is located.
$start: The offset of the sequence, as an integer
$end: The end point of the sequence, as an integer
$type: The type of the sequence, see type()
$origin: The sequence origin
$ffmt: Sequence format, see ffmt()
$descffmt: format of $desc, see descffmt()
## Internal methods ##
_initialize
Title : _initialize
Usage : n/a (internal function)
Function : Assigns initial parameters to a blessed object.
Example :
Returns :
Argument : As Bio::Seq->new, allows for named or listed parameters.
See ->new for the legal types of these values.
_seq
Title : _seq()
Usage : n/a, internal function
Function : called by new() to set sequence field. Checks
: alphabet before setting.
:
Returns : n/a
Argument : sequence string
_monomer
Title : _monomer()
Usage : n/a, internal function
Function : Returns the internal monomer that represents
: sequence type.
:
: Sequence type is treated internally as a monomer
: defined by the %SeqAlph hash. The type field
: is a list of format [monomer,origin]. For any
: output outside the module, the monomer is resolved
: back into string form via the %TypeSeq hash.
:
Returns : original type setting [as monomer]
Argument : none
_file_read
Title : _file_read()
Usage : n/a (Internal Function)
Function : _file_read is called whenever the constructor is called
: with the name of a sequence to be read from disk.
:
: This function is now DEPRECATED. you should use the SeqIO
: system
:
Example : n/a, only called upon by _initialize()
Returns :
Argument :
## ACCESSORS ##
seq_len
Title : seq_len()
Usage : $len = $myseq->seq_len;
Function : Returns a value representing the sequence
: length
:
Example : see above
Arguments : none
Returns : integer
ary
Title : ary
Usage : ary([$start,[$end]])
Function : Returns the sequence of the object as an array, or a substring
of the sequence if $start/$end are defined. If $start is
defined and $end isn't, the substring is from $start to the
end of the sequence.
Example : @slice = $myObject->ary(3,9);
Returns : array of characters
Argument : $start,$end (both integers). They are interpreted w.r.t. the
specific numeration of the sequence!! ($self->{start})
str
Title : str
Usage : str([$start,[$end]])
Function : Returns the sequence of the object as a string, or a slice
of the sequence if $start/$end are defined. If $start is
defined and $end isn't, the slice is from $start to the
end of the sequence.
Example : $slice = $myObject->str(3,9);
Returns : string scalar
Argument : $start,$end (both integers). They are interpreted w.r.t. the
specific numeration of the sequence!! ($self->{start})
seq
Title : seq
Usage : seq([$start,[$end]])
Function : Returns the sequence of the object as an array or a char
string, depending on the value of wantarray. Will rtn a slice
of the sequence if $start/$end are defined. If $start is
defined and $end isn't, the slice is from $start to the
end of the sequence.
Example : @slice = $myObject->seq(3,9);
Returns : regular array of characters, or a scalar string
Argument : $start,$end (both integers). They are interpreted w.r.t. the
specific numeration of the sequence!! ($self->{start})
Comments :
getseq
Title : getseq
Usage : getseq([$start,[$end]])
Function : Returns the sequence of the object as an array or a char
string, depending on the value of wantarray. Will rtn a slice
of the sequence if $start/$end are defined. If $start is
defined and $end isn't, the slice is from $start to the
end of the sequence.
Example : @slice = $myObject->seq(3,9);
Returns : regular array of characters, or a scalar string
Throws : Warning about deprecated method.
Argument : $start,$end (both integers). They are interpreted w.r.t. the
specific numeration of the sequence!! ($self->{start})
id
Title : id()
Usage : $seq_id = $myseq->id;
: $myseq->id($id_string);
:
Function : Sets field if an ID argument string is
: passed in. If no arguments, returns ID value for
: object.
:
Returns : original ID value
Argument : sequence string
desc
Title : desc()
Usage : $description = $myseq->desc;
: $myseq->desc($desc_string);
:
Function : Sets field if an argument string is
: passed in. If no arguments, returns original value for
: object description field.
:
Returns : original value for description
Argument : sequence string
names
Title : names()
Usage : %names = $myseq->names;
: $myseq->names($hash_ref);
:
Function : Sets field if a name hash refrence is
: passed in. If no arguments, returns original
: names hash.
:
Returns : hash refrence (associative array)
Argument : refrence to a hash (associative array)
numbering
Title : numbering()
Usage : $num_start = $myseq->start;
: $myseq->start($value);
:
Function : Sets field if an argument is
: passed in. If no arguments, returns original value.
:
: (Deprecated - should switch to start())
Returns : original value
Argument : new value
start
Title : start
Usage : $start = $myseq->start(); #get
: $myseq->start($value); #set
Function : the set/get for the start position
Example :
Returns : start value
Arguments : new value
end
Title : end
Usage : $end = $myseq->end(); #get
: $myseq->end($value); #set
Function : The set/get for the end position
Example :
Returns : end value
Arguments : new value
get_nse
Title : get_nse
Usage : $tag = $myseq->get_nse() #
Function : gets a string like "name/start-end". This is likely
: to be unique in an alignment/database
: Used alot by SimpleAlign
Example :
Returns : A string
Arguments: Two optional arguments - first being the name/ separator, second the
start-end separator
origin
Title : origin()
Usage : myseq->origin($value)
Function : Sets the origin field which is actually the second
: field of the Type list. The {type} field is a 2 value list
: with a format of ["Monomer","Origin"]
:
Returns : Original value
Argument : string
Comments : SAC: Consider renaming this method to "organism()" or "species()".
: "origin" is ambiguous and can be easily confused with
: a coordinate data (0,0).
type
Title : type()
Usage : myseq->type($value)
Function : Sets the type field which is the first
: field of the Type list. The {type} field is a 2 value list
: with a format of ["Monomer","Origin"]
:
Returns : String containing one of the recognized sequence types:
: 'unknown', 'dna', 'rna', 'amino', 'otherseq', 'aligned'
: See the %Seq::SeqAlph hash for the current types.
Argument : string containing a valid sequence type
: SAC: case of user-supplied argument does not matter
ffmt
Title : ffmt()
Usage : $format = $myseq->ffmt;
: $myseq->ffmt("Fasta");
:
Function : The file format field is used by the internal
: sequence parsing code when trying to read
: in a sequence file. It is also what is used
: as a default output format if the layout
: method is called without an argument.
:
: If a sequence object is created without
: reading in a file, or if the file is read
: in with the use of the ReadSeq package then
: the ffmt field can be set to indicate any default
: output-format preference.
:
: If a sequence is read from a file and parsed
: by internal code (ReadSeq not used) then the ffmt
: field should describe the format of the sequence
: file. The ffmt field is used to send the sequence
: to the correct internal parsing code.
:
Returns : original ffmt value
Argument : recognized ffmt string value (see list of recognized
: formats) # SAC: What are they?! This list should be obvious.
: Valid strings:
: RAW, FASTA, GCG, IG, GENBANK, NBRF, EMBL,
: MSF, PIR, GCG_SEQ, GCG_REF, STRIDER, ZUKER,
: SAC: case of user-supplied argument does not matter
descffmt
Title : descffmt()
Usage : $desc = $myseq->descffmt;
: $myseq->descffmt($new_value);
Function :
:
Returns : original value
Argument : $new_value (one of the formats as defined in $SeqForm).
: SAC: case of $new_value argument does not matter.
setseq
Title : setseq()
Usage : $self->setseq($new_sequence);
Function : Changes the sequence inside a bioseq object
:
Returns : sequence string
Argument : sequence string
parse
Title : parse
Usage : parse($ent,[$ffmt]);
Function : Invokes the proper parsing code depending on
: the value of the object 'ffmt' field.
Example : $self->parse;
Returns : n/a
Argument : the prospective sequence to be parsed,
: and optionally its format so that it doesn't need to
: be estimated
: SAC: case of $ffmt argument does not matter.
parse_raw
Title : parse_raw
Usage : parse_raw;
Function : parses $ent into the $self->{"seq"} field, using Raw
: file format.
Example : $self->parse_raw;
Returns : n/a
Argument : n/a
parse_genbank
Title : parse_genbank
= cut
sub parse_genbank { my ($self) = shift; my ($ent) = @_; my $seqstart = false; my $defstart = false;
my @lines = split("\n", $ent);
for ( @lines ) {
chomp;
m/LOCUS\s*(\S+)/ and $self->{"id"} = $1;
m/DEFINITION\s*(.+)/ and do { $self->{"desc"} = $1; $defstart = true; };
$defstart and do {
m/^ {11}( .+)/ or $defstart = false;
$defstart and $self->{"desc"} .= $1; };
m/ORIGIN/ and do { $seqstart = true; next; };
m!//! and $seqstart = false;
$seqstart and do { s/[\s|\d]//g; $self->{"seq"} .= $_; };
}
return 1;
}
#_______________________________________________________________________
parse_fasta
Title : parse_fasta
Usage : parse_fasta;
Function : parses $ent into the "seq" field, using Fasta
: file format.
:
To-do : use benchmark module to find best/fastest parse
: method
:
Example : $self->parse_fasta;
Returns : n/a
Argument : n/a
parse_gcg
Title : parse_gcg
Usage : used by internal code
Function : Parses the sequence out of a gcg-format string and
: sets the object sequence field accordingly. This is
: a simple, ineffecient method for grabbing JUST the
: sequence.
:
To-do : - parse out more info than just sequence
: - implement alphabet checking
: - better regular expressions/efficiency
: - carp on unexpected / wrong-format situations
:
Version : .01 / 16 Jan 1997
Returns : 1
Argument : gcg-formatted sequence string
## METHODS FOR FILE FORMAT AND OUTPUT ##
#_______________________________________________________________________
layout
Title : layout()
Usage : layout([$format]);
Function : Returns the sequence in whichever format the user specifies,
or in the "ffmt" field if the user does not specify a format.
Example : $fastaFormattedSeq = $myObj->layout("Fasta");
Returns : varies
Argument : $format (one of the formats as defined in $SeqForm).
: SAC: case of $ffmt argument does not matter.
out_raw
Title : out_raw
Usage : out_raw;
Function : Returns the sequence in Raw format.
Example : $self->out_raw;
Returns : string sequence, in raw format
Argument : n/a
out_fasta
Title : out_fasta
Usage : out_fasta;
Function : Returns the sequence as a string in FASTA format.
Example : $self->out_fasta;
:
To-do : benchmark code / find fastest method
:
Returns : string sequence in Fasta format
Argument : n/a
alphabet_ok
Title : alphabet_ok
Usage : $myseq->alphabet_ok;
Function : Checks the sequence for presence of any characters
: that are not considered valid members of the genetic
: alphabet. In addition to the standard genetic alphabet
: (see documentation), "?" and "-" characters are
: considered valid.
:
Example : if($myseq->alphabet_ok) { print "OK!!\n"; }
: else { print "Not OK! \n"; }
:
Note : Does not handle '\' characters in sequence robustly
:
Returns : 1 if OK / 0 if not OK
Argument : none
alphabet
Title : alphabet
Usage : $myseq->alphabet;
Function : Returns the characters in the alphabet in use for the sequence.
Example : print "Alphabet: ".$myseq->alphabet;
Returns : string containing alphabet characters
Argument : none
GCG_checksum
Title : GCG_checksum
Usage : $myseq->GCG_checksum;
Function : returns a gcg checksum for the sequence
Example :
Returns :
Argument : none
trunc
Title : trunc
Usage : $trunc_seq = $mySeq->trunc(12,20);
Function : Returns a truncated part of the sequence, truncation
happening by the ->str() call. This is just a convience call
therefore for this object
Returns : Bio::Seq object ref.
Argument : start point, end point in biological coordinates
copy
Title : copy
Usage : $copyOfObj = $mySeq->copy;
Function : Returns an identical copy of the object.
Example :
Returns : Bio::Seq object ref.
Argument : n/a
revcom
Title : revcom
Usage : $reverse_complemented_seq = $mySeq->revcom;
Function : Returns a Bio::Seq object with the reverse
: complement of a nucleotide object sequence
Example : $reverse_complemented_seq = $mySeq->revcom;
Source : Guts from Jong's <jong@mrc-lmb.cam.ac.uk>
: library of molbio perl routines
Note :
: The letter codes and compliment translations
: are those proposed by IUB (Nomenclature Committee,
: 1985, Eur. J. Biochem. 150; 1-5) and are also
: used by the GCG package. The IUB/GCG letter codes
: for nucleotide ambiguity are compatible with
: EMBL, GenBank and PIR database formats but are
: *NOT* compatible with Stadem/Sanger ambiguity
: symbols. Staden/Sanger use different symbols to
: represent uncertainty and frame abiguity.
:
: Currently Staden/Sanger are not recognized
: sequence types.
:
: GCG Documentation on sequence symbols:
URL : http://www.neb.com/gcgdoc/GCGdoc/Appendices/appendix_iii.html
:
Translation :
: GCG/IUB Meaning Complement
: ------------------------------------
: A A T
: C C G
: G G C
: T T A
: U U A
: M A or C K
: R A or G Y
: W A or T W
: S C or G S
: Y C or T R
: K G or T M
: V A or C or G B
: H A or C or T D
: D A or G or T H
: B C or G or T V
: X G or A or T or C X
: N G or A or T or C N
:--------------------------------------
Revision : 0.01 / 3 Jun 1997
Returns : A new sequence object
to get the actual sequence go
$actual_reversed_sequence = $seq->revcom()->str()
Argument : n/a
complement
Title : complement
Usage : $complemented_seq = $mySeq->compliment;
Function : Returns a char string containing
: the complementary sequence (eg; other strand)
: of the original sequence. The translation method
: is identical to revcom() but the nucleotide order
: is not reversed.
:
: To be honest *most* of the time you will want
: to use revcom not this. Be careful!
:
Example : $complemented_seq = $mySeq->complement;
:
Source : Guts from Jong's <jong@mrc-lmb.cam.ac.uk>
: library of molbio perl routines
Note :
: The letter codes and complement translations
: are those proposed by IUB (Nomenclature Committee,
: 1985, Eur. J. Biochem. 150; 1-5) and are also
: used by the GCG package. The IUB/GCG letter codes
: for nucleotide ambiguity are compatible with
: EMBL, GenBank and PIR database formats but are
: *NOT* compatible with Stadem/Sanger ambiguity
: symbols. Staden/Sanger use different symbols to
: represent uncertainty and frame abiguity.
:
: Currently Staden/Sanger are not recognized
: sequence types.
:
: GCG Documentation on sequence symbols:
URL : http://www.neb.com/gcgdoc/GCGdoc/Appendices
: /appendix_iii.html
:
Translation :
: GCG/IUB Meaning Complement
: ------------------------------------
: A A T
: C C G
: G G C
: T T A
: U U A
: M A or C K
: R A or G Y
: W A or T W
: S C or G S
: Y C or T R
: K G or T M
: V A or C or G B
: H A or C or T D
: D A or G or T H
: B C or G or T V
: X G or A or T or C X
: N G or A or T or C N
:--------------------------------------
:
Revision : 0.01 / 6 Dec 1996
Returns : char string
Argument : n/a
#_______________________________________________________________________'
reverse
Title : reverse
Usage : $reversed_seq = $mySeq->reverse;
Function : Returns a char string containing the
: reverse of the object sequence
:
: Does *NOT* complement it. If you want
: the other strand, use $mySeq->revcom()
:
Example : $reversed_seq = $mySeq->reverse;
:
Revision : 0.01 / 6 Dec 1996
Returns : char string
Argument : n/a
Dna_to_Rna
Title : Dna_to_Rna
Usage : $translated_seq = $mySeq->Dna_to_Rna;
Function : Returns a char string containing the
: Rna translation of the Dna nucleotide sequence
: (Replaces T with U)
:
Example : $translated_seq = $mySeq->Dna_to_Rna;
:
Source : modified from Jong's <jong@mrc-lmb.cam.ac.uk>
: library of molbio perl routines
:
Revision : 0.01 / 6 Dec 1996
Returns : char string
Argument : n/a
Rna_to_Dna
Title : Rna_to_Dna
Usage : $translated_seq = $mySeq->Rna_to_Dna;
Function : Returns a char string containing the
: Dna translation of the Rna nucleotide sequence
: (Replaces U with T)
:
Example : $translated_seq = $mySeq->Rna_to_Dna;
:
Revision : 0.01 / 16 MAR 1997
Returns : char string
Argument : n/a
translate
Title : translate
Usage :
Function : Returns a new Bio::Seq object with the protein
: translation from this sequence
:
: "*" is the default symbol for a stop codon
: "X" is the default symbol for an unknown codon
:
Example : $translation = $mySeq->translate;
: -or- with user defined stop/unknown codon symbols:
: $translation = $mySeq->translate($stop_symbol,$unknown_symbol);
:
Source : modified from Jong's <jong@mrc-lmb.cam.ac.uk>
: library of molbio perl routines
:
To-do : - allow named parameters (just like new and out_GCG )
: - allow "frame" parameter to pick translation frame
:
Revision : 0.01 / 6 Dec 1996
Returns : new Sequence object. Its id is the original id.trans
Argument : n/a
dump
Title : dump
Usage : @results = $mySeq->dump; -or-
: $results = $mySeq->dump;
:
Function : Returns a formatted array or string (depending on how it
: is invoked) containing the contents of a
: Bio::Seq object. Useful for debugging
:
: ***This is used by Chris Dagdigian for debugging ***
: ***Probably should be removed before distribution***
:
Example : @results = $mySeq->dump;
: foreach(@results){print;}
: -or-
: print $myseq->dump;
:
Returns : Array or string depending on value of wantarray
Argument : n/a
out_bad
Title : out_bad()
Usage : out_bad;
Function : Throws a fatal error if we don't know the output format.
Example : $self->out_bad;
Returns : n/a
Argument : n/a
out_primer
Title : out_primer()
Usage : $formatted_seq = $myseq->out_primer;
: @formatted_seq = $myseq->out_primer;
:
: print $myseq->out_primer(-id=>'New ID',
: -header=>'This is my header');
:
Function : outputs a sequence in primer format
:
Note : Not a supported output type - (cant be invoked via layout)
: Use at your own risk :)
:
Example : see usage
:
Revision : 0.01 / 20 Dec 1996
Returns : string or list, depending on how it is invoked
Argument : named list parameters for "id" and "header" are alowed
out_pir
Title : out_pir()
Usage : $formatted_seq = $myseq->layout("PIR");
: $formatted_seq = $myseq->out_pir;
: @formatted_seq = $myseq->out_pir;
:
: print $myseq->out_pir(-title=>'New TITLE',
: -entry=>'New ENTRY',
: -acc=>'User defined accession',
: -date=>'User defined date',
: -reference=>'User defined ref info');
:
Function : Returns a string or an array depending on how it
: is invoked. Can be easily accessed via the layout()
: method, or if more output control is desired it can
: be called directly with the folowing named parameters:
:
: -entry PIR entry
: -title PIR title
: -acc user defined accession number
: -reference user defined reference
: -date user defined date/time info
:
: All named parameters will take precedance over any
: default behavior. When there are no user arguments,
: the default output is as follows:
:
: PIR 'ENTRY' = sequence object "id" field
: PIR 'TITLE' = sequence object "desc" field
: PIR 'DATE' = curent date/time
: PIR 'ACC' = not used in default output
: PIR 'REFERENCE' = not used in default output
:
Note : Not tested stringently.
:
WARNING : Does not deal with numbering issue
:
To-do : - Allow user to pass in hash of additional fields/values
: - Deal with numbering issue
:
Example : see usage
:
Revision : 0.02 / 12 Jan 1997
Returns : string or list, depending on how it is invoked
Argument : named list parameters are allowed, see above
out_genbank
Title : out_genbank()
Usage : $formatted_seq = $myseq->out_genbank;
: @formatted_seq = $myseq->out_genbank;
: print $myseq->out_genbank(-id=>'New ID',
: -def=>'User defined definition',
: -acc=>'User defined accession',
: -origin=>'User defined origin info',
: -spacing=>'single',
: -caps=>'up',
: -date=>'DATE GOES HERE',
: -type=>'mRna');
:
Function : Returns a GenBank formatted sequence array or string
: depending on the value of wantarray when invoked via layout().
: If more control is desired over output format, out_genbank()
: can be addressed directly with the following named parameters:
:
: def - Sequence definition information
: acc - Sequence accession number
: origin - Sequence origin information
: id - short name
: date - new date info
: type - sequence type (Dna, mRna, Amino, etc.)
: spacing - "single" or "double" sequence line spacing
: caps - "up" or "down" sequence capitalization
:
: When invoked via layout() or called directly with no
: arguments, the following default behaviours apply:
: DATE = Current date and time
: DEFINITION = object's description field
: ID = object's ID field
: SPACING = single
:
: All named parameters must be strings. Passed in parameters will
: always take precedence over any fields with default settings.
:
Note : Format not stringently tested for accuracy. Sequence is numbered
: according to the integer specified in the object 'start' field
: but the implementation has not been robustly tested.
:
To-do : - allow user hash reference for additional format fields
:
Example : see usage
:
Revision : 0.02 / 12 Jan 1997
Returns : string or list, depending on how it is invoked
Argument : named list parameters are allowed, see above
out_GCG
Title : out_GCG
Usage : $formatted_seq = $mySeq->layout("GCG");
: @formatted_seq = $mySeq->layout("GCG");
:
: print $myseq->out_GCG(-id=>'New ID',
: -spacing=>'single',
: -caps=>'up',
: -date=>'DATE GOES HERE',
: -header=>'This is a user submitted header',
: -type=>'n');
:
Function : Returns a GCG formatted sequence array or string
: depending on the value of wantarray when invoked via layout().
: If more control is desired over output format, out_GCG()
: can be addressed directly with the following named parameters:
:
: header - first line(s) of formatted sequence
: id - short name that appears before 'Length:' field
: date - overwrite default date info
: type - can be "N" or "P", for nucleotide/protein
: spacing - "single" or "double" sequence line spacing
: caps - "up" or "down" sequence capitalization
:
: When invoked via layout() or called directly with no
: arguments, the following default behaviours apply:
: DATE = Current date and time
: DEFINITION = object's description field
: ID = object's ID field
: SPACING = single
:
: All named parameters must be strings. Passed in parameters will
: always take precedence over any fields with default settings.
:
Example :
Output :
:Sample Bio::Seq sequence
: sample Length: 240 Wed Nov 27 13:24:28 EST 1996 Type: N Check: 5371 ..
:
: 1 aaaacctatg gggtgggctc tcaagctgag accctgtgtg cacagccctc
: 51 tggctggtgg cagtggagac gggatnnnat gacaagcctg ggggacatga
: 101 ccccagagaa ggaacgggaa caggatgagt gagaggaggt tctaaattat
: 151 ccattagcac aggctgccag tggtccttgc ataaatgtat agagcacaca
: 201 ggtgggggga aagggagaga gagaagaagc cagggtataa
:
:
Note : GCG formatted sequences contain a "Type:" field.
: If Type cannot be internally determined and no
: Type name-parameter is passed in then the Type:
: field is not printed.
:
Warning : Unconventional numbering offsets may not
: be robustly handled
:
Revision : 0.06 / 12 Jan 1997
Source : Found guts of this code on bionet.gcg, unknown author
Returns : Array or String
Argument : n/a
out_nbrf
Title : out_nbrf()
Usage : $self->layout("NBRF") or $self->out_nbrf
:
Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
:
: If the ReadSeq wrapper Parse.pm apppears
: to be configured properly it is used
: to generate the output.
:
: If Parse.pm cannot be used then this code
: carps out with an error message.
:
To-do : write internal output code
:
Version : 1.0 / 16 MAR 1997
Example : see Usage
Returns : FORMATTED STRING (wantarray is not used here!)
Argument :
out_gcgseq
Title : out_gcgseq
Usage : out_gcgseq;
Function : Returns the sequence as a string in GCG_SEQ format.
Example : $self->out_gcgseq;
:
Returns : string sequence in GCG_SEQ format
Argument : n/a
Comments : SAC: Derived from out_fasta().
: GCG_SEQ is a format that looks alot like Fasta and is used
: for building GCG sequence datasets (.seq files).
: It also has some similarities to NBRF format.
out_gcgref
Title : out_gcgref
Usage : out_gcgref;
Function : Returns the sequence as a string in GCG_REF format.
Example : $self->out_gcgref;
:
Returns : string sequence in GCG_REF format
Argument : n/a
Comments : SAC: Derived from out_gcgseq().
: GCG_REF is a companion format for GCG_SEQ that is used
: for building GCG sequence datasets (.ref files).
: The .ref file is identical to .seq file but without the sequence.
out_ig
Title : out_ig()
Usage : $self->layout("IG") or $self->out_ig
:
Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
:
: If the ReadSeq wrapper Parse.pm apppears
: to be configured properly it is used
: to generate the output.
:
: If Parse.pm cannot be used then this code
: carps out with an error message.
:
To-do : write internal output code
:
Version : 1.0 / 16 MAR 1997
Example : see Usage
Returns : FORMATTED STRING (wantarray is not used here!)
Argument :
out_strider
Title : out_strider()
Usage : $self->layout("Strider") or $self->out_strider
:
Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
:
: If the ReadSeq wrapper Parse.pm apppears
: to be configured properly it is used
: to generate the output.
:
: If Parse.pm cannot be used then this code
: carps out with an error message.
:
To-do : write internal output code
:
Version : 1.0 / 16 MAR 1997
Example : see Usage
Returns : FORMATTED STRING (wantarray is not used here!)
Argument :
out_zuker
Title : out_zuker()
Usage : $self->layout("Zuker") or $self->out_zuker
:
Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
:
: If the ReadSeq wrapper Parse.pm apppears
: to be configured properly it is used
: to generate the output.
:
: If Parse.pm cannot be used then this code
: carps out with an error message.
:
To-do : write internal output code
:
Version : 1.0 / 16 MAR 1997
Example : see Usage
Returns : FORMATTED STRING (wantarray is not used here!)
Argument :
out_msf
Title : out_msf()
Usage : $self->layout("MSF") or $self->out_msf
:
Function : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
:
: If the ReadSeq wrapper Parse.pm apppears
: to be configured properly it is used
: to generate the output.
:
: If Parse.pm cannot be used then this code
: carps out with an error message.
:
To-do : write internal output code
:
Version : 1.0 / 16 MAR 1997
Example : see Usage
Returns : FORMATTED STRING (wantarray is not used here!)
Argument :
parse_unknown
Title : parse_unknown
Usage : parse_unknown($ent);
Function : tries to figure out the format of $ent and then
: calls the appropriate function to parse it into $self->{"seq"}.
Example : $self->parse_unknown;
Returns : n/a
Argument : $ent : the rough multi-line string to be parsed
parse_bad
Title : parse_bad
Usage : parse_bad;
Function : complains of un-parsable sequence, last-ditch attempt via
: Parse.pm if sequence is being read from a file.
:
Example : $self->parse_bad;
Returns : n/a
Argument : n/a
version
Title : version();
Usage : $myseq->version;
Function : prints Bio::Seq current version number
Bio::Seq Guts
Sequence Object
The sequence object is merely a reference to a hash containing
all or some of the following fields...
Field Value
--------------------------------------------------------------
seq the sequence
id a short identifier for the sequence
desc a description of the sequence, in descffmt file-format
names a hash of identifiers that relate to the sequence..
these could be Database ID's, Accession #'s, URL's,
pathnames, etc. Currently there is no set format
for the names hash and no formal definition of databases
or names
start start in bio-coords of the first residue of the sequence
end end in bio-coords of the first residue of the sequence
type the sequence type. Is actually a 2 value list of format
["monomer","origin"] where monomer is one of the
recognized sequence types and origin is a string
description of the sequences' origin (mitochondrial, etc)
ffmt file-format for the sequence
descffmt file-format of the description string