NAME
Bio::ToolBox::SeqFeature - Fast, simple SeqFeature implementation
head1 SYNOPSIS
# create a transcript
my $transcript = Bio::ToolBox::SeqFeature->new(
-seq_id => chr1,
-start => 1001,
-stop => 1500,
-strand => '+',
);
$seqf->primary_tag('mRNA'); # set parameters individually
# create an exon
my $exon = Bio::ToolBox::SeqFeature->new(
-start => 1001,
-end => 1200,
-type => 'exon',
);
# associate exon with transcript
$transcript->add_SeqFeature($exon);
my $exon_strand = $exon->strand; # inherits from parent
# walk through subfeatures
foreach my $f ($transcript->get_all_SeqFeatures) {
printf "%s is a %s\n", $f->display_name, $f->type;
}
# add attribute
$transcript->add_tag_value('Status', $status);
# get attribute
my $value = $transcript->get_tag_values($key);
DESCRIPTION
SeqFeature objects represent functional elements on a genomic or chromosomal sequence, such as genes, transcripts, exons, etc. In many cases, especially genes, they have a hierarchical structure, typically in this order
gene
mRNA or transcript
exon
CDS
SeqFeature objects have at a minimum coordinate information, including chromosome, start, stop, and strand, and a name or unique identifier. They often also have type or source information, which usually follows the Sequence Ontology key words.
This is a fast, efficient, simplified SeqFeature implementation that mostly implements the Bio::SeqFeatureI API, and could be substituted for other implementations, such Bio::SeqFeature::Lite and Bio::SeqFeature::Generic. Unlike the others, however, it inherits no classes or methods and uses an unorthodox blessed array to store feature attributes, decreasing memory requirements and complexity.
METHODS
Refer to the Bio::SeqFeatureI documentation for general implementation and ideas, which this module tries to implement.
Creating new SeqFeature objects
New, empty SeqFeature objects can be generated, but in general they should be generated with location and other attributes. Pass an array of key = value pairs. Most of the accession methods may be used as key tags to the new method. The following attribute keys are accepted.
- -seq_id
- -start
- -end
- -stop
- -strand
- -name
- -display_name
- -id
- -primary_id
- -type
- -source
- -source_tag
- -primary_tag
- -score
- -phase
- -attributes
-
Provide an anonymous array of key value pairs representing attribute keys and their values.
- -segments
-
Provide an anonymous array of SeqFeature objects to add as child objects.
Accession methods
These are methods to set and/or retrieve attribute values. Pass a single value to set the attribute. The attribute value is always returned.
- seq_id
-
The name of the chromosome or reference sequence. If you are generating a new child object, e.g. exon, then seq_id does not need to be provided. In these cases, the parent's seq_id is inherited.
- start
-
The start coordinate of the feature. SeqFeature objects use the 1-base numbering system, following BioPerl convention. Start coordinates are always less than the stop coordinate.
- end
- stop
-
The end coordinate of the feature. Stop coordinates are always greater than the start coordinate.
- strand
-
The strand that the feature is on. The default value is always unstranded, or 0. Any of the following may be supplied: "
1 0 -1 + . -
". Numeric integers 1, 0, and -1 are always returned. - source
- source_tag
-
A text string representing the source of the feature. This corresponds to the second field in a GFF file. The source tag is optional. If not supplied, the source tag can be inherited from a parent feature if present.
- primary_tag
-
The type of feature. These usually follow Sequence Ontology terms, but are not required. The default value is the generic term "region". Examples include gene, mRNA, transcript, exon, CDS, etc. This corresponds to the third field in a GFF file.
- type
-
A shortcut method which can represent either "primary_tag:source_tag" or, if no source_tag is defined, simply "primary_tag".
- name
- display_name
-
A text string representing the name of the feature. The name is not required to be a unique value, but generally is. The default name, if none is provided, is the primary_id.
- id
- primary_id
-
A text string representing a unique identifier of the feature. If not explicitly defined, a unique ID is automatically generated.
- score
-
A numeric (integer or floating) value representing the feature.
- phase
-
An integer (0,1,2) representing the coding frame. Only required for CDS features.
Special Attributes
Special attributes are key value pairs that do not fall under the above conventions. It is possible to have more than one value assigned to a given key. In a GFF file, this corresponds to the attributes in the 9th field, with the exception of special reserved attributes such as Name, ID, and Parent.
- add_tag_value($key, $value)
-
Sets the special attribute $key to $value. If you have more than one value, $value should be an anonymous array of text values. Following GFF convention, $key should not comprise of special characters, including ";,= ".
-
Returns an array of all attribute keys.
- has_tag($key)
-
Boolean method whether the SeqFeature object contains the attribute.
- get_tag_values($key)
- each_tag_value($key)
-
Returns the value for attribute $key. If multiple values are present, it may return an array or array reference.
- attributes
-
Returns an array or reference to the key value hash;
- remove_tag($key)
-
Deletes the indicated attribute.
Subfeature Hierarchy
Following Sequence Ontology and GFF conventions, SeqFeatures can have subfeatures (children) representing a hierarchical structure, for example genes beget transcripts which beget exons.
Child SeqFeature objects may have more than one parent, for example, shared exons between alternate transcripts. In which case, only one exon SeqFeature object is generated, but is added to both transcript objects.
- add_SeqFeature($feature1, ...)
- add_segment($feature1, ...)
-
Pass one or more SeqFeature objects to be associated as children.
- get_SeqFeatures
- get_all_SeqFeatures
- segments
-
Returns an array of all sub SeqFeature objects.
Range Methods
These are range methods for comparing one SeqFeature object to another. They are analogous to Bio::RangeI methods.
They currently do not support strand checks or strand options.
- length
-
Returns the length of the SeqFeature object.
- overlaps($other)
-
Returns a boolean value whether the $other SeqFeature object overlaps with the self object.
- contains($other)
-
Returns a boolean value whether the self object completely contains the $other SeqFeature object.
- equals($other)
-
Returns a boolean value whether the self object coordinates are equivalent to the $other SeqFeature object.
- intersection($other)
-
Returns a new SeqFeature object representing the intersection or overlap area between the $self object and the $other SeqFeature object.
- union($other)
-
Returns a new SeqFeature object representing the merged interval between the $self and $other SeqFeature objects.
- subtract($other)
-
Returns a new SeqFeature object representing the interval of the $self object after subtracting the $other SeqFeature object.
Export Strings
These methods export the SeqFeature object as a text string in the specified format. New line characters are included.
- gff_string($recurse)
-
Exports the SeqFeature object as a GFF3 formatted string. Pass a boolean value if you wish to recurse through the hierarchy and print subfeatures as a multi-line string. Child-Parent ID attributes are smartly handled, including multiple parentage.
Currently no support is available for other GFF formats.
- bed_string
-
Exports the SeqFeature object as a BED6 formatted string. Currently no support is available for recursive printing or BED12 formats.
LIMITATIONS
Because of their underlying array structure, Bio::ToolBox::SeqFeature objects should generally not be used as a base class (unless you know the ramifications of doing so). The following Bio classes and Interfaces are similar and their API was used as a model. However, in most cases they are not likely to work with this module because of object structure incompatibility, although this has not been explicitly tested.
- Bio::AnnotationI
- Bio::LocationI
- Bio::RangeI
- Bio::SeqI
- Bio::Tools::GFF
- Bio::DB::SeqFeature::Store
- Bio::Graphics
AUTHOR
Timothy J. Parnell, PhD
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.