NAME
CracTools::Annotator - Generic annotation base on CracTools::GFF::Query::File
VERSION
version 1.2
SYNOPSIS
# Construct tha annotator object that will index the GFF file in
# a genomic interal-tree based structure
my $annotator = CracTools::Annotator->new("annotation.gff");
# Query the annotator object for overlapping annotations
my $annot = $annotator->getBestAnnotationCandidate("chr1",12345,12380);
if(defined $annot->{exon}) {
print STDERR "Found overlapping exon\n";
} else {
# If no overlapping exons have been found, we check for the closest gene
# in the downstream direction
my $closest_annot = $annotator->getAnnotationNearestDownCandidates()->[0];
if(defined $closest_annot && defined $closest_annot->{gene}) {
print STDERR "Closest gene annotation is ".12345 - $closest_annot->{gene}->end."bp away\n";
}
}
DESCRIPTION
This module is based on CracTools::Interval::Query::File and provides powerfull methods to query annotation files and prioritize hits to fit specific application needs.
Annotator work with 0-based coordinate system and closed [a,b] intervals.
The principle behind CracTools::Annotator is to build a genomic interval tree that holds the annotations. Then, the user can query this datastructure to retrieve annotations. In order to organized the retrieved annotations, we build candidates hashes that are a branch of the annotation tree. For a classic GFF annotation file, if the queried interval overlap and exon, the branch of the annotation tree, will go from an exon leaf up to the gene root passing by an mRNA internal node.
Candidate structure
An annotation candidate is a hash datastructure, where keys are GFF features (exon, gene, mRNA) and values are CracTools::GFF::Annotation object (a parsed GFF line).
It also contains an entry parent_feature
that holds the parenting links between features, and an entry leaf_feature
that holds the feature name of the leaf ("exon" for example).
my $candidate = {
"exon" => CracTools::GFF::Annotation,
"gene" => CracTools::GFF::Annotation,
"feature" => CracTools::GFF::Annotation, ...,
parent_feature => {exon => mRNA, featureA => featureB, ...},
leaf_feature => "exon",
};
Priority methods
Each annotation query can be parametrized with priorization methods that will choose a set of "best" annotation(s) to be returned to the user. In this module we propose default priorization method, but you can create your own in order to fit your application needs.
There is two kind of priorization method, prioritySub
and comparSub
.
Priority subroutine
The priority subroutine (by default "getCandidatePriorityDefault") recieve as input the queried interval (start and end pos) and an annotation candidate. As output the subroutine must return a priority level (the lower being more important), and a string variable that is a literal version of the priority level.
Compare subroutine
The compare subroutine (by default "compareTwoCandidatesDefault") recieve as input two annotation candidates and the queried interval. As output the subroutine must return the best candidate between the two, or neither (undef) if the subroutine cannot determine.
METHODS
new
Arg [1] : String - $gff_file
GFF file used to perform annotation
Arg [2] : String - $mode
Execution mode : "fast" or "light" ("light" by default)
Example : my $annotator = CracTools::GFF::Annotator->new($gff_file);
Description : Create a new CracTools::GFF::Annotator object based on the
provided GFF file. If "light" mode is specified, CracTools::Annotator
will be less memory consuming but will have a time execution overhead.
ReturnType : CracTools::GFF::Annotator
mode
Description : Return the mode used to create the annotator
ReturnType : string ("light" or "fast")
foundAnnotation
Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand
Description : Return true if any overlapping annotation has been found
ReturnType : Boolean
foundGene
Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand
Description : Return true if an overlapping gene annotation has been found
ReturnType : Boolean
foundSameGene
Arg [1] : String - chr
Arg [2] : String - pos_start1
Arg [3] : String - pos_end1
Arg [4] : String - pos_start2
Arg [5] : String - pos_end1
Arg [6] : String - strand
Description : Return true if a same gene overlaps the two intervals.
ReturnType : Boolean
getBestAnnotationCandidate
Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand
Arg [5] : (Optional) Subroutine - see C<getCandidatePriorityDefault> for more details
Arg [6] : (Optional) Subroutine - see C<compareTwoCandidatesDefault> for more details
Description : Return best annotation candidate according to the priorities given
by the subroutine(s) in argument.
ReturnType : AnnotationCandidate, Int(priority), String(type)
getBestAnnotationCandidates
Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand
Arg [5] : (Optional) Subroutine - see C<getCandidatePriorityDefault> for more details
Arg [6] : (Optional) Subroutine - see C<compareTwoCandidatesDefault> for more details
Description : Return best annotation candidates according to the priorities given
by the subroutine(s) in argument.
ReturnType : ArrayRef of AnnotationCandidates, Int(priority), String(type)
getAnnotationCandidates
Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand
Description : Return an array with all annotation candidates overlapping the
chromosomic region.
ReturnType : ArrayRef of AnnotationCandidate
getAnnotationNearestDownCandidates
Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - strand
Description : Return an array with all annotation candidates nearest down the
query region (without overlap).
ReturnType : ArrayRef of AnnotationCandidate
getAnnotationNearestUpCandidates
Arg [1] : String - chr
Arg [2] : String - pos_end
Arg [3] : String - strand
Description : Return an array with all annotation candidates nearest up the
query region (without overlap).
ReturnType : ArrayRef of AnnotationCandidate
getCandidatePriorityDefault
Arg [1] : String - pos_start
Arg [2] : String - pos_end
Arg [3] : hash - candidate
Description : Default method used to give a priority to a candidate.
You can create your own priority method to fit your specific need
for selecting the best annotation.
The best priority is 0. A priority of -1 means that this candidate
should be avoided.
ReturnType : Array($priority,$type) where $priority is an integer and $type a string
compareTwoCandidatesDefault
Arg [1] : hash - candidate1
Arg [2] : hash - candidate2
Arg [3] : pos_start (position start that has been queried)
Arg [4] : pos_end (position end that has been queried)
Description : Default method used to chose the best candidat when priority are equals
You can create your own priority method to fit your specific need
for selecting the best candidat.
ReturnType : AnnotationCandidate - best candidate or undef if we cannot decide which candidate is the best
PRIVATE METHODS
_init
Description : init method, load GFF annotation into a
CracTools::GFF::Query object.
_constructCandidates
Arg [1] : String - annot_id
Arg [2] : Hash ref - candidate
Since this method is recursive, this is the object that
we are constructing
Arg [3] : Hash ref - annot_hash
annot_hash is a hash reference where keys are annotion IDs
and values are CracTools::GFF::Annotation objects.
Description : _constructCandidate is a recursive method that build a
candidate hash. A candidate is defined as a path into the annotation
(multi-rooted) tree from a leaf (ex: an exon) to a root (ex: a gene).
ReturnType : Candidate Hash ref where keys are GFF features and
values are CracTools::GFF::Annotation objects :
{ "exon" => CracTools::GFF::Annotation,
"gene" => CracTools::GFF::Annotation,
feature => CracTools::GFF::Annotation, ...,
parent_feature => {featureA => featureB},
leaf_feature => "exon",
}
_constructCandidatesFromAnnotation
Arg [1] : Hash ref - annotations
Annotions is a hash reference where keys are coordinates
given by CracTools::Interval::Query::File objects.
Description : _constructCandidate is a recursive method that build a
candidate hash.
ReturnType : Candidate array ref of all candidates built by _constructCandidate
AUTHORS
Nicolas PHILIPPE <nphilippe.research@gmail.com>
Jérôme AUDOUX <jaudoux@cpan.org>
Sacha BEAUMEUNIER <sacha.beaumeunier@gmail.com>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2015 by IRMB/INSERM (Institute for Regenerative Medecine and Biotherapy / Institut National de la Santé et de la Recherche Médicale) and AxLR/SATT (Lanquedoc Roussilon / Societe d'Acceleration de Transfert de Technologie).
This is free software, licensed under:
The GNU Affero General Public License, Version 3, November 2007