NAME

CracTools::Annotator - Generic annotation base on CracTools::GFF::Query::File

VERSION

version 1.2

SYNOPSIS

# Construct tha annotator object that will index the GFF file in
# a genomic interal-tree based structure
my $annotator = CracTools::Annotator->new("annotation.gff");

# Query the annotator object for overlapping annotations
my $annot = $annotator->getBestAnnotationCandidate("chr1",12345,12380);

if(defined $annot->{exon}) {
  print STDERR "Found overlapping exon\n";
} else {
  # If no overlapping exons have been found, we check for the closest gene
  # in the downstream direction
  my $closest_annot = $annotator->getAnnotationNearestDownCandidates()->[0];
  if(defined $closest_annot && defined $closest_annot->{gene}) {
    print STDERR "Closest gene annotation is ".12345 - $closest_annot->{gene}->end."bp away\n";
  }
}

DESCRIPTION

This module is based on CracTools::Interval::Query::File and provides powerfull methods to query annotation files and prioritize hits to fit specific application needs.

Annotator work with 0-based coordinate system and closed [a,b] intervals.

The principle behind CracTools::Annotator is to build a genomic interval tree that holds the annotations. Then, the user can query this datastructure to retrieve annotations. In order to organized the retrieved annotations, we build candidates hashes that are a branch of the annotation tree. For a classic GFF annotation file, if the queried interval overlap and exon, the branch of the annotation tree, will go from an exon leaf up to the gene root passing by an mRNA internal node.

Candidate structure

An annotation candidate is a hash datastructure, where keys are GFF features (exon, gene, mRNA) and values are CracTools::GFF::Annotation object (a parsed GFF line).

It also contains an entry parent_feature that holds the parenting links between features, and an entry leaf_feature that holds the feature name of the leaf ("exon" for example).

my $candidate = {
  "exon" => CracTools::GFF::Annotation, 
  "gene" => CracTools::GFF::Annotation,
  "feature" => CracTools::GFF::Annotation, ..., 
  parent_feature => {exon => mRNA, featureA => featureB, ...},
  leaf_feature => "exon",
};

Priority methods

Each annotation query can be parametrized with priorization methods that will choose a set of "best" annotation(s) to be returned to the user. In this module we propose default priorization method, but you can create your own in order to fit your application needs.

There is two kind of priorization method, prioritySub and comparSub.

Priority subroutine

The priority subroutine (by default "getCandidatePriorityDefault") recieve as input the queried interval (start and end pos) and an annotation candidate. As output the subroutine must return a priority level (the lower being more important), and a string variable that is a literal version of the priority level.

Compare subroutine

The compare subroutine (by default "compareTwoCandidatesDefault") recieve as input two annotation candidates and the queried interval. As output the subroutine must return the best candidate between the two, or neither (undef) if the subroutine cannot determine.

METHODS

new

Arg [1] : String - $gff_file
          GFF file used to perform annotation
Arg [2] : String - $mode
          Execution mode : "fast" or "light" ("light" by default)

Example     : my $annotator = CracTools::GFF::Annotator->new($gff_file);
Description : Create a new CracTools::GFF::Annotator object based on the
              provided GFF file. If "light" mode is specified, CracTools::Annotator
              will be less memory consuming but will have a time execution overhead.
ReturnType  : CracTools::GFF::Annotator

mode

Description : Return the mode used to create the annotator
ReturnType  : string ("light" or "fast")

foundAnnotation

Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand

Description : Return true if any overlapping annotation has been found
ReturnType  : Boolean

foundGene

Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand

Description : Return true if an overlapping gene annotation has been found
ReturnType  : Boolean

foundSameGene

Arg [1] : String - chr
Arg [2] : String - pos_start1
Arg [3] : String - pos_end1
Arg [4] : String - pos_start2
Arg [5] : String - pos_end1
Arg [6] : String - strand

Description : Return true if a same gene overlaps the two intervals.
ReturnType  : Boolean

getBestAnnotationCandidate

Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand
Arg [5] : (Optional) Subroutine - see C<getCandidatePriorityDefault> for more details
Arg [6] : (Optional) Subroutine - see C<compareTwoCandidatesDefault> for more details

Description : Return best annotation candidate according to the priorities given
              by the subroutine(s) in argument.
ReturnType  : AnnotationCandidate, Int(priority), String(type)

getBestAnnotationCandidates

Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand
Arg [5] : (Optional) Subroutine - see C<getCandidatePriorityDefault> for more details
Arg [6] : (Optional) Subroutine - see C<compareTwoCandidatesDefault> for more details

Description : Return best annotation candidates according to the priorities given
              by the subroutine(s) in argument.
ReturnType  : ArrayRef of AnnotationCandidates, Int(priority), String(type)

getAnnotationCandidates

Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - pos_end
Arg [4] : String - strand

Description : Return an array with all annotation candidates overlapping the
              chromosomic region.
ReturnType  : ArrayRef of AnnotationCandidate

getAnnotationNearestDownCandidates

Arg [1] : String - chr
Arg [2] : String - pos_start
Arg [3] : String - strand

Description : Return an array with all annotation candidates nearest down the
              query region (without overlap).
ReturnType  : ArrayRef of AnnotationCandidate

getAnnotationNearestUpCandidates

Arg [1] : String - chr
Arg [2] : String - pos_end
Arg [3] : String - strand

Description : Return an array with all annotation candidates nearest up the
              query region (without overlap).
ReturnType  : ArrayRef of AnnotationCandidate

getCandidatePriorityDefault

Arg [1] : String - pos_start
Arg [2] : String - pos_end
Arg [3] : hash - candidate

Description : Default method used to give a priority to a candidate.
              You can create your own priority method to fit your specific need
              for selecting the best annotation.
              The best priority is 0. A priority of -1 means that this candidate
              should be avoided.
ReturnType  : Array($priority,$type) where $priority is an integer and $type a string

compareTwoCandidatesDefault

Arg [1] : hash - candidate1
Arg [2] : hash - candidate2
Arg [3] : pos_start (position start that has been queried)
Arg [4] : pos_end (position end that has been queried)

Description : Default method used to chose the best candidat when priority are equals
              You can create your own priority method to fit your specific need
              for selecting the best candidat.
ReturnType  : AnnotationCandidate - best candidate or undef if we cannot decide which candidate is the best

PRIVATE METHODS

_init

Description : init method, load GFF annotation into a
              CracTools::GFF::Query object.

_constructCandidates

Arg [1] : String - annot_id
Arg [2] : Hash ref - candidate
          Since this method is recursive, this is the object that
          we are constructing
Arg [3] : Hash ref - annot_hash
          annot_hash is a hash reference where keys are annotion IDs
          and values are CracTools::GFF::Annotation objects.

Description : _constructCandidate is a recursive method that build a
              candidate hash. A candidate is defined as a path into the annotation
              (multi-rooted) tree from a leaf (ex: an exon) to a root (ex: a gene).
ReturnType  : Candidate Hash ref where keys are GFF features and
              values are CracTools::GFF::Annotation objects :
              { "exon" => CracTools::GFF::Annotation, 
                "gene" => CracTools::GFF::Annotation,
                feature => CracTools::GFF::Annotation, ..., 
                parent_feature => {featureA => featureB},
                leaf_feature => "exon",
              }

_constructCandidatesFromAnnotation

Arg [1] : Hash ref - annotations
          Annotions is a hash reference where keys are coordinates
          given by CracTools::Interval::Query::File objects.
Description : _constructCandidate is a recursive method that build a
              candidate hash.
ReturnType  : Candidate array ref of all candidates built by _constructCandidate

AUTHORS

  • Nicolas PHILIPPE <nphilippe.research@gmail.com>

  • Jérôme AUDOUX <jaudoux@cpan.org>

  • Sacha BEAUMEUNIER <sacha.beaumeunier@gmail.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2015 by IRMB/INSERM (Institute for Regenerative Medecine and Biotherapy / Institut National de la Santé et de la Recherche Médicale) and AxLR/SATT (Lanquedoc Roussilon / Societe d'Acceleration de Transfert de Technologie).

This is free software, licensed under:

The GNU Affero General Public License, Version 3, November 2007