NAME

Bio::ToolBox::gff3_parser

DESCRIPTION

This module parses a GFF3 file into SeqFeature objects. Children features are associated with parents as sub SeqFeature objects, assuming the Parent tag is included and correctly identifies the unique ID tag of the parent. Any feature without a Parent tag is assumed to be a parent. Children features referencing a parent feature that has not been loaded may be lost. Multiple parentage, for example exons shared between different transcripts of the same gene, are now supported.

Embedded Fasta sequences are ignored, as are most comment and pragma lines.

Close directives (###) in the GFF3 file are highly encouraged to limit parsing, otherwise the entire file will be slurped into memory. This happens regardless of whether you retrieve top features one at a time or all at once. Refer to the GFF3 definition at http://www.sequenceontology.org for details on the close directive.

The SeqFeature objects that are returned are Bio::SeqFeature::Lite objects. Refer to that documentation for more information.

SYNOPSIS

  use Bio::ToolBox::gff3_parser;
  my $filename = 'file.gff3';
  
  my $parser = Bio::ToolBox::gff3_parser->new($filename) or 
  	die "unable to open gff file!\n";
  
  while (my $feature = $parser->next_top_feature() ) {
	# each $feature is a Bio::SeqFeature::Lite object
	my @children = $feature->get_SeqFeatures();
  }

METHODS

new()
new($file)

Initialize a new gff3_parser object.

Optionally pass the name of the GFF3 file, and it will be automatically opened by calling open_file(). There's not much to do unless you open a file.

open_file($file)

Pass the name of a GFF3 file to be parsed. The file must have a .gff or .gff3 extension, and may optionally be gzipped (.gz extension).

fh()
fh($filehandle)

This method returns the IO::File object of the opened GFF file. A new file may be parsed by passing an opened IO::File or other object that inherits IO::Handle methods.

next_feature()

This method will return a Bio::SeqFeature::Lite object representation of the next feature in the file. Parent - child relationships are NOT assembled. This is best used with simple GFF files with no hierarchies present. This may be used in a while loop until the end of the file is reached. Pragmas are ignored and comment lines and sequence are automatically skipped.

next_top_feature()

This method will return a top level parent Bio::SeqFeature::Lite object assembled with child features as sub-features. For example, a gene object with mRNA subfeatures, which in turn may have exon and/or CDS subfeatures. Child features are assembled based on the existence of proper Parent attributes in child features. If no Parent attributes are included in the GFF file, then this will behave as next_feature().

Unless close features pragmas (###) are included in the file, this will require loading all features in the file into memory to find proper parent-child relationships. Child features (those containing a Parent attribute) are associated with the parent feature. A warning will be issued about lost children (orphans). Shared subfeatures, for example exons common to multiple transcripts, are associated properly with each parent. An opportunity to rescue orphans is available using the orphans() method.

Note that subfeatures may not necessarily be in ascending genomic order when associated with the feature, depending on their order in the GFF3 file and whether shared subfeatures are present or not. When calling subfeatures in your program, you may want to sort the subfeatures. For example

my @subfeatures = map { $_->[0] }
                  sort { $a->[1] <=> $b->[1] }
                  map { [$_, $_->start] }
                  $parent->get_SeqFeatures;

When close pragmas are present in the file, call this method repeatedly to finish parsing the remainder of the file.

top_features()

This method will return an array of the top (parent) features defined in the GFF3 file. This is similar to the next_top_feature() method except that all features are returned at once.

Note that unless close pragmas are present in the file, requesting all top features at once or one at a time will not save on memory; the entire file will still be parsed into memory.

orphans()

This method will return an array of orphan SeqFeature objects that indicated they had a parent but said parent could not be found. Typically, this is an indication of an incomplete or malformed GFF3 file. Nevertheless, it might be a good idea to check this after retrieving all top features.

from_gff3_string($string)

This method will parse a GFF3 formatted string or line of text and return a Bio::SeqFeature::Lite object.

unescape($text)

This method will unescape special characters in a text string. Certain characters, including ";" and "=", are reserved for GFF3 formatting and are not allowed, thus requiring them to be escaped.

AUTHOR

Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.