The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::db_helper::gff3_parser

DESCRIPTION

This module parses a GFF3 file into SeqFeature objects. Children features are associated with parents as sub SeqFeature objects, assuming the Parent tag is included and correctly identifies the unique ID tag of the parent. Any feature without a Parent tag is assumed to be a parent. Children features referencing a parent feature that has not been loaded may be lost. Multiple parentage, for example exons shared between different transcripts of the same gene, are now supported.

Embedded Fasta sequences are ignored, as are most comment and pragma lines.

Close directives (###) in the GFF3 file are highly encouraged to limit parsing, otherwise the entire file will be slurped into memory. Refer to the GFF3 definition at http://www.sequenceontology.org for more details.

The SeqFeature objects that are returned are Bio::SeqFeature::Lite objects. Refer to that documentation for more information.

SYNOPSIS

  use Bio::ToolBox::db_helper::gff3_parser;
  my $filename = 'file.gff3';
  
  my $parser = Bio::ToolBox::db_helper::gff3_parser->new($filename) or 
        die "unable to open gff file!\n";
  
  while (my @top_features = $parser->top_features() ) {
        while (@top_features) {
                my $feature = shift @top_features;
                # each $feature is a Bio::SeqFeature::Lite object
                my @children = $feature->get_SeqFeatures();
        }
  }

METHODS

new()
new($file)

Initialize a new gff3_parser object.

Optionally pass the name of the GFF3 file, and it will be automatically opened by calling parse_file().

parse_file($file)

Pass the name of a GFF3 file to be parsed. The file must have a .gff or .gff3 extension, and may optionally be gzipped (.gz extension).

fh()
fh($filehandle)

This method returns the IO::File object of the opened GFF file. A new file may be parsed by passing an opened IO::File or other object that inherits IO::Handle methods.

next_feature()

This method will return a Bio::SeqFeature::Lite object representation of the next feature in the file. Parent - child relationships are NOT assembled. This is best used with simple GFF files with no hierarchies present. This may be used in a while loop until the end of the file is reached. Pragmas are ignored and comment lines and sequence are automatically skipped.

top_features()

This method will return an array of the top (parent) features defined in the GFF3 file. The file will be progressively parsed from the beginning until either a close features pragma (###) or the end of the file is reached. Child features (those containing a Parent attribute) are associated with the parent feature. A warning will be issued about lost children (orphans). Shared subfeatures, for example exons common to multiple transcripts, are associated properly with each parent.

Note that subfeatures may not necessarily be in ascending genomic order when associated with the feature, depending on their order in the GFF3 file and whether shared subfeatures are present or not. When calling subfeatures in your program, you should sort the subfeatures. For example

  my @subfeatures = map { $_->[0] }
                    sort { $a->[1] <=> $b->[1] }
                    map { [$_, $_->start] }
                    $parent->get_SeqFeatures;

When close pragmas are present in the file, call this method repeatedly to finish parsing the remainder of the file.

from_gff3_string($string)

This method will parse a GFF3 formatted string or line of text and return a Bio::SeqFeature::Lite object.

unescape($text)

This method will unescape special characters in a text string. Certain characters, including ";" and "=", are reserved for GFF3 formatting and are not allowed, thus requiring them to be escaped.

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.