NAME

Bio::ToolBox::Parser::gff - parse GFF3, GTF, and generic GFF files

SYNOPSIS

  use Bio::ToolBox::Parser;
  my $filename = 'file.gff3';
  
  my $Parser = Bio::ToolBox::Parser->new(
  	file    => $filename,
  	do_gene => 1,
  	do_exon => 1,
  ) or die "unable to open gff file!\n";
  # the Parser will taste the file and open the appropriate 
  # subclass parser, gff in this case
  
  while (my $feature = $Parser->next_top_feature() ) {
	# each $feature is a parent SeqFeature object, usually a gene
  	printf "%s:%d-%d\n", $f->seq_id, $f->start, $f->end;
	
	# subfeatures such as transcripts, exons, etc are nested within
	my @children = $feature->get_SeqFeatures();
  }

DESCRIPTION

This is the GFF specific parser subclass to the Bio::ToolBox::Parser object, and as such inherits generic methods from the parent.

This module parses a GFF file into SeqFeature objects. It natively handles GFF3, GTF, and general GFF files.

For both GFF3 and GTF files, fully nested gene models, typically gene => transcript => (exon, CDS, etc), may be built using the appropriate attribute tags. For GFF3 files, these include ID and Parent tags; for GTF these include gene_id and transcript_id tags.

For GFF3 files, any feature without a Parent tag is assumed to be a parent. Children features referencing a parent feature that has not been loaded are considered orphans. Orphans are attempted to be re-associated with missing parents after the file is completely parsed. Any orphans left may be collected. Files with orphans are considered poorly formatted or incomplete and should be fixed. Multiple parentage, for example exons shared between different transcripts of the same gene, are fully supported.

Embedded Fasta sequences are ignored.

The SeqFeature objects that are returned are Bio::ToolBox::SeqFeature objects (default SeqFeature class). Refer to that documentation for more information.

METHODS

Initialize and modify the parser.

In most cases, users should initialize an object using the generic Bio::ToolBox::Parser object.

These are class methods to initialize the parser with an annotation file and modify the parsing behavior. Most parameters can be set either upon initialization or as class methods on the object. Unpredictable behavior may occur if you implement these in the midst of parsing a file.

Do not open subsequent files with the same object. Always create a new object to parse a new file

new
my $parser = Bio::ToolBox::Parser::gff->new($filename);
my $parser = Bio::ToolBox::Parser::gff->new(
    file    => 'file.gtf.gz',
    do_gene => 1,
    do_utr  => 1,
);

Initialize a new gff parser object. Pass a single value (a GFF file name) to automatically open (but not parse yet) a file. Alternatively, pass an array of key value pairs to control how the file is parsed. Options include the following.

file

Provide a GFF file name to be parsed. It should have a gff, gtf, or gff3 file extension. The file may be gzip compressed.

simplify

Pass a boolean value to simplify the SeqFeature objects parsed from the GFF file and ignore extraneous attributes.

do_gene

Pass a boolean (1 or 0) value to combine multiple transcripts with the same gene name under a single gene object. Default is true.

do_cds
do_exon
do_utr
do_codon

Pass a boolean (1 or 0) value to parse certain subfeatures. Exon subfeatures are always parsed, but CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features may be optionally parsed. Default is false.

class

Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature, which is lighter-weight and consumes less memory. A suitable BioPerl alternative is Bio::SeqFeature::Lite.

simplify

Pass a boolean true value to simplify the attributes of GFF3 and GTF files that may have considerable numbers of tags, e.g. Ensembl files. Only essential information, including name, ID, and parentage, is retained. Useful if you're trying to quickly parse annotation files for basic information.

Access methods

See Bio::ToolBox::Parser for generic methods for accessing the features. Below are specific methods to this subclass.

open_file

Opens a new annotation file in a new object. Normally, this is automatically done when a Parser object is instantiated with the new method and a file path was provided. Do not attempt to open a subsequent file with a pre-existing Parser object; it will fail.

Pass the path to a GFF annotation file. It may be compressed with gzip. Success returns 1.

next_feature

This will parse the opened file one line at a time, returning the SeqFeature object for each line of the GFF file. Parent->child relationships are not built. This should only be used for simple files or when Parent->child relationships are not needed.

parse_file =item parse_table

This will parse the opened file entirely into memory, parsing each feature into SeqFeature objects and assembling into parent->child features. NOTE that for vertebrate genome annotation, this may consume considerable amount of memory and take a while.

The check_orphanage method is automatically run upon parsing the entire file.

orphans
my @orphans = $parser->orphans;
printf "we have %d orphans left over!", scalar @orpans;

This method will return an array of orphan SeqFeature objects that indicated they had a parent but said parent could not be found. Typically, this is an indication of an incomplete or malformed GFF3 file. Nevertheless, it might be a good idea to check this after retrieving all top features.

check_orphanage

Method to go through the list of orphan sub features (if any) and attempt to reunite them with their indicated Parent object if it has been loaded into memory. This is automatically run when parse_file is called and all features have been loaded.

typelist

Returns a comma-delimited string of the GFF primary tags (column 3) observed in the first 1000 lines of an opened file. Useful for checking what is in the GFF file. See "sample_gff_type_list" in Bio::ToolBox::Data::file.

unescape

This method will unescape special characters in a text string. Certain characters, including ";" and "=", are reserved for GFF3 formatting and are not allowed, thus requiring them to be escaped.

SEE ALSO

Bio::ToolBox::Parser, Bio::ToolBox::SeqFeature, Bio::Tools::GFF

AUTHOR

Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.