NAME
Bio::ToolBox::parser::gff - parse GFF3, GTF, and GFF files
SYNOPSIS
use Bio::ToolBox::parser::gff;
my $filename = 'file.gff3';
my $parser = Bio::ToolBox::parser::gff->new(
file => $filename,
do_gene => 1,
do_exon => 1,
) or die "unable to open gff file!\n";
while (my $feature = $parser->next_top_feature() ) {
# each $feature is a SeqFeature object
my @children = $feature->get_SeqFeatures();
}
DESCRIPTION
This module parses a GFF file into SeqFeature objects. It natively handles GFF3, GTF, and general GFF files.
For both GFF3 and GTF files, fully nested gene models, typically gene => transcript => (exon, CDS, etc), may be built using the appropriate attribute tags. For GFF3 files, these include ID and Parent tags; for GTF these include gene_id
and transcript_id
tags.
For GFF3 files, any feature without a Parent
tag is assumed to be a parent. Children features referencing a parent feature that has not been loaded are considered orphans. Orphans are attempted to be re-associated with missing parents after the file is completely parsed. Any orphans left may be collected. Files with orphans are considered poorly formatted or incomplete and should be fixed. Multiple parentage, for example exons shared between different transcripts of the same gene, are fully supported.
Embedded Fasta sequences are ignored, as are most comment and pragma lines.
The SeqFeature objects that are returned are Bio::ToolBox::SeqFeature objects. Refer to that documentation for more information.
METHODS
Initialize and modify the parser.
These are class methods to initialize the parser with an annotation file and modify the parsing behavior. Most parameters can be set either upon initialization or as class methods on the object. Unpredictable behavior may occur if you implement these in the midst of parsing a file.
Do not open subsequent files with the same object. Always create a new object to parse a new file
- new
-
my $parser = Bio::ToolBox::parser::gff->new($filename); my $parser = Bio::ToolBox::parser::gff->new( file => 'file.gtf.gz', do_gene => 1, do_utr => 1, );
Initialize a new gff parser object. Pass a single value (a GFF file name) to open a file. Alternatively, pass an array of key value pairs to control how the file is parsed. Options include the following.
- file
-
Provide a GFF file name to be parsed. It should have a gff, gtf, or gff3 file extension. The file may be gzip compressed.
- version
-
Specify the version. Normally this is not needed, as version can be determined either from the file extension (in the case of gtf and gff3) or from the
##gff-version
pragma at the top of the file. Acceptable values include 1, 2, 2.5 (gtf), or 3. - class
-
Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature.
- simplify
-
Pass a boolean value to simplify the SeqFeature objects parsed from the GFF file and ignore extraneous attributes.
- do_gene
-
Pass a boolean (1 or 0) value to combine multiple transcripts with the same gene name under a single gene object. Default is true.
- do_cds
- do_exon
- do_utr
- do_codon
-
Pass a boolean (1 or 0) value to parse certain subfeatures. Exon subfeatures are always parsed, but
CDS
,five_prime_UTR
,three_prime_UTR
,stop_codon
, andstart_codon
features may be optionally parsed. Default is false.
- open_file
-
$parser->open_file($file) or die "unable to open $file!";
Pass the name of a GFF file to be parsed. The file may optionally be gzipped (.gz extension). Do not open a new file when one has already opened a file. Create a new object for a new file, or concatenate the GFF files.
- version
-
Set or get the GFF version of the current file. Acceptable values include 1, 2, 2.5 (gtf), or 3. Normally this is determined by file extension or
gff-version
pragma on the first line, and should not need to be set by the user in most circumstances. - simplify
-
Pass a boolean true value to simplify the attributes of GFF3 and GTF files that may have considerable numbers of tags, e.g. Ensembl files. Only essential information, including name, ID, and parentage, is retained. Useful if you're trying to quickly parse annotation files for basic information.
Feature retrieval
The following methods parse the GFF file lines into SeqFeature objects. It is best if these methods are not mixed; unexpected results may occur.
- next_top_feature
-
This method will return a top level parent SeqFeature object assembled with child features as sub-features. For example, a gene object with mRNA subfeatures, which in turn may have exon and/or CDS subfeatures. Child features are assembled based on the existence of proper Parent attributes in child features. If no Parent attributes are included in the GFF file, then this will behave as "next_feature".
Child features (those containing a
Parent
attribute) are associated with the parent feature. A warning will be issued about lost children (orphans). Shared subfeatures, for example exons common to multiple transcripts, are associated properly with each parent. An opportunity to rescue orphans is available using the "orphans" method.Note that subfeatures may not necessarily be in ascending genomic order when associated with the feature, depending on their order in the GFF3 file and whether shared subfeatures are present or not. When calling subfeatures in your program, you may want to sort the subfeatures. For example
my @subfeatures = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, $_->start] } $parent->get_SeqFeatures;
- top_features
-
This method will return an array of the top (parent) features defined in the GFF file. This is similar to the "next_top_feature" method except that all features are returned at once.
- next_feature
-
This method will return a SeqFeature object representation of the next feature (line) in the file. Parent - child relationships are NOT assembled; however, undefined parents in a GTF file may still be generated, just not returned.
This method is best used with simple GFF files with no hierarchies present. This may be used in a while loop until the end of the file is reached. Pragmas are ignored and comment lines and sequence are automatically skipped.
Other methods
Additional methods for working with the parser object and the parsed SeqFeature objects.
- fh
-
This method returns the IO::File object of the opened GFF file.
- parse_file
-
Parses the entire file into memory. This is automatically called when either "top_features" or "next_top_feature" is called.
- find_gene
-
my $gene = $parser->find_gene( name => $display_name, id => $primary_id, ) or warn "gene $display_name can not be found!";
Pass a gene name, or an array of key => values (
name
,display_name
,ID
,primary_ID
, and/or coordinate information), that can be used to find a gene already loaded into memory. Only useful after </parse_file> is called. Genes with a matching name are confirmed by a matching ID or overlapping coordinates, if available. Otherwise the first match is returned. - orphans
-
my @orphans = $parser->orphans; printf "we have %d orphans left over!", scalar @orpans;
This method will return an array of orphan SeqFeature objects that indicated they had a parent but said parent could not be found. Typically, this is an indication of an incomplete or malformed GFF3 file. Nevertheless, it might be a good idea to check this after retrieving all top features.
- comments
-
This method will return an array of the comment or pragma lines that may have been in the parsed file. These may or may not be useful.
- typelist
-
Returns a comma-delimited string of the GFF primary tags (column 3) observed in the first 1000 lines of an opened file. Useful for checking what is in the GFF file. See "sample_gff_type_list" in Bio::ToolBox::Data::file.
- from_gff_string
-
my $seqfeature = $parser->from_gff_string($string);
This method will parse a single GFF, GTF, or GFF3 formatted string or line of text and return a SeqFeature object.
- unescape
-
This method will unescape special characters in a text string. Certain characters, including ";" and "=", are reserved for GFF3 formatting and are not allowed, thus requiring them to be escaped.
- seq_ids
-
Returns an array or array reference of the names of the chromosomes or reference sequences present in the file. These may be defined by GFF3 sequence-region pragmas or inferred from the features.
- seq_id_lengths
-
my $seq2len = $parser->seq_id_lengths; foreach (keys %$seq2len) { printf "chromosome %s is %d bp long\n", $_, $seq2len->{$_}; }
Returns a hash reference to the chromosomes or reference sequences and their corresponding lengths. In this case, the length is either defined by the
sequence-region
pragma or inferred by the greatest end position of the top features.
SEE ALSO
Bio::ToolBox::SeqFeature, Bio::ToolBox::parser::ucsc, Bio::Tools::GFF
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.