The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::Parser - generic parsing tool for GFF, UCSC, BED

SYNOPSIS

  # obtain an annotation file
  use Bio::ToolBox::Parser;
  my $filename = shift @ARGV; # could be any annotation format
  
  # open in parser
  my $parser = Bio::ToolBox::Parser->new(
        file    => $filename,
  ) or die "unable to open $filename!\n";
  # file is tasted and appropriate parser automatically selected
  # returns parser object if recognized
  # could be one of Bio::ToolBox::Parser::bed, 
  # Bio::ToolBox::Parser::gff, or Bio::ToolBox::Parser::ucsc
  
  # do something with parser
  while (my $feature = $parser->next_top_feature() ) {
        # each $feature is a SeqFeature object
        printf "%s:%d-%d\n", $f->seq_id, $f->start, $f->end;
        my @children = $feature->get_SeqFeatures();
  }

DESCRIPTION

This module is a generic wrapper around the three main annotation file parsers. It will taste test the file and choose the appropriate parser and open it automatically. These parsers include the following.

Bio::ToolBox::Parser::bed

Parses most Bed file formats, including 3-12 column Bed formats, and some specific Encode formats, including narrowPeak, broadPeak, and gappedPeak.

Bio::ToolBox::Parser::gff

Parses any GFF flavor, including GTF and GFF3.

Bio::ToolBox::Parser::ucsc

Parses some of the common UCSC annotation table formats, including refFlat, genePred, genePredExt, and knownGene. Support for some additional UCSC metadata tables is available.

Files are parsed entirely into memory, assembling gene components (transcripts, exon, CDS, UTR, etc) into hierarchical, top-level SeqFeature objects as appropriate. These SeqFeature objects can then be iterated through in a loop, acting on each one as appropriate. The default SeqFeature class is Bio::ToolBox::SeqFeature, an efficient Bio::SeqFeatureI compliant object class.

METHODS

The parser sub classes each contain documentation, but for the most part, they all behave similarly with similar methods.

Initiate new parser

new

Initiate a new parser object. Since this is a wrapper around a specific parser sub class, this is best used when the user doesn't necessarily know a priori what class to invoke. In other words, if you have a file but don't know what to open it with, use the generic Parser and let it pick for you.

    my $file; # obtained from the user, unknown format
    my $parser = Bio::ToolBox::Parser->new($file);
    

Pass either a single value being the name of a file, or a series of key value pairs to inform how to parse the file. The following parameters are allowed:

file

Provide the file name to be parsed. The file may be gzip compressed. It will be automatically tasted to determine the file format. See "taste_file" in Bio::ToolBox::Data::file.

flavor
filetype

If the file has already been tasted using "taste_file" in Bio::ToolBox::Data::file, then pass the flavor and filetype values to the constructor. This bypasses the need to re-taste the file a second time.

do_gene

Pass a boolean (1 or 0) value to combine multiple transcripts with the same gene name under a single gene object. Default is true for those parsers expecting gene annotation (GFF and UCSC).

do_cds
do_exon
do_utr
do_codon

Pass a boolean (1 or 0) value to parse certain subfeatures. Exon subfeatures are always parsed, but CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features may be optionally parsed. Default is false.

source

Provide a string value to be used as the source value when constructing SeqFeature objects that don't have an inherent source value, namely BED and UCSC.

simplify

Pass a boolean value to simplify the SeqFeature objects parsed from the GFF file and ignore extraneous attributes. Ignored for other parsers.

refseqsum
refseqstat
kgxref
ensembltogene
ensemblsource

Pass the appropriate supplementary file names for UCSC-formatted files. Ignored by other parser subclasses.

class

Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature, which is lighter-weight and consumes less memory. A suitable BioPerl alternative is Bio::SeqFeature::Lite.

Modifying parser behavior

These methods can be used to get or set values that modify the parser behavior. These are Boolean methods; it sets and returns either 1 or 0. These are not always used by all subclasses.

do_gene
do_exon
do_cds
do_utr
do_codon
do_name
do_share
simplify

General Parser functions

These are general methods about the parser or the file being parsed.

file

The filename of the file being parsed.

fh

The IO::File file object handle.

filetype

Returns a string representing the file format being parsed. Determined after tasting the file. Values could include, but not limited to, gff3, gtf, gff, bed6, bed12, bedgraph, narrowPeak, broadPeak, gappedPeak, genePred, refFlat, knownGene.

number_loaded

Returns the number of top features parsed and loaded into memory. Does not include subfeatures.

comments

Returns an array of the comment lines in the parsed file.

seq_ids

Returns an array or array reference of the names of the sequence or chromosome names observed in the parsed file.

seq_id_lengths

Returns a HASH reference of sequence identifiers (keys) and the observed sequence length (values). In most cases and file formats, the length is merely the last observed position of a feature on that chromosome, and should not be taken as absolute truth. Some GFF3 files do include sequence information, and in such cases, could be used as absolute truth values.

Feature retrieval

The following methods parse the GFF file lines into SeqFeature objects. It is best if these methods are not mixed; unexpected results may occur.

parse_file

Parses the entire file into memory. This is automatically called when either "top_features" or "next_top_feature" is called.

next_top_feature

This method will return a top level parent SeqFeature object assembled with child features as sub-features. For example, a gene object with mRNA subfeatures, which in turn may have exon and/or CDS subfeatures. Child features are assembled based on the existence of proper Parent attributes in child features. If no Parent attributes are included in the GFF file, then this will behave as "next_feature".

Child features (those containing a Parent attribute) are associated with the parent feature. A warning will be issued about lost children (orphans). Shared subfeatures, for example exons common to multiple transcripts, are associated properly with each parent. An opportunity to rescue orphans is available using the "orphans" method.

Note that subfeatures may not necessarily be in ascending genomic order when associated with the feature, depending on their order in the GFF3 file and whether shared subfeatures are present or not. When calling subfeatures in your program, you may want to sort the subfeatures. For example

  my @subfeatures = map { $_->[0] }
                    sort { $a->[1] <=> $b->[1] }
                    map { [$_, $_->start] }
                    $parent->get_SeqFeatures;
top_features

This method will return an array of the top (parent) features defined in the GFF file. This is similar to the "next_top_feature" method except that all features are returned at once.

next_feature

This method will return a SeqFeature object representation of the next feature (line) in the file. Parent - child relationships are NOT assembled; however, undefined parents in a GTF file may still be generated, just not returned.

This method is best used with simple annotation files where no hierarchies are expected, such BED files. This may be used in a while loop until the end of the file is reached.

fetch
  my $gene = $parser->fetch($primary_id) or 
     warn "gene $display_name can not be found!";

Fetch a loaded top feature from memory using the primary_id tag, which should be unique. Returns the SeqFeature object or undef if not present. Only useful after </parse_file> is called.

SEE ALSO

Bio::ToolBox::Parser::gff, Bio::ToolBox::Parser::ucsc, Bio::ToolBox::Parser::bed, Bio::ToolBox::SeqFeature

AUTHOR

 Timothy J. Parnell, PhD
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.