NAME

Bio::ToolBox::parser::bed - Parser for BED-style formats

SYNOPSIS

 use Bio::ToolBox::parser::bed;
 
 ### Quick-n-easy bed parser
 my $bed = Bio::ToolBox::parser::bed->new('file.bed');
 
 ### Full-powered bed parser, mostly for bed12 functionality
 my $bed = Bio::ToolBox::parser::bed->new(
       file      => 'regions.bed',
       do_exon   => 1,
       do_cds    => 1,
       do_codon  => 1,
 );
 
 # what type of bed file is being parsed, determined when opening file
 my $type = $bed->version; # returns narrowPeak, bedGraph, bed12, bed6, etc
 
 # Retrieve one feature or line at a time
 my $feature = $bed->next_feature;

 # Retrieve array of all features
 my @genes = $bed->top_features;
 
 # each returned feature is a SeqFeature object
 foreach my $f ($bed->next_top_feature) {
 	 printf "%s:%d-%d\n", $f->seq_id, $f->start, $f->end;
 }

DESCRIPTION

This is a parser for converting BED-style and related formats into SeqFeature objects. File formats include the following.

Bed

Bed files may have 3-12 columns, where the first 3-6 columns are basic information about the feature itself, and columns 7-12 are usually for defining subfeatures of a transcript model, including exons, UTRs (thin portions), and CDS (thick portions) subfeatures. This parser will parse these extra fields as appropriate into subfeature SeqFeature objects. Bed files are recognized with the file extension .bed.

Bedgraph

BedGraph files are a type of wiggle format in Bed format, where the 4th column is a score instead of a name. BedGraph files are recognized by the file extension .bedgraph or .bdg.

narrowPeak

narrowPeak files are a specialized Encode variant of bed files with 10 columns (typically denoted as bed6+4), where the extra 4 fields represent score attributes to a narrow ChIPSeq peak. These files are parsed as a typical bed6 file, and the extra four fields are assigned to SeqFeature attribute tags signalValue, pValue, qValue, and peak, respectively. NarrowPeak files are recognized by the file extension .narrowPeak.

broadPeak

broadPeak files, like narrowPeak, are an Encode variant with 9 columns (bed6+3) representing a broad or extended interval of ChIP enrichment without a single "peak". The extra three fields are assigned to SeqFeature attribute tags signalValue, pValue, and qValue, respectively. BroadPeak files are recognized by the file extension .broadPeak.

Track and Browser lines are generally ignored, although a track definition line containing a type key will be interpreted if it matches one of the above file types.

SeqFeature default values

The SeqFeature objects built from the bed file intervals will have some inferred defaults.

Coordinate system

SeqFeature objects use the 1-based coordinate system, per the specification of Bio::SeqFeatureI, so the 0-based start coordinates of bed files will always be parsed into 1-based coordinates.

display_name

SeqFeature objects will use the name field (4th column in bed files), if present, as the display_name. The SeqFeature object should default to the primary_id if a name was not provided.

primary_id

It will use a concatenation of the sequence ID, start (original 0-based), and stop coordinates as the primary_id, for example 'chr1:0-100'.

primary_tag

Bed files don't have a concept of feature type (they're all the same type), so a default primary_tag of 'region' is set. For bed12 files with transcript models, the transcripts will be set to either 'mRNA' or 'ncRNA', depending on the presence of interpreted CDS start and stop (thick coordinates).

source_tag

Bed files don't have a concept of a source. The basename of the provided file is therefore used to set the source_tag.

attribute tags

Extra columns in the narrowPeak and broadPeak formats are assigned to attribute tags as described above. The rgb values set in bed12 files are also set to an attribute tag.

METHODS

Initializing the parser object

new

Initiate a new Bed file parser object. Pass a single value (the bed file name) to open the file for parsing. Alternatively, pass an array of key value pairs to control how the table is parsed. These options are primarily for parsing bed12 files with subfeatures. Options include the following.

file

Provide the path and file name for a Bed file. The file may be gzip compressed.

source

Pass a string to be added as the source tag value of the SeqFeature objects. The default value is the basename of the file to be parsed.

do_exon
do_cds
do_utr
do_codon

Pass a boolean (1 or 0) value to parse certain subfeatures, including exon, CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features. Default is false.

class

Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature.

Modify the parser object

These methods set or retrieve parameters that modify parser functionality.

source
do_exon
do_cds
do_utr
do_codon

These methods retrieve or set parameters to the parsing engine, same as the options to the new method.

open_file

Pass the name of a file to parse. This function is called automatically by the "new" method if a filename was passed. This will open the file, check its format, and set the parsers appropriately.

Parser or file attributes

These retrieve attributes for the parser or file.

version

This returns a string representation of the opened bed file format. For standard bed files, it returns 'bed' followed by the number columns, e.g. bed4 or bed12. For recognized special bed variants, it will return narrowPeak, broadPeak, or bedGraph.

fh

Retrieves the file handle of the current file. This module uses IO::Handle objects. Be careful manipulating file handles of open files!

typelist

Returns a string representation of the type of SeqFeature types to be encountered in the file. Currently this returns generic strings, 'mRNA,ncRNA,exon,CDS' for bed12 and 'region' for everything else.

Feature retrieval

The following methods parse the table lines into SeqFeature objects. It is best if methods are not mixed; unexpected results may occur.

For bed12 files, it will return a transcript model SeqFeature with appropriate subfeatures.

next_feature

This will read the next line of the table, parse it into a feature object, and immediately return it.

next_top_feature

This method will first parse the entire file into memory. It will then return each feature one at a time. Call this method repeatedly using a while loop to get all features.

top_features

This method is similar to "next_top_feature", but instead returns an array of all the top features.

Other methods

Additional methods for working with the parser object and the parsed SeqFeature objects.

parse_file

Parses the entire file into memory without returning any objects.

find_gene
my $gene = $bed->find_gene(
	display_name => 'ABC1',
	primary_id   => 'chr1:123-456',
);

Pass a feature name, or an array of key => values (name, display_name, ID, primary_ID, and/or coordinate information), that can be used to find a feature already loaded into memory. Only really successful if the entire table is loaded into memory. Features with a matching name are confirmed by a matching ID or overlapping coordinates, if available. Otherwise the first match is returned.

comments

This method will return an array of the comment, track, or browser lines that may have been in the parsed file. These may or may not be useful.

seq_ids

Returns an array or array reference of the names of the chromosomes or reference sequences present in the file. Must parse the entire file before using.

seq_id_lengths

Returns a hash reference to the chromosomes or reference sequences and their corresponding lengths. In this case, the length is inferred by the greatest feature end position. Must parse the entire file before using.

SEE ALSO

Bio::ToolBox::SeqFeature, Bio::ToolBox::parser::gff, Bio::ToolBox::parser::ucsc,

AUTHOR

Timothy J. Parnell, PhD
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.