NAME
Bio::ToolBox::parser::bed - Parser for BED-style formats
SYNOPSIS
use Bio::ToolBox::parser::bed;
### Quick-n-easy bed parser
my $bed = Bio::ToolBox::parser::bed->new('file.bed');
### Full-powered bed parser, mostly for bed12 functionality
my $bed = Bio::ToolBox::parser::bed->new(
file => 'regions.bed',
do_exon => 1,
do_cds => 1,
do_codon => 1,
);
# what type of bed file is being parsed, determined when opening file
my $type = $bed->version; # returns narrowPeak, bedGraph, bed12, bed6, etc
# Retrieve one feature or line at a time
my $feature = $bed->next_feature;
# Retrieve array of all features
my @genes = $bed->top_features;
# each returned feature is a SeqFeature object
foreach my $f ($bed->next_top_feature) {
printf "%s:%d-%d\n", $f->seq_id, $f->start, $f->end;
}
DESCRIPTION
This is a parser for converting BED-style and related formats into SeqFeature objects. File formats include the following.
- Bed
-
Bed files may have 3-12 columns, where the first 3-6 columns are basic information about the feature itself, and columns 7-12 are usually for defining subfeatures of a transcript model, including exons, UTRs (thin portions), and CDS (thick portions) subfeatures. This parser will parse these extra fields as appropriate into subfeature SeqFeature objects. Bed files are recognized with the file extension .bed.
- Bedgraph
-
BedGraph files are a type of wiggle format in Bed format, where the 4th column is a score instead of a name. BedGraph files are recognized by the file extension .bedgraph or .bdg.
- narrowPeak
-
narrowPeak files are a specialized Encode variant of bed files with 10 columns (typically denoted as bed6+4), where the extra 4 fields represent score attributes to a narrow ChIPSeq peak. These files are parsed as a typical bed6 file, and the extra four fields are assigned to SeqFeature attribute tags
signalValue
,pValue
,qValue
, andpeak
, respectively. NarrowPeak files are recognized by the file extension .narrowPeak. - broadPeak
-
broadPeak files, like narrowPeak, are an Encode variant with 9 columns (bed6+3) representing a broad or extended interval of ChIP enrichment without a single "peak". The extra three fields are assigned to SeqFeature attribute tags
signalValue
,pValue
, andqValue
, respectively. BroadPeak files are recognized by the file extension .broadPeak.
Track
and Browser
lines are generally ignored, although a track
definition line containing a type
key will be interpreted if it matches one of the above file types.
SeqFeature default values
The SeqFeature objects built from the bed file intervals will have some inferred defaults.
- Coordinate system
-
SeqFeature objects use the 1-based coordinate system, per the specification of Bio::SeqFeatureI, so the 0-based start coordinates of bed files will always be parsed into 1-based coordinates.
display_name
-
SeqFeature objects will use the name field (4th column in bed files), if present, as the
display_name
. The SeqFeature object should default to theprimary_id
if a name was not provided. primary_id
-
It will use a concatenation of the sequence ID, start (original 0-based), and stop coordinates as the
primary_id
, for example 'chr1:0-100'. primary_tag
-
Bed files don't have a concept of feature type (they're all the same type), so a default
primary_tag
of 'region' is set. For bed12 files with transcript models, the transcripts will be set to either 'mRNA' or 'ncRNA', depending on the presence of interpreted CDS start and stop (thick coordinates). source_tag
-
Bed files don't have a concept of a source. The basename of the provided file is therefore used to set the
source_tag
. -
Extra columns in the narrowPeak and broadPeak formats are assigned to attribute tags as described above. The
rgb
values set in bed12 files are also set to an attribute tag.
METHODS
Initializing the parser object
- new
-
Initiate a new Bed file parser object. Pass a single value (the bed file name) to open the file for parsing. Alternatively, pass an array of key value pairs to control how the table is parsed. These options are primarily for parsing bed12 files with subfeatures. Options include the following.
- file
-
Provide the path and file name for a Bed file. The file may be gzip compressed.
- source
-
Pass a string to be added as the source tag value of the SeqFeature objects. The default value is the basename of the file to be parsed.
- do_exon
- do_cds
- do_utr
- do_codon
-
Pass a boolean (1 or 0) value to parse certain subfeatures, including
exon
,CDS
,five_prime_UTR
,three_prime_UTR
,stop_codon
, andstart_codon
features. Default is false. - class
-
Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature.
Modify the parser object
These methods set or retrieve parameters that modify parser functionality.
- source
- do_exon
- do_cds
- do_utr
- do_codon
-
These methods retrieve or set parameters to the parsing engine, same as the options to the new method.
- open_file
-
Pass the name of a file to parse. This function is called automatically by the "new" method if a filename was passed. This will open the file, check its format, and set the parsers appropriately.
Parser or file attributes
These retrieve attributes for the parser or file.
- version
-
This returns a string representation of the opened bed file format. For standard bed files, it returns 'bed' followed by the number columns, e.g.
bed4
orbed12
. For recognized special bed variants, it will returnnarrowPeak
,broadPeak
, orbedGraph
. - fh
-
Retrieves the file handle of the current file. This module uses IO::Handle objects. Be careful manipulating file handles of open files!
- typelist
-
Returns a string representation of the type of SeqFeature types to be encountered in the file. Currently this returns generic strings, 'mRNA,ncRNA,exon,CDS' for bed12 and 'region' for everything else.
Feature retrieval
The following methods parse the table lines into SeqFeature objects. It is best if methods are not mixed; unexpected results may occur.
For bed12 files, it will return a transcript model SeqFeature with appropriate subfeatures.
- next_feature
-
This will read the next line of the table, parse it into a feature object, and immediately return it.
- next_top_feature
-
This method will first parse the entire file into memory. It will then return each feature one at a time. Call this method repeatedly using a while loop to get all features.
- top_features
-
This method is similar to "next_top_feature", but instead returns an array of all the top features.
Other methods
Additional methods for working with the parser object and the parsed SeqFeature objects.
- parse_file
-
Parses the entire file into memory without returning any objects.
- find_gene
-
my $gene = $bed->find_gene( display_name => 'ABC1', primary_id => 'chr1:123-456', );
Pass a feature name, or an array of key => values (name, display_name, ID, primary_ID, and/or coordinate information), that can be used to find a feature already loaded into memory. Only really successful if the entire table is loaded into memory. Features with a matching name are confirmed by a matching ID or overlapping coordinates, if available. Otherwise the first match is returned.
- comments
-
This method will return an array of the comment, track, or browser lines that may have been in the parsed file. These may or may not be useful.
- seq_ids
-
Returns an array or array reference of the names of the chromosomes or reference sequences present in the file. Must parse the entire file before using.
- seq_id_lengths
-
Returns a hash reference to the chromosomes or reference sequences and their corresponding lengths. In this case, the length is inferred by the greatest feature end position. Must parse the entire file before using.
SEE ALSO
Bio::ToolBox::SeqFeature, Bio::ToolBox::parser::gff, Bio::ToolBox::parser::ucsc,
AUTHOR
Timothy J. Parnell, PhD
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.