NAME

Bio::ToolBox::db_helper::big

DESCRIPTION

This module provides support for binary BigWig and BigBed files to the Bio::ToolBox package. It also provides minimal support for a directory of one or more bigWig files as a combined database, known as a BigWigSet.

USAGE

The module requires Bio::DB::Big to be installed, which in turn requires the libBigWig C library to be installed. This provides a simpler and easier-to-install library compared to the UCSC Kent C libraries.

In general, this module should not be used directly. Use the methods available in Bio::ToolBox::Data or Bio::ToolBox::db_helper.

All subroutines are exported by default.

Available subroutines

collect_bigwig_score()

This subroutine will collect a single value from a binary bigWig file. It uses the low-level summary method to collect the statistical information and is therefore significantly faster than the other methods, which rely upon parsing individual data points across the region.

The subroutine is passed a parameter array reference. See below for details.

The object will return either a valid score or a null value.

collect_bigwigset_score()

Similar to collect_bigwig_score() but using a BigWigSet database of BigWig files. Unlike individual BigWig files, BigWigSet features support stranded data collection if a strand attribute is defined in the metadata file.

The subroutine is passed a parameter array reference. See below for details.

collect_bigwig_scores()

This subroutine will collect only the score values from a binary BigWig file for the specified database region. The positional information of the scores is not retained.

The subroutine is passed a parameter array reference. See below for details.

The subroutine returns an array or array reference of the requested dataset values found within the region of interest.

collect_bigwigset_scores()

Similar to collect_bigwig_scores() but using a BigWigSet database of BigWig files. Unlike individual BigWig files, BigWigSet features support stranded data collection if a strand attribute is defined in the metadata file.

The subroutine is passed a parameter array reference. See below for details.

collect_bigwig_position_scores()

This subroutine will collect the score values from a binary BigWig file for the specified database region keyed by position.

The subroutine is passed a parameter array reference. See below for details.

The subroutine returns a hash of the defined dataset values found within the region of interest keyed by position. Note that only one value is returned per position, regardless of the number of dataset features passed. Usually this isn't a problem as only one dataset is examined at a time.

collect_bigwigset_position_scores()

Similar to collect_bigwig_position_scores() but using a BigWigSet database of BigWig files. Unlike individual BigWig files, BigWigSet features support stranded data collection if a strand attribute is defined in the metadata file.

The subroutine is passed a parameter array reference. See below for details.

open_bigwig_db()

This subroutine will open a BigWig database connection. Pass either the local path to a bigWig file (.bw extension) or the URL of a remote bigWig file. It will return the opened database object.

open_bigwigset_db()

This subroutine will open a BigWigSet database connection using a directory of BigWig files and one metadata index file, as described in Bio::DB::BigWigSet. Essentially, this treats a directory of BigWig files as a single database with each BigWig file representing a different feature with unique attributes (type, source, strand, etc).

Pass the subroutine a scalar value representing the local path to the directory. It presumes a feature_type of 'region', as expected by the other Bio::ToolBox::db_helper subroutines and modules. It will return the opened database object.

Data Collection Parameters Reference

The data collection subroutines are passed an array reference of parameters. The recommended method for data collection is to use get_segment_score() method from Bio::ToolBox::db_helper.

The parameters array reference includes these items:

1. The chromosome or seq_id
1. The start position of the segment to collect
3. The stop or end position of the segment to collect
4. The strand of the segment to collect

Should be standard BioPerl representation: -1, 0, or 1.

5. The strandedness of the data to collect

A scalar value representing the desired strandedness of the data to be collected. Acceptable values include "sense", "antisense", or "all". Only those scores which match the indicated strandedness are collected.

6. The method for combining scores.

Acceptable values include mean, min, and max when collecting single score over a genomic segment. This uses the built-in statistic zoom levels of the bigWig.

7. A database object.

Pass the opened Bio::DB::BigWigSet database object when working with BigWigSets.

8 and higher. BigWig file names or BigWigSet database types.

Opened BigWig objects are cached. Both local and remote BigWig files are supported.

SUPPORT MODULES

This includes two additional object-oriented modules for supporting BigWigSets and bigBed SeqFeature iteration.

Bio::ToolBox::db_helper::big::BigWigSet

This provides support for a BigWigSet, which is not natively supported by the Bio::DB::Big adapter, and is based on the concepts from the Bio::DB::BigWigSet adapter. However, it is NOT a drop-in replacement, only a few methods are provided, and only a few of these are similar to the original adapter.

This adapter will still read a INI-style metadata.txt file as described in Bio::DB::BigWigSet for metadata. Briefly, this file format is similar to below

[file1.bw]
name = mydata
type = ChIPSeq

[file2.bw]
name = mydata2
type = ChIPSeq

Each bigWig file in the directory should have a stanza entry with the path and file name in the stanza header in square brackets. Metadata is included as simple key = value pairs, where keys can be typical SeqFeature attributes, including display_name or name, primary_tag or type, and strand.

NOTE: Metadata text files are ideal, but not required. If a metadata file is not present, appropriate metadata will be determined from the bigWig file names, using the basename as the metadata name and possibly extracting the strand from the end of the filename, if it ends in a _f or _r.

The following methods are available.

new

Generate a new BigWigSet object. The path of the directory must be passed as an argument. The contents of the directory will be read, bigWig files located, metadata files (if any) read and processed. Fasta sequences are not supported.

bigwig_names

Returns an array or array reference of the bigWig file names. These are just the file names, without the path.

bigwigs

Returns an array or array reference of the full path for the bigWig files. This is identical to the Bio::DB::BigWigSet method.

get_bigwig

Given a bigwig name, this will return an opened bigwig database Bio::DB::Big object. This is identical to the Bio::DB::BigWigSet method.

get_bigwig_path

Given a bigwig name, this will return the full path to the corresponding bigWig file.

metadata

This will return a hash reference pointing to the metadata hash structure, the keys of which are bigWig names, and the values are hash references for the metadata key = value metadata pairs. This is identical to the Bio::DB::BigWigSet method.

filter_bigwigs

Provide one or more names, primary tags, or types to filter the bigWig files in the Set. For the purposes of this simple method, no distinction is made whether the filtering criteria is a display_name, primary_tag, or type. The provided text strings will be used to search all the metadata values, and the names of files with exact matches are returned. For purposes of filtering, the following metadata keys are searched in the following order: type, name, display_name, and primary_tag, and the first match is kept.

filter_bigwigs_by_strand

Pass first the strand, and then optionally a list of bigWig names, perhaps the results from filter_bigwigs(). If no names were passed, all the names in the BigWigSet will be considered. The names of files whose strand matches the given strand will be returned.

Bio::ToolBox::db_helper::big::BedIteratorWrapper

This is an object wrapper around a bigBed database for retrieving items and returning them as convenient SeqFeature objects. Only the first 3 to 6 standard BED columns are supported: seq_id, start, stop, name, score, and strand.

The following methods are provided.

new

Pass the new method the following items: opened Bio::DB::Big bigBed object, chromosome, start, and end coordinates.

next_seq

This will return the next available feature in the established search interval as a Bio::ToolBox::SeqFeature object. The method name is consistent with other Bio::Perl compatible objects and iterators. Yeah, it sucks, and not very apropos to the actual function. Oh well.

AUTHOR

Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.