NAME
Bio::DB::USeq - Read USeq archive database files
SYNOPSIS
use Bio::DB::USeq;
my $useq = Bio::DB::USeq->new('file.useq') or
die "unable to open file.useq!\n";
# sequence IDs
my @seq_ids = $useq->seq_ids;
my $length = $useq->length($chr); # approximate, not exact
### Retrieving features
# all features or observations across chromosome I
my @features = $useq->features(
-seq_id => 'chrI',
-type => 'region',
);
# same thing using a simple form
# use an array of (chromosome, start, stop, strand)
my @features = $useq->features('chrI');
# same thing with a memory efficient iterator
my $iterator = $useq->get_seq_stream(
-seq_id => 'chrI',
);
while (my $f = $iterator->next_seq) {
# each feature $f supports most SeqFeatureI methods
}
### Retrieving simple scores
my @scores = $useq->scores(
-seq_id => 'chrI',
-start => 1000,
-end => 2000
);
### Same methods as above after defining an interval first
my $segment = $useq->segment(
-seq_id => 'chrI',
-start => 1000,
-end => 2000,
);
my @scores = $segment->scores;
my @features = $segment->features;
my $iterator = $segment->get_seq_stream;
### Efficient retrieval of positioned scores in 100 bins
# compatible with Bio::Graphics
my ($wig) = $useq->features(
# assuming unstranded data here, otherwise two wiggle objects
# would be returned, one for each strand
-seq_id => 'chrI',
-start => 1000,
-end => 2000,
-type => 'wiggle:100',
);
my @bins = $wig->wiggle;
my @bins = $wig->coverage; # same thing
my ($bins) = $wig->get_tag_values('coverage'); # same thing
### Statistical summaries of intervals
# compatible with Bio::Graphics
my ($summary) = $useq->features(
# assuming unstranded data here, otherwise two summaries
# would be returned, one for each strand
-seq_id => 'chrI',
-start => 1000,
-end => 2000,
-type => 'summary',
);
my $stats = $summary->statistical_summary(10);
foreach (@$stats) {
my $max = $_->{maxVal};
my $mean = Bio::DB::USeq::binMean($_);
}
### Stranded data using an iterator
# could be used with either wiggle or summary features
my $stream = $useq->get_seq_stream(
-seq_id => 'chrI',
-start => 1000,
-end => 2000,
-type => 'wiggle:100',
);
my ($forward, $reverse);
if ($useq->stranded) {
$forward = $stream->next_seq;
$reverse = $stream->next_seq;
}
else {
$forward = $stream->next_seq;
}
DESCRIPTION
Bio::DB::USeq is a BioPerl style adaptor for reading USeq files. USeq files are compressed, indexed data files supporting modern bioinformatic datasets, including genomic points, scores, and intervals. More information about the USeq software package can be found at https://github.com/HuntsmanCancerInstitute/USeq, including a description of the USeq data archive format.
USeq files are typically half the size of corresponding bigBed and bigWig files, due to a compact internal format and lack of internal zoom data. This adaptor, however, can still return statistics across different zoom levels in the same manner as big files, albeit at a cost of calculating these in realtime.
Generating useq files
USeq files may be generated using tools from the USeq package, available at https://github.com/HuntsmanCancerInstitute/USeq. They may be generated from native Bar files, text Wig files, text Bed files, and UCSC bigWig and bigBed file formats.
Compatibility
The adaptor follows most conventions of other BioPerl-style Bio::DB adaptors. Observations or features in the useq file archive are returned as SeqFeatureI compatible objects.
Coordinates consumed and returned by the adaptor are 1-based, consistent with BioPerl convention. This is not true of the useq file itself, which uses the interbase coordinate system.
Unlike wig and bigWig files, useq file archives support stranded data, which can make data collection much simpler for complex experiments.
See below for GBrowse compatibility.
Limitations
This adaptor is read only. USeq files, in general, are not modified or written with data. The exceptions are chromosome or global statistics are written to the archiveReadMe.txt file inside the zip archive to cache for future queries.
No support for genomic sequence is included. Users who need access to genomic sequence should seek an alternative BioPerl adaptor, such as Bio::DB::Fasta.
Useq files do not have a native concept of type, primary_tag, or source attributes, as expected with GFF-based database adaptors. The features method does support special types (see below).
Requirements
The adaptor is a Perl-only implementation. It only requires the Archive::Zip module for opening and reading the useq file archive. BioPerl is required for working with SeqFeatures objects generated from useq file observations.
METHODS
Initializing the Bio::DB::USeq object
These are class methods for creating and working with the Bio::DB::USeq object itself.
- new
- new($file)
- new(-useq => $file)
-
This will create a new Bio::DB::USeq object. Optionally pass the path to a useq file archive to open immediately. There is not much to do unless you open a file.
Named arguments may be used to specify the file. Either -useq or -file may be used.
- open($file)
-
Open a useq file archive. Useq files typically use the .useq file extension. Returns true if successful.
DO NOT open a subsequent useq file archive when one has already been opened. Please create a new object to open a new file.
- clone
-
Force the object to re-open the useq archive file under a new file handle to make it clone safe. Do this in the child process before collecting data.
- zip
-
Return the Archive::Zip object representing the useq file archive. Generally not recommended unless you know what you are doing.
General information about the useq file
These are class methods for obtaining general information or metadata attributes regarding the contents of the file.
- stranded
-
Return true or false (1 or 0) indicating whether the contents of the useq file archive are recorded in a stranded fashion.
- attributes
-
Return an array of available attribute keys that were recorded in the useq file archiveReadMe.txt member. These are key = value pairs representing metadata for the useq file.
- attribute($key)
-
Return the metadata attribute value for the specified key. These are recorded in the useq file archiveReadMe.txt member. Returns undefined if the key does not exist.
- type
-
Return the useq file metadata
dataType
value. - genome
-
Return the useq file metadata
versionedGenome
value. - version
-
Return the useq file metadata
useqArchiveVersion
value.
Chromosome Information
These are methods to obtain information about the chromosomes or reference sequences represented in the file archive.
Note that generating score statistics across one or more chromosomes may be computationally expensive. Therefore, chromosome statistics, once calculated, are cached in the useq file metadata for future reference. This necessitates writing to the useq zip archive. This is currently the only supported method for modifying the useq zip archive.
- seq_ids
-
Return an array of the chromosome or sequence identifiers represented in the useq file archive. The names are sorted ASCIIbetically before they are returned.
- length($seq_id)
-
Return the length of the provided chromosome or sequence identifier. Note that this may not be the actual length in the reference genome, but rather the last recorded position of an observation for that chromosome. Hence, it should be used only as an approximation.
- chr_mean($seq_id)
-
Return the mean score value across the entire chromosome.
- chr_stdev($seq_id)
-
Return the score standard deviation across the entire chromosome.
- chr_stats($seq_id)
-
Return a statistical summary across the entire chromosome. This is an anonymous hash with five keys: validCount, sumData, sumSquares, minVal, and maxVal. These are equivalent to the statistical summaries generated by the Bio::DB::BigWig adaptor.
- global_mean
-
Return the mean score value across all chromosomes.
- global_stdev
-
Return the mean score value across all chromosomes.
- global_stats
-
Return a statistical summary across all chromosomes.
Data accession
These are the primary methods for working with data contained in useq file archive. These should be familiar to most BioPerl users.
- features
-
Returns an array or array reference of SeqFeatureI compatible objects overlapping a given genomic interval.
Coordinates of the interrogated regions must be supplied. At a minimum, the seq_id must be supplied. A start of 1 and an end corresponding to the length of the seq_id is used if not directly provided. Coordinates may be specified in two different manners, either as a list of (seq_id, start, end, strand) or as one or more keyed values.
my @features = $useq->features($seq_id, $start, $end); @features = $useq->features( -seq_id => $seq_id, -start => $start, -end => $end, );
If the -iterator argument is supplied with a true value, then an iterator object is returned. See get_seq_stream() for details.
Bio::DB::USeq supports four different feature types. Feature types may be specified using the -type argument.
- region
- interval
- observation
-
The default feature type if the type argument is not specified. These are SeqFeatureI compatible objects representing observations in the useq file archive. These are compatible with the iterator. SeqFeature observations are returned in genomic order.
- chromosome
-
Returns SeqFeatureI compatible objects representing the reference sequences (chromosomes) listed in the useq file archive. These are not compatibile with the iterator.
- wiggle
- wiggle:$bins
- coverage
- coverage:$bins
-
Returns an array of SeqFeatureI compatible objects for each strand requested. If the useq file contains stranded data, and no strand is requested, then two objects will be returned representing each strand.
Each object contains an array representing scores across the requested coordinates. This object is designed to be backwards compatible with coverage features from the Bio::DB::Sam adaptor for use with Bio::Graphics and GBrowse. Note that only scores are returned, not a true depth coverage in the sense of the Bio::DB::Sam coverage types.
By default, the wiggle or coverage array is provided at 1 bp resolution. To improve efficiency with large regions, the wiggle array may be limited by using a bin option, where the interrogated interval is divided into the number of bins requested.
To retrieve the scores, call the wiggle() or coverage() method.
For example, to request wiggle scores in 100 equal bins across the interval, see the following example. The wiggle and coverage types are synonymous.
my ($wiggle) = $useq->features( -seq_id => $chromosome, -start => $start, -end => $end, -type => 'wiggle:100', ); my @scores = $wiggle->wiggle; @scores = $wiggle->coverage;
Wiggle objects may also be obtained with a get_seq_stream() iterator objects.
- summary
- summary:$bins
-
Returns an array of SeqFeatureI compatibile Summary objects for each strand requested. If the useq file contains stranded data, and no strand is requested, then two objects will be returned representing each strand.
Each Summary object can then be used to call statistical summaries for one or more bins across the interval. Each statistical summary is an anonymous hash with five keys: validCount, sumData, sumSquares, minVal, and maxVal. From these values, a mean and standard deviation may also be calculated.
my ($summary) = $useq->features( -seq_id => $chromosome, -start => $start, -end => $end, -type => 'summary', ); my @stats = $summary->statistical_summary(100); foreach my $stat (@stats) { my $count = $stat->{validCount}; my $sum = $stat->{sumData}; my $mean = $sum / $count; }
Statistical summaries are equivalent to those generated by the Bio::DB::BigWig adaptor and may be used interchangeably. They are compatible with the Bio::Graphics modules.
Summary objects may also be obtained with a get_seq_stream() iterator object.
- get_seq_stream
-
This is a memory efficient data accessor. An iterator object is returned for an interval specified using coordinate values in the same manner as features(). Call the method next_seq() on the iterator object to retrieve the observation SeqFeature objects one at a time. The iterator is compatible with region, wiggle, and summary feature types.
# establish the iterator object my $iterator = $useq->get_seq_stream( -seq_id => $seq_id, -start => $start, -end => $end, -type => 'region', ); # retrieve the features one at a time while (my $f = $iterator->next_seq) { # each feature $f is either a # Bio::DB::USeq::Feature, # a Bio::DB::USeq::Wiggle, or # a Bio::DB::USeq::Summary object }
- scores
-
This is a simplified data accessor that only returns the score values overlapping an interrogated interval, rather than assembling each observation into a SeqFeature object. The scores are not associated with genomic coordinates, and are not guaranteed to be returned in original genomic order.
Provide the interval coordinates in the same manner as the features() method. Stranded data collection is supported.
my @scores = $useq->scores( -seq_id => $seq_id, -start => $start, -end => $end, );
- observations
-
This is a low-level data accessor, similar to features() but returning an array of references to the original USeq observations. Each USeq observation is an anonymous array reference of 2-4 elements: start, stop, score, and text, depending on the stored data type. All coordinates are in 0-based coordinates, unlike high level accessors. Note that while strand is supported, it is not reported for each observation. If observations exist on both strands, then each strand should be searched separately. Observations are not guaranteed to be returned in genomic order.
my @observations = $useq->observations( -seq_id => $seq_id, -start => $start, -end => $end, ); foreach my $obs (@observations) { my $start = $obs->[0] + 1; # convert to 1-based coordinate my $stop = $obs->[1]; my $score = $obs->[2]; # may not be present my $text = $obs->[3]; # may not be present }
- segment
-
This returns a Bio::DB::USeq::Segment object, which is a SeqFeatureI compatible segment object corresponding to the specified coordinates. From this object, one can call the features(), scores(), or get_seq_stream() methods directly. Keyed options or location information need not be provided. Stranded segments are supported.
my $segment = $useq->segment( -seq_id => $seq_id, -start => $start, -end => $end, ); my @scores = $segment->scores; my @features = $segment->features('wiggle:100'); my $iterator = $segment->get_seq_stream('region');
- get_features_by_location
-
Convenience method for returning features restricted by location.
- get_feature_by_id
- get_feature_by_name
-
Compatibility methods for returning a specific feature or observation in the USeq file. Text fields, if present, are not indexed in the USeq file, preventing efficient searching of names. As a workaround, an ID or name comprised of "$seq_id:$start:$end" may be used, although a direct search of coordinates would be more efficient.
my $feature = $useq->get_feature_by_id("$seq_id:$start:$end");
ADDITIONAL CLASSES
These are additional class object that may be returned by various methods above.
Bio::DB::USeq::Feature objects
These are SeqFeatureI compliant objects returned by the features() or next_seq() methods. They support the following methods.
seq_id
start
end
strand
score
type
source (returns the useq archive filename)
name (chromosome:start..stop)
Bio::RangeI methods
Additionally, chromosome and global statistics are also available from any feature, as well as from Segment, Wiggle, Iterator, and Summary objects. See the corresponding USeq methods for details.
Bio::DB::USeq::Segment objects
This is a SeqFeatureI compliant object representing a genomic segment or interval. These support the following methods.
- features
- features($type)
- get_seq_stream
- get_seq_stream($type)
- scores
- observations
-
Direct methods for returning features or scores. Coordinate information need not be provided. See the corresponding Bio::DB::USeq methods for more information.
- wiggle
- wiggle($bins)
- coverage
- coverage($bins)
- statistical_summary
- statistical_summary($bins)
-
Convenience methods for returning wiggle (coverage) or summary features over the segment. If desired, the number of bins may be specified. See the features() method for more information.
- slices
-
Returns an array of splice member names that overlap this segment. See "USEQ SLICES" for more information.
Bio::DB::USeq::Iterator objects
This is an iterator object for retrieving useq observation SeqFeatureI objects in a memory efficient manner by returning one feature at a time.
- next_seq
- next_feature
-
Returns the next feature present in the interrogated interval. Features are generally returned in ascending coordinate order. Returns undefined when no features are remaining in the interval. Features may include either region or wiggle types, depending on how the iterator object was established. See features() and get_seq_stream() methods for more information.
Bio::DB::USeq::Wiggle objects
These are SeqFeatureI compliant objects for backwards compatibility with Bio::Graphics and GBrowse. They support the wiggle() and coverage() methods, which returns an array of scores over the interrogated region. By default, the array is equal to the length of the region (1 bp resolution), or may be limited to a specified number of bins for efficiency. See the features() method for more information.
- wiggle
- coverage
-
The scores are stored as an array in the coverage attribute. For convenience, the wiggle() and coverage() methods may be used to retrieve the array or array reference of scores.
- statistical_summary
-
Generate a statistical summary hash for the collected wiggle scores (not the original data). This method is not entirely that useful; best to use the summary feature type in the first place.
- chromosome and global statistics
-
Chromosome and global statistics, including mean and standard deviation, are available from wiggle objects. See the corresponding USeq methods for details.
Bio::DB::USeq::Summary objects
These are SeqFeatureI compliant Summary objects, similar to those generated by the Bio::DB::BigWig database adaptor. As such, they are compatible with Bio::Graphics and GBrowse.
Summary objects can generate statistical summaries over a specified number of bins (default is 1 bin, or the entire requested region). Each statistical summary is an anonymous hash consisting of five keys: validCount, sumData, sumSquares, minVal, and maxVal. From these values, a mean and standard deviation may be calculated.
For convenience, three exported functions are available for calculating the mean and standard deviation from a statistical summary hash. See "EXPORTED FUNCTIONS" for more information.
Use statistical summaries in the following manner.
my $stats = $summary->statistical_summary(10);
my $stat = shift @$stats;
my $min = $stat->{minVal};
my $max = $stat->{maxVal};
my $mean = $stat->{sumData} / $stat->{validCount};
- statistical_summary
- statistical_summary($bins)
-
Generate a statistical summary hash for one or more bins across the interrogated region. Provide the number of bins desired. If a feature type of "summary:$bins" is requested through the features() or get_seq_stream() method, then $bins number of bins will be used. The default number of bins is 1.
- score
-
Generate a single statistical summary over the entire region.
- chromosome and global statistics
-
Chromosome and global statistics, including mean and standard deviation, are available from summary objects. See the corresponding USeq methods for details.
EXPORTED FUNCTIONS
Three subroutine functions are available for export to assist in calculating the mean, variance, and standard deviation from statistical summaries. These functions are identical to those from the Bio::DB::BigWig adaptor and may be used interchangeably.
They are not exported by default; they must explicitly listed.
use Bio::DB::USeq qw(binMean binStdev);
my $stats = $summary->statistical_summary(10);
my $stat = shift @$stats;
my $mean = binMean($stat);
my $stdev = binStdev($stat);
- binMean( $stat )
-
Calculate the mean from a statistical summary anonymous hash.
- binVariance( $stat )
-
Calculate the variance from a statistical summary anonymous hash.
- binStdev( $stat )
-
Calculate the standard deviation from a statistical summary anonymous hash.
USEQ SLICES
Genomic observations are recorded in groups, called slices, of usually 10000 observations at a time. Each slice is a separate zip file member in the useq file archive. These methods are for accessing information about each slice. In general, accessing data through slices is a lower level operation. Users should preferentially use the main data accessors.
The following are Bio::DB::USeq methods available for working with slices.
- slices
-
Returns an array of all the slice member names present in the useq archive file.
- slice_feature($slice)
-
Return a Bio::DB::USeq::Segment object representing the slice interval. The features(), get_seq_stream(), and scores() methods are supported.
- slice_seq_id($slice)
-
Return the chromosome or sequence identifier associated with a slice.
- slice_start($slice)
-
Return the start position of the slice interval.
- slice_end($slice)
-
Return the end position of the slice interval.
- slice_strand($slice)
-
Return the strand of the slice interval.
- slice_type($slice)
-
Return the file type of the slice member. This corresponds to the file extension of the slice zip member and indicates how to parse the binary member. Each letter in the type corresponds to a data type, be it integer, short, floating-point, or text. See the useq file documentation for more details.
- slice_obs_number($slice)
-
Return the number of observations recorded in the slice interval.
GBROWSE COMPATIBILITY
The USeq adaptor has support for Bio::Graphics and GBrowse. It will work with the segments glyph for intervals, the wiggle_xyplot glyph for displaying mean scores, and the wiggle_whiskers glyph for displaying detailed statistics.
Initialize the USeq database adaptor.
[data1:database]
db_adaptor = Bio::DB::USeq
db_args = -file /path/to/data1.useq
Displaying simple intervals with the segments glyph.
[data1_segments]
database = data1
feature = region
glyph = segments
stranded = 1
Displaying scores using the wiggle_xyplot glyph. You may set the bins to whatever number is appropriate (in this example, 1000), or leave blank (not recommended, defaults to 1 bp resolution).
[data1_xyplot]
database = data1
feature = wiggle:1000
glyph = wiggle_xyplot
graph_type = histogram
autoscale = chromosome
Displaying scores using the wiggle_whiskers glyph. Note that generating statistical summaries are computationally more expensive than simple bins of mean values as with the wiggle feature type.
[data1_whiskers]
database = data1
feature = summary
glyph = wiggle_whiskers
graph_type = histogram
autoscale = chromosome
PERFORMANCE
Because the Bio::DB::USeq is implemented as a Perl-only module, performance is subject to the limitations of Perl execution itself and the size of the data that needs to be parsed. In general when collecting score data, requesting scores is the fastest mode of operation, followed by wiggle feature types, and finally summary feature types.
In comparison to UCSC bigWig files, the USeq format is typically much faster when viewing intervals where the entire interval is represented by one or a few internal slices. This is especially true for repeated queries over the same or neighboring intervals, as the slice contents are retained in memory. As the number of internal slices that must be loaded into memory increases, for example querying intervals of hundreds of kilobases in size, performance will begin to lag as each internal slice must be parsed into memory. This is where the UCSC bigWig file format with internal zoom levels of summary statistics can outperform, at the cost of file complexity and size.
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.