NAME
Bio::ToolBox::db_helper
DESCRIPTION
These are helper subroutines to work with relevant databases that can be accessed through BioPerl modules. These include the Bio::DB::SeqFeature::Store relational database as well as Bio::DB::Fasta, Bio::DB::Bam, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed databases. The primary functions included opening connections to these databases, verifying features found within the database, generating lists of features or genomic intervals for subsequent analysis, and collecting features and/or scores within regions of interest. Collected scores may be summarized using a variety of statistical methods.
When collecting scores or features, the data may hosted in a variety of formats and locations. These include:
- SeqFeature::Store Database
-
Full features may be stored, including genes, transcripts, exons, etc. Simple datasets, such as from microarray, may also be stored as the score value in the source GFF file.
References to local, binary, indexed files may also be included as attributes to features stored in the database. Supported files include binary wig files (.wib, see Bio::Graphics::Wiggle) using the 'wigfile' attribute, or bigWig files using the 'wigfile' or 'bigwigfile' attribute. The attribute values must be full paths.
SeqFeature::Store databases are usually hosted by a relational database server (MySQL or PostGreSQL), SQLite file, or an in-memory database (for small GFF3 files only).
- BigWig files
-
BigWig files are compressed, binary, indexed versions of text wig files and may be accessed either locally or remotely. They support extremely fast score retrieval from any genomic location of any size without sacrificing resolution (spatial and numeric).
- Directory of BigWig files
-
A directory containing one or more BigWig files is assembled into a BigWigSet, allowing for metadata, such as strand, to be associated with BigWig files.
- BigBed files
-
BigBed files are compressed, binary, indexed versions of text BED files and may be accessed either locally or remotely. They support extremely fast score and feature retrieval from any genomic location.
- Bam files
-
Bam files are compressed, binary, indexed versions of the text SAM file, or sequence alignment map. They are used with next generation sequencing technologies. They support individual alignment retrieval as well as read depth coverage.
- USeq files
-
USeq files are compressed, binary, indexed files that support BED type annotations or wig type scores distributed across the genome. They support rapid, random access across the genome and are comparable to both BigWig and BigBed files.
While the functions within this library may appear to be simply a rehashing of the methods and functions in Bio::DB::SeqFeature::Store and other modules, they either provide a simpler function to often used database methodologies or are designed to work intimately with the BioToolBox data format file and data structures (see Bio::ToolBox::file_helper
). One key advantage to these functions is the ability to work with datasets that are stranded (transcriptome data, for example).
Historically, this module was initially written to use Bio::DB::GFF for database usage. It has since been re-written to use Bio::DB::SeqFeature::Store database schema. Additional functionality has also been added for other databases and file formats.
Complete usage and examples for the functions are provided below.
USAGE
Call the module at the beginning of your perl script and include the module names to export. None are exported by default.
use Bio::ToolBox::db_helper qw(
get_new_feature_list
get_feature
);
This will export the indicated subroutine names into the current namespace. Their usage is detailed below.
EXPORTED SUBROUTINES
- open_db_connection($database, $no_cache)
-
This module will open a connection to a BioPerl style database. It returns an object that represents the connection. Several different types of databases are supported.
- Bio::DB::SeqFeature::Store database
-
These may be represented by a relational database (e.g. MySQL database), a SQLite database file (file.sqlite or file.db), or a single GFF3 file (file.gff) that can be loaded into an in-memory database. In-memory databases should only be used with small files as they demand a lot of memory.
Parameters for connecting to a relational database are stored in the BioToolBox configuration file,
.biotoolbox.cfg
. These include database adaptors, user name, password, etc. Information regarding the configuration file may be found within the file itself. - Bio::DB::Sam database
-
A self-contained database represented by a sorted, indexed Bam file (file.bam). See http://samtools.sourceforge.net for more details. Files may be either local or remote (prefixed with http:// or ftp://).
- Bio::DB::BigWig database
-
A self-contained database of scores represented by a BigWig (file.bw). See http://genome.ucsc.edu/goldenPath/help/bigWig.html for more information. Files may be either local or remote (prefixed with http:// or ftp://).
- Bio::DB::BigWigSet database
-
A local or remote directory of one or more BigWig files that can treated collectively as a database. A special text file may be included in the directory to assign metadata to each BigWig file, including attributes such as type, source, display name, strand, etc. See Bio::DB::BigWigSet for more information on the formatting of the metadata file.
- Bio::DB::BigBed database
-
A self-contained database of regions represented by a BigBed (file.bb). See http://genome.ucsc.edu/goldenPath/help/bigBed.html for more information. Files may be either local or remote (prefixed with http:// or ftp://).
- Bio::DB::USeq database
-
A self-contained database file of indexed regions or scores represented by a useq archive file (file.useq). See http://useq.sourceforge.net/useqArchiveFormat.html for more information. Files may only be local.
- Bio::DB::Fasta Database
-
A database of fasta sequences. A single multi-fasta or a directory of fasta files may be specified. The directory or parent directory must be writeable by the user to write a small index file.
Pass the name of a relational database or the path of the database file to the subroutine. The opened database object is returned. If it fails, then an error message should be generated and nothing is returned.
Important! If you are forking your perl process, always re-open your database objects in the child process, and pass a second true value to avoid using the cached database object. By default, opened databases are cached to improve efficiency, but this will be disastrous when crossing forks.
Example:
my $db_name = 'cerevisiae'; my $db = open_db_connection($db_name); my $file = '/path/to/file.bam'; my $db = open_db_connection($file); # within a forked child process # reusing the same variable to simplify code # pass second true value to avoid cache $db = open_db_connection($file, 1);
- get_dataset_list
-
This subroutine will retrieve a list of the available features stored in the database and returns an array of the features' types.
For Bio::DB::SeqFeature::Store databases, the type is represented as "type:source", corresponding to the third and second GFF columns, respectively. The types are sorted alphabetically first by source, then by method.
For Bio::DB::BigWigSet databases, the type, primary_tag, method, or display_name attribute may be used, in that respective order of availability. The list is sorted alphabetically.
Pass either the name of the database or an established database object. Supported databases include both Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases.
Example:
my $db_name = 'cerevisiae'; my @types = get_dataset_list($db_name);
- verify_or_request_feature_types()
-
This subroutine will process a list of feature types or data sources to be used for data or feature collection. There are two modes of action. If a list was provided, it will process and verify the list. If no list was provided, then the user will be prompted to select from a list of available feature types in the provided database.
The provided list may be a list of feature types or a list of single file data sources, including Bam, bigWig, or bigBed files. If the list includes feature types, they are verified as existing in the provided database. If the list includes file names, the local files are checked and the names prefixed with "file:" for use downstream.
If no list was provided, then a list of available feature types in the provided database will be presented to the user, and the user prompted to make a selection. One or more types may be selected, and a single item may be enforced if requested. The response is filtered through the parse_list() method from Bio::ToolBox::utility, so a mix of single numbers or a range of numbers may be accepted. The responses are then validated.
Two types of databases are supported: Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases. For SeqFeature Store databases, the type is comprised of "method:source", corresponding to the third and second GFF columns. For BigWigSet databases, types correspond to either the type, method, primary_tag, or display_name attributes, in that order of availability.
For feature types or files that pass validation, they are returned as a list. Types of files that do not pass validation are printed to STDOUT.
To use this subroutine, pass an array with the following keys and values. Not every key is required.
db => The name of the database or a reference to an established BioPerl database object. A Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database are typically used. Only required when no features are provided. feature => Pass either a single dataset name as a scalar or an anonymous array reference of a list of dataset names. These may have been provided as a command line option and need to be verified. If nothing is passed, then a list of possible datasets will be presented to the user to be chosen. prompt => Provide a phrase to be prompted to the user to help in selecting datasets from the list. If none is provided, a generic prompt will be used. single => A Boolean value (1 or 0) indicating whether only a single dataset is allowed when selecting datasets from a presented list. If true, only one dataset choice is accepted. If false, one or more dataset choices are allowed. Default is false. limit => Optionally provide a word or regular expression that matches the feature type (primary_tag only; source_tag, if present, is ignored). For example, provide "gene|mrna" to only present gene and mRNA features to the user. This is only applicable when a user must select from a database list. The default is to list all available feature types.
The subroutine will return a list of the accepted datasets. It will print bad dataset names to standard out.
Example:
# verify a dataset, either file or database type my $db_name = 'cerevisiae'; my $dataset = 'microaray_data'; my @features = verify_or_request_feature_types( db => $db_name, feature => $dataset, ); unless (@features) { die " no valid feature provided!\n"; } # select a list of features my $db_name = 'cerevisiae'; my @types = (); # user not provided @types = verify_or_request_feature_types( db => $db_name, feature => $types, ); # user will be promoted to select from database list if (@types) { print "using types " . join(", ", @types); } else { die " no valid features selected!\n"; }
- check_dataset_for_rpm_support()
-
This subroutine will check a dataset for RPM, or Reads Per Million mapped, support. Only two types of database files support this, Bam files and BigBed files. If the dataset is either one of these, or the name of a database feature which points to one of these files, then it will calculate the total number of mapped alignments (Bam file) or features (BigBed file). It will return this total number. If the dataset does not support RPM (because it is not a Bam or BigBed file, for example), then it will return undefined.
Pass this subroutine one or two values. The first is the name of the dataset. Ideally it should be validated using verify_or_request_feature_types() and have an appropriate prefix (file, http, or ftp). If it does not have a prefix, then it is assumed to be a database feature. The second passed feature is the name of a BioPerl database or an opened database object. This database will be checked for the indicated dataset, and the first returned feature checked for an attribute pointing to a supported file.
For multi-threaded execution pass a third value, the number of CPU cores available to count Bam files. The default is 2 for environments where Parallel::ForkManager is installed, or 1 where it is not.
- get_new_feature_list
-
This subroutine will generate a new feature list collected from the database. Once the list of genomic features is generated, then data may be collected for each item in the list.
The subroutine will generate and return a data hash as described in
Bio::ToolBox::file_helper
. The data table will have two or three columns. The feature name and type:source are listed in columns one and two, respectively. If the features have an Alias tag, then a third column is included with a comma delimited list of the feature aliases.The subroutine is passed an array containing the arguments. The keys include
Required: db => The name of the database or a reference to an established database object. features => A scalar value containing a name representing the type(s) of feature(s) to collect. This name will be parsed into an actual list with the internal subroutine _features_to_classes(). Refer to that documentation for a list of appropriate features.
The subroutine will return a reference to the data hash. It will print status messages to STDOUT.
Example
my $db_name = 'cerevisiae'; my %data = get_new_feature_list( 'db' => $db_name, 'features' => 'genes', );
- get_new_genome_list
-
This subroutine will generate a new list of genomic windows. The genome is split into intervals of a specified size that is moved along the genome in specified step sizes.
The subroutine will generate and return a data hash as described in
Bio::ToolBox::file_helper
. The data table will have 3 columns, including Chromosome, Start, and Stop.The subroutine is passed an array containing the arguments. The keys include
Required: db => The name of the database or a reference to an established database object. Optional: win => A scalar value containing an integer representing the size of the window in basepairs. The default value is defined in biotoolbox.cfg file. step => A scalar value containing an integer representing the step size for advancing the window across the genome. The default is the window size.
The subroutine will return a reference to the data hash. It will print status messages to STDOUT.
Example
my $db_name = 'cerevisiae'; my $window_size = 500; my $step_size = 250; my %data = get_new_genome_list( 'db' => $db_name, 'win' => $window_size, 'step' => $step_size, );
- validate_included_feature
-
This subroutine will validate a database feature to make sure it is useable. It will check feature attributes and compare them against a list of attributes and values to be avoided. The list of unwanted attributes and values is stored in the BioToolBox configuration file biotoolbox.cfg.
Pass the subroutine a Bio::DB::SeqFeature::Store database feature. It will return true (1) if the feature passes validation and false (undefined) if it contains an excluded attribute and value.
- get_feature
-
This subroutine will retrieve a specific feature from a Bio::DB::SeqFeature::Store database for subsequent analysis, manipulation, and/or score retrieval using the get_chromo_region_score() or get_region_dataset_hash() methods. It relies upon unique information to pull out a single, unique feature.
Several attributes may be used to pull out the feature, including the feature's unique database primary ID, name and/or aliases, and GFF type (primary_tag and/or source). The get_new_feature_list() subroutine will generate a list of features with their unique identifiers.
The primary_id attribute is preferentially used as it provides the best performance. However, it is not portable between databases or even re-loading. In that case, the display_name and type are used to identify potential features. Note that the display_name may not be unique in the database. In this case, the addition of aliases may help. If all else fails, a new feature list should be generated.
To get a feature, pass an array of arguments. The keys include
Required: db => The name of the Bio::DB::SeqFeature::Store database or a reference to an established database object. id => Provide the primary_id tag. In the Bio::DB::SeqFeature::Store database schema this is a (usually) non-portable, unique identifier specific to a database. It provides the fastest lookup. name => A scalar value representing the feature display_name. Aliases may be appended with semicolon delimiters. type => Provide the feature type, which is typically expressed as primary_tag:source. Alternatively, provide just the primary_tag only.
While it is possible to identify features with any two attributes (or possibly just name or ID), the best performance is obtained with all three together.
The first SeqFeature object is returned if found.
- get_chromo_region_score
-
This subroutine will retrieve a dataset value for a single specified region in the genome. The region is specified with chromosomal coordinates: chromosome name, start, and stop. It will collect all dataset values within the window, combine them with the specified method, and return the single value.
The subroutine is passed an array containing the arguments. The keys include
Required: db => The name of the database or a reference to an established database object. Optional if an indexed dataset file (Bam, BigWig, etc.) is provided. dataset => The name of the dataset in the database to be collected. The name should correspond to a feature type in the database, either as type or type:source. The name should be verified using the subroutine verify_or_request_feature_types() prior to passing. Multiple datasets may be given, joined by '&', with no spaces. Alternatively, specify a data file name. A local file should be prefixed with 'file:', while a remote file should be prefixed with the transfer protocol (ftp: or http:). method => The method used to combine the dataset values found in the defined region. Acceptable values include sum, mean, median, range, stddev, min, max, rpm, rpkm, and scores. See _get_segment_score() documentation for more info. chromo => The name of the chromosome (reference sequence) start => The start position of the region on the chromosome stop => The stop position of the region on the chromosome end => Alias for stop Optional: strand => The strand of the region (-1, 0, or 1) on the chromosome. The default is 0, or unstranded. value => Specify the type of value to collect. Acceptable values include score, count, or length. The default value type is score. log => Boolean value (1 or 0) indicating whether the dataset values are in log2 space or not. If undefined, the dataset name will be checked for the presence of the phrase 'log2' and set accordingly. This argument is critical for accurately mathematically combining dataset values in the region. stranded => Indicate whether the dataset values from a specific strand relative to the feature should be collected. Acceptable values include sense, antisense, or all. Default is 'all'. rpm_sum => When collecting rpm or rpkm values, the total number of alignments may be provided here. Especially useful when collecting via parallel forked processes, otherwise each fork will sum the number of alignments independently, an expensive proposition.
The subroutine will return the region score if successful.
Examples
my $db = open_db_connection('cerevisiae'); my $score = get_chromo_region_score( 'db' => $db, 'method' => 'mean', 'dataset' => $dataset, 'chr' => $chromo, 'start' => $startposition, 'stop' => $stopposition, 'log' => 1, );
- get_region_dataset_hash
-
This subroutine will retrieve dataset values or feature attributes from features located within a defined region and return them as a hash. The (start) positions will be the hash keys, and the corresponding dataset values or attributes will be the hash values. The region is defined based on a genomic feature in the database. The region corresponding to the entire feature is selected by default. Alternatively, it may be adjusted by passing appropriate arguments.
Different dataset values may be collected. The default is to collect score values of the dataset features found within the region (e.g. microarray values). Alternatively, a count of found dataset features may be returned, or the lengths of the found dataset features. When lengths are used, the midpoint position of the feature is used in the returned hash rather than the start position.
The returned hash is keyed by relative coordinates and their scores. For example, requesting a region from -200 to +200 of a feature (using the start and stop options, below) will return a hash whose keys are relative to the feature start position, i.e. the keys will >= -200 and <= 200. Absolute coordinates relative to the reference sequence or chromosome may be optionally returned instead.
The subroutine is passed an array containing the arguments. The keys include
Required: db => The name of the annotation database or a reference to an established database object. dataset => The name of the dataset in the database to be collected. The name should correspond to a feature type in the database, either as type or type:source. The name should be verified using the subroutine verify_or_request_feature_types() prior to passing. Multiple datasets may be given, joined by '&', with no spaces. Alternatively, specify a data file name. A local file should be prefixed with 'file:', while a remote file should be prefixed with the transfer protocol (ftp: or http:). Required for database features: id => The Primary ID of the genomic feature. This is database specific and may be used alone to identify a genomic feature, or in conjunction with name and type. name => The name of the genomic feature. Required if the Primary ID is not provided. type => The type of the genomic feature. Required if the Primary ID is not provided. Required for coordinate positions: chromo => The chromosome or sequence name (seq_id). This may be used instead of name and type to specify a genomic segment. This must be used with start and stop options, and optionally strand options. Optional: start => Indicate an integer value representing the start position of the region relative to the feature start. Use a negative value to begin upstream of the feature. Must be combined with "stop". stop|end => Indicate an integer value representing the stop position of the region relative to the feature start. Use a negative value to begin upstream of the feature. Must be combined with "start". ddb => The name of the data-specific database or a reference to an established database. Use when the data and annotation are in separate databases. extend => Indicate an integer value representing the number of bp the feature's region should be extended on both sides. position => Indicate the relative position of the feature from which the "start" and "stop" positions are calculated. Three values are accepted: "5", which denotes the 5' end of the feature, "3" which denotes the 3' end of the feature, or "4" which denotes the middle of the feature. This option is only used in conjunction with "start" and "stop" options. The default value is "5". strand => For those features or regions that are NOT inherently stranded (strand 0), artificially set the strand. Three values are accepted: -1, 0, 1. This will overwrite any pre-existing value (it will not, however, migrate back to the database). stranded => Indicate whether the dataset values from a specific strand relative to the feature should be collected. Acceptable values include sense, antisense, or all. Default is all. value => Indicate which attribute will be returned. Acceptable values include "score", "count", or "length". The default behavior will be to return the score values. avoid => Boolean value to indicate that other features of the same type should be avoided. This only works if name and type was provided. Any positioned scores which overlap the other feature(s) are not returned. The default is false (return all values). absolute => Boolean value to indicate that absolute coordinates should be returned, instead of transforming to relative coordinates, which is the default.
The subroutine will return the hash if successful.
Example
my $db = open_db_connection('cerevisiae'); my %region_scores = get_region_dataset_hash( 'db' => $db, 'dataset' => $dataset, 'name' => $name, 'type' => $type, );
- get_chromosome_list
-
This subroutine will collect a list of chromosomes or reference sequences in a Bio::DB database and return the list along with their sizes in bp. Many BioPerl-based databases are supported, including Bio::DB::SeqFeature::Store, Bio::DB::Fasta, Bio::DB::Sam, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed, or any others that support the "seq_ids" method. See the open_db_connection subroutine for more information.
Pass the subroutine either the name of the database, the path to the database file, or an opened database object.
Optionally pass a second value, a boolean argument to limit and exclude unwanted chromosomes as defined by the "chromosome_exclude" option in the BioToolBox configuration file,
biotoolbox.cfg
. A true value limits chromosomes, and false includes all chromosomes. The default is to return all chromosomes. Sometimes some sequences are simply not wanted in analysis, like the mitochondrial chromosome or unmapped contigs.The subroutine will return an array, with each element representing each chromosome or reference sequence in the database. Each element is an anonymous array of two elements, the chromosome name and length in bp.
Example
my $db = open_db_connection('cerevisiae'); # get all chromosomes in the database my @chromosomes = get_chromosome_list($db); foreach (@chromosomes) { my $name = $_->[0]; my $length = $_->[1]; print "chromosome $name is $length bp\n"; }
INTERNAL SUBROUTINES
These are not intended for normal consumption but are documented here so that we know what is going on.
- _features_to_classes
-
This internal subroutine provides a conveniant look up and conversion of a single-word description of a category of features into a list of actual feature classes in the database. For example, the word 'gene' may include all ORFs, snRNAs, snoRNAs, and ncRNAs.
Pass the subroutine the feature category name as a scalar value. The actual list of feature types will be collected and returned as an array. Multiple values may be passed as a comma-delimited string (no spaces).
The aliases and feature lists are specified in the Bio::ToolBox::db_helper configuration file, biotoolbox.cfg. Additional lists and aliases may be placed there. The lists are database specific, or they can be added to the default database.
If the passed category name does not match an alias in the config file, then it is assumed to be a feature in the database. No further validation will be done (if it's not valid, simply no features would be returned from the database).
Also, feature types may be passed as the GFF's method:source, in which case they are assumed to be valid and not checked.
- _get_segment_score
-
This internal subroutine is used to collect the dataset scores for an established genomic region. It works with a variety of data sources.
First, the data may be stored directly in the Bio::DB::SeqFeature::Store (using the original GFF score value). Second, the feature may reference a data file as an attribute (e.g., wigfile=/path/to/file.wib). Finally, the name(s) of a data file may be passed from which to collect the data. Supported data files include BigWig (.bw), BigBed (.bb), USeq (.useq), and Bam (.bam). A Bio::DB::BigWigSet database is also supported.
The subroutine is passed an array of ten specific values, all of which must be defined and presented in this order. These values include
[0] The opened database object that may be used to generate the the segment and contain the data to collect. If the dataset request is from a big file (bigWig, bigBed, Bam), then this database will likely not be used. Otherwise a Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database object should be passed. [1] The chromosome or seq_id of the segment to examine [2] The start position of the segment [3] The stop or end position of the segment [4] The strand of the region (or original feature) -1, 0, or 1. [5] The dataset name for filename. Multiple datasets may be included, delimited with an ampersand (&). Multiple datasets are merged into one, unless excluded by strand. Local data source files should be prepended with 'file:', while remote data source files should be prepended with the transfer protocol (http: or ftp:). [6] The data type to be collecting. In most cases, the score value is used, but others may be collected. Accepted values include score count length [7] The method of combining all of the dataset values found in the segment into a single value. Accepted values include sum mean median min max range (returns difference between max and min) stddev (returns the standard deviation of a population) indexed (returns hash of postion => score) rpm (returns reads per million mapped, only valid with bam and bigbed databases) rpkm (same as rpm but normalized for length in kb) scores (returns an array reference of all the raw scores) [8] The strandedness of acceptable data. Genomic segments established from an inherently stranded database feature (e.g. ORF) have a specific strandedness. If the dataset strand does not match the specified criteria for strandedness, then it is ignored. If the dataset does not have a specified strand, then it is used regardless of the specified criteria for strandedness. Currently, only transcription data is stranded. Accepted values include sense take only values on the same strand antisense take only values on the opposite strand all take all values [9] The log status of the dataset. Many microarray datasets are in log2 format, and must be de-logged to perform accurate mathematical calculations, such as mean or median. Supply a boolean (0 or 1) value.
The subroutine returns a single numeric value if appropriate dataset values are identified in the genomic segment. If not, then a non-value (.) is returned. Use of a period as a non-value avoids un-defined errors in some subsequent programmatic manipulations and provides a convenient human visual marker.
- _return_null
-
Internal method for returning a 0 or internal null '.' character based on the method being used.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.