NAME

Bio::ToolBox::db_helper - helper interface to various database formats

DESCRIPTION

In most cases, this module does not need to be used directly. The methods available to Bio::ToolBox::Data and Bio::ToolBox::Data::Feature provide convenient access to the methods described here.

These are helper subroutines to work with relevant databases that can be accessed through BioPerl modules. These include the Bio::DB::SeqFeature::Store relational database as well as Bio::DB::Fasta, Bio::DB::Sam, Bio::DB::HTS, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed databases. The primary functions included opening connections to these databases, verifying features found within the database, generating lists of features or genomic intervals for subsequent analysis, and collecting features and/or scores within regions of interest. Collected scores may be summarized using a variety of statistical methods.

When collecting scores or features, the data may hosted in a variety of formats and locations. These include:

SeqFeature database

Genomic feature annotation, including genes, transcripts, exons, etc, may be stored in a Bio::DB::SeqFeature::Store database. These databases are backed by either a relational database, including MySQL, PostGreSQL, or SQLite. Small GFF3 files may also be loaded in an in-memory database. These are typically loaded from GFF3 files; see the adapter documentation for more information. Dataset scores, such as from microarray, may also be stored as the score value in the source GFF file.

References to local, binary, indexed files may also be included as attributes to features stored in the database. This is legacy support from the GBrowse genome browser, and should not be used in preference to other methods. Supported files include the legacy binary wig files (.wib, see Bio::Graphics::Wiggle) using the wigfile attribute, or bigWig files using the wigfile or bigwigfile attribute. The attribute values must be full paths.

BigWig files

BigWig files (.bw and .bigwig) are compressed, binary, indexed versions of text wig files and may be accessed either locally or remotely. They support extremely fast score retrieval from any genomic location of any size without sacrificing resolution (spatial and numeric). See http://genome.ucsc.edu/goldenPath/help/bigWig.html for more information. BigWig files are supported by the Bio::DB::BigWig adapter, based on the UCSC library; see the documentation for more information.

Directory of BigWig files

A directory containing two or more BigWig files is assembled into a BigWigSet, allowing for metadata, such as strand, to be associated with BigWig files. Additional metadata beyond the filename may be included in text file metadata.txt within the directory. See the Bio::DB::BigWigSet adapter documentation for more information.

BigBed files

BigBed files are compressed, binary, indexed versions of text BED files. See http://genome.ucsc.edu/goldenPath/help/bigBed.html for more information. Both local and remote files may be accessed. BigBed files are supported by Bio::DB::BigBed adapter, based on the UCSC library; see the adapter documentation for more information.

Bam files

Bam files (.bam) are compressed, binary, indexed versions of the text SAM file, or sequence alignment map. They are used with next generation sequencing technologies. They support individual alignment retrieval as well as read depth coverage. Two different Bam file adapters are supported.

The Bio::DB::HTS adapter is an interface to the Bam (or Cram) file. This is based on the modern HTSlib C library (version >= 1.0), successor to the original samtools library. See http://samtools.github.io for more information.

The Bio::DB::Sam adapter is an older interface to the Bam alignment file. This is based on the samtools C library version <= 0.1.19. While this adapter is still supported, the above should be used for new installs.

USeq files

USeq files (.useq) are compressed, binary, indexed files that support multiple information types, including region intervals, BED type annotations, or wig type scores distributed across the genome. They support rapid, random access across the genome and are comparable to both BigWig and BigBed files. See http://useq.sourceforge.net/useqArchiveFormat.html for more information. Files may only be local. These files are supported by the Bio::DB::USeq adapter.

Fasta files

Fasta files (.fa or .fasta) may be opened. Fasta files are indexed to rapidly and randomly fetch genomic sequence. Three different adapters are available for indexing the fasta file.

The Bio::DB::HTS::Faidx adapter is preferentially used, and requires a .fai index file.

The Bio::DB::Sam::Fai adapter may alternatively used, and also requires a .fai index file.

The older style Bio::DB::Fasta adapter is much slower, but will index either a single genomic fasta file or a directory of individual chromosome or contig fasta files.

Additionally, genomic sequence stored in a Bio::DB::SeqFeature::Store annotation database may also be used.

While the functions within this library may appear to be simply a rehashing of the methods and functions in Bio::DB::SeqFeature::Store and other modules, they either provide a simpler function to often used database methodologies or are designed to work intimately with the BioToolBox data format file and data structures (see Bio::ToolBox::file_helper). One key advantage to these functions is the ability to work with datasets that are stranded (transcriptome data, for example).

Historically, this module was initially written to use Bio::DB::GFF for database usage. It has since been re-written to use Bio::DB::SeqFeature::Store database schema. Additional functionality has also been added for other databases and file formats.

Complete usage and examples for the functions are provided below.

USAGE

Call the module at the beginning of your perl script and include the module names to export. None are exported by default.

  use Bio::ToolBox::db_helper qw(
	  get_new_feature_list 
	  get_db_feature
  );

This will export the indicated subroutine names into the current namespace. Their usage is detailed below.

EXPORTED SUBROUTINES

open_db_connection
my $db_name = 'cerevisiae';
my $db = open_db_connection($db_name);

my $file = '/path/to/file.bam';
my $db = open_db_connection($file);

# within a forked child process
# reusing the same variable to simplify code
# pass second true value to avoid cache
$db = open_db_connection($file, 1); 

This is a simplified, generalized method that will open a connection to a database or indexed file using any one of a number of different BioPerl-style adapters. For local and remote files, the appropriate adapter is determined by the file extension. Relational database connection and authentication information is checked in a configuration file. Once identified, the appropriate Perl module adapter is loaded during run time automatically, saving the user from knowing in advance to use the appropriate module in the script.

Pass the name of a relational database or the path of the database file to the subroutine. The opened database object is returned. If it fails, then an error message should be generated and nothing is returned.

Important! If you are forking your perl process, always re-open your database objects in the child process, and pass a second true value to avoid using the cached database object. By default, opened databases are cached to improve efficiency, but this will be disastrous when crossing forks.

SeqFeature Store database

Provide the name of the relational database to open. Parameters for connecting to a relational database are stored in the BioToolBox configuration file, .biotoolbox.cfg. These include database adaptors, user name, password, etc. Further information regarding the configuration file may be found in Bio::ToolBox::db_helper::config.

For SQLite databases, provide the path to the .sqlite or .db file.

For in-memory databases, provide the path to the .gff or .gff3 file. The larger the file, the more memory and processing time is consumed as the file is parsed into memory. This may be fine for small model organisms like yeast, but not for vertebrate annotation.

Bam file database

Provide the path to the .bam file. This may be a local file, or a remote URL (generally supported by the adapter). If a local .bam.bai index file is not present, it will automatically be built prior to opening; this may fail if the bam file is not sorted. Remote files are not indexed.

BigWig file database

Provide the path to a local .bw or .bigwig file, or URL to a remote file.

BigWigSet database

A local or remote directory of one or more BigWig files that can treated collectively as a database. A special text file may be included in the directory to assign metadata to each BigWig file, including attributes such as type, source, display name, strand, etc. See Bio::DB::BigWigSet for more information on the formatting of the metadata file. If a metadata file is not included (it's not necessary), then the filenames are used as the name and type. Strand may be parsed from the filenames if the basename ends in .f and .r.

BigBed database

Provide the path to a local .bb or .bigbed file, or URL to a remote file.

USeq database

Provide the path to a local .bb or .bigbed file, or URL to a remote file.

Fasta database

Provide the path to a local fasta file for rapidly fetching genomic sequence. If the fasta file is not indexed, then it will be indexed automatically upon opening.

get_db_name
my $name = get_db_name($db);

This method will attempt to get the name of an opened Database object if, for some reason, it's unknown, e.g. the user only provided an opened db object. This only works for some databases, at least those I've bothered to track down and find a usable API call to use.

get_dataset_list
my $db_name = 'cerevisiae';
my @types = get_dataset_list($db_name);

This method will retrieve a list of the available features stored in the database and returns an array of the features' types.

Pass either the name of the database or an established database object. Supported databases include both Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases.

For Bio::DB::SeqFeature::Store databases, the type is represented as "type:source", corresponding to the third and second GFF columns, respectively. The types are sorted alphabetically first by source, then by method.

For Bio::DB::BigWigSet databases, the type, primary_tag, method, or display_name attribute may be used, in that respective order of availability. The list is sorted alphabetically.

verify_or_request_feature_types
# verify a dataset, either file or database type
my $db_name = 'cerevisiae';
my $dataset = 'microaray_data';
my @features = verify_or_request_feature_types(
	db      => $db_name,
	feature => $dataset,
);
unless (@features) {
	die " no valid feature provided!\n";
}

# select a list of features
my $db_name = 'cerevisiae';
my @types = (); # user not provided
@types = verify_or_request_feature_types(
	db      => $db_name,
	feature => $types,
);
# user will be promoted to select from database list
if (@types) {
	print "using types " . join(", ", @types);
}
else {
	die " no valid features selected!\n";
}

This subroutine will process a list of feature types or data sources to be used for data or feature collection. There are two modes of action. If a list was provided, it will process and verify the list. If no list was provided, then the user will be prompted to select from a list of available feature types in the provided database.

The provided list may be a list of feature types or a list of single file data sources, including Bam, bigWig, or bigBed files. If the list includes feature types, they are verified as existing in the provided database. If the list includes file names, the local files are checked and the names prefixed with "file:" for use downstream.

If no list was provided, then a list of available feature types in the provided database will be presented to the user, and the user prompted to make a selection. One or more types may be selected, and a single item may be enforced if requested. The response is filtered through the parse_list method from Bio::ToolBox::utility, so a mix of single numbers or a range of numbers may be accepted. The responses are then validated.

Two types of databases are supported: Bio::DB::SeqFeature::Store and Bio::DB::BigWigSet databases. For SeqFeature Store databases, the type is comprised of "method:source", corresponding to the third and second GFF columns. For BigWigSet databases, types correspond to either the type, method, primary_tag, or display_name attributes, in that order of availability.

For feature types or files that pass validation, they are returned as a list. Types of files that do not pass validation are printed to STDOUT.

To use this subroutine, pass an array with the following keys and values. Not every key is required.

db

The name of the database or a reference to an established BioPerl database object. A Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database are typically used. Only required when no features are provided.

feature

Pass either a single dataset name as a scalar or an anonymous array reference of a list of dataset names. These may have been provided as a command line option and need to be verified. If nothing is passed, then a list of possible datasets will be presented to the user to be chosen.

prompt

Provide a phrase to be prompted to the user to help in selecting datasets from the list. If none is provided, a generic prompt will be used.

single

A Boolean value (1 or 0) indicating whether only a single dataset is allowed when selecting datasets from a presented list. If true, only one dataset choice is accepted. If false, one or more dataset choices are allowed. Default is false.

limit

Optionally provide a word or regular expression that matches the feature type (primary_tag only; source_tag, if present, is ignored). For example, provide "gene|mrna" to only present gene and mRNA features to the user. This is only applicable when a user must select from a database list. The default is to list all available feature types.

check_dataset_for_rpm_support($dataset, [$cpu])
# count the total number of alignments
my $dataset = '/path/to/file.bam';
my $total_count = check_dataset_for_rpm_support($dataset);

# use multithreading
my $total_count = check_dataset_for_rpm_support($dataset, $cpu);

This subroutine will check a dataset for RPM, or Reads Per Million mapped, support. Only two types of database files support this, Bam files and BigBed files. If the dataset is either one of these, or the name of a database feature which points to one of these files, then it will calculate the total number of mapped alignments (Bam file) or features (BigBed file). It will return this total number. If the dataset does not support RPM (because it is not a Bam or BigBed file, for example), then it will return undefined.

Pass this subroutine one or two values. The first is the name of the dataset. Ideally it should be validated using verify_or_request_feature_types() and have an appropriate prefix (file, http, or ftp).

For multi-threaded execution pass a second value, the number of CPU cores available to count Bam files. This will speed up counting bam files considerably. The default is 2 for environments where Parallel::ForkManager is installed, or 1 where it is not.

get_new_feature_list
my $Data = Bio::ToolBox::Data->new;
my $db_name = 'cerevisiae';
my %data = get_new_feature_list(
	data      => $Data,
	db        => $db_name,
	features  => 'genes',
);

This subroutine will generate a new feature list collected from an annotation database, such as a Bio::DB::SeqFeature::Store database. It will populate an empty Bio:ToolBox::Data object data table with a list of the collected features. Columns include the Primary ID, Name, and feature Type.

The subroutine is passed an array containing three required arguments. The keys include

data

An empty Bio::ToolBox::Data object. This is required.

db

The name of the database or a reference to an established database object. Currently, this must be a Bio::DB::SeqFeature::Store database. This is required.

features

A scalar value containing a name representing the feature type(s) of feature(s) to collect. The type may be a primary_tag or primary_tag:source_tag. It may be a single element or a comma-delimited list. This is required.

chrskip

Provide a regular-expression compatible string for skipping features from specific chromosomes, for example mitochondrial or unmapped contigs.

Status messages, including the number of features found and kept, are always printed to STDOUT.

get_new_genome_list
my $Data = Bio::ToolBox::Data->new();
my $db_name = 'cerevisiae';
my $window_size = 500;
my $step_size = 250;
my %data = get_new_genome_list(
	data      => $Data,
	db        => $db_name,
	win       => $window_size,
	step      => $step_size,
);

This subroutine will generate a new list of genomic windows or intervals. The genome is split into intervals of a specified size with the specified step size.

The subroutine is passed an array containing the required arguments.

data

An empty Bio::ToolBox::Data object. This is required.

db

The name of the database or a reference to an established database object. Any BioPerl adapter, including those described in the "DESCRIPTION", are supported. Required.

win

A scalar value containing an integer representing the size of the window in basepairs. The default value is 500 bp.

step

A scalar value containing an integer representing the step size for advancing the window across the genome. The default is the window size.

chrskip
skip
exclude

Provide a regular expression compatible string for excluding specific chromosomes or classes of chromosomes, such as the mitochondrial or unmapped contigs.

Status messages are printed to STDOUT.

get_db_feature
my $db = open_db_connection('annotation.db');
my $seqfeature = get_db_feature(
    db      => $db,
    type    => 'gene',
    name    => 'ABC1',
);

This subroutine will retrieve a specific feature from a Bio::DB::SeqFeature::Store database for subsequent analysis, manipulation, and/or score retrieval using the "get_segment_score" methods. It relies upon unique information to pull out a single, unique feature.

Several attributes may be used to pull out the feature, including the feature's unique database primary ID, name and/or aliases, and GFF type (primary_tag and/or source). The "get_new_feature_list" subroutine will generate a list of features with their unique identifiers.

The primary_id attribute is preferentially used as it provides the best performance. However, it is not always portable between databases or even a database re-load. In that case, the display_name and type are used to identify potential features. Note that the display_name may not always be unique in the database. In this case, the addition of aliases may help. If all else fails, a new feature list should be generated.

To get a feature, pass an array of arguments.

db

The name of the Bio::DB::SeqFeature::Store database or a reference to an established database object.

id

Provide the primary_id tag. In the Bio::DB::SeqFeature::Store database schema this is a (usually) non-portable, unique identifier specific to a database. It provides the fastest lookup.

name

A scalar value representing the feature display_name. Aliases may be appended with semicolon delimiters.

type

Provide the feature type, which is typically expressed as primary_tag:source. Alternatively, provide just the primary_tag only.

While it is possible to identify features with any two attributes (or possibly just name or ID), the best performance is obtained with all three together. The first SeqFeature object is returned if multiple are found.

get_segment_score
my $score = get_segment_score(
	$seqfeature->seq_id,    # chromosome
	$seqfeature->start,     # start
	$seqfeature->end,       # end
	$seqfeature->strand,    # strand
	'sense',                # strandedness
	'mean',                 # method
	0,                      # return type
	$db,                    # database
	'scores.bw'             # datasets
);

Generic method for collecting score(s) from a database. This is the primary interface to all of the database handling packages, including Bam, bigWig, bigBed, and USeq files, as well as SeqFeature databases. By passing common parameters here, the appropriate database or file handling packages are loaded during run time and the appropriate methods are called. Essentially, this is where the magic happens.

Pass the method an array of nine (or more) parameters. The parameters are not keyed as a hash, and the order is explicit. The order is as follows.

  • 0 Chromosome

    String representing chromosome or sequence ID.

  • 1 Start

    Start coordinate (integer, 1-based)

  • 2 Stop

    Stop or end coordinate (integer, 1-based)

  • 3 Strand

    BioPerl style strand value, including -1, 0, and 1.

  • 4 Strandedness

    A string indicating strandedness: sense, antisense, all.

  • 5 Method

    A string representing the method of data collection.

    Current acceptable methods include mean, median, sum, min, max, stddev, count, ncount (named count), and pcount (precise count). See the individual db_helper sub classes for more details. Not all adapters support all methods.

  • 6 Return type

    An integer (0, 1, 2) representing the type of value to return.

    A return type of 0 indicates a single score should be returned, calculated using the specified method. A return type of 1 is an array or array reference of all scores found. A return type of 2 is a hash or hash reference of coordinate => score.

  • 7 Dataset database

    This is either a Bio::DB::SeqFeature::Store or a Bio::DB::BigWigSet database containing multiple score dataset features. If your dataset does not contain multiple different feature types, pass undefined.

    Pass either a string containing the name of database (which will be opened) or an already opened database object.

  • 8 Dataset

    This is the verified dataset from which the scores will be passed. The dataset should be verified using the "verify_or_request_feature_types" subroutine. For indexed file datasets, e.g. .bam or .bw files, this will the verified local path or URL. For a database, this will be a database type, sucha as primary_tag or primary_tag:source. or the path to a data file, e.g. Bam, bigWig, bigBed, or USeq file.

    If multiple datasets are to be combined, simply append them to the array (index 9, 10, etc).

The returned item is dependent on the value of the return type code.

calculate_score
my $scores = get_segment_score($chr,$start,$stop,$strand,
	'sense','ncount',1,undef,'file:/path/to/dataset.bam');
my $score = calculate_score('ncount', $scores);

This subroutine will perform the math to calculate a single score from an array of values. Current acceptable methods include mean, median, sum, min, max, stddev, count, ncount, and pcount.

Pass the subroutine two items: the name of the mathematical method enumerated above, and an array reference of the values to work on. A single score will be returned.

get_chromosome_list
my $db = open_db_connection('cerevisiae');
# get all chromosomes in the database
my @chromosomes = get_chromosome_list($db);
foreach (@chromosomes) {
	my $name = $_->[0];
	my $length = $_->[1];
	print "chromosome $name is $length bp\n";
}

This subroutine will collect a list of chromosomes or reference sequences in a Bio::DB database and return the list along with their sizes in bp. Many BioPerl-based databases are supported, including Bio::DB::SeqFeature::Store, Bio::DB::Fasta, Bio::DB::Sam, Bio::DB::HTS, Bio::DB::BigWig, Bio::DB::BigWigSet, and Bio::DB::BigBed, or any others that support the seq_ids method. See the "open_db_connection" subroutine for more information.

Pass the subroutine either the name of the database, the path to the database file, or an opened database object.

Optionally pass a second value, a regular expression compatible string for skipping specific chromosomes or chromosome classes, such as mitochondrial or unmapped contigs. The default is to return all chromosomes.

The subroutine will return an array, with each element representing each chromosome or reference sequence in the database. Each element is an anonymous array of two elements, the chromosome name and length in bp.

low_level_bam_coverage
my $coverage = low_level_bam_coverage($sam, $tid, $start, $stop);

This is a convenience method for running the low level bam coverage method. Since both Bio::DB::Sam and Bio::DB::HTS bam file adapters are supported, and each has slight variation in the API syntax, this method helps to abstract the actual method and use the appropriate syntax depending on which adapter is loaded. It is best if the $sam object was opened using the "open_db_connection" method, or that $BAM_ADAPTER is set.

NOTE that this is the low level coverage method based on the index object, and not the similarly named high level API method. Read the adapter documentation for proper usage.

low_level_bam_fetch
my $success = low_level_bam_fetch($sam, $tid, $start, $stop, $callback, $data);

This is a convenience method for running the low level bam fetch method. Since both Bio::DB::Sam and Bio::DB::HTS bam file adapters are supported, and each has slight variation in the API syntax, this method helps to abstract the actual method and use the appropriate syntax depending on which adapter is loaded. It is best if the $sam object was opened using the "open_db_connection" method, or that $BAM_ADAPTER is set.

NOTE that this is the low level fetch method based on the index object, and not the similarly named high level API method. Read the adapter documentation for proper usage.

get_genomic_sequence
my $sequence = get_genomic_sequence($db, $chrom, $start, $stop);

This is a convenience wrapper function for fetching genomic sequence from three different supported fasta database adapters, which, of course, all have different APIs, including the Bio::DB::Fasta, Bio::DB::HTS::Faidx, and Bio::DB::Sam::Fai adapters, as well as Bio::DB::SeqFeature::Store and possibly other BioPerl databases.

Pass the opened database object, chromosome, start, and stop coordinates. This assumes BioPerl standard 1-base coordinates. Only the forward sequence is retrieved. The sequence is returned as a simple string.

INTERNAL SUBROUTINES

These are not intended for normal general consumption but are documented here so that at least some of us know what is going on.

_lookup_db_method

Internal method for determining which database adapter and method call should be used given a set of parameters. This result is cached for future data queries (likely hundreds or thousands of repeated calls). The method is stored in the global hash %DB_METHODS;

Pass the parameter array reference from "get_segment_score". This should load the appropriate db_helper sub module during run time.

_load_helper_module

Subroutine to load and import a db_helper module during run time. Sets the appropriate global variable if successful.

_load_bam_helper_module

Subroutine to determine which Bam adapter db_helper module to load, either the older samtools-based adapter or the newer HTSlib-based adapter. Uses the global and exportable variable $BAM_ADAPTER to set a preference for which adapter to use. Use 'sam' or 'hts' or some string containing these names. Do NOT change this variable after loading the helper module; it will not do what you think it will do. If both are available and a preference is not set then the hts helper is the default.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.