NAME

Bio::ToolBox::db_helper::bam

DESCRIPTION

This module is used to collect the dataset scores from a binary bam file (.bam) of alignments. Bam files may be local or remote, and are usually prefixed with 'file:', 'http://', of 'ftp://'.

Collected data values may be restricted to strand by specifying the desired strandedness (sense, antisense, or all), depending on the method of data collection. Collecting scores, or basepair coverage of alignments over the region of interest, does not currently support stranded data collection (as of this writing). However, enumerating alignments (count method) and collecting alignment lengths do support stranded data collection. Alignments are checked to see whether their midpoint is within the search interval before counting or length collected.

Currently, paired-end bam files are treated as single-end files. There are some limitations regarding working with paired-end alignments that don't work well (search, strand, length, etc). If paired-end alignments are to be analyzed, they should be processed into another format (BigWig or BigBed). See the biotoolbox scripts 'bam2gff_bed.pl' or 'bam2wig.pl' for solutions.

To speed up the program and avoid repetitive opening and closing of the files, the opened bam file object is stored in a global hash in case it is needed again.

USAGE

The module requires Lincoln Stein's Bio::DB::Sam to be installed.

Load the module at the beginning of your program.

use Bio::ToolBox::db_helper::bam;

It will automatically export the name of the subroutines.

open_bam_db()

This subroutine will open a Bam database connection. Pass either the local path to a Bam file (.bam extension) or the URL of a remote Bam file. A remote bam file must be indexed. A local bam file may be automatically indexed upon opening if the user has write permissions in the parent directory.

The opened Bio::DB::Sam object will be cached for later use. If you do not want this to happen (in the case of forks, for example), pass a second true argument.

It will return the opened database object.

collect_bam_scores

This subroutine will collect only the data values from a binary bam file for the specified database region. The positional information of the scores is not retained, and the values are best further processed through some statistical method (mean, median, etc.).

The subroutine is passed seven or more arguments in the following order:

1) The chromosome or seq_id
2) The start position of the segment to collect 
3) The stop or end position of the segment to collect 
4) The strand of the original feature (or region), -1, 0, or 1.
5) A scalar value representing the desired strandedness of the data 
   to be collected. Acceptable values include "sense", "antisense", 
   or "all". Only those scores which match the indicated 
   strandedness are collected.
6) The type of data collected. 
   Acceptable values include 'score' (returns the basepair coverage
   of alignments over the region of interest), 'count' (returns the 
   number of alignments found at each base position in the region, 
   recorded at the alignment's midpoint), or 'length' (returns the 
   mean lengths of the alignments found at each base position in 
   the region, recorded at the alignment's midpoint). 
7) The paths, either local or remote, to one or more Bam files.

The subroutine returns an array of the defined dataset values found within the region of interest.

collect_bam_position_scores

This subroutine will collect the score values from a binary bam file for the specified database region keyed by position.

The subroutine is passed the same arguments as collect_bam_scores().

The subroutine returns a hash of the defined dataset values found within the region of interest keyed by position. The feature midpoint is used as the key position. When multiple features are found at the same position, a simple mean (for length data methods) or sum (for count methods) is returned.

sum_total_bam_alignments()

This subroutine will sum the total number of properly mapped alignments in a bam file. Pass the subroutine one to four arguments.

1) The name of the Bam file which should be counted. Alternatively,  
   an opened Bio::DB::Sam object may also be given. Required.
2) Optionally pass the minimum mapping quality of the reads to be 
   counted. The default is 0, where all alignments are counted.
3) Optionally pass a boolean value (1 or 0) indicating whether 
   the Bam file represents paired-end alignments. Only proper 
   alignment pairs are counted. The default is to treat all 
   alignments as single-end.
4) Optionally pass the number of parallel processes to execute 
   when counting alignments. Walking through a Bam file is 
   time consuming but can be easily parallelized. The module 
   Parallel::ForkManager is required, and the default is a 
   conservative two processes when it is installed.
   

The subroutine will return the number of alignments.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0.