NAME

GH::MspTools - a perl package for doing "amazing" tricks with MSPs.

SYNOPSIS

# this example (and this man page) may be out of date.  Think twice.

use GH::Msp;
use GH::MspTools qw(getMSPs findBestOverlap findBestInclusion);

# in reality, these would be real sequences.
$s1 = "acgcttac";
$s2 = "ttacgcactatcct";

$arrayRef = getMSPs($s1, $s2);
if (defined($arrayRef)) {
  @sortedmsps = sort {$a->getLen() <=> $b->getLen()} @{$arrayRef};
  foreach $msp (@sortedmsps) {
    print $msp->dump(), "\n";
  }
}

$bestOverlapRef = findBestOverlap($s1, $s2);
if (defined($bestOverlapRef)) {
  ($cost, $leftStart, $leftEnd, $rightStart, $rightEnd) = 
       @{$bestOverlapRef};
}

$bestInclusionRef = findBestInclusion($s1, $s2);
if (defined($bestInclusionRef)) {
  ($cost, $leftStart, $leftEnd, $rightStart, $rightEnd) = 
       @{$bestOverlapRef};
}

# tuning/twiddling/finger-poken knobs, and their default
# values.
$GH::MspTools::tableSize = 32767;
$GH::MspTools::wordSize = 12;
$GH::MspTools::extensionThreshold = 12;
$GH::MspTools::mspThreshold = 12;
$GH::MspTools::matchScore = 1;
$GH::MspTools::mismatchScore = -5;
$GH::MspTools::ovFudge = 30;

DESCRIPTION

GH::MspTools supplies a set of routines that find simliar regions in DNA sequences.

getMSPs() simply finds a set of maximal segment pairs for two sequences and returns a reference to the array containing them. See GH::Msp for more information on what the msps contain. Changing the order of the arguments will interchange the pos1 and pos2 values in the msps, but the set will be essentially the same.

findBestOverlap($seq1, $seq2) finds the best overlap between the first sequence and the second. It assumes that seq1 is on the left and that seq2 is on the right. It does not reverse complement either sequence. A complete search to see if a pair of sequences overlap might look like:

$aRef1 = findBestOverlap($s1, $s2);
$aRef2 = findBestOverlap($s1, $s2rev);
$aRef3 = findBestOverlap($s1rev, $s2);
$aRef4 = findBestOverlap($s1rev, $s2rev);

findBestInclusion($seq1, $seq2) find the best way to include the second sequence in the first sequence. In otherwords, is seq2 a subsequence of seq1. To check if seq1 is a subsequence of seq2, you need to interchange the arguments findBestInclusion($seq2, $seq1).

CONFIGURATION VARIABLES.

There are several knobs and sliders that are available for finger-poken. This section describes them, at least from a 35,000 foot level.

$GH::MspTools::tableSize

This variable set's the size of the hash table that the msp package uses. 32kb seems to be a good starting point.

$GH::MspTools::wordSize

This variable sets the size of the string that the hash table uses. The hashing function is length dependent, 12 to 15 seem to be a useful range.

$GH::MspTools::extensionThreshold

This variable controls whether or not a proto-MSP is extended or cut-off. It's value is intimately tied to matchScore and misMatchScore.

$GH::MspTools::matchScore

This value is the score that the MSP searching algorithm gives to a pair of characters that match. In a simple dynamic programming algorithm, this would be the score for a "match".

$GH::MspTools::mismatchScore

This value is the score that the MSP searching algorithm gives to a pair of characters that do not match. In a simple dynamic programming algorithm, this would be the score for a "mismatch".

$GH::MspTools::mspThreshold

This is the threshold for deciding whether a proto-MSP is accepted or rejected. Scores must be above this threshold.

$GH::MspTools::ovFudge

This is the amount of "slop" that the various routines which build paths from sets of MSPS (e.g. overlap finding, inclusion finding) will allow and still be willing to "join" a pair of MSPs.

EXPORT

None by default.

EXPORT_OK

getMSP findBestOverlap findBestInclusion

BUGS

The parameters for finding and stringing together msps should be documented and made tunable.

AUTHOR

George Hartzell, hartzell@cs.berkeley.edu

SEE ALSO

GH::Msp

perl(1)