NAME

Text::Median - Perl extension for determining the set median of a set of strings

SYNOPSIS

use Text::Median;
my $medianobj = new Text::Median(module=>"StringDistanceModule",method=>"distancemethod");
$medianobj->add_data(\@data);
print $medianobj->find_median();

DESCRIPTION

The median of a set of strings is defined as the string that minimizes the sum of all distances between that string and all other strings within the set. The true median is not necessarily a member of the set of strings. It as been shown that finding the median of a set of strings is an NP comilete problem in: "Topology of Strings: Median String is NP Complete", C. de la Higuera, F. Casacuberta, Theoretical Computer Science Vol. 230 Issue 1-2, January 2000

This module is concerned with calculating the set median, which is the member of the set which minimize the sum of distances. There are a myriad of string distance algorithms, including the string edit distance, the keyboard distance and the algorithm used by String::Similarity. This module is designed to allow the programmer to choose the algorithm for distance. It should also be noted that the method associated with the module used to calculate distance should take two arguments. The programmer of this module assumes that you, the user, will give the appropriate method and does not double check that the method exists.

The algorithm used in this module is O(N**2).

Methods

new(module=>'module name', method=>'method', max=>0);

Creates and instantiates a Text::Median object. The module and method arguments are required. The module argument is used to pass in the distance module that the Text::Median object will use and the method argument is the particular method within the distance module that calculates the distance. The module must be a valid module.

The max argument is slightly different. Most string distance modules give a larger number for a larger distance. However, in the String::Similarity module (and potentially other modules) the similarity of a string is calculated and the higher the result, the more similar the strings are. Therefore, the set median is the string with the largest sum of similarities rather than the string with the smallest sum of distance. If you are going to use String::Similarity (or similar modules) you must use the max argument in order to derive the set median.

add_data(@data)

Adds a set of data to the module. If a set of data already exists within the module, appends the new set of data to the old set of data.

find_median()

Determines the set median of the given set of strings and returns it. This is where the main calculation occurs, so this might take time depending on the size of the data set. One thing to note: the distance matrix required for the calculation is held in memory, so if additional data is added to the set the calculation is faster. Also, since it is held in memory, it can have a large memory footprint.

EXPORT

None by default.

SEE ALSO

Any perl modules relating to string distance, including the Levenshtein distance, String::Similarity, and String::KeyboardDistance

AUTHOR

Leigh Metcalf, <leigh@fprime.net>

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Leigh Metcalf

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.9 or, at your option, any later version of Perl 5 you may have available.