NAME

Lingua::TokenParse - Parse a word into scored, fragment combinations

SYNOPSIS

use Lingua::TokenParse;

my $obj = Lingua::TokenParse->new(
    word => 'antidisthoughtlessfulneodeoxyribonucleicfoo'
);
$obj->lexicon({
    'a'    => 'not',
    'anti' => 'opposite',
    'di'   => 'two',
    'dis'  => 'away',
    'eo'   => 'hmmmmm',
    'ful'  => 'with',
    'les'  => 'without',
    # etc...
});
$obj->constraints([ qr/eo./ ]);
$obj->parse;
print Dumper($obj->knowns);

DESCRIPTION

This class represents a Lingua::TokenParse object and contains methods to parse a given word into familiar combinations based on a lexicon of known word parts.

Words like "partition" and "automobile" are composed of different word parts. Given a lexicon of known fragments, one can partition a word into a list of its (possibly overlapping) fragment combinations.

Each of these combinations can be given a score, which represents a measure of word familiarity. This measure is a set of ratios of known to unknown parts.

The lexicon is a simple fragment = definition> list and must have a definition for each entry. This definition can be an empty string (i.e. ''), but if it is undefined the fragment is considered an unknown.

Please see the sample code in the distribution eg/ directory for examples of how this module can be used.

METHODS

new

$obj = Lingua::TokenParse->new(
    word => $word,
    lexicon => \%lexicon,
);

Return a new Lingua::TokenParse object.

This method will automatically call the partition methods (detailed below) if a word and lexicon are provided.

The word can be any string, however, you will want to make sure that it does not include the same characters you use for the separator, not_defined and unknown strings (described below).

The lexicon must be a hash reference with word fragments as keys and definitions their respective values. Definitions must be defined in order for the trim_knowns method work properly.

parse

$obj->parse;
$obj->parse($word);

This method resets the partition lists and then calls all the individual parsing methods that are detailed below.

If a string is provided the word to parse is first set to that.

build_parts

$obj->build_parts;

Construct an array of the word partitions, accessed via the parts method.

build_combinations

$obj->build_combinations;

Compute the array of all possible word part combinations, excluding constraints and accessed via the combinations method.

build_knowns

$obj->build_knowns;

Compute the familiar word part combinations, accessed via the knowns method.

This method handles word parts containing prefix and suffix hyphens, which encode information about what is a syntactically illegal word combination, which can be used to score (or even throw out bogus combinations).

build_definitions

$obj->build_definitions;

Construct a hash of the definitions of the word parts in each combination in the keys of the knowns hash. This hash is accessed via the definitions method.

trim_knowns

$obj->trim_knowns;

Trim the hash of known combinations by concatinating adjacent unknown fragments and throwing out combinations with a score of zero.

CONVENIENCE METHOD

output_knowns

@ = $obj->output_knowns;
print Dumper \@knowns;

# Look at the "even friendlier output."
print scalar $obj->output_knowns(
    separator   => $separator,
    not_defined => $not_defined,
    unknown     => $unknown,
);

This method returns the familiar word part combinations in a couple "human accessible" formats. Each have familiarity scores rounded to two decimals and fragment definitions shown in a readable layout

separator

The the string used between fragment definitions. Default is a plus symbol surrounded by single spaces: ' + '.

not_defined

Indicates a known fragment that has no definition. Default is a single period: '.'.

unknown

Indicates an unknown fragment. The default is the question mark: '?'.

ACCESSORS

word

$word = $obj->word;
$obj->word($word);

The actual word to partition which can be any string.

lexicon

$lexicon = $obj->lexicon;
$obj->lexicon(\%lexicon);

The lexicon is a hash reference with word fragments as keys and definitions their respective values.

parts

$parts = $obj->parts;

The array reference of all possible word partitions.

combinations

$combinations = $obj->combinations;

The array reference of all possible word part combinations.

knowns

$knowns = $obj->knowns;

The hash reference of known (non-zero scored) combinations with their familiarity values.

definitions

$definitions = $obj->definitions;

The hash reference of the definitions provided for each fragment of the combinations with the values of unknown fragments set to undef.

constraints

$constraints = $obj->constraints;
$obj->constraints(\@regexps);

An optional, user defined array reference of regular expressions to apply to the list of known combinations. This is acts as a negative pruning device. Taht is, if a match is successful, the entry is excluded from the list.

TO DO

Compute the time required for a given parse.

Make a method to request definitions for unknown fragments and call it... learn().

Use traditional stemming to trim down the common knowns and see if the score is the same...

Synthesize a term list based on a thesaurus of word-part definitions. That is, go in reverse. Non-trivial!

SEE ALSO

Math::BaseCalc

DEDICATION

For my Grandmother and English teacher Frances Jones.

THANK YOU

Thank you to Luc St-Louis for helping me increase the speed while eliminating the exponential memory footprint. I wish I knew your email address so I could tell you. :-) lucs++

AUTHOR

Gene Boggs <gene@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2003-2004 by Gene Boggs

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.