NAME
Lingua::TokenParse - Parse a word into scored, fragment combinations
SYNOPSIS
use Lingua::TokenParse;
my $obj = Lingua::TokenParse->new(
word => 'antidisthoughtlessfulneodeoxyribonucleicfoo'
);
$obj->lexicon({
'a' => 'not',
'anti' => 'opposite',
'di' => 'two',
'dis' => 'away',
'eo' => 'hmmmmm',
'ful' => 'with',
'les' => 'without',
# etc...
});
$obj->constraints([ qr/eo./ ]);
$obj->parse;
print Dumper($obj->knowns);
DESCRIPTION
This class represents a Lingua::TokenParse object and contains methods to parse a given word into familiar combinations based on a lexicon of known word parts.
Words like "partition" and "automobile" are composed of different word parts. Given a lexicon of known fragments, one can partition a word into a list of its (possibly overlapping) fragment combinations.
Each of these combinations can be given a score, which represents a measure of word familiarity. This measure is a set of ratios of known to unknown parts.
The lexicon is a simple fragment = definition> list and must have a definition for each entry. This definition can be an empty string (i.e. ''), but if it is undefined the fragment is considered an unknown.
Please see the sample code in the distribution eg/
directory for examples of how this module can be used.
METHODS
new
$obj = Lingua::TokenParse->new(
word => $word,
lexicon => \%lexicon,
);
Return a new Lingua::TokenParse object.
This method will automatically call the partition methods (detailed below) if a word and lexicon are provided.
The word
can be any string, however, you will want to make sure that it does not include the same characters you use for the separator
, not_defined
and unknown
strings (described below).
The lexicon
must be a hash reference with word fragments as keys and definitions their respective values. Definitions must be defined in order for the trim_knowns method work properly.
parse
$obj->parse;
$obj->parse($word);
This method resets the partition lists and then calls all the individual parsing methods that are detailed below.
If a string is provided the word to parse is first set to that.
build_parts
$obj->build_parts;
Construct an array of the word partitions, accessed via the parts method.
build_combinations
$obj->build_combinations;
Compute the array of all possible word part combinations, excluding constraints and accessed via the combinations method.
build_knowns
$obj->build_knowns;
Compute the familiar word part combinations, accessed via the knowns method.
This method handles word parts containing prefix and suffix hyphens, which encode information about what is a syntactically illegal word combination, which can be used to score (or even throw out bogus combinations).
build_definitions
$obj->build_definitions;
Construct a hash of the definitions of the word parts in each combination in the keys of the knowns hash. This hash is accessed via the definitions method.
trim_knowns
$obj->trim_knowns;
Trim the hash of known combinations by concatinating adjacent unknown fragments and throwing out combinations with a score of zero.
CONVENIENCE METHOD
output_knowns
@ = $obj->output_knowns;
print Dumper \@knowns;
# Look at the "even friendlier output."
print scalar $obj->output_knowns(
separator => $separator,
not_defined => $not_defined,
unknown => $unknown,
);
This method returns the familiar word part combinations in a couple "human accessible" formats. Each have familiarity scores rounded to two decimals and fragment definitions shown in a readable layout
- separator
-
The the string used between fragment definitions. Default is a plus symbol surrounded by single spaces: ' + '.
- not_defined
-
Indicates a known fragment that has no definition. Default is a single period: '.'.
- unknown
-
Indicates an unknown fragment. The default is the question mark: '?'.
ACCESSORS
word
$word = $obj->word;
$obj->word($word);
The actual word to partition which can be any string.
lexicon
$lexicon = $obj->lexicon;
$obj->lexicon(\%lexicon);
The lexicon is a hash reference with word fragments as keys and definitions their respective values.
parts
$parts = $obj->parts;
The array reference of all possible word partitions.
combinations
$combinations = $obj->combinations;
The array reference of all possible word part combinations.
knowns
$knowns = $obj->knowns;
The hash reference of known (non-zero scored) combinations with their familiarity values.
definitions
$definitions = $obj->definitions;
The hash reference of the definitions provided for each fragment of the combinations with the values of unknown fragments set to undef.
constraints
$constraints = $obj->constraints;
$obj->constraints(\@regexps);
An optional, user defined array reference of regular expressions to apply to the list of known combinations. This is acts as a negative pruning device. Taht is, if a match is successful, the entry is excluded from the list.
TO DO
Compute the time required for a given parse.
Make a method to request definitions for unknown fragments and call it... learn()
.
Use traditional stemming to trim down the common knowns and see if the score is the same...
Synthesize a term list based on a thesaurus of word-part definitions. That is, go in reverse. Non-trivial!
SEE ALSO
DEDICATION
For my Grandmother and English teacher Frances Jones.
THANK YOU
Thank you to Luc St-Louis for helping me increase the speed while eliminating the exponential memory footprint. I wish I knew your email address so I could tell you. :-) lucs++
AUTHOR
Gene Boggs <gene@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2003-2004 by Gene Boggs
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.