NAME
Lingua::TokenParse - Parse a word into scored, fragment combinations
SYNOPSIS
use Lingua::TokenParse;
my $word = 'partition';
my %lexicon;
@lexicon{qw(art ion ti)} = qw(foo bar baz);
my $obj = Lingua::TokenParse->new(
word => $word,
lexicon => \%lexicon,
);
print scalar $obj->output_knowns;
# Okay. Now, let's parse a new word.
$obj->word('metaphysical');
$obj->lexicon({
'meta-' => 'more comprehensive',
'ta' => '',
'phys' => 'natural science, singular',
'-ic' => 'being, containing',
'-al' => 'relating to, characterized by',
});
$obj->rules([ qr/^me\./ ]); # Remove combos that start with "me."
$obj->parse;
my @knowns = $obj->output_knowns;
DESCRIPTION
This class represents a Lingua::TokenParse object and contains methods to parse a given word into familiar combinations based on a lexicon of known word parts.
A word like "partition" is actually composed of a few different word parts. Given a lexicon of known fragments, it is possible to partition this word into possibly overlapping fragment combinations.
Each of these combinations can be given a score, which represents a measure of word familiarity. This measure is a set of simple ratios of known to unknown parts.
The lexicon must have a definition for each entry, in order to have the current trim_knowns
method do the right thing. The definition can be an empty string (i.e. ''). However, if the definition is undefined, the fragment is considered an unknown.
Please see the sample code in the distribution eg/ directory for examples of how this module can be used.
METHODS
new(%arguments)
$obj = Lingua::TokenParse->new(
word => $word,
lexicon => \%lexicon,
separator => $separator,
not_defined => $not_defined,
unknown => $unknown,
);
Return a new Lingua::TokenParse object.
This method will automatically call the partition methods (detailed below) if a word and lexicon are provided.
The word can be any string, however, you will want to make sure that it does not include the same characters you use for the separator, not_defined and unknown strings (described below).
The lexicon must be a hash reference with word fragments as keys and definitions their respecive values. Definitions must be defined in order for the trim_knowns method work properly.
The separator is the string used to separate fragment definitions in the output_knowns method. The default is a plus symbol surrounded by single spaces (' + ').
The not_defined argument is the the string used by the output_knowns method to indicate a known fragment that has no definition. The default is a period (.).
The unknown argument is the the string used by the output_knowns method to indicate an unknown fragment. The default is the question mark (?).
parse()
$obj->parse;
This method resets the partition lists and then calls all the indiviual parsing methods that are detailed below.
Call this method after resetting the object with a new word and optionally, a new lexicon.
build_parts()
$obj->build_parts;
Construct an array of the word partitions, accessed via the parts method.
build_combinations()
$obj->build_combinations;
Compute the array of all possible word part combinations, accessed via the combinations method.
build_knowns()
$obj->build_knowns;
Compute the familiar word part combinations, accessed via the knowns method.
This method handles word parts containing prefix and suffix hyphens, which encode information about what is a syntactically illegal word combination, which can be used to score (or even throw out bogus combinations).
build_definitions()
$obj->build_definitions;
Construct a hash of the definitions of the word parts in each combination in the keys of the knowns hash. This hash is accessed via the definitions method.
trim_knowns()
$obj->trim_knowns;
Trim the hash of known combinations by concatinating adjacent unknown fragments and throwing out combinations with a score of zero.
The end of this method is where user defined rules are processed.
output_knowns()
print scalar $obj->output_knowns;
@knowns = $obj->output_knowns;
Convenience method to return the familiar word part combinations with their familiarity scores (rounded to two decimals) and fragment definitions.
In scalar context, a single, newline separated string is returned. In array context, each of these scored combinations, with their fragment definitions is a separate entry in an array.
Here is the format of the output:
Combination [fragment familiarity, character familiarity]
Fragment definitions (with the defined fragment and unknown
separator).
ACCESSORS
These accessors both get and set their respective values. Note that, if you set the word, lexicon or rules after construction, you must manually initialize the parse lists and run the partition methods (via the parse method).
Also, note that it is useless to set the parts, combinations and knowns lists, since they are computed by the partition methods.
word()
$word = $obj->word($word);
The actual word to partition.
Ths word can be any string, however, you will want to make sure that it does not include the same characters you use for the separator, not_defined and unknown strings.
lexicon()
$lexicon = $obj->lexicon(\%lexicon);
The lexicon is a hash reference with word fragments as keys and definitions their respecive values. Definitions must be defined in order for the trim_knowns
method to work properly.
parts()
$parts = $obj->parts;
The array reference of word partitions.
This method is only useful for fetching, since the parts are computed by the build_parts
method.
combinations()
$combinations = $obj->combinations;
The array reference of all possible word part combinations.
This method is only useful for fetching, since the combinations are computed by the build_combinations
method.
knowns()
$knowns = $obj->knowns;
The hash reference of known combinations (keys) with their familiarity scores (values). Note that only the non-zero scored combinations are kept.
This method is only useful for fetching, since the knowns are computed by the build_knowns
method.
definitions()
$definitions = $obj->definitions;
The hash reference of the definitions provided for each fragment of the combinations with the values of unknown fragments set to undef.
rules()
$rules = $obj->rules(\@rules);
An optional, user defined array reference of regular expressions to apply to the list of known combinations. If a match is successful, the entry is removed from the list.
To reiterate, this is a negative, pruning device, that is used in the trim_knowns
method.
separator()
$separator = $obj->separator($separator);
The separator is the string used to separate fragment definitions in the output_knowns
method. The default is a plus symbol surrounded by single spaces (' + ').
not_defined()
$not_defined = $obj->not_defined($not_defined);
The not_defined argument is the the string used by the output_knowns
method to indicate a known fragment that has no definition. The default is a period (.).
unknown()
$unknown = $obj->unknown($unknown);
The unknown argument is the the string used by the output_knowns method to indicate an unknown fragment. The default is the question mark (?).
TO DO
Compute the time required for a given parse.
Use traditional stemming to trim down the common knowns.
Make a method to request definitions for unknown fragments and call it learn
.
Make the output_knowns() method suck less.
Synthesize a term list based on word part (thesaurus) definitions. That is, go in reverse. Non-trivial!
SEE ALSO
DEDICATION
For my Grandmother and English teacher
Frances Jones <frances@theletterlink.com>
THANK YOU
Thank you to Luc St-Louis for helping me increase the speed while eliminating the exponential memory footprint. lucs++
AUTHOR
Gene Boggs <gene@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2003-2004 by Gene Boggs
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.