NAME
Lingua::TokenParse - Parse a word into scored, fragment combinations
SYNOPSIS
use Lingua::TokenParse;
my $p = Lingua::TokenParse->new(
word => 'antidisthoughtlessfulneodeoxyribonucleicfoo',
lexicon => {
a => 'not',
anti => 'opposite',
di => 'two',
dis => 'separation',
eo => 'hmmmmm', # etc.
},
constraints => [ qr/eo(?:\.|$)/ ], # no parts ending in eo allowed
);
print Data::Dumper($p->knowns);
DESCRIPTION
This class represents a Lingua::TokenParse object and contains methods to parse a given word into familiar combinations based on a lexicon of known word parts. This lexicon is a simple fragment => definition list.
Words like "automobile" and "deoyribonucleic" are composed of different roots, prefixes, suffixes, etc. With a lexicon of known fragments, a word can be partitioned into a list of its (possibly overlapping) known and unknown fragment combinations.
These combinations can be given a score, which represents a measure of word familiarity. This measure is a set of ratios of known to unknown fragments and letters.
METHODS
new
$p = Lingua::TokenParse->new(
verbose => 0,
word => $word,
lexicon => \%lexicon,
lexicon_file => $lexicon_file,
constraints => \@constraints,
);
Return a new Lingua::TokenParse object.
This method will automatically call the partition methods (detailed below) if a word and lexicon are provided.
The word
can be any string, however, you will want to make sure that it does not include the same characters you use for the separator
, not_defined
and unknown
strings (described below).
The lexicon
must be a hash reference with word fragments as keys and definitions as their respective values.
parse
$p->parse;
$p->parse($string);
This method clears the partition lists and then calls all the individual parsing methods that are detailed below. If a string is provided the object's word
attribute is reset to that, first.
build_parts
$parts = $p->build_parts;
Construct an array reference of the word partitions.
build_definitions
$known_definitions = $p->build_definitions;
Construct a table of the definitions of the word parts.
build_combinations
$combos = $p->build_combinations;
Compute the array reference of all possible word part combinations.
build_knowns
$raw_knowns = $p->build_knowns;
Compute the familiar word part combinations.
This method handles word parts containing prefix and suffix hyphens, which encode information about what is a syntactically illegal word combination, which can be used to score (or even throw out bogus combinations).
lexicon_cache
$p->lexicon_cache;
$p->lexicon_cache( $lexicon_file );
$p->lexicon_cache( lexicon_file => $lexicon_file );
Backup and retrieve the hash reference of token entries.
If this method is called with no arguments, the object's lexicon_file
is used. If the method is called with a single argument, the object's lexicon_file
attribute is temporarily overridden. If the method is called with two arguments and the first is the string "lexicon_file" then that attribute is set before proceeding.
CONVENIENCE METHOD
output_knowns
@known_list = $p->output_knowns;
print Dumper \@known_list;
# Look at the "even friendlier output."
print scalar $p->output_knowns(
separator => $separator,
not_defined => $not_defined,
unknown => $unknown,
);
This method returns the familiar word part combinations in a couple "human accessible" formats. Each have familiarity scores rounded to two decimals and fragment definitions shown in a readable layout
- separator
-
The the string used between fragment definitions. Default is a plus symbol surrounded by single spaces: ' + '.
- not_defined
-
Indicates a known fragment that has no definition. Default is a single period: '.'.
- unknown
-
Indicates an unknown fragment. The default is the question mark: '?'.
ACCESSORS
word
$p->word($word);
$word = $p->word;
The actual word to partition which can be any string.
lexicon
$p->lexicon(%lexicon);
$p->lexicon(\%lexicon);
$lexicon = $p->lexicon;
The lexicon is a hash reference with word fragments as keys and definitions their respective values. It can be set with either a hash or a hash reference.
parts
$parts = $p->parts;
The computed array reference of all possible word partitions.
combinations
$combinations = $p->combinations;
The computed array reference of all possible word part combinations.
knowns
$knowns = $p->knowns;
The computed hash reference of known (non-zero scored) combinations with their familiarity values.
definitions
$definitions = $p->definitions;
The hash reference of the definitions provided for each fragment of the combinations with the values of unknown fragments set to undef.
constraints
$constraints = $p->constraints;
$p->constraints(\@regexps);
An optional, user defined array reference of regular expressions to apply when constructing the list of parts and combinations. This acts as a negative pruning device, meaning that if a match is successful, the entry is excluded from the list.
EXAMPLES
Example code can be found in the distribution eg/
directory.
TO DO
Turn the lame output_knowns
method into a sensible XML serializer (of optionally everything).
Compute the time required for a given parse.
Make a method to add definitions for unknown fragments and call it... learn()
.
Use traditional stemming to trim down the common knowns and see if the score is the same...
Synthesize a term list based on a thesaurus of word-part definitions. That is, go in reverse. Non-trivial!
SEE ALSO
DEDICATION
For my Grandmother and English teacher Frances Jones.
THANK YOU
Thank you to Luc St-Louis for helping me increase the speed while eliminating the exponential memory footprint. I wish I knew your email address so I could tell you. lucs++
AUTHOR
Gene Boggs <gene@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2003-2004 by Gene Boggs
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.