NAME

Lingua::TokenParse - Parse a word into scored, fragment combinations

SYNOPSIS

use Lingua::TokenParse;

my $word = 'partition';
my %lexicon;
@lexicon{qw(art ion ti)} = qw(foo bar baz);
my $obj = Lingua::TokenParse->new(
    word    => $word,
    lexicon => \%lexicon,
);
print scalar $obj->output_knowns;

# Okay.  Now, let's parse a new word.
$obj->word('metaphysical');
$obj->lexicon({
    'meta-' => 'more comprehensive',
    'ta'    => 'foo',
    'phys'  => 'natural science, singular',
    '-ic'   => 'being, containing',
    '-al'   => 'relating to, characterized by',
});
$obj->rules([ qr/^me\./ ]);
$obj->parse;
my @knowns = $obj->output_knowns;

ABSTRACT

This class represents a Lingua::TokenParse object and contains methods to parse a given word into familiar combinations based on a lexicon of known word parts.

DESCRIPTION

A word like "partition" is actually composed of a few different word parts. Given a lexicon of known fragments, it is possible to partition this word into combinations of these (possibly overlapping) parts. Each of these combinations can be given a score, which represents a measure of familiarity.

Currently, this familiarity measure is a simple ratio of known to unknown parts.

Note that the lexicon must have definitions for each entry, in order to have the current trim_knowns() method do the right thing.

* Check out the sample code in the distribution's eg/ directory for examples of how this module can be used.

METHODS

new()

$obj = Lingua::TokenParse->new(
    word    => $word,
    lexicon => \%lexicon,
);

Return a new Lingua::TokenParse object.

This method will automatically call the partition methods (detailed below) if a word and lexicon are provided.

parse()

$obj->parse();

This method resets the partition lists and then calls all the indiviual parsing methods that are detailed below.

Call this method after resetting the object with a new word and optionally, a new lexicon.

build_parts()

$obj->build_parts();

Construct an array of the word partitions, accessed via the parts() method.

build_combinations()

$obj->build_combinations();

Recursively compute the array of all possible word part combinations, accessed via the combinations() method.

build_knowns()

$obj->build_knowns();

Compute the familiar word part combinations, accessed via the knowns() method.

This method handles word parts containing prefix and suffix hyphens, which encode information about what is a syntactically illegal word combination, which can be used to score (or even throw out bogus combinations).

build_definitions()

$obj->build_definitions();

Construct a hash of the definitions of the word parts in each combination in the keys of the knowns hash. This hash is accessed via the definitions() method.

trim_knowns()

$obj->trim_knowns();

Trim the hash of known combinations by concatinating adjacent unknown fragments and throwing out combinations with a score of zero.

output_knowns()

print scalar $obj->output_knowns();

@knowns = $obj->output_knowns();

Convenience method to return the familiar word part combinations with their familiarity scores (rounded to two decimals) and fragment definitions.

In scalar context, a single, newline separated string is returned. In array context, each of these scored combinations, with their fragment definitions is a separate entry in an array.

Here is the format of the output:

Combination [fragment familiarity, character familiarity]
Fragment definitions (with the defined fragment separator and a ?
character for unknowns).

ACCESSORS

These accessors both get and set their respective values. Note that, if you set the word, lexicon or rules after construction, you must manually initialize the parse lists and run the partition methods (via the parse() method).

Also, note that it is useless to set the parts, combinations and knowns lists, since they are computed by the partition methods.

word()

$word = $obj->word($word);

The actual word to partition.

lexicon()

$lexicon = $obj->lexicon(\%lexicon);

The hash reference of word parts (keys) with their (optional) definitions (values).

parts()

$parts = $obj->parts();

The array reference of word partitions.

Note that this method is only useful for fetching, since the parts are computed by the build_parts() method.

combinations()

$combinations = $obj->combinations();

The array reference of all possible word part combinations.

Note that this method is only useful for fetching, since the combinations are computed by the build_combinations() method.

knowns()

$knowns = $obj->knowns();

The hash reference of known combinations (keys) with their familiarity scores (values). Note that only the non-zero scored combinations are kept.

Note that this method is only useful for fetching, since the knowns are computed by the build_knowns() method.

definitions()

$definitions = $obj->definitions();

The hash reference of the definitions provided for each fragment of the combinations in the knowns hash. Note that the unknown fragments are defined as an empty string.

separator()

$separator = $obj->separator($separator);

The character or (characters) separating the fragment definitions that are produced by the output_knowns() method.

The default is ' + ' (a plus symbol surrounded by single spaces).

rules()

$rules = $obj->rules($rules);

An optional, user defined array of regular expressions to apply to the list of known combinations. If a match is successful, the entry is removed from the list.

DEPENDENCIES

None

DISCLAIMER This module uses some clunky, inefficient algorithms. For instance, a 50 letter word (like a medical term) just might take until the end of time to parse and possibly longer. Please write to me with much needed improvements!

TO DO

Add user defined, known combination rule trimming callbacks.

Compute the time required for a given parse.

Synthesize a term list based on word part (thesaurus) definitions. (That is, go in reverse. Non-trivial!)

DEDICATION

My Grandmother and English teacher - Frances Jones

AUTHOR

Gene Boggs <cpan@ology.net>

COPYRIGHT AND LICENSE

Copyright 2003 by Gene Boggs

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.