NAME

Lingua::TokenParse - Parse a word into scored, fragment combinations

SYNOPSIS

use Lingua::TokenParse;
my $p = Lingua::TokenParse->new(
  word => 'antidisthoughtlessfulneodeoxyribonucleicfoo',
  lexicon => {
      a    => 'not',
      anti => 'opposite',
      di   => 'two',
      dis  => 'separation',
      eo   => 'hmmmmm',  # etc.
  },
  constraints => [ qr/eo(?:\.|$)/ ], # no parts ending in eo allowed
);
print Data::Dumper($p->knowns);

DESCRIPTION

This class represents a Lingua::TokenParse object and contains methods to parse a given word into familiar combinations based on a lexicon of known word parts. This lexicon is a simple fragment => definition list.

Words like "automobile" and "deoyribonucleic" are composed of different roots, prefixes, suffixes, etc. With a lexicon of known fragments, a word can be partitioned into a list of its (possibly overlapping) known and unknown fragment combinations.

These combinations can be given a score, which represents a measure of word familiarity. This measure is a set of ratios of known to unknown fragments and letters.

METHODS

new

$p = Lingua::TokenParse->new(
    verbose => 0,
    word => $word,
    lexicon => \%lexicon,
    lexicon_file => $lexicon_file,
    constraints => \@constraints,
);

Return a new Lingua::TokenParse object.

This method will automatically call the partition methods (detailed below) if a word and lexicon are provided.

The word can be any string, however, you will want to make sure that it does not include the same characters you use for the separator, not_defined and unknown strings (described below).

The lexicon must be a hash reference with word fragments as keys and definitions as their respective values.

parse

$p->parse;
$p->parse($string);

This method clears the partition lists and then calls all the individual parsing methods that are detailed below. If a string is provided the object's word attribute is reset to that, first.

build_parts

$parts = $p->build_parts;

Construct an array of the word partitions.

build_definitions

$known_definitions = $p->build_definitions;

Construct a table of the definitions of the word parts.

build_combinations

$combos = $p->build_combinations;

Compute the array of all possible word part combinations.

build_knowns

$raw_knowns = $p->build_knowns;

Compute the familiar word part combinations.

This method handles word parts containing prefix and suffix hyphens, which encode information about what is a syntactically illegal word combination, which can be used to score (or even throw out bogus combinations).

lexicon_cache

$p->lexicon_cache;
$p->lexicon_cache( $lexicon_file );
$p->lexicon_cache( lexicon_file => $lexicon_file );

Backup and retrieve the hash reference of token entries.

If this method is called with no arguments, the object's lexicon_file is used. If the method is called with a single argument, the object's lexicon_file attribute is temporarily overridden. If the method is called with two arguments and the first is the string "lexicon_file" then that attribute is set before proceeding.

CONVENIENCE METHOD

output_knowns

@known_list = $p->output_knowns;
print Dumper \@known_list;

# Look at the "even friendlier output."
print scalar $p->output_knowns(
    separator   => $separator,
    not_defined => $not_defined,
    unknown     => $unknown,
);

This method returns the familiar word part combinations in a couple "human accessible" formats. Each have familiarity scores rounded to two decimals and fragment definitions shown in a readable layout

separator

The the string used between fragment definitions. Default is a plus symbol surrounded by single spaces: ' + '.

not_defined

Indicates a known fragment that has no definition. Default is a single period: '.'.

unknown

Indicates an unknown fragment. The default is the question mark: '?'.

ACCESSORS

word

$p->word($word);
$word = $p->word;

The actual word to partition which can be any string.

lexicon

$p->lexicon(%lexicon);
$p->lexicon(\%lexicon);
$lexicon = $p->lexicon;

The lexicon is a hash reference with word fragments as keys and definitions their respective values. It can be set with either a hash or a hash reference.

If an argument is supplied but is neither a hash or hashref, the lexicon is cleared (reset to {} an empty hashref).

parts

$parts = $p->parts;

The computed array reference of all possible word partitions.

combinations

$combinations = $p->combinations;

The computed array reference of all possible word part combinations.

knowns

$knowns = $p->knowns;

The computed hash reference of known (non-zero scored) combinations with their familiarity values.

definitions

$definitions = $p->definitions;

The hash reference of the definitions provided for each fragment of the combinations with the values of unknown fragments set to undef.

constraints

$constraints = $p->constraints;
$p->constraints(\@regexps);

An optional, user defined array reference of regular expressions to apply to the list of known combinations. This is acts as a negative pruning device, meaning that if a match is successful, the entry is excluded from the list.

EXAMPLES

Example code can be found in the distribution eg/ directory.

TO DO

Turn the lame output_knowns method into a sensible XML serializer (of optionally everything).

Compute the time required for a given parse.

Make a method to add definitions for unknown fragments and call it... learn().

Use traditional stemming to trim down the common knowns and see if the score is the same...

Synthesize a term list based on a thesaurus of word-part definitions. That is, go in reverse. Non-trivial!

SEE ALSO

Storable

Math::BaseCalc

DEDICATION

For my Grandmother and English teacher Frances Jones.

THANK YOU

Thank you to Luc St-Louis for helping me increase the speed while eliminating the exponential memory footprint. I wish I knew your email address so I could tell you. lucs++

AUTHOR

Gene Boggs <gene@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2003-2004 by Gene Boggs

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.