NAME

Lingua::Word::Parser - Parse a word into known and unknown parts

VERSION

version 0.07

SYNOPSIS

use Lingua::Word::Parser;
my $p = Lingua::Word::Parser->new(
   word => 'abioticaly',
   file => 'eg/lexicon.dat',
);

# Or with a database source:
$p = Lingua::Word::Parser->new(
   word   => 'abioticaly',
   dbname => 'fragments',
   dbuser => 'akbar',
   dbpass => 's3kr1+',
);

my $known  = $p->knowns;
my $combos = $p->power;
my $parts  = $p->score_parts;

# The best guess is the last sorted scored set:
print Dumper $scored->{ [ sort keys %$scored ]->[-1] };

DESCRIPTION

A Lingua::Word::Parser breaks a word into known affixes.

A (word-part => regular-expression) lexicon file must have lines of the form:

a(?=\w)        opposite
ab(?=\w)       away
(?<=\w)o(?=\w) combining
(?<=\w)tic     possessing

Please see the included eg/lexicon.dat file.

A database lexicon must have records of the form:

       affix     definition
-----------------------------
       a(?=\w)   opposite
       ab(?=\w)  away
(?<=\w)o(?=\w)   combining
(?<=\w)tic       possessing

Please see the included eg/word_part.sql file.

METHODS

new()

$x = Lingua::Word::Parser->new(%arguments);

Create a new Lingua::Word::Parser object.

Arguments and defaults:

word:   undef
dbuser: undef
dbpass: undef
dbname: undef
dbtype: mysql
dbhost: localhost

knowns()

my $known = $p->knowns;

Find the known word parts and their bitstring masks.

power()

my $combos = $p->power();

Find the set of non-overlapping known word parts by considering the power set of all masks.

score()

$score = $p->score();
$score = $p->score( $open_sparator, $close_separator);

Score the known vs unknown word part combinations into ratios of characters and chunks or parts or "spans of adjacent characters" as a collection of strings.

If not given, the $open_sparator and $close_separator are '<' and '>' by default.

score_parts()

$score_parts = $p->score_parts();
$score_parts = $p->score_parts( $open_sparator, $close_separator );
$score_parts = $p->score_parts( $open_sparator, $close_separator, $line_terminator );

Score the known vs unknown word part combinations into ratios of characters and chunks (spans of adjacent characters).

If not given, the $open_sparator and $close_separator are '<' and '>' by default.

The line terminator can be any string, like a newline (\n or an HTML line-break), but is the empty string ('') by default.

SEE ALSO

Lingua::TokenParse - The predecessor of this module.

http://en.wikipedia.org/wiki/Affix is the tip of the iceberg...

https://github.com/ology/Word-Part a friendly Dancer user interface.

AUTHOR

Gene Boggs <gene@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by Gene Boggs.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.