NAME
Lingua::Word::Parser - Parse a word into scored known and unknown parts
VERSION
version 0.0803
SYNOPSIS
use Lingua::Word::Parser;
my $p = Lingua::Word::Parser->new(
word => 'abioticaly',
file => 'eg/lexicon.dat',
);
# Or with a database source:
$p = Lingua::Word::Parser->new(
word => 'abioticaly',
dbname => 'fragments',
dbuser => 'akbar',
dbpass => 's3kr1+',
);
my $known = $p->knowns;
my $combos = $p->power;
my $parts = $p->score_parts;
# The best guess is the last sorted scored set:
print Dumper $scored->{ [ sort keys %$scored ]->[-1] };
DESCRIPTION
A Lingua::Word::Parser
breaks a word into known affixes.
A (word-part => regular-expression) lexicon file must have lines of the form:
a(?=\w) opposite
ab(?=\w) away
(?<=\w)o(?=\w) combining
(?<=\w)tic possessing
Please see the included eg/lexicon.dat file.
A database lexicon must have records of the form:
affix definition
-----------------------------
a(?=\w) opposite
ab(?=\w) away
(?<=\w)o(?=\w) combining
(?<=\w)tic possessing
Please see the included eg/word_part.sql file.
METHODS
new()
$x = Lingua::Word::Parser->new(%arguments);
Create a new Lingua::Word::Parser
object.
Arguments and defaults:
word: undef
dbuser: undef
dbpass: undef
dbname: undef
dbtype: mysql
dbhost: localhost
knowns()
my $known = $p->knowns;
Find the known word parts and their bitstring masks.
power()
my $combos = $p->power();
Find the set of non-overlapping known word parts by considering the power set of all masks.
score()
$score = $p->score();
$score = $p->score( $open_separator, $close_separator);
Score the known vs unknown word part combinations into ratios of characters and chunks or parts or "spans of adjacent characters" as a collection of strings.
This method sets the score member to a list of hashrefs with keys:
partition
definition
score
familiarity
If not given, the $open_separator and $close_separator are '<' and '>' by default.
score_parts()
$score_parts = $p->score_parts();
$score_parts = $p->score_parts( $open_separator, $close_separator );
$score_parts = $p->score_parts( $open_separator, $close_separator, $line_terminator );
Score the known vs unknown word part combinations into ratios of characters and chunks (spans of adjacent characters).
If not given, the $open_separator and $close_separator are '<' and '>' by default.
The $line_terminator can be any string, like a newline (\n
or an HTML line-break), but is the empty string (''
) by default.
SEE ALSO
Lingua::TokenParse - The predecessor of this module.
http://en.wikipedia.org/wiki/Affix is the tip of the iceberg...
https://github.com/ology/Word-Part a friendly Dancer user interface.
The t/* and eg/* files in this distribution!
AUTHOR
Gene Boggs <gene@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2015 by Gene Boggs.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.