NAME
Text::AI::CRM114 - Perl interface for CRM114
SYNOPSIS
use Text::AI::CRM114;
my $db = Text::AI::CRM114->new(
Text::AI::CRM114::OSBF_BAYES,
8*1024*1024, ["Alice", "Macbeth"]);
$db->learn("Alice", "Alice was beginning to ...");
$db->learn("Macbeth", "When shall we three meet again ...");
my @ret = $db->classify_text("The Mole had been working very hard all the morning ...");
say "Best classification is $ret[1]" unless ($ret[0] != Text::AI::CRM114::OK);
DESCRIPTION
This module provides a simple Perl interface to libcrm114
, a library that implements several text classification algorithms.
CONSTANTS
libcrm114
uses several constants as status return values and to set the classification algorithm of a new datablock. -- These constants are accessible in this module's namespace, for example Text::AI::CRM114::OK
and Text::AI::CRM114::OSB_WINNOW
.
METHODS
- Text::AI::CRM114->new($flags, $datasize, $classref)
-
Creates a new instance.
- $flags
-
sets the classification algorithm, recommended values are
Text::AI::CRM114::OSB_BAYES
(default),Text::AI::CRM114::OSB_WINNOW
, orText::AI::CRM114::HYPERSPACE
.libcrm114
includes some more algorithms (SVM, PCA, FSCM) which may or may not be production ready. - $datasize
-
the memory size of learned data (default is 4 Mb). Note that some algorithms have to grow the datasize when learning.
- $classref
-
a list of classes passed by reference (default:
['A', 'B']
).
- Text::AI::CRM114->readfile($filename)
-
Creates a new instance by reading a previously saved CRM114 DB from
$filename
. - $db->getclasses()
-
Returns a hash reference to the DB's classes. This hash associates the class names (keys) with the internal integer index (values).
- $db->writefile($filename)
-
Writes the DB into a (binary) file.
- $db->learn($class, $text)
-
Learn some text of a given class.
- $db->classify($text, $verbatim)
-
Classify the text.
The normal working mode without the optional
$verbatim
flag adjusts the return values to be more useful with two classes (e.g. spam/ham). If the flag is given then the values are passed unchanged as they come fromlibcrm114
. In practice this is only relevant if you use more than two classes. (Then you have to consider the success/non-success classes and probably want to add a method to retrieve the single per-class results.)Returns a list of five scalar values:
- $err
-
A numeric error code, should be
Text::AI::libcrm114::OK
- $class
-
The name of the best matching class.
- $prob
-
The success probability. Normally the probability of the matching class (with 0.5 <= $prob <= 1)
With
$verbatim
this is the success probability, i.e. with two classes the probability of the first class and with multiple classes the sum of probabilities for all successful classes (with 0 <= $prob <= 1). - $pR
-
The logarithmic probability ratio i.e.
log10($prob) - log10(1-$prob)
(theorethic range is 0 <= $pR <= 340, limited by floating point precision; but in practice a p = .99 yields a pR = 2, so high values are rather unusual).With
$verbatim
this is the ratio between all success and all non-success probabilities, so for a non-successful result the value can also be negative (range -340 <= $pR <= 340).
ISSUES
This is my first attempt to write a Perl module, so all hints and improvements are appreciated.
I wonder if we should ensure Text::AI::CRM114::OK maps to 0, as this makes the caller's return value checking easier. Currently this is trivial because it already is 0 in libcrm114
. If that should change we would have to insert a rewrite into every XS call to a C function (ugly, but maybe worth it).
I am still not sure if the C memory management works correctly.
Another issue is Unicode support, which is missing in libcrm114
, so it might be a good thing to convert unicode strings into some 8-bit encoding. As long as no string contains \0-values nothing bad[tm] will happen, but I assume that Unicode strings will internally cause wrong tokenization (this should be checked in libtre
).
SEE ALSO
CRM114 homepage: http://crm114.sourceforge.net/
AI::CRM114, a module using the crm language interpreter: https://metacpan.org/module/AI::CRM114
HISTORY
v0.04 remove crm114_strerror, which is not in libcrm114 tarball
v0.03 initial CPAN release
v0.02 initial push to github
AUTHOR
Martin Schuette, <info@mschuette.name>
COPYRIGHT AND LICENSE
Perl module: Copyright (C) 2012 by Martin Schuette
libcrm114: Copyright (C) 2009-2010 by William S. Yerazunis
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License version 3.