NAME
AI::Categorizer::Document - Embodies a document
SYNOPSIS
use AI::Categorizer::Document;
# Simplest way to create a document:
my $d = new AI::Categorizer::Document(name => $string,
content => $string);
# Other parameters are accepted:
my $d = new AI::Categorizer::Document(name => $string,
categories => \@category_objects,
content => { subject => $string,
body => $string2, ... },
content_weights => { subject => 3,
body => 1, ... },
stopwords => \%skip_these_words,
stemming => $string,
term_weighting => $string,
front_bias => $float,
use_features => $feature_vector,
);
# Specify explicit feature vector:
my $d = new AI::Categorizer::Document(name => $string);
$d->features( $feature_vector );
# Now pass the document to a categorization algorithm:
my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
my $hypothesis = $learner->categorize($document);
DESCRIPTION
The Document class embodies the data in a single document, and contains methods for turning this data into a FeatureVector. Usually documents are plain text, but subclasses of the Document class may handle any kind of data.
METHODS
- new(%parameters)
-
Creates a new Document object. Accepts the following parameters:
- name
-
A string that identifies this document. Required.
- content
-
The raw content of this document. May be specified as either a string or as a hash reference, allowing structured document types.
- content_weights
-
A hash reference indicating the weights that should be assigned to features in different sections of a structured document when creating its feature vector. The weight is a multiplier of the feature vector values. For instance, if a
subject
section has a weight of 3 and abody
section has a weight of 1, and word counts are used as feature vector values, then it will be as if all words appearing in thesubject
appeared 3 times.If no weights are specified, all weights are set to 1.
- front_bias
-
Allows smooth bias of the weights of words in a document according to their position. The value should be a number between -1 and 1. Positive numbers indicate that words toward the beginning of the document should have higher weight than words toward the end of the document. Negative numbers indicate the opposite. A bias of 0 indicates that no biasing should be done.
- term_weighting
-
Specifies how word counts should be converted to feature vector values. If
term_weighting
is set tonatural
, the word counts themselves will be used as the values.boolean
indicates that each positive word count will be converted to 1 (or whatever thecontent_weight
for this section is).log
indicates that the values will be set to1+log(count)
. - categories
-
A reference to an array of Category objects that this document belongs to. Optional.
- stopwords
-
A list/hash of features (words) that should be ignored when parsing document content. A hash reference is preferred, with the features as the keys. If you pass an array reference containing the features, it will be converted to a hash reference internally.
- use_features
-
A Feature Vector specifying the only features that should be considered when parsing this document. This is an alternative to using
stopwords
. - stemming
-
Indicates the linguistic procedure that should be used to convert tokens in the document to features. Possible values are
none
, which indicates that the tokens should be used without change, orporter
, indicating that the Porter stemming algorithm should be applied to each token. This requires theLingua::Stem
module from CPAN.
- read( path => $path )
-
An alternative constructor method which reads a file on disk and returns a document with that file's contents.
- name()
-
Returns this document's
name
property as specified when the document was created. - features()
-
Returns the Feature Vector associated with this document.
- categories()
-
In a list context, returns a list of Category objects to which this document belongs. In a scalar context, returns the number of such categories.
- create_feature_vector()
-
Creates this document's Feature Vector by parsing its content. You won't call this method directly, it's called by
new()
.
AUTHOR
Ken Williams <kenw@ee.usyd.edu.au>
COPYRIGHT
This distribution is free software; you can redistribute it and/or modify it under the same terms as Perl itself. These terms apply to every file in the distribution - if you have questions, please contact the author.