NAME
AI::Categorizer::Document - Embodies a document
SYNOPSIS
use AI::Categorizer::Document;
# Simplest way to create a document:
my $d = new AI::Categorizer::Document(name => $string,
content => $string);
# Other parameters are accepted:
my $d = new AI::Categorizer::Document(name => $string,
categories => \@category_objects,
content => { subject => $string,
body => $string2, ... },
content_weights => { subject => 3,
body => 1, ... },
stopwords => \%skip_these_words,
stemming => $string,
term_weighting => $string,
front_bias => $float,
use_features => $feature_vector,
);
# Specify explicit feature vector:
my $d = new AI::Categorizer::Document(name => $string);
$d->features( $feature_vector );
# Now pass the document to a categorization algorithm:
my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
my $hypothesis = $learner->categorize($document);
DESCRIPTION
The Document class embodies the data in a single document, and contains methods for turning this data into a FeatureVector. Usually documents are plain text, but subclasses of the Document class may handle any kind of data.
METHODS
- new(%parameters)
-
Creates a new Document object. Accepts the following parameters:
- name
-
A string that identifies this document. Required.
- content
-
The raw content of this document. May be specified as either a string or as a hash reference, allowing structured document types.
- content_weights
-
A hash reference indicating the weights that should be assigned to features in different sections of a structured document when creating its feature vector. The weight is a multiplier of the feature vector values. For instance, if a
subjectsection has a weight of 3 and abodysection has a weight of 1, and word counts are used as feature vector values, then it will be as if all words appearing in thesubjectappeared 3 times.If no weights are specified, all weights are set to 1.
- front_bias
-
Allows smooth bias of the weights of words in a document according to their position. The value should be a number between -1 and 1. Positive numbers indicate that words toward the beginning of the document should have higher weight than words toward the end of the document. Negative numbers indicate the opposite. A bias of 0 indicates that no biasing should be done.
- term_weighting
-
Specifies how word counts should be converted to feature vector values. If
term_weightingis set tonatural, the word counts themselves will be used as the values.booleanindicates that each positive word count will be converted to 1 (or whatever thecontent_weightfor this section is).logindicates that the values will be set to1+log(count). - categories
-
A reference to an array of Category objects that this document belongs to. Optional.
- stopwords
-
A list/hash of features (words) that should be ignored when parsing document content. A hash reference is preferred, with the features as the keys. If you pass an array reference containing the features, it will be converted to a hash reference internally.
- use_features
-
A Feature Vector specifying the only features that should be considered when parsing this document. This is an alternative to using
stopwords. - stemming
-
Indicates the linguistic procedure that should be used to convert tokens in the document to features. Possible values are
none, which indicates that the tokens should be used without change, orporter, indicating that the Porter stemming algorithm should be applied to each token. This requires theLingua::Stemmodule from CPAN.
- read( path => $path )
-
An alternative constructor method which reads a file on disk and returns a document with that file's contents.
- name()
-
Returns this document's
nameproperty as specified when the document was created. - features()
-
Returns the Feature Vector associated with this document.
- categories()
-
In a list context, returns a list of Category objects to which this document belongs. In a scalar context, returns the number of such categories.
- create_feature_vector()
-
Creates this document's Feature Vector by parsing its content. You won't call this method directly, it's called by
new().
AUTHOR
Ken Williams <kenw@ee.usyd.edu.au>
COPYRIGHT
This distribution is free software; you can redistribute it and/or modify it under the same terms as Perl itself. These terms apply to every file in the distribution - if you have questions, please contact the author.