NAME
AI::NaiveBayes1 - Bayesian prediction of categories
SYNOPSIS
use AI::NaiveBayes1;
my $nb = AI::NaiveBayes1->new;
$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=Y',cases=>30);
$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=N',cases=>10);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=Y',cases=>18);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=N',cases=>16);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=Y',cases=>22);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=N',cases=>14);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=Y',cases=> 6);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=N',cases=>84);
$nb->train;
print "Model:\n" . $nb->print_model;
# Find results for unseen instances
my $result = $nb->predict
(attributes => {model=>'T', place=>'N'});
foreach my $k (keys(%{ $result })) {
print "for label $k P = " . $result->{$k} . "\n";
}
# export the model into a string
my $string = $nb->export_to_YAML();
# create the same model from the string
my $nb1 = AI::NaiveBayes1->import_from_YAML($string);
# write the model to a file (shorter than model->string->file)
$nb->export_to_YAML_file('t/tmp1');
# read the model from a file (shorter than file->string->model)
my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');
See Examples for more examples.
DESCRIPTION
This module implements the classic "Naive Bayes" machine learning algorithm.
METHODS
Constructor Methods
- new()
-
Creates a new
AI::NaiveBayes1
object and returns it. - set_real(list_of_attributes)
-
Delares a list of attributes to be real-valued. During training, their conditional probabilities will be modeled with Gaussian (normal) distributions.
- import_from_YAML($string)
-
Creates a new
AI::NaiveBayes1
object from a string where it is represented inYAML
. Requires YAML module. - import_from_YAML_file($file_name)
-
Creates a new
AI::NaiveBayes1
object from a file where it is represented inYAML
. Requires YAML module.
Methods
add_instance(attributes=>HASH,label=>STRING|ARRAY)
-
Adds a training instance to the categorizer.
add_instances(attributes=>HASH,label=>STRING|ARRAY,cases=>NUMBER)
-
Adds a number of identical instances to the categorizer.
- export_to_YAML()
-
Returns a
YAML
string representation of anAI::NaiveBayes1
object. Requires YAML module. export_to_YAML_file( $file_name )
-
Writes a
YAML
string representation of anAI::NaiveBayes1
object to a file. Requires YAML module. - print_model()
-
Returns a string, human-friendly representation of the model. The model is supposed to be trained before calling this method.
- train()
-
Calculates the probabilities that will be necessary for categorization using the
predict()
method. predict( attributes => HASH )
-
Use this method to predict the label of an unknown instance. The attributes should be of the same format as you passed to
add_instance()
.predict()
returns a hash reference whose keys are the names of labels, and whose values are corresponding probabilities. labels
-
Returns a list of all the labels the object knows about (in no particular order), or the number of labels if called in a scalar context.
THEORY
Bayes' Theorem is a way of inverting a conditional probability. It states:
P(y|x) P(x)
P(x|y) = -------------
P(y)
and so on...
This is a pretty standard algorithm explained in many machine learning textbooks (e.g., "Data Mining" by Witten and Eibe).
The algorithm relies on estimating P(A|C), where A is an arbitrary attribute, and C is the class attribute. If A is not real-valued, then this conditional probability is estimated using a table of all possible values for A and C.
If A is real-valued, then the distribution P(A|C) is modeled as a Gaussian (normal) distribution for each possible value of C=c, Hence, for each C=c we collect the mean value (m) and standard deviation (s) for A during training. During classification, P(A=a|C=c) is estimated using Gaussian distribution, i.e., in the following way:
1 (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
sqrt(2*Pi)*s 2*s^2
this boils down to the following lines of code:
$scores{$label} *=
0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
exp( -0.5 *
( ( $newattrs->{$att} -
$m->{real_stat}{$att}{$label}{mean})
/ $m->{real_stat}{$att}{$label}{stddev}
) ** 2
);
i.e.,
P(A=a|C=c) = 0.398942280401433 / s *
exp( -0.5 * ( ( a-m ) / s ) ** 2 );
EXAMPLES
Example with a real-valued attribute modeled by a Gaussian distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA book), page 86):
# @relation weather
#
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
#
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no
$nb->set_real('temperature', 'humidity');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');
$nb->train;
my $printedmodel = "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});
YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'} - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);
HISTORY
Algorithms::NaiveBayes by Ken Williams was not what I needed so I wrote this one. Algorithms::NaiveBayes is oriented towards text categorization, it includes smoothing, and log probabilities. This module is a generic, basic Naive Bayes algorithm.
THANKS
I would like to thank Yung-chung Lin (xern@ cpan. org) for his implementation of the Gaussian model for continuous variables, and the following people for bug reports, support, and comments (in chronological order):
Tom Dyson
Dan Von Kohorn
CPAN-testers (jlatour, Jost.Krieger)
Craig Talbert
and Andrew Brian Clegg.
AUTHOR
Copyright 2003-2005 Vlado Keselj http://www.cs.dal.ca/~vlado. In 2004 Yung-chung Lin provided implementation of the Gaussian model for continous variables.
This script is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The module is available on CPAN (http://search.cpan.org/~vlado), and http://www.cs.dal.ca/~vlado/srcperl/. The latter site is updated more frequently.
SEE ALSO
Algorithms::NaiveBayes, perl.