AI::NaiveBayes1 - Bayesian prediction of categories
use AI::NaiveBayes1;
my $nb = AI::NaiveBayes1->new;
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=Y',cases=> 6);
print "Model:\n" . $nb->print_model;
# Find results for unseen instances
my $result = $nb->predict
(attributes => {model=>'T', place=>'N'});
foreach my $k (keys(%{ $result })) {
print "for label $k P = " . $result->{$k} . "\n";
# export the model into a string
my $string = $nb->export_to_YAML();
# create the same model from the string
my $nb1 = AI::NaiveBayes1->import_from_YAML($string);
# write the model to a file (shorter than model->string->file)
# read the model from a file (shorter than file->string->model)
my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');
See Examples for more examples.
This module implements the classic "Naive Bayes" machine learning algorithm.
Constructor Methods
- new()
Creates a new
object and returns it. - set_real(list_of_attributes)
Delares a list of attributes to be real-valued. During training, their conditional probabilities will be modeled with Gaussian (normal) distributions.
- import_from_YAML($string)
Creates a new
object from a string where it is represented inYAML
. Requires YAML module. - import_from_YAML_file($file_name)
Creates a new
object from a file where it is represented inYAML
. Requires YAML module.
Adds a training instance to the categorizer.
Adds a number of identical instances to the categorizer.
- export_to_YAML()
Returns a
string representation of anAI::NaiveBayes1
object. Requires YAML module. export_to_YAML_file( $file_name )
Writes a
string representation of anAI::NaiveBayes1
object to a file. Requires YAML module. - print_model()
Returns a string, human-friendly representation of the model. The model is supposed to be trained before calling this method.
- train()
Calculates the probabilities that will be necessary for categorization using the
method. predict( attributes => HASH )
Use this method to predict the label of an unknown instance. The attributes should be of the same format as you passed to
returns a hash reference whose keys are the names of labels, and whose values are corresponding probabilities. labels
Returns a list of all the labels the object knows about (in no particular order), or the number of labels if called in a scalar context.
Bayes' Theorem is a way of inverting a conditional probability. It states:
P(y|x) P(x)
P(x|y) = -------------
and so on...
This is a pretty standard algorithm explained in many machine learning textbooks (e.g., "Data Mining" by Witten and Eibe).
The algorithm relies on estimating P(A|C), where A is an arbitrary attribute, and C is the class attribute. If A is not real-valued, then this conditional probability is estimated using a table of all possible values for A and C.
If A is real-valued, then the distribution P(A|C) is modeled as a Gaussian (normal) distribution for each possible value of C=c, Hence, for each C=c we collect the mean value (m) and standard deviation (s) for A during training. During classification, P(A=a|C=c) is estimated using Gaussian distribution, i.e., in the following way:
1 (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
sqrt(2*Pi)*s 2*s^2
this boils down to the following lines of code:
$scores{$label} *=
0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
exp( -0.5 *
( ( $newattrs->{$att} -
/ $m->{real_stat}{$att}{$label}{stddev}
) ** 2
P(A=a|C=c) = 0.398942280401433 / s *
exp( -0.5 * ( ( a-m ) / s ) ** 2 );
Example with a real-valued attribute modeled by a Gaussian distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA book), page 86):
# @relation weather
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no
$nb->set_real('temperature', 'humidity');
my $printedmodel = "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});
YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'} - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);
Algorithms::NaiveBayes by Ken Williams was not what I needed so I wrote this one. Algorithms::NaiveBayes is oriented towards text categorization, it includes smoothing, and log probabilities. This module is a generic, basic Naive Bayes algorithm.
I would like to thank Yung-chung Lin (xern@ cpan. org) for his implementation of the Gaussian model for continuous variables, and the following people for bug reports, support, and comments (in chronological order):
Tom Dyson
Dan Von Kohorn
CPAN-testers (jlatour, Jost.Krieger)
Craig Talbert
and Andrew Brian Clegg.
Copyright 2003-2005 Vlado Keselj In 2004 Yung-chung Lin provided implementation of the Gaussian model for continous variables.
This script is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The module is available on CPAN (, and The latter site is updated more frequently.
Algorithms::NaiveBayes, perl.