NAME

AI::NaiveBayes1 - Bayesian prediction of categories

SYNOPSIS

use AI::NaiveBayes1;
my $nb = AI::NaiveBayes1->new;

$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=Y',cases=>30);
$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=N',cases=>10);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=Y',cases=>18);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=N',cases=>16);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=Y',cases=>22);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=N',cases=>14);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=Y',cases=> 6);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=N',cases=>84);

$nb->train;

print "Model:\n" . $nb->print_model;

# Find results for unseen instances
my $result = $nb->predict
   (attributes => {model=>'T', place=>'N'});

foreach my $k (keys(%{ $result })) {
    print "for label $k P = " . $result->{$k} . "\n";
}

# export the model into a string
my $string = $nb->export_to_YAML();

# create the same model from the string
my $nb1 = AI::NaiveBayes1->import_from_YAML($string);

# write the model to a file (shorter than model->string->file)
$nb->export_to_YAML_file('t/tmp1');

# read the model from a file (shorter than file->string->model)
my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');

See Examples for more examples.

DESCRIPTION

This module implements the classic "Naive Bayes" machine learning algorithm.

METHODS

Constructor Methods

new()

Creates a new AI::NaiveBayes1 object and returns it.

set_real(list_of_attributes)

Delares a list of attributes to be real-valued. During training, their conditional probabilities will be modeled with Gaussian (normal) distributions.

import_from_YAML($string)

Creates a new AI::NaiveBayes1 object from a string where it is represented in YAML. Requires YAML module.

import_from_YAML_file($file_name)

Creates a new AI::NaiveBayes1 object from a file where it is represented in YAML. Requires YAML module.

Methods

add_instance(attributes=>HASH,label=>STRING|ARRAY)

Adds a training instance to the categorizer.

add_instances(attributes=>HASH,label=>STRING|ARRAY,cases=>NUMBER)

Adds a number of identical instances to the categorizer.

export_to_YAML()

Returns a YAML string representation of an AI::NaiveBayes1 object. Requires YAML module.

export_to_YAML_file( $file_name )

Writes a YAML string representation of an AI::NaiveBayes1 object to a file. Requires YAML module.

Returns a string, human-friendly representation of the model. The model is supposed to be trained before calling this method.

train()

Calculates the probabilities that will be necessary for categorization using the predict() method.

predict( attributes => HASH )

Use this method to predict the label of an unknown instance. The attributes should be of the same format as you passed to add_instance(). predict() returns a hash reference whose keys are the names of labels, and whose values are corresponding probabilities.

labels

Returns a list of all the labels the object knows about (in no particular order), or the number of labels if called in a scalar context.

THEORY

Bayes' Theorem is a way of inverting a conditional probability. It states:

          P(y|x) P(x)
P(x|y) = -------------
             P(y)

and so on...

This is a pretty standard algorithm explained in many machine learning textbooks (e.g., "Data Mining" by Witten and Eibe).

The algorithm relies on estimating P(A|C), where A is an arbitrary attribute, and C is the class attribute. If A is not real-valued, then this conditional probability is estimated using a table of all possible values for A and C.

If A is real-valued, then the distribution P(A|C) is modeled as a Gaussian (normal) distribution for each possible value of C=c, Hence, for each C=c we collect the mean value (m) and standard deviation (s) for A during training. During classification, P(A=a|C=c) is estimated using Gaussian distribution, i.e., in the following way:

                   1               (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
             sqrt(2*Pi)*s           2*s^2
                                                                                                                             

this boils down to the following lines of code:

    $scores{$label} *=
    0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
      exp( -0.5 *
           ( ( $newattrs->{$att} -
               $m->{real_stat}{$att}{$label}{mean})
             / $m->{real_stat}{$att}{$label}{stddev}
           ) ** 2
	   );
                                                                                                                              

i.e.,

P(A=a|C=c) = 0.398942280401433 / s *
  exp( -0.5 * ( ( a-m ) / s ) ** 2 );

EXAMPLES

Example with a real-valued attribute modeled by a Gaussian distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA book), page 86):

# @relation weather
# 
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
# 
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no

$nb->set_real('temperature', 'humidity');

$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');

$nb->train;

my $printedmodel =  "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});

YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'}  - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);

HISTORY

Algorithms::NaiveBayes by Ken Williams was not what I needed so I wrote this one. Algorithms::NaiveBayes is oriented towards text categorization, it includes smoothing, and log probabilities. This module is a generic, basic Naive Bayes algorithm.

THANKS

I would like to thank Yung-chung Lin (xern@ cpan. org) for his implementation of the Gaussian model for continuous variables, and the following people for bug reports, support, and comments (in chronological order):

Tom Dyson

Dan Von Kohorn

CPAN-testers (jlatour, Jost.Krieger)

Craig Talbert

and Andrew Brian Clegg.

AUTHOR

Copyright 2003-2005 Vlado Keselj http://www.cs.dal.ca/~vlado. In 2004 Yung-chung Lin provided implementation of the Gaussian model for continous variables.

This script is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The module is available on CPAN (http://search.cpan.org/~vlado), and http://www.cs.dal.ca/~vlado/srcperl/. The latter site is updated more frequently.

SEE ALSO

Algorithms::NaiveBayes, perl.