NAME

AI::NaiveBayes1 - Bayesian prediction of categories

SYNOPSIS

use AI::NaiveBayes1;
my $nb = AI::NaiveBayes1->new;

$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=Y',cases=>30);
$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=N',cases=>10);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=Y',cases=>18);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=N',cases=>16);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=Y',cases=>22);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=N',cases=>14);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=Y',cases=> 6);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=N',cases=>84);

$nb->train;

print "Model:\n" . $nb->print_model;

# Find results for unseen instances
my $result = $nb->predict
   (attributes => {model=>'T', place=>'N'});

foreach my $k (keys(%{ $result })) {
    print "for label $k P = " . $result->{$k} . "\n";
}

# export the model into a string
my $string = $nb->export_to_YAML();

# create the same model from the string
my $nb1 = AI::NaiveBayes1->import_from_YAML($string);

# write the model to a file (shorter than model->string->file)
$nb->export_to_YAML_file('t/tmp1');

# read the model from a file (shorter than file->string->model)
my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');

See Examples for more examples.

DESCRIPTION

This module implements the classic "Naive Bayes" machine learning algorithm.

METHODS

Constructor Methods

new(): Creates a new AI::NaiveBayes1 object and returns it.
set_real(list_of_attributes): Delares a list of attributes to be real-valued. During training, their conditional probabilities will be modeled with Gaussian (normal) distributions.
import_from_YAML($string): Creates a new AI::NaiveBayes1 object from a string where it is represented in YAML. Requires YAML module.
import_from_YAML_file($file_name): Creates a new AI::NaiveBayes1 object from a file where it is represented in YAML. Requires YAML module.

Methods

add_instance(attributes=>HASH,label=>STRING|ARRAY): Adds a training instance to the categorizer.
add_instances(attributes=>HASH,label=>STRING|ARRAY,cases=>NUMBER): Adds a number of identical instances to the categorizer.
export_to_YAML(): Returns a YAML string representation of an AI::NaiveBayes1 object. Requires YAML module.
export_to_YAML_file( $file_name ): Writes a YAML string representation of an AI::NaiveBayes1 object to a file. Requires YAML module.
print_model(): Returns a string, human-friendly representation of the model. The model is supposed to be trained before calling this method.
train(): Calculates the probabilities that will be necessary for categorization using the predict() method.
predict( attributes => HASH ): Use this method to predict the label of an unknown instance. The attributes should be of the same format as you passed to add_instance(). predict() returns a hash reference whose keys are the names of labels, and whose values are corresponding probabilities.
labels: Returns a list of all the labels the object knows about (in no particular order), or the number of labels if called in a scalar context.

THEORY

Bayes' Theorem is a way of inverting a conditional probability. It states:

          P(y|x) P(x)
P(x|y) = -------------
             P(y)

and so on...

This is a pretty standard algorithm explained in many machine learning textbooks (e.g., "Data Mining" by Witten and Eibe).

The algorithm relies on estimating P(A|C), where A is an arbitrary attribute, and C is the class attribute. If A is not real-valued, then this conditional probability is estimated using a table of all possible values for A and C.

If A is real-valued, then the distribution P(A|C) is modeled as a Gaussian (normal) distribution for each possible value of C=c, Hence, for each C=c we collect the mean value (m) and standard deviation (s) for A during training. During classification, P(A=a|C=c) is estimated using Gaussian distribution, i.e., in the following way:

                   1               (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
             sqrt(2*Pi)*s           2*s^2

this boils down to the following lines of code:

    $scores{$label} *=
    0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
      exp( -0.5 *
           ( ( $newattrs->{$att} -
               $m->{real_stat}{$att}{$label}{mean})
             / $m->{real_stat}{$att}{$label}{stddev}
           ) ** 2
	   );

i.e.,

P(A=a|C=c) = 0.398942280401433 / s *
  exp( -0.5 * ( ( a-m ) / s ) ** 2 );

EXAMPLES

Example with a real-valued attribute modeled by a Gaussian distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA book), page 86):

# @relation weather
# 
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
# 
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no

$nb->set_real('temperature', 'humidity');

$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');

$nb->train;

my $printedmodel =  "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});

YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'}  - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);

HISTORY

Algorithms::NaiveBayes by Ken Williams was not what I needed so I wrote this one. Algorithms::NaiveBayes is oriented towards text categorization, it includes smoothing, and log probabilities. This module is a generic, basic Naive Bayes algorithm.

THANKS

I would like to thank Yung-chung Lin (xern@ cpan. org) for his implementation of the Gaussian model for continuous variables, and the following people for bug reports, support, and comments (in chronological order):

Tom Dyson

Dan Von Kohorn

CPAN-testers (jlatour, Jost.Krieger)

Craig Talbert

and Andrew Brian Clegg.

AUTHOR

This script is provided "as is" without expressed or implied warranty. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The module is available on CPAN (http://search.cpan.org/~vlado), and http://www.cs.dal.ca/~vlado/srcperl/. The latter site is updated more frequently.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)