NAME

Lingua::Stem::En - Porter's stemming algorithm for 'generic' English

SYNOPSIS

use Lingua::Stem::En;
my $stems   = Lingua::Stem::En::stem({ -words => $word_list_reference,
                                    -locale => 'en',
                                -exceptions => $exceptions_hash,
                                 });

DESCRIPTION

This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words.

It is derived from the C program "stemmer.c" as found in freewais and elsewhere, which contains these notes:

Purpose:    Implementation of the Porter stemming algorithm documented 
            in: Porter, M.F., "An Algorithm For Suffix Stripping," 
            Program 14 (3), July 1980, pp. 130-137.
Provenance: Written by B. Frakes and C. Cox, 1986.

I have re-interpreted areas that use Frakes and Cox's "WordSize" function. My version may misbehave on short words starting with "y", but I can't think of any examples.

The step numbers correspond to Frakes and Cox, and are probably in Porter's article (which I've not seen). Porter's algorithm still has rough spots (e.g current/currency, -ings words), which I've not attempted to cure, although I have added support for the British -ise suffix.

CHANGES

1999.06.15 - Changed to '.pm' module, moved into Lingua::Stem namespace,
             optionalized the export of the 'stem' routine
             into the caller's namespace, added named parameters

1999.06.24 - Switch core implementation of the Porter stemmer to
             the one written by Jim Richardson <jimr@maths.usyd.edu.au>

2000.08.25 - 2.11 Added stemming cache

METHODS

stem({ -words => \@words, -locale => 'en', -exceptions => \%exceptions });

Stems a list of passed words using the rules of US English. Returns an anonymous hash reference to the stemmed words.

Example:

my $stemmed_words = Lingua::Stem::En::stem({ -words => \@words,
                                            -locale => 'en',
                                        -exceptions => \%exceptions,
                        });
stem_caching({ -level => 0|1|2 });

Sets the level of stem caching.

'0' means 'no caching'. This is the default level.

'1' means 'cache per run'. This caches stemming results during a single call to 'stem'.

'2' means 'cache indefinitely'. This caches stemming results until either the process exits or the 'clear_stem_cache' method is called.

clear_stem_cache;

Clears the cache of stemmed words

NOTES

This code is almost entirely derived from the Porter 2.1 module written by Jim Richardson.

SEE ALSO

Lingua::Stem

AUTHOR

Jim Richardson, University of Sydney
jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html

Integration in Lingua::Stem by 
Benjamin Franz, FreeRun Technologies,
snowhare@nihongo.org or http://www.nihongo.org/snowhare/

COPYRIGHT

Jim Richardson, University of Sydney Benjamin Franz, FreeRun Technologies

This code is freely available under the same terms as Perl.

BUGS

TODO