NAME

WAIT::Filter - Perl extension providing the basic freeWAIS-sf reduction functions

SYNOPSIS

use WAIT::Filter qw(Stem Soundex Phonix isolc disolc isouc disouc
                    isotr disotr stop grundform utf8iso);

$stem   = Stem($word);
$scode  = Soundex($word);
$pcode  = Phonix($word);
$lword  = isolc($word);
disolc($word);
$uword  = isouc($word);
disouc($word);
$trword = isotr($word);
disotr($word);
$word   = stop($word);
$word   = grundform($word);

@words = WAIT::Filter::split($word);
@words = WAIT::Filter::split2($word);
@words = WAIT::Filter::split3($word);
@words = WAIT::Filter::split4($word); # arbitrary numbers allowed

DESCRIPTION

This tiny modules gives access to the basic reduction functions build in freeWAIS-sf.

Stem(word)

reduces word using the well know Porter algorithm.

AU: Porter, M.F.
TI: An Algorithm for Suffix Stripping
JT: Program
VO: 14
PP: 130-137
PY: 1980
PM: JUL
Soundex(word)

computes the 4 byte Soundex code for word.

AU: Gadd, T.N.
TI: 'Fisching for Werds'. Phonetic Retrieval of written text in
    Information Retrieval Systems
JT: Program
VO: 22
NO: 3
PP: 222-237
PY: 1988
Phonix(word)

computes the 8 byte Phonix code for word.

AU: Gadd, T.N.
TI: PHONIX: The Algorithm
JT: Program
VO: 24
NO: 4
PP: 363-366
PY: 1990
PM: OCT

ISO charcater case functions

There are some additional function which transpose some/most ISOlatin1 characters to upper and lower case. To allow for maximum speed there are also destructive versions which change the argument instead of allocating a copy which is returned. For convenience, the destructive version also returns the argument. So all of the following is valid and $word will contain the lowercased string.

$word = isolc($word);
$word = disolc($word);
disolc($word);

Here are the hardcoded characters which are recognized:

abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß
ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß
$new = isolc($word)
disolc($word)

transposes to lower case.

$new = isouc($word)
disouc($word)

transposes to upper case.

$new = isotr($word)
disotr($word)

Remove non-letters according to the above table.

$new = stop($word)

Returns an empty string if $word is a stopword.

$new = grundform($word)

Calls Text::German::reduce

$new = utf8iso($word)

Convert UTF8 encoded strings to ISO-8859-1. WAIT currently is internally based on the Latin1 character set, so if you process anything in a different encoding, you should convert to Latin1 as the first filter.

split, split2, split3, ...

The splitN funtions all take a scalar as input and return a list of words. Split acts just like the perl split(' '). Split2 eliminates all words from the list that are shorter than 2 characters (bytes), split3 eliminates those shorter than 3 characters (bytes) and so on.

AUTHOR

Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>

SEE ALSO

perl(1).

4 POD Errors

The following errors were encountered while parsing the POD:

Around line 648:

You forgot a '=back' before '=head1'

Around line 663:

Non-ASCII character seen before =encoding in 'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß'. Assuming CP1252

Around line 666:

'=item' outside of any '=over'

Around line 706:

You forgot a '=back' before '=head1'