NAME

WAIT::Filter - Perl extension providing the basic freeWAIS-sf reduction functions

SYNOPSIS

use WAIT::Filter qw(Stem Soundex Phonix isolc disolc isouc disouc
                    isotr disotr stop grundform utf8iso);

$stem   = Stem($word);
$scode  = Soundex($word);
$pcode  = Phonix($word);
$lword  = isolc($word);
disolc($word);
$uword  = isouc($word);
disouc($word);
$trword = isotr($word);
disotr($word);
$word   = stop($word);
$word   = grundform($word);

@words = WAIT::Filter::split($word);
@words = WAIT::Filter::split2($word);
@words = WAIT::Filter::split3($word);
@words = WAIT::Filter::split4($word); # arbitrary numbers allowed

DESCRIPTION

This tiny modules gives access to the basic reduction functions build in freeWAIS-sf.

Stem(word)

reduces word using the well know Porter algorithm.

AU: Porter, M.F.
TI: An Algorithm for Suffix Stripping
JT: Program
VO: 14
PP: 130-137
PY: 1980
PM: JUL

Soundex(word)

computes the 4 byte Soundex code for word.

AU: Gadd, T.N.
TI: 'Fisching for Werds'. Phonetic Retrieval of written text in
    Information Retrieval Systems
JT: Program
VO: 22
NO: 3
PP: 222-237
PY: 1988

Phonix(word)

computes the 8 byte Phonix code for word.

AU: Gadd, T.N.
TI: PHONIX: The Algorithm
JT: Program
VO: 24
NO: 4
PP: 363-366
PY: 1990
PM: OCT

ISO charcater case functions

There are some additional function which transpose some/most ISOlatin1 characters to upper and lower case. To allow for maximum speed there are also destructive versions which change the argument instead of allocating a copy which is returned. For convenience, the destructive version also returns the argument. So all of the following is valid and $word will contain the lowercased string.

$word = isolc($word);
$word = disolc($word);
disolc($word);

Here are the hardcoded characters which are recognized:

abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß
ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß

$new = isolc($word)
disolc($word): transposes to lower case.
$new = isouc($word)
disouc($word): transposes to upper case.
$new = isotr($word)
disotr($word): Remove non-letters according to the above table.
$new = stop($word): Returns an empty string if $word is a stopword.
$new = grundform($word): Calls Text::German::reduce
$new = utf8iso($word): Convert UTF8 encoded strings to ISO-8859-1. WAIT currently is internally based on the Latin1 character set, so if you process anything in a different encoding, you should convert to Latin1 as the first filter.
split, split2, split3, ...: The splitN funtions all take a scalar as input and return a list of words. Split acts just like the perl split(' '). Split2 eliminates all words from the list that are shorter than 2 characters (bytes), split3 eliminates those shorter than 3 characters (bytes) and so on.

AUTHOR

Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

ISO charcater case functions

AUTHOR

SEE ALSO

Module Install Instructions