NAME
WAIT::Filter - Perl extension providing the basic freeWAIS-sf reduction functions
SYNOPSIS
use WAIT::Filter qw(Stem Soundex Phonix isolc disolc isouc disouc
isotr disotr stop grundform utf8iso);
$stem = Stem($word);
$scode = Soundex($word);
$pcode = Phonix($word);
$lword = isolc($word);
disolc($word);
$uword = isouc($word);
disouc($word);
$trword = isotr($word);
disotr($word);
$word = stop($word);
$word = grundform($word);
@words = WAIT::Filter::split($word);
@words = WAIT::Filter::split2($word);
@words = WAIT::Filter::split3($word);
@words = WAIT::Filter::split4($word); # arbitrary numbers allowed
DESCRIPTION
This tiny modules gives access to the basic reduction functions build in freeWAIS-sf.
- Stem(word)
-
reduces word using the well know Porter algorithm.
AU: Porter, M.F. TI: An Algorithm for Suffix Stripping JT: Program VO: 14 PP: 130-137 PY: 1980 PM: JUL
- Soundex(word)
-
computes the 4 byte Soundex code for word.
AU: Gadd, T.N. TI: 'Fisching for Werds'. Phonetic Retrieval of written text in Information Retrieval Systems JT: Program VO: 22 NO: 3 PP: 222-237 PY: 1988
- Phonix(word)
-
computes the 8 byte Phonix code for word.
AU: Gadd, T.N. TI: PHONIX: The Algorithm JT: Program VO: 24 NO: 4 PP: 363-366 PY: 1990 PM: OCT
ISO charcater case functions
There are some additional function which transpose some/most ISOlatin1 characters to upper and lower case. To allow for maximum speed there are also destructive versions which change the argument instead of allocating a copy which is returned. For convenience, the destructive version also returns the argument. So all of the following is valid and $word
will contain the lowercased string.
$word = isolc($word);
$word = disolc($word);
disolc($word);
Here are the hardcoded characters which are recognized:
abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß
ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß
$new =
isolc($word)
- disolc
($word)
-
transposes to lower case.
$new =
isouc($word)
- disouc
($word)
-
transposes to upper case.
$new =
isotr($word)
- disotr
($word)
-
Remove non-letters according to the above table.
$new =
stop($word)
-
Returns an empty string if $word is a stopword.
$new =
grundform($word)
-
Calls Text::German::reduce
$new =
utf8iso($word)
-
Convert UTF8 encoded strings to ISO-8859-1. WAIT currently is internally based on the Latin1 character set, so if you process anything in a different encoding, you should convert to Latin1 as the first filter.
- split, split2, split3, ...
-
The splitN funtions all take a scalar as input and return a list of words. Split acts just like the perl split(' '). Split2 eliminates all words from the list that are shorter than 2 characters (bytes), split3 eliminates those shorter than 3 characters (bytes) and so on.
AUTHOR
Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>
SEE ALSO
perl(1).
4 POD Errors
The following errors were encountered while parsing the POD:
- Around line 648:
You forgot a '=back' before '=head1'
- Around line 663:
Non-ASCII character seen before =encoding in 'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß'. Assuming CP1252
- Around line 666:
'=item' outside of any '=over'
- Around line 706:
You forgot a '=back' before '=head1'