NAME
Lingua::Stem::Any - Unified interface to any stemmer on CPAN
VERSION
This document describes Lingua::Stem::Any v0.01.
SYNOPSIS
use Lingua::Stem::Any;
# create German stemmer using the default source module
$stemmer = Lingua::Stem::Any->new(language => 'de');
# create German stemmer explicitly using Lingua::Stem::Snowball
$stemmer = Lingua::Stem::Any->new(
language => 'de',
source => 'Lingua::Stem::Snowball',
);
# get stem for word
$stem = $stemmer->stem($word);
# get list of stems for list of words
@stems = $stemmer->stem(@words);
DESCRIPTION
This module aims to provide a simple unified interface to any stemmer on CPAN. It will provide a default available source module when a language is requested but no source is requested.
Attributes
- language
-
The following language codes are currently supported.
┌────────────┬────┐ │ Bulgarian │ bg │ │ Czech │ cs │ │ Danish │ da │ │ Dutch │ nl │ │ English │ en │ │ Finnish │ fi │ │ French │ fr │ │ Galician │ gl │ │ German │ de │ │ Hungarian │ hu │ │ Italian │ it │ │ Latin │ la │ │ Norwegian │ no │ │ Persian │ fa │ │ Portuguese │ pt │ │ Romanian │ ro │ │ Russian │ ru │ │ Spanish │ es │ │ Swedish │ sv │ │ Turkish │ tr │ └────────────┴────┘
They are in the two-letter ISO 639-1 format and are case-insensitive but are always returned in lowercase when requested.
# instantiate a stemmer object $stemmer = Lingua::Stem::Any->new(language => $language); # get current language $language = $stemmer->language; # change language $stemmer->language($language);
Country codes such as
cz
for the Czech Republic are not supported, nor are IETF language tags such aspt-PT
orpt-BR
. - source
-
The following source modules are currently supported.
┌────────────────────────┬──────────────────────────────────────────────┐ │ Module │ Languages │ ├────────────────────────┼──────────────────────────────────────────────┤ │ Lingua::Stem::Snowball │ da nl en fi fr de hu it no pt ro ru es sv tr │ │ Lingua::Stem::UniNE │ bg cs fa │ │ Lingua::Stem │ da de en fr gl it no pt ru sv │ └────────────────────────┴──────────────────────────────────────────────┘
A module name is used to specify the source. If no source is specified, the first available source in the above list with support for the current language is used.
# get current source $source = $stemmer->source; # change source $stemmer->source('Lingua::Stem::UniNE');
- casefold
-
Boolean value specifying whether to apply Unicode casefolding to words before stemming them. This is enabled by default and is performed before normalization when also enabled.
- normalize
-
Boolean value specifying whether to apply Unicode NFC normalization to words before stemming them. This is enabled by default and is performed after casefolding when also enabled.
Methods
- stem
-
Accepts a list of strings, stems each string, and returns a list of stems. The list returned will always have the same number of elements in the same order as the list provided.
@stems = $stemmer->stem(@words); # get the stem for a single word $stem = $stemmer->stem($word);
The words should be provided as character strings and the stems are returned as character strings. Byte strings in arbitrary character encodings are not supported.
- stem_in_place
-
Accepts an array reference, stems each element, and replaces them with the resulting stems.
$stemmer->stem_in_place(\@words);
This method is provided for potential optimization when a large array of words is to be stemmed. The return value is not defined.
- languages
-
Returns a list of supported two-letter language codes using lowercase letters.
# all languages @languages = $stemmer->languages; # languages supported by Lingua::Stem::Snowball @languages = $stemmer->languages('Lingua::Stem::Snowball');
- sources
-
Returns a list of supported source module names.
# all sources @sources = $stemmer->sources; # sources that support English @sources = $stemmer->sources('en');
TODO
optional stem caching
custom stemming exceptions
SEE ALSO
Lingua::Stem::Snowball, Lingua::Stem::UniNE, Lingua::Stem
ACKNOWLEDGEMENTS
This module is brought to you by Shutterstock (@ShutterTech). Additional open source projects from Shutterstock can be found at code.shutterstock.com.
AUTHOR
Nick Patch <patch@cpan.org>
COPYRIGHT AND LICENSE
© 2013 Nick Patch
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.