NAME
Lingua::TFIDF::WordSegmenter::SplitBySpace - Simple word segmenter suitable for most european languages
VERSION
version 0.01
SYNOPSIS
use Lingua::TFIDF::WordSegmenter::SplitBySpace;
my $segmenter = Lingua::TFIDF::WordSegmenter::SplitBySpace->new(
lower_case => 1,
remove_punctuations => 1,
stop_words => [qw/i you he she it they a the am are is was were/],
);
my $iter = $segmenter->segment('Humpty Dumpty sat on wall, ...');
while (defined(my $word = $iter->())) { ... }
DESCRIPTION
This class is a simple word segmenter. Like Text::TFIDF, this class segments a sentence into words by spliting by spaces.
METHODS
new([ lower_case => 0 ] [, remove_punctuations => 0 ] [, stop_words => [] ])
Constructor. Takes some optional parameters:
- lower_case
-
Set off by default. Convert all the words into lower cases.
- remove_punctuations
-
Set off by default. Removes punctuation characters (e.g., commas, periods, quotes, question marks and exclamation marks) from head and tail of segmented words. Note that punctuations at inside of a word (e.g., "King's") will be remain unchanged.
- stop_words
-
Specifies words you want to exclude from segmented words. This is useful for removing functional words.
Note that stop word filtering will be performed after
lower_case
andremove_punctuations
options are processed. So, for example, if you enablelower_case
option and want to exclude "I" from result, you should supply the stop word list as['i']
.
segment($document | \$document)
Executes word segmentation on given $document
and returns an word iterator.
AUTHOR
Koichi SATOH <sekia@cpan.org>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2014 by Koichi SATOH.
This is free software, licensed under:
The MIT (X11) License