NAME

Lingua::TFIDF::WordSegmenter::SplitBySpace - Simple word segmenter suitable for most european languages

VERSION

version 0.01

SYNOPSIS

use Lingua::TFIDF::WordSegmenter::SplitBySpace;

my $segmenter = Lingua::TFIDF::WordSegmenter::SplitBySpace->new(
  lower_case => 1,
  remove_punctuations => 1,
  stop_words => [qw/i you he she it they a the am are is was were/],
);
my $iter = $segmenter->segment('Humpty Dumpty sat on wall, ...');
while (defined(my $word = $iter->())) { ... }

DESCRIPTION

This class is a simple word segmenter. Like Text::TFIDF, this class segments a sentence into words by spliting by spaces.

METHODS

new([ lower_case => 0 ] [, remove_punctuations => 0 ] [, stop_words => [] ])

Constructor. Takes some optional parameters:

lower_case

Set off by default. Convert all the words into lower cases.

remove_punctuations

Set off by default. Removes punctuation characters (e.g., commas, periods, quotes, question marks and exclamation marks) from head and tail of segmented words. Note that punctuations at inside of a word (e.g., "King's") will be remain unchanged.

stop_words

Specifies words you want to exclude from segmented words. This is useful for removing functional words.

Note that stop word filtering will be performed after lower_case and remove_punctuations options are processed. So, for example, if you enable lower_case option and want to exclude "I" from result, you should supply the stop word list as ['i'].

segment($document | \$document)

Executes word segmentation on given $document and returns an word iterator.

AUTHOR

Koichi SATOH <sekia@cpan.org>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2014 by Koichi SATOH.

This is free software, licensed under:

The MIT (X11) License