NAME
Grapheme::Ngram - n-grams of Unicode Extended Grapheme Clusters
SYNOPSIS
use Grapheme::Ngram;
my $class = 'Grapheme::Ngram';
my @ngrams = $class->ngram($string,$width);
DESCRIPTION
For many applications it's better to work along graphemes.
Building n-grams is one of them.
METHODS
new
$object = Grapheme::Ngram->new();
ngram
my $array_ref = $object->ngram($string, $width);
$string
...... string of characters
$width
....... length of the resulting tokens. Default is 1.
$array_ref
... reference to array of ngram tokens
Returns one token with the unmodified $string
if the number of graphemes in $string
is lower than $width
. Returns an empty $array_ref
if $string
is empty or undef. NOTE: maybe this will be changed in future. Defaults to length = 1 if $width
is not an integer larger than 0.
from_tokens
my @ngram = $object->from_tokens(\@tokens, $width);
Same as ngram
but takes tokens. This method is used by ngram
.
This allows to use a custom tokenizer for e.g. treating 'sh' also as grapheme:
my @tokens = $string =~ m/(Sh|sh|\X)/g;
_tokenize
my @graphemes = $object->_tokenize($string);
This internal method splits $string
into a list of graphemes.
SOURCE REPOSITORY
http://github.com/wollmers/Grapheme-Ngram
AUTHOR
Helmut Wollmersdorfer, <helmut.wollmersdorfer@gmail.com>
COPYRIGHT AND LICENSE
Copyright (C) 2014 by Helmut Wollmersdorfer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.