NAME

Lingua::EN::Tokenizer::Offsets - Finds word (token) boundaries, and returns their offsets.

VERSION

version 0.03

SYNOPSIS

use Lingua::EN::Tokenizer::Offsets qw/token_offsets get_tokens/;
 
my $str <<END
Hey! Mr. Tambourine Man, play a song for me.
I'm not sleepy and there is no place I’m going to.
END

my $offsets = token_offsets($str);     ## Get the offsets.
foreach my $o (@$offsets) {
    my $start  = $o->[0];
    my $length = $o->[1]-$o->[0];

    my $token = substr($text,$start,$length)  ## Get a token.
    # ...
}

### or

my $tokens = get_tokens($str);     
foreach my $token (@$tokens) {
    ## do something with $token
}

METHODS

tokenize($text)

Returns a tokenized version of $text (space-separated tokens).

$text can be a scalar or a scalar reference.

get_offsets($text)

Returns a reference to an array containin pairs of character offsets, corresponding to the start and end positions of tokens from $text.

$text can be a scalar or a scalar reference.

get_tokens($text)

Splits $text it into tokens, returning an array reference.

$text can be a scalar or a scalar reference.

adjust_offsets($text,$offsets)

Minor adjusts to offsets (leading/trailing whitespace, etc)

$text can be a scalar or a scalar reference.

initial_offsets($text)

First naive delimitation of tokens.

$text can be a scalar or a scalar reference.

offsets2tokens($text,$offsets)

Given a list of token boundaries offsets and a text, returns an array with the text split into tokens.

$text can be a scalar or a scalar reference.

ACKNOWLEDGEMENTS

Based on the original tokenizer written by Josh Schroeder and provided by Europarl http://www.statmt.org/europarl/.

AUTHOR

André Santos <andrefs@cpan.org>

COPYRIGHT AND LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Lingua::EN::Tokenizer::Offsets, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::EN::Tokenizer::Offsets

CPAN shell

perl -MCPAN -e shell
install Lingua::EN::Tokenizer::Offsets

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

VERSION

SYNOPSIS

METHODS

tokenize($text)

get_offsets($text)

get_tokens($text)

adjust_offsets($text,$offsets)

initial_offsets($text)

offsets2tokens($text,$offsets)

ACKNOWLEDGEMENTS

SEE ALSO

AUTHOR

COPYRIGHT AND LICENSE

Module Install Instructions

Keyboard Shortcuts