NAME
Algorithm::NGram
SYNPOSIS
use Algorithm::NGram;
my $ng = Algorithm::NGram->new(ngram_width => 3); # use trigrams
# feed in text
$ng->add_text($text1); # analyze $text1
$ng->add_text($text2); # analyze $text2
# feed in arbitrary sequence of tokens
$ng->add_start_token;
$ng->add_tokens(qw/token1 token2 token3/);
$ng->add_end_token;
my $output = $ng->generate_text;
DESCRIPTION
This is a module for analyzing token sequences with n-grams. You can use it to parse a block of text, or feed in your own tokens. It can generate new sequences of tokens from what has been fed in.
EXPORT
None.
METHODS
- new
-
Create a new n-gram analyzer instance.
Options:
- ngram_width
-
This is the "window size" of how many tokens the analyzer will keep track of. A ngram_width of two will make a bigram, a ngram_width of three will make a trigram, etc...
- ngram_width
-
Returns token window size (e.g. the "n" in n-gram)
- token_table
-
Returns n-gram table
- add_text
-
Splits a block of text up by whitespace and processes each word as a token. Automatically calls
add_start_token()
at the beginning of the text andadd_end_token()
at the end. - add_tokens
-
Adds an arbitrary list of tokens.
- add_start_token
-
Adds the "start token." This is useful because you often will want to mark the beginnings and ends of a token sequence so that when generating your output the generator will know what tokens start a sequence and when to end.
- add_end_token
-
Adds the "end token." See
add_start_token()
. - analyze
-
Generates an n-gram frequency table. Returns a hashref of N => tokens => count, where N is the number of tokens (will be from 2 to ngram_width). You will not normally need to call this unless you want to get the n-gram frequency table.
- generate_text
-
After feeding in text tokens, this will return a new block of text based on whatever text was added.
- generate
-
Generates a new sequence of tokens based on whatever tokens have previously been fed in.
- next_tok
-
Given a list of tokens, will pick a possible token to come next.
- token_lookup
-
Returns a hashref of the counts of tokens that follow a sequence of tokens.
- token_key
-
Serializes a sequence of tokens for use as a key into the n-gram table. You will not normally need to call this.
- serialize
-
Returns the tokens and n-gram (if one has been generated) in a string
- deserialize($string)
-
Deserializes a string and returns an
Algorithm::NGram
instance
SEE ALSO
AUTHOR
Mischa Spiegelmock, <mspiegelmock@gmail.com>
COPYRIGHT AND LICENSE
Copyright 2007 by Mischa Spiegelmock
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.