NAME
String::Markov - A Moo-based, text-oriented Markov Chain module
VERSION
version 0.006
SYNOPSIS
my $mc = String::Markov->new();
$mc->add_files(@ARGV);
print $mc->generate_sample . "\n" for (1..20);
my $mc = String::Markov->new(order => 1, sep => ' ');
for my $stanza (@The_Rime_of_the_Ancient_Mariner) {
$mc->add_sample($stanza);
}
print $mc->generate_sample;
DESCRIPTION
String::Markov is a Moo-based Markov Chain module, designed to easily consume and produce text.
ATTRIBUTES
order
The order of the chain, i.e. how much past state is used to determine the next state. The default of 2 is for reasonable for constructing new names/words when splitting into characters, or for long-ish works when splitting into words.
split_sep
How states are split. This value (or sep; see "new()") is passed directly as the first argument of "split" in perlfunc, so using ' ' has special semantics. Regular expressions will work as well, but be aware that any matched characters are discarded.
join_sep
How to re-join states. This value (or sep; see "new()") is passed directly as the first argument of "join" in perlfunc. In addition, it is used to build keys for internal hashes. This can cause problems in cases where split_sep() produces sequences like 'ae', 'io'
, 'a', 'ei', 'o'
, or 'ae', 'i', 'o'
, which will all turn into 'aeio'
with the default if ''
. If join_sep is '*'
instead, then three unique keys result: 'ae*io'
, 'a*ei*o'
, and 'ae*i*o'
. See "add_sample()".
null
What is used to track the beginning and end of a sample. The default of "\0"
should work for UTF-8 text, but may cause problems with UTF-16 or other encodings.
normalize
Whether to normalize Unicode strings. This value, if true, is passed as the first argument to Unicode::Normalize::normalize. The default 'C'
should do what most people expect, but it may be the case that 'D'
is what you want. If you're not using Unicode, set this to undef.
do_chomp
Whether to "chomp" in perlfunc lines when reading files. See "add_files()".
METHODS
new()
# Defaults
my $mc = String::Markov->new(
order => 2,
sep => '',
split_sep => undef,
join_sep => undef,
null => "\0",
normalize => 'C',
do_chomp => 1,
);
The sep argument doesn't correlate to an attribute, but is used to initialize split_sep or join_sep if either is undefined.
See "ATTRIBUTES".
split_line()
This is the method "add_sample()" calls when it is passed a non-ref argument. It returns an array of states (usually individual characters or words) that are used to build the Markov Chain model.
The default implementation is equivalent to:
sub split_line {
my ($self, $sample) = @_;
$sample = normalize($self->normalize, $sample) if $self->normalize;
return split($self->split_sep, $sample);
}
This method can be overridden to deal with unusual data.
add_sample()
This method adds samples to build the Markov Chain model. It takes a single argument, which can be either a string or an array reference. If the argument is an array reference, its elements are directly used to update the Markov Chain. If it is a string, add_sample() uses the split_line() method to create an array of states, and then updates the Markov Chain.
Note that this function generates hash keys for the transition matrix. The keys are built according to the order, null, and join_sep attributes, so if an instance is created with:
my $mc = String::Markov->new(null => '!', order => 2, join_sep => '*');
$mc->add_sample($_) for (@sample_lines);
Then the internal transition matrix might look like:
{
'!*!' => { 'A' => 5, 'B' => 7, ... }, # Initial state
'!*A' => { ... },
'!*B' => { ... },
...
'x*y' => { '!' => 4 }, # always end after 'xy'
'y*z' => { '!' => 3, 'q' => 2 }, # sometimes end after 'yz'
...
}
add_files()
This is a simple convenience method, designed to replace code like:
while(<>) { chomp; $mc->add_sample($_) }
It takes a list of file names as arguments, and adds them line-by-line.
generate_sample()
This method returns a sequence of states, generated from the Markov Chain using the Monte Carlo method.
If called in scalar context, the states are joined with join_sep before being returned.
SEE ALSO
AUTHOR
Grant Mathews <gmathews@cpan.org>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2014 by Grant Mathews.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)