NAME
Unicode::SetAutomaton - UTF-8 based DFAs and Regexps from Unicode sets
SYNOPSIS
use Unicode::SetAutomaton;
use Set::IntSpan;
my $set = Set::IntSpan->new([[0x000000, 0x10FFFF]]);
my $dfa = Unicode::SetAutomaton->new(classes => [$set]);
print $dfa->as_expressions;
DESCRIPTION
This module takes sets of Unicode characters and turns them into UTF-8 based minimal deterministic finite automata and regular expressions. Applications include making byte-oriented regular expression and finite automata tools compatible with Unicode input, and possibly performance optimizations.
The author's motivation in writing this module was a search for a fast method to associate characters in UTF-8 encoded strings with character classes. Ignoring memory access performance, automata produced by this module offer a simple, general, near-optimal solution to this problem.
METHODS
- Unicode::SetAutomaton->new(classes => [$set1, $set2, ...])
-
Creates a new automaton from the
Set::IntSpan
objects. Classes must by in the Unicode range U+0000 .. U+10FFFF. Surrogate code points may be included but have no effect on the output as the definition of UTF-8 disallows their encoding. Sets should not be empty.The data of the automaton can be accessed through the returned hash reference. The following keys are available:
- start_state
-
The automaton's start state.
- transitions
-
An array of four-tuples representing a transition. The tuples are encoded as array reference. The items are: source state, first byte, last byte, destination state. The automaton moves from the source state to the destination state iff $input_byte >= $first and $input_byte <= $last.
The set of transitions omits illegal moves, if a character is not in any of the input sets, or if the input is not a valid UTF-8 sequence, then that is indicated by the absence of corresponding transitions.
Accepting states do not have outgoing transitions, meaning the automaton will never accept more than a single code point. Copying the transitions of the start state to each accepting state yields an automaton for strings.
For instance, the transitions for the automaton that accepts any valid single-character UTF-8 sequence, as generated by the code in the synopsis, might look like this:
[ 0, 0x00, 0x7F, 1 ], [ 0, 0xC2, 0xDF, 2 ], [ 0, 0xE0, 0xE0, 3 ], [ 0, 0xE1, 0xEC, 4 ], [ 0, 0xED, 0xED, 5 ], [ 0, 0xEE, 0xEF, 4 ], [ 0, 0xF0, 0xF0, 6 ], [ 0, 0xF1, 0xF3, 7 ], [ 0, 0xF4, 0xF4, 8 ], [ 2, 0x80, 0xBF, 1 ], [ 3, 0xA0, 0xBF, 2 ], [ 4, 0x80, 0xBF, 2 ], [ 5, 0x80, 0x9F, 2 ], [ 6, 0x90, 0xBF, 4 ], [ 7, 0x80, 0xBF, 4 ], [ 8, 0x80, 0x8F, 4 ],
- input_classes
-
An array reference of
Set::IntSpan
objects, exactly as they have been passed to the constructor. - disjoint_classes
-
An array reference of
Set::IntSpan
objects. This is the smallest set of classes such that each character in the input classes belongs to exactly one class. Each object is a subset of an input class. - state_to_disjoint
-
A hash reference that maps a state to one of the disjoint character classes. If there is such mapping, the state is an accepting state, otherwise it is not in an accepting state.
- disjoint_to_state
-
An array reference that maps a disjoint class to a state. Mapping between disjoint classes and states is bijective, so this holds the same information as the state_to_disjoint hash reference.
- disjoint_to_input
-
An array reference that maps a disjoint class to one or more input classes. For example:
if (exists $dfa->{state_to_disjoint}->{$state}) { my $dset = $dfa->{state_to_disjoint}->{$state}; my $sets = $dfa->{disjoint_to_input}->[$dset]; print "The matched code point is in these sets: ", join ',', @$sets; }
Practical applications will likely have to encode the automaton in a different form. For example, using a binary search as transition function the list of transitions would have to be sorted by source state and range, and it may be beneficial to rename accepting states so that the state numbers corresponds to the number of the disjoint classes.
Applications that encode the transition information in arrays should note that if the input is known to be valid UTF-8, perhaps by using a separate DFA that ensures just that in parrallel, only the start state has transitions from most bytes. Other states move only over bytes in the range 0x80 .. 0xBF. So to save space, two arrays could be used.
- as_expressions()
-
Returns a list of regular expressions, one for each disjoint class, in the order of the disjoint classes. For instance, the code in the synopsis might print (wrapped to meet length restrictions):
[\x00-\x7f]|([\xc2-\xdf]|\xed[\x80-\x9f]| ([\xe1-\xec\xee-\xef]|\xf4[\x80-\x8f]| [\xf1-\xf3][\x80-\xbf]|\xf0[\x90-\xbf] )[\x80-\xbf]|\xe0[\xa0-\xbf])[\x80-\xbf]
AUTHOR / COPYRIGHT / LICENSE
Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
This module is licensed under the same terms as Perl itself.