NAME
Regexp::ERE - extended regular expressions and finite automata
SYNOPSIS
use Regexp::ERE qw(
&ere_to_nfa
&nfa_inter
&nfa_to_regex
&nfa_to_input_constraints
&nfa_to_dfa
&dfa_to_min_dfa
);
# condition 1: begins with abc or def
my $nfa1 = ere_to_nfa('^(abc|def)');
# condition 2: ends with 123 or 456
my $nfa2 = ere_to_nfa('(123|456)$');
# condition 1 and condition 2
my $inter_nfa = nfa_inter($nfa1, $nfa2);
# compute extended regular expression (string)
my $ere = nfa_to_regex($inter_nfa);
# compute perl regular expression
my $perlre = nfa_to_regex($inter_nfa, 1);
# compute weaker input constraints suitable for widgets
my ($input_constraints, $split_perlre)
= nfa_to_input_constraints($inter_nfa);
# minimal dfa (simpler regular expression happens to result)
my $nfa3 = ere_to_nfa('^(a|ab|b)*$');
my $dfa3 = nfa_to_dfa($nfa3);
my $min_dfa3 = dfa_to_min_dfa($dfa3);
my $ere3 = nfa_to_regex($min_dfa3);
DESCRIPTION
Pure-perl module for:
Parsing POSIX Extended Regular Expressions (
$ere
) into Non-Deterministic Finite Automata ($nfa
)Manipulating
$nfa
s (concatenating, or-ing, and-ing)Computing Deterministic Finite Automata (
$dfa
s) from$nfa
s (powerset construction)Computing minimal
$dfa
s from$dfa
s (Hopcroft's algorithm)Computing
$ere
s or Perl Regular Expressions from$nfa
or$dfa
(Warshall algorithm)Heuristically deriving (possibly weaker) constraints from a
$nfa
or$dfa
suitable for display in a graphical user interface, i.e. a sequence of widgets of type 'free text' and 'drop down';Example:
'^(abc|def)'
=>$nfa
=>[['abc', 'def'], 'free text']
GLOSSARY AND CONVERSIONS OVERVIEW
Conversions overview
$ere -> $nfa -> $tree -> $regex ($ere or $perlre)
-> $input_constraints
The second argument of -> $regex conversions is an optional boolean,
true : conversion to a compiled perl regular expression
false: conversion to an ere string
The -> $input_constraints conversions return a pair (
$input_constraints: aref as described at tree_to_input_constraints()
$split_perlre : a compiled perl regular expression
)
Glossary
- $char_class
-
A set of unicode characters.
- $ere
-
Extended regular expression (string). See
ere_to_nfa($ere)
for the exact syntax. - $perlre
-
Perl regular expression
- $nfa
-
Non-deterministic finite automaton
- $dfa
-
Deterministic finite automaton (special case of
$nfa
) - $tree
-
Intermediate hierarchical representation of a regular expression (which still can be manipulated before stringification), similar to a parse tree (but used for generating, not for parsing).
- $input_constraints
-
Ad-hoc data structure representing a list of gui-widgets (free text fields and drop-down lists), a helper for entering inputs conforming to a given
$nfa
.
DATA STRUCTURES AND SUBROUTINES
Each of the documented subroutines can be imported, for instance use ERE qw(&ere_to_nfa &nfa_match);
.
Character class
WARNING: $char_class
es must be created exclusively by char_to_cc()
or interval_list_to_cc()
for equivalent character classes to be always the same array reference. For the same reason, $char_class
es must never be mutated.
In this implementation, the state transitions of a $nfa
are based upon character classes (not single characters). A character class is an ordered list of disjoint, non-mergeable intervals (over unicode code points, i.e. positive integers).
$char_class = [
[ $low_0, $high_0 ] # $interval_0
, [ $low_1, $high_1 ] # $interval_1
, ...
]
Constraints:
1: 0 <= $$char_class[$i][0] (0 <= low)
2: $$char_class[$i][1] <= MAX_CHAR (high <= MAX_CHAR)
3: $$char_class[$i][0] <= $$char_class[$i][1] (low <= high)
4: $$char_class[$i][1] + 1 < $$char_class[$i+1][0] (non mergeable)
Exceptions (anchors used only in the parsing phase only):
begin : [ -2, -1 ]
end : [ -3, -2 ]
begin or end : [ -3, -1 ]
Immediately after parsing, such pseudo-character classes are removed by nfa_resolve_anchors()
(internal subroutine).
- char_to_cc($c)
-
Returns the unique
$char_class
equivalent to[[ord($c), ord($c)]]
. - interval_list_to_cc($interval_list)
-
$interval_list
is an arbitrary list of intervals. Returns the unique$char_class
whose reunion of intervals is the same set as the reunion of the intervals of$interval_list
.Example:
interval_list_to_cc([[102, 112], [65, 90], [97, 102], [113, 122]]) returns [[65, 90], [97, 122]] (i.e [f-p]|[A-Z]|[a-f]|[q-z] => [A-Z]|[a-z])
Note that both
$interval_list
and$char_class
are lists of intervals, but only$char_class
obeys the constraints above, while$interval_list
does not.Remark also that
interval_list_to_cc($char_class)
is the identity (returns the same reference as given) on$char_class
es returned by eitherinterval_list_to_cc()
orchar_to_cc()
. - cc_union(@char_classes)
-
Returns the unique
$char_class
containing all characters of all given@char_classes
.
Nfa
WARNING: nfa_xxx()
routines are destructive, the $nfa
references given as arguments will not be valid $nfa
any more. Furthermore, the same $nfa
reference must be used only once as argument. For instance, for concatenating a $nfa
with itself, nfa_concat(nfa, nfa)
does not work; instead, nfa_concat($nfa, nfa_clone($nfa))
must be used; or even nfa_concat(nfa_clone($nfa), nfa_clone($nfa)
if the original $nfa
is to be used further.
$nfa = [ $state_0, $state_1, ... ]
$state = [
$accepting
, $transitions
]
$transitions = [
[ $char_class_0 => $state_ind_0 ]
, [ $char_class_1 => $state_ind_1 ]
, ...
]
In the same $transition
, $state_ind_i
are pairwise different and are valid indexes of @$nfa
. There is exactly one initial state at index 0.
nfa_clone(@nfas)
-
Maps each of the given
@nfas
to a clone. nfa_quant($in_nfa, $min, $max, $prev_has_suffix, $next_has_prefix)
-
Precondition:
0 <= $min && ( $max eq '' || $min <= $max)
Returns
$out_nfa
, a$nfa
computed from$in_nfa
.Let L be the language accepted by
$in_nfa
and M the language accepted by$out_nfa
. Then a word m belongs to M if and only if and ordered list (l_1, ..., l_r) of words belonging to L exists such that:$min <= r and ($max eq '' or r <= $max) and m is the concatenation of (l_1, ..., l_r)
Examples with
$in_nfa
being a$nfa
accepting'^a$'
:nfa_quant($in_nfa, 2, 4 ) accepts '^a{2,4}$' nfa_quant($in_nfa, 0, '') accepts '^a{0,}$' (i.e. '^a*$')
$pref_has_prefix
and$next_has_prefix
are hints for dispatching$min
, for example:'a+' => 'a*a' (!$prev_has_suffix && $next_has_prefix) 'a+' => 'aa*' ( $prev_has_suffix && !$next_has_prefix) 'a{2,}' => 'aa*a' ( $prev_has_suffix && $next_has_prefix)
nfa_concat(@in_nfas)
-
Returns
$out_nfa
, a$nfa
computed from@in_nfas
.Let r be the number of given
@in_nfas
, L_i the language accepted by$in_nfas[$i]
and M the language accepted by$out_nfa
. Then a word m belongs to M if and only if an ordered list (l_1, ..., l_r) of words exists, l_i belonging to L_i, such that m is the concatenation of (l_1, ..., l_r). nfa_union(@in_nfas)
-
Returns
$out_nfa
, a$nfa
computed from@in_nfas
.$out_nfa
accepts a word w if and only if at least one of@in_nfas
accepts w. nfa_inter(@in_nfas)
-
Returns
$out_nfa
, a $$nfa
computed from@in_nfas
.$out_nfa
accepts a word w if and only if each of@in_nfas
accepts w. nfa_match($in_nfa, $str)
-
Returns true if and only if
$in_nfa
accepts$str
. nfa_isomorph($nfa1, $nfa2)
-
Returns true if and only if the labeled graphs represented by
$nfa1
and$nfa2
are isomorph. While isomorph$nfa
s accept the same language, the converse is not true. nfa_to_dfa($in_nfa)
-
Compute a deterministic finite automaton from
$in_nfa
(powerset construction).The data structure of a deterministic finite automaton (dfa) is the same as that of a non-deterministic one, but it is further constrained: For each state and each unicode character there exist exactly one transition (i.e. a pair
($char_class, $state_index)
) matching this character.Note that the following constraint hold for both a
$dfa
and a$nfa
: For each pair of state p1 and p2, there exists at most one transition from p1 to p2 (artefact of this implementation). dfa_to_min_dfa($in_dfa)
-
Computes a minimal deterministic
$dfa
from the given$in_dfa
(Hopcroft's algorithm).Note that the given
$in_dfa
must be a$dfa
, as returned fromnfa_to_dfa()
, and not a mere$nfa
.Myhill-Nerode theorem: two minimal dfa accepting the same language are isomorph (i.e.
nfa_isomorph()
returns true).
Tree
$tree = [ $star, [ $alt_0, $alt_1, ... ] ]
or $char_class # ref($char_class) eq CHAR_CLASS
or undef # accepting nothing
$alt = [ $tree_0, $tree_1, ... ]
A $tree
is a hierarchical data structure used as intermediate form for regular expression generation routines.
Similar to a parse tree, except that the $tree
s described here are not the direct result of the parsing routines ere_to_xxx()
; indeed, the parsing routines generate a $nfa
, which then can be converted to a $tree
.
A string is spanned by $tree = [$star, [ $alt_0, $alt_1, ... ] ]
if it is spanned by one of the $alt_i
(if $star
is false) of a repetition thereof (if $star
is true).
A string is spanned by $alt = [ $tree_0, $tree_1, ...]
if it is the concatenation of @substrings
, each $substrings[$i]
being spanned by $$alt[$i]
.
nfa_to_tree($nfa)
-
Converts a
$nfa
to a$tree
. Returnsundef
if the$nfa
accepts nothing (not even the empty string). tree_to_regex($tree, $to_perlre)
-
Converts a
$tree
to an$ere
(if$to_perlre
is false) or to a$perlre
(if$to_perlre
is true).
Input constraints
$input_constraints = [ $input_constraint_0, $input_constraint_1, ... ]
$input_constraint = [ 'word_0', 'word_1', ... ] (drop down)
or 'free_text' (free text)
tree_to_input_constraints($tree)
-
Converts a
$tree
to a pair($input_constraints, $split_str)
.$split_perlre
is a compiled perl regular expression splitting a string accordingly to$input_constraints
. This$perlre
matches if and only if each drop down can be assigned a value; then$str =~ $perlre
in list context returns as many values as@$input_constraints
.
Ere
An $ere
is a perl string.
The syntax an $ere
is assumed to follow is based on POSIX ERE (else the ere_to_xxx()
routines will die()
).
Unsupported POSIX features: back-references, equivalence classes [[=a=]]
, character class [[:digit:]]
, collating symbols [[.ch.]]
.
)
is always a special character. POSIX says that )
is a normal character if there is no matching (
.
There is no escape sequences such as \t
for tab or \n
for line feed. POSIX does not specify such escape sequences neither.
\
before a non-special character is ignored (except in bracket expressions). POSIX does not allow it.
The empty string is legal in alternations ((|a)
is equivalent to (a?)
). POSIX does not allow it. The (|a)
form is generated by the xxx_to_ere()
routines (avoiding quantifiers other than *
).
[a-l-z]
is interpreted as ([a-l] | - | z)
(but it is discouraged to rely upon this implementation artefact). POSIX says that the interpretation of this construct is undefined.
In bracket expressions, \
is a normal character, thus ]
as character must occur first, or second after a ^
(POSIX compliant, but possibly surprising for perl programmers).
All unicode characters supported by perl are allowed as literal characters.
ere_to_nfa($ere)
-
Parses an
$ere
to a$nfa
.WARNING: the parsing routines, in particular
ere_to_xxx()
,die()
on syntax errors; thus the caller may want to eval-trap such errors. - quote($string)
-
Returns $string with escaped special characters.
Shorthands
ere_to_tree($ere)
:=nfa_to_tree(ere_to_nfa($ere))
ere_to_regex($ere, $to_perlre)
:=tree_to_regex(ere_to_tree($ere), $to_perlre)
nfa_to_regex($nfa, $to_perlre)
:=tree_to_regex(nfa_to_tree($nfa), $to_perlre)
ere_to_input_constraints($ere)
:=tree_to_input_constraints(ere_to_tree($ere))
nfa_to_input_constraints($nfa)
:=tree_to_input_constraints(nfa_to_tree($nfa))
nfa_to_min_dfa($nfa)
:=dfa_to_min_dfa(nfa_to_dfa($nfa))
AUTHOR
Loïc Jonas Etienne <loic.etienne@tech.swisssign.com>
COPYRIGHT and LICENSE
Artistic License 2.0 http://www.perlfoundation.org/artistic_license_2_0