NAME
Regexp::ERE - extended regular expressions and finite automata
SYNOPSIS
use Regexp::ERE qw(
&ere_to_nfa
&nfa_inter
&nfa_to_regex
&nfa_to_input_constraints
&nfa_to_dfa
&dfa_to_min_dfa
);
# condition 1: begins with abc or def
my $nfa1 = ere_to_nfa('^(abc|def)');
# condition 2: ends with 123 or 456
my $nfa2 = ere_to_nfa('(123|456)$');
# condition 1 and condition 2
my $inter_nfa = nfa_inter($nfa1, $nfa2);
# compute extended regular expression (string)
my $ere = nfa_to_regex($inter_nfa);
# compute perl regular expression
my $perlre = nfa_to_regex($inter_nfa, 1);
# compute weaker input constraints suitable for widgets
my ($input_constraints, $split_perlre)
= nfa_to_input_constraints($inter_nfa);
# minimal dfa (simpler regular expression happens to result)
my $nfa3 = ere_to_nfa('^(a|ab|b)*$');
my $dfa3 = nfa_to_dfa($nfa3);
my $min_dfa3 = dfa_to_min_dfa($dfa3);
my $ere3 = nfa_to_regex($min_dfa3);
DESCRIPTION
Pure-perl module for:
Parsing POSIX Extended Regular Expressions (
$ere) into Non-Deterministic Finite Automata ($nfa)Manipulating
$nfas (concatenating, or-ing, and-ing)Computing Deterministic Finite Automata (
$dfas) from$nfas (powerset construction)Computing minimal
$dfas from$dfas (Hopcroft's algorithm)Computing
$eres or Perl Regular Expressions from$nfaor$dfa(Warshall algorithm)Heuristically deriving (possibly weaker) constraints from a
$nfaor$dfasuitable for display in a graphical user interface, i.e. a sequence of widgets of type 'free text' and 'drop down';Example:
'^(abc|def)'=>$nfa=>[['abc', 'def'], 'free text']
GLOSSARY AND CONVERSIONS OVERVIEW
Conversions overview
$ere -> $nfa -> $tree -> $regex ($ere or $perlre)
-> $input_constraints
The second argument of -> $regex conversions is an optional boolean,
true : conversion to a compiled perl regular expression
false: conversion to an ere string
The -> $input_constraints conversions return a pair (
$input_constraints: aref as described at tree_to_input_constraints()
$split_perlre : a compiled perl regular expression
)
Glossary
- $char_class
-
A set of unicode characters.
- $ere
-
Extended regular expression (string). See
ere_to_nfa($ere)for the exact syntax. - $perlre
-
Perl regular expression
- $nfa
-
Non-deterministic finite automaton
- $dfa
-
Deterministic finite automaton (special case of
$nfa) - $tree
-
Intermediate hierarchical representation of a regular expression (which still can be manipulated before stringification), similar to a parse tree (but used for generating, not for parsing).
- $input_constraints
-
Ad-hoc data structure representing a list of gui-widgets (free text fields and drop-down lists), a helper for entering inputs conforming to a given
$nfa.
DATA STRUCTURES AND SUBROUTINES
Each of the documented subroutines can be imported, for instance use ERE qw(&ere_to_nfa &nfa_match);.
Character class
WARNING: $char_classes must be created exclusively by char_to_cc() or interval_list_to_cc() for equivalent character classes to be always the same array reference. For the same reason, $char_classes must never be mutated.
In this implementation, the state transitions of a $nfa are based upon character classes (not single characters). A character class is an ordered list of disjoint, non-mergeable intervals (over unicode code points, i.e. positive integers).
$char_class = [
[ $low_0, $high_0 ] # $interval_0
, [ $low_1, $high_1 ] # $interval_1
, ...
]
Constraints:
1: 0 <= $$char_class[$i][0] (0 <= low)
2: $$char_class[$i][1] <= MAX_CHAR (high <= MAX_CHAR)
3: $$char_class[$i][0] <= $$char_class[$i][1] (low <= high)
4: $$char_class[$i][1] + 1 < $$char_class[$i+1][0] (non mergeable)
Exceptions (anchors used only in the parsing phase only):
begin : [ -2, -1 ]
end : [ -3, -2 ]
begin or end : [ -3, -1 ]
Immediately after parsing, such pseudo-character classes are removed by nfa_resolve_anchors() (internal subroutine).
- char_to_cc($c)
-
Returns the unique
$char_classequivalent to[[ord($c), ord($c)]]. - interval_list_to_cc($interval_list)
-
$interval_listis an arbitrary list of intervals. Returns the unique$char_classwhose reunion of intervals is the same set as the reunion of the intervals of$interval_list.Example:
interval_list_to_cc([[102, 112], [65, 90], [97, 102], [113, 122]]) returns [[65, 90], [97, 122]] (i.e [f-p]|[A-Z]|[a-f]|[q-z] => [A-Z]|[a-z])Note that both
$interval_listand$char_classare lists of intervals, but only$char_classobeys the constraints above, while$interval_listdoes not.Remark also that
interval_list_to_cc($char_class)is the identity (returns the same reference as given) on$char_classes returned by eitherinterval_list_to_cc()orchar_to_cc(). - cc_union(@char_classes)
-
Returns the unique
$char_classcontaining all characters of all given@char_classes.
Nfa
WARNING: nfa_xxx() routines are destructive, the $nfa references given as arguments will not be valid $nfa any more. Furthermore, the same $nfa reference must be used only once as argument. For instance, for concatenating a $nfa with itself, nfa_concat(nfa, nfa) does not work; instead, nfa_concat($nfa, nfa_clone($nfa)) must be used; or even nfa_concat(nfa_clone($nfa), nfa_clone($nfa) if the original $nfa is to be used further.
$nfa = [ $state_0, $state_1, ... ]
$state = [
$accepting
, $transitions
]
$transitions = [
[ $char_class_0 => $state_ind_0 ]
, [ $char_class_1 => $state_ind_1 ]
, ...
]
In the same $transition, $state_ind_i are pairwise different and are valid indexes of @$nfa. There is exactly one initial state at index 0.
nfa_clone(@nfas)-
Maps each of the given
@nfasto a clone. nfa_quant($in_nfa, $min, $max, $prev_has_suffix, $next_has_prefix)-
Precondition:
0 <= $min && ( $max eq '' || $min <= $max)Returns
$out_nfa, a$nfacomputed from$in_nfa.Let L be the language accepted by
$in_nfaand M the language accepted by$out_nfa. Then a word m belongs to M if and only if and ordered list (l_1, ..., l_r) of words belonging to L exists such that:$min <= r and ($max eq '' or r <= $max) and m is the concatenation of (l_1, ..., l_r)Examples with
$in_nfabeing a$nfaaccepting'^a$':nfa_quant($in_nfa, 2, 4 ) accepts '^a{2,4}$' nfa_quant($in_nfa, 0, '') accepts '^a{0,}$' (i.e. '^a*$')$pref_has_prefixand$next_has_prefixare hints for dispatching$min, for example:'a+' => 'a*a' (!$prev_has_suffix && $next_has_prefix) 'a+' => 'aa*' ( $prev_has_suffix && !$next_has_prefix) 'a{2,}' => 'aa*a' ( $prev_has_suffix && $next_has_prefix) nfa_concat(@in_nfas)-
Returns
$out_nfa, a$nfacomputed from@in_nfas.Let r be the number of given
@in_nfas, L_i the language accepted by$in_nfas[$i]and M the language accepted by$out_nfa. Then a word m belongs to M if and only if an ordered list (l_1, ..., l_r) of words exists, l_i belonging to L_i, such that m is the concatenation of (l_1, ..., l_r). nfa_union(@in_nfas)-
Returns
$out_nfa, a$nfacomputed from@in_nfas.$out_nfaaccepts a word w if and only if at least one of@in_nfasaccepts w. nfa_inter(@in_nfas)-
Returns
$out_nfa, a $$nfacomputed from@in_nfas.$out_nfaaccepts a word w if and only if each of@in_nfasaccepts w. nfa_match($in_nfa, $str)-
Returns true if and only if
$in_nfaaccepts$str. nfa_isomorph($nfa1, $nfa2)-
Returns true if and only if the labeled graphs represented by
$nfa1and$nfa2are isomorph. While isomorph$nfas accept the same language, the converse is not true. nfa_to_dfa($in_nfa)-
Compute a deterministic finite automaton from
$in_nfa(powerset construction).The data structure of a deterministic finite automaton (dfa) is the same as that of a non-deterministic one, but it is further constrained: For each state and each unicode character there exist exactly one transition (i.e. a pair
($char_class, $state_index)) matching this character.Note that the following constraint hold for both a
$dfaand a$nfa: For each pair of state p1 and p2, there exists at most one transition from p1 to p2 (artefact of this implementation). dfa_to_min_dfa($in_dfa)-
Computes a minimal deterministic
$dfafrom the given$in_dfa(Hopcroft's algorithm).Note that the given
$in_dfamust be a$dfa, as returned fromnfa_to_dfa(), and not a mere$nfa.Myhill-Nerode theorem: two minimal dfa accepting the same language are isomorph (i.e.
nfa_isomorph()returns true).
Tree
$tree = [ $star, [ $alt_0, $alt_1, ... ] ]
or $char_class # ref($char_class) eq CHAR_CLASS
or undef # accepting nothing
$alt = [ $tree_0, $tree_1, ... ]
A $tree is a hierarchical data structure used as intermediate form for regular expression generation routines.
Similar to a parse tree, except that the $trees described here are not the direct result of the parsing routines ere_to_xxx(); indeed, the parsing routines generate a $nfa, which then can be converted to a $tree.
A string is spanned by $tree = [$star, [ $alt_0, $alt_1, ... ] ] if it is spanned by one of the $alt_i (if $star is false) of a repetition thereof (if $star is true).
A string is spanned by $alt = [ $tree_0, $tree_1, ...] if it is the concatenation of @substrings, each $substrings[$i] being spanned by $$alt[$i].
nfa_to_tree($nfa)-
Converts a
$nfato a$tree. Returnsundefif the$nfaaccepts nothing (not even the empty string). tree_to_regex($tree, $to_perlre)-
Converts a
$treeto an$ere(if$to_perlreis false) or to a$perlre(if$to_perlreis true).
Input constraints
$input_constraints = [ $input_constraint_0, $input_constraint_1, ... ]
$input_constraint = [ 'word_0', 'word_1', ... ] (drop down)
or 'free_text' (free text)
tree_to_input_constraints($tree)-
Converts a
$treeto a pair($input_constraints, $split_str).$split_perlreis a compiled perl regular expression splitting a string accordingly to$input_constraints. This$perlrematches if and only if each drop down can be assigned a value; then$str =~ $perlrein list context returns as many values as@$input_constraints.
Ere
An $ere is a perl string.
The syntax an $ere is assumed to follow is based on POSIX ERE (else the ere_to_xxx() routines will die()).
Unsupported POSIX features: back-references, equivalence classes [[=a=]], character class [[:digit:]], collating symbols [[.ch.]].
) is always a special character. POSIX says that ) is a normal character if there is no matching (.
There is no escape sequences such as \t for tab or \n for line feed. POSIX does not specify such escape sequences neither.
\ before a non-special character is ignored (except in bracket expressions). POSIX does not allow it.
The empty string is legal in alternations ((|a) is equivalent to (a?)). POSIX does not allow it. The (|a) form is generated by the xxx_to_ere() routines (avoiding quantifiers other than *).
[a-l-z] is interpreted as ([a-l] | - | z) (but it is discouraged to rely upon this implementation artefact). POSIX says that the interpretation of this construct is undefined.
In bracket expressions, \ is a normal character, thus ] as character must occur first, or second after a ^ (POSIX compliant, but possibly surprising for perl programmers).
All unicode characters supported by perl are allowed as literal characters.
ere_to_nfa($ere)-
Parses an
$ereto a$nfa.WARNING: the parsing routines, in particular
ere_to_xxx(),die()on syntax errors; thus the caller may want to eval-trap such errors. - quote($string)
-
Returns $string with escaped special characters.
Shorthands
ere_to_tree($ere):=nfa_to_tree(ere_to_nfa($ere))ere_to_regex($ere, $to_perlre):=tree_to_regex(ere_to_tree($ere), $to_perlre)nfa_to_regex($nfa, $to_perlre):=tree_to_regex(nfa_to_tree($nfa), $to_perlre)ere_to_input_constraints($ere):=tree_to_input_constraints(ere_to_tree($ere))nfa_to_input_constraints($nfa):=tree_to_input_constraints(nfa_to_tree($nfa))nfa_to_min_dfa($nfa):=dfa_to_min_dfa(nfa_to_dfa($nfa))
AUTHOR
Loïc Jonas Etienne <loic.etienne@tech.swisssign.com>
COPYRIGHT and LICENSE
Artistic License 2.0 http://www.perlfoundation.org/artistic_license_2_0