NAME

RE - A regular expression base class

SYNOPSIS

use RE;
use DFA;
my $re = RE->new();
$re->set_re('a|b|(hi)*');
my $dfa = $re->to_dfa();
print $dfa->info(); # see stuff on DFA

DESCRIPTION

This module implements a regular expression parser, and supports the conversion of a RE to a deterministic finite automata. A homegrown recursive descent parser is used to build the parse tree, and the method used to conver the regular expression to a DFA uses no intermediate NFA.

Recursive Descent-safe Regex Grammar:

R  -> O

O  -> CO'

O' -> '|' CO' | epsilon

C  -> SC'

C' -> .SC' | epsilon

S  -> LS'

S' -> *S' | epsilon

L  -> a | b | c |..| 0 | 1 | 2 |..| (R) | epsilon

Terminal symbols: a,b,c,..,z,0,1,2,..,9,|,*,(,)

NOTE: Concatenation operator, '.', is not a terminal symbol
and should not be included in the regex

FAQ:
  Q: Does this support Perl regular expressions?
  A: No, just the regular expression using the terminal symbols
     listed above.

Valid terminal characters include:

a b c d e f g h i j k l m n o p q r s t u v w x y z

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

0 1 2 3 4 5 6 7 8 9 + - = , ? & [ ] { } . ~ ^ @ % $

: ; < >

Methods

new

Create a brand spaking new RE object; does not accept regex here

set_epsilon

Not really for public consumption, but could be in the future. Defines how epsilon (i.e., the null string) is represented in the parse tree.

get_epsilon_symbol

Returns the string representation of the null string

set_re

Defines a regular expression for RE to parse through

get_re

Returns the regular expression set by set_re as a string

set_current

Meant for private consumption. Initializes the stack used to store what part of the regex has yet to be parsed

reset_current

Meant for private consumption. Initializes the stack used to store what part of the regex has yet to be parsed

get_current

Returns what is in the _CURRENT_STR stack

minimize

Minimizes the regular expression optimally

shrink

Shrinks RE using hueristics to create a 'small enough' equivalent expression

to_nfa

Not implemented, but will create an NFA using Thompson's Method; from there one could do a NFA->to_dfa, and compare the resulting dfa to the one from RE->to_dfa.

thompson

Guts of RE->to_nfa; uses depth first parse tree traversal

to_dfa

Currently BREAKS on a*(b|cd)m!!!!!!!!!!!!!!

Main driver which is used to convert the regular expression to a DFA; calls RE->parse() internally if _PARSE_TREE is !defined, so no need to call before this function.

get_transitions_on

Provides a way to get the transitions using the contents of the _FOLLOW_POS table

move

Called by RE->to_re to get the sub set of new states for each sub set of states during the sub state construction process of building the DFA

firstpos

Determines firt positions for all nodes in the a parse tree

lastpost

Determines the last postition for all nodes in the parse tree

followpos

Determines the first postition for all nodes in the parse tree

followpos

Returns hash containing follow position table

parse

Parses regular expressin set by RE->set_re, and stores the parse tree in _PARSE_TREE

cat_endmarker

Adds '#', or the end of regex marker to the parse tree; not for public consumption

match

Matches current terminal symbol with terminal character, and loads up the next lookahead character

lookahead

Returns value of current lookahead

nexttoken

Sets next token as lookahead

R

R -> O

O

O -> CO'

O_prime

O' -> '|'CO' | epsilon

C

C -> SC'

C_prime

C' -> .SC' | epsilon

S

S -> LS'

S_prime

S' -> *S' | epsilon

L

L -> a | b | c |..| 0 | 1 | 2 |..| (R)

get_next_pos

Returns the next position, used in creating leaf nodes for terminal symbols (minus null string)

get_curr_pos

Returns the current count of terminal symbols (minus null string)

set_parse_tree

Set parse tree

get_parse_tree

Return parse tree

get_terminals

Returns array of terminal symbols

is_terminal

Checks to see if given character is a terminal symbol

is_member

General subroutine used to test if an element is already in an array

get_symbols

Returns array of all symbols used in current regex

trace_on

Turns on tracing - allows to trace the recursive descent parsing

trace_off

Turns tracing off

trace

Returns value of _TRACE

toggle_cat_state

Toggles cat state instead of cat'ing a '.' to everything

get_cat_state

Returns $self->{_CAT_STATE} (1|0)

set_error

Increments error count for regex parsing

get_error

Returns error count

set_done

Sets done flag

done

Returns if done or not

DESTROY

Called automatically when object is destroyed either explicitly or automatically when references go to 0 or when the main program is finished

AUTHOR

Brett D. Estrade - <estrabd AT mailcan DOT com>

CAVEATS

Currently, all states are stored as labels. There is also no integrity checking for consistency among the start, final, and set of all states.

BUGS

Not saying it is bug free, just saying I haven't hit any yet :)

AVAILABILITY

Anonymous CVS Checkout at http://www.brettsbsd.net/cgi-bin/viewcvs.cgi/

ACKNOWLEDGEMENTS

This suite of modules started off as a homework assignment for a compiler class I took for my MS in computer science at the University of Southern Mississippi.

COPYRIGHT

This code is released under the same terms as Perl.