NAME
Regexp::Assemble - Assemble multiple Regular Expressions into one RE
VERSION
This document describes version 0.05 of Regexp::Assemble, released 2004-12-10.
SYNOPSIS
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+\\d*\\s+c' );
$ra->add( 'a\\w+\\d+' );
$ra->add( 'a\\d+' );
print $ra->re; # prints (?:a(?:b+(?:\d*\s+)?c|(?:\w+)?\d+))
DESCRIPTION
Regexp::Assemble takes a number of regular expressions and assembles them into a single regular expression (or RE) that matches all that the individual REs match, only what they match and nothing else.
The assembled RE is more sophisticated than a brute force join( '|', @list)
concatenation. Common subexpressions are shared; alternations are introduced only when patterns diverge. As a result, backtracking is kept to a minimum. If a given path fails... there are no other paths to try and so the expression fails quickly. If no wildcards like .*
appear, no backtracking will be performed. In very large expressions, this can provide a large speed boost.
As a result, instead of having a large list of expressions to loop over, the string only needs to be tested against one expression. This is especially interesting when on average the expression fails to match most of the time.
This is useful when you have a large number of patterns that you want to apply against a string. If you only that one of the patterns matched then this module is for you. Nonetheless, large numbers of alternations are dealt with in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie.
Some more examples of usage appear in the accompanying README.
METHODS
- new
-
Creates a new Regexp::Assemble object. A set of key/value parameters can be supplied to control the finer details of the object's behaviour.
chomp, controls whether the pattern should be chomped before being lexed. Handy if you are reading lines from a file. By default, no chomping is performed.
filter, allows you to add a callback to enable sanity checks on the pattern being loaded.
mutable, controls whether new patterns can be added to the object after the RE is generated.
reduce, controls whether tail reduction occurs or not. If set, patterns like
a(?:bc+d|ec+d)
will be reduced toa[be]c+d
. That is, the end of the pattern in each part of the b... and d... alternations is identical, and hence is hoisted out of the alternation and placed after it.lex, specifies the pattern used to lex the input lines into tokens. You could replace the default pattern by a more sophisticated version that matches arbitrarily nested parentheses, for example.
debug, controls whether copious amounts of output is produced during the loading stage or the reducing stage of assembly.
my $ra = Regexp::Assemble->new; my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );
A more detailed explanation of these attributes follows.
- clone
-
Clones the contents of a Regexp::Assemble object and creates a new object (in other words it performs a deep copy).
my $copy = $ra->clone();
If the Storable module is installed, its dclone method will be used, otherwise the cloning will be performed using a pure perl approach.
- add(LIST)
-
Takes a string, breaks it apart into a set of tokens (respecting meta characters) and inserts the resulting list into the
R::A
object. It uses a naive regular expression to lex the string that may be fooled complex expressions (specifically, it will fail to lex nested parenthetical expressions such asab(?cd(?:ef)?gh)
correctly). If this is the case, the end of the string will not be tokenised correctly and returned as one long string.On the one hand, this may indicate that the patterns you are trying to feed the
R::A
object are too complex. Simpler patterns might allow the algorithm to work more effectively and spot reductions in the resulting pattern.On the other hand, you can supply your own pattern to perform the lexing if you need, and if that is not enough, you can provide a code block for the ultimate in lexing flexibility.
A list of strings may be supplied, thus you can pass it a file handle of a file opened for reading:
$re->add( '\d+-\d+-\d+-\d+\.example\.com' ); $re->add( <IN> );
You probably need to set the
chomp
attribute on the object to get files to work correctly:my $re = Regexp::Assemble->new( chomp=>1 )->add( <IN> ); # or my $re = Regexp::Assemble->new; $re->chomp(1)->add( <IN> );
This method is chainable.
- insert(LIST)
-
Takes a list of tokens representing a regular expression and stores them in the object. Note: you should not pass it a bare regular expression, such as
ab+c?d*e
. You must pass it as a list of tokens, e.g.('a', 'b+', 'c?', 'd*', 'e')
.This method is chainable, e.g.:
my $ra = Regexp::Assemble->new ->insert( qw[ a b+ c? d* e ] ) ->insert( qw[ a c+ d+ e* f ] );
The
add
method callsinsert
internally. - as_string
-
Assemble the expression and return it as a string. You may want to do this if you are writing the pattern to a file. The following arguments can be passed to control the aspect of the resulting pattern:
indent, the number of spaces used to indent nested grouping of a pattern. Use this to produce a pretty-printed pattern (for some definition of "pretty"). The resulting output is rather verbose. The reason is to ensure that the metacharacters
(?:
and)
always occur on otherwise empty lines. This allows you grep the result for an even more synthetic view of the pattern:egrep -v '^ *[()]' <regexp.file>
The result of the above is quite readable. Remember to backslash the spaces appearing in your own patterns if you wish to use an indented pattern in an
m/.../x
construct.If you have not set the mutable attribute on the object, calling this method will drain the internal data structure. Large numbers of patterns can eat a significant amount of memory, and this lets perl recover the memory used. Mutable means that you want to keep the internal data structure lying around in order to add additional patterns to the object after an assembled pattern has been produced.
- re
-
Assembles the pattern and return it as a compiled RE, using the
qr//
operator. Note that there is no way of specifying flags to the qr operator. If you need to do something fancy, then you should retrieve the string using the as_string operator and apply the requiredqr//imx
-type flags yourself.The indent attribute, documented in the
as_string
method can be used here.As with
as_string
, calling this method will reset the internal data structures to free the memory used in assembling the RE. This behaviour can be controlled with themutable
attribute.With method chaining, it is possible to produce a RE without having a temporary
Regexp::Assemble
object lying around, e.g.:my $re = Regexp::Assemble->new ->add( q[ab+cd+e] ) ->add( q[ac\\d+e] ) ->add( q[c\\d+e] ) ->re;
The
$re
variable now contains a Regexp object that can be used directly:while( <> ) { /$re/ and print "Something in [$_] matched\n"; )
Note that when a
Regexp::Assemble
object is used in string context (hence, within anm//
operator), there
method is called, so you don't even need to save the RE in a separate variable. The following will work as expected:my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] ); while( <IN> ) { if( /($re)/ ) { print "Here be giants: $1\n"; } }
- debug(NUMBER)
-
Turns debugging on or off. Statements are printed to the currently selected file handle (STDOUT by default). If you are already using this handle, you will have to arrange to select an output handle to a file of your own choosing, before call the
add
,as_string
orre
) functions, otherwise it will scribble all over your carefully formatted output.0 turns debugging off, 1 emits debug traces dealing with adding new patterns to the object, 2 emits debug traces dealing with the reduction of the assembled pattern.
Values can be added (or or'ed together) to trace everything
$r->debug(1)->add( '\\d+abc' );
- chomp(0|1)
-
Turns chomping on or off. When loading an object from a file the lines will contain their line separator token. This may produce undesired results. In this case, call chomp with a value of 1 to enable autochomping. Chomping is off by default.
$re->chomp( 1 ); $re->add( <DATA> );
- filter(CODE)
-
Allows you to install a callback to check that the pattern being loaded contains valid input. It receives a list on input. It is expected to return 0 to indicate that the pattern should not be added, anything else indicates that the contents are fine.
If you know that all patterns you expect to assemble contain a restricted set of of tokens (e.g. no spaces), you could do the following:
$ra->filter(sub { not grep { / / } @_ });
or
sub only_spaces_and_digits { not grep { ![\d ] } @_ } $ra->filter( \&only_spaces_and_digits );
These two examples will silently ignore faulty patterns, If you want the user to be made aware of the problem you should raise an error (via
warn
ordie
), log an error message, whatever is best. If you want to remove a filter, passundef
as a parameter.$ra->filter(undef);
This method is chainable.
- lex(SCALAR)
-
Change the pattern used to break a string apart into tokens.
- reduce(0|1)
-
Turns pattern reduction on or off. A reduced pattern may be considerably shorter than an unreduced pattern. Consider
/sl(?:ip|op|ap)/
versus/sl[aio]p/
. An unreduced pattern will be very similar to those produced byRegexp::Optimizer
. Reduction is on by default. Turning it off makes assembly much faster. - mutable(0|1)
-
When the
re
oras_string
methods are called the reduction algoritm kicks in and takes the current data structure and fold the common portions of the patterns that have been stored in the object. Once this occurs, it is no longer possible to add any more patterns.In fact, the internal structures are release to free up memory. If you have a programa that adds additional patterns to an object over a long period, you can set the mutable attribute. This will stop the internal structure from being drained and you can continue to add patterns.
The main consequence is that it the assembled pattern will not undergo any reduction (as the internal data structure undergoes such a transformation as that it becomes very difficult to cope with the change that adding a new pattern would bring about. If this is a problem, the solution is to create a non-mutable object, continue adding to it as long as needed, and each time a new assembled regular expression is required, clone the object, turn the mutable attribute off and proceed as usual.
By default the mutable attribute defaults to zero. The method can be chained.
$r->add( 'abcdef\\d+' )->mutable(0);
- reset
-
Resets the internal state of the object, as if no
add
orinsert
methods had been called. Does not modify the state of controller attributes such asdebug
,lex
,reduce
and so forth. - Default_Lexer
-
Warning: the
Default_Lexer
function is a class method, not an object method. Do not call it on an object, bad things will happen.Regexp::Assemble::Default_Lexer( '.|\\d+' );
The
Default_Lexer
method lets you replace the default pattern used for all subsequently created Regexp::Assemble objects. It will not have any effect on existing objects. (It is also possible to override the lexer pattern used on a per-object basis).The parameter should be an ordinary scalar, not a compiled pattern. The pattern should be capable of matching all parts of the string, i.e., the following constraint should hold true:
$_ = 'this will be matched by a pattern'; my $pattern = qr/./; my @token = m/($pattern)/g; $_ eq join( '' => @token ) or die;
If no parameter is supplied, the current default pattern in use will be returned.
DIAGNOSTICS
"don't pass a refname
to Default_Lexer"
You tried to replace the default lexer pattern with an object instead of a scalar. Solution: You probably tried to call $obj->Default_Lexer. Call qualified class method instead Regexp::Assemble::Default_Lexer
.
"filter method not passed a coderef"
A reference to a subroutine (anonymous or otherwise) was expected. Solution: read the documentation for the filter
method.
"lexed a zero-length token, an infinite loop would ensue"
The add
method eats the input string one token at a time. If you supply your own lexer pattern, it may match the null character, in which case the loop that processes the string will loop an infinite number of times. Solution: Fix your pattern.
"was not passed an ARRAY ref"
A coding error. Can only occur when debugging is active. Solution: turn off debugging (it's off by default).
"was not passed a HASH ref"
A coding error. Can only occur when debugging is active. Solution: turn off debugging (it's off by default).
NOTES
This module has been tested successfully with a range of versions of perl, from 5.005_03 to 5.8.6.
The expressions produced by this module can be used with the PCRE library.
Where possible, feed R::A the simplest tokens possible. Don't add a(?-\d+){2})b
when a-\d+-\d+b
. The reason is that if you also add a\d+c
the resulting REs change dramatically: a(?:(?:-\d+){2}b|-\d+c)
versus a-\d+(?:-\d+b|c)
. Since R::A doesn't analyse tokens, it doesn't know how to "unroll" the {2}
quantifier, and will fail to notice the divergence after the first -d\d+
.
What is more, when the string 'a-123000z' is matched against the first pattern, the regexp engine will have to backtrack over each alternation before determining that there is no match. No such backtracking occurs in the second pattern: The engine scans up to the 'z' and then fails immediately, since neither of the alternations start with 'z'.
R::A does, however, understand character classes. Given a-b
, axb
and a\db
, it will assemble these into a[-\dx]b
. When - appears as a candidate for a character class it will be the first character in the class. When ^ appears as a candidate for a character class it will be the last character in the class. R::A will also replace all the digits 0..9 appearing in a character class by \d
. I'd do it for letters as well, but thinking about accented characters and other glyphs hurts my head.
SEE ALSO
- Regex::PreSuf
-
Regex::PreSuf
takes a string and chops it itself into tokens of length 1. Since it can't deal with tokens of more than one character, it can't deal with meta-characters and thus no regular expressions. Which is the main reason why I wrote this module. - Regexp::Optimizer
-
Regexp::Optimizer
produces regular expressions that are similar to those produced by R::A with reductions switched off. - Text::Trie
-
Text::Trie
is well worth investigating. Tries can outperform very bushy (read: many alternations) patterns.
LIMITATIONS
The current version does not let you know which initial regular expression, that forms part of the assembled RE, was the one that caused the match. This may be addressed in a future version. (The solution probably lies in using the (?{code})
contruct to assist in tagging the original expressions).
Regexp::Assemble
does not attempt to find common substrings. For instance, it will not collapse /aabababc/
down to /a(?:ab}{3}c/
. If there's a module out there that performs this sort of string analysis I'd like to know about it. But keep in mind that the algorithms that do this are very expensive: quadratic or worse.
Regexp::Assemble
does not emit lookahead or lookbehind assertions in the expressions it produces. Nor is the wonderful and frightening (?
...)> construct used. These and other optimisations may be introduced in further releases if there is a demand for them, if I figure out how to use them in this module, or if someone sends patches.
You can't remove a pattern that has been added to an object. You'll just have to start over again. Adding a pattern is difficult enough, I'd need a solid argument to convince me to add a remove
method. If you need to do this you should read the documentation on the mutable
and clone
methods.
BUGS
The module does not produce POSIX-style regular expressions.
The algorithm used to assemble the regular expressions makes extensive use of mutually-recursive functions (i.e.: A calls B, B calls A, ...) For deeply similar expressions, it may be possible to provoke "Deep recursion" warnings.
Regexp::Assemble
does not assign meaning to meta-characters. For instance, if the following two pattern lists are given: a\d
and a\d+
, it will not determine that \d
can be matched by \d+
. Instead, it will produce a(?:\d|\d+)
. Along a similar line of reasoning, it will not determine that a
and a\d+
is equivalent to a\d*
(It will produce a(?:\d+)?
instead).
When a string is tested against an assembled RE, it may match when it shouldn't, or else not match when it should. The module has been tested extensively, and has an extensive test suite, but you never know... In either case, it is a bug, and I want to know about it.
A temporary work-around is to disable reductions:
my $pattern = $assembler->reduce(0)->re;
A discussion about implementation details and where I think bugs might lie appears in the accompanying README file. If this file is not available locally, you should be able to find a copy on the Web at your nearest CPAN mirror.
Please report all bugs at http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble|rt.cpan.org
Make sure you include the output from the following two commands:
perl -MRegexp::Assemble -le 'print Regexp::Assemble::VERSION'
perl -V
If you are feeling brave, extensive debugging traces are available.
Note that perl 5.6.0 has a bug that re-escapes backslashes that occur in qw// quoted word lists. You will come to grief if you try to build patterns statically in this way, should they contain metacharacters like \d
. If you read patterns in from a file you should have no problems. The real solution is to upgrade to at least 5.6.1, if not 5.6.2 or more recent. Or downgrade. Anything but 5.6.0.
ACKNOWLEDGEMENTS
This module grew out of work I did building access maps for Postfix, a modern SMTP mail transfer agent. See http://www.postfix.org/ for more information. I used Perl to build large regular expressions for blocking dynamic/residential IP addresses to cut down on spam and viruses. Once I had the code running for this, it was easy to start adding stuff to block really blatant spam subject lines, bogus HELO strings, spammer mailer-ids and more...
I presented the work at the French Perl Workshop in 2004, and the thing most people asked was whether the underlying mechanism for assembling the REs was available as a module. At that time it was nothing more that a twisty maze of scripts, all different. The interest shown indicated that a module was called for. I'd like to thank the people who showed interest. Hey, it's going to make my messy scripts smaller, in any case.
Thanks go to Jean Forget and Philippe Blayo who looked over an early version, H. Merijn Brandt, who stopped over in Paris one evening, and discussed things over a few beers. Nicholas Clark pointed out that while what this module does (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he encouraged to me continue with the development of this module. In any event, this module allows one to gauge the difficulty of undertaking the endeavour in C. I'd rather gouge my eyes out with a blunt pencil.
A tip 'o the hat to Paul Johnson who settled the question as to whether this module should live in the Regex:: namespace, or Regexp:: namespace. If you're not convinced, try running the following one-liner:
perl -le 'print ref qr//'
Thanks also to broquaint on Perlmonks, who answered a question pertaining to traversing an early version of the underlying data structure used by the module (Sadly, that code is no more). bart and ysth also provided a couple of tips pertaining to the Correct Use of overloading.
AUTHOR
David Landgren, david@landgren.net
Copyright (C) 2004
http://www.landgren.net/perl/
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.