NAME
Regexp::Assemble - Assemble multiple Regular Expressions into one RE
VERSION
This document describes version 0.10 of Regexp::Assemble, released 2005-03-29.
SYNOPSIS
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+\\d*\\s+c' );
$ra->add( 'a\\w+\\d+' );
$ra->add( 'a\\d+' );
print $ra->re; # prints (?:a(?:b+(?:\d*\s+)?c|(?:\w+)?\d+))
DESCRIPTION
Regexp::Assemble allows you to take a number of regular expressions and assemble them into a single regular expression (or RE) that will match everything that any of the individual REs match, only what they match and nothing else.
The assembled RE is more sophisticated than a brute force join( '|', @list)
concatenation. Common subexpressions are shared; alternations are introduced only when patterns diverge. As a result, backtracking is kept to a minimum. If a given path fails... there are no other paths to try and so the expression fails quickly. If no wildcards like .*
appear, no backtracking will be performed. In very large expressions, this can provide a large speed boost.
As a result, instead of having a large list of expressions to loop over, the string only needs to be tested against one expression. This is especially interesting when on average the expression fails to match most of the time.
This is useful when you have a large number of patterns that you want to apply against a string. It is also possible to track the orginal patterns, so that you can determine which, among the source patterns that form the assembled pattern, was the one that caused the match to occur.
You should realise that large numbers of alternations are processed in perl's regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie.
Some more examples of usage appear in the accompanying README.
METHODS
- new
-
Creates a new Regexp::Assemble object. A set of key/value parameters can be supplied to control the finer details of the object's behaviour.
flags, sets the flags
imsx
that should be applied to the resulting pattern. Warning: no error checking is done, you should ensure that the flags you pass are understood by the version of Perl in use.chomp, controls whether the pattern should be chomped before being lexed. Handy if you are reading lines from a file. By default, no chomping is performed.
track, controls whether you want know which of the initial patterns was the one that matched. See the
matched
method for more details. Note that in this mode of operation THERE ARE SECURITY IMPLICATIONS OF WHICH YOU YOU SHOULD BE AWARE.pre_filter, allows you to add a callback to enable sanity checks on the pattern being loaded. This callback is triggered before the pattern is split apart by the lexer. In other words, it operates on the entire pattern. If you are loading patterns from a file, this would be an appropriate place to remove comments.
filter, allows you to add a callback to enable sanity checks on the pattern being loaded. This callback is triggered after the pattern has been split apart by the lexer.
mutable, controls whether new patterns can be added to the object after the RE is generated.
reduce, controls whether tail reduction occurs or not. If set, patterns like
a(?:bc+d|ec+d)
will be reduced toa[be]c+d
. That is, the end of the pattern in each part of the b... and d... alternations is identical, and hence is hoisted out of the alternation and placed after it.lex, specifies the pattern used to lex the input lines into tokens. You could replace the default pattern by a more sophisticated version that matches arbitrarily nested parentheses, for example.
debug, controls whether copious amounts of output is produced during the loading stage or the reducing stage of assembly.
my $ra = Regexp::Assemble->new; my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );
A more detailed explanation of these attributes follows.
- clone
-
Clones the contents of a Regexp::Assemble object and creates a new object (in other words it performs a deep copy).
my $copy = $ra->clone();
If the Storable module is installed, its dclone method will be used, otherwise the cloning will be performed using a pure perl approach.
- add(LIST)
-
Takes a string, breaks it apart into a set of tokens (respecting meta characters) and inserts the resulting list into the
R::A
object. It uses a naive regular expression to lex the string that may be fooled complex expressions (specifically, it will fail to lex nested parenthetical expressions such asab(?cd(?:ef)?gh)
correctly). If this is the case, the end of the string will not be tokenised correctly and returned as one long string.On the one hand, this may indicate that the patterns you are trying to feed the
R::A
object are too complex. Simpler patterns might allow the algorithm to work more effectively and spot reductions in the resulting pattern.On the other hand, you can supply your own pattern to perform the lexing if you need, and if that is not enough, you can provide a code block for the ultimate in lexing flexibility.
A list of strings may be supplied, thus you can pass it a file handle of a file opened for reading:
$re->add( '\d+-\d+-\d+-\d+\.example\.com' ); $re->add( <IN> );
You probably need to set the
chomp
attribute on the object to get files to work correctly:my $re = Regexp::Assemble->new( chomp=>1 )->add( <IN> ); # or my $re = Regexp::Assemble->new; $re->chomp(1)->add( <IN> );
The pre_filter method provides allows you to filter input on a line-by-line basis.
This method is chainable.
- insert(LIST)
-
Takes a list of tokens representing a regular expression and stores them in the object. Note: you should not pass it a bare regular expression, such as
ab+c?d*e
. You must pass it as a list of tokens, e.g.('a', 'b+', 'c?', 'd*', 'e')
.This method is chainable, e.g.:
my $ra = Regexp::Assemble->new ->insert( qw[ a b+ c? d* e ] ) ->insert( qw[ a c+ d+ e* f ] );
The
add
method callsinsert
internally.The
filter
method allows you to accept or reject the addition of the entire pattern on a token-by-token basis. - as_string
-
Assemble the expression and return it as a string. You may want to do this if you are writing the pattern to a file. The following arguments can be passed to control the aspect of the resulting pattern:
indent, the number of spaces used to indent nested grouping of a pattern. Use this to produce a pretty-printed pattern (for some definition of "pretty"). The resulting output is rather verbose. The reason is to ensure that the metacharacters
(?:
and)
always occur on otherwise empty lines. This allows you grep the result for an even more synthetic view of the pattern:egrep -v '^ *[()]' <regexp.file>
The result of the above is quite readable. Remember to backslash the spaces appearing in your own patterns if you wish to use an indented pattern in an
m/.../x
construct. Indenting is ignored if tracking is enabled.If you have not set the mutable attribute on the object, calling this method will drain the internal data structure. Large numbers of patterns can eat a significant amount of memory, and this lets perl recover the memory used. Mutable means that you want to keep the internal data structure lying around in order to add additional patterns to the object after an assembled pattern has been produced.
- re
-
Assembles the pattern and return it as a compiled RE, using the
qr//
operator.As with
as_string
, calling this method will reset the internal data structures to free the memory used in assembling the RE. This behaviour can be controlled with themutable
attribute.The indent attribute, documented in the
as_string
method, can be used here (it will be ignored if tracking is enabled).With method chaining, it is possible to produce a RE without having a temporary
Regexp::Assemble
object lying around, e.g.:my $re = Regexp::Assemble->new ->add( q[ab+cd+e] ) ->add( q[ac\\d+e] ) ->add( q[c\\d+e] ) ->re;
The
$re
variable now contains a Regexp object that can be used directly:while( <> ) { /$re/ and print "Something in [$_] matched\n"; )
The
re
method is called when the object is used in string context (hence, withing anm//
operator), so by and large you do not even need to save the RE in a separate variable. The following will work as expected:my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] ); while( <IN> ) { if( /($re)/ ) { print "Here be giants: $1\n"; } }
This approach does not work with tracked patterns. The
match
andmatched
methods must be used instead, see below. - match(SCALAR)
-
If pattern tracking is in use, you must
use re 'eval'
in order to make things work correctly. At a minimum, this will make your code look like this:my $did_match = do { use re 'eval'; $target =~ /$ra/ } if( $did_match ) { print "matched ", $ra->matched, "\n"; }
(The main reason is that the
$^R
variable is currently broken and an ugly workaround is required. See Perl bug #32840 for more information if you are curious. The README also contains more information).The important thing to note is that with
use re 'eval'
, THERE ARE SECURITY IMPLICATIONS WHICH YOU IGNORE AT YOUR PERIL. The problem is this: if you do not have strict control over the patterns being fed toRegexp::Assemble
when tracking is enabled, and someone slips you a pattern such as/^(?{system 'rm -rf /'})/
and you attempt to match a string against the resulting pattern, you will know Fear and Loathing.What is more, the
$^R
workaround means that that tracking does not work if you perform a bare/$re/
pattern match as shown above. You have to instead call thematch
method, in order to supply the necessary context to take care of the tracking housekeeping details.if( defined( my $match = $ra->match($_)) ) { print " $_ matched by $match\n"; }
In the case of a successul match, the original matched pattern is returned directly. The matched pattern will also be available through the
matched
method.(Except that the above is not true for 5.6.0: the
match
method returns true or undef, and thematched
method always returns undef).If you are capturing parts of the pattern e.g.
foo(bar)rat
you will want to get at the captures. See thembegin
,mend
andmvar
methods. If you are not using captures then you may safely ignore this section. - mbegin
-
This method returns a copy of
@-
at the moment of the last match. You should ordinarily not need to bother with this,mvar
should be able to supply all your needs. - mend
-
This method returns a copy of
@+
at the moment of the last match. - mvar(NUMBER)
-
The
mvar
method returns the captures of the last match.mvar(1)
corresponds to $1,mvar(2)
to $2, and so on.mvar(0)
happens to return the target string matched, as a byproduct of walking down the@-
and@+
arrays after the match.If called without a parameter,
mvar
will return a reference to an array containing all captures. - matched
-
If pattern tracking has been set, via the
track
attribute, or through thetrack
method, this method will return the original pattern of the last successful match. Returns undef match has yet been performed, or tracking has not been enabled.See below in the NOTES section for additional subtleties of which you should be aware of when tracking patterns.
Note that this method is not available in 5.6.0, due to limitations in the implementation of
(?{...})
at the time. - debug(NUMBER)
-
Turns debugging on or off. Statements are printed to the currently selected file handle (STDOUT by default). If you are already using this handle, you will have to arrange to select an output handle to a file of your own choosing, before call the
add
,as_string
orre
) functions, otherwise it will scribble all over your carefully formatted output.0 turns debugging off, 1 emits debug traces dealing with adding new patterns to the object, 2 emits debug traces dealing with the reduction of the assembled pattern.
Values can be added (or or'ed together) to trace everything
$r->debug(1)->add( '\\d+abc' );
- chomp(0|1)
-
Turns chomping on or off. When loading an object from a file the lines will contain their line separator token. This may produce undesired results. In this case, call chomp with a value of 1 to enable autochomping. Chomping is off by default.
$re->chomp( 1 ); $re->add( <DATA> );
- flags(STRING)
-
Sets the flags that govern how the pattern behaves (for versions of Perl up to 5.9 or so, these are
imsx
). By default no flags are enabled. - track(0|1)
-
Turns tracking on or off. When this attribute is enabled, additional housekeeping information is inserted into the assembled expression using
({...}
embedded code constructs. This provides the necessary information to determine which, of the original patterns added, was the one that caused the match.$re->track( 1 ); if( $target =~ /$re/ ) { print "$target matched by ", $re->matched, "\n"; }
Note that when this functionality is enabled, no reduction is performed and no character classes are generated. In other words,
brag|tag
is not reduced down to(?:br|t)ag
anddig|dim
is not reduced todi[gm]
. - pre_filter(CODE)
-
Allows you to install a callback to check that the pattern being loaded contains valid input. It receives the pattern as a whole to be added, before it been tokenised by the lexer. It may to return 0 or undef to indicate that the pattern should not be added, any true value indicates that the contents are fine.
TODO: how to remove comments
If you want to remove the filter, pass
undef
as a parameter.$ra->pre_filter(undef);
This method is chainable.
- filter(CODE)
-
Allows you to install a callback to check that the pattern being loaded contains valid input. It receives a list on input, after it has been tokenised by the lexer. It may to return 0 or undef to indicate that the pattern should not be added, any true value indicates that the contents are fine.
If you know that all patterns you expect to assemble contain a restricted set of of tokens (e.g. no spaces), you could do the following:
$ra->filter(sub { not grep { / / } @_ });
or
sub only_spaces_and_digits { not grep { ![\d ] } @_ } $ra->filter( \&only_spaces_and_digits );
These two examples will silently ignore faulty patterns, If you want the user to be made aware of the problem you should raise an error (via
warn
ordie
), log an error message, whatever is best. If you want to remove a filter, passundef
as a parameter.$ra->filter(undef);
This method is chainable.
- lex(SCALAR)
-
Change the pattern used to break a string apart into tokens. You can study the
eg/naive
script as a starting point. - reduce(0|1)
-
Turns pattern reduction on or off. A reduced pattern may be considerably shorter than an unreduced pattern. Consider
/sl(?:ip|op|ap)/
versus/sl[aio]p/
. An unreduced pattern will be very similar to those produced byRegexp::Optimizer
. Reduction is on by default. Turning it off makes assembly much faster. - mutable(0|1)
-
When the
re
oras_string
methods are called the reduction algoritm kicks in and takes the current data structure and fold the common portions of the patterns that have been stored in the object. Once this occurs, it is no longer possible to add any more patterns.In fact, the internal structures are release to free up memory. If you have a programa that adds additional patterns to an object over a long period, you can set the mutable attribute. This will stop the internal structure from being drained and you can continue to add patterns.
The main consequence is that it the assembled pattern will not undergo any reduction (as the internal data structure undergoes such a transformation as that it becomes very difficult to cope with the change that adding a new pattern would bring about. If this is a problem, the solution is to create a non-mutable object, continue adding to it as long as needed, and each time a new assembled regular expression is required, clone the object, turn the mutable attribute off and proceed as usual.
By default the mutable attribute defaults to zero. The method can be chained.
$r->add( 'abcdef\\d+' )->mutable(0);
- reset
-
Resets the internal state of the object, as if no
add
orinsert
methods had been called. Does not modify the state of controller attributes such asdebug
,lex
,reduce
and the like. - Default_Lexer
-
Warning: the
Default_Lexer
function is a class method, not an object method. Do not call it on an object, bad things will happen.Regexp::Assemble::Default_Lexer( '.|\\d+' );
The
Default_Lexer
method lets you replace the default pattern used for all subsequently created Regexp::Assemble objects. It will not have any effect on existing objects. (It is also possible to override the lexer pattern used on a per-object basis).The parameter should be an ordinary scalar, not a compiled pattern. The pattern should be capable of matching all parts of the string, i.e., the following constraint should hold true:
$_ = 'this will be matched by a pattern'; my $pattern = qr/./; my @token = m/($pattern)/g; $_ eq join( '' => @token ) or die;
If no parameter is supplied, the current default pattern in use will be returned.
DIAGNOSTICS
"don't pass a refname
to Default_Lexer"
You tried to replace the default lexer pattern with an object instead of a scalar. Solution: You probably tried to call $obj->Default_Lexer. Call qualified class method instead Regexp::Assemble::Default_Lexer
.
"filter method not passed a coderef"
A reference to a subroutine (anonymous or otherwise) was expected. Solution: read the documentation for the filter
method.
"lexed a zero-length token, an infinite loop would ensue"
The add
method eats the input string one token at a time. If you supply your own lexer pattern, it may match the null character, in which case the loop that processes the string will loop an infinite number of times. Solution: Fix your pattern.
"was not passed an ARRAY ref"
A coding error. Can only occur when debugging is active. Solution: turn off debugging (it's off by default).
"was not passed a HASH ref"
A coding error. Can only occur when debugging is active. Solution: turn off debugging (it's off by default).
NOTES
This module has been tested successfully with a range of versions of perl, from 5.005_03 to 5.8.6. Use of 5.6.0 is not recommended.
The expressions produced by this module can be used with the PCRE library.
Where possible, feed R::A the simplest tokens possible. Don't add a(?-\d+){2})b
when a-\d+-\d+b
. The reason is that if you also add a\d+c
the resulting REs change dramatically: a(?:(?:-\d+){2}b|-\d+c)
versus a-\d+(?:-\d+b|c)
. Since R::A doesn't analyse tokens, it doesn't know how to "unroll" the {2}
quantifier, and will fail to notice the divergence after the first -d\d+
.
What is more, when the string 'a-123000z' is matched against the first pattern, the regexp engine will have to backtrack over each alternation before determining that there is no match. No such backtracking occurs in the second pattern: The engine scans up to the 'z' and then fails immediately, since neither of the alternations start with 'z'.
Regexp::Assemble does, however, understand character classes. Given a-b
, axb
and a\db
, it will assemble these into a[-\dx]b
. When - appears as a candidate for a character class it will be the first character in the class. When ^ appears as a candidate for a character class it will be the last character in the class.
It also knows about meta-characters than can "absorb" regular characters. For instance, given a\d
and a5
, it knows that <5> can be represented by \d
and so the assembly is just a\d
. The "absorbent" meta-characters it deals with are .
, \d
, \s
and \W
and their complements. It also knows that \d
and \D
can be replaced by .
.
Regexp::Assemble will also replace all the digits 0..9 appearing in a character class by \d
. I'd do it for letters as well, but thinking about accented characters and other glyphs hurts my head.
When tracking is in use, no reduction is performed. Furthermore, no character classes are formed. The reason is that it becomes just too difficult to determine the original pattern. Consider the the two patterns pale
and palm
. These would be reduced to (?:pal[em]
. The final character matches one of two possibilities. To resolve whether it matched an 'e'
or 'm'
would require a whole lot more housekeeping. Without character classes it becomes much easier.
Similarly, dogfood
and seafood
would form (?:dog|sea)food
. When the pattern is being assembled, the tracking decision needs to be made at the end of the grouping, but the tail of the pattern has not yet been visited. Deferring things to make this work correctly is a vast hassle.
Hence tracked patterns will hence be bulky, much more bulkier than simple patterns.
SEE ALSO
- perlre
-
General information about Perl's regular expressions.
- re
-
Specific information about
use re 'eval'
. - Regex::PreSuf
-
Regex::PreSuf
takes a string and chops it itself into tokens of length 1. Since it can't deal with tokens of more than one character, it can't deal with meta-characters and thus no regular expressions. Which is the main reason why I wrote this module. - Regexp::Optimizer
-
Regexp::Optimizer
produces regular expressions that are similar to those produced by R::A with reductions switched off. - Text::Trie
-
Text::Trie
is well worth investigating. Tries can outperform very bushy (read: many alternations) patterns.
LIMITATIONS
Regexp::Assemble
does not attempt to find common substrings. For instance, it will not collapse /aabababc/
down to /a(?:ab}{3}c/
. If there's a module out there that performs this sort of string analysis I'd like to know about it. But keep in mind that the algorithms that do this are very expensive: quadratic or worse.
Regexp::Assemble
does not emit lookahead or lookbehind assertions in the expressions it produces. Nor is the wonderful and frightening (?
...)> construct used. These and other optimisations may be introduced in further releases if there is a demand for them, if I figure out how to use them in this module, or if someone sends patches. There is an example in the README that shows a technique for contructing patterns with lookbehind assertions, using a two-stage approach.
Regexp::Assemble
does not attempt to interpret meta-character modifiers. For instance, if the following two pattern lists are given: a\d
and a\d+
, it will not determine that \d
can be matched by \d+
. Instead, it will produce a(?:\d|\d+)
. Along a similar line of reasoning, it will not determine that a
and a\d+
is equivalent to a\d*
(It will produce a(?:\d+)?
instead).
You can't remove a pattern that has been added to an object. You'll just have to start over again. Adding a pattern is difficult enough, I'd need a solid argument to convince me to add a remove
method. If you need to do this you should read the documentation on the mutable
and clone
methods.
Tracking doesn't really work at all with 5.6.0. It works better in subsequent 5.6 releases. For maximum reliability, the use of a 5.8 release is strongly recommended.
The module does not produce POSIX-style regular expressions.
BUGS
The algorithm used to assemble the regular expressions makes extensive use of mutually-recursive functions (i.e.: A calls B, B calls A, ...) For deeply similar expressions, it may be possible to provoke "Deep recursion" warnings.
When a string is tested against an assembled RE, it may match when it shouldn't, or else not match when it should. The module has been tested extensively, and has an extensive test suite, but you never know... In either case, it is a bug, and I want to know about it.
A temporary work-around is to disable reductions:
my $pattern = $assembler->reduce(0)->re;
A discussion about implementation details and where bugs might lurk appears in the README file. If this file is not available locally, you should be able to find a copy on the Web at your nearest CPAN mirror.
If you are feeling brave, extensive debugging traces are available.
Please report all bugs at http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble|rt.cpan.org
Make sure you include the output from the following two commands:
perl -MRegexp::Assemble -le 'print Regexp::Assemble::VERSION'
perl -V
ACKNOWLEDGEMENTS
This module grew out of work I did building access maps for Postfix, a modern SMTP mail transfer agent. See http://www.postfix.org/ for more information. I used Perl to build large regular expressions for blocking dynamic/residential IP addresses to cut down on spam and viruses. Once I had the code running for this, it was easy to start adding stuff to block really blatant spam subject lines, bogus HELO strings, spammer mailer-ids and more...
I presented the work at the French Perl Workshop in 2004, and the thing most people asked was whether the underlying mechanism for assembling the REs was available as a module. At that time it was nothing more that a twisty maze of scripts, all different. The interest shown indicated that a module was called for. I'd like to thank the people who showed interest. Hey, it's going to make my messy scripts smaller, in any case.
Thomas Drugeon has been a valuable sounding board for trying out new ideas. Jean Forget and Philippe Blayo looked over an early version. H. Merijn Brandt stopped over in Paris one evening, and discussed things over a few beers.
Nicholas Clark pointed out that while what this module does (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he encouraged to me continue with the development of this module. In any event, this module allows one to gauge the difficulty of undertaking the endeavour in C. I'd rather gouge my eyes out with a blunt pencil.
Paul Johnson settled the question as to whether this module should live in the Regex:: namespace, or Regexp:: namespace. If you're not convinced, try running the following one-liner:
perl -le 'print ref qr//'
Thanks also to broquaint on Perlmonks, who answered a question pertaining to traversing an early version of the underlying data structure used by the module (Sadly, that code is no more). bart and ysth also provided a couple of tips pertaining to the Correct Use of overloading.
AUTHOR
David Landgren, david@landgren.net
Copyright (C) 2004-2005. All rights reserved.
http://www.landgren.net/perl/
If you use this module, I'd love to hear about what you're using it for. If you want to be informed of updates, send me a note.
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.