NAME
Search::Tools::RegExp - build regular expressions from search queries
SYNOPSIS
my $regexp = Search::Tools::RegExp->new();
my $kw = $regexp->build('the quick brown fox');
for my $w ($kw->keywords)
{
my $r = $kw->re( $w );
# the word itself
printf("the word is %s\n", $r->word);
# is it flagged as a phrase?
print "the word is a phrase\n" if $r->phrase;
# each of these are regular expressions
print $r->plain;
print $r->html;
}
DESCRIPTION
Build regular expressions for a string of text.
All text is converted to UTF-8 automatically if it isn't already, via the Search:Tools::Keywords module.
VARIABLES
The following package variables are defined:
- UTF8Char
-
Regexp defining a valid UTF-8 word character. Default
\w
. - WordChar
-
Default word_characters regexp. Defaults to
UTF8Char
plus'
,.
and-
. - IgnFirst
-
Default ignore_first_char regexp. Defaults to
'
and-
. - IgnLast
-
Default ignore_last_char regexp. Defaults to
'
,.
and-
. - PhraseDelim
-
Phrase delimiter character. Default is double-quote '"'.
- Wildcard
-
Character to use as a wildcard. Default is asterik '*'.
METHODS
new
Create new object. The following parameters are also accessors:
- kw
-
A Search::Tools::Keywords object, if you want to pass in one instead of having one made for you.
- wildcard
-
The wildcard character. Default is
$Wildcard
. - word_characters
-
Regexp for what characters constitute a 'word'. Default is
$WordChar
. - ignore_first_char
-
Default is
$IgnFirst
. - ignore_last_char
-
Default is
$IgnLast
. - stemmer
-
Stemming code ref passed through to the default Search::Tools::Keywords object.
- phrase_delim
-
Phrase delimiter. Defaults to
$PhraseDelim
. - stopwords
-
Words to be ignored.
- debug
-
Turn on helpful info on stderr.
isHTML( str )
Returns true if str contains anything that looks like HTML markup:
< > or &[#\w]+;
This is a naive check but useful for internal purposes.
build( str )
Returns a Search::Tools::RegExp::Keywords object.
BUGS and LIMITATIONS
The special HTML chars &, < and > can pose problems in regexps against markup, so they are ignored if you include them in word_characters
in new().
AUTHOR
Peter Karman perl@peknet.com
Based on the HTML::HiLiter regular expression building code, originally by the same author, copyright 2004 by Cray Inc.
Thanks to Atomic Learning www.atomiclearning.com
for sponsoring the development of this module.
COPYRIGHT
Copyright 2006 by Peter Karman. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
HTML::HiLiter, Search::Tools, Search::Tools::RegExp::Keywords, Search::Tools::RegExp::Keyword