NAME

CharClass::Matcher -- Generate C macros that match character classes efficiently

SYNOPSIS

perl regen/regcharclass.pl

DESCRIPTION

Dynamically generates macros for detecting special charclasses in latin-1, utf8, and codepoint forms. Macros can be set to return the length (in bytes) of the matched codepoint, and/or the codepoint itself.

To regenerate regcharclass.h, run this script from perl-root. No arguments are necessary.

Using WHATEVER as an example the following macros can be produced, depending on the input parameters (how to get each is described by internal comments at the __DATA__ line):

is_WHATEVER(s,is_utf8)
is_WHATEVER_safe(s,e,is_utf8)

Do a lookup as appropriate based on the is_utf8 flag. When possible comparisons involving octet<128 are done before checking the is_utf8 flag, hopefully saving time.

The version without the _safe suffix should be used only when the input is known to be well-formed.

is_WHATEVER_utf8(s)
is_WHATEVER_utf8_safe(s,e)

Do a lookup assuming the string is encoded in (normalized) UTF8.

The version without the _safe suffix should be used only when the input is known to be well-formed.

is_WHATEVER_latin1(s)
is_WHATEVER_latin1_safe(s,e)

Do a lookup assuming the string is encoded in latin-1 (aka plan octets).

The version without the _safe suffix should be used only when it is known that s contains at least one character.

is_WHATEVER_cp(cp)

Check to see if the string matches a given codepoint (hypothetically a U32). The condition is constructed as to "break out" as early as possible if the codepoint is out of range of the condition.

IOW:

(cp==X || (cp>X && (cp==Y || (cp>Y && ...))))

Thus if the character is X+1 only two comparisons will be done. Making matching lookups slower, but non-matching faster.

what_len_WHATEVER_FOO(arg1, ..., len)

A variant form of each of the macro types described above can be generated, in which the code point is returned by the macro, and an extra parameter (in the final position) is added, which is a pointer for the macro to set the byte length of the returned code point.

These forms all have a what_len prefix instead of the is_, for example what_len_WHATEVER_safe(s,e,is_utf8,len) and what_len_WHATEVER_utf8(s,len).

These forms should not be used except on small sets of mostly widely separated code points; otherwise the code generated is inefficient. For these cases, it is best to use the is_ forms, and then find the code point with utf8_to_uvchr_buf(). This program can fail with a "deep recursion" message on the worst of the inappropriate sets. Examine the generated macro to see if it is acceptable.

what_WHATEVER_FOO(arg1, ...)

A variant form of each of the is_ macro types described above can be generated, in which the code point and not the length is returned by the macro. These have the same caveat as "what_len_WHATEVER_FOO(arg1, ..., len)", plus they should not be used where the set contains a NULL, as 0 is returned for two different cases: a) the set doesn't include the input code point; b) the set does include it, and it is a NULL.

The above isn't quite complete, as for specialized purposes one can get a macro like is_WHATEVER_utf8_no_length_checks(s), which assumes that it is already known that there is enough space to hold the character starting at s, but otherwise checks that it is well-formed. In other words, this is intermediary in checking between is_WHATEVER_utf8(s) and is_WHATEVER_utf8_safe(s,e).

CODE FORMAT

perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"

AUTHOR

Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.

BUGS

No tests directly here (although the regex engine will fail tests if this code is broken). Insufficient documentation and no Getopts handler for using the module as a script.

LICENSE

You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the README file.