NAME
Devel::Tokenizer::C - Generate C source for fast keyword tokenizer
SYNOPSIS
use Devel::Tokenizer::C;
$t = new Devel::Tokenizer::C TokenFunc => sub { "return \U$_[0];\n" };
$t->add_tokens(qw( bar baz ))->add_tokens(['for']);
$t->add_tokens([qw( foo )], 'defined DIRECTIVE');
print $t->generate;
DESCRIPTION
The Devel::Tokenizer::C module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.
The generated code is optimized for speed. On the ANSI-C keyword set, it's 2-3 times faster than equivalent code generated with the gprof
utility.
The above example would print the following C source code:
switch (tokstr[0])
{
case 'b':
switch (tokstr[1])
{
case 'a':
switch (tokstr[2])
{
case 'r':
if (tokstr[3] == '\0')
{ /* bar */
return BAR;
}
goto unknown;
case 'z':
if (tokstr[3] == '\0')
{ /* baz */
return BAZ;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
case 'f':
switch (tokstr[1])
{
case 'o':
switch (tokstr[2])
{
#if defined DIRECTIVE
case 'o':
if (tokstr[3] == '\0')
{ /* foo */
return FOO;
}
goto unknown;
#endif /* defined DIRECTIVE */
case 'r':
if (tokstr[3] == '\0')
{ /* for */
return FOR;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
default:
goto unknown;
}
So the generated code only includes the main switch
statement for the tokenizer. You can configure most of the generated code to fit for your application.
CONFIGURATION
CaseSensitive => 0 | 1
Boolean defining whether the generated tokenizer should be case sensitive or not. This will only affect the letters A-Z. The default is 1, so the generated tokenizer is case sensitive.
Indent => STRING
String to be used for one level of indentation. The default is two space characters.
MergeSwitches => 0 | 1
Boolean defining whether nested switch
statements containing only a single case
should be merged into a single if
statement. This is usually only done at the end of a branch. With MergeSwitches
, merging will also be done in the middle of a branch. E.g. the code
$t = new Devel::Tokenizer::C
TokenFunc => sub { "return \U$_[0];\n" },
MergeSwitches => 1;
$t->add_tokens(qw( carport carpet muppet ));
print $t->generate;
would output this switch
statement:
switch (tokstr[0])
{
case 'c':
if (tokstr[1] == 'a' &&
tokstr[2] == 'r' &&
tokstr[3] == 'p')
{
switch (tokstr[4])
{
case 'e':
if (tokstr[5] == 't' &&
tokstr[6] == '\0')
{ /* carpet */
return CARPET;
}
goto unknown;
case 'o':
if (tokstr[5] == 'r' &&
tokstr[6] == 't' &&
tokstr[7] == '\0')
{ /* carport */
return CARPORT;
}
goto unknown;
default:
goto unknown;
}
}
goto unknown;
case 'm':
if (tokstr[1] == 'u' &&
tokstr[2] == 'p' &&
tokstr[3] == 'p' &&
tokstr[4] == 'e' &&
tokstr[5] == 't' &&
tokstr[6] == '\0')
{ /* muppet */
return MUPPET;
}
goto unknown;
default:
goto unknown;
}
Strategy => 'ordered' | 'narrow' | 'wide'
The strategy to be used for sorting character positions. ordered
will leave the characters in their normal order. narrow
will sort the characters positions so that the positions with least character variation are checked first. wide
will do exactly the opposite. (If you're confused now, just try it. ;-)
The default is ordered
. You can only use narrow
and wide
together with StringLength
.
The code
$t = new Devel::Tokenizer::C
TokenFunc => sub { "return \U$_[0];\n" },
StringLength => 'len',
Strategy => 'ordered';
$t->add_tokens(qw( mhj xho mhx ));
print $t->generate;
would output this switch
statement:
switch (len)
{
case 3: /* 3 tokens of length 3 */
switch (tokstr[0])
{
case 'm':
switch (tokstr[1])
{
case 'h':
switch (tokstr[2])
{
case 'j':
{ /* mhj */
return MHJ;
}
goto unknown;
case 'x':
{ /* mhx */
return MHX;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
case 'x':
if (tokstr[1] == 'h' &&
tokstr[2] == 'o')
{ /* xho */
return XHO;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
Using the narrow
strategy, the switch
statement would be:
switch (len)
{
case 3: /* 3 tokens of length 3 */
switch (tokstr[1])
{
case 'h':
switch (tokstr[0])
{
case 'm':
switch (tokstr[2])
{
case 'j':
{ /* mhj */
return MHJ;
}
goto unknown;
case 'x':
{ /* mhx */
return MHX;
}
goto unknown;
default:
goto unknown;
}
case 'x':
if (tokstr[2] == 'o')
{ /* xho */
return XHO;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
default:
goto unknown;
}
Using the wide
strategy, the switch
statement would be:
switch (len)
{
case 3: /* 3 tokens of length 3 */
switch (tokstr[2])
{
case 'j':
if (tokstr[0] == 'm' &&
tokstr[1] == 'h')
{ /* mhj */
return MHJ;
}
goto unknown;
case 'o':
if (tokstr[0] == 'x' &&
tokstr[1] == 'h')
{ /* xho */
return XHO;
}
goto unknown;
case 'x':
if (tokstr[0] == 'm' &&
tokstr[1] == 'h')
{ /* mhx */
return MHX;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
StringLength => STRING
Identifier of the C variable that contains the length of the string, when available. If the string length is know, switching can be done more effectively. That doesn't mean that it is more effective to compute the string length first. If you don't know the string length, just don't use this option. This is also the default.
TokenEnd => STRING
Character that defines the end of each token. The default is the null character '\0'
. Can also be undef
if tokens don't end with a special character.
TokenFunc => SUBROUTINE
A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.
This is the default subroutine:
TokenFunc => sub { "return $_[0];\n" }
TokenString => STRING
Identifier of the C character array that contains the token string. The default is tokstr
.
UnknownLabel => STRING
Label that should be jumped to via goto
if there's no keyword matching the token. The default is unknown
.
ADDING TOKENS
You can add tokens using the add_tokens
method.
The method either takes a list of token strings or a reference to an array of token strings which can optionally be followed by a preprocessor directive string.
Calls to add_tokens
can be chained together, as the method returns a reference to its object.
GENERATING THE CODE
The generate
method will return a string with the tokenizer switch
statement. If no tokens were added, it will return an empty string.
You can optionally pass an Indent
option to the generate
method to specify a string used for indenting the whole switch
statement, e.g.:
print $t->generate(Indent => "\t");
This is completely independent from the Indent
option passed to the constructor.
AUTHOR
Marcus Holland-Moritz <mhx@cpan.org>
BUGS
I hope none, since the code is pretty short. Perhaps lack of functionality ;-)
COPYRIGHT
Copyright (c) 2002-2005, Marcus Holland-Moritz. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.