NAME
Tokenizer - Generate C source for fast keyword tokenizer
SYNOPSIS
use Tokenizer;
$t = new Tokenizer tokfnc => sub { "return \U$_[0];\n" };
$t->addtokens( '', qw( bar baz for ) );
$t->addtokens( 'DIRECTIVE', qw( foo ) );
print $t->makeswitch;
DESCRIPTION
The Tokenizer module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.
The generated code is optimized for speed. On the ANSI-C keyword set, it's 2-3 times faster than equivalent code generated with the gprof
utility.
The above example would print the following C source code:
switch( tokstr[0] )
{
case 'b':
switch( tokstr[1] )
{
case 'a':
switch( tokstr[2] )
{
case 'r':
if( tokstr[3] == '\0' )
{ /* bar */
return BAR;
}
goto unknown;
case 'z':
if( tokstr[3] == '\0' )
{ /* baz */
return BAZ;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
case 'f':
switch( tokstr[1] )
{
case 'o':
switch( tokstr[2] )
{
#if defined DIRECTIVE
case 'o':
if( tokstr[3] == '\0' )
{ /* foo */
return FOO;
}
goto unknown;
#endif /* defined DIRECTIVE */
case 'r':
if( tokstr[3] == '\0' )
{ /* for */
return FOR;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
default:
goto unknown;
}
So the generated code only includes the main switch statement for the tokenizer. You can configure most of the generated code to fit for your application.
CONFIGURATION
tokfnc => SUBROUTINE
A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.
This is the default subroutine:
tokfnc => sub { "return $_[0];\n" }
tokstr => STRING
Identifier of the C character array that contains the token string. The default is tokstr
.
ulabel => STRING
Label that should be jumped to via goto
if there's no keyword matching the token. The default is unknown
.
endtok => STRING
Character that defines the end of each token. The default is the null character '\0'
.
ADDING KEYWORDS
You can add tokens using the addtokens
method. The first parameter is the name of a preprocessor define if you want the code generated for the following tokens to be dependent upon that define. If you don't want that dependency, pass an empty string. Following is a list of all keyword tokens.
GENERATING THE CODE
The makeswitch
method will return a string with the tokenizer switch statement.
AUTHOR
Marcus Holland-Moritz <mhx@cpan.org>
BUGS
I hope none, since the code is pretty short. Perhaps lack of functionality ;-)
COPYRIGHT
Copyright (c) 2002-2003, Marcus Holland-Moritz. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.