NAME

Devel::Tokenizer::C - Generate C source for fast keyword tokenizer

SYNOPSIS

use Devel::Tokenizer::C;

$t = new Devel::Tokenizer::C TokenFunc => sub { "return \U$_[0];\n" };

$t->add_tokens( qw( bar baz ) )->add_tokens( ['for'] );
$t->add_tokens( [qw( foo )], 'defined DIRECTIVE' );

print $t->generate;

DESCRIPTION

The Devel::Tokenizer::C module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.

The generated code is optimized for speed. On the ANSI-C keyword set, it's 2-3 times faster than equivalent code generated with the gprof utility.

The above example would print the following C source code:

switch( tokstr[0] )
{
  case 'b':
    switch( tokstr[1] )
    {
      case 'a':
        switch( tokstr[2] )
        {
          case 'r':
            if( tokstr[3] == '\0' )
            {                                     /* bar        */
              return BAR;
            }

            goto unknown;

          case 'z':
            if( tokstr[3] == '\0' )
            {                                     /* baz        */
              return BAZ;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  case 'f':
    switch( tokstr[1] )
    {
      case 'o':
        switch( tokstr[2] )
        {
#if defined DIRECTIVE
          case 'o':
            if( tokstr[3] == '\0' )
            {                                     /* foo        */
              return FOO;
            }

            goto unknown;
#endif /* defined DIRECTIVE */

          case 'r':
            if( tokstr[3] == '\0' )
            {                                     /* for        */
              return FOR;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

So the generated code only includes the main switch statement for the tokenizer. You can configure most of the generated code to fit for your application.

CONFIGURATION

TokenFunc => SUBROUTINE

A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.

This is the default subroutine:

TokenFunc => sub { "return $_[0];\n" }

TokenString => STRING

Identifier of the C character array that contains the token string. The default is tokstr.

UnknownLabel => STRING

Label that should be jumped to via goto if there's no keyword matching the token. The default is unknown.

TokenEnd => STRING

Character that defines the end of each token. The default is the null character '\0'.

CaseSensitive => 0 | 1

Boolean defining whether the generated tokenizer should be case sensitive or not. This will only affect the letters A-Z. The default is 1, so the generated tokenizer is case sensitive.

ADDING TOKENS

You can add tokens using the add_tokens method.

The method either takes a list of token strings or a reference to an array of token strings which can optionally be followed by a preprocessor directive string.

Calls to add_tokens can be chained together, as the method returns a reference to its object.

GENERATING THE CODE

The generate method will return a string with the tokenizer switch statement. If no tokens were added, it will return an empty string.

AUTHOR

Marcus Holland-Moritz <mhx@cpan.org>

BUGS

I hope none, since the code is pretty short. Perhaps lack of functionality ;-)

COPYRIGHT

Copyright (c) 2003, Marcus Holland-Moritz. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.