NAME

Tokenizer - Generate C source for fast keyword tokenizer

SYNOPSIS

use Tokenizer;

$t = new Tokenizer tokfnc => sub { "return \U$_[0];\n" };

$t->addtokens( '', qw( bar baz for ) );
$t->addtokens( 'DIRECTIVE', qw( foo ) );

print $t->makeswitch;

DESCRIPTION

The Tokenizer module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.

The code created by the above example would print the following C source code:

switch( tokstr[0] )
{
  case 'b':
    switch( tokstr[1] )
    {
      case 'a':
        switch( tokstr[2] )
        {
          case 'r':
            if( tokstr[3] == '\0' )
            {                                     /* bar        */
              return BAR;
            }

            goto unknown;

          case 'z':
            if( tokstr[3] == '\0' )
            {                                     /* baz        */
              return BAZ;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  case 'f':
    switch( tokstr[1] )
    {
      case 'o':
        switch( tokstr[2] )
        {
#if defined DIRECTIVE
          case 'o':
            if( tokstr[3] == '\0' )
            {                                     /* foo        */
              return FOO;
            }

            goto unknown;
#endif /* defined DIRECTIVE */

          case 'r':
            if( tokstr[3] == '\0' )
            {                                     /* for        */
              return FOR;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

So the generated code only includes the main switch statement for the tokenizer. You can configure most of the generated code to fit for your application.

CONFIGURATION

tokfnc => SUBROUTINE

A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.

This is the default subroutine:

tokfnc => sub { "return $_[0];\n" }

tokstr => STRING

Identifier of the C character array that contains the token string. The default is tokstr.

ulabel => STRING

Label that should be jumped to via goto if there's no keyword matching the token. The default is unknown.

endtok => STRING

Character that defines the end of each token. The default is the null character '\0'.

ADDING KEYWORDS

You can add tokens using the addtokens method. The first parameter is the name of a preprocessor define if you want the code generated for the following tokens to be dependent upon that define. If you don't want that dependency, pass an empty string. Following is a list of all keyword tokens.

GENERATING THE CODE

The makeswitch method will return a string with the tokenizer switch statement.

AUTHOR

Marcus Holland-Moritz <mhx@cpan.org>

BUGS

I hope none, since the code is pretty short. Perhaps lack of functionality ;-)

COPYRIGHT

Copyright (c) 2002, Marcus Holland-Moritz. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.