NAME

Devel::Tokenizer::C - Generate C source for fast keyword tokenizer

SYNOPSIS

use Devel::Tokenizer::C;

$t = new Devel::Tokenizer::C TokenFunc => sub { "return \U$_[0];\n" };

$t->add_tokens(qw( bar baz ))->add_tokens(['for']);
$t->add_tokens([qw( foo )], 'defined DIRECTIVE');

print $t->generate;

DESCRIPTION

The Devel::Tokenizer::C module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.

The generated code is optimized for speed. On the ANSI-C keyword set, it's 2-3 times faster than equivalent code generated with the gprof utility.

The above example would print the following C source code:

switch (tokstr[0])
{
  case 'b':
    switch (tokstr[1])
    {
      case 'a':
        switch (tokstr[2])
        {
          case 'r':
            if (tokstr[3] == '\0')
            {                                     /* bar        */
              return BAR;
            }

            goto unknown;

          case 'z':
            if (tokstr[3] == '\0')
            {                                     /* baz        */
              return BAZ;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  case 'f':
    switch (tokstr[1])
    {
      case 'o':
        switch (tokstr[2])
        {
#if defined DIRECTIVE
          case 'o':
            if (tokstr[3] == '\0')
            {                                     /* foo        */
              return FOO;
            }

            goto unknown;
#endif /* defined DIRECTIVE */

          case 'r':
            if (tokstr[3] == '\0')
            {                                     /* for        */
              return FOR;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

So the generated code only includes the main switch statement for the tokenizer. You can configure most of the generated code to fit for your application.

CONFIGURATION

CaseSensitive => 0 | 1

Boolean defining whether the generated tokenizer should be case sensitive or not. This will only affect the letters A-Z. The default is 1, so the generated tokenizer is case sensitive.

Indent => STRING

String to be used for one level of indentation. The default is two space characters.

MergeSwitches => 0 | 1

Boolean defining whether nested switch statements containing only a single case should be merged into a single if statement. This is usually only done at the end of a branch. With MergeSwitches, merging will also be done in the middle of a branch. E.g. the code

$t = new Devel::Tokenizer::C
         TokenFunc     => sub { "return \U$_[0];\n" },
         MergeSwitches => 1;

$t->add_tokens(qw( carport carpet muppet ));

print $t->generate;

would output this switch statement:

switch (tokstr[0])
{
  case 'c':
    if (tokstr[1] == 'a' &&
        tokstr[2] == 'r' &&
        tokstr[3] == 'p')
    {
      switch (tokstr[4])
      {
        case 'e':
          if (tokstr[5] == 't' &&
              tokstr[6] == '\0')
          {                                       /* carpet     */
            return CARPET;
          }

          goto unknown;

        case 'o':
          if (tokstr[5] == 'r' &&
              tokstr[6] == 't' &&
              tokstr[7] == '\0')
          {                                       /* carport    */
            return CARPORT;
          }

          goto unknown;

        default:
          goto unknown;
      }
    }

    goto unknown;

  case 'm':
    if (tokstr[1] == 'u' &&
        tokstr[2] == 'p' &&
        tokstr[3] == 'p' &&
        tokstr[4] == 'e' &&
        tokstr[5] == 't' &&
        tokstr[6] == '\0')
    {                                             /* muppet     */
      return MUPPET;
    }

    goto unknown;

  default:
    goto unknown;
}

Strategy => 'ordered' | 'narrow' | 'wide'

The strategy to be used for sorting character positions. ordered will leave the characters in their normal order. narrow will sort the characters positions so that the positions with least character variation are checked first. wide will do exactly the opposite. (If you're confused now, just try it. ;-)

The default is ordered. You can only use narrow and wide together with StringLength.

The code

$t = new Devel::Tokenizer::C
         TokenFunc     => sub { "return \U$_[0];\n" },
         StringLength  => 'len',
         Strategy      => 'ordered';

$t->add_tokens(qw( mhj xho mhx ));

print $t->generate;

would output this switch statement:

switch (len)
{
  case 3: /* 3 tokens of length 3 */
    switch (tokstr[0])
    {
      case 'm':
        switch (tokstr[1])
        {
          case 'h':
            switch (tokstr[2])
            {
              case 'j':
                {                                 /* mhj        */
                  return MHJ;
                }

                goto unknown;

              case 'x':
                {                                 /* mhx        */
                  return MHX;
                }

                goto unknown;

              default:
                goto unknown;
            }

          default:
            goto unknown;
        }

      case 'x':
        if (tokstr[1] == 'h' &&
            tokstr[2] == 'o')
        {                                         /* xho        */
          return XHO;
        }

        goto unknown;

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

Using the narrow strategy, the switch statement would be:

switch (len)
{
  case 3: /* 3 tokens of length 3 */
    switch (tokstr[1])
    {
      case 'h':
        switch (tokstr[0])
        {
          case 'm':
            switch (tokstr[2])
            {
              case 'j':
                {                                 /* mhj        */
                  return MHJ;
                }

                goto unknown;

              case 'x':
                {                                 /* mhx        */
                  return MHX;
                }

                goto unknown;

              default:
                goto unknown;
            }

          case 'x':
            if (tokstr[2] == 'o')
            {                                     /* xho        */
              return XHO;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

Using the wide strategy, the switch statement would be:

switch (len)
{
  case 3: /* 3 tokens of length 3 */
    switch (tokstr[2])
    {
      case 'j':
        if (tokstr[0] == 'm' &&
            tokstr[1] == 'h')
        {                                         /* mhj        */
          return MHJ;
        }

        goto unknown;

      case 'o':
        if (tokstr[0] == 'x' &&
            tokstr[1] == 'h')
        {                                         /* xho        */
          return XHO;
        }

        goto unknown;

      case 'x':
        if (tokstr[0] == 'm' &&
            tokstr[1] == 'h')
        {                                         /* mhx        */
          return MHX;
        }

        goto unknown;

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

StringLength => STRING

Identifier of the C variable that contains the length of the string, when available. If the string length is know, switching can be done more effectively. That doesn't mean that it is more effective to compute the string length first. If you don't know the string length, just don't use this option. This is also the default.

TokenEnd => STRING

Character that defines the end of each token. The default is the null character '\0'. Can also be undef if tokens don't end with a special character.

TokenFunc => SUBROUTINE

A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.

This is the default subroutine:

TokenFunc => sub { "return $_[0];\n" }

TokenString => STRING

Identifier of the C character array that contains the token string. The default is tokstr.

UnknownLabel => STRING

Label that should be jumped to via goto if there's no keyword matching the token. The default is unknown.

ADDING TOKENS

You can add tokens using the add_tokens method.

The method either takes a list of token strings or a reference to an array of token strings which can optionally be followed by a preprocessor directive string.

Calls to add_tokens can be chained together, as the method returns a reference to its object.

GENERATING THE CODE

The generate method will return a string with the tokenizer switch statement. If no tokens were added, it will return an empty string.

You can optionally pass an Indent option to the generate method to specify a string used for indenting the whole switch statement, e.g.:

print $t->generate(Indent => "\t");

This is completely independent from the Indent option passed to the constructor.

AUTHOR

Marcus Holland-Moritz <mhx@cpan.org>

BUGS

I hope none, since the code is pretty short. Perhaps lack of functionality ;-)

COPYRIGHT

To install Devel::Tokenizer::C, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Devel::Tokenizer::C

CPAN shell

perl -MCPAN -e shell
install Devel::Tokenizer::C

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)