NAME

C::Tokenize - reduce a C file to a series of tokens

SYNOPSIS

# Remove all C preprocessor instructions from a C program:
my $c = <<EOF;
#define X Y
#ifdef X
int X;
#endif
EOF
use C::Tokenize '$cpp_re';
$c =~ s/$cpp_re//g;
print "$c\n";

produces output

int X;

(This example is included as synopsis-cpp.pl in the distribution.)

# Print all the comments in a C program:
my $c = <<EOF;
/* This is the main program. */
int main ()
{
    int i;
    /* Increment i by 1. */
    i++;
    // Now exit with zero status.
    return 0;
}
EOF
use C::Tokenize '$comment_re';
while ($c =~ /($comment_re)/g) {
    print "$1\n";
}

produces output

/* This is the main program. */
/* Increment i by 1. */
// Now exit with zero status.

(This example is included as synopsis-comment.pl in the distribution.)

VERSION

This documents version 0.12 of C::Tokenize corresponding to git commit ae2d55568809a6b5cc23d66948cdd87cf6d0f98f released on Wed Dec 7 09:47:21 2016 +0900.

DESCRIPTION

This module provides a tokenizer, "tokenize", which breaks C source code into its smallest meaningful components, and the regular expressions which match each of these components. For example, the module supplies a regular expression "$comment_re" which matches a C comment line.

It also supplies some extra regular expressions for, for example, local include statements, "$include_local", or C variables, "$cvar_re", as well as extra functions, like "decomment" for removing traditional C comments.

REGULAR EXPRESSIONS

The following regular expressions can be imported from this module using, for example,

use C::Tokenize '$cpp_re'

to import $cpp_re.

Most of the following regular expressions do not do any capturing, except where noted. If you want to capture, add your own parentheses around the regular expression.

$trad_comment_re

Match /* */ comments.

$cxx_comment_re

Match // comments.

$comment_re

Match both /* */ and // comments.

$cpp_re

Match a C preprocessor instruction.

$char_const_re

Match a character constant, such as 'a' or '\-'.

$operator_re

Match an operator such as + or --.

$number_re

Match a number, either integer, floating point, or hexadecimal. Does not do octal yet.

$word_re

Match a word, such as a function or variable name or a keyword of the language.

$grammar_re

Match other syntactic characters such as { or [.

$single_string_re

Match a single C string constant such as "this".

$string_re

Match a full-blown C string constant, including compound strings "like" "this".

$reserved_re

Match a C reserved word like auto or goto.

$include_local

Match an include statement which uses double quotes, like #include "some.c".

This captures the entire statement in $1 and the file name in $2.

This was added in version 0.11 of C::Tokenize.

$include

Match any include statement, like #include <stdio.h>.

This captures the entire statement in $1 and the file name in $2.

use C::Tokenize '$include';
my $c = <<EOF;
#include <this.h>
#include "that.h"
EOF
while ($c =~ /$include/g) {
    print "Include statement $1 includes file $2.\n";
}

produces output

Include statement #include <this.h> includes file this.h.
Include statement #include "that.h" includes file that.h.

(This example is included as includes.pl in the distribution.)

This was added in version 0.12 of C::Tokenize.

$cvar_re

This matches a C variable, for example anything which may be an lvalue or a function argument.

use C::Tokenize '$cvar_re';
my $c = 'func (x->y, & z, ** a, & q);';
while ($c =~ /[,\(]\s*($cvar_re)/g) {
    print "$1 is a C variable.\n";
}

produces output

x->y is a C variable.
& z is a C variable.
** a is a C variable.
& q is a C variable.

(This example is included as cvar.pl in the distribution.)

This was added in version 0.11 of C::Tokenize.

VARIABLES

@fields

@Fields contains a list of all the fields which are extracted by "tokenize".

FUNCTIONS

decomment

my $out = decomment ('/* comment */');
# $out = " comment ";

Remove the traditional C comment marks /* and */ from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.

tokenize

my $tokens = tokenize ($file);

Convert $file into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:

leading

Any whitespace which comes before the token (called "leading whitespace").

type

The type of the token, which may be

comment

A comment, like

/* This */

// this.

cpp

A C preprocessor instruction like

#define THIS 1

#include "That.h".

char_const

A character constant, like '\0' or 'a'.

grammar

A piece of C "grammar", like { or ] or ->.

number

A number such as 42,

word

A word, which may be a variable name or a function.

string

A string, like "this", or even "like" "this".

reserved

A C reserved word, like auto or goto.

All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:

use C::Tokenize '@fields';

$name

The value of the type. For example, if $token->{name} equals 'comment', then the value of the type is in , $token->{comment}.

if ($token->{name} eq 'string') {
    my $c_string = $token->{string};
}

line

The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.

EXPORTS

use C::Tokenize ':all';

exports all the regular expressions and functions from the module.

BUGS

Octal not parsed: It does not parse octal expressions.
No trigraphs: No handling of trigraphs.
Requires Perl 5.10: This module uses named captures in regular expressions, so it requires Perl 5.10 or more.
No line directives: The line numbers provided by "tokenize" do not respect C line directives.
Insufficient tests: The module has been used somewhat, but the included tests do not exercise many of the features of C.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.

To install C::Tokenize, copy and paste the appropriate command in to your terminal.

cpanm

cpanm C::Tokenize

CPAN shell

perl -MCPAN -e shell
install C::Tokenize

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

VERSION

DESCRIPTION

REGULAR EXPRESSIONS

VARIABLES

@fields

FUNCTIONS

decomment

tokenize

EXPORTS

SEE ALSO

BUGS

AUTHOR

COPYRIGHT & LICENCE

Module Install Instructions