NAME
C::Tokenize - reduce a C file to a series of tokens
SYNOPSIS
# Remove all C preprocessor instructions from a C program:
use C::Tokenize '$cpp_re';
$c =~ s/$cpp_re//g;
# Print all the comments in a C program:
use C::Tokenize '$comment_re';
while ($c =~ /($comment_re)/) {
print "$1\n";
}
DESCRIPTION
This module provides
REGULAR EXPRESSIONS
The regular expressions can be imported using, for example,
use C::Tokenize '$cpp_re'
to import $cpp_re
.
None of the regular expressions does any capturing. If you want to capture, add your own parentheses around the regular expression.
- $trad_comment_re
-
Match
/* */
comments. - $cxx_comment_re
-
Match
//
comments. - $comment_re
-
Match both
/* */
and//
comments. - $cpp_re
-
Match a C preprocessor instruction.
- $char_const_re
-
Match a character constant, such as
'a'
or'\-'
. - $operator_re
-
Match an operator such as
+
or--
. - $number_re
-
Match a number, either integer, floating point, or hexadecimal. Does not do octal yet.
- $word_re
-
Match a word, such as a function or variable name or a keyword of the language.
- $grammar_re
-
Match other syntactic characters such as
{
or[
. - $single_string_re
-
Match a single C string constant such as
"this"
. - $string_re
-
Match a full-blown C string constant, including compound strings
"like" "this"
. - $reserved_re
-
Match a C reserved word like
auto
orgoto
.
VARIABLES
@fields
@Fields contains a list of all the fields which are extracted by "tokenize".
FUNCTIONS
decomment
my $out = decomment ('/* comment */');
# $out = " comment ";
Remove the comments from a string.
tokenize
my $tokens = tokenize ($file);
Convert $file
into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:
- leading
-
Any whitespace which comes before the token (called "leading whitespace").
- name
-
The name of the token, which may be
- comment
-
A comment, like
/* This */
or
// this.
- cpp
-
A C preprocessor instruction like
#define THIS 1
or
#include "That.h".
- char_const
-
A character constant, like
'\0'
or'a'
. - grammar
-
A piece of C "grammar", like
{
or]
or->
. - number
-
A number such as
42
, - word
-
A word, which may be a variable name or a function.
- string
-
A string, like
"this"
, or even"like" "this"
. - reserved
-
A C reserved word, like
auto
orgoto
.
All of the fields which may be captured are available in the variable "@fields".
- $name
-
The value of the type. For example, if
$token->{name}
equals 'comment', then the value of the type is in ,$token->{comment}
.if ($token->{name} eq 'string') { my $c_string = $token->{string}; }
- line
-
The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number always refers to the final line.
BUGS
- Octal not parsed
-
It does not parse octal expressions.
- No trigraphs
-
No handling of trigraphs.
- Requires Perl 5.10
-
This module uses named captures in regular expressions, so it requires Perl 5.10 or more.
- No line directives
-
The line numbers provided by "tokenize" do not respect C line directives.
- Insufficient tests
-
The module has been used somewhat, but the included tests do not exercise many of the features of C.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.