NAME
C::Tokenize - reduce a C file to a series of tokens
SYNOPSIS
# Remove all C preprocessor instructions from a C program:
my $c = <<EOF;
#define X Y
#ifdef X
int X;
#endif
EOF
use C::Tokenize '$cpp_re';
$c =~ s/$cpp_re//g;
print "$c\n";
produces output
int X;
(This example is included as synopsis-cpp.pl in the distribution.)
# Print all the comments in a C program:
my $c = <<EOF;
/* This is the main program. */
int main ()
{
int i;
/* Increment i by 1. */
i++;
// Now exit with zero status.
return 0;
}
EOF
use C::Tokenize '$comment_re';
while ($c =~ /($comment_re)/g) {
print "$1\n";
}
produces output
/* This is the main program. */
/* Increment i by 1. */
// Now exit with zero status.
(This example is included as synopsis-comment.pl in the distribution.)
VERSION
This documents version 0.11 of C::Tokenize corresponding to git commit 98692825df8419440be4fa40fd09df0690472299 released on Tue Sep 6 10:43:09 2016 +0900.
DESCRIPTION
This module provides a tokenizer, "tokenize", which breaks C source code into its smallest meaningful components, and the regular expressions which match each of these components. For example, the module supplies a regular expression "$comment_re" which matches a C comment line.
It also supplies some extra regular expressions for, for example, local include statements, "$include_local", or C variables, "$cvar_re", as well as extra functions "decomment" for removing traditional C comments.
REGULAR EXPRESSIONS
The following regular expressions can be imported from this module using, for example,
use C::Tokenize '$cpp_re'
to import $cpp_re
.
Most of the following regular expressions do not do any capturing, except where noted. If you want to capture, add your own parentheses around the regular expression.
- $trad_comment_re
-
Match
/* */
comments. - $cxx_comment_re
-
Match
//
comments. - $comment_re
-
Match both
/* */
and//
comments. - $cpp_re
-
Match a C preprocessor instruction.
- $char_const_re
-
Match a character constant, such as
'a'
or'\-'
. - $operator_re
-
Match an operator such as
+
or--
. - $number_re
-
Match a number, either integer, floating point, or hexadecimal. Does not do octal yet.
- $word_re
-
Match a word, such as a function or variable name or a keyword of the language.
- $grammar_re
-
Match other syntactic characters such as
{
or[
. - $single_string_re
-
Match a single C string constant such as
"this"
. - $string_re
-
Match a full-blown C string constant, including compound strings
"like" "this"
. - $reserved_re
-
Match a C reserved word like
auto
orgoto
. - $include_local
-
Match an include statement which uses double quotes, like
#include "some.c"
.This captures the entire statement in
$1
and the file name in$2
. - $cvar_re
-
This matches a C variable, for example anything which may be an lvalue or a function argument.
use C::Tokenize '$cvar_re'; my $c = 'func (x->y, & z, ** a, & q);'; while ($c =~ /[,\(]\s*($cvar_re)/g) { print "$1 is a C variable.\n"; }
produces output
x->y is a C variable. & z is a C variable. ** a is a C variable. & q is a C variable.
(This example is included as cvar.pl in the distribution.)
VARIABLES
@fields
@Fields contains a list of all the fields which are extracted by "tokenize".
FUNCTIONS
decomment
my $out = decomment ('/* comment */');
# $out = " comment ";
Remove the traditional C comment marks /*
and */
from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.
tokenize
my $tokens = tokenize ($file);
Convert $file
into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:
- leading
-
Any whitespace which comes before the token (called "leading whitespace").
- type
-
The type of the token, which may be
- comment
-
A comment, like
/* This */
or
// this.
- cpp
-
A C preprocessor instruction like
#define THIS 1
or
#include "That.h".
- char_const
-
A character constant, like
'\0'
or'a'
. - grammar
-
A piece of C "grammar", like
{
or]
or->
. - number
-
A number such as
42
, - word
-
A word, which may be a variable name or a function.
- string
-
A string, like
"this"
, or even"like" "this"
. - reserved
-
A C reserved word, like
auto
orgoto
.
All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:
use C::Tokenize '@fields';
- $name
-
The value of the type. For example, if
$token->{name}
equals 'comment', then the value of the type is in ,$token->{comment}
.if ($token->{name} eq 'string') { my $c_string = $token->{string}; }
- line
-
The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.
EXPORTS
use C::Tokenize ':all';
exports all the regular expressions and functions from the module.
SEE ALSO
The regular expressions contained in this module are shown at this web page: http://www.lemoda.net/c/c-regex/index.html.
BUGS
- Octal not parsed
-
It does not parse octal expressions.
- No trigraphs
-
No handling of trigraphs.
- Requires Perl 5.10
-
This module uses named captures in regular expressions, so it requires Perl 5.10 or more.
- No line directives
-
The line numbers provided by "tokenize" do not respect C line directives.
- Insufficient tests
-
The module has been used somewhat, but the included tests do not exercise many of the features of C.
AUTHOR
Ben Bullock, <bkb@cpan.org>
Request
If you'd like to see this module continued, let me know that you're using it. For example, send an email, write a bug report, star the project's github repository, add a patch, add a ++
on Metacpan.org, or write a rating at CPAN ratings. It really does make a difference. Thanks.
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012-2016 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.