NAME
C::Tokenize - reduce a C file to a series of tokens
SYNOPSIS
# Remove all C preprocessor instructions from a C program:
my $c = <<EOF;
#define X Y
#ifdef X
int X;
#endif
EOF
use C::Tokenize '$cpp_re';
$c =~ s/$cpp_re//g;
print "$c\n";
produces output
int X;
(This example is included as synopsis-cpp.pl in the distribution.)
# Print all the comments in a C program:
my $c = <<EOF;
/* This is the main program. */
int main ()
{
int i;
/* Increment i by 1. */
i++;
// Now exit with zero status.
return 0;
}
EOF
use C::Tokenize '$comment_re';
while ($c =~ /($comment_re)/g) {
print "$1\n";
}
produces output
/* This is the main program. */
/* Increment i by 1. */
// Now exit with zero status.
(This example is included as synopsis-comment.pl in the distribution.)
VERSION
This documents version 0.12 of C::Tokenize corresponding to git commit ae2d55568809a6b5cc23d66948cdd87cf6d0f98f released on Wed Dec 7 09:47:21 2016 +0900.
DESCRIPTION
This module provides a tokenizer, "tokenize", which breaks C source code into its smallest meaningful components, and the regular expressions which match each of these components. For example, the module supplies a regular expression "$comment_re" which matches a C comment line.
It also supplies some extra regular expressions for, for example, local include statements, "$include_local", or C variables, "$cvar_re", as well as extra functions, like "decomment" for removing traditional C comments.
REGULAR EXPRESSIONS
The following regular expressions can be imported from this module using, for example,
use C::Tokenize '$cpp_re'
to import $cpp_re
.
Most of the following regular expressions do not do any capturing, except where noted. If you want to capture, add your own parentheses around the regular expression.
- $trad_comment_re
-
Match
/* */
comments. - $cxx_comment_re
-
Match
//
comments. - $comment_re
-
Match both
/* */
and//
comments. - $cpp_re
-
Match a C preprocessor instruction.
- $char_const_re
-
Match a character constant, such as
'a'
or'\-'
. - $operator_re
-
Match an operator such as
+
or--
. - $number_re
-
Match a number, either integer, floating point, or hexadecimal. Does not do octal yet.
- $word_re
-
Match a word, such as a function or variable name or a keyword of the language.
- $grammar_re
-
Match other syntactic characters such as
{
or[
. - $single_string_re
-
Match a single C string constant such as
"this"
. - $string_re
-
Match a full-blown C string constant, including compound strings
"like" "this"
. - $reserved_re
-
Match a C reserved word like
auto
orgoto
. - $include_local
-
Match an include statement which uses double quotes, like
#include "some.c"
.This captures the entire statement in
$1
and the file name in$2
.This was added in version 0.11 of C::Tokenize.
- $include
-
Match any include statement, like
#include <stdio.h>
.This captures the entire statement in
$1
and the file name in$2
.use C::Tokenize '$include'; my $c = <<EOF; #include <this.h> #include "that.h" EOF while ($c =~ /$include/g) { print "Include statement $1 includes file $2.\n"; }
produces output
Include statement #include <this.h> includes file this.h. Include statement #include "that.h" includes file that.h.
(This example is included as includes.pl in the distribution.)
This was added in version 0.12 of C::Tokenize.
- $cvar_re
-
This matches a C variable, for example anything which may be an lvalue or a function argument.
use C::Tokenize '$cvar_re'; my $c = 'func (x->y, & z, ** a, & q);'; while ($c =~ /[,\(]\s*($cvar_re)/g) { print "$1 is a C variable.\n"; }
produces output
x->y is a C variable. & z is a C variable. ** a is a C variable. & q is a C variable.
(This example is included as cvar.pl in the distribution.)
This was added in version 0.11 of C::Tokenize.
VARIABLES
@fields
@Fields contains a list of all the fields which are extracted by "tokenize".
FUNCTIONS
decomment
my $out = decomment ('/* comment */');
# $out = " comment ";
Remove the traditional C comment marks /*
and */
from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.
tokenize
my $tokens = tokenize ($file);
Convert $file
into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:
- leading
-
Any whitespace which comes before the token (called "leading whitespace").
- type
-
The type of the token, which may be
- comment
-
A comment, like
/* This */
or
// this.
- cpp
-
A C preprocessor instruction like
#define THIS 1
or
#include "That.h".
- char_const
-
A character constant, like
'\0'
or'a'
. - grammar
-
A piece of C "grammar", like
{
or]
or->
. - number
-
A number such as
42
, - word
-
A word, which may be a variable name or a function.
- string
-
A string, like
"this"
, or even"like" "this"
. - reserved
-
A C reserved word, like
auto
orgoto
.
All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:
use C::Tokenize '@fields';
- $name
-
The value of the type. For example, if
$token->{name}
equals 'comment', then the value of the type is in ,$token->{comment}
.if ($token->{name} eq 'string') { my $c_string = $token->{string}; }
- line
-
The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.
EXPORTS
use C::Tokenize ':all';
exports all the regular expressions and functions from the module.
SEE ALSO
The regular expressions contained in this module are shown at this web page.
This example of use of this module demonstrates using C::Tokenize (version 0.12) to remove unnecessary header inclusions from C files.
There is a C to HTML converter in the examples subdirectory of the distribution called c2html.pl.
BUGS
- Octal not parsed
-
It does not parse octal expressions.
- No trigraphs
-
No handling of trigraphs.
- Requires Perl 5.10
-
This module uses named captures in regular expressions, so it requires Perl 5.10 or more.
- No line directives
-
The line numbers provided by "tokenize" do not respect C line directives.
- Insufficient tests
-
The module has been used somewhat, but the included tests do not exercise many of the features of C.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012-2016 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.