NAME
C::Tokenize - reduce a C file to a series of tokens
SYNOPSIS
# Remove all C preprocessor instructions from a C program:
use C::Tokenize '$cpp_re';
$c =~ s/$cpp_re//g;
# Print all the comments in a C program:
use C::Tokenize '$comment_re';
while ($c =~ /($comment_re)/) {
print "$1\n";
}
DESCRIPTION
This module provides a tokenizer which breaks C source code into its smallest meaningful components, and the regular expressions which match each of these components. For example, the module supplies a regular expression "$comment_re" which matches a C comment line.
REGULAR EXPRESSIONS
The following regular expressions can be imported from this module using, for example,
use C::Tokenize '$cpp_re'
to import $cpp_re
.
None of the following regular expressions does any capturing. If you want to capture, add your own parentheses around the regular expression.
- $trad_comment_re
-
Match
/* */
comments. - $cxx_comment_re
-
Match
//
comments. - $comment_re
-
Match both
/* */
and//
comments. - $cpp_re
-
Match a C preprocessor instruction.
- $char_const_re
-
Match a character constant, such as
'a'
or'\-'
. - $operator_re
-
Match an operator such as
+
or--
. - $number_re
-
Match a number, either integer, floating point, or hexadecimal. Does not do octal yet.
- $word_re
-
Match a word, such as a function or variable name or a keyword of the language.
- $grammar_re
-
Match other syntactic characters such as
{
or[
. - $single_string_re
-
Match a single C string constant such as
"this"
. - $string_re
-
Match a full-blown C string constant, including compound strings
"like" "this"
. - $reserved_re
-
Match a C reserved word like
auto
orgoto
.
VARIABLES
@fields
@Fields contains a list of all the fields which are extracted by "tokenize".
FUNCTIONS
decomment
my $out = decomment ('/* comment */');
# $out = " comment ";
Remove the traditional C comment marks /*
and */
from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.
tokenize
my $tokens = tokenize ($file);
Convert $file
into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:
- leading
-
Any whitespace which comes before the token (called "leading whitespace").
- type
-
The type of the token, which may be
- comment
-
A comment, like
/* This */
or
// this.
- cpp
-
A C preprocessor instruction like
#define THIS 1
or
#include "That.h".
- char_const
-
A character constant, like
'\0'
or'a'
. - grammar
-
A piece of C "grammar", like
{
or]
or->
. - number
-
A number such as
42
, - word
-
A word, which may be a variable name or a function.
- string
-
A string, like
"this"
, or even"like" "this"
. - reserved
-
A C reserved word, like
auto
orgoto
.
All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:
use C::Tokenize '@fields';
- $name
-
The value of the type. For example, if
$token->{name}
equals 'comment', then the value of the type is in ,$token->{comment}
.if ($token->{name} eq 'string') { my $c_string = $token->{string}; }
- line
-
The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.
EXPORTS
use C::Tokenize ':all';
exports all the regular expressions from the module.
SEE ALSO
The regular expressions contained in this module are shown at this web page: http://www.lemoda.net/c/c-regex/index.html.
BUGS
- Octal not parsed
-
It does not parse octal expressions.
- No trigraphs
-
No handling of trigraphs.
- Requires Perl 5.10
-
This module uses named captures in regular expressions, so it requires Perl 5.10 or more.
- No line directives
-
The line numbers provided by "tokenize" do not respect C line directives.
- Insufficient tests
-
The module has been used somewhat, but the included tests do not exercise many of the features of C.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012-2014 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.