NAME

PPR::X - Pattern-based Perl Recognizer

VERSION

This document describes PPR::X version 0.001009

SYNOPSIS

use PPR::X;

# Define a regex that will match an entire Perl document...
my $perl_document = qr{

    # What to match            # Install the (?&PerlDocument) rule
    (?&PerlEntireDocument)     $PPR::X::GRAMMAR

}x;


# Define a regex that will match a single Perl block...
my $perl_block = qr{

    # What to match...         # Install the (?&PerlBlock) rule...
    (?&PerlBlock)              $PPR::X::GRAMMAR
}x;


# Define a regex that will match a simple Perl extension...
my $perl_coroutine = qr{

    # What to match...
    coro                                           (?&PerlOWS)
    (?<coro_name>  (?&PerlQualifiedIdentifier)  )  (?&PerlOWS)
    (?<coro_code>  (?&PerlBlock)                )

    # Install the necessary subrules...
    $PPR::X::GRAMMAR
}x;


# Define a regex that will match an integrated Perl extension...
my $perl_with_classes = qr{

    # What to match...
    \A
        (?&PerlOWS)       # Optional whitespace (including comments)
        (?&PerlDocument)  # A full Perl document
        (?&PerlOWS)       # More optional whitespace
    \Z

    # Add a 'class' keyword into the syntax that PPR::X understands...
    (?(DEFINE)
        (?<PerlKeyword>

                class                              (?&PerlOWS)
                (?&PerlQualifiedIdentifier)        (?&PerlOWS)
            (?: is (?&PerlNWS) (?&PerlIdentifier)  (?&PerlOWS) )*+
                (?&PerlBlock)
        )

        (?<kw_balanced_parens>
            \( (?: [^()]++ | (?&kw_balanced_parens) )*+ \)
        )
    )

    # Install the necessary standard subrules...
    $PPR::X::GRAMMAR
}x;

DESCRIPTION

The PPR::X module provides a single regular expression that defines a set of independent subpatterns suitable for matching entire Perl documents, as well as a wide range of individual syntactic components of Perl (i.e. statements, expressions, control blocks, variables, etc.)

The regex does not "parse" Perl (that is, it does not build a syntax tree, like the PPI module does). Instead it simply "recognizes" standard Perl constructs, or new syntaxes composed from Perl constructs.

Its features and capabilities therefore complement those of the PPI module, rather than replacing them. See "Comparison with PPI".

INTERFACE

Importing and using the Perl grammar regex

The PPR::X module exports no subroutines or variables, and provides no methods. Instead, it defines a single package variable, $PPR::X::GRAMMAR, which can be interpolated into regexes to add rules that permit Perl constructs to be parsed:

$source_code =~ m{ (?&PerlEntireDocument)  $PPR::X::GRAMMAR }x;

Note that all the examples shown so far have interpolated this "grammar variable" at the end of the regular expression. This placement is desirable, but not necessary. Both of the following work identically:

$source_code =~ m{ (?&PerlEntireDocument)   $PPR::X::GRAMMAR }x;

$source_code =~ m{ $PPR::X::GRAMMAR   (?&PerlEntireDocument) }x;

However, if the grammar is to be extended, then the extensions must be specified before the base grammar (i.e. before the interpolation of $PPR::X::GRAMMAR). Placing the grammar variable at the end of a regex ensures that will be the case, and has the added advantage of "front-loading" the regex with the most important information: what is actually going to be matched.

Note too that, because the PPR::X grammar internally uses capture groups, placing $PPR::X::GRAMMAR anywhere other than the very end of your regex may change the numbering of any explicit capture groups in your regex. For complete safety, regexes that use the PPR::X grammar should probably use named captures, instead of numbered captures.

Error reporting

Regex-based parsing is all-or-nothing: either your regex matches (and returns any captures you requested), or it fails to match (and returns nothing).

This can make it difficult to detect why a PPR::X-based match failed; to work out what the "bad source code" was that prevented your regex from matching.

So the module provides a special variable that attempts to detect the source code that prevented any call to the (?&PerlStatement) subpattern from matching. That variable is: $PPR::X::ERROR

$PPR::X::ERROR is only set if it is undefined at the point where an error is detected, and will only be set to the first such error that is encountered during parsing.

Note that errors are only detected when matching context-sensitive components (for example in the middle of a (?&PerlStatement), as part of a (?&PerlContextualRegex), or at the end of a (?&PerlEntireDocument). Errors, especially errors at the end of otherwise valid code, will often not be detected in context-free components (for example, at the end of a (?&PerlStatementSequence), as part of a (?&PerlRegex), or at the end of a (?&PerlDocument).

A common mistake in this area is to attempt to match an entire Perl document using:

m{ \A (?&PerlDocument) \Z   $PPR::X::GRAMMAR }x

instead of:

m{ (?&PerlEntireDocument)   $PPR::X::GRAMMAR }x

Only the second approach will be able to successfully detect an unclosed curly bracket at the end of the document.

PPR_X::ERROR interface

If it is set, $PPR::X::ERROR will contain an object of type PPR::X::ERROR, with the following methods:

$PPR::X::ERROR->origin($line, $file)

Returns a clone of the PPR::X::ERROR object that now believes that the source code parsing failure it is reporting occurred in a code fragment starting at the specified line and file. If the second argument is omitted, the file name is not reported in any diagnostic.

$PPR::X::ERROR->source()

Returns a string containing the specific source code that could not be parsed as a Perl statement.

$PPR::X::ERROR->prefix()

Returns a string containing all the source code preceding the code that could not be parsed. That is: the valid code that is the preceding context of the unparsable code.

$PPR::X::ERROR->line( $opt_offset )

Returns an integer which is the line number at which the unparsable code was encountered. If the optional "offset" argument is provided, it will be added to the line number returned. Note that the offset is ignored if the PPR::X::ERROR object originates from a prior call to $PPR::X::ERROR->origin (because in that case you will have already specified the correct offset).

$PPR::X::ERROR->diagnostic()

Returns a string containing the diagnostic that would be returned by perl -c if the source code were compiled.

Warning: The diagnostic is obtained by partially eval'ing the source code. This means that run-time code will not be executed, but BEGIN and CHECK blocks will run. Do not call this method if the source code that created this error might also have non-trivial compile-time side-effects.

A typical use might therefore be:

# Make sure it's undefined, and will only be locally modified...
local $PPR::X::ERROR;

# Process the matched block...
if ($source_code =~ m{ (?<Block> (?&PerlBlock) )  $PPR::X::GRAMMAR }x) {
    process( $+{Block} );
}

# Or report the offending code that stopped it being a valid block...
else {
    die "Invalid Perl block: " . $PPR::X::ERROR->source . "\n",
        $PPR::X::ERROR->origin($linenum, $filename)->diagnostic . "\n";
}

Decommenting code with PPR_X::decomment()

The module provides (but does not export) a decomment() subroutine that can remove any comments and/or POD from source code.

It takes a single argument: a string containing the course code. It returns a single value: a string containing the decommented source code.

For example:

$decommented_code = PPR::X::decomment( $commented_code );

The subroutine will fail if the argument wasn't valid Perl code, in which case it returns undef and sets $PPR::X::ERROR to indicate where the invalid source code was encountered.

Note that, due to separate bugs in the regex engine in Perl 5.14 and 5.20, the decomment() subroutine is not available when running under these releases.

Examples

Note: In each of the following examples, the subroutine slurp() is used to acquire the source code from a file whose name is passed as its argument. The slurp() subroutine is just:

sub slurp { local (*ARGV, $/); @ARGV = shift; readline; }

or, for the less twisty-minded:

sub slurp {
    my ($filename) = @_;
    open my $filehandle, '<', $filename or die $!;
    local $/;
    return readline($filehandle);
}

Validating source code

# "Valid" if source code matches a Perl document under the Perl grammar
printf(
    "$filename %s a valid Perl file\n",
    slurp($filename) =~ m{ (?&PerlEntireDocument)  $PPR::X::GRAMMAR }x
        ? "is"
        : "is not"
);

Counting statements

printf(                                        # Output
    "$filename contains %d statements\n",      # a report of
    scalar                                     # the count of
        grep {defined}                         # defined matches
            slurp($filename)                   # from the source code,
                =~ m{
                      \G (?&PerlOWS)           # skipping whitespace
                         ((?&PerlStatement))   # and keeping statements,
                      $PPR::X::GRAMMAR            # using the Perl grammar
                    }gcx;                      # incrementally
);

Stripping comments and POD from source code

my $source = slurp($filename);                    # Get the source
$source =~ s{ (?&PerlNWS)  $PPR::X::GRAMMAR }{ }gx;  # Compact whitespace
print $source;                                    # Print the result

Stripping comments and POD from source code (in Perl v5.14 or later)

# Print  the source code,  having compacted whitespace...
  print  slurp($filename)  =~ s{ (?&PerlNWS)  $PPR::X::GRAMMAR }{ }gxr;

Stripping everything except comments and POD from source code

say                                         # Output
    grep {defined}                          # defined matches
        slurp($filename)                    # from the source code,
            =~ m{ \G ((?&PerlOWS))          # keeping whitespace,
                     (?&PerlStatement)?     # skipping statements,
                  $PPR::X::GRAMMAR             # using the Perl grammar
                }gcx;                       # incrementally

Available rules

Interpolating $PPR::X::GRAMMAR in a regex makes all of the following rules available within that regex.

Note that other rules not listed here may also be added, but these are all considered strictly internal to the PPR::X module and are not guaranteed to continue to exist in future releases. All such "internal-use-only" rules have names that start with PPR_X_...

(?&PerlDocument)

Matches a valid Perl document, including leading or trailing whitespace, comments, and any final __DATA__ or __END__ section.

This rule is context-free, so it can be embedded in a larger regex. For example, to match an embedded chunk of Perl code, delimited by <<<...>>>:

$src = m{ <<< (?&PerlDocument) >>>   $PPR::X::GRAMMAR }x;

(?&PerlEntireDocument)

Matches an entire valid Perl document, including leading or trailing whitespace, comments, and any final __DATA__ or __END__ section.

This rule is not context-free. It has an internal \A at the beginning and \Z at the end, so a regex containing (?&PerlEntireDocument) will only match if:

(a)

the (?&PerlEntireDocument) is the sole top-level element of the regex (or, at least the sole element of a single top-level |-branch of the regex),

and
(b)

the entire string being matched contains only a single valid Perl document.

In general, if you want to check that a string consists entirely of a single valid sequence of Perl code, use:

$str =~ m{ (?&PerlEntireDocument)  $PPR::X::GRAMMAR }

If you want to check that a string contains at least one valid sequence of Perl code at some point, possibly embedded in other text, use:

$str =~ m{ (?&PerlDocument)  $PPR::X::GRAMMAR }

(?&PerlStatementSequence)

Matches zero-or-more valid Perl statements, separated by optional POD sequences.

(?&PerlStatement)

Matches a single valid Perl statement, including: control structures; BEGIN, CHECK, UNITCHECK, INIT, END, DESTROY, or AUTOLOAD blocks; variable declarations, use statements, etc.

(?&PerlExpression)

Matches a single valid Perl expression involving operators of any precedence, but not any kind of block (i.e. not control structures, BEGIN blocks, etc.) nor any trailing statement modifier (e.g. not a postfix if, while, or for).

(?&PerlLowPrecedenceNotExpression)

Matches an expression at the precedence of the not operator. That is, a single valid Perl expression that involves operators above the precedence of and.

(?&PerlAssignment)

Matches an assignment expression. That is, a single valid Perl expression involving operators above the precedence of comma (, or =>).

(?&PerlConditionalExpression) or (?&PerlScalarExpression)

Matches a conditional expression that uses the ?...: ternary operator. That is, a single valid Perl expression involving operators above the precedence of assignment.

The alterative name comes from the fact that anything matching this rule is what most people think of as a single element of a comma-separated list.

(?&PerlBinaryExpression)

Matches an expression that uses any high-precedence binary operators. That is, a single valid Perl expression involving operators above the precedence of the ternary operator.

(?&PerlPrefixPostfixTerm)

Matches a term with optional prefix and/or postfix unary operators and/or a trailing sequence of -> dereferences. That is, a single valid Perl expression involving operators above the precedence of exponentiation (**).

(?&PerlTerm)

Matches a simple high-precedence term within a Perl expression. That is: a subroutine or builtin function call; a variable declaration; a variable or typeglob lookup; an anonymous array, hash, or subroutine constructor; a quotelike or numeric literal; a regex match; a substitution; a transliteration; a do or eval block; or any other expression in surrounding parentheses.

(?&PerlTermPostfixDereference)

Matches a sequence of array- or hash-lookup brackets, or subroutine call parentheses, or a postfix dereferencer (e.g. ->$*), with explicit or implicit intervening ->, such as might appear after a term.

(?&PerlLvalue)

Matches any variable or parenthesized list of variables that could be assigned to.

(?&PerlPackageDeclaration)

Matches the declaration of any package (with or without a defining block).

(?&PerlSubroutineDeclaration)

Matches the declaration of any named subroutine (with or without a defining block).

(?&PerlUseStatement)

Matches a use <module name> ...; or use <version number>; statement.

(?&PerlReturnStatement)

Matches a return <expression>; or return; statement.

(?&PerlReturnExpression)

Matches a return <expression> as an expression without trailing end-of-statement markers.

(?&PerlControlBlock)

Matches an if, unless, while, until, for, or foreach statement, including its block.

(?&PerlDoBlock)

Matches a do-block expression.

(?&PerlEvalBlock)

Matches a eval-block expression.

(?&PerlTryCatchFinallyBlock)

Matches an try block, followed by an option catch block, followed by an optional finally block, using the built-in syntax introduced in Perl v5.34 and v5.36.

Note that if your code uses one of the many CPAN modules (such as Try::Tiny or TryCatch) that provided try/catch behaviours prior to Perl v5.34, then you will most likely need to override this subrule to match the alternate try/catch syntax provided by your preferred module.

For example, if your code uses the TryCatch module, you would need to alter the PPR::X parser by explicitly redefining the subrule for try blocks, with something like:

my $MATCH_A_PERL_DOCUMENT = qr{

    \A (?&PerlEntireDocument) \Z

    (?(DEFINE)
        # Redefine this subrule to match TryCatch syntax...
        (?<PerlTryCatchFinallyBlock>
                try                                  (?>(?&PerlOWS))
                (?>(?&PerlBlock))
            (?:                                      (?>(?&PerlOWS))
                catch                                (?>(?&PerlOWS))
            (?: \( (?>(?&PPR_X_balanced_parens)) \)    (?>(?&PerlOWS))  )?+
                (?>(?&PerlBlock))
            )*+
        )
    )

    $PPR::X::GRAMMAR
}xms;

Note that the popular Try::Tiny module actually implements try/catch as a normally parsed Perl subroutine call expression, rather than a statement. This means that the unmodified PPR::X grammar can successfully parse all the module's constructs.

However, the unmodified PPR::X grammar may misclassify some Try::Tiny usages as being built-in Perl v5.36 try blocks followed by an unrelated call to the catch subroutine, rather than identifying the try and catch as a single expression containing two subroutine calls.

If that difference in interpretation matters to you, you can deactivate the built-in Perl v5.36 try/catch syntax entirely, like so:

my $MATCH_A_PERL_DOCUMENT = qr{
    \A (?&PerlEntireDocument) \Z

    (?(DEFINE)
        # Turn off built-in try/catch syntax...
        (?<PerlTryCatchFinallyBlock>   (?!)  )

        # Decanonize 'try' and 'catch' as reserved words ineligible for sub names...
        (?<PPR_X_X_non_reserved_identifier>
            (?! (?> for(?:each)?+ | while   | if    | unless | until | given | when   | default
                |   sub | format  | use     | no    | my     | our   | state  | defer | finally
                # Note: Removed 'try' and 'catch' which appear here in the original subrule
                |   (?&PPR_X_X_named_op)
                |   [msy] | q[wrxq]?+ | tr
                |   __ (?> END | DATA ) __
                )
                \b
            )
            (?>(?&PerlQualifiedIdentifier))
            (?! :: )
        )

    )

    $PPR::X::GRAMMAR
}xms;

For more details and options for modifying PPR::X grammars in this way, see also the documentation of the PPR_X module.

(?&PerlStatementModifier)

Matches an if, unless, while, until, for, or foreach modifier that could appear after a statement. Only matches the modifier, not the preceding statement.

(?&PerlFormat)

Matches a format declaration, including its terminating "dot".

(?&PerlBlock)

Matches a {...}-delimited block containing zero-or-more statements.

(?&PerlCall)

Matches a call to a subroutine or built-in function. Accepts all valid call syntaxes, either via a literal names or a reference, with or without a leading &, with or without arguments, with or without parentheses on any argument list.

(?&PerlAttributes)

Matches a list of colon-preceded attributes, such as might be specified on the declaration of a subroutine or a variable.

(?&PerlCommaList)

Matches a list of zero-or-more comma-separated subexpressions. That is, a single valid Perl expression that involves operators above the precedence of not.

(?&PerlParenthesesList)

Matches a list of zero-or-more comma-separated subexpressions inside a set of parentheses.

(?&PerlList)

Matches either a parenthesized or unparenthesized list of comma-separated subexpressions. That is, matches anything that either of the two preceding rules would match.

(?&PerlAnonymousArray)

Matches an anonymous array constructor. That is: a list of zero-or-more subexpressions inside square brackets.

(?&PerlAnonymousHash)

Matches an anonymous hash constructor. That is: a list of zero-or-more subexpressions inside curly brackets.

(?&PerlArrayIndexer)

Matches a valid indexer that could be applied to look up elements of a array. That is: a list of or one-or-more subexpressions inside square brackets.

(?&PerlHashIndexer)

Matches a valid indexer that could be applied to look up entries of a hash. That is: a list of or one-or-more subexpressions inside curly brackets, or a simple bareword indentifier inside curley brackets.

(?&PerlDiamondOperator)

Matches anything in angle brackets. That is: any "diamond" readline (e.g. <$filehandle> or file-grep operation (e.g. <*.pl>).

(?&PerlComma)

Matches a short (,) or long (=>) comma.

(?&PerlPrefixUnaryOperator)

Matches any high-precedence prefix unary operator.

(?&PerlPostfixUnaryOperator)

Matches any high-precedence postfix unary operator.

(?&PerlInfixBinaryOperator)

Matches any infix binary operator whose precedence is between .. and **.

(?&PerlAssignmentOperator)

Matches any assignment operator, including all op= variants.

(?&PerlLowPrecedenceInfixOperator)

Matches and, <or>, or xor.

(?&PerlAnonymousSubroutine)

Matches an anonymous subroutine.

(?&PerlVariable)

Matches any type of access on any scalar, array, or hash variable.

(?&PerlVariableScalar)

Matches any scalar variable, including fully qualified package variables, punctuation variables, scalar dereferences, and the $#array syntax.

(?&PerlVariableArray)

Matches any array variable, including fully qualified package variables, punctuation variables, and array dereferences.

(?&PerlVariableHash)

Matches any hash variable, including fully qualified package variables, punctuation variables, and hash dereferences.

(?&PerlTypeglob)

Matches a typeglob.

(?&PerlScalarAccess)

Matches any kind of variable access beginning with a $, including fully qualified package variables, punctuation variables, scalar dereferences, the $#array syntax, and single-value array or hash look-ups.

(?&PerlScalarAccessNoSpace)

Matches any kind of variable access beginning with a $, including fully qualified package variables, punctuation variables, scalar dereferences, the $#array syntax, and single-value array or hash look-ups. But does not allow spaces between the components of the variable access (i.e. imposes the same constraint as within an interpolating quotelike).

(?&PerlScalarAccessNoSpaceNoArrow)

Matches any kind of variable access beginning with a $, including fully qualified package variables, punctuation variables, scalar dereferences, the $#array syntax, and single-value array or hash look-ups. But does not allow spaces or arrows between the components of the variable access (i.e. imposes the same constraint as within a <...>-delimited interpolating quotelike).

(?&PerlArrayAccess)

Matches any kind of variable access beginning with a @, including arrays, array dereferences, and list slices of arrays or hashes.

(?&PerlArrayAccessNoSpace)

Matches any kind of variable access beginning with a @, including arrays, array dereferences, and list slices of arrays or hashes. But does not allow spaces between the components of the variable access (i.e. imposes the same constraint as within an interpolating quotelike).

(?&PerlArrayAccessNoSpaceNoArrow)

Matches any kind of variable access beginning with a @, including arrays, array dereferences, and list slices of arrays or hashes. But does not allow spaces or arrows between the components of the variable access (i.e. imposes the same constraint as within a <...>-delimited interpolating quotelike).

(?&PerlHashAccess)

Matches any kind of variable access beginning with a %, including hashes, hash dereferences, and kv-slices of hashes or arrays.

(?&PerlLabel)

Matches a colon-terminated label.

(?&PerlLiteral)

Matches a literal value. That is: a number, a qr or qw quotelike, a string, or a bareword.

(?&PerlString)

Matches a string literal. That is: a single- or double-quoted string, a q or qq string, a heredoc, or a version string.

(?&PerlQuotelike)

Matches any form of quotelike operator. That is: a single- or double-quoted string, a q or qq string, a heredoc, a version string, a qr, a qw, a qx, a /.../ or m/.../ regex, a substitution, or a transliteration.

(?&PerlHeredoc)

Matches a heredoc specifier. That is: just the initial <<TERMINATOR> component, not the actual contents of the heredoc on the subsequent lines.

This rule only matches a heredoc specifier if that specifier is correctly followed on the next line by any heredoc contents and then the correct terminator.

However, if the heredoc specifier is correctly matched, subsequent calls to either of the whitespace-matching rules ((?&PerlOWS) or (?&PerlNWS)) will also consume the trailing heredoc contents and the terminator.

So, for example, to correctly match a heredoc plus its contents you could use something like:

m/ (?&PerlHeredoc) (?&PerlOWS)  $PPR::X::GRAMMAR /x

or, if there may be trailing items on the same line as the heredoc specifier:

m/ (?&PerlHeredoc)
   (?<trailing_items> [^\n]* )
   (?&PerlOWS)

   $PPR::X::GRAMMAR
/x

Note that the saeme limitations apply to other constructs that match heredocs, such a (?&PerlQuotelike) or (?&PerlString).

(?&PerlQuotelikeQ)

Matches a single-quoted string, either a '...' or a q/.../ (with any valid delimiters).

(?&PerlQuotelikeQQ)

Matches a double-quoted string, either a "..." or a qq/.../ (with any valid delimiters).

(?&PerlQuotelikeQW)

Matches a "quotewords" list. That is a qw/ list of words / (with any valid delimiters).

(?&PerlQuotelikeQX)

Matches a qx system call, either a `...` or a qx/.../ (with any valid delimiters)

(?&PerlQuotelikeS) or (?&PerlSubstitution)

Matches a substitution operation. That is: s/.../.../ (with any valid delimiters and any valid trailing modifiers).

(?&PerlQuotelikeTR) or (?&PerlTransliteration)

Matches a transliteration operation. That is: tr/.../.../ or y/.../.../ (with any valid delimiters and any valid trailing modifiers).

(?&PerlContextualQuotelikeM) or (?&PerContextuallMatch)

Matches a regex-match operation in any context where it would be allowed in valid Perl. That is: /.../ or m/.../ (with any valid delimiters and any valid trailing modifiers).

(?&PerlQuotelikeM) or (?&PerlMatch)

Matches a regex-match operation. That is: /.../ or m/.../ (with any valid delimiters and any valid trailing modifiers) in any context (i.e. even in places where it would not normally be allowed within a valid piece of Perl code).

(?&PerlQuotelikeQR)

Matches a qr regex constructor (with any valid delimiters and any valid trailing modifiers).

(?&PerlContextualRegex)

Matches a qr regex constructor or a /.../ or m/.../ regex-match operation (with any valid delimiters and any valid trailing modifiers) anywhere where either would be allowed in valid Perl.

In other words: anything capable of matching within valid Perl code.

(?&PerlRegex)

Matches a qr regex constructor or a /.../ or m/.../ regex-match operation in any context (i.e. even in places where it would not normally be allowed within a valid piece of Perl code).

In other words: anything capable of matching.

(?&PerlBuiltinFunction)

Matches the name of any builtin function.

To match an actual call to a built-in function, use:

m/
    (?= (?&PerlBuiltinFunction) )
    (?&PerlCall)
/x

(?&PerlNullaryBuiltinFunction)

Matches the name of any builtin function that never takes arguments.

To match an actual call to a built-in function that never takes arguments, use:

m/
    (?= (?&PerlNullaryBuiltinFunction) )
    (?&PerlCall)
/x

(?&PerlVersionNumber)

Matches any number or version-string that can be used as a version number within a use, no, or package statement.

(?&PerlVString)

Matches a version-string (a.k.a v-string).

(?&PerlNumber)

Matches a valid number, including binary, octal, decimal and hexadecimal integers, and floating-point numbers with or without an exponent.

(?&PerlIdentifier)

Matches a simple, unqualified identifier.

(?&PerlQualifiedIdentifier)

Matches a qualified or unqualified identifier, which may use either :: or ' as internal separators, but only :: as initial or terminal separators.

(?&PerlOldQualifiedIdentifier)

Matches a qualified or unqualified identifier, which may use either :: or ' as both internal and external separators.

(?&PerlBareword)

Matches a valid bareword.

Note that this is not the same as an simple identifier, nor the same as a qualified identifier.

(?&PerlPod)

Matches a single POD section containing any contiguous set of POD directives, up to the first =cut or end-of-file.

(?&PerlPodSequence)

Matches any sequence of POD sections, separated and /or surrounded by optional whitespace.

(?&PerlNWS)

Match one-or-more characters of necessary whitespace, including spaces, tabs, newlines, comments, and POD.

(?&PerlOWS)

Match zero-or-more characters of optional whitespace, including spaces, tabs, newlines, comments, and POD.

(?&PerlOWSOrEND)

Match zero-or-more characters of optional whitespace, including spaces, tabs, newlines, comments, POD, and any trailing __END__ or __DATA__ section.

(?&PerlEndOfLine)

Matches a single newline (\n) character.

This is provided mainly to allow newlines to be "hooked" by redefining (?<PerlEndOfLine>) (for example, to count lines during a parse).

(?&PerlKeyword)

Match a pluggable keyword.

Note that there are no pluggable keywords in the default PPR::X regex; they must be added by the end-user. See the following section for details.

Extending the Perl syntax with keywords

In Perl 5.12 and later, it's possible to add new types of statements to the language using a mechanism called "pluggable keywords".

This mechanism (best accessed via CPAN modules such as Keyword::Simple or Keyword::Declare) acts like a limited macro facility. It detects when a statement begins with a particular, pre-specified keyword, passes the trailing text to an associated keyword handler, and replaces the trailing source code with whatever the keyword handler produces.

For example, the Dios module uses this mechanism to add keywords such as class, method, and has to Perl 5, providing a declarative OO syntax. And the Object::Result module uses pluggable keywords to add a result statement that simplifies returning an ad hoc object from a subroutine.

Unfortunately, because such modules effectively extend the standard Perl syntax, by default PPR::X has no way of successfully parsing them.

However, when setting up a regex using $PPR::X::GRAMMAR it is possible to extend that grammar to deal with new keywords...by defining a rule named (?<PerlKeyword>...).

This rule is always tested as the first option within the standard (?&PerlStatement) rule, so any syntax declared within effectively becomes a new kind of statement. Note that each alternative within the rule must begin with a valid "keyword" (that is: a simple identifier of some kind).

For example, to support the three keywords from Dios:

$Dios::GRAMMAR = qr{

    # Add a keyword rule to support Dios...
    (?(DEFINE)
        (?<PerlKeyword>

                class                              (?&PerlOWS)
                (?&PerlQualifiedIdentifier)        (?&PerlOWS)
            (?: is (?&PerlNWS) (?&PerlIdentifier)  (?&PerlOWS) )*+
                (?&PerlBlock)
        |
                method                             (?&PerlOWS)
                (?&PerlIdentifier)                 (?&PerlOWS)
            (?: (?&kw_balanced_parens)             (?&PerlOWS) )?+
            (?: (?&PerlAttributes)                 (?&PerlOWS) )?+
                (?&PerlBlock)
        |
                has                                (?&PerlOWS)
            (?: (?&PerlQualifiedIdentifier)        (?&PerlOWS) )?+
                [\@\$%][.!]?(?&PerlIdentifier)     (?&PerlOWS)
            (?: (?&PerlAttributes)                 (?&PerlOWS) )?+
            (?: (?: // )?+ =                       (?&PerlOWS)
                (?&PerlExpression)                 (?&PerlOWS) )?+
            (?> ; | (?= \} ) | \z )
        )

        (?<kw_balanced_parens>
            \( (?: [^()]++ | (?&kw_balanced_parens) )*+ \)
        )
    )

    # Add all the standard PPR::X rules...
    $PPR::X::GRAMMAR
}x;

# Then parse with it...

$source_code =~ m{ \A (?&PerlDocument) \Z  $Dios::GRAMMAR }x;

Or, to support the result statement from Object::Result:

my $ORK_GRAMMAR = qr{

    # Add a keyword rule to support Object::Result...
    (?(DEFINE)
        (?<PerlKeyword>
            result                        (?&PerlOWS)
            \{                            (?&PerlOWS)
            (?: (?> (?&PerlIdentifier)
                |   < [[:upper:]]++ >
                )                         (?&PerlOWS)
                (?&PerlParenthesesList)?+      (?&PerlOWS)
                (?&PerlBlock)             (?&PerlOWS)
            )*+
            \}
        )
    )

    # Add all the standard PPR::X rules...
    $PPR::X::GRAMMAR
}x;

# Then parse with it...

$source_code =~ m{ \A (?&PerlDocument) \Z  $ORK_GRAMMAR }x;

Note that, although pluggable keywords are only available from Perl 5.12 onwards, PPR::X will still accept (&?PerlKeyword) extensions under Perl 5.10.

Extending the Perl syntax in other ways

Other modules (such as Devel::Declare and Filter::Simple) make it possible to extend Perl syntax in even more flexible ways. The PPR::X module provides support for syntactic extensions more general than pluggable keywords.

PPR::X allows any of its public rules to be redefined in a particular regex. For example, to create a regex that matches standard Perl syntax, but which allows the keyword fun as a synonym for sub:

my $FUN_GRAMMAR = qr{

    # Extend the subroutine-matching rules...
    (?(DEFINE)
        (?<PerlStatement>
            # Try the standard syntax...
            (?&PerlStdStatement)
        |
            # Try the new syntax...
            fun                               (?&PerlOWS)
            (?&PerlOldQualifiedIdentifier)    (?&PerlOWS)
            (?: \( [^)]*+ \) )?+              (?&PerlOWS)
            (?: (?&PerlAttributes)            (?&PerlOWS) )?+
            (?> ; | (?&PerlBlock) )
        )

        (?<PerlAnonymousSubroutine>
            # Try the standard syntax
            (?&PerlStdAnonymousSubroutine)
        |
            # Try the new syntax
            fun                               (?&PerlOWS)
            (?: \( [^)]*+ \) )?+              (?&PerlOWS)
            (?: (?&PerlAttributes)            (?&PerlOWS) )?+
            (?> ; | (?&PerlBlock) )
        )
    )

    $PPR::X::GRAMMAR
}x;

Note first that any redefinitions of the various rules have to be specified before the interpolation of the standard rules (so that the new rules take syntactic precedence over the originals).

The structure of each redefinition is essentially identical. First try the original rule, which is still accessible as (?&PerlStd...) (instead of (?&Perl...)). Otherwise, try the new alternative, which may be constructed out of other rules. original rule.

There is no absolute requirement to try the original rule as part of the new rule, but if you don't then you are replacing the rule, rather than extending it. For example, to replace the low-precedence boolean operators (and, or, xor, and not) with their Latin equivalents:

my $GRAMMATICA = qr{

    # Verbum sapienti satis est...
    (?(DEFINE)

        # Iunctiones...
        (?<PerlLowPrecedenceInfixOperator>
            atque | vel | aut
        )

        # Contradicetur...
        (?<PerlLowPrecedenceNotExpression>
            (?: non  (?&PerlOWS) )*+  (?&PerlCommaList)
        )
    )

    $PPR::X::GRAMMAR
}x;

Or to maintain a line count within the parse:

my $COUNTED_GRAMMAR = qr{

    (?(DEFINE)

        (?<PerlEndOfLine>
            # Try the standard syntax
            (?&PerlStdEndOfLine)

            # Then count the line (must localize, to handle backtracking)...
            (?{ local $linenum = $linenum + 1; })
        )
    )

    $PPR::X::GRAMMAR
}x;

Comparison with PPI

The PPI and PPR::X modules can both identify valid Perl code, but they do so in very different ways, and are optimal for different purposes.

PPI scans an entire Perl document and builds a hierarchical representation of the various components. It is therefore suitable for recognition, validation, partial extraction, and in-place transformation of Perl code.

PPR::X matches only as much of a Perl document as specified by the regex you create, and does not build any hierarchical representation of the various components it matches. It is therefore suitable for recognition and validation of Perl code. However, unless great care is taken, PPR::X is not as reliable as PPI for extractions or transformations of components smaller than a single statement.

On the other hand, PPI always has to parse its entire input, and build a complete non-trivial nested data structure for it, before it can be used to recognize or validate any component. So it is almost always significantly slower and more complicated than PPR::X for those kinds of tasks.

For example, to determine whether an input string begins with a valid Perl block, PPI requires something like:

if (my $document = PPI::Document->new(\$input_string) ) {
    my $block = $document->schild(0)->schild(0);
    if ($block->isa('PPI::Structure::Block')) {
        $block->remove;
        process_block($block);
        process_extra($document);
    }
}

whereas PPR::X needs just:

if ($input_string =~ m{ \A (?&PerlOWS) ((?&PerlBlock)) (.*) }xs) {
    process_block($1);
    process_extra($2);
}

Moreover, the PPR::X version will be at least twice as fast at recognizing that leading block (and usually four to seven times faster)...mainly because it doesn't have to parse the trailing code at all, nor build any representation of its hierarchical structure.

As a simple rule of thumb, when you only need to quickly detect, identify, or confirm valid Perl (or just a single valid Perl component), use PPR::X. When you need to examine, traverse, or manipulate the internal structure or component relationships within an entire Perl document, use PPI.

DIAGNOSTICS

Warning: This program is running under Perl 5.20...

Due to an unsolved issue with that particular release of Perl, the single regex in the PPR::X module takes a ridiculously long time to compile under Perl 5.20 (i.e. minutes, not milliseconds).

The code will work correctly when it eventually does compile, but the start-up delay is so extreme that the module issues this warning, to reassure users the something is actually happening, and explain why it's happening so slowly.

The only remedy at present is to use an older or newer version of Perl.

For all the gory details, see: https://rt.perl.org/Public/Bug/Display.html?id=122283 https://rt.perl.org/Public/Bug/Display.html?id=122890

PPR::X::decomment() does not work under Perl 5.14

There is a separate bug in the Perl 5.14 regex engine that prevents the decomment() subroutine from correctly detecting the location of comments.

The subroutine throws an exception if you attempt to call it when running under Perl 5.14 specifically.

The module has no other diagnostics, apart from those Perl provides for all regular expressions.

The commonest error is to forget to add $PPR::X::GRAMMAR to a regex, in which case you will get a standard Perl error message such as:

Reference to nonexistent named group in regex;
marked by <-- HERE in m/

    (?&PerlDocument <-- HERE )

/ at example.pl line 42.

Adding $PPR::X::GRAMMAR at the end of the regex solves the problem.

CONFIGURATION AND ENVIRONMENT

PPR::X requires no configuration files or environment variables.

DEPENDENCIES

Requires Perl 5.10 or later.

INCOMPATIBILITIES

None reported.

LIMITATIONS

This module works under all versions of Perl from 5.10 onwards.

However, the lastest release of Perl 5.20 seems to have significant difficulties compiling large regular expressions, and typically requires over a minute to build any regex that incorporates the $PPR::X::GRAMMAR rule definitions.

The problem does not occur in Perl 5.10 to 5.18, nor in Perl 5.22 or later, though the parser is still measurably slower in all Perl versions greater than 5.20 (presumably because most regexes are measurably slower in more modern versions of Perl; such is the price of full re-entrancy and safe lexical scoping).

The decomment() subroutine trips a separate regex engine bug in Perl 5.14 only and will not run under that version.

There was a lingering bug in regex re-interpolation between Perl 5.18 and 5.28, which means that interpolating a PPR::X grammar (or any other precompiled regex that uses the (??{...}) construct) into another regex sometimes does not work. In these cases, the spurious error message generated is usually: Sequence (?_...) not recognized. This problem is unlikely ever to be resolved, as those versions of Perl are no longer being maintained. The only known workaround is to upgrade to Perl 5.30 or later.

There are also constructs in Perl 5 which cannot be parsed without actually executing some code...which the regex does not attempt to do, for obvious reasons.

BUGS

No bugs have been reported.

Please report any bugs or feature requests to bug-ppr@rt.cpan.org, or through the web interface at http://rt.cpan.org.

AUTHOR

Damian Conway <DCONWAY@CPAN.org>

LICENCE AND COPYRIGHT

Copyright (c) 2017, Damian Conway <DCONWAY@CPAN.org>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.