NAME
PPI - Parse, Analyze and Manipulate Perl (without perl) - RELEASE CANDIDATE 2
SYNOPSIS
use
PPI;
# Create a new empty document
my
$Document
= PPI::Document->new;
# Create a document from source
$Document
= PPI::Document->new(
'print "Hello World!\n"'
);
# Load a Document from a file
$Document
= PPI::Document->load(
'Module.pm'
);
# Does it contain any POD?
if
(
$Document
->find_any(
'PPI::Token::Pod'
) ) {
"Module contains POD\n"
;
}
# Get the name of the main package
$pkg
=
$Document
->find_first(
'PPI::Statement::Package'
)->namespace;
# Remove all that nasty documentation
$Document
->prune(
'PPI::Token::Pod'
);
$Document
->prune(
'PPI::Token::Comment'
);
# Save the file
$Document
->save(
'Module.pm.stripped'
);
STATUS
As of version 0.995, PPI is at release candidate status
The entire PPI feature-set is now implemented, the API now supports all of the major language structures, and will handle the entire perl syntax.
Source filters are not and will not (and can not) be supported.
The class structure of the PDOM (Perl Document Object Model) is complete, frozen and documented. Most of the analysis methods within the PDOM that are documented can also be considered frozen. Some small changes may be made down the track, but everything is now considered "done".
DESCRIPTION
About this Document
This is the PPI manual. It describes PPI, its reason for existing, its structure, its use, an overview of the API, and provides implementation samples.
Background
The ability to read, and manipulate perl (programmatically) other than with the perl executable is one that has caused difficulty for a long time.
The root cause of this problem is perl's dynamic grammar. Although there are typically not huge differences in the grammar of most code, some things cause large problems.
An example of these are function signatures, as demonstrated by the following.
@result
= (dothis
$foo
,
$bar
);
# Which of the following is it equivalent to?
@result
= (dothis(
$foo
),
$bar
);
@result
= dothis(
$foo
,
$bar
);
This code can be interpreted in two different ways, depending on whether the &dothis
function is expecting one argument, or two, or several.
To restate, a true or "real" parser needs information that can not be found in the immediate vicinity. In fact, this information might not even be in the same file. It might also not be able to determine this without the prior execution of a BEGIN {}
block. In other words, to parse perl, you must also execute it, or if not it, everything that it depends on for its grammar.
This, while possibly feasible in some circumstances, is not a valid solution ( at least, so far as this module is concerned ). Imagine trying to parse some code that had a dependency on the Win32::*
modules from a Unix machine, or trying to parse some code with a dependency on another module that had not even been written yet...
For more information on why it is impossible to parse perl, see:
http://www.perlmonks.org/index.pl?node_id=44722
Why "Isolated"?
Originally, PPI was short for Parse::Perl::Isolated. In aknowledgement that someone may some day come up with a valid solution for the grammar problem, it was decided to leave the Parse::Perl
namespace free.
The purpose of this parser is not to parse Perl code, but to parse Perl documents. In most cases, a single file is valid as both. By treating the problem this way, we can parse a single file containing Perl source isolated from any other resources, such as the libraries upon which the code may depend, and without needing to run an instance of perl alongside or inside the the parser (a possible solution for Parse::Perl that is investigated from time to time).
Why do we want to parse?
Once we accept that we will probably never be able to parse perl well enough to execute it, it is worth re-examining WHY
we wanted to "parse" perl in the first place. What are the uses we would put such a parser to.
- Documentation
-
Analyze the contents of a Perl document to automatically generate documentation, in parallel to, or as a replacement for, POD documentation.
- Structural and Quality Analysis
-
Determine quality or other metrics across a body of code, and identify situations relating to particular phrases, techniques or locations.
- Refactoring
-
Make structural, syntax, or other changes to code in an automated manner, independently, or in assistance to an editor. This list includes backporting, forward porting, partial evaluation, "improving" code, or whatever.
- Layout
-
Change the layout of code without changing its meaning. This includes techniques such as tidying (like perltidy), obfuscation, compression, or to implement formatting preferences or policies.
- Presentation
-
This includes method of improving the presentation of code, without changing the text of the code. Modify, improve, syntax colour etc the presentation of a Perl document.
With these goals identified, as long as the above tasks can be achieved, with some sort of reasonable guarantee that the code will not be damaged in the process, then PPI can be considered to be a success.
Good Enough(TM)
With the above tasks in mind, PPI seeks to be good enough to achieve the above tasks, or to provide a sufficiently good API on which to allow others to implement modules in these and related areas.
However, there are going to be limits to this process. Because PPI cannot adapt to changing grammars, any code written using code filters should not be assumed to be parsable. At one extreme, this includes anything munged by Acme::Bleach, as well as (arguably) more common cases like Switch.pm and Exception.pm. We do not pretend to be able to parse code using these modules, although someone may be able to extend PPI to handle them.
UPDATE: The ability to extend PPI to handle lexical additions to the language, which means handling filters that LOOK like they should be perl, but aren't, is on the drawing board to be done some time post-1.0
The goal for success is thus to be able to successfully parse 99% of all Perl documents contained in CPAN. This means the entire file in each case.
IMPLEMENTATION
General Layout
PPI is built upon two primary "parsing" components, PPI::Tokenizer and PPI::Lexer, and a large tree of nearly 50 classes which implement the various objects within the Perl Document Object Model (PDOM).
The Perl Document Object Model is somewhat similar in style and intent to the regular DOM, but contains many differences to handle perl-specific cases.
On top of the Tokenizer and Lexer, and the classes of the PDOM, sit a number of classes intended to make life a little easier when dealing with PDOM object trees.
Both the major parsing components were implemented from scratch with just plain Perl code. There are no grammar rules, no YACC or LEX style tools, just code. This is primarily because of the sheer volume of accumulated cruft that exists in perl. Not even perl itself is capable of parsing perl documents (remember, it just parses and executes it as code) so PPI needs to be even cruftier than perl itself. Yes, eewww...
The Tokenizer
The Tokenizer is considered complete and of release candidate quality. Not quite fully "stable", but close.
The Tokenizer takes source code and converts it into a series of tokens. It does this using a slow but thorough character by character manual process, rather than using complex regexs. Well, that's actually a lie, it has a lot of support regexs throughout, and it's not truly character by character. The Tokenizer is increasingly "skipping ahead" when it can find shortcuts, so the "current character" cursor tends to jump a bit wildly. Remember that cruft I was mentioning. Right, well the tokenizer is full of it. In reality, the number of times the Tokenizer will ACTUALLY move the character cursor itself is only about 5% - 10% higher than the number of tokens in the file.
Currently, these speed issues mean that PPI is not of great use for highly interactive tasks, such as an editor which checks and formats code on the fly. This situation is improving somewhat with multi-gigahertz processors, but can still be painful at times.
How slow? As an example, tokenizing CPAN.pm, a 7112 line, 40,000 token file takes about 5 seconds on my little Duron 800 test server. So you should expect the tokenizer to work at a rate of about 1700 lines of code per gigacycle. The code gets tweaked and improved all the time, and there is a fair amount of scope left for speed improvements, but it is painstaking work, and fairly slow going.
The target rate is about 5000 lines per gigacycle.
The main avenue for making it to this speed has now become PPI::XS, a drop-in XS accelerator for miscellaneous parts of PPI.
Since PPI::XS has only just gotten off the ground and is currently only at proof-of-concept stage, this may take a little while.
The Lexer
The Lexer is considered complete, but subject to minor. Early beta quality.
The Lexer takes a token stream, and converts it to a lexical tree. Again, remember we are parsing Perl documents here, not code, so this includes whitespace, comments, and all number of weird things that have no relevance when code is actually executed.
An instantiated PPI::Lexer object consumes PPI::Tokenizer objects, or things that can be converted into one, and produces PPI::Document objects.
THE PERL DOCUMENT OBJECT MODEL
The PDOM is a structured collection of data classes that together provide a correct and scalable model for documents that follow the standard Perl syntax.
Although this is a basic overview and doesn't cover the PDOM classes in order or details, the following is a rough inheritance layout of the main core classes.
PPI::Element
PPI::Token
PPI::Token::*
PPI::Node
PPI::Statement
PPI::Statement::*
PPI::Structure
PPI::Structure::*
PPI::Document
To summarize the above layout, all PDOM objects inherit from the PPI::Element class.
Under this are PPI::Token, strings of content with a known type, and PPI::Node, contains to hold other Elements.
The first PDOM element you are likely to encounter is the PPI::Document object.
The Document
At the top of all complete PDOM trees is a PPI::Document object. Each Document will contain a number of Statements, Structures and Tokens.
A PPI::Statement is any series of Tokens and Structures that are treated as a single contiguous statement by perl itself. You should note that a Statement is as close as PPI can get to "parsing" the code in the sense that perl-itself parses Perl code when it is building the op-tree. PPI cannot tell you, for example, which tokens are subroutine names, or arguments to a sub call, or what have you.
At a fundamental level, it only knows that this series of elements represents a single Statement. For specific Statement types however, the PDOM is able to derive additional useful information.
A PPI::Structure is any series of tokens contained within matching braces. This includes things like code blocks, conditions, function argument braces, anonymous array constructors, lists, scoping braces et al. Each Structure contains none, one, or many Tokens and Structures (the rules for which vary for the different Structure subclasses)
The PDOM at Work
To demonstrate, lets start with an example showing how the PDOM tree might look for the following chunk of simple Perl code.
#!/usr/bin/perl
(
"Hello World!"
);
exit
();
This is not all that complicated. Very very simple in fact. Translated into a PDOM tree it would have the following structure.
PPI::Document
PPI::Token::Comment
'#!/usr/bin/perl\n'
PPI::Token::Whitespace
'\n'
PPI::Statement
PPI::Token::Bareword
'print'
PPI::Structure::List ( ... )
PPI::Token::Whitespace
' '
PPI::Statement::Expression
PPI::Token::Quote::Double
'"Hello World!"'
PPI::Token::Whitespace
' '
PPI::Token::Structure
';'
PPI::Token::Whitespace
'\n'
PPI::Token::Whitespace
'\n'
PPI::Statement
PPI::Token::Bareword
'exit'
PPI::Structure::List ( ... )
PPI::Token::Structure
';'
PPI::Token::Whitespace
'\n'
Please note that in this this example, strings are only listed for the ACTUAL element that contains the string. Also, Structures are listed with the brace characters noted.
The PPI::Dumper module can be used to generate similar trees yourself.
Notice how PPI builds EVERYTHING into the model, including whitespace. This is needed in order to make the Document fully "round trip" compliant. That is, if you stringify the Document you get the same file you started with.
The one exception is that if the newlines for your file are wrong, PPI will probably have localised them for you.
We can make that PDOM dump a little easier to read if we strip out all the whitespace. Here it is again, sans the distracting whitespace tokens.
PPI::Document
PPI::Token::Comment
'#!/usr/bin/perl\n'
PPI::Statement
PPI::Token::Bareword
'print'
PPI::Structure::List ( ... )
PPI::Statement::Expression
PPI::Token::Quote::Double
'"Hello World!"'
PPI::Token::Structure
';'
PPI::Statement
PPI::Token::Bareword
'exit'
PPI::Structure::List ( ... )
PPI::Token::Structure
';'
As you can see, the tree can get fairly deep at time, especially when every isolated token in a bracket becomes its own statement. This is needed to allow anything inside the tree the ability to grow. It also makes the search and analysis algorithms much more flexible.
Because of the depth and complexity of PDOM trees, a vast number of very easy to use methods have been added wherever possible to help people working with PDOM trees do normal tasks relatively quickly and efficiently.
CLASSES
This section has two parts.
Firstly a large tree of all the classes contained in the PPI core. They are listed only by name, with no description.
And second, a shorter list with descriptions for the primary classes in the core PPI distribution.
Perl Document Object Model Classes
PPI::Element
PPI::Node
PPI::Document
PPI::Document::Fragment
PPI::Statement
PPI::Statement::Scheduled
PPI::Statement::Package
PPI::Statement::Include
PPI::Statement::Sub
PPI::Statement::Compound
PPI::Statement::Break
PPI::Statement::Data
PPI::Statement::End
PPI::Statement::Expression
PPI::Statement::Variable
PPI::Statement::Null
PPI::Statement::UnmatchedBrace
PPI::Statement::Unknown
PPI::Structure
PPI::Structure::Block
PPI::Structure::Subscript
PPI::Structure::Constructor
PPI::Structure::Condition
PPI::Structure::List
PPI::Structure::ForLoop
PPI::Structure::Unknown
PPI::Token
PPI::Token::Whitespace
PPI::Token::Comment
PPI::Token::Pod
PPI::Token::Number
PPI::Token::Word
PPI::Token::DashedWord
PPI::Token::Symbol
PPI::Token::Magic
PPI::Token::ArrayIndex
PPI::Token::Operator
PPI::Token::Quote
PPI::Token::Quote::Single
PPI::Token::Quote::Double
PPI::Token::Quote::Literal
PPI::Token::Quote::Interpolate
PPI::Token::QuoteLike
PPI::Token::QuoteLike::Backtick
PPI::Token::QuoteLike::Command
PPI::Token::QuoteLike::Regexp
PPI::Token::QuoteLike::Words
PPI::Token::QuoteLike::Readline
PPI::Token::Regexp
PPI::Token::Regexp::Match
PPI::Token::Regexp::Substitute
PPI::Token::Regexp::Transliterate
PPI::Token::HereDoc
PPI::Token::Cast
PPI::Token::Structure
PPI::Token::Label
PPI::Token::Separator
PPI::Token::Data
PPI::Token::End
PPI::Token::Prototype
PPI::Token::Attribute
PPI::Token::Unknown
Primary Class Overview
- PPI::Tokenizer
-
The PPI Tokenizer consumes chunks of text and provides access to a stream of PPI::Token objects. The Tokenizer is very very complicated, to the point where even the author treads a bit carefully when working with it.
Most of the complication is the result of optimizations which have tripled the tokenization speed, at the expense of maintainability. We cope with the spagetti by heavily comment everything.
Because the Tokenizer holds the array of Tokens internally, providing cursor-based access to it, an instantiate Tokenizer object can only be used once, unlike the Lexer which just spits out single PPI::Document objects and can be reused as needed.
- PPI::Lexer
-
The PPI Lexer. Converts Token streams into PDOM trees.
- PPI::Dumper
-
A simple class for dumping readable debugging version of PDOM structures
- PPI::Token::_QuoteEngine
-
The PPI::Token::Quote and PPI::Token::QuoteLike classes provide abstract base classes for the many and varied types of quote and quote-like things in perl. However, much of the actual quote login is implemented in a separate quote engine, based at PPI::Token::_QuoteEngine.
Classes that inherit from PPI::Token::Quote, PPI::Token::QuoteLike and the base Regexp class PPI::Token::Regexp are generally parsed only by the Quote Engine.
- PPI::Document
-
The Document object, the top of the PDOM
- PPI::Document::Fragment
-
A cohesive fragment of a larger Document. Although not of any current use, it is planned for use in certain internal tree manipulation algortihms.
i.e. For doing things like cut/paste/insert etc. Very similar to PPI::Document, but has some additional methods, and does not represent a lexical scope boundary.
- PPI::Element
-
The Element class is the abstract base class for all objects within the PDOM
- PPI::Node
-
The Node object, the abstract base class for all PDOM object that can contain other Elements, such as the Document, Statement and Structure objects.
- PPI::Statement
-
The base class for all Perl statements. Generic "evaluate for side-effects" statements are of this actual type. Other more interesting statement types belong to one of its children.
See the PPI::Statement documentation for a longer description and list of all of the different statement types and subclasses.
- PPI::Structure
-
The abstract base class for all structures. A Structure is a language construct consisting of matching braces containing a set of other elements.
See the PPI::Structure documentation for a description and list of all of the different structure types/classes.
- PPI::Token
-
A token is the basic unit of content. At its most basic, a Token is just a string tagged with metadata (its class, some additional flags in some cases).
INSTALLING
The core PPI distribution is pure perl and has been kept as tight as possible and with as few dependencies as possible.
It should download and install normally on any platform from within the CPAN and CPANPLUS applications, or directly using the distribution tarball.
There are no special install instructions for PPI.
EXTENDING
For the time being, the PPI namespace is to be reserved for the sole use of the modules under the umbrella of the Parse::Perl project.
You are recommended to use the PPIx:: namespace for PPI-specific modifications or prototypes thereof, or Perl:: for modules which provide a general Perl language-related functions.
TO DO
- More analysis methods for PDOM classes (post 1.000)
- Creation of a PPI tutorial (OSCON)
- We can _always_ write more tests
SUPPORT
Bugs should be reported vi the following URI
http://rt.cpan.org/NoAuth/ReportBug.html?Queue=PPI
For other issues, or commercial enhancement or support, contact the author.
AUTHOR
Adam Kennedy, http://ali.as/, cpan@ali.as
ACKNOWLEDGMENTS
A huge thank you to Phase N (http://phase-n.com/) for permitting the open sourcing and release of this distribution from commercial work.
Completion funding provided by The Perl Foundation (http://www.perlfoundation.org/)
COPYRIGHT
Copyright (c) 2004 - 2005 Adam Kennedy. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.