PDD 26: Compiler Tools - Abstract Syntax Tree

Abstract

This PDD describes the node types and semantics of the Parrot Abstract Syntax Tree representation.

Description

The Parrot Abstract Syntax Tree (PAST) is a set of Parrot classes useful for generating an abstract semantic representation of a program written in a high level language such as Perl or Python. Once a program has been translated into its equivalent PAST representation, the Parrot Compiler Toolkit can be used to generate executable PIR or bytecode from the abstract syntax tree representation.

PAST is designed to provide the common operations and semantics needed by many of the high level languages targeted by Parrot, while also being extensible to support the unique needs of specific languages.

Implementation

PAST::Node

PAST::Node is the base class for all nodes in an abstract syntax tree. Each PAST::Node element has an array of children nodes (which may be empty), as well as a number of attributes and accessor methods for the node.

Every PAST node has name, source, and pos attributes. The name attribute is the node's name, if any, while source and pos are used to identify the location of the node in the original source code.

init([child1, child2, ..., ] [attr1=>val1, attr2=>val2, ...])

Initialize a PAST node with the given children and attributes. Adds each child to the node (using the push method, below) and calls the appropriate accessor method for each attribute.

new([child1, child2, ..., ] [attr1=>val1, attr2=>val2, ...])

Create a new PAST node with the same type as the invocant and initialized with the given children and attributes. Returns the newly created node.

push(child)

Add child to the end of the node's array of children.

pop()

Remove the last child from the node's array of children. Returns the child.

unshift(child)

Add child to the beginning of the node's array of children.

shift()

Remove the first child from the node's array of children. Returns the child.

iterator( )

Return a newly initialized Iterator for the node's list of children.

node(val)

Set the invocant's source and pos attributes to be the same as val. If val is another PAST node, then source and pos are simply copied from that node. Otherwise, val is assumed to be a Match object (e.g., from PGE) and the source and position information are obtained from that.

name([value])

Accessor method -- sets/returns the name attribute of the node.

named([value])

Accessor method -- for nodes that are being used as named arguments, sets/returns the name to be associated with the argument.

flat([value])

Accessor method -- sets/returns the "flatten" flag on nodes being used as arguments.

attr(STR attrname, PMC value, INT has_value)

Helper method for writing accessors -- if has_value is true then set the node's value of attrname to value. Returns the (resulting) value of attrname for the invocant.

PAST::Stmts

PAST::Stmts is simply a PAST::Node used to contain a sequence of other PAST nodes to be evaluated. It has no specific methods or attributes.

PAST::Val

PAST::Val nodes represent constant values in the abstract syntax tree. The value attribute represents the value of the node itself.

value([value])

Accessor method to get/set the constant value for this node.

returns([typename])

Get/set the type of value to be generated by this node. If not specified, the type is taken from the type of the value attribute given above.

PAST::Var

PAST::Var nodes represent variable-like items within the abstract syntax tree. The name attribute (inherited from PAST::Node) identifies the name of the variable as given by the high level language program. The other attributes for PAST::Var nodes provide the details of how the variable is defined and accessed.

scope([value])

A variable node's "scope" indicates how the variable is to be defined or accessed within the program. The available scopes include:

"lexical"

Lexical variables are scoped to the PAST::Block in which they are declared, including any nested blocks. If the node's isdecl attribute is true, then this node defines a new lexical variable within the current block. Otherwise, the node refers to a lexical variable already declared in the current or outer block.

"package"

Package variables represent global or namespace-managed variables in the program. The node's namespace attribute may be used to explicitly identify the namespace of the variable, otherwise it is assumed to be within the namespace of the PAST::Block containing the node.

"parameter"

Parameter variables are the formal arguments to subroutine and methods, typically represented as PAST::Block nodes. The parameter's lexical name is given by the node's name attribute. Positional parameters are defined by the order in which they appear in the PAST::Block node, named parameters have values for the named attribute. Optional parameters are identified via the viviself attribute (see below) indicating how the parameter is to be initialized if not supplied by the caller. Slurpy parameters are indicated via the node's slurpy attribute.

"keyed"

Keyed variables represent the elements of aggregates such as arrays and hashes. Nodes representing keyed elements have two children; the first child is the PAST representation of the aggregate, and the second child is the PAST representation of the key or index. The vivibase attribute below may be used to specify how to generate the aggregate if it doesn't already exist.

"attribute"

Attribute variables represent object attributes (in some languages they're called "member variables"). The attribute's name is given by the node's name attribute. Nodes representing attribute variables have an optional child, representing the object to which the attribute belongs. If this child is not present, the attribute is assumed to belong to the current invocant, indicated by the special variable self (which is implicitly passed to all subs that are flagged as a :method or :vtable).

"register"

Register variables are limited in scope to the PAST::Block node in which they are declared. This is different from the lexical scope, which includes any nested PAST::Block nodes. If the node's isdecl attribute is true, then this node defines a new register variable within the current block. Register variables are mapped to Parrot registers, and are useful for handling the PIR pseudo-variable self and storing intermediate results. Names given to the name attribute must conform to rules for PIR identifiers. If no name attribute is set, Parrot registers are used. In this case, setting the isdecl does not have any effect.

If scope is not explicitly provided in the node, then PAST will look at the local symbol tables of any outer PAST::Block nodes to try to determine the scope of the named variable. If this still does not result in a scope, then 'lexical' is assumed.

viviself([value])

Accessor method for the viviself attribute, which specifies how uninitialized variables are to be initialized when first used. The viviself attribute may be either a string or array representing a type (e.g., Integer), or it may be a PAST representation for generating the variable's initial value.

vivibase([value])

Accessor method for the vivibase attribute, which specifies how the base portion of aggregate variables (see "keyed scope" above) is to be created if it doesn't already exist. Vivibase may either be a string or array representing the type of the newly created aggregate, or it may be a PAST representation for generating the aggregate.

isdecl([flag])

Accessor method for the isdecl attribute. A true value of isdecl indicates that the variable given by this node is being created at this point within the current lexical scope. Otherwise, the node refers to a pre-existing variable (possibly from an outer scope).

lvalue([flag])

Accessor method for the lvalue attribute, which indicates whether this variable is being used in an lvalue context.

slurpy([flag])

Accessor method for the slurpy attribute of parameter variables. A true value of slurpy indicates that the parameter variable given by this node is to be created as a slurpy parameter (consuming all remaining arguments passed in). Named slurpy parameters are indicated by having a non-empty named attribute and a true value of slurpy.

PAST::Op

PAST::Op nodes represent the operations in an abstract syntax tree. The primary function of the node is given by its pasttype attribute, secondary functions may be indicated by the node's name, pirop, or other attributes as given below.

pasttype([value])

Accessor method for the node's pasttype attribute. The pasttype is the primary indicator of the type of operation to be performed, the operands are typically given as the children of the node. Defined values of pasttype are:

chain

A short-circuiting chain of operations. In a sequence of nodes with pasttype 'chain', the right operand of a node serves as the left operand of its parent. Each node is evaluated only once, and the first false result short-circuits the chain. In other words, $x < $y < $z is true only if $x < $y and $y < $z, but $y only gets evaluated once.

copy

Copy the value of the node's second child into the variable expression given by its first child.

bind

Bind the variable given by the node's first child to the value given by its second child.

if

The first, second, and third children represent the "condition", "then", and "else" parts of a conditional expression. The first child is evaluated; if the result is true then the second child is evaluated and returned as the result of the PAST::Op node, otherwise the third child is evaluated and returned as the result. This implements the standard if-then-else logic needed by most higher level languages, and can also be used for implementing the ternary operator.

If the node is missing its second ("then") or third ("else") child, then the result of the condition is used as the corresponding result of the operation. This makes it easy to implement the "short-circuit and" operator commonly used in many high level languages. For example, the standard && operator may be implemented using an "if" node, where the left operand is the first (condition) child, the right operand is the second (then) child, and the third child is left as null or uninitialized.

It's also possible to build a "short-circuit or" (||) operator using this pasttype, by setting the left operand to || as the first child and the right operand as the third child (leaving the second child as null). However, it's probably simpler to use the "unless" type as described below.

unless

Same as if above, except that the second child is evaluated if the first child evaluates to false and the third child is evaluated if the first child evaluates to true.

The unless type can be used to implement "short-circuit or" semantics; simply set the first child to the left operand and the second child to the right operand, leaving the third child empty or uninitialized. If the first child evaluates to true it is returned as the result of the operation, otherwise the second child is evaluated and returned as the result.

while

Evaluate the first child (condition), if the result is true then evaluate the second child (body) and repeat.

until

Evaluate the first child (condition), if the result is false then evaluate the second child (body) and repeat.

repeat_while, repeat_until

Same as while and until above, except the second child is evaluated before the conditional first child is evaluated for continuation of the loop.

for

Iterate over the first child in groups of elements given by arity (default 1). For each iteration, invoke the second child, passing the elements as parameters.

call

Call the subroutine given by the name attribute, passing the results of any child nodes as arguments. If the node has no name attribute, then the first child is assumed to evaluate to a callable subroutine, and any remaining children are used as arguments.

callmethod

Invoke the method given by name on the first child, passing the results of any child nodes as arguments. If the node has no name attribute, then the first child is evaluated as a method to be called, the second child is the invocant, and any remaining children are arguments to the method call.

pirop

Execute the PIR opcode given by the pirop attribute. See the pirop method below for details. This is also the default behavior when a pirop attribute is set and pasttype is not.

inline

Execute the sequence of PIR statements given by the node's inline attribute (a string). See the inline method below for details.

return

Generate a return exception, using the first child (if any) as a return value.

try

Evaluates the first child, if any exceptions occur then they are handled by the code given by the second child (if any).

xor

Evaluate the child nodes looking for exactly one true result to be returned. If two true results are encountered, the operation immediately short-circuits and returns false. Otherwise, after all children have been evaluated the result of any true child is used as the result of the operation, or the result of the last child if none of the children evaluated to true.

pirop([opcode])

Get/set the PIR opcode to be executed for this node. Internally the PAST and POST (Parrot Opcode Syntax Tree) implementations understand the register types available for common PIR operations and will handle any needed register or constant conversion of operands automatically. Note that except for the assign opcode, any destination is typically generated automatically and should not be explicitly given as a child operand to the node. The table of PIR opcodes that PAST "knows" about is given in compilers/pct/src/PAST/Compiler.pir .

lvalue([flag])

Get/set whether this node is an lvalue, or treats its first child as an lvalue (e.g., for assignment).

inline([STRING code])

Get/set the code to be used for inline PIR when pasttype is set to "inline". The code argument is PIR text to be inserted in the final generated code sequence. Sequences of "%0", "%1", "%2", ... "%9" in code are replaced with the evaluated results of the first, second, third, ..., tenth children nodes. (If you need more than ten arguments to your inline PIR, consider making it a subroutine call instead.)

The register to hold the result of the inline PIR operation is given by "%r", "%t", or "%u" in the code string:

%r   - Generate a unique PMC register for the result.
%t   - Generate a unique PMC register for the result,
       and initialize it with an object of type C<returns>
       {{Pm: or possibly C<viviself> }}
       before the execution of the inline PIR.
%u   - Re-use the first child's PMC (%0) if it's a temporary
       result, otherwise same as %t above.

If none of %r, %t, or %u appear in code, then the first child's (%0) is used as the result of this operation.

PAST::Block

PAST::Block nodes represent lexical scopes within an abstract syntax tree, and roughly translate to individual Parrot subroutines. A PAST::Block node nested within another PAST::Block node acts like a nested lexical scope.

If the block has a name attribute, that becomes the name of the resulting Parrot sub. Otherwise, a unique name is automatically generated for the block.

Each PAST::Block node can maintain its own local symbol table, see the symbol method below.

blocktype([type])

Get/set the type of the block to type. The currently understood values are 'declaration', 'immediate', and 'method'. 'Declaration' indicates that a block is simply being defined at this point, the result of the node is a reference to the block. A blocktype of 'immediate' indicates a block that is to be immediately executed when it is evaluated in the AST, and the result of the last operation is used as the return value for the block.

closure([value])

Get/set the closure flag for the block to value. If the closure flag on a (non-immediate) block is true, then the node returns a reference to a clone of the block that has captured the current lexical context.

namespace([value])

Get/set the namespace for this particular block (and any nested blocks that do not explicitly provide a namespace). Value may either be a simple string or an array of strings representing a nested namespace.

hll([value])

Get/set the HLL namespace for this block (and any nested blocks that do not explicitly provide a hll).

symbol(name, [attr1 => val1, attr2 => val2, ...])

Each PAST::Block node can use the symbol method to maintain its own compile-time notion of a local symbol table. Each value of name is given its own hash to hold information about that symbol for the block (i.e., the symbol table acts like a hash of hashes indexed by symbol name). If the symbol method is called with named arguments, then the method sets the entries in the hash corresponding to name in the block's symbol table. If symbol is called with just a single name argument, then the current hash for local symbol name is returned.

HLLs are free to place any values in the symbol hashes that may be useful. However, the scope entry for a symbol is typically used to provide the scope attribute for any nested PAST::Var nodes that do not provide an explicit scope attribute.

symbol_defaults([attr1 => val1, attr2 => val2, ...])

Sets default attributes for symbols that aren't explicitly given by the symbol method above. For example, an abstract syntax tree can use this method on a top-level PAST::Block to specify the scope attribute to be used for all variable nodes that don't otherwise provide one.

subid([value])

Get/set the unique subid to be used for this block. If no subid is explicitly set, a unique subid is randomly generated for the block.

pirflags([value])

Get/set any PIR flags to be used when generating the block.

compiler([compiler_name])

Specify that the children nodes of this block are to be compiled using compiler_name instead of being treated as standard PAST nodes. This is useful when a program may contain components of code written in other HLLs. For example, the perl6 compiler uses this feature to use PGE to compile pre-parsed Perl 6 regular expressions, rather than trying to represent the semantics of those expressions in PAST itself.

When code is generated from a PAST::Block node having a compiler attribute, the compiler is invoked with name and outer arguments so that any generated subs can have names and lexical scoping appropriate to the block (it's the responsibility of the external compiler to support this).

PAST::CompUnit

PAST::CompUnit nodes are used for the abstract representation of compilation units. Specific attributes and semantics are yet to be determined.

References

None.