From Code to Community: Sponsoring The Perl and Raku Conference 2025 Learn more

NAME

YAX::Parser - fast pure Perl tree and stream parser

SYNOPSIS

my $xml_str = <<XML
<?xml version="1.0" ?>
<doc>
<content id="42"><![CDATA[
This is a cdata section, so >>anything goes!<<
]]>
</content>
<!-- comments are nodes too -->
</doc>
XML
# tree parse - the common case
my $xml_doc = YAX::Parser->parse( $xml_str );
my $xml_doc = YAX::Parser->parse_file( $path );
# shallow parse
my @tokens = YAX::Parser->tokenize( $xml_str );
# stream parse
YAX::Parser->stream( $xml_str, $state, %handlers )
YAX::Parser->stream_file( '/some/file.xml', $state, %handlers );

DESCRIPTION

This module implements a fast DOM and stream parser based on Robert D. Cameron's regular expression shallow parsing grammar and technique. It doesn't implement the full W3C DOM API by design. Instead, it takes a more pragmatic approach. DOM trees are constructed with everything being an object except for attributes, which are stored as a hash reference.

We also borrow some ideas from browser implementations, in particular, nodes are keyed in a table in the document on their id attributes (if present) so you can say:

my $found = $xml_doc->get( $node_id );

Parsing is usually done by calling class methods on YAX::Parser, which, if invoked as a tree parser, returns an instance of YAX::Document

my $xml_doc = YAX::Parser->parse( $xml_str );

METHODS

See the "SYNOPSIS" for, here's just the list for now:

parse( $xml_str )

Parse $xml_str and return a YAX::Document object.

parse_file( $path )

Same as above by read the file at $path for the input.

stream( $xml_str, $state, %handlers )

Although not its main focus, YAX::Parser also provides for stream parsing. It tries to be a bit more sane than Expat, in that it allows you to specify a state holder which can be anything and is passed as the first argument to the handler functions. A typical case is to use a hash reference with a stack (for tracking nesting):

my $state = { stack => [ ] };

all handler functions are optional, but the full list is:

my %handlers = (
text => \&handle_text, # called for text nodes
elmt => \&handle_element_open, # called for open tags
elcl => \&handle_element_close, # called for tag close
decl => \&handle_declaration, # called for declarations
proc => \&handle_proc_inst, # called for processing instructions
pass => \&handle_passthrough, # called when no handlers match
);

an element handler is passed the state, tag name and attributes hash:

sub handle_element_open {
my ( $state, $name, %attributes ) = @_;
if ( $name eq 'a' and $attributes{href} ) {
...
}
}

element close handlers take two arguments: state and tag name:

sub handle_element_close {
my ( $state, $name ) = @_;
die "not well formed" unless pop @{ $state->{stack} } eq $name;
}

all other handlers take the state and the entire matched token

sub handle_proc_inst {
my ( $state, $token ) = @_;
$token =~ /^<\?(.*?)\?>$/;
my $instr = $1;
...
}
stream_file( $path, $state, %handlers )

Same as above by read the file at $path for the input.

tokenize( $xml_str )

Useful for quick and dirty tokenizing of $xml_str. Returns a list of tokens.

SEE ALSO

YAX::Document, YAX::Node

LICENSE

This program is free software and may be modified and distributed under the same terms as Perl itself.

AUTHOR

Richard Hundt