NAME

XML::Reader - Reading XML and providing path information based on a pull-parser.

SYNOPSIS

use XML::Reader;

my $text = '<root>stu<test param="v">w</test>xyz</root>';
my $rdr = XML::Reader->new(\$text) or die "Error: $!";

while ($rdr->iterate) {
    print "Path = ", $rdr->path, ", Value = ", $rdr->value, "\n";
}

DESCRIPTION

XML::Reader provides an easy to use and simple interface for sequentially parsing XML files (so called "pull-mode" parsing) and at the same time keeps track of the complete XML-path.

It was developped as a thin wrapper on top of XML::TokeParser. XML::TokeParser allows pull-mode parsing, but does not keep track of the complete XML-Path. Also, the interface to XML::TokeParser (see $t->is_start_tag, $t->is_end_tag, $t->is_text) requires you to distinguish between start-tags, end-tags and text, which, in my view, complicates the interface.

There is also XML::TiePYX, which lets you pull-mode parse XML-Files (see http://www.xml.com/pub/a/2000/03/15/feature/index.html for an introduction to PYX). But still, with XML::TiePYX you need to account for start-tags, end-tags and text, and it does not provide the full XML-path.

By contrast, XML::Reader translates start-tags, end-tags and text into XPath-like expressions. So you don't need to worry about tags, you just get a path and a value, and that's it.

For example, the following XML...

<data>
  <item>abc</item>
  <item>
    <dummy/>
    fgh
    <inner name="ttt" id="fff">
      ooo <!-- comment --> ppp
    </inner>
  </item>
</data>

...corresponds to a sequence of path/value pairs.

You can also keep track of the start- and end-tags: There is a method is_start which returns 1 or 0, depending on whether the XML-file had a start tag at the current position. There is also the equivalent method is_end. Just remember, those two method only make sense if filter is switched off (otherwise those methods return constant 0). Finally, there is the method tag which gives you the current tag-name (or attribute-name).

Here is the sequence of path/value pairs, including is_start, is_end and tag:

path = '/data'                  value = ''        is_start = 1 is_end = 0 tag = 'data'
path = '/data/item'             value = 'abc'     is_start = 1 is_end = 1 tag = 'item'
path = '/data'                  value = ''        is_start = 0 is_end = 0 tag = 'data'
path = '/data/item'             value = ''        is_start = 1 is_end = 0 tag = 'item'
path = '/data/item/dummy'       value = ''        is_start = 1 is_end = 1 tag = 'dummy'
path = '/data/item'             value = 'fgh'     is_start = 0 is_end = 0 tag = 'item'
path = '/data/item/inner'       value = ''        is_start = 1 is_end = 0 tag = 'inner'
path = '/data/item/inner/@id'   value = 'fff'     is_start = 0 is_end = 0 tag = 'id'
path = '/data/item/inner/@name' value = 'ttt'     is_start = 0 is_end = 0 tag = 'name'
path = '/data/item/inner'       value = 'ooo'     is_start = 0 is_end = 0 tag = 'inner'
path = '/data/item/inner/#'     value = 'comment' is_start = 0 is_end = 0 tag = ''
path = '/data/item/inner'       value = 'ppp'     is_start = 0 is_end = 1 tag = 'inner'
path = '/data/item'             value = ''        is_start = 0 is_end = 1 tag = 'item'
path = '/data'                  value = ''        is_start = 0 is_end = 1 tag = 'data'

INTERFACE

Object creation

To create an XML::Reader object, the following syntax is used:

my $rdr = XML::Reader->new($data, {comment => 0, strip => 1, filter => 1})
  or die "Error: $!";

The element $data (which is mandatory) is either the name of the XML-file, or a reference to a string, in which case the content of that string is taken as the text of the XML.

Here is an example to create an XML::Reader object with a file-name:

my $rdr = XML::Reader->new('input.xml') or die "Error: $!";

Here is another example to create an XML::Reader object with a reference:

my $rdr = XML::Reader->new(\'<data>abc</data>') or die "Error: $!";

One ,or more, of the following options can be added as a hash-reference:

option {comment => 0}

The option {comment => 1} allows comments to be passed through. The option {comment => 0} disables comments. The default is {comment => 0}.

option {strip => 1}

The option {strip => 1} strips all leading and trailing spaces from text and comments. (attributes are never stripped). The default is {strip => 1}.

option {filter => 1}

The option {filter => 1} removes all empty text lines. Be careful if you want to use the is_start and is_end methods, in which case you have to set option {filter => 0}. The default is {filter => 1}.

Methods

A successfully created object of type XML::Reader provides the following methods:

iterate

Reads one single XML-value. It returns 1 after a successful read, or undef when it hits end-of-file.

path

Provides the complete path of the currently selected value, attributes are represented by leading '@'-signs, comments are represented by a '#'-symbol.

value

Provides the actual value (i.e. text, attribute or comment).

type

Provides the type of the value: 'T' for text, '@' for attributes, '#' for comments.

tag

Provides the current tag-name (or attribute-name).

is_start

Returns 1 or 0, depending on whether the XML-file had a start tag at the current position. Be careful, this method only make sense if filter is switched off (otherwise constant 0 is returned).

is_end

Returns 1 or 0, depending on whether the XML-file had an end tag at the current position. Be careful, this method only make sense if filter is switched off (otherwise constant 0 is returned).

level

Indicates the nesting level of the XPath expression (numeric value greater than zero).

OTHER CONSIDERATIONS

Memory leak in XML::TokeParser

The XML::TokeParser object has a circular reference, see subroutine XML::TokeParser::new

$self->{parser} = $parser->parse_start( TokeParser => $self )

This line of code generates a circular reference as follows:

$self->{parser}{TokeParser} == $self

In order to resolve this circular reference during object destruction, an attempt has been made to remove the circular reference in the DESTROY subroutine for XML::TokeParser by undefining the first element in the chain of that circular reference.

package XML::TokeParser;

sub DESTROY {
    my $self = shift;
    ...
    $self->{parser} = undef;
}

Unfortunately, this approach does not work, as the XML::TokeParser object is part of the circular reference itself and therefore XML::TokeParser::DESTROY will not be called until the circular reference is cleaned up during global destruction.

The solution is to resolve the circular reference in the DESTROY subroutine of a package which is not part of the circular reference itself. The obvious solution is the correct one: We resolve the circular reference in the DESTROY subroutine of this package XML::Reader.

One thing to remember here is that we now are dealing with an additional level of indirection, i.e. in XML::TokeParser::DESTROY, instead of...

$self->{parser} = undef;

...we now have to say (in XML::Reader::DESTROY)...

$self->{parser}{parser} = undef;

...This should work, however, tests with XML::Reader have shown that this does not work. You may ask why ? - I don't know. - What does work, however is the following instruction in XML::Reader::DESTROY...

$self->{parser}{parser}{TokeParser} = undef;

AUTHOR

Klaus Eichner, March 2009

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Klaus Eichner

All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

RELATED MODULES

If you also want to write XML, have a look at XML::Writer. This module provides a simple interface for writing XML. (If you are writing non-mixed content XML, consider setting DATA_MODE=>1 and DATA_INDENT=>2, which allows for proper indentation in your XML-Output file)

SEE ALSO

XML::TokeParser, XML::Parser, XML::TiePYX, XML::Writer.