NAME

XML::Reader - Reading XML and providing path information based on a pull-parser.

SYNOPSIS

use XML::Reader;

my $text = q{<init><page node="400">m <!-- remark --> r</page></init>};

my $rdr = XML::Reader->new(\$text, {filter => 2}) or die "Error: $!";
while ($rdr->iterate) {
    printf "Path: %-19s, Value: %s\n", $rdr->path, $rdr->value;
}

This program produces the following output:

Path: /init              , Value:
Path: /init/page/@node   , Value: 400
Path: /init/page         , Value: m r
Path: /init              , Value:

DESCRIPTION

XML::Reader provides a simple and easy to use interface for sequentially parsing XML files (so called "pull-mode" parsing) and at the same time keeps track of the complete XML-path.

It was developped as a wrapper on top of XML::Parser (while, at the same time, some basic functions have been copied from XML::TokeParser). Both XML::Parser and XML::TokeParser allow pull-mode parsing, but do not keep track of the complete XML-Path. Also, the interfaces to XML::Parser and XML::TokeParser require you to distinguish between start-tags, end-tags and text, which, in my view, complicates the interface.

There is also XML::TiePYX, which lets you pull-mode parse XML-Files (see http://www.xml.com/pub/a/2000/03/15/feature/index.html for an introduction to PYX). But still, with XML::TiePYX you need to account for start-tags, end-tags and text, and it does not provide the full XML-path.

By contrast, XML::Reader translates start-tags, end-tags and text into XPath-like expressions. So you don't need to worry about tags, you just get a path and a value, and that's it.

For example, the following XML in variable '$line1'...

my $line1 = q{
  <data>
    <item>abc</item>
    <item><!-- c1 -->
      <dummy/>
      fgh
      <inner name="ttt" id="fff">
        ooo <!-- c2 --> ppp
      </inner>
    </item>
  </data>
};

...can be parsed with XML::Reader using the methods iterate to iterate one-by-one through the XML-data, path and value to extract the current XML-path and it's value.

You can also keep track of the start- and end-tags: There is a method is_start, which returns 1 or 0, depending on whether the XML-file had a start tag at the current position. There is also the equivalent method is_end. If you want to know whether you have encountered a fresh sequence of attributes, you can use the method is_init_attr.

There are also the methods comment, tag, attr, type and level. comment returns the comment, if any. tag gives you the current tag-name, attr returns the attribute-name, type returns either 'T' for text or '@' for attributes and level indicates the current nesting-level (a number >= 0).

Here is a sample program which parses the XML in '$line1' from above to demonstrate the principle...

use XML::Reader;

my $rdr = XML::Reader->new(\$line1, {filter => 2}) or die "Error: $!";
my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. pat=%-22s, val=%-9s, s=%-1s, i=%-1s, e=%-1s, tag=%-6s, atr=%-6s, t=%-1s, lvl=%2d, c=%s\n",
     $i, $rdr->path, $rdr->value, $rdr->is_start, $rdr->is_init_attr,
     $rdr->is_end, $rdr->tag, $rdr->attr, $rdr->type, $rdr->level, $rdr->comment;
}

...and here is the output:

 1. pat=/data                 , val=         , s=1, i=0, e=0, tag=data  , atr=      , t=T, lvl= 1, c=
 2. pat=/data/item            , val=abc      , s=1, i=0, e=1, tag=item  , atr=      , t=T, lvl= 2, c=
 3. pat=/data                 , val=         , s=0, i=0, e=0, tag=data  , atr=      , t=T, lvl= 1, c=
 4. pat=/data/item            , val=         , s=1, i=0, e=0, tag=item  , atr=      , t=T, lvl= 2, c=c1
 5. pat=/data/item/dummy      , val=         , s=1, i=0, e=1, tag=dummy , atr=      , t=T, lvl= 3, c=
 6. pat=/data/item            , val=fgh      , s=0, i=0, e=0, tag=item  , atr=      , t=T, lvl= 2, c=
 7. pat=/data/item/inner/@id  , val=fff      , s=0, i=1, e=0, tag=@id   , atr=id    , t=@, lvl= 4, c=
 8. pat=/data/item/inner/@name, val=ttt      , s=0, i=0, e=0, tag=@name , atr=name  , t=@, lvl= 4, c=
 9. pat=/data/item/inner      , val=ooo ppp  , s=1, i=0, e=1, tag=inner , atr=      , t=T, lvl= 3, c=c2
10. pat=/data/item            , val=         , s=0, i=0, e=1, tag=item  , atr=      , t=T, lvl= 2, c=
11. pat=/data                 , val=         , s=0, i=0, e=1, tag=data  , atr=      , t=T, lvl= 1, c=

If you want, you can set option {filter => 1} to select only those lines that have a value.

use XML::Reader;

my $rdr = XML::Reader->new(\$line1, {filter => 1}) or die "Error: $!";
my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. pat=%-22s, val=%-9s, tag=%-6s, atr=%-6s, t=%-1s, lvl=%2d\n",
     $i, $rdr->path, $rdr->value, $rdr->tag, $rdr->attr, $rdr->type, $rdr->level;
}

Then the output will be as follows (be careful not to interpret the methods $rdr->is_start, $rdr->is_init_attr, $rdr->is_end or $rdr->comment when the filter has been activated, those methods will be undefined when option {filter => 1} is set).

1. pat=/data/item            , val=abc      , tag=item  , atr=      , t=T, lvl= 2
2. pat=/data/item            , val=fgh      , tag=item  , atr=      , t=T, lvl= 2
3. pat=/data/item/inner/@id  , val=fff      , tag=@id   , atr=id    , t=@, lvl= 4
4. pat=/data/item/inner/@name, val=ttt      , tag=@name , atr=name  , t=@, lvl= 4
5. pat=/data/item/inner      , val=ooo ppp  , tag=inner , atr=      , t=T, lvl= 3

INTERFACE

Object creation

To create an XML::Reader object, the following syntax is used:

my $rdr = XML::Reader->new($data,
  {strip => 1, filter => 2, using => ['/path1', '/path2']})
  or die "Error: $!";

The element $data (which is mandatory) is either the name of the XML-file, or a reference to a string, in which case the content of that string is taken as the text of the XML.

Here is an example to create an XML::Reader object with a file-name:

my $rdr = XML::Reader->new('input.xml') or die "Error: $!";

Here is another example to create an XML::Reader object with a reference:

my $rdr = XML::Reader->new(\'<data>abc</data>') or die "Error: $!";

One or more of the following options can be added as a hash-reference:

option {strip => 0|1}

The option {strip => 1} strips all leading and trailing spaces from text and comments. (attributes are never stripped).

The default is {strip => 1}.

option {filter => 0|1|2}

Option {filter => 0} produces the maximum number of output lines. Option {filter => 1} produces the minimum number of output lines. Be careful if you want to use one of the four methods is_start, is_init_attr, is_end or comment. If you have option {filter => 1}, then those four methods will return undef.

The default is {filter => 0}.

option {using => ['/path1/path2/path3', '/path4/path5/path6']}

This option removes all lines which do not start with '/path1/path2/path3' (or with '/path4/path5/path6', for that matter). This effectively leaves only lines starting with '/path1/path2/path3' or '/path4/path5/path6'. Those lines (which are not removed) will have a shorter path by effectively removing the prefix '/path1/path2/path3' (or '/path4/path5/path6') from the path. The removed prefix, however, shows up in the prefix-method.

'/path1/path2/path3' (or '/path4/path5/path6') are supposed to be absolute and complete, i.e. absolute meaning they have to start with a '/'-character and complete meaning that the last item in path 'path3' (or 'path6', for that matter) will be completed internally by a trailing '/'-character.

Methods

A successfully created object of type XML::Reader provides the following methods:

iterate

Reads one single XML-value. It returns 1 after a successful read, or undef when it hits end-of-file.

path

Provides the complete path of the currently selected value, attributes are represented by leading '@'-signs, comments are represented by a '#'-symbol.

value

Provides the actual value (i.e. the value of the current text, attribute or comment).

comment

Provides the comments of the XML.

type

Provides the type of the value: 'T' for text or '@' for attributes.

tag

Provides the current tag-name.

attr

Provides the current attribute (returns the empty string for non-attribute lines).

is_start

Returns 1 or 0, depending on whether the XML-file had a start tag at the current position. Be careful, this method only make sense for option {filter => 0} or {filter => 2} (otherwise, in case of {filter => 1}, the method is_start returns undef).

is_init_attr

Returns 1 or 0, depending on whether a new sequence of attributes is initiated. Be careful, this method only make sense for option {filter => 0} or {filter => 2} (otherwise, in case of {filter => 1}, the method is_init_attr returns undef).

is_end

Returns 1 or 0, depending on whether the XML-file had an end tag at the current position. Be careful, this method only make sense for option {filter => 0} or {filter => 2} (otherwise, in case of {filter => 1}, the method is_end returns undef).

level

Indicates the nesting level of the XPath expression (numeric value greater than zero).

prefix

Shows the prefix which has been removed in option {using => ...}. Returns the empty string if option {using => ...} has not been specified.

OPTION USING

Here is a sample piece of XML (in variable '$line2'):

my $line2 = q{
<data>
  <order>
    <database>
      <customer name="aaa" />
      <customer name="bbb" />
      <customer name="ccc" />
      <customer name="ddd" />
    </database>
  </order>
  <dummy value="ttt">test</dummy>
  <supplier>hhh</supplier>
  <supplier>iii</supplier>
  <supplier>jjj</supplier>
</data>
};

An example with option 'using'

The following program takes this XML and parses it with XML::Reader, including the option 'using' to target specific elements:

use XML::Reader;

my $rdr = XML::Reader->new(\$line2, {filter => 2,
  using => ['/data/order/database/customer', '/data/supplier']});

my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. prf=%-29s, pat=%-7s, val=%-3s, tag=%-6s, t=%-1s, lvl=%2d\n",
      $i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->tag, $rdr->type, $rdr->level;
}

This is the output of that program:

 1. prf=/data/order/database/customer, pat=/@name , val=aaa, tag=@name , t=@, lvl= 1
 2. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 3. prf=/data/order/database/customer, pat=/@name , val=bbb, tag=@name , t=@, lvl= 1
 4. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 5. prf=/data/order/database/customer, pat=/@name , val=ccc, tag=@name , t=@, lvl= 1
 6. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 7. prf=/data/order/database/customer, pat=/@name , val=ddd, tag=@name , t=@, lvl= 1
 8. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 9. prf=/data/supplier               , pat=/      , val=hhh, tag=      , t=T, lvl= 0
10. prf=/data/supplier               , pat=/      , val=iii, tag=      , t=T, lvl= 0
11. prf=/data/supplier               , pat=/      , val=jjj, tag=      , t=T, lvl= 0

An example without option 'using'

The following program takes the same XML and parses it with XML::Reader, but without the option 'using'.

use XML::Reader;

my $rdr = XML::Reader->new(\$line2, {filter => 2});
my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. prf=%-1s, pat=%-37s, val=%-6s, tag=%-11s, t=%-1s, lvl=%2d\n",
     $i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->tag, $rdr->type, $rdr->level;
}

As you can see in the following output, there are many more lines written, the prefix is empty and the path is much longer than in the previous program:

 1. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
 2. prf= , pat=/data/order                          , val=      , tag=order      , t=T, lvl= 2
 3. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
 4. prf= , pat=/data/order/database/customer/@name  , val=aaa   , tag=@name      , t=@, lvl= 5
 5. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
 6. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
 7. prf= , pat=/data/order/database/customer/@name  , val=bbb   , tag=@name      , t=@, lvl= 5
 8. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
 9. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
10. prf= , pat=/data/order/database/customer/@name  , val=ccc   , tag=@name      , t=@, lvl= 5
11. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
12. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
13. prf= , pat=/data/order/database/customer/@name  , val=ddd   , tag=@name      , t=@, lvl= 5
14. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
15. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
16. prf= , pat=/data/order                          , val=      , tag=order      , t=T, lvl= 2
17. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
18. prf= , pat=/data/dummy/@value                   , val=ttt   , tag=@value     , t=@, lvl= 3
19. prf= , pat=/data/dummy                          , val=test  , tag=dummy      , t=T, lvl= 2
20. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
21. prf= , pat=/data/supplier                       , val=hhh   , tag=supplier   , t=T, lvl= 2
22. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
23. prf= , pat=/data/supplier                       , val=iii   , tag=supplier   , t=T, lvl= 2
24. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
25. prf= , pat=/data/supplier                       , val=jjj   , tag=supplier   , t=T, lvl= 2
26. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1

OPTION FILTER

Option {filter => 0}

Option {filter => 0} produces the maximum number of output lines. Here is a sample program to demonstrate the option {filter => 0}.

use XML::Reader;

my $text = q{<root><test param="v">e<data id="z">g</data>f</test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->new(\$text, {filter => 0}) or die "Error: $!";
while ($rdr->iterate) {
    printf "Path: %-19s, Value: %s\n", $rdr->path, $rdr->value;
}

This program produces the following output:

Path: /root              , Value:
Path: /root/test         , Value:
Path: /root/test/@param  , Value: v
Path: /root/test         , Value: e
Path: /root/test/data    , Value:
Path: /root/test/data/@id, Value: z
Path: /root/test/data    , Value: g
Path: /root/test         , Value: f
Path: /root              , Value: x yz

Option {filter => 2}

The above example shows lines with empty values, which could be considered as redundant. In particular the second line ("Path: /root/test, Value:") is not needed, as it is immediately followed by its own attribute line ("Path: /root/test/@param, Value: v").

The same goes for line five ("Path: /root/test/data, Value:") which is also unnecessary, as it is immediately followed by its own attribute line ("Path: /root/test/data/@id, Value: z").

In order to remove those two redundant lines (lines two and five), we can employ the option {filter => 2}.

use XML::Reader;

my $text = q{<root><test param="v">e<data id="z">g</data>f</test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->new(\$text, {filter => 2}) or die "Error: $!";
while ($rdr->iterate) {
    printf "Path: %-19s, Value: %s\n", $rdr->path, $rdr->value;
}

The program with option {filter => 2} produces the following output:

Path: /root              , Value:
Path: /root/test/@param  , Value: v
Path: /root/test         , Value: e
Path: /root/test/data/@id, Value: z
Path: /root/test/data    , Value: g
Path: /root/test         , Value: f
Path: /root              , Value: x yz

This looks better now: the redundant lines are gone. Please note that the first line ("Path: /root, Value:") is also empty, but has not been removed by {filter => 2}, i.e. it is not followed by its own attribute, (-- well, it is followed by an attribute, but with a different path -- that's why we can't easily take it out).

In fact, the first line is necessary for the structure of the XML.

Anyway, let us now look at the same example (with option {filter => 2}), but with an additional algorithm to reconstruct the original XML:

use XML::Reader;

my $text = q{<root><test param="v">e<data id="z">g</data>f</test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->new(\$text, {filter => 2}) or die "Error: $!";

my %at  = ();

while ($rdr->iterate) {
    my $indentation = '  ' x $rdr->level;

    if ($rdr->is_init_attr) { %at  = (); }
    if ($rdr->type eq '@')  { $at{$rdr->attr} = $rdr->value; }

    if ($rdr->is_start) {
        print $indentation, '<', $rdr->tag;
        if (%at) {
            my @a = map{" $_='$at{$_}'"} sort keys %at;
            print "@a";
        }
        print '>', "\n";
    }

    if ($rdr->type eq 'T' and $rdr->value ne '') {
        print $indentation, '  ', $rdr->value, "\n";
    }

    if ($rdr->is_end) {
        print $indentation, '</', $rdr->tag, '>', "\n";
    }
}

...and here is the output:

<root>
  <test param='v'>
    e
    <data id='z'>
      g
    </data>
    f
  </test>
  x yz
</root>

Option {filter => 1}

Now that we have seen that option {filter => 2} allows us to reconstruct the XML, we might want to remove empty lines alltogether. That's what option {filter => 1} is all about. With option {filter => 1} we lose the ability to reconstruct the XML, but simple data processing is easier.

Here is a program:

use XML::Reader;

my $text = q{<root><test param="v">e<data id="z">g</data>f</test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->new(\$text, {filter => 1}) or die "Error: $!";
while ($rdr->iterate) {
    printf "Path: %-19s, Value: %s\n", $rdr->path, $rdr->value;
}

...and here is the output:

Path: /root/test/@param  , Value: v
Path: /root/test         , Value: e
Path: /root/test/data/@id, Value: z
Path: /root/test/data    , Value: g
Path: /root/test         , Value: f
Path: /root              , Value: x yz

Please be aware that with option {filter => 1}, the methods comment(), is_start(), is_init_attr() and is_end() are all out of service, i.e. they return undef.

AUTHOR

Klaus Eichner, March 2009

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Klaus Eichner

All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

RELATED MODULES

If you also want to write XML, have a look at XML::Writer. This module provides a simple interface for writing XML. (If you are writing non-mixed content XML, consider setting DATA_MODE=>1 and DATA_INDENT=>2, which allows for proper indentation in your XML-Output file)

SEE ALSO

XML::TokeParser, XML::Parser, XML::Parser::Expat, XML::TiePYX, XML::Writer.