NAME

XML::Reader - Reading XML and providing path information based on a pull-parser.

SYNOPSIS

use XML::Reader;

my $text = q{<init><page node="400">m <!-- remark --> r</page></init>};

my $rdr = XML::Reader->newhd(\$text) or die "Error: $!";
while ($rdr->iterate) {
    printf "Path: %-19s, Value: %s\n", $rdr->path, $rdr->value;
}

This program produces the following output:

Path: /init              , Value:
Path: /init/page/@node   , Value: 400
Path: /init/page         , Value: m r
Path: /init              , Value:

DESCRIPTION

XML::Reader provides a simple and easy to use interface for sequentially parsing XML files (so called "pull-mode" parsing) and at the same time keeps track of the complete XML-path.

It was developped as a wrapper on top of XML::Parser (while, at the same time, some basic functions have been copied from XML::TokeParser). Both XML::Parser and XML::TokeParser allow pull-mode parsing, but do not keep track of the complete XML-Path. Also, the interfaces to XML::Parser and XML::TokeParser require you to distinguish between start-tags, end-tags and text, which, in my view, complicates the interface.

There is also XML::TiePYX, which lets you pull-mode parse XML-Files (see http://www.xml.com/pub/a/2000/03/15/feature/index.html for an introduction to PYX). But still, with XML::TiePYX you need to account for start-tags, end-tags and text, and it does not provide the full XML-path.

By contrast, XML::Reader translates start-tags, end-tags and text into XPath-like expressions. So you don't need to worry about tags, you just get a path and a value, and that's it.

For example, the following XML in variable '$line1'...

my $line1 = q{
  <data>
    <item>abc</item>
    <item><!-- c1 -->
      <dummy/>
      fgh
      <inner name="ttt" id="fff">
        ooo <!-- c2 --> ppp
      </inner>
    </item>
  </data>
};

...can be parsed with XML::Reader using the methods iterate to iterate one-by-one through the XML-data, path and value to extract the current XML-path and it's value.

You can also keep track of the start- and end-tags: There is a method is_start, which returns 1 or 0, depending on whether the XML-file had a start tag at the current position. There is also the equivalent method is_end. If you want to know whether you have encountered a fresh sequence of attributes, you can use the method is_init_attr.

There are also the methods comment, tag, attr, type and level. comment returns the comment, if any. tag gives you the current tag-name, attr returns the attribute-name, type returns either 'T' for text or '@' for attributes and level indicates the current nesting-level (a number >= 0).

Here is a sample program which parses the XML in '$line1' from above to demonstrate the principle...

use XML::Reader;

my $rdr = XML::Reader->newhd(\$line1) or die "Error: $!";
my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. pat=%-22s, val=%-9s, s=%-1s, i=%-1s, e=%-1s, tag=%-6s, atr=%-6s, t=%-1s, lvl=%2d, c=%s\n",
     $i, $rdr->path, $rdr->value, $rdr->is_start, $rdr->is_init_attr,
     $rdr->is_end, $rdr->tag, $rdr->attr, $rdr->type, $rdr->level, $rdr->comment;
}

...and here is the output:

 1. pat=/data                 , val=         , s=1, i=1, e=0, tag=data  , atr=      , t=T, lvl= 1, c=
 2. pat=/data/item            , val=abc      , s=1, i=1, e=1, tag=item  , atr=      , t=T, lvl= 2, c=
 3. pat=/data                 , val=         , s=0, i=1, e=0, tag=data  , atr=      , t=T, lvl= 1, c=
 4. pat=/data/item            , val=         , s=1, i=1, e=0, tag=item  , atr=      , t=T, lvl= 2, c=c1
 5. pat=/data/item/dummy      , val=         , s=1, i=1, e=1, tag=dummy , atr=      , t=T, lvl= 3, c=
 6. pat=/data/item            , val=fgh      , s=0, i=1, e=0, tag=item  , atr=      , t=T, lvl= 2, c=
 7. pat=/data/item/inner/@id  , val=fff      , s=0, i=1, e=0, tag=@id   , atr=id    , t=@, lvl= 4, c=
 8. pat=/data/item/inner/@name, val=ttt      , s=0, i=0, e=0, tag=@name , atr=name  , t=@, lvl= 4, c=
 9. pat=/data/item/inner      , val=ooo ppp  , s=1, i=0, e=1, tag=inner , atr=      , t=T, lvl= 3, c=c2
10. pat=/data/item            , val=         , s=0, i=1, e=1, tag=item  , atr=      , t=T, lvl= 2, c=
11. pat=/data                 , val=         , s=0, i=1, e=1, tag=data  , atr=      , t=T, lvl= 1, c=

If you want, you can set option {filter => 1} to select only those lines that have a value.

use XML::Reader;

my $rdr = XML::Reader->newhd(\$line1, {filter => 1}) or die "Error: $!";
my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. pat=%-22s, val=%-9s, tag=%-6s, atr=%-6s, t=%-1s, lvl=%2d\n",
     $i, $rdr->path, $rdr->value, $rdr->tag, $rdr->attr, $rdr->type, $rdr->level;
}

In this case the output will be as follows (be careful not to interpret the methods $rdr->is_start, $rdr->is_init_attr, $rdr->is_end or $rdr->comment when option {filter => 1} is set (those methods will, in fact, be undefined).

1. pat=/data/item            , val=abc      , tag=item  , atr=      , t=T, lvl= 2
2. pat=/data/item            , val=fgh      , tag=item  , atr=      , t=T, lvl= 2
3. pat=/data/item/inner/@id  , val=fff      , tag=@id   , atr=id    , t=@, lvl= 4
4. pat=/data/item/inner/@name, val=ttt      , tag=@name , atr=name  , t=@, lvl= 4
5. pat=/data/item/inner      , val=ooo ppp  , tag=inner , atr=      , t=T, lvl= 3

INTERFACE

Object creation

To create an XML::Reader object, the following syntax is used:

my $rdr = XML::Reader->newhd($data,
  {strip => 1, filter => 2, using => ['/path1', '/path2']})
  or die "Error: $!";

The element $data (which is mandatory) is either the name of the XML-file, or a reference to a string, in which case the content of that string is taken as the text of the XML.

Here is an example to create an XML::Reader object with a file-name:

my $rdr = XML::Reader->newhd('input.xml') or die "Error: $!";

Here is another example to create an XML::Reader object with a reference:

my $rdr = XML::Reader->newhd(\'<data>abc</data>') or die "Error: $!";

One or more of the following options can be added as a hash-reference:

option {using => }

Option Using allows for selecting a sub-tree of the XML.

The syntax is {using => ['/path1/path2/path3', '/path4/path5/path6']}

option {filter => }

Option {filter => 1} activates a filter to remove lines with an empty value.

Option {filter => 2} desactivates the filter, so all lines are shown, even lines with an empty value.

Option {filter => 3} also desactivates the filter, but removes attribute lines (i.e. it removes lines where $rdr->type = '@'). Instead, it returns the attributes in a hash $rdr->att_hash.

The syntax is {filter => 1|2|3}, default is {filter => 2}

option {strip => }

Option {strip => 1} strips all leading and trailing spaces from text and comments. (attributes are never stripped). {strip => 0} leaves text and comments unmodified.

The syntax is {strip => 0|1}, default is {strip => 1}

Methods

A successfully created object of type XML::Reader provides the following methods:

iterate

Reads one single XML-value. It returns 1 after a successful read, or undef when it hits end-of-file.

path

Provides the complete path of the currently selected value, attributes are represented by leading '@'-signs.

value

Provides the actual value (i.e. the value of the current text, attribute or comment).

comment

Provides the comments of the XML. Be careful, this method only make sense for option {filter => 2} (otherwise, in case of {filter => 1}, the method comment returns undef).

type

Provides the type of the value: 'T' for text or '@' for attributes.

tag

Provides the current tag-name.

attr

Provides the current attribute (returns the empty string for non-attribute lines).

is_start

Returns 1 or 0, depending on whether the XML-file had a start tag at the current position. Be careful, this method only make sense for option {filter => 2} (otherwise, in case of {filter => 1}, the method is_start returns undef).

is_init_attr

Returns 1 or 0, depending on whether a new sequence of attributes is initiated. Be careful, this method only make sense for option {filter => 2} (otherwise, in case of {filter => 1}, the method is_init_attr returns undef).

is_end

Returns 1 or 0, depending on whether the XML-file had an end tag at the current position. Be careful, this method only make sense for option {filter => 2} (otherwise, in case of {filter => 1}, the method is_end returns undef).

level

Indicates the nesting level of the XPath expression (numeric value greater than zero).

prefix

Shows the prefix which has been removed in option {using => ...}. Returns the empty string if option {using => ...} has not been specified.

OPTION USING

Option Using allows for selecting a sub-tree of the XML.

Here is how it works in detail...

option {using => ['/path1/path2/path3', '/path4/path5/path6']} removes all lines which do not start with '/path1/path2/path3' (or with '/path4/path5/path6', for that matter). This effectively leaves only lines starting with '/path1/path2/path3' or '/path4/path5/path6'.

Those lines (which are not removed) will have a shorter path by effectively removing the prefix '/path1/path2/path3' (or '/path4/path5/path6') from the path. The removed prefix, however, shows up in the prefix-method.

'/path1/path2/path3' (or '/path4/path5/path6') are supposed to be absolute and complete, i.e. absolute meaning they have to start with a '/'-character and complete meaning that the last item in path 'path3' (or 'path6', for that matter) will be completed internally by a trailing '/'-character.

An example with option 'using'

The following program takes this XML and parses it with XML::Reader, including the option 'using' to target specific elements:

use XML::Reader;

my $line2 = q{
<data>
  <order>
    <database>
      <customer name="aaa" />
      <customer name="bbb" />
      <customer name="ccc" />
      <customer name="ddd" />
    </database>
  </order>
  <dummy value="ttt">test</dummy>
  <supplier>hhh</supplier>
  <supplier>iii</supplier>
  <supplier>jjj</supplier>
</data>
};

my $rdr = XML::Reader->newhd(\$line2,
  {using => ['/data/order/database/customer', '/data/supplier']});

my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. prf=%-29s, pat=%-7s, val=%-3s, tag=%-6s, t=%-1s, lvl=%2d\n",
      $i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->tag, $rdr->type, $rdr->level;
}

This is the output of that program:

 1. prf=/data/order/database/customer, pat=/@name , val=aaa, tag=@name , t=@, lvl= 1
 2. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 3. prf=/data/order/database/customer, pat=/@name , val=bbb, tag=@name , t=@, lvl= 1
 4. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 5. prf=/data/order/database/customer, pat=/@name , val=ccc, tag=@name , t=@, lvl= 1
 6. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 7. prf=/data/order/database/customer, pat=/@name , val=ddd, tag=@name , t=@, lvl= 1
 8. prf=/data/order/database/customer, pat=/      , val=   , tag=      , t=T, lvl= 0
 9. prf=/data/supplier               , pat=/      , val=hhh, tag=      , t=T, lvl= 0
10. prf=/data/supplier               , pat=/      , val=iii, tag=      , t=T, lvl= 0
11. prf=/data/supplier               , pat=/      , val=jjj, tag=      , t=T, lvl= 0

An example without option 'using'

The following program takes the same XML and parses it with XML::Reader, but without the option 'using'.

use XML::Reader;

my $rdr = XML::Reader->newhd(\$line2);
my $i = 0;
while ($rdr->iterate) { $i++;
    printf "%3d. prf=%-1s, pat=%-37s, val=%-6s, tag=%-11s, t=%-1s, lvl=%2d\n",
     $i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->tag, $rdr->type, $rdr->level;
}

As you can see in the following output, there are many more lines written, the prefix is empty and the path is much longer than in the previous program:

 1. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
 2. prf= , pat=/data/order                          , val=      , tag=order      , t=T, lvl= 2
 3. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
 4. prf= , pat=/data/order/database/customer/@name  , val=aaa   , tag=@name      , t=@, lvl= 5
 5. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
 6. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
 7. prf= , pat=/data/order/database/customer/@name  , val=bbb   , tag=@name      , t=@, lvl= 5
 8. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
 9. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
10. prf= , pat=/data/order/database/customer/@name  , val=ccc   , tag=@name      , t=@, lvl= 5
11. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
12. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
13. prf= , pat=/data/order/database/customer/@name  , val=ddd   , tag=@name      , t=@, lvl= 5
14. prf= , pat=/data/order/database/customer        , val=      , tag=customer   , t=T, lvl= 4
15. prf= , pat=/data/order/database                 , val=      , tag=database   , t=T, lvl= 3
16. prf= , pat=/data/order                          , val=      , tag=order      , t=T, lvl= 2
17. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
18. prf= , pat=/data/dummy/@value                   , val=ttt   , tag=@value     , t=@, lvl= 3
19. prf= , pat=/data/dummy                          , val=test  , tag=dummy      , t=T, lvl= 2
20. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
21. prf= , pat=/data/supplier                       , val=hhh   , tag=supplier   , t=T, lvl= 2
22. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
23. prf= , pat=/data/supplier                       , val=iii   , tag=supplier   , t=T, lvl= 2
24. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1
25. prf= , pat=/data/supplier                       , val=jjj   , tag=supplier   , t=T, lvl= 2
26. prf= , pat=/data                                , val=      , tag=data       , t=T, lvl= 1

OPTION FILTER

Option Filter allows to switch on ({filter => 1}), or to switch off ({filter => 2}) a filter for empty lines.

Option {filter => 2}

With option {filter => 2}, that is the filter for empty lines is switched off, XML::Reader produces one line for each start-tag and one line for each end-tag. (consecutive start- and end-tags can be combined into one single line.) Also, attribute lines are added via the special '/@...' syntax.

Option {filter => 2} is the default.

Here is an example...

use XML::Reader;

my $text = q{<root><test param="v"><a><b>e<data id="z">g</data>f</b></a></test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->newhd(\$text) or die "Error: $!";
while ($rdr->iterate) {
    printf "Path: %-24s, Value: %s\n", $rdr->path, $rdr->value;
}

This program (with implicit option {filter => 2} as default) produces the following output:

Path: /root                   , Value:
Path: /root/test/@param       , Value: v
Path: /root/test              , Value:
Path: /root/test/a            , Value:
Path: /root/test/a/b          , Value: e
Path: /root/test/a/b/data/@id , Value: z
Path: /root/test/a/b/data     , Value: g
Path: /root/test/a/b          , Value: f
Path: /root/test/a            , Value:
Path: /root/test              , Value:
Path: /root                   , Value: x yz

{filter => 2} also allows to rebuild the structure of the XML with the help of the methods is_start, is_init_attr and is_end. Please note that the first line ("Path: /root, Value:") is empty, but important for the structure of the XML. Therefore we can't ignore it.

Let us now look at the same example (with option {filter => 2}), but with an additional algorithm to reconstruct the original XML:

use XML::Reader;

my $text = q{<root><test param="v"><a><b>e<data id="z">g</data>f</b></a></test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->newhd(\$text) or die "Error: $!";

my %at;

while ($rdr->iterate) {
    my $indentation = '  ' x $rdr->level;

    if ($rdr->is_init_attr) { %at  = (); }
    if ($rdr->type eq '@')  { $at{$rdr->attr} = $rdr->value; }

    if ($rdr->is_start) {
        print $indentation, '<', $rdr->tag, join('', map{" $_='$at{$_}'"} sort keys %at), '>', "\n";
    }

    if ($rdr->type eq 'T' and $rdr->value ne '') {
        print $indentation, '  ', $rdr->value, "\n";
    }

    if ($rdr->is_end) {
        print $indentation, '</', $rdr->tag, '>', "\n";
    }
}

...and here is the output:

<root>
  <test param='v'>
    <a>
      <b>
        e
        <data id='z'>
          g
        </data>
        f
      </b>
    </a>
  </test>
  x yz
</root>

...this is proof that the original structure of the XML is not lost.

Option {filter => 3}

Option {filter = 3} works very much like {filter => 2}.

The difference, though, is that with option {filter => 3} all attribute-lines are filtered out and instead, the attributes are presented for each start-line in a hash $rdr->att_hash().

This allows, in fact, to dispense with the global %at variable of the previous algorithm, and use a local %at variable instead:

my %at = %{$rdr->att_hash};

Here is the new algorithm for {filter => 3}, we don't need to worry about attributes (that is, we don't need to check fot $rdr->type eq '@') and, as already mentioned, the %at variable is now local:

use XML::Reader;

my $text = q{<root><test param="v"><a><b>e<data id="z">g</data>f</b></a></test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->newhd(\$text, {filter => 3}) or die "Error: $!";

while ($rdr->iterate) {
    my $indentation = '  ' x $rdr->level;

    if ($rdr->is_start) {
        my %at = %{$rdr->att_hash};
        print $indentation, '<', $rdr->tag, join('', map{" $_='$at{$_}'"} sort keys %at), '>', "\n";
    }

    if ($rdr->type eq 'T' and $rdr->value ne '') {
        print $indentation, '  ', $rdr->value, "\n";
    }

    if ($rdr->is_end) {
        print $indentation, '</', $rdr->tag, '>', "\n";
    }
}

...the output for {filter => 3} is identical to the output for {filter => 2}:

<root>
  <test param='v'>
    <a>
      <b>
        e
        <data id='z'>
          g
        </data>
        f
      </b>
    </a>
  </test>
  x yz
</root>

Option {filter => 1}

Option {filter => 1} reduces the number of output lines (i.e. it removes all lines that don't have a value).

Be careful if you want to use one of the four methods is_start, is_init_attr, is_end or comment. In fact, if you have option {filter => 1}, then those four methods will return undef.

With option {filter => 1} we lose the ability to reconstruct the XML, but simple data processing is easier.

Here is a sample program:

use XML::Reader;

my $text = q{<root><test param="v"><a><b>e<data id="z">g</data>f</b></a></test>x <!-- remark --> yz</root>};

my $rdr = XML::Reader->newhd(\$text, {filter => 1}) or die "Error: $!";
while ($rdr->iterate) {
    printf "Path: %-24s, Value: %s\n", $rdr->path, $rdr->value;
}

...and here is the output:

Path: /root/test/@param       , Value: v
Path: /root/test/a/b          , Value: e
Path: /root/test/a/b/data/@id , Value: z
Path: /root/test/a/b/data     , Value: g
Path: /root/test/a/b          , Value: f
Path: /root                   , Value: x yz

AUTHOR

Klaus Eichner, March 2009

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Klaus Eichner

All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

RELATED MODULES

If you also want to write XML, have a look at XML::Writer. This module provides a simple interface for writing XML. (If you are writing non-mixed content XML, consider setting DATA_MODE=>1 and DATA_INDENT=>2, which allows for proper indentation in your XML-Output file)

SEE ALSO

XML::TokeParser, XML::Parser, XML::Parser::Expat, XML::TiePYX, XML::Writer.