NAME
XML::Reader - Reading XML and providing path information based on a pull-parser.
SYNOPSIS
use XML::Reader;
my $text = q{<root>stu<test param="v">w</test>x <!-- remark --> yz</root>};
my $rdr = XML::Reader->new(\$text) or die "Error: $!";
while ($rdr->iterate) {
printf "Path: %-18s, Value: %s\n", $rdr->path, $rdr->value;
}
The above program produces the following output:
Path: /root , Value: stu
Path: /root/test , Value:
Path: /root/test/@param , Value: v
Path: /root/test , Value: w
Path: /root , Value: x yz
DESCRIPTION
XML::Reader provides a simple and easy to use interface for sequentially parsing XML files (so called "pull-mode" parsing) and at the same time keeps track of the complete XML-path.
It was developped as a wrapper on top of XML::Parser (while, at the same time, some basic functions have been copied from XML::TokeParser). Both XML::Parser and XML::TokeParser allow pull-mode parsing, but do not keep track of the complete XML-Path. Also, the interfaces to XML::Parser and XML::TokeParser require you to distinguish between start-tags, end-tags and text, which, in my view, complicates the interface.
There is also XML::TiePYX, which lets you pull-mode parse XML-Files (see http://www.xml.com/pub/a/2000/03/15/feature/index.html for an introduction to PYX). But still, with XML::TiePYX you need to account for start-tags, end-tags and text, and it does not provide the full XML-path.
By contrast, XML::Reader translates start-tags, end-tags and text into XPath-like expressions. So you don't need to worry about tags, you just get a path and a value, and that's it.
For example, the following XML in variable '$line1'...
my $line1 = q{
<data>
<item>abc</item>
<item><!-- c1 -->
<dummy/>
fgh
<inner name="ttt" id="fff">
ooo <!-- c2 --> ppp
</inner>
</item>
</data>
};
...can be parsed with XML::Reader using the methods iterate
to iterate one-by-one through the XML-data, path
and value
to extract the current XML-path and it's value.
You can also keep track of the start- and end-tags: There is a method is_start
, which returns 1 or 0, depending on whether the XML-file had a start tag at the current position. There is also the equivalent method is_end
. Just remember, those two method only make sense if filter is switched off (otherwise those methods return constant 0).
There are also the methods comment
, tag
, type
and level
. comment
returns the comment, if any. tag
gives you the current tag-name or attribute-name), type
returns either 'T' for text or '@' for attributes and level
indicates the current nesting-level, a number >= 0.
Here is a sample program which parses the XML in '$line1' from above to demonstrate the principle...
use XML::Reader;
my $rdr = XML::Reader->new(\$line1) or die "Error: $!";
my $i = 0;
while ($rdr->iterate) { $i++;
printf "%3d. pat=%-22s, val=%-9s, s=%-1s, e=%-1s, tag=%-6s, t=%-1s, lvl=%2d, c=%s\n",
$i, $rdr->path, $rdr->value, $rdr->is_start,
$rdr->is_end, $rdr->tag, $rdr->type, $rdr->level, $rdr->comment;
}
...and here is the output:
1. pat=/data , val= , s=1, e=0, tag=data , t=T, lvl= 1, c=
2. pat=/data/item , val=abc , s=1, e=1, tag=item , t=T, lvl= 2, c=
3. pat=/data , val= , s=0, e=0, tag=data , t=T, lvl= 1, c=
4. pat=/data/item , val= , s=1, e=0, tag=item , t=T, lvl= 2, c=c1
5. pat=/data/item/dummy , val= , s=1, e=1, tag=dummy , t=T, lvl= 3, c=
6. pat=/data/item , val=fgh , s=0, e=0, tag=item , t=T, lvl= 2, c=
7. pat=/data/item/inner , val= , s=1, e=0, tag=inner , t=T, lvl= 3, c=
8. pat=/data/item/inner/@id , val=fff , s=0, e=0, tag=@id , t=@, lvl= 4, c=
9. pat=/data/item/inner/@name, val=ttt , s=0, e=0, tag=@name , t=@, lvl= 4, c=
10. pat=/data/item/inner , val=ooo ppp , s=0, e=1, tag=inner , t=T, lvl= 3, c=c2
11. pat=/data/item , val= , s=0, e=1, tag=item , t=T, lvl= 2, c=
12. pat=/data , val= , s=0, e=1, tag=data , t=T, lvl= 1, c=
If you want, you can set a filter to select only those lines that have a value.
my $rdr = XML::Reader->new(\$line1, {filter => 1}) or die "Error: $!";
Then the output will be as follows (be careful not to interpret $rdr->is_start or $rdr->is_end when the filter has been activated. Also be careful if you are looking for comments. Comments with filter on are only returned if the tag is not empty, that's why you don't see comment 'c1' in the output below)
1. pat=/data/item , val=abc , s=0, e=0, tag=item , t=T, lvl= 2, c=
2. pat=/data/item , val=fgh , s=0, e=0, tag=item , t=T, lvl= 2, c=
3. pat=/data/item/inner/@id , val=fff , s=0, e=0, tag=@id , t=@, lvl= 4, c=
4. pat=/data/item/inner/@name, val=ttt , s=0, e=0, tag=@name , t=@, lvl= 4, c=
5. pat=/data/item/inner , val=ooo ppp , s=0, e=0, tag=inner , t=T, lvl= 3, c=c2
INTERFACE
Object creation
To create an XML::Reader object, the following syntax is used:
my $rdr = XML::Reader->new($data,
{strip => 1, filter => 0, using => ['/path1', '/path2']})
or die "Error: $!";
The element $data (which is mandatory) is either the name of the XML-file, or a reference to a string, in which case the content of that string is taken as the text of the XML.
Here is an example to create an XML::Reader object with a file-name:
my $rdr = XML::Reader->new('input.xml') or die "Error: $!";
Here is another example to create an XML::Reader object with a reference:
my $rdr = XML::Reader->new(\'<data>abc</data>') or die "Error: $!";
One or more of the following options can be added as a hash-reference:
- option {strip => 1}
-
The option {strip => 1} strips all leading and trailing spaces from text and comments. (attributes are never stripped). The default is {strip => 1}.
- option {filter => 0}
-
The option {filter => 1} removes all empty text lines. Be careful if you want to use the
is_start
andis_end
methods, in which case you have to set option {filter => 0}. The default is {filter => 0}. - option {using => ['/path1/path2/path3', '/path4/path5/path6']}
-
This option removes all lines which do not start with '/path1/path2/path3' (or with '/path4/path5/path6', for that matter). This effectively leaves only lines starting with '/path1/path2/path3' or '/path4/path5/path6'. Those lines (which are not removed) will have a shorter path by effectively removing the prefix '/path1/path2/path3' (or '/path4/path5/path6') from the path. The removed prefix, however, shows up in the prefix-method.
'/path1/path2/path3' (or '/path4/path5/path6') are supposed to be absolute and complete, i.e. absolute meaning they have to start with a '/'-character and complete meaning that the last item in path 'path3' (or 'path6', for that matter) will be completed internally by a trailing '/'-character.
Methods
A successfully created object of type XML::Reader provides the following methods:
- iterate
-
Reads one single XML-value. It returns 1 after a successful read, or undef when it hits end-of-file.
- path
-
Provides the complete path of the currently selected value, attributes are represented by leading '@'-signs, comments are represented by a '#'-symbol.
- value
-
Provides the actual value (i.e. the value of the current text, attribute or comment).
- comment
-
Provides the comments of the XML.
- type
-
Provides the type of the value: 'T' for text or '@' for attributes.
- tag
-
Provides the current tag-name (or attribute-name).
- is_start
-
Returns 1 or 0, depending on whether the XML-file had a start tag at the current position. Be careful, this method only make sense if filter is switched off (otherwise constant 0 is returned if filter is switched on).
- is_end
-
Returns 1 or 0, depending on whether the XML-file had an end tag at the current position. Be careful, this method only make sense if filter is switched off (otherwise constant 0 is returned if filter is switched on).
- level
-
Indicates the nesting level of the XPath expression (numeric value greater than zero).
- prefix
-
Shows the prefix which has been removed in option {using => ...}. Returns the empty string if option {using => ...} has not been specified.
EXAMPLES
Here is a sample piece of XML (in variable '$line2'):
my $line2 = q{
<data>
<order>
<database>
<customer name="aaa" />
<customer name="bbb" />
<customer name="ccc" />
<customer name="ddd" />
</database>
</order>
<dummy value="ttt">test</dummy>
<supplier>hhh</supplier>
<supplier>iii</supplier>
<supplier>jjj</supplier>
</data>
};
An example with option 'using'
The following program takes this XML and parses it with XML::Reader, including the option 'using' to target specific elements:
use XML::Reader;
my $rdr = XML::Reader->new(\$line2, {filter => 0,
using => ['/data/order/database/customer', '/data/supplier']});
my $i = 0;
while ($rdr->iterate) { $i++;
printf "%3d. prf=%-29s, pat=%-7s, val=%-3s, s=%-1s, e=%-1s, tag=%-6s, t=%-1s, lvl=%2d\n",
$i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->is_start,
$rdr->is_end, $rdr->tag, $rdr->type, $rdr->level;
}
This is the output of that program:
1. prf=/data/order/database/customer, pat=/ , val= , s=1, e=0, tag= , t=T, lvl= 0
2. prf=/data/order/database/customer, pat=/@name , val=aaa, s=0, e=0, tag=@name , t=@, lvl= 1
3. prf=/data/order/database/customer, pat=/ , val= , s=0, e=1, tag= , t=T, lvl= 0
4. prf=/data/order/database/customer, pat=/ , val= , s=1, e=0, tag= , t=T, lvl= 0
5. prf=/data/order/database/customer, pat=/@name , val=bbb, s=0, e=0, tag=@name , t=@, lvl= 1
6. prf=/data/order/database/customer, pat=/ , val= , s=0, e=1, tag= , t=T, lvl= 0
7. prf=/data/order/database/customer, pat=/ , val= , s=1, e=0, tag= , t=T, lvl= 0
8. prf=/data/order/database/customer, pat=/@name , val=ccc, s=0, e=0, tag=@name , t=@, lvl= 1
9. prf=/data/order/database/customer, pat=/ , val= , s=0, e=1, tag= , t=T, lvl= 0
10. prf=/data/order/database/customer, pat=/ , val= , s=1, e=0, tag= , t=T, lvl= 0
11. prf=/data/order/database/customer, pat=/@name , val=ddd, s=0, e=0, tag=@name , t=@, lvl= 1
12. prf=/data/order/database/customer, pat=/ , val= , s=0, e=1, tag= , t=T, lvl= 0
13. prf=/data/supplier , pat=/ , val=hhh, s=1, e=1, tag= , t=T, lvl= 0
14. prf=/data/supplier , pat=/ , val=iii, s=1, e=1, tag= , t=T, lvl= 0
15. prf=/data/supplier , pat=/ , val=jjj, s=1, e=1, tag= , t=T, lvl= 0
An example without option 'using'
The following program takes the same XML and parses it with XML::Reader, but without the option 'using'.
use XML::Reader;
my $rdr = XML::Reader->new(\$line2, {filter => 0});
my $i = 0;
while ($rdr->iterate) { $i++;
printf "%3d. prf=%-1s, pat=%-37s, val=%-6s, s=%-1s, e=%-1s, tag=%-11s, t=%-1s, lvl=%2d\n",
$i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->is_start,
$rdr->is_end, $rdr->tag, $rdr->type, $rdr->level;
}
As you can see in the following output, there are many more lines written, the prefix is empty and the path is much longer than in the previous program:
1. prf= , pat=/data , val= , s=1, e=0, tag=data , t=T, lvl= 1
2. prf= , pat=/data/order , val= , s=1, e=0, tag=order , t=T, lvl= 2
3. prf= , pat=/data/order/database , val= , s=1, e=0, tag=database , t=T, lvl= 3
4. prf= , pat=/data/order/database/customer , val= , s=1, e=0, tag=customer , t=T, lvl= 4
5. prf= , pat=/data/order/database/customer/@name , val=aaa , s=0, e=0, tag=@name , t=@, lvl= 5
6. prf= , pat=/data/order/database/customer , val= , s=0, e=1, tag=customer , t=T, lvl= 4
7. prf= , pat=/data/order/database , val= , s=0, e=0, tag=database , t=T, lvl= 3
8. prf= , pat=/data/order/database/customer , val= , s=1, e=0, tag=customer , t=T, lvl= 4
9. prf= , pat=/data/order/database/customer/@name , val=bbb , s=0, e=0, tag=@name , t=@, lvl= 5
10. prf= , pat=/data/order/database/customer , val= , s=0, e=1, tag=customer , t=T, lvl= 4
11. prf= , pat=/data/order/database , val= , s=0, e=0, tag=database , t=T, lvl= 3
12. prf= , pat=/data/order/database/customer , val= , s=1, e=0, tag=customer , t=T, lvl= 4
13. prf= , pat=/data/order/database/customer/@name , val=ccc , s=0, e=0, tag=@name , t=@, lvl= 5
14. prf= , pat=/data/order/database/customer , val= , s=0, e=1, tag=customer , t=T, lvl= 4
15. prf= , pat=/data/order/database , val= , s=0, e=0, tag=database , t=T, lvl= 3
16. prf= , pat=/data/order/database/customer , val= , s=1, e=0, tag=customer , t=T, lvl= 4
17. prf= , pat=/data/order/database/customer/@name , val=ddd , s=0, e=0, tag=@name , t=@, lvl= 5
18. prf= , pat=/data/order/database/customer , val= , s=0, e=1, tag=customer , t=T, lvl= 4
19. prf= , pat=/data/order/database , val= , s=0, e=1, tag=database , t=T, lvl= 3
20. prf= , pat=/data/order , val= , s=0, e=1, tag=order , t=T, lvl= 2
21. prf= , pat=/data , val= , s=0, e=0, tag=data , t=T, lvl= 1
22. prf= , pat=/data/dummy , val= , s=1, e=0, tag=dummy , t=T, lvl= 2
23. prf= , pat=/data/dummy/@value , val=ttt , s=0, e=0, tag=@value , t=@, lvl= 3
24. prf= , pat=/data/dummy , val=test , s=0, e=1, tag=dummy , t=T, lvl= 2
25. prf= , pat=/data , val= , s=0, e=0, tag=data , t=T, lvl= 1
26. prf= , pat=/data/supplier , val=hhh , s=1, e=1, tag=supplier , t=T, lvl= 2
27. prf= , pat=/data , val= , s=0, e=0, tag=data , t=T, lvl= 1
28. prf= , pat=/data/supplier , val=iii , s=1, e=1, tag=supplier , t=T, lvl= 2
29. prf= , pat=/data , val= , s=0, e=0, tag=data , t=T, lvl= 1
30. prf= , pat=/data/supplier , val=jjj , s=1, e=1, tag=supplier , t=T, lvl= 2
31. prf= , pat=/data , val= , s=0, e=1, tag=data , t=T, lvl= 1
AUTHOR
Klaus Eichner, March 2009
COPYRIGHT AND LICENSE
Copyright (C) 2009 by Klaus Eichner
All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
RELATED MODULES
If you also want to write XML, have a look at XML::Writer. This module provides a simple interface for writing XML. (If you are writing non-mixed content XML, consider setting DATA_MODE=>1 and DATA_INDENT=>2, which allows for proper indentation in your XML-Output file)
SEE ALSO
XML::TokeParser, XML::Parser, XML::Parser::Expat, XML::TiePYX, XML::Writer.