NAME
XML::Reader - Reading XML and providing path information based on a pull-parser.
SYNOPSIS
use XML::Reader;
my $text = q{<init>n <?test pi?> t<page node="400">m <!-- remark --> r</page></init>};
my $rdr = XML::Reader->newhd(\$text) or die "Error: $!";
while ($rdr->iterate) {
printf "Path: %-19s, Value: %s\n", $rdr->path, $rdr->value;
}
This program produces the following output:
Path: /init , Value: n t
Path: /init/page/@node , Value: 400
Path: /init/page , Value: m r
Path: /init , Value:
DESCRIPTION
XML::Reader provides a simple and easy to use interface for sequentially parsing XML files (so called "pull-mode" parsing) and at the same time keeps track of the complete XML-path.
It was developped as a wrapper on top of XML::Parser (while, at the same time, some basic functions have been copied from XML::TokeParser). Both XML::Parser and XML::TokeParser allow pull-mode parsing, but do not keep track of the complete XML-Path. Also, the interfaces to XML::Parser and XML::TokeParser require you to distinguish between start-tags, end-tags and text on seperate lines, which, in my view, complicates the interface (although, XML::Reader allows option {filter => 4} which emulates start-tags, end-tags and text on separate lines, if that's what you want).
There is also XML::TiePYX, which lets you pull-mode parse XML-Files (see http://www.xml.com/pub/a/2000/03/15/feature/index.html for an introduction to PYX). But still, with XML::TiePYX you need to account for start-tags, end-tags and text, and it does not provide the full XML-path.
By contrast, XML::Reader translates start-tags, end-tags and text into XPath-like expressions. So you don't need to worry about tags, you just get a path and a value, and that's it. (However, should you wish to operate XML::Reader in a PYX compatible mode, there is option {filter => 4}, as mentioned above, which allows you to parse XML in that way).
But going back to the normal mode of operation, here is an example XML in variable '$line1':
my $line1 =
q{<?xml version="1.0" encoding="iso-8859-1"?>
<data>
<item>abc</item>
<item><!-- c1 -->
<dummy/>
fgh
<inner name="ttt" id="fff">
ooo <!-- c2 --> ppp
</inner>
</item>
</data>
};
This example can be parsed with XML::Reader using the methods iterate
to iterate one-by-one through the XML-data, path
and value
to extract the current XML-path and its value.
You can also keep track of the start- and end-tags: There is a method is_start
, which returns 1 or 0, depending on whether the XML-file had a start tag at the current position. There is also the equivalent method is_end
.
There are also the methods tag
, attr
, type
and level
. tag
gives you the current tag-name, attr
returns the attribute-name, type
returns either 'T' for text or '@' for attributes and level
indicates the current nesting-level (a number >= 0).
Here is a sample program which parses the XML in '$line1' from above to demonstrate the principle...
use XML::Reader;
my $rdr = XML::Reader->newhd(\$line1) or die "Error: $!";
my $i = 0;
while ($rdr->iterate) { $i++;
printf "%3d. pat=%-22s, val=%-9s, s=%-1s, e=%-1s, tag=%-6s, atr=%-6s, t=%-1s, lvl=%2d\n", $i,
$rdr->path, $rdr->value, $rdr->is_start, $rdr->is_end, $rdr->tag, $rdr->attr, $rdr->type, $rdr->level;
}
...and here is the output:
1. pat=/data , val= , s=1, e=0, tag=data , atr= , t=T, lvl= 1
2. pat=/data/item , val=abc , s=1, e=1, tag=item , atr= , t=T, lvl= 2
3. pat=/data , val= , s=0, e=0, tag=data , atr= , t=T, lvl= 1
4. pat=/data/item , val= , s=1, e=0, tag=item , atr= , t=T, lvl= 2
5. pat=/data/item/dummy , val= , s=1, e=1, tag=dummy , atr= , t=T, lvl= 3
6. pat=/data/item , val=fgh , s=0, e=0, tag=item , atr= , t=T, lvl= 2
7. pat=/data/item/inner/@id , val=fff , s=0, e=0, tag=@id , atr=id , t=@, lvl= 4
8. pat=/data/item/inner/@name, val=ttt , s=0, e=0, tag=@name , atr=name , t=@, lvl= 4
9. pat=/data/item/inner , val=ooo ppp , s=1, e=1, tag=inner , atr= , t=T, lvl= 3
10. pat=/data/item , val= , s=0, e=1, tag=item , atr= , t=T, lvl= 2
11. pat=/data , val= , s=0, e=1, tag=data , atr= , t=T, lvl= 1
INTERFACE
Object creation
To create an XML::Reader object, the following syntax is used:
my $rdr = XML::Reader->newhd($data,
{strip => 1, filter => 2, using => ['/path1', '/path2']})
or die "Error: $!";
The element $data (which is mandatory) is the name of the XML-file, or a reference to a string, in which case the content of that string is taken as the text of the XML.
Alternatively, $data can also be a previously opened filehandle, such as \*STDIN, in which case that filehandle is used to read the XML.
Here is an example to create an XML::Reader object with a file-name:
my $rdr = XML::Reader->newhd('input.xml') or die "Error: $!";
Here is another example to create an XML::Reader object with a reference:
my $rdr = XML::Reader->newhd(\'<data>abc</data>') or die "Error: $!";
Here is an example to create an XML::Reader object with an open filehandle:
open my $fh, '<', 'input.xml' or die "Error: $!";
my $rdr = XML::Reader->newhd($fh);
Here is an example to create an XML::Reader object with \*STDIN:
my $rdr = XML::Reader->newhd(\*STDIN);
One or more of the following options can be added as a hash-reference:
- option {parse_ct => }
-
Option {parse_ct => 1} allows for comments to be parsed, default is {parse_ct => 0}
- option {parse_pi => }
-
Option {parse_pi => 1} allows for processing-instructions and XML-Declarations to be parsed, default is {parse_pi => 0}
- option {using => }
-
Option {using => } allows for selecting a sub-tree of the XML.
The syntax is {using => ['/path1/path2/path3', '/path4/path5/path6']}
- option {filter => }
-
Option {filter => 2} shows all lines, including attributes.
Option {filter => 3} removes attribute lines (i.e. it removes lines where $rdr->type eq '@'). Instead, it returns the attributes in a hash $rdr->att_hash.
Option {filter => 4} breaks down each line into its individual start-tags, end-tags, attributes, comments and processing-instructions. This allows the parsing of XML into pyx-formatted lines.
The syntax is {filter => 2|3|4}, default is {filter => 2}
- option {strip => }
-
Option {strip => 1} strips all leading and trailing spaces from text and comments. (attributes are never stripped). {strip => 0} leaves text and comments unmodified.
The syntax is {strip => 0|1}, default is {strip => 1}
Methods
A successfully created object of type XML::Reader provides the following methods:
- iterate
-
Reads one single XML-value. It returns 1 after a successful read, or undef when it hits end-of-file.
- path
-
Provides the complete path of the currently selected value, attributes are represented by leading '@'-signs.
- value
-
Provides the actual value (i.e. the value of the current text or attribute).
Note that, when {filter => 2 or 3} and in case of an XML declaration (i.e. $rdr->is_decl == 1), you want to suppress any value (which would be empty anyway). A typical code fragment would be:
print $rdr->value, "\n" unless $rdr->is_decl;
The above code does *not* apply for {filter => 4}, in which case a simple "print $rdr->value;" suffices:
print $rdr->value, "\n";
- comment
-
Provides the comment of the XML. You should check if $rdr->is_comment is true before accessing the comment.
- type
-
Provides the type of the value: 'T' for text, '@' for attributes.
If option {filter => 4} is in effect, then the type can be: 'T' for text, '@' for attributes, 'S' for start tags, 'E' for end-tags, '#' for comments, 'D' for the XML Declaration, '?' for processing-instructions.
- tag
-
Provides the current tag-name.
- attr
-
Provides the current attribute name (returns the empty string for non-attribute lines).
- level
-
Indicates the nesting level of the XPath expression (numeric value greater than zero).
- prefix
-
Shows the prefix which has been removed in option {using => ...}. Returns the empty string if option {using => ...} has not been specified.
- att_hash
-
Returns a reference to a hash with the current attributes of a start-tag (or empty hash if it is not a start-tag).
- dec_hash
-
Returns a reference to a hash with the current attributes of an XML-Declaration (or empty hash if it is not an XML-Declaration).
- proc_tgt
-
Returns the target (i.e. the first part) of a processing-instruction (or an empty string if the current event is not a processing-instruction).
- proc_data
-
Returns the data (i.e. the second part) of a processing-instruction (or an empty string if the current event is not a processing-instruction).
- pyx
-
Returns the pyx string of the current XML-event.
The pyx string is a string that starts with a specific first character. That first character of each line of PYX tells you what type of event you are dealing with: if the first character is '(', then you are dealing with a start event. If it's a ')', then you are dealing with and end event. If it's an 'A' then you are dealing with attributes. If it's '-', then you are dealing with text. If it's '?', then you are dealing with processing-instructions. (see http://www.xml.com/pub/a/2000/03/15/feature/index.html for an introduction to PYX).
The method
pyx
makes sense only if option {filter => 4} is selected, for any filter other than 4, undef is returned. - is_start
-
Returns 1 if the XML-file had a start tag at the current position, otherwise 0 is returned.
- is_end
-
Returns 1 if the XML-file had an end tag at the current position, otherwise 0 is returned.
- is_decl
-
Returns 1 if the XML-file had an XML-Declaration at the current position, otherwise 0 is returned.
- is_proc
-
Returns 1 if the XML-file had a processing-instruction at the current position, otherwise 0 is returned.
- is_comment
-
Returns 1 if the XML-file had a comment at the current position, otherwise 0 is returned.
- is_text
-
Returns 1 if the XML-file had text at the current position, otherwise 0 is returned.
- is_attr
-
Returns 1 if the XML-file had an attribute at the current position, otherwise 0 is returned.
- is_value
-
Returns 1 if the XML-file has either a text or an attribute at the current position, otherwise 0 is returned. This is mostly useful in mode {filter => 4} to see whether the method value() can be used.
OPTION USING
Option {using => ...} allows for selecting a sub-tree of the XML.
Here is how it works in detail...
option {using => ['/path1/path2/path3', '/path4/path5/path6']} eliminates all lines which do not start with '/path1/path2/path3' (or with '/path4/path5/path6', for that matter). This effectively leaves only lines starting with '/path1/path2/path3' or '/path4/path5/path6'.
Those lines (which are not eliminated) will have a shorter path by effectively removing the prefix '/path1/path2/path3' (or '/path4/path5/path6') from the path. The removed prefix, however, shows up in the prefix-method.
'/path1/path2/path3' (or '/path4/path5/path6') are supposed to be absolute and complete, i.e. absolute meaning they have to start with a '/'-character and complete meaning that the last item in path 'path3' (or 'path6', for that matter) will be completed internally by a trailing '/'-character.
An example with option 'using'
The following program takes this XML and parses it with XML::Reader, including the option 'using' to target specific elements:
use XML::Reader;
my $line2 = q{
<data>
<order>
<database>
<customer name="aaa" />
<customer name="bbb" />
<customer name="ccc" />
<customer name="ddd" />
</database>
</order>
<dummy value="ttt">test</dummy>
<supplier>hhh</supplier>
<supplier>iii</supplier>
<supplier>jjj</supplier>
</data>
};
my $rdr = XML::Reader->newhd(\$line2,
{using => ['/data/order/database/customer', '/data/supplier']});
my $i = 0;
while ($rdr->iterate) { $i++;
printf "%3d. prf=%-29s, pat=%-7s, val=%-3s, tag=%-6s, t=%-1s, lvl=%2d\n",
$i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->tag, $rdr->type, $rdr->level;
}
This is the output of that program:
1. prf=/data/order/database/customer, pat=/@name , val=aaa, tag=@name , t=@, lvl= 1
2. prf=/data/order/database/customer, pat=/ , val= , tag= , t=T, lvl= 0
3. prf=/data/order/database/customer, pat=/@name , val=bbb, tag=@name , t=@, lvl= 1
4. prf=/data/order/database/customer, pat=/ , val= , tag= , t=T, lvl= 0
5. prf=/data/order/database/customer, pat=/@name , val=ccc, tag=@name , t=@, lvl= 1
6. prf=/data/order/database/customer, pat=/ , val= , tag= , t=T, lvl= 0
7. prf=/data/order/database/customer, pat=/@name , val=ddd, tag=@name , t=@, lvl= 1
8. prf=/data/order/database/customer, pat=/ , val= , tag= , t=T, lvl= 0
9. prf=/data/supplier , pat=/ , val=hhh, tag= , t=T, lvl= 0
10. prf=/data/supplier , pat=/ , val=iii, tag= , t=T, lvl= 0
11. prf=/data/supplier , pat=/ , val=jjj, tag= , t=T, lvl= 0
An example without option 'using'
The following program takes the same XML and parses it with XML::Reader, but without the option 'using'.
use XML::Reader;
my $rdr = XML::Reader->newhd(\$line2);
my $i = 0;
while ($rdr->iterate) { $i++;
printf "%3d. prf=%-1s, pat=%-37s, val=%-6s, tag=%-11s, t=%-1s, lvl=%2d\n",
$i, $rdr->prefix, $rdr->path, $rdr->value, $rdr->tag, $rdr->type, $rdr->level;
}
As you can see in the following output, there are many more lines written, the prefix is empty and the path is much longer than in the previous program:
1. prf= , pat=/data , val= , tag=data , t=T, lvl= 1
2. prf= , pat=/data/order , val= , tag=order , t=T, lvl= 2
3. prf= , pat=/data/order/database , val= , tag=database , t=T, lvl= 3
4. prf= , pat=/data/order/database/customer/@name , val=aaa , tag=@name , t=@, lvl= 5
5. prf= , pat=/data/order/database/customer , val= , tag=customer , t=T, lvl= 4
6. prf= , pat=/data/order/database , val= , tag=database , t=T, lvl= 3
7. prf= , pat=/data/order/database/customer/@name , val=bbb , tag=@name , t=@, lvl= 5
8. prf= , pat=/data/order/database/customer , val= , tag=customer , t=T, lvl= 4
9. prf= , pat=/data/order/database , val= , tag=database , t=T, lvl= 3
10. prf= , pat=/data/order/database/customer/@name , val=ccc , tag=@name , t=@, lvl= 5
11. prf= , pat=/data/order/database/customer , val= , tag=customer , t=T, lvl= 4
12. prf= , pat=/data/order/database , val= , tag=database , t=T, lvl= 3
13. prf= , pat=/data/order/database/customer/@name , val=ddd , tag=@name , t=@, lvl= 5
14. prf= , pat=/data/order/database/customer , val= , tag=customer , t=T, lvl= 4
15. prf= , pat=/data/order/database , val= , tag=database , t=T, lvl= 3
16. prf= , pat=/data/order , val= , tag=order , t=T, lvl= 2
17. prf= , pat=/data , val= , tag=data , t=T, lvl= 1
18. prf= , pat=/data/dummy/@value , val=ttt , tag=@value , t=@, lvl= 3
19. prf= , pat=/data/dummy , val=test , tag=dummy , t=T, lvl= 2
20. prf= , pat=/data , val= , tag=data , t=T, lvl= 1
21. prf= , pat=/data/supplier , val=hhh , tag=supplier , t=T, lvl= 2
22. prf= , pat=/data , val= , tag=data , t=T, lvl= 1
23. prf= , pat=/data/supplier , val=iii , tag=supplier , t=T, lvl= 2
24. prf= , pat=/data , val= , tag=data , t=T, lvl= 1
25. prf= , pat=/data/supplier , val=jjj , tag=supplier , t=T, lvl= 2
26. prf= , pat=/data , val= , tag=data , t=T, lvl= 1
OPTION PARSE_CT
Option {parse_ct => 1} allows for comments to be parsed (usually, comments are ignored by XML::Reader, that is {parse_ct => 0} is the default.
Here is an example where comments are ignored by default:
use XML::Reader;
my $text = q{<?xml version="1.0"?><dummy>xyz <!-- remark --> stu <?ab cde?> test</dummy>};
my $rdr = XML::Reader->newhd(\$text) or die "Error: $!";
while ($rdr->iterate) {
if ($rdr->is_decl) { my %h = %{$rdr->dec_hash};
print "Found decl ", join('', map{" $_='$h{$_}'"} sort keys %h), "\n"; }
if ($rdr->is_proc) { print "Found proc ", "t=", $rdr->proc_tgt, ", d=", $rdr->proc_data, "\n"; }
if ($rdr->is_comment) { print "Found comment ", $rdr->comment, "\n"; }
print "Text '", $rdr->value, "'\n" unless $rdr->is_decl;
}
Here is the output:
Text 'xyz stu test'
Now, the very same XML data, and the same algorithm, except for the option {parse_ct => 1}, which is now activated:
use XML::Reader;
my $text = q{<?xml version="1.0"?><dummy>xyz <!-- remark --> stu <?ab cde?> test</dummy>};
my $rdr = XML::Reader->newhd(\$text, {parse_ct => 1}) or die "Error: $!";
while ($rdr->iterate) {
if ($rdr->is_decl) { my %h = %{$rdr->dec_hash};
print "Found decl ", join('', map{" $_='$h{$_}'"} sort keys %h), "\n"; }
if ($rdr->is_proc) { print "Found proc ", "t=", $rdr->proc_tgt, ", d=", $rdr->proc_data, "\n"; }
if ($rdr->is_comment) { print "Found comment ", $rdr->comment, "\n"; }
print "Text '", $rdr->value, "'\n" unless $rdr->is_decl;
}
Here is the output:
Text 'xyz'
Found comment remark
Text 'stu test'
OPTION PARSE_PI
Option {parse_pi => 1} allows for processing-instructions and XML-Declarations to be parsed (usually, processing-instructions and XML-Declarations are ignored by XML::Reader, that is {parse_pi => 0} is the default.
As an example, we use the very same XML data, and the same algorithm from the above paragraph, except for the option {parse_pi => 1}, which is now activated (together with option {parse_ct => 1}):
use XML::Reader;
my $text = q{<?xml version="1.0"?><dummy>xyz <!-- remark --> stu <?ab cde?> test</dummy>};
my $rdr = XML::Reader->newhd(\$text, {parse_ct => 1, parse_pi => 1}) or die "Error: $!";
while ($rdr->iterate) {
if ($rdr->is_decl) { my %h = %{$rdr->dec_hash};
print "Found decl ", join('', map{" $_='$h{$_}'"} sort keys %h), "\n"; }
if ($rdr->is_proc) { print "Found proc ", "t=", $rdr->proc_tgt, ", d=", $rdr->proc_data, "\n"; }
if ($rdr->is_comment) { print "Found comment ", $rdr->comment, "\n"; }
print "Text '", $rdr->value, "'\n" unless $rdr->is_decl;
}
Note the "unless $rdr->is_decl" in the above code. This is to avoid outputting any value after the XML declaration (which would be empty anyway).
Here is the output:
Found decl version='1.0'
Text 'xyz'
Found comment remark
Text 'stu'
Found proc t=ab, d=cde
Text 'test'
OPTION FILTER
Option {filter => } allows to select different operation modes when processing the XML data.
Option {filter => 2}
With option {filter => 2}, XML::Reader produces one line for each character event. A preceding start-tag results in method is_start to be set to 1, a trailing end-tag results in method is_end to be set to 1. Likewise, a preceding comment results in method is_comment to be set to 1, a preceding XML-declaration results in method is_decl to be set to 1, a preceding processing-instruction results in method is_proc to be set to 1.
Also, attribute lines are added via the special '/@...' syntax.
Option {filter => 2} is the default.
Here is an example...
use XML::Reader;
my $text = q{<root><test param="v"><a><b>e<data id="z">g</data>f</b></a></test>x <!-- remark --> yz</root>};
my $rdr = XML::Reader->newhd(\$text) or die "Error: $!";
while ($rdr->iterate) {
printf "Path: %-24s, Value: %s\n", $rdr->path, $rdr->value;
}
This program (with implicit option {filter => 2} as default) produces the following output:
Path: /root , Value:
Path: /root/test/@param , Value: v
Path: /root/test , Value:
Path: /root/test/a , Value:
Path: /root/test/a/b , Value: e
Path: /root/test/a/b/data/@id , Value: z
Path: /root/test/a/b/data , Value: g
Path: /root/test/a/b , Value: f
Path: /root/test/a , Value:
Path: /root/test , Value:
Path: /root , Value: x yz
The same {filter => 2} also allows to rebuild the structure of the XML with the help of the methods is_start
and is_end
. Please note that in the above output, the first line ("Path: /root, Value:") is empty, but important for the structure of the XML. Therefore we can't ignore it.
Let us now look at the same example (with option {filter => 2}), but with an additional algorithm to reconstruct the original XML:
use XML::Reader;
my $text = q{<root><test param="v"><a><b>e<data id="z">g</data>f</b></a></test>x <!-- remark --> yz</root>};
my $rdr = XML::Reader->newhd(\$text) or die "Error: $!";
my %at;
while ($rdr->iterate) {
my $indentation = ' ' x ($rdr->level - 1);
if ($rdr->type eq '@') { $at{$rdr->attr} = $rdr->value; }
if ($rdr->is_start) {
print $indentation, '<', $rdr->tag, join('', map{" $_='$at{$_}'"} sort keys %at), '>', "\n";
}
unless ($rdr->type eq '@') { %at = (); }
if ($rdr->type eq 'T' and $rdr->value ne '') {
print $indentation, ' ', $rdr->value, "\n";
}
if ($rdr->is_end) {
print $indentation, '</', $rdr->tag, '>', "\n";
}
}
...and here is the output:
<root>
<test param='v'>
<a>
<b>
e
<data id='z'>
g
</data>
f
</b>
</a>
</test>
x yz
</root>
...this is proof that the original structure of the XML is not lost.
Option {filter => 3}
Option {filter => 3} works very much like {filter => 2}.
The difference, though, is that with option {filter => 3} all attribute-lines are suppressed and instead, the attributes are presented for each start-line in the hash $rdr->att_hash().
This allows, in fact, to dispense with the global %at variable of the previous algorithm, and use %{$rdr->att_hash} instead:
Here is the new algorithm for {filter => 3}, we don't need to worry about attributes (that is, we don't need to check fot $rdr->type eq '@') and, as already mentioned, the %at variable is replaced by %{$rdr->att_hash} :
use XML::Reader;
my $text = q{<root><test param="v"><a><b>e<data id="z">g</data>f</b></a></test>x <!-- remark --> yz</root>};
my $rdr = XML::Reader->newhd(\$text, {filter => 3}) or die "Error: $!";
while ($rdr->iterate) {
my $indentation = ' ' x ($rdr->level - 1);
if ($rdr->is_start) {
print $indentation, '<', $rdr->tag,
join('', map{" $_='".$rdr->att_hash->{$_}."'"} sort keys %{$rdr->att_hash}),
'>', "\n";
}
if ($rdr->type eq 'T' and $rdr->value ne '') {
print $indentation, ' ', $rdr->value, "\n";
}
if ($rdr->is_end) {
print $indentation, '</', $rdr->tag, '>', "\n";
}
}
...the output for {filter => 3} is identical to the output for {filter => 2}:
<root>
<test param='v'>
<a>
<b>
e
<data id='z'>
g
</data>
f
</b>
</a>
</test>
x yz
</root>
Option {filter => 4}
Although this is not the main purpose of XML::Reader, option {filter => 4} can generate individual lines for start-tags, end-tags, comments, processing-instructions and XML-Declarations. Its aim is to generate a pyx string for further processing and analysis.
Here is an example:
use XML::Reader;
my $text = q{<?xml version="1.0" encoding="iso-8859-1"?>
<delta>
<dim alter="511">
<gamma />
<beta>
car <?tt dat?>
</beta>
</dim>
dskjfh <!-- remark --> uuu
</delta>};
my $rdr = XML::Reader->newhd(\$text, {filter => 4, parse_pi => 1}) or die "Error: $!";
while ($rdr->iterate) {
printf "Type = %1s, pyx = %s\n", $rdr->type, $rdr->pyx;
}
And here is the output:
Type = D, pyx = ?xml version='1.0' encoding='iso-8859-1'
Type = S, pyx = (delta
Type = S, pyx = (dim
Type = @, pyx = Aalter 511
Type = S, pyx = (gamma
Type = E, pyx = )gamma
Type = S, pyx = (beta
Type = T, pyx = -car
Type = ?, pyx = ?tt dat
Type = E, pyx = )beta
Type = E, pyx = )dim
Type = T, pyx = -dskjfh uuu
Type = E, pyx = )delta
Be aware that comments can be produced by pyx
in a non-standard way if requested by {parse_ct => 1}. In fact, comments are produced with a leading hash symbol which is not part of the pyx specification, as can be seen by the following example:
use XML::Reader;
my $text = q{
<delta>
<!-- remark -->
</delta>};
my $rdr = XML::Reader->newhd(\$text, {filter => 4, parse_ct => 1}) or die "Error: $!";
while ($rdr->iterate) {
printf "Type = %1s, pyx = %s\n", $rdr->type, $rdr->pyx;
}
Here is the output:
Type = S, pyx = (delta
Type = #, pyx = #remark
Type = E, pyx = )delta
Finally, when operating with {filter => 4}, the usual methods (value
, attr
, path
, is_start
, is_end
, is_decl
, is_proc
, is_comment
, is_attr
, is_text
, is_value
, comment
, proc_tgt
, proc_data
, dec_hash
or att_hash
) remain operational. Here is an example:
use XML::Reader;
my $text = q{<?xml version="1.0"?>
<parent abc="def"> <?pt hmf?>
dskjfh <!-- remark -->
<child>ghi</child>
</parent>};
my $rdr = XML::Reader->newhd(\$text, {filter => 4, parse_pi => 1, parse_ct => 1}) or die "Error: $!";
while ($rdr->iterate) {
printf "Path %-15s v=%s ", $rdr->path, $rdr->is_value;
if ($rdr->is_start) { print "Found start tag ", $rdr->tag, "\n"; }
elsif ($rdr->is_end) { print "Found end tag ", $rdr->tag, "\n"; }
elsif ($rdr->is_decl) { my %h = %{$rdr->dec_hash};
print "Found decl ", join('', map{" $_='$h{$_}'"} sort keys %h), "\n"; }
elsif ($rdr->is_proc) { print "Found proc ", "t=", $rdr->proc_tgt, ", d=", $rdr->proc_data, "\n"; }
elsif ($rdr->is_comment) { print "Found comment ", $rdr->comment, "\n"; }
elsif ($rdr->is_attr) { print "Found attribute ", $rdr->attr, "='", $rdr->value, "'\n"; }
elsif ($rdr->is_text) { print "Found text ", $rdr->value, "\n"; }
}
Here is the output:
Path / v=0 Found decl version='1.0'
Path /parent v=0 Found start tag parent
Path /parent/@abc v=1 Found attribute abc='def'
Path /parent v=0 Found proc t=pt, d=hmf
Path /parent v=1 Found text dskjfh
Path /parent v=0 Found comment remark
Path /parent/child v=0 Found start tag child
Path /parent/child v=1 Found text ghi
Path /parent/child v=0 Found end tag child
Path /parent v=0 Found end tag parent
Note that v=1 (i.e. $rdr->is_value == 1) for all text and all attributes.
EXAMPLES
Let's look at the following piece of XML from which we want to extract the values in <item> (by that I mean only the first 'start...'-value, not the 'end...'-value), plus the attributes "p1" and "p3". The item-tag must be exactly in the /start/param/data range (and *not* in the /start/param/dataz range).
my $text = q{
<start>
<param>
<data>
<item p1="a" p2="b" p3="c">start1 <inner p1="p">i1</inner> end1</item>
<item p1="d" p2="e" p3="f">start2 <inner p1="q">i2</inner> end2</item>
<item p1="g" p2="h" p3="i">start3 <inner p1="r">i3</inner> end3</item>
</data>
<dataz>
<item p1="j" p2="k" p3="l">start9 <inner p1="s">i9</inner> end9</item>
</dataz>
<data>
<item p1="m" p2="n" p3="o">start4 <inner p1="t">i4</inner> end4</item>
</data>
</param>
</start>};
We expect exactly 4 output-lines from our parse (i.e. we don't expect the 'dataz' part - 'start9' - in the output):
item = 'start1', p1 = 'a', p3 = 'c'
item = 'start2', p1 = 'd', p3 = 'f'
item = 'start3', p1 = 'g', p3 = 'i'
item = 'start4', p1 = 'm', p3 = 'o'
Parsing XML with {filter => 2}
Here is a sample program to parse that XML with {filter => 2}. (Note how the prefix '/start/param/data/item' is located in the {using =>} option of newhd). We need two scalars ('$p1' and '$p3') to hold the parameters in '/@p1' and in '/@p3' and carry them over to the $rdr->is_start section, where they can be printed.
my $rdr = XML::Reader->newhd(\$text,
{filter => 2, using => '/start/param/data/item'}) or die "Error: $!";
my ($p1, $p3);
while ($rdr->iterate) {
if ($rdr->path eq '/@p1') { $p1 = $rdr->value; }
elsif ($rdr->path eq '/@p3') { $p3 = $rdr->value; }
elsif ($rdr->path eq '/' and $rdr->is_start) {
printf "item = '%s', p1 = '%s', p3 = '%s'\n",
$rdr->value, $p1, $p3;
}
unless ($rdr->is_attr) { $p1 = undef; $p3 = undef; }
}
Parsing XML with {filter => 3}
With {filter => 3} we can dispense with the two scalars '$p1' and '$p3', the code becomes very simple:
my $rdr = XML::Reader->newhd(\$text,
{filter => 3, using => '/start/param/data/item'}) or die "Error: $!";
while ($rdr->iterate) {
if ($rdr->path eq '/' and $rdr->is_start) {
printf "item = '%s', p1 = '%s', p3 = '%s'\n",
$rdr->value, $rdr->att_hash->{p1}, $rdr->att_hash->{p3};
}
}
Parsing XML with {filter => 4}
With {filter => 4}, however, the code becomes slightly more complicated again: As already shown for {filter => 2}, we need again two scalars ('$p1' and '$p3') to hold the parameters in '/@p1' and in '/@p3' and carry them over. In addition to that, we also need a way to count text-values (see scalar '$count'), so that we can distinguish between the first value 'start...' (that we want to print) and the second value 'end...' (that we do not want to print).
my $rdr = XML::Reader->newhd(\$text,
{filter => 4, using => '/start/param/data/item'}) or die "Error: $!";
my ($count, $p1, $p3);
while ($rdr->iterate) {
if ($rdr->path eq '/@p1') { $p1 = $rdr->value; }
elsif ($rdr->path eq '/@p3') { $p3 = $rdr->value; }
elsif ($rdr->path eq '/') {
if ($rdr->is_start) { $count = 0; $p1 = undef; $p3 = undef; }
elsif ($rdr->is_text) {
$count++;
if ($count == 1) {
printf "item = '%s', p1 = '%s', p3 = '%s'\n",
$rdr->value, $p1, $p3;
}
}
}
}
FUNCTIONS
Function slurp_xml
The function slurp_xml reads an XML file and slurps it into an array-ref. Here is an example where we want to slurp the name, the street and the city of all customers in the path '/data/order/database/customer':
use XML::Reader qw(slurp_xml);
my $line2 = q{
<data>
<order>
<database>
<customer name="smith" id="652">
<street>high street</street>
<city>boston</city>
</customer>
<customer name="jones" id="184">
<street>maple street</street>
<city>new york</city>
</customer>
<customer name="stewart" id="520">
<street>ring road</street>
<city>dallas</city>
</customer>
</database>
</order>
<dummy value="ttt">test</dummy>
<supplier>hhh</supplier>
<supplier>iii</supplier>
<supplier>jjj</supplier>
</data>
};
my $aref = slurp_xml(\$line2, '/data/order/database/customer',
['/@name', '/street', '/city']);
for (@$aref) {
printf "Name = %-7s Street = %-12s City = %s\n", $_->[0], $_->[1], $_->[2];
}
The first parameter to slurp_xml is the filename (or scalar reference, or open filehandle) of the XML that will be slurped. In this case we read from a scalar ref \$line2. The second parameter is the root of the sub-tree that we want to slurp (in our case that's '/data/order/database/customer'). Finally we supply a list of the elements that we want to slurp, relative to the sub-tree. In this case it is ['/@name', '/street', '/city'].
Here is the output:
Name = smith Street = high street City = boston
Name = jones Street = maple street City = new york
Name = stewart Street = ring road City = dallas
slurp_xml works similar to XML::Simple, in that it reads all required information in one go into an in-memory data structure. The difference, however, is that slurp_xml lets you be specific in what you actually want before you do the slurping, so that your in-memory data structure is smaller and less complicated.
AUTHOR
Klaus Eichner, March 2009
COPYRIGHT AND LICENSE
Copyright (C) 2009 by Klaus Eichner
All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the artistic license, see http://www.opensource.org/licenses/artistic-license-1.0.php
RELATED MODULES
If you also want to write XML, have a look at XML::Writer. This module provides a simple interface for writing XML. (If you are writing non-mixed content XML, consider setting DATA_MODE=>1 and DATA_INDENT=>2, which allows for proper indentation in your XML-Output file)
SEE ALSO
XML::TokeParser, XML::Simple, XML::Parser, XML::Parser::Expat, XML::TiePYX, XML::Writer.