NAME

XML::TreePP::XMLPath - Something similar to XPath, allowing definition of paths to XML subtrees

SYNOPSIS

use XML::TreePP;
use XML::TreePP::XMLPath;

my $tpp = XML::TreePP->new();
my $tppx = XML::TreePP::XMLPath->new();

my $tree = { rss => { channel => { item => [ {
    title   => "The Perl Directory",
    link    => "http://www.perl.org/",
}, {
    title   => "The Comprehensive Perl Archive Network",
    link    => "http://cpan.perl.org/",
} ] } } };
my $xml = $tpp->write( $tree );

Get a subtree of the XMLTree:

my $xmlsub = $tppx->getSubTree( $xml , q{rss/channel/item[title="The Comprehensive Perl Archive Network"]} );
print $xmlsub->{'link'};

Iterate through all attributes and Elements of each <item> XML element:

my $xmlsub = $tppx->getSubTree( $xml , q{rss/channel/item} );
my $h_attr = $tppx->getAttributes( $xmlsub );
my $h_elem = $tppx->getElements( $xmlsub );
foreach $attrHash ( @{ $h_attr } ) {
    while my ( $attrKey, $attrVal ) = each ( %{$attrHash} ) {
        ...
    }
}
foreach $elemHash ( @{ $h_elem } ) {
    while my ( $elemName, $elemVal ) = each ( %{$elemHash} ) {
        ...
    }
}

DESCRIPTION

A Pure PERL extension to Pure PERL XML::TreePP module to support paths to XML subtree content. This may seem similar to XPath, but it is not XPath.

REQUIREMENTS

The following perl modules are depended on by this module:

  • XML::TreePP

  • Params::Validate

IMPORTABLE METHODS

When the calling application invokes this module in a use clause, the following methods can be imported into its space.

  • getAttributes

  • getElements

  • getSubtree

  • parseXMLPath

Example:

use XML::TreePP::XMLPath qw(getAttributes getElements getSubtree parseXMLPath);

XMLPath PHILOSOPHY

Referring to the following XML Data.

<paragraph>
    <sentence language="english">
        <words>Do red cats eat yellow food</words>
        <punctuation>?</punctuation>
    </sentence>
    <sentence language="english">
        <words>Brown cows eat green grass</words>
        <punctuation>.</punctuation>
    </sentence>
</paragraph>

Where the path "parapgraph/sentence[@language=english]/words" matches "Do red cats eat yellow food"

And the path "parapgraph/sentence[punctuation=.]/words" matches "Brown cows eat green grass"

So that "@attr=val" is identified as an attribute inside the "<tag attr=val></tag>"

And "attr=val" is identified as a nested attribute inside the "<tag><attr>val</attr></tag>"

After XML::TreePP parses the above XML, it looks like this:

{
  paragraph => {
        sentence => [
              {
                "-language" => "english",
                punctuation => "?",
                words => "Do red cats eat yellow food",
              },
              {
                "-language" => "english",
                punctuation => ".",
                words => "Brown cows eat green grass",
              },
            ],
      },
}

Things To Note

Note that attributes are specified in the XMLPath as @attribute_name, but after XML::TreePP parses the XML Document, the attribute name is identifed as -attribute_name in the resulting parsed document.

XMLPath requires attributes to be specified as @attribute_name and takes care of the conversion from @ to - behind the scenes when accessing the XML::TreePP parsed XML document.

Child elements on the next level of a parent element are accessible as attributes as attribute_name. This is the same format as @attribute_name except without the @ symbol. Specifying the attribute without an @ symbol identifies the attribute as a child element of the parent element being evaluated.

Child element values are only accessible as CDATA. That is when the element being evaluated is animal, the attribute (or child element) is cat, and the value of the attribue is tiger, it is presented as this:

<jungle>
    <animal>
        <cat>tiger</cat>
    </animal>
</jungle>

The XMLPath used to access the key=value pair of cat=tiger for element animal would be as follows:

jungle/animal[cat=tiger]

However, note that in this case, due to how XML::TreePP parses the XML document, the above XMLPath is invalid:

<jungle>
    <animal>
        <cat color="black">tiger</cat>
    </animal>
</jungle>

However, in this case, the following two XMLPaths would be valid:

jungle/animal/cat[#text=tiger]
jungle/animal/cat[@color="black"][#text=tiger]

One should realize that in the second case, the element cat is being evaluated, and not the element animal as in the first case. And will be undesireable if you want to evaluate animal.

As such, this is a current limitation to XMLPath. This limitation would be resolved by XPath Query methods. However XMLPath is not XPath.

METHODS

new

Create a new object instances of this module.

  • returns

    An object instance of this module.

$tppx = new XML::TreePP::XMLPath();

charlexsplit

An analysis method for single character boundry and start/stop tokens

  • string

    The string to analyse

  • boundry_start

    The single character starting boundry separating wanted elements

  • boundry_stop

    The single character stopping boundry separating wanted elements

  • tokens

    A { start_char => stop_char } hash reference of start/stop tokens. The characters in string contained within a start_char and stop_char are not evaluated to match boundires.

  • boundry_begin

    Provide "1" if the beginning of the string should be treated as a boundry_start character.

  • boundry_end

    Provide "1" if the ending of the string should be treated as a boundry_stop character.

  • returns

    An arrary reference of elements

$elements = charlexsplit (
                    string         => $string,
                    boundry_start  => $charA,   boundry_stop   => $charB,
                    tokens         => \@tokens,
                    boundry_begin  => $char1,   boundry_end    => $char2 );

parseXMLPath

Parse a string that represents the path to a XML element in a XML document The XML Path is something like XPath, but it is not

  • XMLPath

    The XML path to be parsed.

  • returns

    An arrary reference of hash reference elements of the XMLPath. Note that the XML attributes, known as "@attr" are transformed into "-attr". The preceeding "-" minus in place of the "@" at is the recognized format of attributes in the XML::TreePP module.

    Being that this is intended to be a submodule of XML::TreePP, the format of '@attr' is converted to '-attr' to conform with how XML::TreePP handles attributes.

    See: XML::TreePP->set( attr_prefix => '@' ); for more information. This module only supports the default format, '-attr', of attributes at this time.

$parsedXMLPath = parseXMLPath( $XMLPath );

validateAttrValue

Validate a subtree of a parsed XML document to have a paramter set in which attribute matches value.

  • XMLSubTree

    The XML tree, or subtree, (element) to validate. This is an XML document parsed by the XML::TreePP->parse() method.

    The XMLSubTree can be an ARRAY of multiple elements to evaluate. The XMLSubTree would be validated as follows:

    $subtree[item]->{'attribute'} eq "value"
    $subtree[item]->{'attribute'}->{'value'} exists
    returning: $subtree[item] if valid (returns the first valid [item])

    Or the XMLSubTree can be a HASH which would be a single element to evaluate. The XMLSubTree would be validated as follows:

    $subtree{'attribute'} eq "value"
    $subtree{'attribute'}->{'value'} exists
    returning: $subtree if valid
  • params

    Validate the element having an attribute matching value in this current XMLSubTree position

    This is an array reference of [["attr1","val"],["attr2","val"]], as in:

    my $params = [[ "MyKeyName" , "Value_to_match_for_KeyName" ]];
  • returns

    The subtree that is validated, or undef if not validated

$validatedXMLTree = validateAttrValue( $XMLTree , \@params );

getSubtree

Return a subtree of an XML tree from a given XMLPath. See parseXMLPath() for the format of a XMLPath.

  • XMLTree

    An XML::TreePP parsed XML document.

  • XMLPath

    The path within the XML Tree to retrieve. See parseXMLPath()

  • returns

    A subtree of a XML::TreePP parsed XMLTree found at the XMLPath.

$XMLSubTree = getSubtree ( $XMLTree , $XMLPath ) 

getAttributes

  • XMLTree

    An XML::TreePP parsed XML document.

  • XMLPath

    The path within the XML Tree to retrieve. See parseXMLPath()

  • returns

    An array refrence of [{attribute=>value}], or undef if none found

    In the case where the XML Path points at a multi-same-name element, the return value is a ref arrary of ref hashes, one hash ref for each element.

    Example Returned Data:

    XML Path points at a single named element
    [ {attr1=>val,attr2=>val} ]
    
    XML Path points at a multi-same-name element
    [ {attr1A=>val,attr1B=>val}, {attr2A=>val,attr2B=>val} ]
$attributes = getAttributes ( $XMLTree , $XMLPath );

getElements

Gets the child elements found at a specified XMLPath

  • XMLTree

    An XML::TreePP parsed XML document.

  • XMLPath

    The path within the XML Tree to retrieve. See parseXMLPath()

  • returns

    An array refrence of [{element=>value}], or undef if none found

    An array reference of a hash reference of elements (not attributes) and each elements XMLSubTree, or undef if none found. If the XMLPath points at a multi-valued element, then the subelements of each element at the XMLPath are returned as separate hash references in the returning array reference.

    The format of the returning data is the same as the getAttributes() method.

    The XMLSubTree is fetched based on the provided XMLPath. Then all elements found under that XMLPath are placed into a referenced hash table to be returned. If an element found has additional XML data under it, it is all returned just as it was provided.

    Simply, this strips all XML attributes found at the XMLPath, returning the remaining elements found at that path.

    If the XMLPath has no elements under it, then undef is returned instead.

$elements = getElements ( $XMLTree , $XMLPath );

EXAMPLES

Method: new

It is not necessary to create an object of this module. However, if you choose to do so any way, here is how you do it.

my $obj = new XML::TreePP::XMLPath;

This module supports being called by two methods.

1. By importing the functions you wish to use, as in:
use XML::TreePP::XMLPath qw( function1 function2 );
function1( args )

See IMPORTABLE METHODS section for methods available for import

2. Or by calling the functions in an object oriented mannor, as in:
my $tppx = new XML::TreePP::XMLPath;
$tppx->function1( args )

Using either method works the same and returns the same output.

Method: charlexsplit

Here are three steps that can be used to parse values out of a string:

Step 1:

First, parse the entire string deliminated by the / character.

my $el = charlexsplit   (
    string        => q{abcdefg/xyz/path[@key='val'][@key2='val2']/last},
    boundry_start => '/',
    boundry_stop  => '/',
    tokens        => [qw( [ ] ' ' " " )],
    boundry_begin => 1,
    boundry_end   => 1
    );
dump( $el );

Output:

["abcdefg", "xyz", "path[\@key='val'][\@key2='val2']", "last"],

Step 2:

Second, parse the elements from step 1 that have key/val pairs, such that each single key/val is contained by the [ and ] characters

my $el = charlexsplit (
    string        => q( path[@key='val'][@key2='val2'] ),
    boundry_start => '[',
    boundry_stop  => ']',
    tokens        => [qw( ' ' " " )],
    boundry_begin => 0,
    boundry_end   => 0
    );
dump( $el );

Output:

["\@key='val'", "\@key2='val2'"]

Step 3:

Third, parse the elements from step 2 that is a single key/val, the single key/val is delimintated by the = character

my $el = charlexsplit (
    string        => q{ @key='val' },
    boundry_start => '=',
    boundry_stop  => '=',
    tokens        => [qw( ' ' " " )],
    boundry_begin => 1,
    boundry_end   => 1
    );
dump( $el );

Output:

["\@key", "'val'"]

Note that in each example the tokens represent a group of escaped characters which, when analysed, will be collected as part of an element, but will not be allowed to match any starting or stopping boundry.

So if you have a start token without a stop token, you will get undesired results. This example demonstrate this data error.

my $el = charlexsplit   (
    string        => q{ path[@key='val'][@key2=val2'] },
    boundry_start => '[',
    boundry_stop  => ']',
    tokens        => [qw( ' ' " " )],
    boundry_begin => 0,
    boundry_end   => 0
    );
dump( $el );

Undesired output:

["\@key='val'"]

In this example of bad data being parsed, the boundry_stop character ] was never matched for the key2=val2 element.

And there is no error message. The charlexsplit method throws away the second element silently due to the token start and stop mismatch.

Method: parseXMLPath

use XML::TreePP::XMLPath qw(parseXMLPath);
use Data::Dump qw(dump);

my $parsedPath = parseXMLPath(
                              q{abcdefg/xyz/path[@key1='val1'][key2='val2']/last}
                              );
dump ( $parsedPath );

Output:

[
  ["abcdefg", undef],
  ["xyz", undef],
  ["path", [["-key1", "val1"], ["key2", "val2"]]],
  ["last", undef],
]

Method: validateAttrValue

#!/usr/bin/perl
use XML::TreePP;
use XML::TreePP::XMLPath qw(getSubtree);
use Data::Dump qw(dump);
#
# The XML document data
my $xmldata=<<XMLEND;
    <paragraph>
        <sentence language="english">
            <words>Do red cats eat yellow food</words>
            <punctuation>?</punctuation>
        </sentence>
        <sentence language="english">
            <words>Brown cows eat green grass</words>
            <punctuation>.</punctuation>
        </sentence>
    </paragraph>
XMLEND
#
# Parse the XML docuemnt.
my $tpp = new XML::TreePP;
my $xmldoc = $tpp->parse($xmldata);
print "Output Test #1\n";
dump( $xmldoc );
#
# Retrieve the sub tree of the XML document at path "paragraph/sentence"
my $xmlSubTree = getSubtree($xmldoc, "paragraph/sentence");
print "Output Test #2\n";
dump( $xmlSubTree );
#
my (@params, $validatedSubTree);
#
# Test the XML Sub Tree to have an attribute "-language" with value "german"
@params = (['-language', 'german']);
$validatedSubTree = validateAttrValue($xmlSubTree, \@params);
print "Output Test #3\n";
dump( $validatedSubTree );
#
# Test the XML Sub Tree to have an attribute "-language" with value "english"
@params = (['-language', 'english']);
$validatedSubTree = validateAttrValue($xmlSubTree, \@params);
print "Output Test #4\n";
dump( $validatedSubTree );

Output:

Output Test #1
{
  paragraph => {
        sentence => [
              {
                "-language" => "english",
                punctuation => "?",
                words => "Do red cats eat yellow food",
              },
              {
                "-language" => "english",
                punctuation => ".",
                words => "Brown cows eat green grass",
              },
            ],
      },
}
Output Test #2
[
  {
    "-language" => "english",
    punctuation => "?",
    words => "Do red cats eat yellow food",
  },
  {
    "-language" => "english",
    punctuation => ".",
    words => "Brown cows eat green grass",
  },
]
Output Test #3
undef
Output Test #4
{
  "-language" => "english",
  punctuation => "?",
  words => "Do red cats eat yellow food",
}

Method: getSubtree

#!/usr/bin/perl
use XML::TreePP;
use XML::TreePP::XMLPath qw(getSubtree);
use Data::Dump qw(dump);
#
# The XML document data
my $xmldata=<<XMLEND;
    <level1>
        <level2>
            <level3 attr1="val1" attr2="val2">
                <attr3>val3</attr3>
                <attr4/>
                <attrX>one</attrX>
                <attrX>two</attrX>
                <attrX>three</attrX>
            </level3>
            <level3 attr1="valOne"/>
        </level2>
    </level1>
XMLEND
#
# Parse the XML docuemnt.
my $tpp = new XML::TreePP;
my $xmldoc = $tpp->parse($xmldata);
print "Output Test #1\n";
dump( $xmldoc );
#
# Retrieve the sub tree of the XML document at path "level1/level2"
my $xmlSubTree = getSubtree($xmldoc, 'level1/level2');
print "Output Test #2\n";
dump( $xmlSubTree );
#
# Retrieve the sub tree of the XML document at path "level1/level2/level3[@attr1='val1']"
my $xmlSubTree = getSubtree($xmldoc, 'level1/level2/level3[@attr1="val1"]');
print "Output Test #3\n";
dump( $xmlSubTree );

Output:

Output Test #1
{
  level1 => {
        level2 => {
              level3 => [
                    {
                      "-attr1" => "val1",
                      "-attr2" => "val2",
                      attr3    => "val3",
                      attr4    => undef,
                      attrX    => ["one", "two", "three"],
                    },
                    { "-attr1" => "valOne" },
                  ],
            },
      },
}
Output Test #2
{
  level3 => [
        {
          "-attr1" => "val1",
          "-attr2" => "val2",
          attr3    => "val3",
          attr4    => undef,
          attrX    => ["one", "two", "three"],
        },
        { "-attr1" => "valOne" },
      ],
}
Output Test #3
{
  "-attr1" => "val1",
  "-attr2" => "val2",
  attr3    => "val3",
  attr4    => undef,
  attrX    => ["one", "two", "three"],
}

See validateAttrValue() EXAMPLES section for more usage examples.

Method: getAttributes

#!/usr/bin/perl
#
use XML::TreePP;
use XML::TreePP::XMLPath qw(getAttributes);
use Data::Dump qw(dump);
#
# The XML document data
my $xmldata=<<XMLEND;
    <level1>
        <level2>
            <level3 attr1="val1" attr2="val2">
                <attr3>val3</attr3>
                <attr4/>
                <attrX>one</attrX>
                <attrX>two</attrX>
                <attrX>three</attrX>
            </level3>
            <level3 attr1="valOne"/>
        </level2>
    </level1>
XMLEND
#
# Parse the XML docuemnt.
my $tpp = new XML::TreePP;
my $xmldoc = $tpp->parse($xmldata);
print "Output Test #1\n";
dump( $xmldoc );
#
# Retrieve the sub tree of the XML document at path "level1/level2/level3"
my $attributes = getAttributes($xmldoc, 'level1/level2/level3');
print "Output Test #2\n";
dump( $attributes );
#
# Retrieve the sub tree of the XML document at path "level1/level2/level3[attr3=""]"
my $attributes = getAttributes($xmldoc, 'level1/level2/level3[attr3="val3"]');
print "Output Test #3\n";
dump( $attributes );

Output:

Output Test #1
{
  level1 => {
        level2 => {
              level3 => [
                    {
                      "-attr1" => "val1",
                      "-attr2" => "val2",
                      attr3    => "val3",
                      attr4    => undef,
                      attrX    => ["one", "two", "three"],
                    },
                    { "-attr1" => "valOne" },
                  ],
            },
      },
}
Output Test #2
[{ attr1 => "val1", attr2 => "val2" }, { attr1 => "valOne" }]
Output Test #3
[{ attr1 => "val1", attr2 => "val2" }]

Method: getElements

#!/usr/bin/perl
#
use XML::TreePP;
use XML::TreePP::XMLPath qw(getElements);
use Data::Dump qw(dump);
#
# The XML document data
my $xmldata=<<XMLEND;
    <level1>
        <level2>
            <level3 attr1="val1" attr2="val2">
                <attr3>val3</attr3>
                <attr4/>
                <attrX>one</attrX>
                <attrX>two</attrX>
                <attrX>three</attrX>
            </level3>
            <level3 attr1="valOne"/>
        </level2>
    </level1>
XMLEND
#
# Parse the XML docuemnt.
my $tpp = new XML::TreePP;
my $xmldoc = $tpp->parse($xmldata);
print "Output Test #1\n";
dump( $xmldoc );
#
# Retrieve the multiple same-name elements of the XML document at path "level1/level2/level3"
my $elements = getElements($xmldoc, 'level1/level2/level3');
print "Output Test #2\n";
dump( $elements );
#
# Retrieve the elements of the XML document at path "level1/level2/level3[attr3="val3"]
my $elements = getElements($xmldoc, 'level1/level2/level3[attr3="val3"]');
print "Output Test #3\n";
dump( $elements );

Output:

Output Test #1
{
  level1 => {
        level2 => {
              level3 => [
                    {
                      "-attr1" => "val1",
                      "-attr2" => "val2",
                      attr3    => "val3",
                      attr4    => undef,
                      attrX    => ["one", "two", "three"],
                    },
                    { "-attr1" => "valOne" },
                  ],
            },
      },
}
Output Test #2
[
  { attr3 => "val3", attr4 => undef, attrX => ["one", "two", "three"] },
  undef,
]
Output Test #3
[
  { attr3 => "val3", attr4 => undef, attrX => ["one", "two", "three"] },
]

AUTHOR

Russell E Glaue, http://russ.glaue.org

SEE ALSO

XML::TreePP

COPYRIGHT AND LICENSE

Copyright (c) 2008 Center for the Application of Information Technologies. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.