NAME

XML::Twig - A perl module for processing huge XML documents in tree mode.

SYNOPSIS

single-tree mode    
    my $t= new XML::Twig();
    $t->parse( '<doc><para>para1</para></doc>');
    $t->print;

chunk mode 
    my $t= new XML::Twig( TwigHandlers => { section => \&flush});
    $t->parsefile( 'doc.xml');
    $t->flush;
    sub flush { $_[0]->flush; }

    my $t= new XML::Twig( TwigHandlers => { 'section/title' => \&print_elt_text});
    $t->parsefile( 'doc.xml');
    sub print_elt_text 
      { my( $t, $elt)= @_;
        print $elt->text; 
      }

roots mode
    my $t= new XML::Twig( 
             TwigRoots    => { 'section/title' => 1 },
             TwigHandlers => { 'section/title' => \&print_elt_text}
                        );
    $t->parsefile( 'doc.xml');
    sub print_elt_text 
      { my( $t, $elt)= @_;
        print $elt->text; 
      }

    my $t= new XML::Twig( 
             TwigRoots    => { 'section/title' => \&print_elt_text}
                        );
    $t->parsefile( 'doc.xml');
    sub print_elt_text 
      { my( $t, $elt)= @_;
        print $elt->text; 
      }

DESCRIPTION

This module provides a way to process XML documents. It is build on top of XML::Parser.

The module offers a tree interface to the document, while allowing you to output the parts of it that have been completely processed.

It allows minimal resource (CPU and memory) usage by building the tree only for the parts of the documents that need actual processing, through the use of the TwigRoots and TwigPrintOutsideRoots options. The finish and finish_print methods also help to increase performances.

XML::Twig tries to make simple things easy so it tries its best to takes care of a lot of the (usually) annoying (but sometimes necessary) features that come with XML and XML::Parser.

Whitespaces

Whitespaces that look non-significant are discarded, this behaviour can be controlled using the KeepSpaces, KeepSpacesIn and DiscardSpacesIn options.

Encoding

You can specify that you want the output in the same encoding as the input (provided you have valid XML, which means you have to specify the encoding either in the document or when you create the Twig object) using the KeepEncoding option

METHODS

Twig

A twig is a subclass of XML::Parser, so all XML::Parser methods can be called on a twig object, including parse and parsefile. setHandlers on the other hand cannot not be used, see "BUGS"

new

This is a class method, the constructor for XML::Twig. Options are passed as keyword value pairs. Recognized options are the same as XML::Parser, plus some XML::Twig specifics:

TwigHandlers

This argument replaces the corresponding XML::Parser argument. It consists of a hash { gi_or_path => \&handler}

A gi (generic identifier) is just a tag name.

A path is a poor approximation of an XPath expression. It looks like '/doc/section/chapter/title' or 'chapter/title'.

A special gi _all_ is used to call a function for each element. The special gi _default_ is used to call a handler for each element that does NOT have a specific handler.

The order of precedence to trigger a handler is: full path, longer path, shorter path, gi, _default_

When an element is CLOSED the corresponding handler is called, with 2 arguments, the twig and the "Element". The twig includes the document tree that has been built so far, the element is the complete sub-tree for the element. Text is stored in elements where gi is #PCDATA (due to mixed content, text and sub-element in an element there is no way to store the text as just an attribute of the enclosing element).

Warning: if you have used purge or flush on the twig the element might not be complete, some of its children might have been entirely flushed or purged, and the start tag might even have been printed (by flush) already, so changing its gi might not give the expected result.

TwigRoots

This argument let's you build the tree only for those elements you are interested in.

Example: my $t= new XML::Twig( TwigRoots => { title => 1, subtitle => 1});
         $t->parsefile( file);
         my $t= new XML::Twig( TwigRoots => { 'section/title' => 1});
         $t->parsefile( file);

returns a twig containing a document including only title and subtitle elements, as children of the root element.

You can also use path to trigger the building of the twig.

WARNING: path are checked for the document. Even if the TwigRoots option is used they will be checked against the full document tree, not the virtual tree created by XML::Twig

WARNING: TwigRoots elements should NOT be nested, that would hopelessly confuse XML::Twig ;--(

Note: you can set handlers using TwigRoots Example: my $t= new XML::Twig( TwigRoots => { title => sub { $_{1]->print;}, subtitle => \&process_subtitle }); $t->parsefile( file);

TwigPrintOutsideRoots

To be used in conjunction with the TwigRoots argument. When set to a true value this will print the document outside of the TwigRoots elements.

Example: my $t= new XML::Twig( TwigRoots => { title => \&number_title },
                               TwigPrintOutsideRoots => 1,
                              );
          $t->parsefile( file);
          { my $nb;
          sub number_title
            { my( $twig, $title);
              $nb++;
              $title->prefix( "$nb "; }
              $title->print;
            }
          }
              

This example prints the document outside of the title element, calls number_title for each title element, prints it, and then resumes printing the document. The twig is built only for the title elements.

LoadDTD

If this argument is set to a true value, parse or parsefile on the twig will load the DTD information. This information can then be accessed through the twig, in a DTDHandler for example. This will load even an external DTD.

See "DTD Handling" for more information

DTDHandler

Sets a handler that will be called once the doctype (and the DTD) have been loaded, with 2 arguments, the twig and the DTD.

-item StartTagHandlers

A hash { gi => \&handler}. Sets element handlers that are called when the element is open (at the end of the XML::Parser Start handler). The handlers are called with 2 params: the twig and the element. The element is empty at that point, its attributes are created though.

WARNING: StartTag handlers are NOT called outside ot TwigRoots if this argument is used.

Special gi's _all_ and default are used to call a function respectively for each tag, and for each tag that does not have a StartTag handler.

The main use for those handlers is probably to create temporary attributes that will be used when processing sub-element with TwigHanlders.

You should also use it to change tags if you use flush. If you change the tag in a regular TwigHanlder then the start tag might already have been flushed.

By the way there is no EndTagHandlers option as this would be exactly the same as the TwigHandlers option.

-item CharHandler

A reference to a subroutine that will be called every time PCDATA.

-item KeepEncoding

This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting KeepEncoding will use the Expat original_string method for character, thus keeping the original encoding, as well as the original entities in the strings.

WARNING: attribute values will NOT keep their encoding (they will be converted to UTF8).

WARNING: this option is NOT used when parsing with the non-blocking parser (parse_start, parse_more, parse_done methods).

Id

This optional argument gives the name of an attribute that can be used as an ID in the document. Elements whose ID is known can be accessed through the elt_id method. Id defaults to 'id'. See "BUGS"

DiscardSpaces

If this optional argument is set to a true value then spaces are discarded when they look non-significant: strings containing only spaces are discarded. This argument is set to true by default.

KeepSpaces

If this optional argument is set to a true value then all spaces in the document are kept, and stored as PCDATA. KeepSpaces and DiscardSpaces cannot be both set.

DiscardSpacesIn

This argument sets KeepSpaces to true but will cause the twig builder to discard spaces in the elements listed. The syntax for using this argument is: new XML::Twig( DiscardSpacesIn => [ 'elt1', 'elt2']);

KeepSpacesIn

This argument sets DiscardSpaces to true but will cause the twig builder to keep spaces in the elements listed. The syntax for using this argument is: new XML::Twig( KeepSpacesIn => [ 'elt1', 'elt2']);

setTwigHandlers ($handlers)

Set the Twig handlers. $handlers is a reference to a hash similar to the one in the TwigHandlers option of new. All previous handlers are unset. The method returns the reference to the previous handlers.

setTwigHandler ($gi $handler)

Set a single Twig handlers for the $gi element. $handler is a reference to a subroutine. If the handler was previously set then the reference to the previous handler is returned.

setStartTagHandlers ($handlers)

Set the StartTag handlers. $handlers is a reference to a hash similar to the one in the StartTagHandlers option of new. All previous handlers are unset. The method returns the reference to the previous handlers.

setStartTagHandler ($gi $handler)

Set a single StartTag handlers for the $gi element. $handler is a reference to a subroutine. If the handler was previously set then the reference to the previous handler is returned.

setTwigHandlers ($handlers)

Set the Twig handlers. $handlers is a reference to a hash similar to the one in the TwigHandlers option of new.

dtd

Returns the dtd (an XML::Twig::DTD object) of a twig

root

Returns the root element of a twig

elt_id ($id)

Returns the element whose id attribute is $id

entity_list

Returns the entity list of a twig

change_gi ($old_gi, $new_gi)

Performs a (very fast) global change. All elements old_gi are now new_gi. See "BUGS"

flush ($optional_filehandle, $options)

Flushes a twig up to (and including) the current element, then deletes all unnecessary elements from the tree that's kept in memory. flush keeps track of which elements need to be open/closed, so if you flush from handlers you don't have to worry about anything. Just keep flushing the twig every time you're done with a sub-tree and it will come out well-formed. After the whole parsing don't forget to flush one more time to print the end of the document. The doctype and entity declarations are also printed.

flush take an optional filehandle as an argument.

options: use the Update_DTD option if you have updated the (internal) DTD and/or the entity list and you want the updated DTD to be output

Example: $t->flush( Update_DTD => 1);
         $t->flush( \*FILE, Update_DTD => 1);
         $t->flush( \*FILE);
flush_up_to ($elt, $optionnal_filehandle, %options)

Flushes up to the $elt element. This allows you to keep part of the tree in memory when you flush.

options: see flush.

purge

Does the same as a flush except it does not print the twig. It just deletes all elements that have been completely parsed so far.

purge_up_to ($elt)

Purges up to the $elt element. This allows you to keep part of the tree in memory when you flush.

Prints the whole document associated with the twig. To be used only AFTER the parse.

options: see flush.

sprint

Returns the text of the whole document associated with the twig. To be used only AFTER the parse.

options: see flush.

Prints the prolog (XML declaration + DTD + entity declarations) of a document.

options: see flush.

prolog ($optional_filehandle, %options)

Returns the prolog (XML declaration + DTD + entity declarations) of a document.

options: see flush.

finish

Call Expat finish method. Unsets all handlers (including internal ones that set context), but expat continues parsing to the end of the document or until it finds an error. It should finish up a lot faster than with the handlers set.

finish_print

Stop twig processing, flush the twig and proceed to finish printing the document as fast as possible. Use this method when modifying a document and the modification is done.

depth

Calls Expat's depth method , which returns the depth in the tree during the parsing. This is usefull when using the TwigRoots option to still get info on the actual document.

in_element ($gi)

Call Expat in_element method. Returns true if $gi is equal to the name of the innermost currently opened element. If namespace processing is being used and you want to check against a name that may be in a namespace, then use the generate_ns_name method to create the $gi argument. Usefull when using the TwigRoots option.

within_element($gi)

Call Expat within_element method. Returns the number of times the given name appears in the context list. If namespace processing is being used and you want to check against a name that may be in a namespace, then use the generate_ns_name method to create the $gi argument. Usefull when using the TwigRoots option.

context

Returns a list of element names that represent open elements, with the last one being the innermost. Inside start and end tag handlers, this will be the tag of the parent element.

path($gi)

Returns the element context in a form similar to XPath's short form: '/root/gi1/../gi'

parse(SOURCE [, OPT => OPT_VALUE [...]])

This method is inherited from XML::Parser. The SOURCE parameter should either be a string containing the whole XML document, or it should be an open IO::Handle. Constructor options to XML::Parser::Expat given as keyword-value pairs may follow the SOURCE parameter. These override, for this call, any options or attributes passed through from the XML::Parser instance.

A die call is thrown if a parse error occurs. Otherwise it will return 1 or whatever is returned from the Final handler, if one is installed. In other words, what parse may return depends on the style.

parsestring

This is just an alias for parse for backwards compatibility.

parsefile(FILE [, OPT => OPT_VALUE [...]])

This method is inherited from XML::Parser. Open FILE for reading, then call parse with the open handle. The file is closed no matter how parse returns. Returns what parse returns.

Elt

new ($gi, @content)

The gi is optional (but then you can't have a content ), the content can be just a string or a list of strings and element.

Examples: my $elt1= new XML::Twig::Elt();
          my $elt2= new XML::Twig::Elt( 'para');  
          my $elt3= new XML::Twig::Elt( 'para', 'this is a para');  
          my $elt4= new XML::Twig::Elt( 'para', $elt3, 'another para'); 

The strings are not parsed, the element is not attached to any twig.

parse ($string, %args)

Creates an element from an XML string. The string is actually parsed as a new twig, then the root of that twig is returned. The arguments in %args are passed to the twig. As always if the parse fails the parser will die, so use an eval if you want to trap syntax errors.

set_gi ($gi)

Sets the gi of an element

gi

Returns the gi of the element

closed

Returns true if the element has been closed. Might be usefull if you are somewhere in the tree, during the parse, and have no idea whether a parent element is completely loaded or not.

is_pcdata

Returns 1 if the element is a #PCDATA element, returns 0 otherwise.

pcdata

Returns the text of a PCDATA element or undef if the element is not PCDATA.

set_pcdata ($text)

Sets the text of a PCDATA element.

append_pcdata ($text)

Add the text at the end of a #PCDATA element.

is_cdata

Returns 1 if the element is a #CDATA element, returns 0 otherwise.

cdata

Returns the text of a CDATA element or undef if the element is not CDATA.

set_cdata ($text)

Sets the text of a CDATA element.

append_cdata ($text)

Add the text at the end of a #CDATA element.

root

Returns the root of the twig in which the element is contained.

twig

Returns the twig containing the element.

parent ($optional_gi)

Returns the parent of the element, or the first ancestor whose gi is $gi.

first_child ($optional_gi)

Returns the first child of the element, or the first child whose gi is $gi. (ie the first of the element children whose gi matches).

last_child ($optional_gi)

Returns the last child of the element, or the last child whose gi is $gi. (ie the last of the element children whose gi matches).

prev_sibling ($optional_gi)

Returns the previous sibling of the element, or the first one whose gi is $gi.

next_sibling ($optional_gi)

Returns the next sibling of the element, or the first one whose gi is $gi.

atts

Returns a hash ref containing the element attributes

set_atts ({att1=>$att1_val, att2=> $att2_val... })

Sets the element attributes with the hash ref supplied as the argument

del_atts

Deletes all the element attributes.

set_att ($att, $att_value)

Sets the attribute of the element to the given value

att ($att)

Returns the attribute value

del_att ($att)

Delete the attribute for the element

set_id ($id)

Sets the id attribute of the element to the value. See "elt_id" to change the id attribute name

id

Gets the id attribute value

del_id ($id)

Deletes the id attribute of the element and remove it from the id list for the document

children ($optional_gi)

Returns the list of children (optionally whose gi is $gi) of the element

ancestors ($optional_gi)

Returns the list of ancestors (optionally whose gi is $gi) of the element.

NOTE: the element itself is not part of the list, in order to include it you will have to write:

my @array= ($elt, $elt->ancestors)
next_elt ($optional_elt, $optional_gi)

Returns the next elt (optionally whose gi is $gi) of the element. This is defined as the next element which opens after the current element opens. Which usually means the first child of the element. Counter-intuitive as it might look this allows you to loop through the whole document by starting from the root.

The $optional_elt is the root of a subtree. When the next_elt is out of the subtree then the method returns undef. You can then walk a sub tree with:

my $elt= $subtree_root;
while( $elt= $elt->next_elt( $subtree_root);
  { # insert processing code here
    $elt= $elt->next_elt( $subtree_root);
  }
prev_elt ($optional_gi)

Returns the previous elt (optionally whose gi is $gi) of the element. This is the first element which opens before the current one. It is usually either the last descendant of the previous sibling or simply the parent

level ($optional_gi)

Returns the depth of the element in the twig (root is 0). If the optional gi is given then only ancestors of the given type are counted.

WARNING: in a tree created using the TwigRoots option this will not return the level in the document tree, level 0 will be the document root, level 1 will be the TwigRoots elements. During the parsing (in a TwigHandler) you can use the depth method on the twig object to get the real parsing depth.

in ($potential_parent)

Returns true if the element is in the potential_parent

in_context ($gi, $optional_level)

Returns true if the element is included in an element whose gi is $gi, optionally within $optional_level levels. The returned value is the including element.

cut

Cuts the element from the tree.

paste ($optional_position, $ref)

Pastes a (previously cut) element. The optional position element can be:

first_child (default)

The element is pasted as the first child of the element object this method is called on.

last_child

The element is pasted as the last child of the element object this method is called on.

before

The element is pasted before the element object, as its previous sibling.

after

The element is pasted after the element object, as its next sibling.

move ($optional_position, $ref)

Move an element in the tree. This is just a cut then a paste. The syntax is the same as paste.

replace ($ref)

Replaces an element in the tree. Sometimes it is just not possible to cut an element then paste another in its place, so replace comes in handy.

prefix ($text)

Add a prefix to an element. If the element is a PCDATA element the text is added to the pcdata, if the elements first_child is a PCDATA then the text is added to it's pcdata, otherwise a new PCDATA element is created and pasted as the first child of the element.

suffix ($text)

Add a suffix to an element. If the element is a PCDATA element the text is added to the pcdata, if the elements last_child is a PCDATA then the text is added to it's pcdata, otherwise a new PCDATA element is created and pasted as the last child of the element.

erase

Erases the element: the element is deleted and all of its children are pasted in its place.

delete

Cut the element and frees the memory.

DESTROY

Frees the element from memory.

start_tag

Returns the string for the start tag for the element, including the /> at the end of an empty element tag

end_tag

Returns the string for the end tag of an element. For an empty element, this returns the empty string ('').

Prints an entire element, including the tags, optionally to a FILEHANDLE.

sprint ($elt, $optional_no_enclosing_tag)

Returns the string for an entire element, including the tags. To be used with caution! If the optional second argument is true then only the string inside the element is returned (the start and end tag for $elt are not).

text

Returns a string consisting of all the PCDATA and CDATA in an element, without any tags.

set_text ($string)

Sets the text for the element: if the element is a PCDATA, just set its text, otherwise cut all the children of the element and create a single PCDATA child for it, which holds the text.

set_content (@list_of_elt_and_strings)

Sets the content for the element, from aa list of strings and elements. Cuts all the element children, then pastes the list elements as the children. This method will create a PCDATA element for any strings in the list.

insert (@gi)

For each gi in the list inserts an element $gi as the only child of the element. All children of the element are set as children of the new element. The upper level element is returned.

$p->insert( 'table', 'tr', 'td') puts $p in a table with a single tr and a single td and returns the table element.

wrap_in (@gi)

Wraps elements $gi as the successive ancestors of the element, returns the new element. $elt->wrap_in( 'td', 'tr', 'table') wraps the element as a single cell in a table for example.

cmp ($elt) Compare the order of the 2 elements in a twig.
$a is the <A>..</A> element, $b is the <B>...</B> element

document                        $a->cmp( $b)
<A> ... </A> ... <B>  ... </B>     -1
<A> ... <B>  ... </B> ... </A>     -1
<B> ... </B> ... <A>  ... </A>      1
<B> ... <A>  ... </A> ... </B>      1
 $a == $b                           0
 $a and $b not in the same tree   undef
before ($elt)

Returns 1 if $elt starts before the element, 0 otherwise. If the 2 elements are not in the same twig then return undef.

if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
after ($elt)

Returns 1 if $elt starts after the element, 0 otherwise. If the 2 elements are not in the same twig then return undef.

if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
path

Returns the element context in a form similar to XPath's short form: '/root/gi1/../gi'

private methods
set_parent ($parent)
set_first_child ($first_child)
set_last_child ($last_child)
set_prev_sibling ($prev_sibling)
set_next_sibling ($next_sibling)
set_twig_current
del_twig_current
twig_current
flushed

This method should NOT be used, always flush the twig, not an element.

set_flushed
del_flushed
flush

Those methods should not be used, unless of course you find some creative and interesting, not to mention useful, ways to do it.

Entity_list

new

Creates an entity list.

add ($ent)

Adds an entity to an entity list.

delete ($ent or $gi).

Deletes an entity (defined by its name or by the Entity object) from the list.

Prints the entity list.

Entity

new ($name, $val, $sysid, $pubid, $ndata)

Same arguments as the Entity handler for XML::Parser.

Prints an entity declaration.

text

Returns the entity declaration text.

EXAMPLES

See the test file in t/test[1-n].t Additional examples can be found at http://standards.ieee.org/resources/spasystem/twig/

To figure out what flush does call the following script with an xml file and an element name as arguments

use XML::Twig;

my ($file, $elt)= @ARGV;
my $t= new XML::Twig( TwigHandlers => 
    { $elt => sub {$_[0]->flush; print "\n[flushed here]\n";} });
$t->parsefile( $file, ErrorContext => 2);
$t->flush;
print "\n";

NOTES

DTD Handling

There are 3 possibilities here. They are:

No DTD

No doctype, no DTD information, no entity information, the world is simple...

Internal DTD

The XML document includes an internal DTD, and maybe entity declarations.

If you use the LoadDTD option when creating the twig the DTD information and the entity declarations can be accessed.

The DTD and the entity declarations will be flush'ed (or print'ed) either as is (if they have not been modified) or as reconstructed (poorly, comments are lost, order is not kept, due to it's content this DTD should not be viewed by anyone) if they have been modified. You can also modify them directly by changing the $twig->{twig_doctype}->{internal} field (straight from XML::Parser, see the Doctype handler doc)

External DTD

The XML document includes a reference to an external DTD, and maybe entity declarations.

If you use the LoadDTD when creating the twig the DTD information and the entity declarations can be accessed. The entity declarations will be flush'ed (or print'ed) either as is (if they have not been modified) or as reconstructed (badly, comments are lost, order is not kept).

You can change the doctype through the $twig->set_doctype method and print the dtd through the $twig->dtd_text or $twig->dtd_print methods.

If you need to modify the entity list this is probably the easiest way to do it.

Flush

If you set handlers and use flush, do not forget to flush the twig one last time AFTER the parsing, or you might be missing the end of the document.

Remember that element handlers are called when the element is CLOSED, so if you have handlers for nested elements the inner handlers will be called first. It makes it for example trickier than it would seem to number nested clauses.

BUGS

ID list

The ID list is NOT updated when ID's are modified or elements cut or deleted.

change_gi

This method will not function properly if you do:

$twig->change_gi( $old1, $new);
$twig->change_gi( $old2, $new);
$twig->change_gi( $new, $even_newer);
sanity check on XML::Parser method calls

XML::Twig should really prevent calls to some XML::Parser methods, especially the setHandlers method.

TODO

multiple twigs are not well supported

A number of twig features are just global at the moment. These include the ID list and the "gi pool" (if you use change_gi then you change the gi for ALL twigs).

The next version will try to support this while trying not to be to hard on performance (at least when a single twig is used!).

XML::Parser-like handlers

Sometimes it would be nice to be able to use both XML::Twig handlers and XML::Parser handlers, for example to perform generic tasks on all open tags, like adding an ID, or taking care of the autonumbering.

Next version...

BENCHMARKS

You can use the benchmark_twig file to do additional benchmarks. Please send me benchmark information for additional systems.

AUTHOR

Michel Rodriguez <m.v.rodriguez@ieee.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Bug reports and comments to m.v.rodriguez@ieee.org. The XML::Twig page is at http://standards.ieee.org/resources/spasystem/twig/

SEE ALSO

XML::Parser