Take me over?
NAME
XML::XPathScript::Stylesheet - XPathScript's Stylesheet Writer Guide
STYLESHEET SYNTAX
An XPathScript stylesheet is written in an ASP-like format; everything that is not enclosed within special delimiters are printed verbatim.
Delimiters
- <% %>
-
Evaluates the code enclosed without printing anything.
Example:
<% $template->set( 'foo' => { pre => 'bar' } ); %>
- <%= %>
-
Evaluates the code enclosed and prints out its result.
Example:
Author: <%= findvalue( '/doc/author@name' ) %>
- <%# %>
-
Comments out the code enclosed. The code will not be executed, nor show in the transformed document.
- <%~ %>
-
A shorthand for <%= apply_templates( ) %>
Example:
Author: <%~ /doc/author %>
- <%- -%>, <%-= -%>, <%-~ -%>, <%-# -%>
-
If a dash is added to a delimiter, all whitespaces (including carriage returns) predeceding or following the delimiter are removed from the transformed document. This is useful to keep a stylesheet readable without generating transformed document with many whitespace gaps. The dash can be added independently to the right and left delimiter.
Example:
<h1> <%-~ /doc/title -%> </h1>
- <!--#include file="/path/to/file" -->
-
Insert the content of the file into the stylesheet. The path is relative to the stylesheet, not the processed document.
PRE-DEFINED VARIABLES
This section describes pre-defined variables accessible from within a XPathScript stylesheet.
- $template, $t, $XML::XPathScript::trans
-
All three variables point to the stylesheet's template. See section "TRANSFORMATION TEMPLATE".
- $XML::XPathScript::xp
-
The DOM of the xml document unto which the stylesheet is applied.
- $XML::XPathScript::current
-
The XML::XPathScript object from which the stylesheet has been invoked. See the XML::XPathScript manpage for a list of utility methods that can be called from within the stylesheet.
TRANSFORMATION TEMPLATE
The transformation template defines the modification that will automatically be brought on document elements when 'apply_templates' is called.
See the XML::XPathScript::Template manpage for details on how to configure the template.
Special tags
In addition to regular tag names, three special tags can be used in the template: text() and comment(), that match the corresponding nodes in the document, and '*', a catch-all tag.
- text(), #text
-
Match text nodes.
Note that text nodes can be assigned a special action. See section "action" of this manpage.
Example:
<% $template->set( 'text()' => { pre => '\begin{comment}', post => '\end{comment}', ); %>
- comment()
-
Match comment nodes.
- '*'
-
Will match any regular tag (that is, not comments nor text) that isn't explicitly matched.
Tag Attributes
The tags' attributes define how the associated nodes are transformed by the template.
- pre, intro, prechildren, prechild, postchild, postchildren, extro, post
-
Define the text to be printed around a node. All defined attributes are outputed in the following order:
pre <tag> # if showtag == 1 intro prechildren # if <tag> has children prechild # for each child [ child node ] postchild # for each child postchildren # if <tag> has children extro </tag> # if showtag == 1 post
If interpolation is enabled, XPath expressions delimited by curly braces can be imbedded in any of these attributes.
$template->set( 'movie' => { pre => 'title: {./@title}, year: {./year}' } );
Interpolation is enabled via the XML::XPathScript object's method interpolation.
The expressions' delimiter can be modified via the XML::XPathScript object's method interpolation_regex.
- showtag
-
If set to true, the original tag is printed out.
- action
-
Dictate how the node and its children are processed. The allowed values are:
- DO_SELF_AND_KIDS
-
Process the current node and its children.
- DO_SELF_ONLY
-
Process the current node, but not its children.
- DO_NOT_PROCESS
-
Do not process either the current node or any of its children.
- DO_TEXT_AS_CHILD
-
Only meaningful for text nodes. When this value is given, the processor pretends that the text is a child of the node, which basically means that
$t->{pre}
and$t->{post}
will frame the text instead of replacing it.Example:
$template->( 'text()' => { pre => 'replacement text' } ); # will transform <foo>blah</foo> into <foo>replacement text</foo> $template->( 'text()' => { action => DO_TEXT_AS_CHILD, pre => 'text: ' } ); # will transform <foo>blah</foo> into <foo>text: blah</foo>
- xpath expression
-
Process the current node and all its children that match the xpath expression. The XPath expression is anchored on the current node.
Example:
# only do the children of 'foo' having their attribute 'process' # set to 'yes' $template->set( 'foo' => { action => './*[@process = "yes"]' } );
- testcode
-
A reference to a subroutine that will be executed upon visiting the tag. When invoked, the subroutine is passed two parameters: the current node's object and a tag object holding all the attributes of the visited tag. Modifications to the tag object only affect the transformation of the current node. To change the transformation of all subsequent tag of the same type, use the stylesheet $template instead.
Also, the return value of the subroutine overrides the value of the 'action' attribute.
Example:
<% $template->set( '*' => { testcode => \&uppercase_tag } ); sub uppercase_tag { my( $n, $tag ) = @_; my $name = uc $n->getName; $tag->set({ pre => "<$name>", post => "</$name>", }); return DO_SELF_AND_KIDS; } %>
STYLESHEET WRITING GUIDELINES
Here are a few things to watch out for when coding stylesheets.
XPath scalar return values considered harmful
XML::XPath calls such as findvalue() return objects in an object class designed to map one of the types mandated by the XPath spec (see XML::XPath for details). This is often not what a Perl programmer comes to expect (e.g. strings and numbers cannot be treated the same). There are some work-arounds built in XML::XPath, using operator overloading: when using those objects as strings (by concatenating them, using them in regular expressions etc.), they become strings, through a transparent call to one of their methods such as -value() >. However, we do not support this for a variety of reasons (from limitations in "overload" to stylesheet compatibility between XML::XPath and XML::LibXML to Unicode considerations), and that is why our "findvalue" and friends return a real Perl scalar, in violation of the XPath specification.
On the other hand, "findnodes" does return a list of objects in list context, and an XML::XPath::NodeSet or XML::LibXML::NodeList instance in scalar context, obeying the XPath specification in full. Therefore you most likely do not want to call findnodes() in scalar context, ever: replace
my $attrnode = findnodes('@url',$xrefnode); # WRONG!
with
my ($attrnode) = findnodes('@url',$xrefnode);
Do not use DOM method calls, for they make stylesheets non-portable
The findvalue() such functions described in XML::XPathScript::Processor are not the only way of extracting bits from the XML document. Objects passed as the first argument to the testcode
tag attribute and returned by findnodes() in array context are of one of the XML::XPath::Node::* or XML::LibXML::* classes, and they feature some data extraction methods by themselves, conforming to the DOM specification.
However, the names of those methods are not standardized even among DOM parsers (the accessor to the childNodes
property, for example, is named childNodes()
in XML::LibXML and getChildNodes()
in XML::XPath!). In order to write a stylesheet that is portable between XML::libXML and XML::XPath used as back-ends to XML::XPathScript, one should refrain from doing that. The exact same data is available through appropriate XPath formulae, albeit more slowly, and there are also type-checking accessors such as is_element_node()
in XML::XPathScript::Processor.
THE UNICODE MESS
Unicode is a balucitherian character numbering standard, that strives to be a superset of all character sets currently in use by humans and computers. Going Unicode is therefore the way of the future, as it will guarantee compatibility of your applications with every character set on planet Earth: for this reason, all XML-compliant APIs (XML::XPathScript being no exception) should return Unicode strings in all their calls, regardless of the charset used to encode the XML document to begin with.
The gotcha is, the brave Unicode world sells itself in much the same way as XML when it promises that you'll still be able to read your data back in 30 years: that will probably turn out to be true, but until then, you can't :-)
Therefore, you as a stylesheet author will more likely than not need to do some wrestling with Unicode in Perl, XML::XPathScript or not. Here is a primer on how.
Unicode, UTF-8 and Perl
Unicode is not a text file format: UTF-8 is. Perl, when doing Unicode, prefers to use UTF-8 internally.
Unicode is a character numbering standard: that is, an abstract registry that associates unique integer numbers to a cast of thousands of characters. For example the "smiling face" is character number 0x263a, and the thin space is 0x2009 (there is a URL to a Unicode character table in "SEE ALSO"). Of course, this means that the 8-bits- (or even, Heaven forbid, 7-bits-?)-per-character idea goes through the window this instant. Coding every character on 16 bits in memory is an option (called UTF-16), but not as simple an idea as it sounds: one would have to rewrite nearly every piece of C code for starters, and even then the Chinese aren't quite happy with "only" 65536 character code points.
Introducing UTF-8, which is a way of encoding Unicode character numbers (of any size) in an ASCII- and C-friendly way: all 127 ASCII characters (such as "A" or or "/" or ".", but not the ISO-8859-1 8-bit extensions) have the same encoding in both ASCII and UTF-8, including the null character (which is good for strcpy() and friends). Of course, this means that the other characters are rendered using several bytes, for example "é" is "é" in UTF-8. The result is therefore vaguely intelligible for a Western reader.
Output to UTF-8 with XPathScript
The programmer- and C-friendly characteristics of UTF-8 have made it the choice for dealing with Unicode in Perl. The interpreter maintains an "UTF8-tainted" bit on every string scalar it handles (much like what perlsec does for untrusted data). Every function in XML::XPathScript returns a string with such bit set to true: therefore, producing UTF-8 output is straightforward and one does not have to take any special precautions in XPathScript.
Output to a non-UTF-8 character set with XPathScript
When "binmode" is invoked from the stylesheet body, it signals that the stylesheet output should not be UTF-8, but instead some user-chosen character encoding that XML::XPathScript cannot and will not know or care about. Calling XML::XPathScript-
current()->binmode() > has the following consequences:
presence of this "UTF-8 taint" in the stylesheet output is now a fatal error. That is, whenever the result of a template evaluation is marked internally in Perl with the "this string is UTF-8" flag (as opposed to being treated by Perl as binary data without character meaning, see "perlunicode"), "translate_node" in XML::XPathScript::Processor will croak;
the stylesheet therefore needs to build an "unicode firewall". That is,
testcode
blocks have to take input in UTF-8 (as per the XML standard, UTF-8 indeed is what will be returned by "findvalue" in XML::XPathScript::Processor and such) and provide output in binary (in whatever character set is intended for the output), lest translate_node() croaks as explained above. The Unicode::String module comes in handy to the stylesheet writer to cast from UTF-8 to an 8-bit-per-character charset such as ISO 8859-1, while laundering Perl's internal UTF-8-string bit at the same time;the appropriate voodoo is performed on the output filehandle(s) so that a spurious, final charset conversion will not happen at print() time under any locales, versions of Perl, or phases of moon.
AUTHORS
Yanick Champoux <yanick@cpan.org> and Dominique Quatravaux <dom@idealx.com>
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 350:
Non-ASCII character seen before =encoding in '"é"'. Assuming CP1252