NAME
XHTML::Util - (alpha software) powerful utilities for common but difficult to nail HTML munging.
VERSION
0.99_02
SYNOPSIS
use strict;
use warnings;
use XHTML::Util;
my $xu = XHTML::Util
->new(\"This is naked\n\ntext for making into paragraphs.");
print $xu->enpara, $/;
# <p>This is naked</p>
#
# <p>text for making into paragraphs.</p>
$xu = XHTML::Util
->new(\"<blockquote>Quotes should probably have paras.</blockquote>");
print $xu->enpara("blockquote");
# <blockquote>
# <p>Quotes should probably have paras.</p>
# </blockquote>
$xu = XHTML::Util
->new(\'<i><a href="#"><b>Something</b></a>.</i>');
print $xu->strip_tags('a');
# <i><b>Something</b>.</i>
DESCRIPTION
You can use CSS expressions to most of the methods. E.g., to only enpara the contents of div tags with a class of "enpara" -- <div class="enpara"/>
-- you could do this-
print $xu->enpara("div.enpara");
To do the contents of all blockquotes and divs-
print $xu->enpara("div, blockquote");
Alterations to the XHTML in the object are persistent.
my $xu = XHTML::Util
->new(\'<script>alert("OH HAI")</script>');
$xu->strip_tags('script');
Will remove the script tags—not the script cotent though—so the next time you call anything that returns the stringified object the changes will remain–
print $xu->as_string, $/;
# alert("OH HAI")
Well... really you'll get <![CDATA[alert("OH HAI")]]>
.
METHODS
new
Creates a new XHTML::Util
object.
strip_tags
Why you might need this-
my $post_title = "I <3 <a href="http://icanhascheezburger.com/">kittehs</a>";
my $blog_link = some_link_maker($post_title);
print $blog_link;
<a href="/oh-noes">I <3 <a href="http://icanhascheezburger.com/">kittehs</a></a>
That isn't legal so there's no definition for what browsers should do with it. Some sort of tolerate it, some don't. It's never going to be a good user experience.
What you can do is something like this–
my $post_title = "I <3 <a href="http://icanhascheezburger.com/">kittehs</a>";
my $safe_title = $xu->strip_tags($post_title, ["a"]);
# Menu link should only go to the single post page.
my $menu_view_title = some_link_maker($safe_title);
# No need to link back to the page you're viewing already.
my $single_view_title = $post_title;
remove
Takes a CSS selector string. Completely removes the matched nodes, including their content. This differs from "strip_tags" which retains the child nodes intact and only removes the tags proper.
# Remove <center/> tags and external images.
my $cleaned = $xu->remove("center, img[src^='http']");
traverse
[Not implemented.] Walks the given nodes and executes the given callback.
translate_tags
[Not implemented.] Translates one tag to another.
remove_style
[Not implemented.] Removes styles from matched nodes. To remove all style from a fragment-
$xu->remove_style("*");
(Should also remove style sheets, yes?)
inline_stylesheets
[Not implemented.] Moves all linked stylesheet information into inline style attributes. This is useful, for example, when distributing a document fragment like an RSS/Atom feed and having it match its online appearance.
sanitize
[Not implemented.] Upgrades old or broken HTML to valid XHTML.
fix
[Partially implemented.] Attempts to make many known problems go away. E.g., entitiy escaping, missing alt attributes of images, etc.
validate
Validates a given document or fragment (which is actually contained in a full document) against a DTD provided by name or, if none is provided, it will validate against xhtml1-transitional. Uses XML::LibXML's validate under the covers.
is_valid
A non-fatal version of "validate". Returns true on success, false on failure.
enpara
To add paragraph markup to naked text. There are many, many implementations of this basic idea out there as well as many like Markdown which do much more. While some are decent, none is really meant to sling arbitrary HTML and get DWIM behavior from places where it's left out; every implementation I've seen either has rigid syntax or has beaucoup failure prone edge cases. Consider these-
Is this a paragraph
or two?
<p>I can do HTML when I'm paying attention.</p>
<p style="color:#a00">Or I need to for some reason.</p>
Oh, I stopped paying attention... What happens here? Or <i>here</i>?
I'd like to see this in a paragraph so it's legal markup.
<pre>
now
this
should
not be touched!
</pre>I meant to do that.
With XHTML::Util->enpara
you will get-
<p>Is this a paragraph<br/>
or two?</p>
<p>I can do HTML when I'm paying attention.</p>
<p style="color:#a00">Or I need to for some reason.</p>
<p>Oh, I stopped paying attention... What happens here? Or <i>here</i>?</p>
<p>I'd like to see this in a paragraph so it's legal markup.</p>
<pre>
now
this
should
not be touched!
</pre>
<p>I meant to do that.</p>
parser
The XML::LibXML parser object used to parse (X)HTML.
doc
The XML::LibXML::Document object created from input.
root
The documentElement of the XML::LibXML::Document object.
selector_to_xpath
This wraps "selector_to_xpath" in selector_to_xpath HTML::Selector::Xpath. Not really meant to be used but exposed in case you want it.
print $xu->selector_to_xpath("form[name='register'] input[type='password']");
# //form[@name='register']//input[@type='password']
TO DO
I think the default doc should be \"". There is no reason to jump through that hoop if wanting to build up something from scratch.
Finish spec and tests. Get it running solid enough to remove alpha label. Generalize the argument handling. Provide optional setting or methods for returning nodes intead of serialized content. Improve document/head related handling/options.
I can see this being easier to use functionally. I haven't decided on the argspec or method-->sub approach for that yet. I think it's a good idea.
BUGS AND LIMITATIONS
All input should be UTF-8 or at least safe to run Encode::decode_utf8 on. Regular Latin character sets, I suspect, will be fine but extended sets will probably give garbage or unpredictable results; guessing.
This will wreck XML and probably XHTML with a custom DTD too. It uses HTML::Tagset's conception of what valid tags are. This is not optimal but it is easier than DTD handlig. It might improve to more automatic detection in the future.
I have used many of these methods and snippets in many projects and I'm tired of recycling them. Some are extremely useful and, at least in the case of "enpara", better than any other implementation I've been able to find in any language.
That said, a lot of the code herein is not well tested or at least not well tested in this incarnation. Bug reports and good feedback are adored.
SEE ALSO
XML::LibXML, HTML::Tagset, HTML::Entities, HTML::Selector::XPath, HTML::TokeParser::Simple, CSS::Tiny.
CSS W3Schools, Learning CSS at W3C.
AUTHOR
Ashley Pond V, ashley at cpan.org.
COPYRIGHT & LICENSE
Copyright (©) 2006-2009.
This program is free software; you can redistribute it or modify it or both under the same terms as Perl itself.
DISCLAIMER OF WARRANTY
Because this software is licensed free of charge, there is no warranty for the software, to the extent permitted by applicable law. Except when otherwise stated in writing the copyright holders or other parties provide the software "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of the software is with you. Should the software prove defective, you assume the cost of all necessary servicing, repair, or correction.
In no event unless required by applicable law or agreed to in writing will any copyright holder, or any other party who may modify and/or redistribute the software as permitted by the above licence, be liable to you for damages, including any general, special, incidental, or consequential damages arising out of the use or inability to use the software (including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the software to operate with any other software), even if such holder or other party has been advised of the possibility of such damages.