NAME
HTML::Tidy::libXML - Tidy HTML via XML::LibXML
VERSION
$Id: libXML.pm,v 0.2 2009/02/21 11:47:58 dankogai Exp dankogai $
SYNOPSIS
use
HTML::Tidy::libXML;
my
$tidy
= HTML::Tidy::libXML->new();
my
$xml
=
$tidy
->clean(
$html
,
$encoding
);
# clean enough as xml
my
$xhtml
=
$tidy
->clean(
$html
,
$encoding
, 1);
# clean enough for browsers
EXPORT
none.
Functions
new
Creates an object.
my
$tidy
= HTML::Tidy::libXML->new();
html2dom
my
$dom
=
$tidy
->html2dom(
$string
,
$encoding
);
This is analogus to
my
$lx
= XML::LibXML->new;
$lx
->recover_silently(1);
my
$dom
=
$lx
->parse_html_string(
$string
);
Except one major difference. HTML::Tidy::LibXML does not trust <meta http-equiv="content-type" content="text/html; charset="foo">
while XML::LibXML tries to use one. Consider this;
This kinda works since XML::LibXML is capable of fetching document directly. But XML::LibXML does not honor HTTP header. Here is the better practice.
require
LWP::UserAgent;
require
HTTP::Response::Encoding;
my
$uri
=
shift
||
die
;
my
$res
= LWP::UserAgent->new->get(
$uri
);
die
$res
->status_line
unless
$res
->is_success;
my
$dom
=
$tidy
->html2dom(
$res
->content,
$res
->encoding);
dom2xml
my
$tidy
->com2xml(
$dom
,
$level
);
Tidies $dom
which is XML::LibXML::Document object and returns an XML string. If the level is ommitted, the resulting XML is good enough as XML -- valid but not very browser compliant (like <br clear="">
, <a name="here" />
). Set level to 1 or above for tidier, browser-compliant xhtml.
html2xml
my
$xml
=
$tidy
->html2xml(
$html
,
$encoding
,
$level
)
Which is the shorthand for:
my
$dom
=
$tidy
->html2dom(
$html
,
$encoding
);
my
$xml
=
$tidy
->dom2xml(
$dom
,
$level
);
clean
An alias to html2xml
.
BENCHMARK
This is what happened trying to tidy http://www.perl.com/ on my PowerBook Pro. See t/bench.pl for details.
Rate H::T H::T::LibXML(1) H::T::LibXML(0)
H::T 96.2/s -- -25% -49%
H::T::LibXML(1) 128/s 33% -- -31%
H::T::LibXML(0) 187/s 95% 46% --
AUTHOR
Dan Kogai, <dankogai at dan.co.jp>
BUGS
Please report any bugs or feature requests to bug-html-tidy-libxml at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Tidy-libXML. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc HTML::Tidy::libXML
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
COPYRIGHT & LICENSE
Copyright 2009 Dan Kogai, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.