NAME

HTML::HTML5::Sanity - Perl extension to make HTML5 DOM trees less insane.

VERISON

0.02

SYNOPSIS

use HTML::HTML5::Parser;
use HTML::HTML5::Sanity;

my $parser    = HTML::HTML5::Parser->new;
my $html5_dom = $parser->parse_file('http://example.com/');
my $sane_dom  = fix_document($html5_dom);

print document_to_clarkml($sane_dom);

DESCRIPTION

The Document Object Model (DOM) generated by HTML::HTML5::Parser meets the requirements of the HTML5 spec, but will probably catch a lot of people by surprise.

The main oddity is that elements and attributes which appear to be namespaced are not really. For example, the following element:

<div xml:lang="fr">...</div>

Looks like it should be parsed so that it has an attribute "lang" in the XML namespace. Not so. It will really be parsed as having the attribute "xml:lang" in the null namespace.

fix_document
$sane_dom = fix_document($html5_dom);

Returns a modified copy of the DOM and leaving the original DOM unmodified.

document_to_clarkml, element_to_clarkml, attribute_to_clarkml,
$string = document_to_clarkml($document);
$string = element_to_clarkml($element);
$string = attribute_to_clarkml($attribute);

Returns a Clark-Notation-like string useful for debugging. Only the first function, which takes an XML::LibXML::Document is exported by default, but by choosing an export list of ":all" or ":debug" will export the others too.

document_to_hashref, element_to_hashref, attribute_to_hashref,
$data = document_to_hashref($document);
$data = element_to_hashref($element);
$data = attribute_to_hashref($attribute);

Returns a hashref useful for debugging. Only the first function, which takes an XML::LibXML::Document is exported by default, but by choosing an export list of ":all" or ":debug" will export the others too.

$HTML::HTML5::Sanity::FIX_LANG_ATTRIBUTES
$HTML::HTML5::Sanity::FIX_LANG_ATTRIBUTES = 2;
$sane_dom = fix_document($html5_dom);

If set to 1 (the default), the package will detect invalid values in @lang and @xml:lang, and remove the attribute if it is invalid. If set to 2, it will also attempt to canonicalise the value (e.g. 'EN_GB' will be converted to to 'en-GB'). If set to 0, then the value of language attributes is not checked.

BUGS

Please report any bugs to http://rt.cpan.org/.

SEE ALSO

HTML::HTML5::Parser, XML::LibXML.

AUTHOR

Toby Inkster <tobyink@cpan.org>.

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Toby Inkster

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8 or, at your option, any later version of Perl 5 you may have available.