NAME
HTML::Normalize - HTML light weight cleanup
VERSION
Version 1.0003
SYNOPSIS
my $norm = HTML::Normalize->new ();
my $cleanHtml = $norm->cleanup (-html => $dirtyHtml);
DESCRIPTION
HTML::Normalize uses HTML::TreeBuilder to parse an HTML string then processes the resultant tree to clean up various structural issues in the original HTML. The result is then rendered using HTML::Element's as_HTML member.
Key structural clean ups fix tag soup (<b><i>foo</b></i>
becomes <b><i>foo</i></b>
) and inline/block element nesting (<span><p>foo</p></span>
becomes <p><span>foo</span></p>
). <br>
tags at the start or end of a link element are migrated out of the element.
Note that HTML::Normalize's approach to cleaning up tag soup is different than that used by HTML::Tidy. HTML::Tidy tends to enforce nested and swaps end tags to achieve that. HTML::Normalize inserts extra tags to allow correctly taged overlapped markup.
HTML::Normalize can also remove attributes set to default values and empty elements. For example a <font face="Verdana" size="1" color="#FF0000">
element would become and <font color="#FF0000">
and <font face="Verdana" size="1">
would be removed if Verdana size 1 is set as the default font.
Methods
new
creates an HTML::Normalize instance and performs parameter validation.
cleanup
Validates any further parameters and check parameter consistency then parses the HTML to generate the internal representation. It then edits the internal representation and renders the result back into HTML.
Note that cleanup may be called multiple times with different HTML strings to process.
Generally errors are handled by carping and may be detected in both new and cleanup.
new
Create a new HTML::Normalize
instance.
my $norm = HTML::Normalize->new ();
- -compact: optional
-
Setting
-compact => 1
suppresses generation of 'optional' close tags. This reduces the sizeof the output slightly at the expense of breaking any hope of XHTML compliance. - -default: optional - multiple
-
Define a default attribute for an element. Default attributes are removed if the attribute value has not been overridden in a parent node. For element such as 'font' this may result in the element being removed if no attributes remain.
-default
takes a string of the form 'tag attribute=value' as an argument. For example:-default => 'font face="Verdana"'
would specify that the face "Verdana" is the default face attribute for font elements.
value may be a constant or a regular expression. A regular expression matches:
/(~|qr)\s*(.).*\1\s*$/
except that the paired delimiters [], {}, () and <> are also accepted as pattern delimiters.
Literal match values should not encode entities, but remember that quotes around attribute values are optional for some values so the outer pair of quote characters will be removed if present. The match value extends to the end of the line and is not bounded by quote qharacters (except as noted earlier) so no quoting of "special" characters is required - there are no special characters.
Multiple default attributes may be provided but only one default value is allowed for any one tag/attribute pair.
Default values are case sensitive. However you can use the regular expression form to overcome this limitation.
- -distribute: optional - default true
-
Distribute inline elements over children if the children are block level elements. For example:
<span boo="foo"><p>foo</p><p>bar</p></span>
becomes:
<p><span boo="foo">foo</span></p><p><span boo="foo">bar</span></p>
This action is only taken if all the child elements are block level elements.
- -expelbr: optional - default true
-
If
-expelbr
is true (the default) break elements at the edges of link elements are expelled from the link element. Thus:<a href="linkto"><br>link text<br></a>
becomes
<br><a href="linkto">link text</a><br>
- -html: required
-
the HTML string to clean.
- -indent: optional - default ' '
-
String used to indent formatted output. Ignored if -unformatted is true.
- -keepimplicit: optional
-
as_HTML adds various HTML required sections such as head and body elements. By default HTML::Normalize removes these elements so that it is suitable for processing HTML fragments. Set
-keepimplicit =
1> to render the implicit elements.Note that if this option is true, the extra nodes will be generated regardless of their presence in the original HTML.
- -maxlinelen: optional - default 80
-
Notional maximum line length if -selfrender is true. The line length may be exceeded if no suitable break position is found. Note that the current indent is included in the line length.
- -selfrender: optional
-
Use the experimental HTML::Normalize code to render HTML rather than using HTML::Element's renderer. This code has not been tested against a wide range of HTML and may be unreliable. It's advantage is that it produces (in the author's opinion) prettier output than HTML::Element's as_HTML member.
- -unformatted: optional
-
Suppress output formatting. By default as_HTML is called as
as_HTML (undef, ' ', {})
which wraps and indents elements. Setting
-unformatted => 1
suppresses generation of line breaks and indentation reducing the size of the output slightly.
cleanup
cleanup
takes no parameters and returns the cleaned up version of the HTML.
my $cleanHtml = $norm->cleanup ();
elements
elements
takes no parameters and returns the list of HTML::Element instances generated by cleanup
. elements
should only be called after cleanup
. It will return undef
if cleanup
was not called or failed.
$norm->cleanup ();
my @elements = $norm->elements();
BUGS
p/div/p parsing issue
HTML::TreeBuilder 3.23 and earlier misparses:
<p><div><p>foo</p></div></p>
as:
<p><div></div></p> <p>foo</p>
A work around in HTML::Normalize turns that into
<p><div><p>foo</p></div></p>
which is probably still incorrect - div elements should not nest within p elements. A better fix for the problem requires HTML::TreeBuilder to be fixed.
Bug reports and feature requests
Please report any other bugs or feature requests to bug-html-normalize at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Normalize. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
This module is supported by the author through CPAN. The following links may be of assistance:
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
RT: CPAN's request tracker
Search CPAN
ACKNOWLEDGEMENTS
This module was inspired by Bart Lateur's PerlMonks node 'Cleaning up HTML' (http://perlmonks.org/?node_id=658103) and is a collaboration between Bart and the author.
AUTHOR
Peter Jaquiery
CPAN ID: GRANDPA
grandpa@cpan.org
COPYRIGHT & LICENSE
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.