NAME

HTML::Normalize - HTML light weight cleanup

VERSION

Version 1.0003

SYNOPSIS

my $norm = HTML::Normalize->new ();
my $cleanHtml = $norm->cleanup (-html => $dirtyHtml);

DESCRIPTION

HTML::Normalize uses HTML::TreeBuilder to parse an HTML string then processes the resultant tree to clean up various structural issues in the original HTML. The result is then rendered using HTML::Element's as_HTML member.

Key structural clean ups fix tag soup (<b><i>foo</b></i> becomes <b><i>foo</i></b>) and inline/block element nesting (<span><p>foo</p></span> becomes <p><span>foo</span></p>). <br> tags at the start or end of a link element are migrated out of the element.

Note that HTML::Normalize's approach to cleaning up tag soup is different than that used by HTML::Tidy. HTML::Tidy tends to enforce nested and swaps end tags to achieve that. HTML::Normalize inserts extra tags to allow correctly taged overlapped markup.

HTML::Normalize can also remove attributes set to default values and empty elements. For example a <font face="Verdana" size="1" color="#FF0000"> element would become and <font color="#FF0000"> and <font face="Verdana" size="1"> would be removed if Verdana size 1 is set as the default font.

Methods

new creates an HTML::Normalize instance and performs parameter validation.

cleanup Validates any further parameters and check parameter consistency then parses the HTML to generate the internal representation. It then edits the internal representation and renders the result back into HTML.

Note that cleanup may be called multiple times with different HTML strings to process.

Generally errors are handled by carping and may be detected in both new and cleanup.

new

Create a new HTML::Normalize instance.

my $norm = HTML::Normalize->new ();
-compact: optional

Setting -compact => 1 suppresses generation of 'optional' close tags. This reduces the sizeof the output slightly at the expense of breaking any hope of XHTML compliance.

-default: optional - multiple

Define a default attribute for an element. Default attributes are removed if the attribute value has not been overridden in a parent node. For element such as 'font' this may result in the element being removed if no attributes remain.

-default takes a string of the form 'tag attribute=value' as an argument. For example:

-default => 'font face="Verdana"'

would specify that the face "Verdana" is the default face attribute for font elements.

value may be a constant or a regular expression. A regular expression matches:

/(~|qr)\s*(.).*\1\s*$/

except that the paired delimiters [], {}, () and <> are also accepted as pattern delimiters.

Literal match values should not encode entities, but remember that quotes around attribute values are optional for some values so the outer pair of quote characters will be removed if present. The match value extends to the end of the line and is not bounded by quote qharacters (except as noted earlier) so no quoting of "special" characters is required - there are no special characters.

Multiple default attributes may be provided but only one default value is allowed for any one tag/attribute pair.

Default values are case sensitive. However you can use the regular expression form to overcome this limitation.

-distribute: optional - default true

Distribute inline elements over children if the children are block level elements. For example:

<span boo="foo"><p>foo</p><p>bar</p></span>

becomes:

<p><span boo="foo">foo</span></p><p><span boo="foo">bar</span></p>

This action is only taken if all the child elements are block level elements.

-expelbr: optional - default true

If -expelbr is true (the default) break elements at the edges of link elements are expelled from the link element. Thus:

<a href="linkto"><br>link text<br></a>

becomes

<br><a href="linkto">link text</a><br>
-html: required

the HTML string to clean.

-indent: optional - default ' '

String used to indent formatted output. Ignored if -unformatted is true.

-keepimplicit: optional

as_HTML adds various HTML required sections such as head and body elements. By default HTML::Normalize removes these elements so that it is suitable for processing HTML fragments. Set -keepimplicit = 1> to render the implicit elements.

Note that if this option is true, the extra nodes will be generated regardless of their presence in the original HTML.

-maxlinelen: optional - default 80

Notional maximum line length if -selfrender is true. The line length may be exceeded if no suitable break position is found. Note that the current indent is included in the line length.

-selfrender: optional

Use the experimental HTML::Normalize code to render HTML rather than using HTML::Element's renderer. This code has not been tested against a wide range of HTML and may be unreliable. It's advantage is that it produces (in the author's opinion) prettier output than HTML::Element's as_HTML member.

-unformatted: optional

Suppress output formatting. By default as_HTML is called as

as_HTML (undef, '   ', {})

which wraps and indents elements. Setting -unformatted => 1 suppresses generation of line breaks and indentation reducing the size of the output slightly.

cleanup

cleanup takes no parameters and returns the cleaned up version of the HTML.

my $cleanHtml = $norm->cleanup ();

elements

elements takes no parameters and returns the list of HTML::Element instances generated by cleanup. elements should only be called after cleanup. It will return undef if cleanup was not called or failed.

$norm->cleanup ();
my @elements = $norm->elements();

BUGS

p/div/p parsing issue

HTML::TreeBuilder 3.23 and earlier misparses:

<p><div><p>foo</p></div></p>

as:

<p><div></div></p> <p>foo</p>

A work around in HTML::Normalize turns that into

<p><div><p>foo</p></div></p>

which is probably still incorrect - div elements should not nest within p elements. A better fix for the problem requires HTML::TreeBuilder to be fixed.

Bug reports and feature requests

Please report any other bugs or feature requests to bug-html-normalize at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Normalize. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

This module is supported by the author through CPAN. The following links may be of assistance:

ACKNOWLEDGEMENTS

This module was inspired by Bart Lateur's PerlMonks node 'Cleaning up HTML' (http://perlmonks.org/?node_id=658103) and is a collaboration between Bart and the author.

AUTHOR

Peter Jaquiery
CPAN ID: GRANDPA
grandpa@cpan.org

COPYRIGHT & LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.