NAME
Formatter::HTML::HTML - Formatter to clean existing HTML
SYNOPSIS
use Formatter::HTML::HTML;
my $formatter = Formatter::HTML::HTML->format($data);
print $formatter->document;
print $formatter->title;
my $links = $text->links;
print ${$links}[0]->{uri};
DESCRIPTION
This module will clean the document using HTML::Tidy. It also inherits from that module, so you can use methods of that class. It can also parse and return links and the title (using HTML::TokeParser).
METHODS
This module conforms with the Formatter API specification, version 0.93:
format($string)
-
The format function that you call to initialise the formatter. It takes the plain text as a string argument and returns an object of this class.
document([$charset])
-
Will return a full, cleaned and valid HTML document. You may specify an optional
$charset
parameter. This will include a HTMLmeta
element with the chosen character set. It will still be your responsibility to ensure that the document served is encoded with this character set. fragment
-
This will return only the contents of the
body
element. links
-
Will return all links found the input plain text string as an arrayref. The arrayref will for each element contain a key
uri
with the address andtitle
with the link text. title
-
Will return the title of the document as seen in the HTML
title
element or undef if none can be found.
SEE ALSO
Formatter, HTML::Tidy, HTML::TokeParser
TODO
Both the fragment
and document
methods use naive regular expressions to strip off elements and add a meta
element respectively. This is clearly not very reliable, and should be done with a proper parser.
AUTHOR
Kjetil Kjernsmo, <kjetilk@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2005 by Kjetil Kjernsmo
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.