NAME
Alvis::HTML - Perl extension for converting documents in dirty HTML into "clean" HTML suitable for Alvis purposes
SYNOPSIS
use
Alvis::HTML;
# Create a new instance and specify that we want to remove uninteresting
# HTML tags, keep and fix tags of interest to Alvis::Convert and
# convert both symbolic and numerical characters entities
# to UTF-8 characters.
#
my
$C
=Alvis::HTML->new(
alvisKeep
=>0,
alvisRemove
=>1,
obsolete
=>1,
proprietary
=>1,
xhtml
=>1,
wml
=>1,
keepAll
=>1,
assertHTML
=>0,
convertCharEnts
=>1,
convertNumEnts
=>1,
cleanWhitespace
=>0
);
my
(
$txt
,
$header
)=
$C
->clean(
$html
,
{
title
=>1,
baseURL
=>1});
if
(!
defined
(
$txt
))
{
die
"Instantiating Alvis::HTML failed."
;
}
#
# Remove all HTML tags from the document. Assert that the document actually
# is HTML. HTML is in 'iso-8859-1', (output is always in UTF-8).
# Assert that the source assumptions (UTF-8, no '\0') hold before
# trying to convert.
#
$C
=Alvis::HTML->new(
alvisKeep
=>1,
alvisRemove
=>1,
obsolete
=>1,
proprietary
=>1,
xhtml
=>1,
wml
=>1,
keepAll
=>1,
assertHTML
=>0,
convertCharEnts
=>1,
convertNumEnts
=>1,
sourceEncoding
=>
'iso-8859-1'
,
assertSourceAssumptions
=>1
);
DESCRIPTION
Assumes the input is in UTF-8 and does NOT contain '\0's (or rather that they carry no meaning and are removable).
METHODS
new()
Options available: assertHTML if 1, try to check if the source really is in any of the recognized dialects. keepAll if 1, pass all documents on regardless of their HTMLness. Non-HTML goes forward as '\n'.
Options to specify HTML subsets whose tags to remove: (set to
defined
)
alvisKeep W3's HTML 4.01 tags Alvis::Convert
is interested in
alvisRemove 4.01 tags Alvis::Convert is NOT interested in
obsolete HTML <4.01
proprietary Net-escape,Exploder,...
xhtml XHTML 1.1
wml WML
Note: alvisKeep + alvisRemove == remove all HTML 4.01 tags
convertCharEnts convert symbolic character entities to UTF-8 characters.
convertNumEnts convert numerical character entities to UTF-8
characters.
sourceEncoding encoding of the source HTML text (
default
:
'utf-8'
)
If not
'utf-8'
, HTML is converted to UTF-8.
If undefined, the encoding is guessed first.
assertSourceAssumptions
make sure that
before
any operations the source is
in UTF-8 and contains
no
null bytes.
clean(html,options)
Remove unwanted tags from $html (text). $options is a mechanism for returning the title and base URL of the document and setting call-specific parameters.
If their extraction is desired, set fields 'title' and 'baseURL' to a defined value. e.g.
my
(
$txt
,
$header
)=
$C
->clean(
$html
,
{
title
=>1,
baseURL
=>1});
In $options you can also set the source and target encodings (sourceEncoding,targetEncoding).
my
(
$txt
,
$header
)=
$C
->clean(
$html
,
{
title
=>1,
baseURL
=>1,
sourceEncoding
=>
'iso-8859-1'
});
This will guess the encoding first:
my
(
$txt
,
$header
)=
$C
->clean(
$html
,
{
title
=>1,
baseURL
=>1,
sourceEncoding
=>
undef
});
will convert from 'iso-8859-1' to default output encoding (UTF-8).
errmsg()
Returns a stack of error messages, if any. Empty string otherwise.
SEE ALSO
Alvis::Canonical
AUTHOR
Kimmo Valtonen, <kimmo.valtonen@hiit.fi>
COPYRIGHT AND LICENSE
Copyright (C) 2006 by Kimmo Valtonen
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.