NAME
Alvis::Canonical - Perl extension for converting documents in various formats into the Alvis canonical format for documents
SYNOPSIS
use Alvis::Canonical;
# Create a new instance, specify the conversion of both numeric and
# symbolic character entities to Unicode characters
my $C=Alvis::Canonical->new(convertCharEnts=>1,
convertNumEnts=>1);
if (!defined($C))
{
die("Unable to instantiate Alvis::Canonical.");
}
# Convert an HTML document text in UTF-8 to the canonical format.
# Specify that you want the title and baseURL as well, if any can be
# determined.
my ($txt,$header)=$C->HTML($html,
{title=>1,
baseURL=>1});
if (!defined($txt))
{
die $C->errmsg();
}
DESCRIPTION
Assumes the input is in UTF-8 and does NOT contain '\0's (or rather that they carry no meaning and are removable).
METHODS
new()
Available options:
warnings Issue warnings about badly faulty original HTML where
we have to resort to an heuristic solution.
Puts a warning to STDERR documenting the error and
the solution. Default: no.
convertCharEnts Convert HTML symbolic character entities to UTF-8
characters? Default: yes.
convertNumEnts Convert HTML numerical character entities to UTF-8
characters? Default: yes.
sourceEncoding the encoding of the source documents. Default: undef,
which means it is guessed.
my $C=Alvis::Canonical->new(convertCharEnts=>1,
convertNumEnts=>1);
if (!defined($C))
{
die die("Unable to instantiate Alvis::Canonical.");
}
HTML($html,$options)
Converts dirty HTML to a valid Alvis canonicalDocument. $options is a mechanism for returning the title and base URL of the document. If their extraction is desired, set fields 'title' and 'baseURL' to a defined value. If you know the encoding of the source document, set option 'sourceEncoding', e.g.
my ($txt,$header)=$C->HTML($html,
{title=>1,
baseURL=>1,
sourceEncoding=>'iso-8859-2'});
errmsg()
Returns a stack of error messages, if any. Empty string otherwise.
SEE ALSO
Alvis::Convert
AUTHOR
Kimmo Valtonen, <kimmo.valtonen@hiit.fi>
COPYRIGHT AND LICENSE
Copyright (C) 2006 by Kimmo Valtonen
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.