The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Alvis::HTML - Perl extension for converting documents in dirty HTML into "clean" HTML suitable for Alvis purposes

SYNOPSIS

 use Alvis::HTML;

 # Create a new instance and specify that we want to remove uninteresting 
 # HTML tags, keep and fix tags of interest to Alvis::Convert and
 # convert both symbolic and numerical characters entities
 # to UTF-8 characters.
 #
 my $C=Alvis::HTML->new(alvisKeep=>0,
                        alvisRemove=>1,
                        obsolete=>1,
                        proprietary=>1,
                        xhtml=>1,
                        wml=>1,
                        keepAll=>1,
                        assertHTML=>0,
                        convertCharEnts=>1,
                        convertNumEnts=>1,
                        cleanWhitespace=>0
                        );

 my ($txt,$header)=$C->clean($html,
                             {title=>1,
                              baseURL=>1});
 if (!defined($txt))
 {
     die "Instantiating Alvis::HTML failed.";
 }

 #
 # Remove all HTML tags from the document. Assert that the document actually
 # is HTML. HTML is in 'iso-8859-1', (output is always in UTF-8).
 # Assert that the source assumptions (UTF-8, no '\0') hold before
 # trying to convert.
 #
 $C=Alvis::HTML->new(alvisKeep=>1,
                     alvisRemove=>1,
                     obsolete=>1,
                     proprietary=>1,
                     xhtml=>1,
                     wml=>1,
                     keepAll=>1,
                     assertHTML=>0,
                     convertCharEnts=>1,
                     convertNumEnts=>1,
                     sourceEncoding=>'iso-8859-1',
                     assertSourceAssumptions=>1
                    );

DESCRIPTION

Assumes the input is in UTF-8 and does NOT contain '\0's (or rather that they carry no meaning and are removable).

METHODS

new()

Options available: assertHTML if 1, try to check if the source really is in any of the recognized dialects. keepAll if 1, pass all documents on regardless of their HTMLness. Non-HTML goes forward as '\n'.

 Options to specify HTML subsets whose tags to remove: (set to defined)

    alvisKeep          W3's HTML 4.01 tags Alvis::Convert
                       is interested in
    alvisRemove        4.01 tags Alvis::Convert is NOT interested in
    obsolete           HTML <4.01
    proprietary        Net-escape,Exploder,...
    xhtml              XHTML 1.1
    wml                WML

     Note: alvisKeep + alvisRemove == remove all HTML 4.01 tags

    convertCharEnts    convert symbolic character entities to UTF-8 characters.
    convertNumEnts     convert numerical character entities to UTF-8 
                       characters.  

    sourceEncoding     encoding of the source HTML text (default: 'utf-8')
                       If not 'utf-8', HTML is converted to UTF-8.
                       If undefined, the encoding is guessed first.

    assertSourceAssumptions
 
                       make sure that before any operations the source is
                       in UTF-8 and contains no null bytes.

clean(html,options)

Remove unwanted tags from $html (text). $options is a mechanism for returning the title and base URL of the document and setting call-specific parameters.

If their extraction is desired, set fields 'title' and 'baseURL' to a defined value. e.g.

  my ($txt,$header)=$C->clean($html,
                              {title=>1,
                               baseURL=>1});

In $options you can also set the source and target encodings (sourceEncoding,targetEncoding).

   my ($txt,$header)=$C->clean($html,
                              {title=>1,
                               baseURL=>1,
                               sourceEncoding=>'iso-8859-1'});

This will guess the encoding first:

   my ($txt,$header)=$C->clean($html,
                              {title=>1,
                               baseURL=>1,
                               sourceEncoding=>undef});

will convert from 'iso-8859-1' to default output encoding (UTF-8).

errmsg()

Returns a stack of error messages, if any. Empty string otherwise.

SEE ALSO

Alvis::Canonical

AUTHOR

Kimmo Valtonen, <kimmo.valtonen@hiit.fi>

COPYRIGHT AND LICENSE

Copyright (C) 2006 by Kimmo Valtonen

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.