NAME

Alvis::HTML - Perl extension for converting documents in dirty HTML into "clean" HTML suitable for Alvis purposes

SYNOPSIS


            
              
              use Alvis::HTML;
# Create a new instance and specify that we want to remove uninteresting 
# HTML tags, keep and fix tags of interest to Alvis::Convert and
# convert both symbolic and numerical characters entities
# to UTF-8 characters.
#
my $C=Alvis::HTML->new(alvisKeep=>0,
                alvisRemove=>1,
                obsolete=>1,
                proprietary=>1,
                xhtml=>1,
                wml=>1,
                keepAll=>1,
                assertHTML=>0,
                       convertCharEnts=>1,
                       convertNumEnts=>1,
                       cleanWhitespace=>0
                );
my ($txt,$header)=$C->clean($html,
                     {title=>1,
                      baseURL=>1});
if (!defined($txt))
{
    die "Instantiating Alvis::HTML failed.";
}
#
# Remove all HTML tags from the document. Assert that the document actually
# is HTML. HTML is in 'iso-8859-1', (output is always in UTF-8).
# Assert that the source assumptions (UTF-8, no '\0') hold before
# trying to convert.
#
$C=Alvis::HTML->new(alvisKeep=>1,
                    alvisRemove=>1,
                    obsolete=>1,
                    proprietary=>1,
             xhtml=>1,
             wml=>1,
             keepAll=>1,
             assertHTML=>0,
             convertCharEnts=>1,
             convertNumEnts=>1,
                    sourceEncoding=>'iso-8859-1',
                    assertSourceAssumptions=>1
            );

DESCRIPTION

Assumes the input is in UTF-8 and does NOT contain '\0's (or rather that they carry no meaning and are removable).

METHODS

new()

Options available: assertHTML if 1, try to check if the source really is in any of the recognized dialects. keepAll if 1, pass all documents on regardless of their HTMLness. Non-HTML goes forward as '\n'.


            
              
              Options to specify HTML subsets whose tags to remove: (set to defined)
   alvisKeep          W3's HTML 4.01 tags Alvis::Convert
                      is interested in
   alvisRemove        4.01 tags Alvis::Convert is NOT interested in
   obsolete           HTML <4.01
   proprietary        Net-escape,Exploder,...
   xhtml              XHTML 1.1
   wml                WML
    Note: alvisKeep + alvisRemove == remove all HTML 4.01 tags
   convertCharEnts    convert symbolic character entities to UTF-8 characters.
   convertNumEnts     convert numerical character entities to UTF-8 
                      characters.  
   sourceEncoding     encoding of the source HTML text (default: 'utf-8')
                      If not 'utf-8', HTML is converted to UTF-8.
                      If undefined, the encoding is guessed first.
   assertSourceAssumptions
                      make sure that before any operations the source is
                      in UTF-8 and contains no null bytes.

clean(html,options)

Remove unwanted tags from $html (text). $options is a mechanism for returning the title and base URL of the document and setting call-specific parameters.

If their extraction is desired, set fields 'title' and 'baseURL' to a defined value. e.g.


            
              
              my ($txt,$header)=$C->clean($html,
                            {title=>1,
                       baseURL=>1});

In $options you can also set the source and target encodings (sourceEncoding,targetEncoding).


            
              
              my ($txt,$header)=$C->clean($html,
                           {title=>1,
                       baseURL=>1,
                            sourceEncoding=>'iso-8859-1'});

This will guess the encoding first:


            
              
              my ($txt,$header)=$C->clean($html,
                           {title=>1,
                       baseURL=>1,
                            sourceEncoding=>undef});

will convert from 'iso-8859-1' to default output encoding (UTF-8).

errmsg()

Returns a stack of error messages, if any. Empty string otherwise.

AUTHOR

Kimmo Valtonen, <kimmo.valtonen@hiit.fi>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

To install Alvis::Convert, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Alvis::Convert

CPAN shell

perl -MCPAN -e shell
install Alvis::Convert

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)