NAME

HTML::HTML5::Parser - parse HTML reliably

SYNOPSIS

use HTML::HTML5::Parser;

my $parser = HTML::HTML5::Parser->new;
my $doc    = $parser->parse_string(<<'EOT');
<!doctype html>
<title>Foo</title>
<p><b><i>Foo</b> bar</i>.
<p>Baz</br>Quux.
EOT

my $fdoc   = $parser->parse_file( $html_file_name );
my $fhdoc  = $parser->parse_fh( $html_file_handle );

DESCRIPTION

This library is substantially the same as the non-CPAN module Whatpm::HTML. Changes include:

Provides an XML::LibXML-like DOM interface. If you usually use XML::LibXML's DOM parser, this should be a drop-in solution for tag soup HTML.
Constructs an XML::LibXML::Document as the result of parsing.
Via bundling and modifications, removed external dependencies on non-CPAN packages.

Constructor

new

$parser = HTML::HTML5::Parser->new;

The constructor does not do anything interesting.

XML::LibXML-Compatible Methods

parse_file, parse_html_file

$doc = $parser->parse_file( $html_file_name [,\%opts] );

This function parses an HTML document from a file or network; $html_file_name can be either a filename or an URL.

Options include 'encoding' to indicate file encoding (e.g. 'utf-8') and 'user_agent' which should be a blessed LWP::UserAgent object to be used when retrieving URLs.

If requesting a URL and the response Content-Type header indicates an XML-based media type (such as XHTML), XML::LibXML::Parser will be used automatically (instead of the tag soup parser). The XML parser can be told to use a DTD catalogue by setting the option 'xml_catalogue' to the filename of the catalogue.

HTML (tag soup) parsing can be forced using the option 'force_html', even when an XML media type is returned. If an options hashref was passed, parse_file will set $options->{'parser_used'} to the name of the class used to parse the URL, to allow the calling code to double-check which parser was used afterwards.

If an options hashref was passed, parse_file will set $options->{'response'} to the HTTP::Response object obtained by retrieving the URI.

parse_fh, parse_html_fh

$doc = $parser->parse_fh( $io_fh [,\%opts] );

parse_fh() parses a IOREF or a subclass of IO::Handle.

Options include 'encoding' to indicate file encoding (e.g. 'utf-8').

parse_string, parse_html_string

$doc = $parser->parse_string( $html_string [,\%opts] );

This function is similar to parse_fh(), but it parses an HTML document that is available as a single string in memory.

Options include 'encoding' to indicate file encoding (e.g. 'utf-8').

The push parser and SAX-based parser are not supported. Trying to change an option (such as recover_silently) will make HTML::HTML5::Parser carp a warning. (But you can inspect the options.)

Additional Methods

The module provides a few additional methods to obtain additional, non-DOM data from DOM nodes.

compat_mode

$mode = $parser->compat_mode( $doc );

Returns 'quirks', 'limited quirks' or undef (standards mode).

dtd_public_id

$pubid = $parser->dtd_public_id( $doc );

For an XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will tell you the Public Identifier of the DTD used (if any).

dtd_system_id

$sysid = $parser->dtd_system_id( $doc );

For an XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will tell you the System Identifier of the DTD used (if any).

source_line

($line, $col) = $parser->source_line( $node );
$line = $parser->source_line( $node );

In scalar context, source_line returns the line number of the source code that started a particular node (element, attribute or comment).

In list context, returns a line/column pair. (Tab characters count as one column, not eight.)

AUTHOR

Toby Inkster, <tobyink@cpan.org>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.1 or, at your option, any later version of Perl 5 you may have available.

To install HTML::HTML5::Parser, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::HTML5::Parser

CPAN shell

perl -MCPAN -e shell
install HTML::HTML5::Parser

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)