NAME
HTML::Parser::Simple
- Parse nice HTML files without needing a compiler
Synopsis
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser::Simple;
# -------------------------
# Method 1:
my($p) = HTML::Parser::Simple -> new
(
{
input_dir => '/source/dir',
output_dir => '/dest/dir',
}
);
$p -> parse_file('in.html', 'out.html');
# Method 2:
my($p) = HTML::Parser::Simple -> new();
$p -> parse('<html>...</html>');
$p -> traverse($p -> get_root() );
print $p -> result();
Description
HTML::Parser::Simple
is a pure Perl module.
It parses HTML V 4 files, and generates a tree of nodes per HTML tag.
The data associated with each node is documented in the FAQ.
Distributions
This module is available as a Unix-style distro (*.tgz).
See http://savage.net.au/Perl-modules.html for details.
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.
Constructor and initialization
new(...) returns an object of type HTML::Parser::Simple
.
This is the class's contructor.
Usage: HTML::Parser::Simple -> new()
.
This method takes a hashref of options.
Call new()
as new({option_1 => value_1, option_2 => value_2, ...})
.
Available options:
- input_dir
-
This takes the path where the input file is to read from.
The default value is '' (the empty string).
- output_dir
-
This takes the path where the output file is to be written.
The default value is '' (the empty string).
- verbose
-
This takes either a 0 or a 1.
Write more or less progress messages to STDERR.
The default value is 0.
Note: Currently, setting verbose does nothing.
- xhtml
-
This takes either a 0 or a 1.
0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?> at the start of the input file, and some other XHTML features.
1 means accept it.
The default value is 0.
Warning: The only XHTML changes to this code, so far, are:
Method: get_current_node()
Returns the Tree::Simple object which the parser calls the current node.
Method: get_depth()
Returns the nesting depth of the current tag.
It's just there in case you need it.
Method: get_input_dir()
Returns the input_dir parameter, as passed in to new()
.
Method: get_output_dir()
Returns the output_dir parameter, as passed in to new()
.
Method: get_node_type()
Returns the type of the most recently created node, 'global', 'head', or 'body'.
See the first question in the FAQ for details.
Method: result()
Returns the result so far of the parse.
Method: get_root()
Returns the node which the parser calls the root of the tree of nodes.
Method: get_verbose()
Returns the verbose parameter, as passed in to new()
.
Method: get_xhtml()
Returns the xhtml parameter, as passed in to new()
.
Method: log($msg)
Print $msg to STDERR if new()
was called as new({verbose => 1})
, or if $p -> set_verbose(1) was called.
Otherwise, print nothing.
Method: parse($html)
Parses the string of HTML in $html, and builds a tree of nodes.
After calling $p -> parse()
, you must call $p -> traverse($p -> get_root() )
before calling $p -> result()
.
Alternately, call $p -> parse_file()
, which calls all these methods for you.
Note: parse()
may be called directly or via parse_file()
.
Method: parse_file($input_file_name, $output_file_name)
Parses the HTML in the input file, and writes the result to the output file.
Method: result()
Returns the result so far of the parse.
Method: set_current_node($node)
Sets the node which the parser calls the current node.
Returns undef.
Method: set_depth($depth)
Sets the nesting depth of the current node.
Returns undef.
It's just there in case you need it.
Method: set_input_dir($dir_name)
Sets the input_dir parameter, as though it was passed in to new()
.
Returns undef.
Method: set_output_dir($dir_name)
Sets the output_dir parameter, as though it was passed in to new()
.
Returns undef.
Method: set_node_type($node_type)
Sets the type of the next node to be created, 'global', 'head', or 'body'.
See the first question in the FAQ for details.
Returns undef.
Method: set_root($node)
Returns the node which the parser calls the root of the tree of nodes.
Returns undef.
Method: set_verbose($Boolean)
Sets the verbose parameter, as though it was passed in to new()
.
Returns undef.
Method: set_xhtml($Boolean)
Sets the xhtml parameter, as though it was passed in to new()
.
Returns undef.
FAQ
- What is the format of the data stored in each node of the tree?
-
The data of each node is a hash ref. The keys/values of this hash ref are:
- attributes
-
This is the string of HTML attributes associated with the HTML tag.
So, <table align = 'center' bgColor = '#80c0ff' summary = 'Body'> will have an attributes string of " align = 'center' bgColor = '#80c0ff' summary = 'Body'".
Note the leading space.
- content
-
This is an array ref of bits and pieces of content.
Consider this fragment of HTML:
<p>I did <i>not</i> say I <i>liked</i> debugging.</p>
When parsing 'I did ', the number of child nodes (of <p>) is 0, since <i> has not yet been detected.
So, 'I did ' is stored in the 0th element of the array ref.
Likewise, 'not' is stored in the 0th element of the array ref belonging to the node 'i'.
Next, ' say I ' is stored in the 1st element of the array ref, because it follows the 1st child node (<i>).
Likewise, ' debugging' is stored in the 2nd element.
This way, the input string can be reproduced by successively outputting the elements of the array ref of content interspersed with the contents of the child nodes (processed recusively).
Note: If you are processing this tree, never forget that there can be content after the last child node has been closed, but before the current node is closed.
Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.
- depth
-
The nesting depth of the tag within the document.
The root is at depth 0, '<html>' is at depth 1, '<head>' and '<body>' are a depth 2, and so on.
It's just there in case you need it.
- The name the HTML tag
-
So, the tag '<html>' will mean the name is 'html'.
The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.
The root has the node 'html' as the only child, of course.
- node_type
-
This holds 'global' before '<head>' and between '</head>' and '<body>', and after '</body>'.
It holds 'head' for all nodes from '<head>' to '</head>', and holds 'body' from '<body>' to '</body>'.
It's just there in case you need it.
- How are HTML comments handled?
-
They are treated as content. This includes the prefix '<!--' and the suffix '-->'.
- How is DOCTYPE handled?
-
It is treated as content belonging to the root of the tree.
- How is the XML declaration handled?
-
It is treated as content belonging to the root of the tree.
- Does this module handle all HTML pages?
-
No, never.
- Which versions of HTML does this module handle?
-
Up to V 4.
- What do I do if this module does not handle my HTML page?
-
Make yourself a nice cup of tea, and then fix your page.
- Does this validate the HTML input?
-
No.
For example, if you feed in a HTML page without the title tag, this module does not care.
- How do I view the output HTML?
-
By installing HTML::Revelation, of course!
Sample output:
http://savage.net.au/Perl-modules/html/CreateTable.html
- How do I test this module (or my file)?
-
Suggested steps:
Note: There are quite a few files involved. Proceed with caution.
- Select a HTML file to test
-
Call this input.html.
- Run input.html thru reveal.pl
-
Reveal.pl ships with HTML::Revelation.
Call the output file output.1.html.
- Run input.html thru parse.html.pl
-
Parse.html.pl ships with HTML::Parser::Simple.
Call the output file parsed.html.
- Run parsed.html thru reveal.pl
-
Call the output file output.2.html.
- Compare output.1.html and output.2.html
-
If they match, or even if they don't match, you're finished.
- Will you implement a 'quirks' mode to handle my special HTML file?
-
No, never.
Help with quirks:
http://www.quirksmode.org/sitemap.html
- Is there anything I should be aware of?
-
Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.
In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.
The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag.
I don't define 'a' to be inline, others do, e.g. http://www.w3.org/TR/html401/ and hence HTML::Tagset.
Inline means:
<a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>
will not be parsed as an 'a' containing a 'div'.
The 'a' tag will be closed before the 'div' is opened. So, the result will look like:
<a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div>
To achieve what was presumably intended, use 'span':
<a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>
Some people (*cough* *cough*) have had to redo their entire websites due to this very problem.
Of course, this is just one of a vast set of possible problems.
You have been warned.
- Why did you use Tree::Simple but not Tree or Tree::Fast or Tree::DAG_Node?
-
During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.
Late news: Tree does not cope with an array ref stored in the metadata, so I've switched to Tree::DAG_Node.
Stop press: As an experiment I switched to Tree::Simple. Since it also works I'll just keep using it.
- Why isn't this module called HTML::Parser::PurePerl?
- How do I output my own stuff while traversing the tree?
- Is the code on github?
-
Yes. See: git://github.com/ronsavage/html--parser--simple.git
- How is the source formatted?
-
I edit with Emacs, using the default formatting for Perl.
That means, in general, leading 4-space tabs. Hashrefs use a leading tab and then a space.
All vertical alignment within lines is done manually with spaces.
Perl::Critic is off the agenda.
Credits
This Perl HTML parser has been converted from a JavaScript one written by John Resig.
http://ejohn.org/files/htmlparser.js
Well done John!
Note also the comments published here:
http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58
Author
HTML::Parser::Simple
was written by Ron Savage <ron@savage.net.au> in 2009.
Home page: http://savage.net.au/index.html
Copyright
Australian copyright (c) 2009 Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License, a copy of which is available at:
http://www.opensource.org/licenses/index.html