NAME
HTML::DOM - A Perl implementation of the HTML Document Object Model
VERSION
Version 0.019 (alpha)
WARNING: This module is still at an experimental stage. The API is subject to change without notice.
SYNOPSIS
use HTML::DOM;
my $dom_tree = new HTML::DOM; # empty tree
$dom_tree->write($source_code);
$dom_tree->close;
my $other_dom_tree = new HTML::DOM;
$dom_tree->parse_file($filename);
$dom_tree->getElementsByTagName('body')->[0]->appendChild(
$dom_tree->createElement('input')
);
print $dom_tree->innerHTML, "\n";
my $text = $dom_tree->createTextNode('text');
$text->data; # get attribute
$text->data('new value'); # set attribute
DESCRIPTION
This module implements the HTML Document Object Model by extending the HTML::Tree modules. The HTML::DOM class serves both as an HTML parser and as the document class.
The following DOM modules are currently supported:
Feature Version (aka level)
------- -------------------
HTML 2.0
Core 2.0
Events 2.0
UIEvents 2.0
MouseEvents 2.0
MutationEvents 2.0 (partially)
StyleSheets 2.0
CSS 2.0 (partially)
CSS2 2.0
Views 2.0
StyleSheets, CSS and CSS2 are actually provided by CSS::DOM. This list corresponds to CSS::DOM versions 0.02 to 0.05.
METHODS
Construction and Parsing
- $tree = new HTML::DOM %options;
-
This class method constructs and returns a new HTML::DOM object. The
%options
, which are all optional, are as follows:- url
-
The value that the
URL
method will return. This value is also used by thedomain
method. - referrer
-
The value that the
referrer
method will return - response
-
An HTTP::Response object. This will be used for information needed for writing cookies. It is expected to have a reference to a request object (accessible via its
request
method--see HTTP::Response). Passing a parameter to the 'cookie' method will be a no-op without this. -
An HTTP::Cookies object. As with
response
, if you omit this, arguments passed to thecookie
method will be ignored. - charset
-
The original character set of the document. This does not affect parsing via the
write
method (which always assumes Unicode).parse_file
will use this, if specified, or HTML::Encoding otherwise. HTML::DOM::Form'smake_request
method uses this to encode form data unless the form has a valid 'accept-charset' attribute.
If
referrer
andurl
are omitted, they can be inferred fromresponse
. - $tree = new_from_file HTML::DOM
- $tree = new_from_content HTML::DOM
-
Not yet implemented. (I'm minded never to implement these, for the simple reason that you won't get all the options that the constructor provides.)
- $tree->elem_handler($elem_name => sub { ... })
-
If you call this method first, then, when the DOM tree is in the process of being built (as a result of a call to
write
orparse_file
), the subroutine will be called after each$elem_name
element is added to the tree. If you give '*' as the element name, the subroutine will be called for each element that does not have a handler. The subroutine's two arguments will be the tree itself and the element in question. The subroutine can call the DOM object'swrite
method to insert HTML code into the source after the element.Here is a lame example (which does not take Content-Script-Type headers or security into account):
$tree->elem_handler(script => sub { my($document,$elem) = @_; return unless $elem->attr('type') eq 'application/x-perl'; eval($elem->firstChild->data); }); $tree->write( '<p>The time is <script type="application/x-perl"> $document->write(scalar localtime) </script> precisely. </p>' ); $tree->close; print $tree->documentElement->as_text, "\n";
(Note: HTML::DOM::Element's
content_offset
method might come in handy for reporting line numbers for script errors.) - $tree->write(...) (DOM method)
-
This parses the HTML code passed to it, adding it to the end of the document. It assumes that its input is a normal Perl Unicode string. Like HTML::TreeBuilder's
parse
method, it can take a coderef.When it is called from an an element handler (see
elem_handler
, above), the value passed to it will be inserted into the HTML code after the current element when the element handler returns. (In this case a coderef won't do--maybe that will be added later.)If the
close
method has been called,write
will callopen
before parsing the HTML code passed to it. - $tree->writeln(...) (DOM method)
-
Just like
write
except that it appends "\n" to its argument and does not work with code refs. (Rather pointless, if you ask me. :-) - $tree->close() (DOM method)
-
Call this method to signal to the parser that the end of the HTML code has been reached. It will then parse any residual HTML that happens to be buffered. It also makes the next
write
callopen
. - $tree->open (DOM method)
-
Deletes the HTML tree, resetting it so that it has just an <html> element, and a parser hungry for HTML code.
- $tree->parse_file($file)
-
This method takes a file name or handle and parses the content, (effectively) calling
close
afterwards. In the former case (a file name), HTML::Encoding will be used to detect the encoding. In the latter (a file handle), you'll have tobinmode
it yourself. This could be considered a bug. If you have a solution to this (how to make HTML::Encoding detect an encoding from a file handle), please let me know.As of version 0.12, this method returns true upon success, or undef/empty list on failure.
- $tree->charset
-
This method returns the name of the character set that was passed to
new
, or, if that was not given, that whichparse_file
used.It returns undef if
new
was not given a charset, ifparse_file
was not used or ifparse_file
was passed a file handle.You can also set the charset by passing an argument, in which case the old value is returned.
Other DOM Methods
- doctype
-
Returns nothing
- implementation
-
Returns the HTML::DOM::Implementation object.
- documentElement
-
Returns the <html> element.
- createElement ( $tag )
- createDocumentFragment
- createTextNode ( $text )
- createComment ( $text )
- createAttribute ( $name )
-
Each of these creates a node of the appropriate type.
- createProcessingInstruction
- createEntityReference
-
These two throw an exception.
- getElementsByTagName ( $name )
-
$name
can be the name of the tag, or '*', to match all tag names. This returns a node list object in scalar context, or a list in list context. - importNode ( $node, $deep )
-
Clones the
$node
, setting itsownerDocument
attribute to the document with which this method is called. If$deep
is true, the$node
will be cloned recursively. - alinkColor
- background
- bgColor
- fgColor
- linkColor
- vlinkColor
-
These six methods return (optionally set) the corresponding attributes of the body element. Note that most of the names do not map directly to the names of the attributes.
fgColor
refers to thetext
attribute. Those that end with 'linkColor' refer to the attributes of the same name but without the 'Color' on the end. - title
-
Returns (or optionally sets) the title of the page.
- referrer
-
Returns the page's referrer.
- domain
-
Returns the domain name portion of the document's URL.
- URL
-
Returns the document's URL.
- body
-
Returns the body element, or the outermost frame set if the document has frames. You can set the body by passing an element as an argument, in which case the old body element is returned.
- images
- applets
- links
- forms
- anchors
-
These five methods each return a list of the appropriate elements in list context, or an HTML::DOM::Collection object in scalar context. In this latter case, the object will update automatically when the document is modified.
In the case of
forms
you can access those by using the HTML::DOM object itself as a hash. I.e., you can write$doc->{f}
instead of$doc->forms->{f}
. -
This returns a string containing the document's cookies (the format may still change). If you pass an argument, it will set a cookie as well. Both Netscape-style and RFC2965-style cookie headers are supported.
- getElementById
- getElementsByName
-
These two do what their names imply. The latter will return a list in list context, or a node list object in scalar context. Calling it in list context is probably more efficient.
- createEvent ( $category )
-
Creates a new event object, believe it or not.
The
$category
is the DOM event category, which determines what type of event object will be returned. The currently supported event categories are MouseEvents and UIEvents.You can omit the
$category
to create an instance of the event base class (not officially part of the DOM). - defaultView
-
Returns the HTML::DOM::View object associated with the document.
Note: Currently this has an object there by default, but this may change, since the object is fairly useless.
Although it is supposed to be read-only according to the DOM, you can set this attribute by passing an argument to it, in which case you should not count on any meaningful return value. It
is
still marked as read-only in%HTML::DOM::Interface
.If you do set it, it is recommended that the object be a subclass of HTML::DOM::View.
- styleSheets
-
Returns a CSS::DOM::StyleSheetList of the document's style sheets, or a simple list in list context.
- innerHTML
-
Serialises and returns the HTML document. If you pass an argument, it will set the contents of the document via
open
,write
andclose
, returning a serialisation of the old contents. - location
- set_location_object
-
location
returns the location object, if you've put one there withset_location_object
. HTML::DOM doesn't actually implement such an object itself, but provides the appropriate magic to make$doc->location($foo)
translate into$doc->location->href($foo)
.BTW, the location object had better be true when used as a boolean, or HTML::DOM will think it doesn't exist.
Other (Non-DOM) Methods
- $tree->event_attr_handler
- $tree->default_event_handler
- $tree->default_event_handler_for
- $tree->error_handler
-
See "EVENT HANDLING", below.
- $tree->base
-
Returns the base URL of the page; either from a <base href=...> tag or the URL passed to
new
.
HASH ACCESS
You can use an HTML::DOM object as a hash ref to access it's form elements by name. So $doc->{yayaya}
is short for $doc->forms->{yayaya}
.
EVENT HANDLING
HTML::DOM supports both the DOM Level 2 event model and the HTML 4 event model (at least in part; the HTMLEvent interface is not yet implemented; MutationEvents are not always triggered when they should be [see "BUGS", below]).
An event listener (aka handler) is a coderef, an object with a handleEvent
method or an object with &{}
overloading. HTML::DOM does not implement any classes that provide a handleEvent
method, but will support any object that has one.
Default Actions
Default actions that HTML::DOM is capable of handling internally (such as triggering a DOMActivate event when an element is clicked, and triggering a form's submit event when the submit button is activated) are dealt with automatically. You don't have to worry about those. For others, read on....
To specify the default actions associated with an event, provide a subroutine (in this case, it not being part of the DOM, you can't use an object with a handleEvent
method) via the default_event_handler_for
and default_event_handler
methods.
With the former, you can specify the default action to be taken when a particular type of event occurs. The currently supported types are:
submit when a form is submitted
link called when a link is activated (DOMActivate event)
Pass the type of event as the first argument and a code ref as the second argument. When the code ref is called, its sole argument will be the event object. For instance:
$dom_tree->default_event_handler_for( link => sub {
my $event = shift;
go_to( $event->target->href );
});
sub go_to { ... }
default_event_handler_for
with just one argument returns the currently assigned coderef. With two arguments it returns the old one after assigning the new one.
Use default_event_handler
(without the _for
) to specify a fallback subroutine that will be used for events not in the list above, and for events in the list above that do not have subroutines assigned to them. Without any arguments it will return the currently assigned coderef. With an argument it will return the old one after assigning the new one.
Dispatching Events
HTML::DOM::Node's dispatchEvent
method triggers the appropriate event listeners, but does not call any default actions associated with it. The return value is a boolean that indicates whether the default action should be taken.
H:D:Node's trigger_event
method will trigger the event for real. It will call dispatchEvent
and, provided it returns true, will call the default event handler.
HTML Event Attributes
The event_attr_handler
can be used to assign a coderef that will turn text assigned to an event attribute (e.g., onclick
) into a listener. The arguments to the routine will be (0) the element, (1) the name (aka type) of the event (without the initial 'on'), (2) the value of the attribute and (3) the offset within the source of the attribute's value. (Actually, if the value is within quotes, it is the offset of the first quotation mark. Also, it will be undef
for generated HTML [source code passed to the write
method by an element handler].) As with default_event_handler
, you can replace an existing handler with a new one, in which case the old handler is returned. If you call this method without arguments, it returns the current handler. Here is an example of its use, that assumes that handlers are Perl code:
$dom_tree->event_attr_handler(sub {
my($elem, $name, $code, $offset) = @_;
my $sub = eval "sub { $code }";
return sub {
my($event) = @_;
local *_ = \$elem;
my $ret = &$sub;
defined $ret and !$ret and
$event->preventDefault;
};
});
The event attribute handler will be called whenever an element attribute whose name begins with 'on' (case-tolerant) is modified. (For efficiency's sake, I may change it to call the event attribute handler only when the event is triggered, so it is not called unnecessarily.)
When an Event Handler Dies
Use error_handler
to assign a coderef that will be called whenever an event listener raises an error. The error will be contained in $@
.
CLASSES AND DOM INTERFACES
Here are the inheritance hierarchy of HTML::DOM's various classes and the DOM interfaces those classes implement. The classes in the left column all begin with 'HTML::', which is omitted for brevity. Items in brackets have not yet been implemented. (See also HTML::DOM::Interface for a machine-readable list of standard methods.)
Class Inheritance Hierarchy Interfaces
--------------------------- ----------
DOM::Exception DOMException, EventException
DOM::Implementation DOMImplementation,
[DOMImplementationCSS]
Element
DOM::Node Node, EventTarget
DOM::DocumentFragment DocumentFragment
DOM Document, HTMLDocument,
DocumentEvent, DocumentView,
DocumentStyle, [DocumentCSS]
DOM::CharacterData CharacterData
DOM::Text Text
DOM::Comment Comment
DOM::Element Element, HTMLElement,
ElementCSSInlineStyle
DOM::Element::HTML HTMLHtmlElement
DOM::Element::Head HTMLHeadElement
DOM::Element::Link HTMLLinkElement, LinkStyle
DOM::Element::Title HTMLTitleElement
DOM::Element::Meta HTMLMetaElement
DOM::Element::Base HTMLBaseElement
DOM::Element::IsIndex HTMLIsIndexElement
DOM::Element::Style HTMLStyleElement, LinkStyle
DOM::Element::Body HTMLBodyElement
DOM::Element::Form HTMLFormElement
DOM::Element::Select HTMLSelectElement
DOM::Element::OptGroup HTMLOptGroupElement
DOM::Element::Option HTMLOptionElement
DOM::Element::Input HTMLInputElement
DOM::Element::TextArea HTMLTextAreaElement
DOM::Element::Button HTMLButtonElement
DOM::Element::Label HTMLLabelElement
DOM::Element::FieldSet HTMLFieldSetElement
DOM::Element::Legend HTMLLegendElement
DOM::Element::UL HTMLUListElement
DOM::Element::OL HTMLOListElement
DOM::Element::DL HTMLDListElement
DOM::Element::Dir HTMLDirectoryElement
DOM::Element::Menu HTMLMenuElement
DOM::Element::LI HTMLLIElement
DOM::Element::Div HTMLDivElement
DOM::Element::P HTMLParagraphElement
DOM::Element::Heading HTMLHeadingElement
DOM::Element::Quote HTMLQuoteElement
DOM::Element::Pre HTMLPreElement
DOM::Element::Br HTMLBRElement
DOM::Element::BaseFont HTMLBaseFontElement
DOM::Element::Font HTMLFontElement
DOM::Element::HR HTMLHRElement
DOM::Element::Mod HTMLModElement
DOM::Element::A HTMLAnchorElement
DOM::Element::Img HTMLImageElement
DOM::Element::Object HTMLObjectElement
DOM::Element::Param HTMLParamElement
DOM::Element::Applet HTMLAppletElement
DOM::Element::Map HTMLMapElement
DOM::Element::Area HTMLAreaElement
DOM::Element::Script HTMLScriptElement
DOM::Element::Table HTMLTableElement
DOM::Element::Caption HTMLTableCaptionElement
DOM::Element::TableColumn HTMLTableColElement
DOM::Element::TableSection HTMLTableSectionElement
DOM::Element::TR HTMLTableRowElement
DOM::Element::TableCell HTMLTableCellElement
DOM::Element::FrameSet HTMLFrameSetElement
DOM::Element::Frame HTMLFrameElement
DOM::Element::IFrame HTMLIFrameElement
DOM::NodeList NodeList
DOM::NodeList::Radio
DOM::NodeList::Magic NodeList
DOM::NamedNodeMap NamedNodeMap
DOM::Attr Node, Attr, EventTarget
DOM::Collection HTMLCollection
DOM::Collection::Elements
DOM::Collection::Options
DOM::Event Event
DOM::Event::UI UIEvent
DOM::Event::Mouse MouseEvent
DOM::Event::Mutation MutationEvent
DOM::View AbstractView, [ViewCSS]
The EventListener interface is not implemented by HTML::DOM, but is supported. See "EVENT HANDLING", above.
Not listed above is HTML::DOM::EventTarget, which is a base class both for HTML::DOM::Node and HTML::DOM::Attr. The format I'm using above doesn't allow for multiple inheritance, so I probably need to redo it.
Although HTML::DOM::Node inherits from HTML::Element, the interface is not entirely compatible. In particular:
Any methods that expect text nodes to be just strings are unreliable. See the note under "objectify_text" in HTML::Element.
HTML::Element's tree-manipulation methods don't trigger mutation events.
HTML::Element's
delete
method is not necessary, because HTML::DOM uses weak references (for 'upward' references in the object tree).
IMPLEMENTATION NOTES
Objects' attributes are accessed via methods of the same name. When the method is invoked, the current value is returned. If an argument is supplied, the attribute is set (unless it is read-only) and its old value returned.
Where the DOM spec. says to use null, undef or an empty list is used.
Instead of UTF-16 strings, HTML::DOM uses Perl's Unicode strings (which happen to be stored as UTF-8 internally). The only significant difference this makes is to
length
,substringData
and other methods of Text and Comment nodes. These methods behave in a Perlish way (i.e., the offsets and lengths are specified in Unicode characters, not in UTF-16 bytes). The alternate methodslength16
,substringData16
et al. use UTF-16 for offsets and are standards-compliant in that regard (but the string returned bysubstringData
is still a regular Perl string).Each method that returns a NodeList will return a NodeList object in scalar context, or a simple list in list context. You can use the object as an array ref in addition to calling its
item
andlength
methods.In cases where a method is supposed to return something implementing the DOMTimeStamp interface, a simple Perl scalar is returned, containing the time as returned by Perl’s built-in
time
function.
PREREQUISITES
perl 5.8.2 or later
Exporter 5.57 or later
HTML::TreeBuilder and HTML::Element (both part of the HTML::Tree distribution) (tested with 3.23)
URI.pm (tested with 1.35)
LWP 5.13 or later (for the cookie
method and a form's make_request
method to work)
"CSS::DOM 0.05" or later is required if you use any of the style sheet features.
Scalar::Util 1.14 or later
HTML::Encoding is required if a file name is passed to parse_file
.
BUGS
- -
-
Setting a boolean attribute to true through the DOM Level 0 interface is supposed to set the attribute's value to the attribute's name. Right now it sets the value to whatever true value you pass it.
To report bugs, please e-mail the author.
AUTHOR, COPYRIGHT & LICENSE
Copyright (C) 2007-8 Father Chrysostomos
$text = new HTML::DOM ->createTextNode('sprout');
$text->appendData('@');
$text->appendData('cpan.org');
print $text->data, "\n";
This program is free software; you may redistribute it and/or modify it under the same terms as perl.
SEE ALSO
HTML::DOM::Exception, HTML::DOM::Node, HTML::DOM::Event, HTML::DOM::Interface
HTML::Tree, HTML::TreeBuilder, HTML::Element, HTML::Parser, LWP, WWW::Mechanize, HTTP::Cookies, WWW::Mechanize::Plugin::JavaScript, HTML::Form, HTML::Encoding
The DOM Level 1 specification at http://www.w3.org/TR/REC-DOM-Level-1
The DOM Level 2 Core specification at http://www.w3.org/TR/DOM-Level-2-Core
The DOM Level 2 Events specification at http://www.w3.org/TR/DOM-Level-2-Events
etc.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1550:
Non-ASCII character seen before =encoding in 'Perl’s'. Assuming UTF-8