NAME

HTML::Element - Class for objects that represent HTML elements

SYNOPSIS

use HTML::Element;
$a = HTML::Element->new('a', href => 'http://www.perl.com/');
$a->push_content("The Perl Homepage");

$tag = $a->tag;
print "$tag starts out as:",  $a->starttag, "\n";
print "$tag ends as:",  $a->endtag, "\n";
print "$tag\'s href attribute is: ", $a->attr('href'), "\n";

$links_r = $a->extract_links();
print "Hey, I found ", scalar(@$links_r), " links.\n";

print "And that, as HTML, is: ", $a->as_HTML, "\n";
$a = $a->delete;

DESCRIPTION

Objects of the HTML::Element class can be used to represent elements of HTML. These objects have attributes, notably attributes that designates the elements's parent and content. The content is an array of text segments and other HTML::Element objects. A tree with HTML::Element objects as nodes can represent the syntax tree for a HTML document.

HOW WE REPRESENT TREES

It may occur to you to wonder what exactly a "tree" is, and how it's represented in memory. Consider this HTML document:

<html lang='en-US'>
  <head>
    <title>Stuff</title>
    <meta name='author' content='Jojo'>
  </head>
  <body>
   <h1>I like potatoes!</h1>
  </body>
</html>

Building a syntax tree out of it makes a tree-structure in memory that could be diagrammed as:

              html (lang='en-US')
               / \
             /     \
           /         \
         head        body
        /\               \
      /    \               \
    /        \               \
  title     meta              h1
   |       (name='author',     |
"Stuff"    content='Jojo')    "I like potatoes"

This is the traditional way to diagram a tree, with the "root" at the top, and it's this kind of diagram that people have in mind when they say, for example, that "the meta element is under the head element instead of under the body element". (The same is also said with "inside" instead of "under" -- the use of "inside" makes more sense when you're looking at the HTML source.)

Another way to represent the above tree is with indenting:

html (attributes: lang='en-US')
  head
    title
      "Stuff"
    meta (attributes: name='author' content='Jojo')
  body
    h1
      "I like potatoes"

Incidentally, diagramming with indenting works much better for very large trees, and is easier for a program to generate. The $tree->dump method uses indentation just that way.

However you diagram the tree, it's stored the same in memory -- it's a network of objects, each of which has attributes like so:

element #1:  _tag: 'html'
             _parent: none
             _content: [element #2, element #5]
             lang: 'en-US'

element #2:  _tag: 'head'
             _parent: element #1
             _content: [element #3, element #4]

element #3:  _tag: 'title'
             _parent: element #2
             _content: [text segment "Stuff"]

element #4   _tag: 'meta'
             _parent: element #2
             _content: none
             name: author
             content: Jojo

element #5   _tag: 'body'
             _parent: element #1
             _content: [element #6]

element #6   _tag: 'h1'
             _parent: element #5
             _content: [text segment "I like potatoes"]

The "treeness" of the tree-structure that these elements comprise is not an aspect of any particular object, but is emergent from the relatedness attributes (_parent and _content) of these element-objects and from how you use them to get from element to element.

While you could access the content of a tree by writing code that says "access the 'src' attribute of the root's first child's seventh child's third child", you're more likely to have to scan the contents of a tree, looking for whatever nodes, or kinds of nodes, you want to do something with. The most straightforward way to look over a tree is to "traverse" it; an HTML::Element method ($h->traverse) is provided for this purpose; and several other HTML::Element methods are based on it.

(For everything you ever wanted to know about trees, and then some, see Donald Knuth's The Art of Computer Programming, Volume 1.)

BASIC METHODS

$h = HTML::Element->new('tag', 'attrname' => 'value', ... )

This constructor method returns a new HTML::Element object. The tag name is a required argument; it will be forced to lowercase. Optionally, you can specify other initial attributes at object creation time.

$h->attr('attr') or $h->attr('attr', 'value')

Returns (optionally sets) the value of the given attribute of $h. The attribute name (but not the value, if provided) is forced to lowercase. If setting a new value, the old value of that attribute is returned. If methods are provided for accessing an attribute (like $h->tag, $h->content_list, etc. below), use those instead of calling attr $h->attr, whether for reading or setting.

Note that setting an attribute to undef (as opposed to "", the empty string) actually deletes the attribute.

$h->tag() or $h->tag('tagname')

Returns (optionally sets) the tag name (also known as the generic identifier) for the element $h. In setting, the tag name is always converted to lower case.

$h->parent() or $h->parent($new_parent)

Returns (optionally sets) the parent for this element. The parent should either be undef, or should be another element.

You should not use this to directly set the parent of an element. Instead use any of the other methods under "Structure-Modifying Methods", below.

Note that not($h->parent) is a simple test for whether $h is the root of its subtree.

$h->content_list()

Returns a list representing the content of this element -- i.e., what nodes (elements or text segments) are inside/under this element. (Note that this may be an empty list.)

In a scalar context, this returns the count of the items, as you may expect.

$h->content()

This somewhat deprecated method returns the content of this element; but unlike content_list, this returns either undef (which you should understand to mean no content), or a reference to the array of content items, each of which is either a text segment (a string, i.e., a defined non-reference scalar value), or an HTML::Element object. Note that even if an arrayref is returned, it may be a reference to an empty array.

While older code should feel free to continue to use $h->content, new code should use $h->content_list in almost all conceivable cases. It is my experience that in most cases this leads to simpler code anyway, since it means one can say:

@children = $h->content_list;

instead of the inelegant:

@children = @{$h->content || []};

If you do use $h->content, you should not use the reference returned by it (assuming it returned a reference, and not undef) to directly set or change the content of an element! Instead use any of the other methods under "Structure-Modifying Methods", below.

$h->implicit() or $h->implicit($bool)

Returns (optionally sets) the "_implicit" attribute. This attribute is a flag that's used to indicate that the element was not originally present in the source, but was added to the parse tree (by HTML::TreeBuilder, for example) in order to conform to the rules of HTML structure.

$h->pos() or $h->pos($element)

Returns (and optionally sets) the "_pos" (for "current position") pointer of $h. This attribute is a pointer used during some parsing operations, whose value is whatever HTML::Element element at or under $h is currently "open", where $h->insert_element(NEW) will actually insert a new element.

(This has nothing to do with the Perl function called "pos", for controlling where regular expression matching starts.)

If you set $h->pos($element), be sure that $element is either $h, or an element under $h.

If you've been modifying the tree under $h and are no longer sure $h->pos is valid, you can enforce validity with:

$h->pos(undef) unless $h->pos->is_inside($h);
$h->all_attr()

Returns all this element's attributes and values, as key-value pairs. This will include some "internal" attributes (i.e., ones not present in the original element, and which will not be represented if/when you call $h->as_HTML). Internal attributes are distinguished by the fact that the first character of their key (not value, key!) is an underscore ("_").

$h->all_external_attr()

Like all_attr, except that internal attributes are not present.

STRUCTURE-MODIFYING METHODS

These methods are provided for modifying the content of trees by adding or changing nodes as parents or children of other nodes.

$h->push_content($element_or_text, ...)

Adds the specified items to the end of the content list of the element $h. The items of content to be added should each be either a text segment (a string) or an HTML::Element object.

The push_content method will try to consolidate adjacent text segments while adding to the content list. That's to say, if $h's content_list is

('foo bar ', $some_node, 'baz!')

and you call

$h->push_content('quack?');

then the resulting content list will be this:

('foo bar ', $some_node, 'baz!quack?')

and not this:

('foo bar ', $some_node, 'baz!', 'quack?')

If that latter is what you want, you'll have to override the feature of consolidating text by using splice_content, as in:

$h->splice_content(scalar($h->content_list),0,'quack?');

Similarly, if you wanted to add 'Skronk' to the beginning of the content list, calling this:

$h->push_content('Skronk');

then the resulting content list will be this:

('Skronkfoo bar ', $some_node, 'baz!')

and not this:

('Skronk', 'foo bar ', $some_node, 'baz!')

What you'd to do get the latter is:

$h->splice_content(0,0,'Skronk');
$h->unshift_content($element_or_text, ...)

Adds the specified items to the beginning of the content list of the element $h. The items of content to be added should each be either a text segment (a string) or an HTML::Element object.

The unshift_content method will try to consolidate adjacent text segments while adding to the content list. See above for a discussion of this.

$h->splice_content($offset, $length, $element_or_text, ...)

Detaches the elements from $h's list of content-nodes, starting at $offset and continuing for $length items, replacing them with the elements of the following list, if any. Returns the elements (if any) removed from the content-list. If $offset is negative, then it starts that far from the end of the array, just like Perl's normal splice function. If $length and the following list is omitted, removes everything from $offset onward.

The items of content to be added (if any) should each be either a text segment (a string), or an HTML::Element object that's not already a child of $h.

$h->detach()

This unlinks $h from its parent, by setting its 'parent' attribute to undef, and by removing it from the content list of its parent (if it had one). The return value is the parent that was detached from (or undef, if $h had no parent to start with). Note that neither $h nor its parent are explicitly destroyed.

$h->detach_content()

This unlinks $h all of $h's children from $h, and returns them. Note that these are not explicitly destroyed; for that, you can just use $h->delete_content.

$h->replace_with( $element_or_text, ... )

This replaces $h in its parent's content list with the nodes specified. The element $h (which by then may have no parent) is returned. This causes a fatal error if $h has no parent. The list of nodes to insert may contain $h, but at most once. Aside from that possible exception, the nodes to insert should not already be children of $h's parent.

Also, note that this method does not destroy $h -- use $h->replace_with(...)->delete if you need that.

$h->preinsert($element_or_text...)

Inserts the given nodes right BEFORE $h in $h's parent's content list. This causes a fatal error if $h has no parent. None of the given nodes should be $h or other children of $h. Returns $h.

$h->postinsert($element_or_text...)

Inserts the given nodes right AFTER $h in $h's parent's content list. This causes a fatal error if $h has no parent. None of the given nodes should be $h or other children of $h. Returns $h.

$h->replace_with_content()

This replaces $h in its parent's content list with its own content. The element $h (which by then has no parent or content of its own) is returned. This causes a fatal error if $h has no parent. Also, note that this does not destroy $h -- use $h->replace_with_content->delete if you need that.

$h->delete_content()

Clears the content of $h, calling $i->delete for each content element. Compare with $h->detach_content.

Returns $h.

$h->delete()

Detaches this element from its parent (if it has one) and explicitly destroys the element and all its descendants. The return value is undef.

Perl uses garbage collection based on reference counting; when no references to a data structure exist, it's implicitly destroyed -- i.e., when no value anywhere points to a given object anymore, Perl knows it can free up the memory that the now-unused object occupies.

But this fails with HTML::Element trees, because a parent element always holds references to its children, and its children elements hold references to the parent, so no element ever looks like it's not in use. So, to destroy those elements, you need to call $h->delete on the parent.

$h->clone()

Returns a copy of the element (whose children are clones (recursively) of the original's children, if any).

The returned element is parentless. Any '_pos' attributes present in the source element/tree will be absent in the copy. For that and other reasons, the clone of an HTML::TreeBuilder object that's in mid-parse (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot (currently) be used to continue the parse.

You are free to clone HTML::TreeBuilder trees, just as long as: 1) they're done being parsed, or 2) you don't expect to resume parsing into the clone. (You can continue parsing into the original; it is never affected.)

HTML::Element->clone_list(...nodes...)
or: ref($h)->clone_list(...nodes...)

Returns a list consisting of a copy of each node given. Text segments are simply copied; elements are cloned by calling $it->clone on each of them.

$h->normalize_content

Normalizes the content of $h -- i.e., concatenates any adjacent text nodes. (Any undefined text segments are turned into empty-strings.) Note that this does not recurse into $h's descendants.

$h->insert_element($element, $implicit)

Inserts (via push_content) a new element under the element at $h->pos(). Then updates $h->pos() to point to the inserted element, unless $element is a prototypically empty element like "br", "hr", "img", etc. The new $h->pos() is returned. This method is useful only if your particular tree task involves setting $h->pos.

DUMPING METHODS

$h->dump()

Prints the element and all its children to STDOUT, in a format useful only for debugging. The structure of the document is shown by indentation (no end tags).

$h->as_HTML() or $h->as_HTML($entities)
or $h->as_HTML($entities, $indent_char)

Returns a string representing in HTML the element and its children. The optional argument $entities specifies a string of the entities to encode. For compatibility with previous versions, specify '<>&' here. If omitted or undef, all unsafe characters are encoded as HTML entities. See HTML::Entities for details.

If $indent_char is specified and defined, the HTML to be output is intented, using the string you specify (which you probably should set to "\t", or some number of spaces, if you specify it). This feature is currently somewhat experimental. But try it, and feel free to email me any bug reports. (Note that output, although indented, is not wrapped. Patches welcome.)

$h->as_text()
$h->as_text(skip_dels => 1)

Returns a string that represents only the text parts of the element's descendants. Entities are decoded to corresponding ISO-8859-1 (Latin-1) characters. See HTML::Entities for more information.

If skip_dels is true, then text content under "del" nodes is not included in what's returned.

$h->starttag() or $h->starttag($entities)

Returns a string representing the complete start tag for the element. I.e., leading "<", tag name, attributes, and trailing ">". Attributes values that don't consist entirely of digits are surrounded with double-quotes, and appropriate characters are encoded. If $entities is omitted or undef, all unsafe characters are encoded as HTML entities. See HTML::Entities for details. If you specify some value for $entities, remember to include the double-quote character in it. (Previous versions of this module would basically behave as if '&">' were specified for $entities.)

$h->endtag()

Returns a string representing the complete end tag for this element. I.e., "</", tag name, and ">".

THE TRAVERSER METHOD

The traverse() method is the most important general method for accessing the information in a tree. It accepts the following syntaxes:

$h->traverse(\&callback)
or $h->traverse(\&callback, $ignore_text)
or $h->traverse([\&pre_callback,\&post_callback], $ignore_text)

These all mean to traverse the element and all of its children. That is, this method starts at node $h, "pre-order visits" $h, traverses its children, and then will "post-order visit" $h. "Visiting" means that the callback routine is called, with these arguments:

$_[0] : the node (element or text segment),
$_[1] : a startflag, and
$_[2] : the depth

If the $ignore_text parameter is given and true, then the pre-order call will not be happen for text content.

The startflag is 1 when we enter a node (i.e., in pre-order calls) and 0 when we leave the node (in post-order calls).

Note, however, that post-order calls don't happen for nodes that are text segments or are elements that are prototypically empty (like "br", "hr", etc.).

If we visit text nodes (i.e., unless $ignore_text is given and true), then when text nodes are visited, we will also pass two extra arguments to the callback:

$_[3] : the element that's the parent
         of this text node
$_[4] : the index of this text node
         in its parent's content list

Note that you can specify that the pre-order routine can be a different routine from the post-order one:

$h->traverse([\&pre_callback,\&post_callback], ...);

You can also specify that no post-order calls are to be made, by providing a false value as the post-order routine:

$h->traverse([ \&pre_callback,0 ], ...);

And similarly for suppressing pre-order callbacks:

$h->traverse([ 0,\&post_callback ], ...);

Note that these two syntaxes specify the same operation:

$h->traverse([\&foo,\&foo], ...);
$h->traverse( \&foo       , ...);

The return values from calls to your pre- or post-order routines are significant, and are used to control recursion into the tree.

These are the values you can return, listed in descending order of my estimation of their usefulness:

HTML::Element::OK, 1, or any other true value

...to keep on traversing.

Note that HTML::Element::OK et al are constants. So if you're running under use strict (as I hope you are), and you say: return HTML::Element::PRUEN the compiler will flag this as an error (an unallowable bareword, in fact), whereas if you spell PRUNE correctly, the compiler will not complain.

undef, 0, '0', '', or HTML::Element::PRUNE

...to block traversing under the current element's content. (This is ignored if received from a post-order callback, since by then the recursion has already happened.) If this is returned by a pre-order callback, no post-order callback for the current node will happen.

HTML::Element::ABORT

...to abort the whole traversal immediately. This is often useful when you're looking for just the first node in the tree that meets some criterion of yours.

HTML::Element::PRUNE_UP

...to abort continued traversal into this node and its parent node. No post-order callback for the current or parent node will happen.

HTML::Element::PRUNE_SOFTLY

Like PRUNE, except that the post-order call for the current node is not blocked.

Almost every task to do with extracting information from a tree can be expressed in terms of traverse operations (usually in only one pass, and usually paying attention to only pre-order, or to only post-order), or operations based on traversing. (In fact, many of the other methods in this class are basically calls to traverse() with particular arguments.)

The source code for HTML::Element and HTML::TreeBuilder contain many examples of the use of the "traverse" method to gather information about the content of trees and subtrees.

(Note: you should not change the structure of a tree while you are traversing it.)

SECONDARY STRUCTURAL METHODS

These methods all involve some structural aspect of the tree; either they report some aspect of the tree's structure, or they involve traversal down the tree, or walking up the tree.

$h->is_inside('tag', ...) or $h->is_inside($element, ...)

Returns true if the $h element is, or is contained anywhere inside an element that is any of the ones listed, or whose tag name is any of the tag names listed.

$h->is_empty()

Returns true if $h has no content, i.e., has no elements or text segments under it. In other words, this returns true if $h is a leaf node, AKA a terminal node. Do not confuse this sense of "empty" with another sense that it can have in SGML/HTML/XML terminology, which means that the element in question is of the type (like HTML's "hr", "br", "img", etc.) that can't have any content.

That is, a particular "p" element may happen to have no content, so $that_p_element->is_empty will be true -- even though the prototypical "p" element isn't "empty" (not in the way that the prototypical "hr" element is).

If you think this might make for potentially confusing code, consider simply using the clearer exact equivalent: not($h->content_list)

$h->pindex()

Return the index of the element in its parent's contents array, such that $h would equal

$h->parent->content->[$h->pindex]
or
($h->parent->content_list)[$h->pindex]

assuming $h isn't root. If the element $h is root, then $h->pindex returns undef.

$h->address()

Returns a string representing the location of this node in the tree. The address consists of numbers joined by a '.', starting with '0', and followed by the pindexes of the nodes in the tree that are ancestors of $h, starting from the top.

So if the way to get to a node starting at the root is to go to child 2 of the root, then child 10 of that, and then child 0 of that, and then you're there -- then that node's address is "0.2.10.0".

As a bit of a special case, the address of the root is simply "0".

I forsee this being used mainly for debugging.

$h->address($address)

This returns the node (whether element or text-segment) at the given address in the tree that $h is a part of. (That is, the address is resolved starting from $h->root.)

If there is no node at the given address, this returns undef.

$h->depth()

Returns a number expressing $h's depth within its tree, i.e., how many steps away it is from the root. If $h has no parent (i.e., is root), its depth is 0.

$h->root()

Returns the element that's the top of $h's tree. If $h is root, this just returns $h. (If you want to test whether $h is the root, instead of asking what its root is, just test not($h->parent).)

$h->lineage()

Returns the list of $h's ancestors, starting with its parent, and then that parent's parent, and so on, up to the root. If $h is root, this returns an empty list.

If you simply want a count of the number of elements in $h's lineage, use $h->depth.

$h->lineage_tag_names()

Returns the list of the tag names of $h's ancestors, starting with its parent, and that parent's parent, and so on, up to the root. If $h is root, this returns an empty list. Example output: ('html', 'body', 'table', 'tr', 'td', 'em')

$h->descendants()

In list context, returns the list of all $h's descendant elements, listed in pre-order (i.e., an element appears before its content-elements). Text segments DO NOT appear in the list. In scalar context, returns a count of all such elements.

$h->find_by_tag_name('tag', ...)

In list context, returns a list of elements at or under $h that have any of the specified tag names. In scalar context, returns the first (in pre-order traversal of the tree) such element found, or undef if none.

$h->find_by_attribute('attribute', 'value')

In a list context, returns a list of elements at or under $h that have the specified attribute, and have the given value for that attribute. In a scalar context, returns the first (in pre-order traversal of the tree) such element found, or undef if none.

$h->attr_get_i('attribute')

In list context, returns a list consisting of the values of the given attribute for $self and for all its ancestors starting from $self and working its way up. Nodes with no such attribute are skipped. ("attr_get_i" stands for "attribute get, with inheritance".) In scalar context, returns the first such value, or undef if none.

Consider a document consisting of:

<html lang='i-klingon'>
  <head><title>Pati Pata</title></head>
  <body>
    <h1 lang='la'>Stuff</h1>
    <p lang='es-MX' align='center'>
      Foo bar baz <cite>Quux</cite>.
    </p>
    <p>Hooboy.</p>
  </body>
</html>

If $h is the "cite" element, $h->attr_get_i("lang") in list context will return the list ('es-MX', 'i-klingon'). In scalar context, it will return the value 'es-MX'.

If you call with multiple attribute names...

$h->attr_get_i('a1', 'a2', 'a3')

...in list context, this will return a list consisting of the values of these attributes which exist in $self and its ancestors. In scalar context, this returns the first value (i.e., the value of the first existing attribute from the first element that has any of the attributes listed). So, in the above example,

$h->attr_get_i('lang', 'align');

will return:

 ('es-MX', 'center', 'i-klingon') # in list context
or
 'es-MX' # in scalar context.

But note that this:

$h->attr_get_i('align', 'lang');

will return:

 ('center', 'es-MX', 'i-klingon') # in list context
or
 'center' # in scalar context.
$h->extract_links() or $h->extract_links(@wantedTypes)

Returns links found by traversing the element and all of its children and looking for attributes (like "href" in an "a" element, or "src" in an "img" element) whose values represent links. The return value is a reference to an array. Each element of the array is reference to an array with two items: the link-value and a the element that has the attribute with that link-value. You may or may not end up using the element itself -- for some purposes, you may use only the link value.

You might specify that you want to extract links from just some kinds of elements (instead of the default, which is to extract links from all the kinds of elements known to have attributes whose values represent links). For instance, if you want to extract links from only "a" and "img" elements, you could code it like this:

for (@{  $e->extract_links('a', 'img')  }) {
    my($link, $element) = @$_;
    print
      "Hey, there's a ", $element->tag,
      " that links to $link\n";
}
$h->same_as($i)

Returns true if $h and $i are both elements representing the same tree of elements, each with the same tag name, with the same explicit attributes (i.e., not counting attributes whose names start with "_"), and with the same content (textual, comments, etc.).

Sameness of descendant elements is tested, recursively, with $child1->same_as($child_2), and sameness of text segments is tested with $segment1 eq $segment2.

$h = HTML::Element->new_from_lol(ARRAYREF)

Resursively constructs a tree of nodes, based on the (non-cyclic) data structure represented by ARRAYREF, where that is a reference to an array of arrays (of arrays (of arrays (etc.))). In each arrayref in that structure: arrayrefs are considered to designate a sub-tree representing children for the node constructed from the current arrayref; hashrefs are considered to contain attribute-value pairs to add to the element to be constructed from the current arrayref; text segments at the start of any arrayref will be considered to specify the name of the element to be constructed from the current araryref; all other text segments will be considered to specify text segments as children for the current arrayref.

An example will hopefully make this more obvious:

my $h = HTML::Element->new_from_lol(
  ['html',
    ['head',
      [ 'title', 'I like stuff!' ],
    ],
    ['body',
      {'lang', 'en-JP', _implicit => 1},
      'stuff',
      ['p', 'um, p < 4!', {'class' => 'par123'}],
      ['div', {foo => 'bar'}, '123'],
    ]
  ]
);
$h->dump;

Will print this:

<html> @0
  <head> @0.0
    <title> @0.0.0
      "I like stuff!"
  <body lang="en-JP"> @0.1 (IMPLICIT)
    "stuff"
    <p class="par123"> @0.1.1
      "um, p < 4!"
    <div foo="bar"> @0.1.2
      "123"

And printing $h->as_HTML will give something like:

<html><head><title>I like stuff!</title></head>
<body lang="en-JP">stuff<p class="par123">um, p &lt; 4!
<div foo="bar">123</div></body></html>
$h->has_insane_linkage

This method is for testing whether this element or the elements under it have linkage attributes (_parent and _content) whose values are deeply aberrant: if there are undefs in a content list; if an element appears in the content lists of more than one element; if the _parent attribute of an element doesn't match its actual parent; or if an element appears as its own descendant (i.e., if there is a cyclicity in the tree).

This returns empty list (or false, in scalar context) if the subtree's linkage methods are sane; otherwise it returns two items (or true, in scalar context): the element where the error occurred, and a string describing the error.

This method is provided is mainly for debugging and troubleshooting -- it should be quite impossible for any document constructed via HTML::TreeBuilder to parse into a non-sane tree (since it's not the content of the tree per se that's in question, but whether the tree in memory was properly constructed); and it should be impossible for you to produce an insane tree just thru reasonable use of normal documented structure-modifying methods. But if you're constructing your own trees, and your program is going into infinite loops as during calls to traverse() or any of the secondary structural methods, as part of debugging, consider calling is_insane on the tree.

BUGS

* If you want to free the memory associated with a tree built of HTML::Element nodes, then you will have to delete it explicitly. See the $h->delete method, above.

* There's almost nothing to stop you from making a "tree" with cyclicities (loops) in it, which could, for example, make the traverse method go into an infinite loop. So don't make cyclicities! (If all you're doing is parsing HTML files, and looking at the resulting trees, this will never be a problem for you.)

* There's no way to represent comments or processing directives in a tree with HTML::Elements. Not yet, at least.

* There's (currently) nothing to stop you from using an undefined value as a text segment. If you're running under perl -w, however, this may make HTML::Element's code produce a slew of warnings.

NOTES ON SUBCLASSING

You are welcome to derive subclasses from HTML::Element, but you should be aware that the code in HTML::Element makes certain assumptions about elements (and I'm using "element" to mean ONLY an object of class HTML::Element, or of a subclass of HTML::Element):

* The value of an element's _parent attribute must either be undef or otherwise false, or must be an element.

* The value of an element's _content attribute must either be undef or otherwise false, or a reference to an (unblessed) array. The array may be empty; but if it has items, they must ALL be either mere strings (text segments), or elements.

* The value of an element's _tag attribute should, at least, be a string of printable characters.

Moreover, bear these rules in mind:

* Do not break encapsulation on objects. That is, access their contents only thru $obj->attr or more specific methods.

* You should think twice before completely overriding any of the methods that HTML::Element provides. (Overriding with a method that calls the superclass method is not so bad, tho.)

SEE ALSO

HTML::AsSubs, HTML::TreeBuilder

COPYRIGHT

Copyright 1995-1998 Gisle Aas, 1999-2000 Sean M. Burke.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Original author Gisle Aas <gisle@aas.no>; current maintainer Sean M. Burke, <sburke@netadventure.net>