NAME

HTML::Object::Element - HTML Element Object

SYNOPSIS

use HTML::Object::Element;
my $this = HTML::Object::Element->new || die( HTML::Object::Element->error, "\n" );

VERSION

v0.2.6

DESCRIPTION

This interface implement a core element for HTML::Object parser. An element can be one or more space, a text, a tag, a comment, or a document, all of the above inherit from this core interface.

For a more elaborate interface and a close implementation of the Web Document Object Model (a.k.a. DOM), see HTML::Object::DOM::Element and the DOM parser

METHODS

address

This method is purely for compatibility with "address" in HTML::Element. Please, refer to its documentation for its use.

all_attr

Returns an hash (not an hash reference) of the element's attributes as a key-value pairs.

This is provided in compatibility with HTML::Element

my %attributes = $e->all_attr;

all_attr_names

Returns a list of all the element's attributes in no particular order.

my @attributes = $e->all_attr_names;

as_html

This is an alias for "as_string"

as_string

Returns a string representation of the current element and its underlying descendants.

If a cached version of that string exists, it is returned instead.

as_text

Returns a string representation of the text content of the current element and its descendant.

If a cached version of that string exists, it is returned instead.

as_trimmed_text

Return the value returned by "as_text", only its leading and trailing spaces, if any, are trimmed.

as_xml

This is merely an alias for as_string

attr

Provided with an attribute name and this will return it. If an attribute value is also provided, it will set or replace the attribute valu accordingly. If that attribute value provided is undef, this will remove the attribute altogether.

attributes

Returns an hash object of all the attributes key-value pairs.

Be careful this is a 'live' object, and if you make change to it directly, you could damage the hierarchy or introduce errors.

attributes_sequence

Returns an array object containing the attribute names in their order of appearance.

checksum

Returns the element checksum, used to determine if any change was made.

children

Returns an array object containing all the element's children.

class

Returns this element class, e.g. HTML::Object::Element or HTML::Object::Document

clone

Returns a copy of the current element, and recursively all of its descendants,

The cloned element, that is returned, has no parent.

clone_list

Clone all the element children and return a new array object of the cloned children.

This is quite different from HTML::Element equivalent that is accessed as a class method and takes an arbitrary list of elements.

close

Close the current tag, if necessary. It returns the current object upon success, or undef upon error and sets an error

close_tag

Set or get a closing element object that is used to close the current element.

column

Returns the column at which this element was found in the original HTML text string, by the parser.

content

This is an alias for "children". It returns an array object of the current element's children objects.

content_array_ref

This is an alias for "children". It returns an array object of the current element's children objects.

This is provided in compatibility with HTML::Element

content_list

In list context, this returns the list of the curent element's children, if any, and in scalar context, this returns the number of children elements it contains.

This is provided in compatibility with HTML::Element

delete

Remove all of its content by calling "delete_content", detach the current object, and destroy the object.

delete_content

Remove the content, i.e. all the children, of the current element, effectively calling "delete" on each one of them.

It returns the current element.

delete_ignorable_whitespace

Does not do anything by design. There is no much value into this method under HTML::Object in the first place.

depth

Returns an integer representing the depth level of the current element in the hierarchy.

descendants

Returns an array object of all the element's descendants throughout its hierarchy.

destroy

An alias for "delete"

destroy_content

An alias for "delete_content"

detach

This method takes no parameter and removes the current element from its parent's list of children element, and unset its parent object value.

It returns the element parent object.

detach_content

This method takes no argument and will remove the parent value for each of its children, set the children list for the current element to an empty list and return the list of those children elements thus removed.

my @removed = $e->detach_content;

This is provided in compatibility with HTML::Element

dump

Print out on the stdout a representation of the hierarchy of element objects.

eid

Returns the element unique id, which is automatically generated for any element. This is actually a uuid. For example:

my $eid = $e->eid; # e.g.: 971ef725-e99b-4869-b6ac-b245794e84e2

end

Returns the current object.

Actually, I am not sure this should be here, and rather it should be in HTML::Object::XQuery since it simulates jQuery.

Returns links found by traversing the element and all of its children and looking for attributes (like href in an <a> element, or src in an <img> element) whose values represent links.

You may specify that you want to extract links from just some kinds of elements (instead of the default, which is to extract links from all the kinds of elements known to have attributes whose values represent links). For instance, if you want to extract links from only <a> and <img> elements, you could code it like this:

my $links = $elem->extract_links( qw( a img ) ) ||
    die( $elem->error );
foreach( @$links )
{
    say "Hey, there is a ", $_->{tag}, " that links to ", $_->{value}, "in its ", $_->{attribute}, " attribute, at ", $_->{element}->address;
}

The dictionary definition hash reference of all tags and their attributes containing potential links is available as $HTML::Object::LINK_ELEMENTS

This method returns an array object containing hash objects, for each attribute of an element containing a link, with the following properties:

  • attribute

    The attribute containing the link

  • element

    The element object

  • tag

    The element tag name.

  • value

    The attribute value, which would typically contain the link value.

Nota bene: this method has been implemented to provide similar API as HTML::Element and the 2 first paragraphs of this method description are taken from this module.

find_by_attribute

Returns an array object of all the elements (including potentially the current element itself) in the element's hierarchy who have an attribute that matches the given attribute name.

my $list = $e->find_by_attribute( 'data-dob' );

find_by_tag_name

Returns an array object of all the elements (including potentially the current element itself) in the element's hierarchy who matches any of the specified tag names. Tag names can be provided n case insensitive.

my $list = $e->find_by_tag_name( qw( div p span ) );

has_children

Returns true if the current element has children, i.e. it contains other elements within itself.

id

Set or get the id HTML attribute of the element.

insert_element

Provided with an element object and this will add it to the current element's children.

It returns the current element object.

internal

Returns the internal hash of key-value paris used internally by this package. This is primarily used to handle the data-* special attributes.

is_closed

Returns true if the current element has a closing tag that is accessible with "close_tag"

is_empty

Returns true if this is an element who, by HTML standard, does not contain any other elements, and false otherwise.

To check if the element has children, use "has_children"

is_inside

Provided with a list of tag names or element objects, and this will check if the current element is contained in any of the element objects, or elements whose tag name is provided. It returns true if it is contained, or false otherwise.

Example:

say $e->is_inside( qw( span div ), $elem1, 'p', $elem2 ) ? 'yes' : 'no';

is_valid_attribute

Provided with an attribute name and this returns true if it is valid of false otherwise.

is_void

Returns true if, by standard, this tag is void, meaning it does not contain any children. For example: <br />, <link />, or <input />

left

Returns an array object of all the sibling objects before the current element.

line

Returns the line at which this element was found in the original HTML text string, by the parser.

lineage

Returns an array object of the current element's parent and parent's parent up to the root of the hierarchy

lineage_tag_names

Returns an array object of the current element's parent tag name and parent's parent tag name up to the root of the hierarchy

This is equivalent to:

my $list = $self->lineage->map(sub{ $_->tag });

look

This is the method that does the heavy work for "look_down" and "look_up"

look_down

Provided with some criterias, and an optional hash reference of options, and this will crawl down the current element hierarchy to find any matching element.

my $list = $e->look_down( _tag => 'div' ); # returns an Module::Generic::Array object
my $list = $e->look_down( class => qr/\bclass_name\b/, { max_level => 3, max_match => 1 });

The options you can specify are:

max_level

Takes an integer that sets the maximum lower or upper level beyond which, this wil stop searching.

max_match

Takes an integer that sets the maximum number of matches after which, this will stop recurring and return the result.

There are three kinds of criteria you can specify:

1. attr_name, attr_value

This is used when you are looking for an element with a particular attribute name and value. For example:

my $list = $e->look_down( id => 'hello' );

This will look for any element whose attribute id has a value of hello

If you want to search for an attribute that does not exist, set the attribute value being searched to undef

To search for a tag, use the special attribute _tag. For example:

my $list = $e->look_down( _tag => 'div' );

This will return an array object of all the div elements.

2. attr_name, qr//

Same as above, except the attribute value of the element being checked will be evaluated against this regular expression and if true will be added into the resulting array object.

For example:

my $list = $e->look_down( 'data-dob' => qr/^\d{4}-\d{2}-\d{2}$/ );

This will search for all element who have an attribute data-dob and with value something that looks like a date.

3. \&my_check or sub{ # some code here }

Provided with a code reference (i.e. a reference to an existing subroutine, or an anonymous one), and it will be evaluated for each element found. If it returns undef, look_down will interrupt its crawling, and if it returns true, it will signal the need to add the element to the resulting array object of elements.

For example:

my $list = $e->look_down(
    _tag => 'img',
    class => qr/\bactive\b/,
    sub
    {
        return( $_->attr( 'width' ) > 350 ? 1 : 0 );
    }
);

When executing the code, the current element being evaluated will be made available via $_

Those criteria are called and evaluated in the order they are provided. Thus, if you specify, for example:

my $list = $e->look_down(
    _tag => 'img',
    class => qr/\bactive\b/,
    sub
    {
        return( $_->attr( 'width' ) > 350 ? 1 : 0 );
    }
);

Each element will be evaluated first to see if their tag is img and discarded if they are not. Then, if they have a class attribute and its content match the regular expression provided, and the element gets discarded if it does not match. Finally, the code will be evaluated.

Thus, the order of the criteria is important.

It returns an array object of all the elements found.

This is provided as a compatibility with HTML::Element

look_up

Provided with some criterias, and an optional hash reference of options, and this will crawl up the current element ascendants starting with its parent to find any matching element.

The options that can be used are the same ones that for "look_down", i.e. max_level and max_match

It returns an array object of all the elements found.

This is provided as a compatibility with HTML::Element

looks_like_html

Provided with a string and this returns true if the string starts with an HTML tag, or false otherwise.

looks_like_it_has_html

Provided with a string and this returns true if the string contains HTML tags, or false otherwise.

modified

Set or get a boolean of whether the element was modified. Actually this is not used.

This returns a DateTime object.

new_attribute

This creates a new HTML::Object::Attribute object passing it any arguments provided, and returns the object thus created, or undef if an error occurred.

new_closing

This creates a new HTML::Object::Closing object passing it any arguments provided, and returns the object thus created, or undef if an error occurred.

new_document

Instantiate a new HTML document, passing it whatever argument was provided, and return the resulting object.

new_element

Instantiate a new element, passing it whatever argument was provided, and return the resulting object.

new_from_lol

This is a legacy from HTML::Element, but is not actually used.

This recursively constructs a tree of nodes.

It returns an array object of elements.

new_parser

Instantiate a new parser object, passing it whatever argument was provided, and return the resulting object.

new_text

Instantiate a new text object, passing it whatever argument was provided, and return the resulting object.

normalize_content

Check each of the current element child element and concatenate any adjacent text or space element.

It returns the current object.

offset

Returns the offset value, i.e. the byte position, at which the tag was found in the original HTML data.

original

Returns the original raw string data as it was captured initially by the parser.

This is an important feature of HTML::Object since that, if nothing was changed, HTML::Object will return the element objects in their original text version.

Whereas, other HTML parser, decode all the HTML elements parsed and rebuild them, often badly and even though they have not been changed, which of course, incur a heavy speed penalty.

parent

Returns the current element's parent element, if any. The value returned could very well be empty if, for example, it is the top element or if the element was created independently of any parsing.

pindex

This is an alias for "pos"

pos

Read-only.

Returns the position integer of the current element among its parent's children elements.

It returns a smart undef if the element has no parent.

If the current element, somehow, could not be found among its parent, this would return undef

Contrary to the HTML::Element equivalent, you cannot manually change this value.

postinsert

Provided with a list of elements and this will add them right after the current element in its parent's children.

It returns the current element object for chaining upon success, and upon error, it returns undef and sets an error

preinsert

Provided with a list of elements and this will add them right before the current element in its parent's children.

It returns the current element object for chaining upon success, and upon error, it returns undef and sets an error

push_content

Provided with a list of elements and this will add them as children to the current element.

Contrary to the HTML::Element equivalent, this requires that only object be provided, which is easy to do anyhow.

If consecutive text or space objects are provided they are automatically merged with their immediate text or space objects, if any.

For example:

$e->push_content( $elem1, HTML::Object::Element->new( value => q{some text} ), $elem2 );

And if two consecutive text objects were provided the second one would have its value merged with the previous one.

It returns the current element object for chaining.

replace_with

Provided with a list of element objects and this will replace the current element in its parent's children with the element objects provided.

This will return an error if the current element has no parent, or if the current element cannot be found among its parent's children elements.

Also, this method will filter out any duplicate objects, and return an error if the element being replaced is also among the objects provided for replacement or if the current element's parent is among the replacement objects.

Each replacement object is detached from its previous parent and re-attach to the current element's parent before being added to its children.

It returns the current element object.

replace_with_content

Replaces the current element in its parent's children by its own children element, which, in other words, means that the current element children will be moved up and replace the current element itself.

It returns the current element object, which will then, have no more parent.

reset

Enable the reset flag for this element, which has the effect of instructing "as_string" to not use its cache.

Returns an array object of all the sibling objects after the current element.

root

Returns the top most element in the hierarchy, which usually is HTML::Object::Document

same_as

This method will check that 2 element objects are similar, in the sense that they can have different "eid", but have identical structure.

I you want to check if 2 element object are actually the same, by comparing their eid, you can use the comparison signs that have been overloaded. For example:

say $a eq $b ? 'same' : 'nope';

set_checksum

Calculate and returns the md5 checksum of the current element based on all its attributes.

splice_content

Provided with an offset and a length, and a list of element objects and this will replace the elements children at offset position offset and for a length number of items by the list of objects supplied.

If consecutive text element or space element are provided they will be merged with their immediate previous sibling of the same type.

For example:

$e->splice_content( 3, 2, $elem1, $elem2, HTML::Object::Text->new( value => 'Hello world' ) );

It returns an error if the offset or length provided is not a valid integer.

Upon success, it returns the current object for chaining.

tag

Returns the tag name of the current element as a scalar object. Be careful at any change you would make as it would directly change the element tag name.

Non-element tag, such as text or space have a pseudo tag starting with an underscore ("_"), such as _text and _space

traverse

Provided with a reference to an existing subroutine, or an anonymous one, and this will crawl through every element of the descending hierarchy and call the callback code, passing it the element object being evaluated. The local variable $_ is also made available and set to the element being evaluated.

unshift_content

This acts like "push_content", except that instead of appending the elements, this prepends the given element on top of the element children.

It returns the current element.

AUTHOR

Jacques Deguest <jack@deguest.jp>

SEE ALSO

HTML::Object, HTML::Object::Attribute, HTML::Object::Boolean, HTML::Object::Closing, HTML::Object::Collection, HTML::Object::Comment, HTML::Object::Declaration, HTML::Object::Document, HTML::Object::Element, HTML::Object::Exception, HTML::Object::Literal, HTML::Object::Number, HTML::Object::Root, HTML::Object::Space, HTML::Object::Text, HTML::Object::XQuery

Mozilla Element documentation

COPYRIGHT & LICENSE

Copyright (c) 2021 DEGUEST Pte. Ltd.

All rights reserved

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.