NAME

WWW::Mechanize::Chrome::DOMops - Operations on the DOM loaded in Chrome

VERSION

Version 0.13

SYNOPSIS

This module provides a set of tools to operate on the DOM loaded onto the provided WWW::Mechanize::Chrome object after fetching a URL.

Operating on the DOM is powerful but there are security risks involved if the browser and profile you used for loading this DOM is your everyday browser and profile.

Please read "SECURITY WARNING" before continuing on to the main course.

Currently, WWW::Mechanize::Chrome::DOMops provides these tools:

Both domops_find() and domops_zap() return some information from each match and its descendents (like tag, id etc.). This information can be tweaked by the caller. domops_find() and domops_zap() optionally execute javascript code on each match and its descendents and can return data back to the caller perl code.

The selection of the HTML elements in the DOM can be done in various ways:

There is more information about this in section "ELEMENT SELECTORS".

Here are some usage scenaria:

  use WWW::Mechanize::Chrome::DOMops qw/
      domops_zap domops_find
      $domops_VERBOSITY
      $domops_LOGGER
  /;

  # adjust verbosity: 0, 1, 2, 3
  $WWW::Mechanize::Chrome::domops_VERBOSITY = 3;
  # optionally set our own logger instead of using default (to STDOUT/ERR)
  $WWW::Mechanize::Chrome::domops_VERBOSITY = Mojo::Log->new(path=>'xx.log');

  # First, create a mech object and load a URL on it
  # Note: you need google-chrome binary installed in your system!
  # See section CREATING THE MECH OBJECT for creating the mech
  # and how to redirect its javascript console to perl's output
  my $mechobj = WWW::Mechanize::Chrome->new();
  # fetch a page which will setup a DOM on which to operate:
  $mechobj->get('https://www.bbbbbbbbb.com');

  # find elements in the DOM, select by CSS selector,
  # XPath selector, id, tag or name:
  my $ret = domops_find({
     'mech-obj' => $mechobj,
     # find elements whose class is in the provided
     # scalar class name or array of class names
     'element-class' => ['slanted-paragraph', 'class2', 'class3'],
     # *OR* their tag is this:
     'element-tag' => 'p',
     # *OR* their name is this:
     'element-name' => ['aname', 'name2'],
     # *OR* their id is this:
     'element-id' => ['id1', 'id2'],
     # *OR* just provide a CSS selector
     'element-cssselector' => 'a-css-selector',
     # *OR* just provide a XPath selector
     'element-xpathselector' => 'a-xpath-selector',
     # specifies that we should use the union of the above sets
     # hence the *OR* in above comment
     '||' => 1,
     # this says to find all elements whose class
     # is such-and-such AND element tag is such-and-such
     # '&&' => 1 means to calculate the INTERSECTION of all
     # individual matches.

     # build the information sent back from each match
     'element-information-from-matched' => <<'EOJ',

// begin JS code to extract information from each match and return it // back as a hash const r = htmlElement.hasAttribute("role") ? htmlElement.getAttribute("role") : "" ; return {"tag" : htmlElement.tagName, "id" : htmlElement.id, "role" : r}; EOJ # optionally run javascript code on all those elements matched 'find-cb-on-matched' => [ { 'code' =><<'EOJS', // the element to operate on is 'htmlElement' console.log("operating on this element "+htmlElement.tagName); // this is returned back in the results of domops_find() under // key "cb-results"->"find-cb-on-matched" return 1; EOJS 'name' => 'func1' }, {...} ], # optionally run javascript code on all those elements # matched AND THEIR CHILDREN too! 'find-cb-on-matched-and-their-children' => [ { 'code' =><<'EOJS', // the element to operate on is 'htmlElement' console.log("operating on this element "+htmlElement.tagName); // this is returned back in the results of domops_find() under // key "cb-results"->"find-cb-on-matched" notice the complex data return {"abc":"123",{"xyz":[1,2,3]}}; EOJS 'name' => 'func2' } ], # optionally ask it to create a valid id for any HTML # element returned which does not have an id. # The text provided will be postfixed with a unique # incrementing counter value 'insert-id-if-none' => '_prefix_id', # or ask it to randomise that id a bit to avoid collisions 'insert-id-if-none-random' => '_prefix_id',

     # optionally, also output the javascript code to a file for debugging
     'js-outfile' => 'output.js',
  });


  # Delete an element from the DOM
  $ret = domops_zap({
     'mech-obj' => $mechobj,
     'element-id' => 'paragraph-123'
  });

  # Mass murder:
  $ret = domops_zap({
     'mech-obj' => $mechobj,
     'element-tag' => ['div', 'span', 'p'],
     '||' => 1, # the union of all those matched with above criteria
  });

  # error handling
  if( $ret->{'status'} < 0 ){ die "error: ".$ret->{'message'} }
  # status of -3 indicates parameter errors,
  # -2 indicates that eval of javascript code inside the mech object
  # has failed (syntax errors perhaps, which could have been introduced
  # by user-specified callback
  # -1 indicates that javascript code executed correctly but
  # failed somewhere in its logic.

  "Found " . $ret->{'status'} . " matches which are: ";
  # ... results are in $ret->{'found'}->{'first-level'}
  # ... and also in $ret->{'found'}->{'all-levels'}
  # the latter contains a recursive list of those
  # found AND ALL their children

  # wait for page to load with catching the Page.loadEventFired
  if( 0 == domops_wait_for_page_to_load() ){ print "page loaded\n" }
  else { die "page did not load within the default timeout" }

  domops_wait_for_page_to_load({
    'timeout' => 50.5, # fractional seconds
    'sleep' => 1.5, # fractional seconds to sleep between polling
  });

  # this waits for Page.loadEventFired AND for ALL
  # DOM elements specified with the XPath selectors:
  domops_wait_for_page_to_load({
    'elements-must-be-present' => [
      'div[@id="anid1"]',
      'span[@id="anid2"]',
    ],
    'elements-must-be-present-op' => '&&'
  });

EXPORT

SUBROUTINES/METHODS

domops_find($params)

It finds HTML elements in the DOM currently loaded on the parameters-specified WWW::Mechanize::Chrome object. The parameters are:

JAVASCRIPT HELPERS

There is one javascript function available to all user-specified callbacks:

RETURN VALUE:

The returned value is a hashref with at least a status key which is greater or equal to zero in case of success and denotes the number of matched HTML elements. Or it is -3, -2 or -1 in case of errors:

If status is not negative, then this is success and its value denotes the number of matched HTML elements. Which can be zero or more. In this case the returned hash contains this

"found" => {
  "first-level" => [
    {
      "tag" => "NAV",
      "id" => "nav-id-1"
    }
  ],
  "all-levels" => [
    {
      "tag" => "NAV",
      "id" => "nav-id-1"
    },
    {
      "id" => "li-id-2",
      "tag" => "LI"
    },
  ]
}

Key first-level contains those items matched directly while key all-levels contains those matched directly as well as those matched because they are descendents (direct or indirect) of each matched element.

Each item representing a matched HTML element has two fields: tag and id. Beware of missing id or use insert-id-if-none or insert-id-if-none-random to fill in the missing ids.

If find-cb-on-matched or find-cb-on-matched-and-their-children were specified, then the returned result contains this additional data:

"cb-results" => {
   "find-cb-on-matched" => [
     [
       {
         "name" => "func1",
         "result" => {
           "a" => 1,
           "b" => 2
         }
       }
     ],
     [
       {
         "result" => 1,
         "name" => "func2"
       }
     ]
   ],
   "find-cb-on-matched-and-their-children" => ...
 },

find-cb-on-matched and/or find-cb-on-matched-and-their-children will be present depending on whether corresponding value in the input parameters was specified or not. Each of these contain the return result for running the callback on each HTML element in the same order as returned under key found.

HTML elements allows for missing id. So field id can be empty unless caller set the insert-id-if-none input parameter which will create a unique id for each HTML element matched but with missing id. These changes will be saved in the DOM. When this parameter is specified, the returned HTML elements will be checked for duplicates because now all of them have an id field. Therefore, if you did not specify this parameter results may contain duplicate items and items with empty id field. If you did specify this parameter then some elements of the DOM (those matched by our selectors) will have their missing id created and saved in the DOM.

Another implication of using this parameter when running it twice or more with the same value is that you can get same ids. So, always supply a different value to this parameter if run more than once on the same DOM.

domops_zap($params)

It removes HTML element(s) from the DOM currently loaded on the parameters-specified WWW::Mechanize::Chrome object. The params are exactly the same as with "domops_find($params)" except that insert-id-if-none is ignored.

domops_zap() is implemented as a domops_find() with an additional callback for all elements matched in the first level (not their children) as:

'find-cb-on-matched' => {
  'code' => 'htmlElement.parentNode.removeChild(htmlElement); return 1;',
  'name' => '_thezapper'
 };

RETURN VALUE:

Return value is exactly the same as with "domops_find($params)"

domops_wait_for_page_to_load($params)

It waits for the page to load by detecting the Page.loadEventFired event. However, because the DOM may be altered at any time, even if said event has been fired, there is provision to wait for specific DOM elements as well via the elements-must-be-present input parameter. This can be a scalar or an ARRAY_REF containing XPath selectors for DOM elements to wait for their appearance on the page. If this contains more than one selectors (i.e. it is an ARRAY_REF), then input parameter elements-must-be-present-op can be set to && or ||, denoting the method to combine these. I.e. wait for all (&&) or wait for any (||).

INPUT PARAMETERS:

As a HASH_REF:

RETURN VALUE:

domops_read_dom_element_selectors_from_JSON_file($filename)

It reads DOM element selectors, in their various forms as documented at "ELEMENT SELECTORS", from specified filename and returns these as a Perl data structure which can then be passed on to "domops_find($params)" and "domops_zap($params)".

RETURN VALUE:

domops_read_dom_element_selectors_from_JSON_string($string)

It reads DOM element selectors, in their various forms as documented at "ELEMENT SELECTORS", from specified string and returns these as a Perl data structure which can then be passed on to "domops_find($params)" and "domops_zap($params)".

RETURN VALUE:

$WWW::Mechanize::Chrome::DOMops::domops_VERBOSITY

Set this upon loading the module to 0, 1, 2, 3 to adjust verbosity. 0 implies no verbosity and it is the default.

$WWW::Mechanize::Chrome::DOMops::domops_LOGGER

Set this upon loading the module to your own logger object which must implement 3 methods: error(), warn(), info(). Mojo::Log does implement the above 3 methods. But if you have a different object whose class implements these methods then pass it on! Even if your preferred logger object does not support these 3 methods, you can create a wrapper class around your preferred logger object to implement the 3 methods.

Default is a Mojo::Log object which logs to STDOUT/ERR, so you do not need to be concerned with this option at all.

You may want to use your own logger if this module is called from another module which has its own logger and you want all logging to go to the same place. Or, you want to log to a file, e.g. $WWW::Mechanize::Chrome::DOMops::domops_LOGGER = Mojo::Log-new(path=>'xx.log');>

ELEMENT SELECTORS

Element selectors are how one selects HTML elements from the DOM. There are 5 ways to select HTML elements: by class (element-class), tag (element-tag), id (element-id), name (element-name), a CSS selector (element-cssselector) or via an XPath selector (element-xpathselector).

Multiple selectors can be specified by combining the various selector types, above. For example, one can select by element-class and element-tag (and ...). In this selection mode, the matched elements from each selector type (e.g. set A contains the HTML elements matched via element-class and set B contains the HTML elements matched via element-tag) must be combined by means of either the UNION (||) or INTERSECTION (&&) of the two sets A and B.

Each selector can take one or more values. If you want to select by just one class then provide that one class as a string scalar. If you want to select an HTML elements which may belong to two classes, then provide the two class names as an array.

These are the valid selectors:

And one of these two must be used to combine the results into a final list:

CREATING THE MECH OBJECT

The mech (WWW::Mechanize::Chrome) object must be supplied to the functions in this module. It must be created by the caller. This is how I do it:

use WWW::Mechanize::Chrome;
use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init($ERROR);

my %default_mech_params = (
    headless => 1,
#   log => $mylogger,
    launch_arg => [
            '--window-size=600x800',
            '--password-store=basic', # do not ask me for stupid chrome account password
#           '--remote-debugging-port=9223',
#           '--enable-logging', # see also log above
            '--disable-gpu',
            '--no-sandbox',
            '--ignore-certificate-errors',
            '--disable-background-networking',
            '--disable-client-side-phishing-detection',
            '--disable-component-update',
            '--disable-hang-monitor',
            '--disable-save-password-bubble',
            '--disable-default-apps',
            '--disable-infobars',
            '--disable-popup-blocking',
    ],
);

my $mech_obj = eval {
    WWW::Mechanize::Chrome->new(%default_mech_params)
};
die $@ if $@;

# This transfers all javascript code's console.log(...)
# messages to perl's warn()
# we need to keep $console var in scope!
my $console = $mech_obj->add_listener('Runtime.consoleAPICalled', sub {
      warn
          "js console: "
        . join ", ",
          map { $_->{value} // $_->{description} }
          @{ $_[0]->{params}->{args} };
    })
;

# and now fetch a page
my $URL = '...';
my $retmech = $mech_obj->get($URL);
die "failed to fetch $URL" unless defined $retmech;
$mech_obj->sleep(1); # let it settle
# now the mech object has loaded the URL and has a DOM hopefully.
# You can pass it on to domops_find() or domops_zap() to operate on the DOM.

SECURITY WARNING

WWW::Mechanize::Chrome invokes the google-chrome executable on behalf of the current user. Headless or not, google-chrome is invoked. Depending on the launch parameters, either a fresh, new browser session will be created or the session of the current user with their profile, data, cookies, passwords, history, etc. will be used. The latter case is very dangerous.

This behaviour is controlled by WWW::Mechanize::Chrome's constructor parameters which, in turn, are used for launching the google-chrome executable. Specifically, see WWW::Mechanize::Chrome#separate_session, <WWW::Mechanize::Chrome#data_directory and WWW::Mechanize::Chrome#incognito.

Unless you really need to mechsurf with your current session, aim to launching the browser with a fresh new session. This is the safest option.

Do not rely on default behaviour as this may change over time. Be explicit.

Also, be warned that WWW::Mechanize::Chrome::DOMops executes javascript code on that google-chrome instance. This is done nternally with javascript code hardcoded into the WWW::Mechanize::Chrome::DOMops's package files.

On top of that WWW::Mechanize::Chrome::DOMops allows for user-specified javascript code to be executed on that google-chrome instance. For example the callbacks on each element found, etc.

This is an example of what can go wrong if you are not using a fresh google-chrome session:

You have just used google-chrome to access your yahoo webmail and you did not logout. So, there will be an access cookie in the google-chrome when you later invoke it via WWW::Mechanize::Chrome (remember you have not told it to use a fresh session).

If you allow unchecked user-specified (or copy-pasted from ChatGPT) javascript code in WWW::Mechanize::Chrome::DOMops's domops_find(), domops_zap(), etc. then it is, theoretically, possible that this javascript code initiates an XHR to yahoo and fetch your emails and pass them on to your perl code.

But there is another problem, WWW::Mechanize::Chrome::DOMops's integrity of the embedded javascript code may have been compromised to exploit your current session.

This is very likely with a Windows installation which, being the security swiss cheese it is, it is possible for anyone to compromise your module's code. It is less likely in Linux, if your modules are installed by root and are read-only for normal users. But, still, it is possible to be compromised (by root).

Another issue is with the saved passwords and the browser's auto-fill when landing on a login form.

Therefore, for all these reasons, it is advised not to invoke (via WWW::Mechanize::Chrome) google-chrome with your current/usual/everyday/email-access/bank-access identity so that it does not have access to your cookies, passwords, history etc.

It is better to create a fresh google-chrome identity/profile and use that for your WWW::Mechanize::Chrome::DOMops needs.

No matter what identity you use, you may want to erase the cookies and history of google-chrome upon its exit. That's a good practice.

It is also advised to review the javascript code you provide via WWW::Mechanize::Chrome::DOMops callbacks if it is taken from 3rd-party, human or not, e.g. ChatGPT.

Additionally, make sure that the current installation of WWW::Mechanize::Chrome::DOMops in your system is not compromised with malicious javascript code injected into it. For this you can check its MD5 hash.

DEPENDENCIES

This module depends on WWW::Mechanize::Chrome which, in turn, depends on the google-chrome executable be installed on the host computer. See WWW::Mechanize::Chrome::Install on how to install the executable.

Test scripts (which create there own mech object) will detect the absence of google-chrome binary and exit gracefully, meaning the test passes. But with a STDERR message to the user. Who will hopefully notice it and proceed to google-chrome installation. In any event, this module will be installed with or without google-chrome.

AUTHOR

Andreas Hadjiprocopis, <bliako at cpan.org>

CODING CONDITIONS

This code was written under extreme climate conditions of 44 Celsius. Keep packaging those vegs in kilos of plastic wrappers, keep obsolidating our perfectly good hardware, keep inventing new consumer needs and brainwash them down our throats, in short Crack Deep the Roof Beam, Capitalism.

BUGS

Please report any bugs or feature requests to bug-www-mechanize-chrome-domops at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-Mechanize-Chrome-DOMops. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc WWW::Mechanize::Chrome::DOMops

You can also look for information at:

DEDICATIONS

Almaz

ACKNOWLEDGEMENTS

CORION for publishing WWW::Mechanize::Chrome and all its contributors.

LICENSE AND COPYRIGHT

Copyright 2019 Andreas Hadjiprocopis.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.