NAME
WWW::SuperAgent - An enhanced UserAgent
VERSION
Version 0.03
SYNOPSIS
WWW::SuperAgent is an anonymising user agent for getting web pages. By default SuperAgent presents 1 of 10 useragent handles when requesting web pages, randomly selecting from Windows, OSX and Linux and browsers IE, Chrome, Firefox and Safari. Additionally, SuperAgent can limit requests per ip address and domain. This is helpful when a domain sets a limit on how many requests it will accept from a unique IP - if this limit is reached, SuperAgent will return an empty string and carp a warning instead of requesting the page.
SuperAgent also maintains a history, can dump and restore the history - this is useful when tracking a large scraping operation, which may occur over several disconnected sessions.
SuperAgent encodes all webpages as UTF-8, and is based on LWP::UserAgent;
use WWW::SuperAgent;
my $ip = '127.0.0.1'; # insert your actual ip here
my $sa = WWW::SuperAgent->new($ip);
my $html = $sa->get_url('http://google.com');
...
SUBROUTINES/METHODS
new (ip_address)
Instantiates a new SuperAgent object - if an ip address is provided, SuperAgent will store it and track requests against it. If not, SuperAgent will use localhost (127.0.0.1).
get_url ($url)
Gets and returns a utf8 encoded file via http.
get_domain_count (url)
Returns the count of requests to the domain in the history.
get_domain_ip_count ($domain, $ip)
Returns the count of requests to the domain and ip in the history.
get_url_count ($url)
Returns the count of requests to the url in the history.
get_ip_count ($ip)
Returns the count of requests to the ip in the history.
clear_history
get_history
print_history ($filepath)
This method requires a path as a parameter, and will write the history to the path, appending its contents.
load_history ($filepath)
Loads a SuperAgent browsing history from a tab delimited file in the format: ip url agent response_code.
set_alias_mode_on
Turns on the alias mode which shuffles the user agent header for every subsequent request. The default is on.
set_alias_mode_off
Turns the alias mode off which retains the current user agent header for every subsequent request.
get_alias
Returns the current agent string for the request header.
get_domain_ip_limit
Returns the current limit per unique ip address and domain.
set_domain_ip_limit ($limit)
Sets the number of times a request can be made per domain.
INTERNAL SUBROUTINES/METHODS
_log_request
Logs requests into the history.
_set_alias($alias)
Sets the agent string for the request header.
_get_random_alias
Returns a random alias from the alias array;
_check_domain_limit
Checks a request is not breaching the domain limit.
AUTHOR
David Farrell, <davidnmfarrell at gmail.com>
BUGS
Please report any bugs or feature requests to bug-www-superagent at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-SuperAgent. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc WWW::SuperAgent
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2013 David Farrell.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.