NAME

WWW::SuperAgent - An enhanced UserAgent

VERSION

Version 0.03

SYNOPSIS

WWW::SuperAgent is an anonymising user agent for getting web pages. By default SuperAgent presents 1 of 10 useragent handles when requesting web pages, randomly selecting from Windows, OSX and Linux and browsers IE, Chrome, Firefox and Safari. Additionally, SuperAgent can limit requests per ip address and domain. This is helpful when a domain sets a limit on how many requests it will accept from a unique IP - if this limit is reached, SuperAgent will return an empty string and carp a warning instead of requesting the page.

SuperAgent also maintains a history, can dump and restore the history - this is useful when tracking a large scraping operation, which may occur over several disconnected sessions.

SuperAgent encodes all webpages as UTF-8, and is based on LWP::UserAgent;

use WWW::SuperAgent;
my $ip = '127.0.0.1'; # insert your actual ip here
my $sa = WWW::SuperAgent->new($ip);
my $html = $sa->get_url('http://google.com');
...

SUBROUTINES/METHODS

new (ip_address)

Instantiates a new SuperAgent object - if an ip address is provided, SuperAgent will store it and track requests against it. If not, SuperAgent will use localhost (127.0.0.1).

get_url ($url)

Gets and returns a utf8 encoded file via http.

get_domain_count (url)

Returns the count of requests to the domain in the history.

get_domain_ip_count ($domain, $ip)

Returns the count of requests to the domain and ip in the history.

get_url_count ($url)

Returns the count of requests to the url in the history.

get_ip_count ($ip)

Returns the count of requests to the ip in the history.

clear_history

get_history

This method requires a path as a parameter, and will write the history to the path, appending its contents.

load_history ($filepath)

Loads a SuperAgent browsing history from a tab delimited file in the format: ip url agent response_code.

set_alias_mode_on

Turns on the alias mode which shuffles the user agent header for every subsequent request. The default is on.

set_alias_mode_off

Turns the alias mode off which retains the current user agent header for every subsequent request.

get_alias

Returns the current agent string for the request header.

get_domain_ip_limit

Returns the current limit per unique ip address and domain.

set_domain_ip_limit ($limit)

Sets the number of times a request can be made per domain.

INTERNAL SUBROUTINES/METHODS

_log_request

Logs requests into the history.

_set_alias($alias)

Sets the agent string for the request header.

_get_random_alias

Returns a random alias from the alias array;

_check_domain_limit

Checks a request is not breaching the domain limit.

AUTHOR

David Farrell, <davidnmfarrell at gmail.com>

BUGS

Please report any bugs or feature requests to bug-www-superagent at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-SuperAgent. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc WWW::SuperAgent

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2013 David Farrell.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.