NAME
WWW::Mechanize::Cookbook - Recipes for using WWW::Mechanize
INTRODUCTION
First, please note that many of these are possible just using LWP::UserAgent. Since WWW::Mechanize
is a subclass of LWP::UserAgent, whatever works on LWP::UserAgent
should work on WWW::Mechanize
. See the lwpcook man page included with the libwww-perl distribution.
BASICS
Create a mech
use WWW::Mechanize;
my $mech = WWW::Mechanize->new( autocheck => 1 );
The autocheck => 1
tells Mechanize to die if any IO fails, so you don't have to manually check. It's easier that way. If you want to do your own error checking, leave it out.
Fetch a page
$mech->get( "http://search.cpan.org" );
print $mech->content;
$mech->content
contains the raw HTML from the web page. It is not parsed or handled in any way, at least through the content
method.
Fetch a page into a file
Sometimes you want to dump your results directly into a file. For example, there's no reason to read a JPEG into memory if you're only going to write it out immediately. This can also help with memory issues on large files.
$mech->get( "http://www.cpan.org/src/stable.tar.gz",
":content_file" => "stable.tar.gz" );
Fetch a page with credentials
Mech's credentials()
method comes straight from LWP::UserAgent, since WWW::Mechanize is a subclass. It takes four arguments: the base, realm, user, and password.
my $url= 'http://10.11.12.13/password.html';
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->credentials(
'10.11.12.13:80',
'ABCDEF',
'admin' => 'password'
);
$mech->get( $url );
print $mech->content();
"Normal" browsers do this in two steps: they try to fetch the resource and get a 401 response that contains the challenge (which has the realm information). They ask the user for the name and password, which they then use to request the resource again. They save that information for future accesses to the same realm. The realm helps the user agent keep things straight on the user side, and the server doesn't really care about it otherwise. Sometimes you'll need to pull the auth method from the challenge, but usually it's just "Basic".
If you want to do it in one step (so you don't get the 401 at all), you stick the "Authorization" header in the initial request. It makes no difference to the server that you don't know the realm name. With HTTP::Request you just add the header to the request. In WWW::Mechanize, you can use the add_header()
method. The HTTP specification (RFC 2616) explains the format of the Authorization header.
LINKS
Find all image links
Find all links that point to a JPEG, GIF or PNG.
my @links = $mech->find_all_links(
tag => "a", url_regex => qr/\.(jpe?g|gif|png)$/i );
Find all download links
Find all links that have the word "download" in them.
my @links = $mech->find_all_links(
tag => "a", text_regex => qr/\bdownload\b/i );
APPLICATIONS
Check all pages on a web site
Use Abe Timmerman's WWW::CheckSite http://search.cpan.org/dist/WWW-CheckSite/
AUTHOR
Copyright 2005 Andy Lester <andy@petdance.com>