NAME
WWW::CheckSite::Validator - A spider that assesses 'kwalitee' for a site
SYNOPSIS
use WWW::CheckSite::Validator;
my $wcv = WWW::CheckSite::Validator->new(
uri => 'http://www.test-smoke.org'
);
while ( my $info = $wcv->get_page ) {
# handle the info
}
DESCRIPTION
This is a subclass of WWW::CheckSite::Spider
.
WWW::CheckSite::Validator
starts its work after the spider has fetched the page. It will check these things:
links
All links on the page (
<a href>
,<area href>
,<frame src>
) are checked for availability.images
All images on the page (
<img src>
,<input type=image>
) are checked for availability.stylesheets
All stylesheets on the page (
<link rel=stylesheet type=text/css>
) are checked for availability.W3 HTML validation
The contents of the page are send to http://validator.w3.org for validation.
METHODS
WWW::CheckSite::Validator->new( %args )
Extend WWW::CheckSite::Spider->new
to check for Image::Info so we can do a basic check on the images.
On top of the attributes used by WWW::CheckSite::Spider, this class uses:
html_by => by_uri|by_upload|by_none
html_validator => <uri>
css_by => by_uri|by_upload|by_none
css_validator => <uri>
NOTE: the validate attrubute has been removed.
$wcs->process_page
This method overrides the WWW::CheckSite::Spider::process_page()
method to check on the availability of links, images and stylesheets. When specified it will also send the page for validation by W3.ORG.
On top of the standard information it returns more:
links a list of links on the page, with some extra info
links_cnt the number of links on the page
links_ok the number of links that returned STATUS==200
images a list of images on the page, with some extra info
images_cnt the number of images on the page
images_ok the number of images that returned STATUS==200
styles a list of stylesheets on the page, with some extra info
styles_cnt the number of stylesheets on the page
styles_ok the number of stylesheets that returned STATUS==200
valid the result of validation at W3.ORG
$wcs->check_links( $stats )
The check_links()
method gets information about the links on this page. If there is no return status, it will HEAD
the uri and update the cache status for this link to prevent multiple HEADing.
NOTE: This method does not respect the exclusion rules, and only robot-rules with strictrules
enabled!
The structure for links:
link as set in the
a/area
taguri as returned after the HEAD request
tag set to 'A' or 'AREA'
text set to the text in the link
status the return status from the HEAD request
depth the depth in the "browse-tree"
action explanation of the action taken on this uri
$wcs->check_images( $stats )
The check_images()
method gets information about the images on the page. The list comes from the images() method of the mechanize object. It will only HEAD
the uri.
The structure for images:
link as set in the
img/input
taguri as returned after the HEAD request
tag set to 'ALT'
text set to the text of the ALT attribute
status the return status from the HEAD request
ct the 'Content-Type' returned by the HEAD request
$wcs->check_styles( $stats )
The check_styles()
method checks the validity of stylesheets used in the page. We check for <link rel="stylesheet" type="text/css">
tags.
The structure for stylesheets:
link as set in the link tag
uri as returned after the HEAD request
tag set to 'link'
text set to empty for compatibility with links and images
status the return status from the HEAD request
ct the 'Content-Type' returned by the HEAD request
$wcs->validate
The validate()
method sends the url/contents off to W3.org to validate.
$wcs->validate_by_none
The fallback do-not-validate method.
$wcs->validate_by_uri
Sends only the uri to W3.ORG and get the validation result.
$wcs->validate_by_upload( $stats )
Create a temporary file (with File::Temp) from $agent->content
, call the validator with that temporary file and save the result (as a boolean) in $stats->{validate}
.
$wcs->validate_by_xmllint( $stats )
Use the xmllint(1) program to validate the (X)HTML.
$wcs->validate_style( $ua )
Dispatch the validation to the right method.
$wcs->style_by_none
The fallback do-not-validate-stylesheet method.
$wcs->style_by_uri( $ua )
Sends only the uri to JIGSAW.W3.ORG and get the validation result.
$wcs->style_by_upload( $ua )
Create a temporary file (with File::Temp) from $ua->content
, call the validator with that temporary file and return the result.
$wcs->validate_inline_style( $style )
Creates a new user-agent, and calls validate_upload_style()
.
$wcs->validate_upload_style( $ua, $style )
Saves $style
to a temporary file and uploads it to the css-validator.
$wcs->_extract_inline_styles
Uses HTML::TokeParser to extract inline styles from a document and returns a reference to an array with the contents of the inline style.
$wcs->validate_image( $ua )
This is more like a basic consistency check, that uses Image::Info::image_info()
.
$wcs->ct_can_validate( $ua )
Check if the content-type is "validatable".
$wcs->set_action
Why?
SEE ALSO
WWW::CheckSite::Spider, WWW::CheckSite
AUTHOR
Abe Timmerman, <abeltje@cpan.org>
BUGS
Please report any bugs or feature requests to bug-WWW-CheckSite@rt.cpan.org
, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
COPYRIGHT & LICENSE
Copyright MMV Abe Timmerman, All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.