NAME
Plack::Middleware::DetectRobots - Automatically set a flag in the environment if a robot client is detected
VERSION
version 0.03
SYNOPSIS
use Plack::Builder;
my $app = sub { ... } # as usual
builder {
enable 'DetectRobots';
# or: enable 'DetectRobots', env_key => 'psgix.robot_client';
# or: enable 'DetectRobots', extended_check => 1, generic_check => 1;
$app;
};
# ... and later ...
if ( $env->{'robot_client'} ) {
# ... do something ...
}
DESCRIPTION
This Plack middleware uses the list of robots that is part of the AWStats log analyzer software package to analyse the User-Agent
HTTP header and to set an environment flag to either a true or false value depending on the detection of a robot client.
Once activated it checks the User-Agent HTTP header against a basic list of patterns for common bots.
If you activate the appropriate options, it can also use an extended list for the detection of less common bots (cf. extended_check
) and / or a list of quite generic patterns to detect unknown bots (cf. generic_check
).
You may also pass in your own regular expression as a string for further checks (cf. <local_regexp>).
The checks are executed in this order:
1. Local regular expression
2. Basic check
3. Extended check
4. Generic check
If a check yields a positive result (i.e.: detects a bot) the remaining checks are skipped.
Depending on the check which detected a bot, the environment flag is set to one of these values: LOCAL
, BASIC
, EXTENDED
, or GENERIC
.
If no bot is detected, the flag is set to 0
.
The default name of the flag in the environment is robot_client
, but this can be customized by setting the env_key
option when enabling this middleware.
It might make sense to use psgix.robot_client
by default instead, but the PSGI spec states that the "'psgix.' prefix is reserved for officially blessed extensions" - which does not apply to this module. You may, however, set the key to psgix.robot_client
yourself by using the env_key
option mentioned before.
WARNING
This software is currently considered BETA and still needs to be seriously tested!
ROBOTS LIST
Based on Revision 2d289e, 2014-11-20 of http://sourceforge.net/p/awstats/code/ci/develop/tree/wwwroot/cgi-bin/lib/robots.pm.
Note: that list might be somewhat dated, as I did not find bingbot in the list of common bots (only in the extended list) while it's predecessor msnbot was considered common.
CONFIGURATION
You may specify the following option when enabling the middleware:
env_key
-
Set the name of the entry in the environment hash.
basic_check
-
You may deactivate the standard checks by setting this option to a false value. E.g. if your are only interested in obscure bots or in your local pattern checks.
By setting this option to a false value while simultaneously passing a regular expression to
local_regexp
one can imitate the behaviour of Plack::Middleware::BotDetector. extended_check
-
Determines if an extended list of less often seen robots is also checked for. By default, only common robots are checked for, because the extended check requires a rather large and complex regular expression. Set this param to a true value to change the default behaviour.
generic_check
-
Determines if the User-Agent string is also analysed to determine if it contains certain strings that generically identify the client as a bot, e.g. "spider" or "crawler" By default, this check is not performed, even though it uses only a relatively short and simple regex.. Set this param to a true value to change the default behaviour.
local_regexp
-
You may optionally pass in your own regular expression (as a Regexp object using
qr//
) to check for additional patterns in the User-Agent string.
SEE ALSO
Plack, Plack::Middleware, Plack::Middleware::BotDetector, http://awstats.org/
The functionality provided by Plack::Middleware::BotDetector
is basically the same as that of this module, but it requires you to pass in your own regular expression and does not include a default list of known bots.
AUTHOR
Heiko Jansen <hjansen@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2015 by Heiko Jansen.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.