NAME

URI::Sequin - Extract information from the URLs of Search-Engines

SYNOPSIS

use URI::Sequin qw/se_extract key_extract log_extract %log_types/;

$url = &log_extract($line_from_log_file, 'NCSA');

$log_types{'MyLogType'} = '^(.+?) -> .+$';
$url = &log_extract($line_from_log_file, 'MyLogType');

$keyword_string = &key_extract($url);

($search_engine_name, $search_engine_url) = @{&se_extract($url)};

DESCRIPTION

This module provides three tools to aid people trying to analyse Search-Engine URLs. It’s meant mainly for those who want to analyse referrer logs and pick out key information about site visitors, such as which Search-Engine and keywords they used to find the site.

The functions and globals provided (and exported by default) from this module are:

log_extract($log_line, 'Type')

This will pick out the referring URL from a line of a logfile. The 'type' can be one of the built in types or can be a user-created one. For more information, see %log_types below. This subroutine accepts a scalar, and returns a scalar.

key_extract($url)

This will try and determine the keywords used in $url. It accepts a scalar and returns a scalar. Should nothing be found, it returns an undefined value.

se_extract($url)

This will try and determine the name of the Search-Engine used and its URL. It accepts a scalar, and returns an array containing firstly the Search- Engine’s name and secondly the Search-Engine’s URL. Should the URL appear not to be from a Search Query, it returns a reference to an empty array.

%log_types

There are five built-in logfile types already in this hash. They are:

  • IIS1 - Microsoft IIS 3.0 and 2.0

  • IIS2 - Microsoft IIS4.0 (W3SVC format)

  • NCSA - For APACHE, NETSCAPE and any other NCSA format logs

  • ORW - O'Reilly WebSite format

  • General - A generalised one that will work with most logfiles

It’s easy to add another one. Simply add a key to the hash, with a value that is a regex. Parenthesise the part that is the referring URL, as the script uses $1 to obtain the URL. (see the example in the Synopsis section).

I have only one request for people who use this module. *Please* tell me where and how you've used it, and if you have any thoughts or suggestions on it, tell me!

BUGS

Doesn't like the Amnesi Search Engine. But then, neither do I. Also, the 'General' log type needs to be used with discretion ... be sure that none of the URLs contain literal " if you use it.

AUTHOR

Peter Sergeant <pete@grou.ch>

COPYRIGHT

Copyright 2001 Peter Sergeant.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 419:

Non-ASCII character seen before =encoding in 'It’s'. Assuming CP1252