NAME

CGI::Application::Search - Base class for CGI::App Swish-e site engines

SYNOPSIS

	package My::Search;
	use base 'CGI::Application::Search';
	
	sub cgiapp_init {
	  my $self = shift;
	  $self->param(
        'SWISHE_INDEX' => 'my-swishe.index',
        'TEMPLATE'     => 'search_results.tmpl',
      );
	}

	#let the user turn context highlighting off
	sub cgiapp_prerun {
	  my $self = shift;
	  $self->param('HIGHLIGHT_CONTEXT' => 0)
		if($self->query->param('highlight_off'));
	}

	1;

DESCRIPTION

A CGI::Application based control module that uses Swish-e API in perl (http://swish-e.org) to to perform searches on a swish-e index of documents. It uses HTML::Template to display the search form and the results. You may customize this template to alter the look and feel of the generated search interface.

TUTORIAL

You can skip this section if you're a Swish-e veteren. Otherwise, read on for a step-by-step guide to adding a search interface to your site using CGI::Application::Search.

Step 1: Install Swish-e

The first thing you need to do is install Swish-e. First, download it from the swish-e site:

http://swish-e.org

Then unpack it, cd into the directory, build and install:

tar zxf swish-e-2.4.3.tar.gz
cd swish-e-2.4.3
./configure
make
make install

You'll also need to build the Perl module, SWISH::API, which this module uses:

cd perl
perl Makefile.PL
make
make install

Step 2: Setup a Config File

The first step to setting up a swish-e search engine is writing a config file. Swish-e supports a smorgasborg of configuration options but just a few will get you started.

# index all HTML files in /path/to/index
IndexDir /path/to/index
IndexOnly .html .htm
IndexContents HTML2 .html .htm

# C::A::Search needs a description, use the first 1,500 characters
# of the body
StoreDescription HTML2 <body> 1500

# remove doc-root path so links will work on the results page
ReplaceRules remove /path/to/index

Put the above in a file called swish-e.conf.

Step 3: Run the Indexer

Now that you've got a configuration file you can index your site. The basic command is:

$ swish-e -v 1 -c swish-e.conf -f /path/to/swishe-index

The last part is the place where Swish-e will write its index. It should be the name of a file in a directory writable by you and readable by your CGI scripts.

Later you'll need to setup the indexer to run from cron, but for now just run it once.

Swish-e has a command-line interface to running searches which you can use to confirm that your index is working. For example, to search for "foo":

$ swish-e -w foo -f /path/to/swishe-index

If that works you should see some hits (assuming your site contains "foo").

Step 5: Setup an Instance Script

Like all CGI::Application modules, CGI::Application::Search requires an instance script. Create a file called 'search.pl' or 'search.cgi' in a place where your web server will execute it. Put this in it:

#!/usr/bin/perl -w
use strict;
use CGI::Application::Search;
my $app = CGI::Application::Search->new(
            PARAMS => { SWISHE_INDEX => '/path/to/index' });
$app->run();

Now make it executable:

$ chmod +x search.pl

Step 6: Test Your Instance Script

First, test it on the command-line:

$ ./search.pl

That should show you the HTML for the search form with no results. Now try it in your browser:

http://yoursite.example.com/search.pl

If that doesn't work, check your error log. Do not email me or the CGI::Application mailing list until you check your error log. Yes, I mean you. Thanks.

Step 7: Rejoice

You've just completed the world's easiest search system setup! Now go setup that indexing cronjob.

RUN_MODES

This controller has two run modes. The start_mode is show_search.

  • show_search

    This run mode will show the simple search form. If there are any results they will also be displayed. This is the default run mode and after a search is performed, this run mode is called to display the results.

  • perform_search

    This run mode will actually use the SWISH::API module to perform the search on a given index. If the HIGHLIGHT_CONTEXT option is set is will then use Text::Context to obtain a suitable context of the search content for each result returned and highlight the text according to the HIGHLIGHT_START and HIGHLIGHT_STOP options.

OTHER METHODS

Most of the time you will not need to call the methods that are implemented in this module. But in cases where more customization is required than can be done in the templates, it might be prudent to override or extend these methods in your derived class.

new()

We simply override and extend the CGI::Application new() constructor to also setup our callbacks.

setup()

A simple no-op sub that you are free to override to run at setup time

RUN MODES

generate_search_query($keywords)

This method is used to generate the query for swish-e from the $keywords (by default the 'keywords' CGI parameter), as well as any EXTRA_PROPERTIES that are present.

If you wish to generate your own search query then you should override this method. This is common if you need to have access/authorization control that will need to be taken into account for your search. (eg, anything under /protected can't be seen by someone not logged in).

Please see the swish-e documentation on the exact syntax for the query.

CONFIGURATION

There are several configuration parameters that you can set at any time (in your cgiapp_init or cgiapp_prerun, or PARAMS hash in new()) before the run mode is called that will affect the search and display of the results. They are:

SWISHE_INDEX

This is the swishe index used for the searches. The default is 'data/swish-e.index'. You will probably override this every time.

TEMPLATE

The name of the search interface template. A default template is included within the module which will be used if you don't specify one. A more elaborate example is included in the distribution under the tmpl/ directory.

The following parameters are passed to your template regardless of it's TEMPLATE_TYPE:

searched
elapsed_time
keywords
hits
hit_title
hit_path
hit_last_modified
hit_size
hit_description
first_page
last_page
prev_page
next_page
pages
current
page_num
start_num
stop_num
total_entries

TEMPLATE_TYPE

This module uses CGI::Application::Plugin::AnyTemplate to allow flexibility in choosing which templating system to use for your search. This works especially well when you are trying to integrate the Search into an existing app with an existing templating structure.

This value is passed to the $self->template->config() method as the default_type. By default it is 'HTMLTemplate'. Please see CGI::Application::Plugin::AnyTemplate for more options.

If you want more control of configuration for the template the it would probably best be done by subclassing CGI::Application::Search and passing your desired params to $self->template->config.

PER_PAGE

How many search result items to display per page. The default is 10.

HIGHLIGHT_CONTEXT

Boolean indicating whether or not we should highlight the context. The default is true.

HIGHLIGHT_START

The text to be prepended to a word being highlighted. If this value is false and HIGHTLIGHT_CONTEXT is true then it will use the default provided by Text::Context. The default text is <ltstrong<gt>>.

HIGHLIGHT_STOP

The text to be appended to a word being highlighted. If this value is false and HIGHTLIGHT_CONTEXT is true then it will use the default provided by Text::Context. The default text is <lt/strong<gt>>.

EXTRA_PROPERTIES

This is an array ref of extra properties used in the search. By default, the module will only use the value of the 'keywords' parameter coming in the CGI query. If anything is provided as an extra property then it will be added to the query used in the search.

An example: You have some of you pages designated into categories. You want the user to have the option of narrowing his results by category. You add the word 'category' to the 'EXTRA_PROPERTIES' list and then you add a 'category' form element that the user has the option of giving a value to your search form. If the user gives that element a value, then it will be seen and applied to the search. This will also only work if you have the 'category' element defined for your documents (see "SWISH-E Configuration" and 'MetaNames' in the swish-e.org SWISH-CONF documentation).

The default is an empty list.

CONTEXT_LENGTH

This is the maximum length for the context (in chars) that is displayed for each search result. The default is 250 characters.

show_search()

This method will load the template pointed to by the TEMPLATE param (falling back on a default internal template if none is configured) and display it to the user. It will 'associate' this template with $self so that any parameters in $self->param() are also accessible to the template. It will also use HTML::FillInForm to fill in the search form with the previously selected parameters.

perform_search()

This is where the meat of the searching is performed. We create a SWISH::API object on the SWISHE_INDEX and create the query for the search based on the value of the 'keywords' parameter in CGI and any other EXTRA_PARAMETERS. The search is executed and if HIGHLIGHT_CONTEXT is true we will use Text::Context to highlight it and then format the results data only showing PER_PAGE number of elements per page (if PER_PAGE is true). We will also show a list of pages that can be selected for navigating through the results. Then we will return to the show_search() method for displaying.

TEMPLATES

A default template is provided inside the module which will be used if you don't specify a template. This is useful for testing out the module and may also serve as a base for your template development.

Two more elaborate templates are provided as examples of how to use this module in the tmpl/ directory. Please feel free to copy and change them in what ever way you see fit. To help in giving you more information to display (or not display, depending on your preference) the following variables are available for your templates:

Global Tmpl Vars

These variables are available throughout the templates and contain information related to the search as a whole:

  • searched

    A boolean indicating whether or not a search was performed.

  • keywords

    The exact string that was returned to the server from the input named 'keywords'

  • elapsed_time

    A string representing the number of seconds that the search took. This will be a floating point number with a precision of 3.

  • hits

    This is the TMPL_LOOP that contains the actuall results from the search.

  • pages

    This is the TMPL_LOOP that contains paging information for the results

  • first_page

    This is a boolean indicating whether or not this page of the results is the first or not.

  • last_page

    This is a boolean indicating whether or not this page of the results is the last or not.

  • start_num

    This is the number of the first result on the current page

  • stop_num

    This is the number of the last result on the current page

  • total_entries

    The total number of results in their search, not the total number shown on the page.

HITS TMPL_LOOP Vars

These variables are available only inside of the TMPL_LOOP named "HITS".

  • hit_reccount

    The swishreccount property of the results as indexed by SWISH-E

  • hit_rank

    The rank to the result as given by SWISH-E (the swishrank property)

  • hit_title

    The swishtitle property of the results as indexed by SWISH-E

  • hit_path

    The swishdocpath property of the results as indexed by SWISH-E

  • hit_size

    The swishdocsize property of the results as indexed by SWISH-E and then formatted with Number::Format::format_bytes

  • hit_description

    The swishdescription property of the results as indexed by SWISH-E. If HIGHLIGHT_CONTEXT is true, then this description will also have search terms highlighted and will only be, at most, CONTEXT_LENGTH characters long.

  • hit_last_modified

    The swishlastmodified property of the results as indexed by SWISH-E and then formatted using Time::Piece::strftime with a format string of %B %d, %Y.

OTHER NOTES

  • If at any time prior to the execution of the 'perform_search' run mode you set the $self-<gtparam('results')> parameter a search will not be performed, but rather and empty set of results is returned. This is helpful when you decide in either cgiapp_init or cgiapp_prerun that this user does not have permissions to perform the desired search.

  • You must use the StoreDescription setting in your Swish-e configuration file. If you don't you'll get an error when C::A::Search tries to retrieve a description for each hit.

AUTHOR

Michael Peters <mpeters@plusthree.com>

Thanks to Plus Three, LP (http://www.plusthree.com) for sponsoring my work on this module.

CONTRIBUTORS

Sam Tregar <sam@tregar.com>