NAME
KinoSearch::Docs::Tutorial::Simple - Bare-bones search app.
Setup
First, copy/move the directory containing the html presentation of the US Constitution from the sample
directory of the KinoSearch distribution to the base level of your web server's htdocs
directory.
$ mv sample/us_constitution /usr/local/apache2/htdocs/
Next, create a configuration file, conf.pl
, which will be shared by both our indexing and search apps.
# conf.pl -- Configuration file shared by invindexer.pl and search.cgi.
{
# Path to the index on the file system.
path_to_invindex => '/path/to/uscon_invindex',
# Path to the directory which holds the US Constitution html files.
uscon_source => '/usr/local/apache2/htdocs/us_constitution',
};
Change the values in conf.pl
as needed.
Indexing: invindexer.pl
Our first task will be to create an application called invindexer.pl
which builds a searchable "inverted index" from a collection of documents.
After we load the configuration file and all necessary modules...
#!/usr/bin/perl
use strict;
use warnings;
# Load configuration file. (Note: change conf.pl location as needed.)
my $conf;
BEGIN { $conf = do "./conf.pl" or die "Can't locate conf.pl"; }
use KinoSearch::Simple;
use File::Spec::Functions qw( catfile );
... we'll start by creating a KinoSearch::Simple object, telling it where we'd like the index to be located and the language of the source material.
my $simple = KinoSearch::Simple->new(
path => $conf->{path_to_invindex},
language => 'en',
);
Next, we'll add a subroutine which reads in and extracts plain text from an HTML source file. KinoSearch::Simple won't be of any help with this task, because it's not equipped to deal with source files directly -- as a matter of principle, KinoSearch remains deliberately ignorant on the vast subject of file formats, preferring to focus instead on its core competencies of indexing and search.
There are many excellent dedicated parsing modules available on CPAN, and ordinarily we'd be calling on HTML::Parser or the like... however, today we're going to use quick-and-dirty regular expressions for the sake of simplicity. Parsing HTML using regexes is generally an awful idea, but we can guarantee that the following fragile-but-easy-to-grok parsing sub will work because the source docs are 100% controlled by us and we can ensure that they are well-formed.
# Parse an HTML file from our US Constitution collection and return a
# hashref with three keys: title, body, and url.
sub slurp_and_parse_file {
my $filename = shift;
my $filepath = catfile( $conf->{uscon_source}, $filename );
open( my $fh, '<', $filepath )
or die "Can't open '$filepath': $!";
my $raw = do { local $/; <$fh> }; # slurp!
# build up a document hash
my %doc = ( url => "/us_constitution/$filename" );
$raw =~ m#<title>(.*?)</title>#s
or die "couldn't isolate title in '$filepath'";
$doc{title} = $1;
$raw =~ m#<div id="bodytext">(.*?)</div><!--bodytext-->#s
or die "couldn't isolate bodytext in '$filepath'";
$doc{content} = $1;
$doc{content} =~ s/<.*?>/ /gsm; # quick and dirty tag stripping
return \%doc;
}
Add some elementary directory reading code...
# Collect names of source html files.
opendir( my $source_dh, $conf->{uscon_source} )
or die "Couldn't opendir '$conf->{uscon_source}': $!";
my @filenames;
for my $filename ( readdir $source_dh ) {
next unless $filename =~ /\.html/;
next if $filename eq 'index.html';
push @filenames, $filename;
}
closedir $source_dh
or die "Couldn't closedir '$conf->{uscon_source}': $!";
... and now we're ready for the meat of invindexer.pl:
foreach my $filename (@filenames) {
my $doc = slurp_and_parse_file($filename);
$simple->add_doc($doc); # ta-da!
}
That's all there is to it.
Search: search.cgi
As with our indexing app, the bulk of the code in our search script won't be KinoSearch-specific.
The beginning is dedicated to CGI processing and configuration.
#!/usr/bin/perl -T
use strict;
use warnings;
# Load configuration file. (Note: change conf.pl location as needed.)
my $conf;
BEGIN { $conf = do "./conf.pl" or die "Can't locate conf.pl"; }
use CGI;
use Data::Pageset;
use HTML::Entities qw( encode_entities );
use KinoSearch::Simple;
my $cgi = CGI->new;
my $q = $cgi->param('q') || '';
my $offset = $cgi->param('offset') || 0;
my $hits_per_page = 10;
Once that's out of the way, we create our KinoSearch::Simple object and feed it a query string.
my $simple = KinoSearch::Simple->new(
path => $conf->{path_to_invindex},
language => 'en',
);
my $hit_count = $simple->search(
query => $q,
offset => $offset,
num_wanted => $hits_per_page,
);
The value returned by search() is the total number of documents in the collection which matched the query. We'll show this hit count to the user, and also use it to along with the parameters offset
and num_wanted
to break up results into "pages" of manageable size.
Calling search() on our Simple object turns it into an iterator. Invoking fetch_hit_hashref() now returns our stored documents (augmented with a score
), starting with the most relevant.
# create result list
my $report = '';
while ( my $hit = $simple->fetch_hit_hashref ) {
my $score = sprintf( "%0.3f", $hit->{score} );
my $title = encode_entities( $hit->{title} );
$report .= qq|
<p>
<a href="$hit->{url}"><strong>$title</strong></a>
<em>$score</em>
<br>
<span class="excerptURL">$hit->{url}</span>
</p>
|;
}
The rest of the script is just text wrangling. Notable aspects include the use of Data::Pageset to create paging links, and the encode_entities function to guard against cross-site scripting attacks.
#---------------------------------------------------------------#
# No tutorial material below this point - just html generation. #
#---------------------------------------------------------------#
# Generate paging links and hit count, print and exit.
my $paging_links = generate_paging_info( $q, $hit_count );
blast_out_content( $q, $report, $paging_links );
# Create html fragment with links for paging through results n-at-a-time.
sub generate_paging_info {
my ( $query_string, $total_hits ) = @_;
$query_string = encode_entities($query_string);
my $paging_info;
if ( !length $query_string ) {
# No query? No display.
$paging_info = '';
}
elsif ( $total_hits == 0 ) {
# Alert the user that their search failed.
$paging_info
= qq|<p>No matches for <strong>$query_string</strong></p>|;
}
else {
my $current_page = ( $offset / $hits_per_page ) + 1;
my $pager = Data::Pageset->new(
{ total_entries => $total_hits,
entries_per_page => $hits_per_page,
current_page => $current_page,
pages_per_set => 10,
mode => 'slide',
}
);
my $last_result = $pager->last;
my $first_result = $pager->first;
# Display the result nums, start paging info.
$paging_info = qq|
<p>
Results <strong>$first_result-$last_result</strong>
of <strong>$total_hits</strong>
for <strong>$query_string</strong>.
</p>
<p>
Results Page:
|;
# Create a url for use in paging links.
my $href = $cgi->url( -relative => 1 ) . "?" . $cgi->query_string;
$href .= ";offset=0" unless $href =~ /offset=/;
# Generate the "Prev" link.
if ( $current_page > 1 ) {
my $new_offset = ( $current_page - 2 ) * $hits_per_page;
$href =~ s/(?<=offset=)\d+/$new_offset/;
$paging_info .= qq|<a href="$href"><= Prev</a>\n|;
}
# Generate paging links.
for my $page_num ( @{ $pager->pages_in_set } ) {
if ( $page_num == $current_page ) {
$paging_info .= qq|$page_num \n|;
}
else {
my $new_offset = ( $page_num - 1 ) * $hits_per_page;
$href =~ s/(?<=offset=)\d+/$new_offset/;
$paging_info .= qq|<a href="$href">$page_num</a>\n|;
}
}
# Generate the "Next" link.
if ( $current_page != $pager->last_page ) {
my $new_offset = $current_page * $hits_per_page;
$href =~ s/(?<=offset=)\d+/$new_offset/;
$paging_info .= qq|<a href="$href">Next =></a>\n|;
}
# Close tag.
$paging_info .= "</p>\n";
}
return $paging_info;
}
# Print content to output.
sub blast_out_content {
my ( $query_string, $hit_list, $paging_info ) = @_;
$query_string = encode_entities($query_string);
print "Content-type: text/html\n\n";
print qq|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type"
content="text/html;charset=ISO-8859-1">
<link rel="stylesheet" type="text/css"
href="/us_constitution/uscon.css">
<title>KinoSearch: $query_string</title>
</head>
<body>
<div id="navigation">
<form id="usconSearch" action="">
<strong>
Search the
<a href="/us_constitution/index.html">US Constitution</a>:
</strong>
<input type="text" name="q" id="q" value="$query_string">
<input type="submit" value="=>">
<input type="hidden" name="offset" value="0">
</form>
</div><!--navigation-->
<div id="bodytext">
$hit_list
$paging_info
<p style="font-size: smaller; color: #666">
<em>
Powered by
<a href="http://www.rectangular.com/kinosearch/">KinoSearch</a>
</em>
</p>
</div><!--bodytext-->
</body>
</html>
|;
}
OK... now what?
KinoSearch::Simple is perfectly adequate for some tasks, but it's not very flexible. Many people will find that it doesn't do at least one or two things they can't live without.
In our next tutorial chapter, BeyondSimple, we'll rewrite our indexing and search scripts using the classes that KinoSearch::Simple hides from view, opening up the possibilities for expansion; then, we'll spend the rest of the tutorial chapters exploring these possibilities.
COPYRIGHT
Copyright 2005-2007 Marvin Humphrey
LICENSE, DISCLAIMER, BUGS, etc.
See KinoSearch version 0.20.