NAME
KSx::Highlight::Summarizer - KinoSearch Highlighter subclass that provides more comprehensive summaries
VERSION
0.04 (beta)
SYNOPSIS
use KSx::Highlight::Summarizer;
my $summarizer = new KSx::Highlight::Summarizer
searchable => $searcher,
query => 'foo bar',
field => 'content',
# optional:
pre_tag => '<b>',
post_tag => '</b>',
encoder => sub {
my $str = shift; $str =~ s/([&'"<])/'&#'.ord($1).';'/eg; $str
},
page_handler => sub { "<h3>Page $_[1]:</h3>" },
ellipsis => "\x{2026}", # default: ' ... '
excerpt_length => 150, # default: 200
summary_length => 400,
;
my $excerpt = $summarizer->create_excerpt( $hit );
DESCRIPTION
This module extends KinoSearch::Highlight::Highlighter (which provides an excerpt for a search result, with search words highlighted) to provide various customisations, especially summaries, i.e., multiple excerpts joined together with ellipses.
The superclass finds the best location with the text of a search result, takes a single piece of text surrounding it, and then formats it, highlighting words as appropriate. This module will also take the second best location and create an excerpt for that (removing overlap), and so on until the summary_length
is reached or exceeded.
METHODS
new
This is the constructor. It takes hash-style arguments, as shown in the "SYNOPSIS". The various arguments are as follows:
- searchable
-
A reference to an object that isa KinoSearch::Search::Searchable (e.g., a KinoSearch::Searcher)
- query
-
A query string or object
- field
-
The name of the field for which to make a summary
- pre_tag, post_tag
-
These two are strings of text to be inserted around highlighted words, such as HTML tags. The defaults are '<strong>' and '</strong>'.
- encoder
-
An code ref that is expected to encode the text fed to it, e.g., with HTML entities
- page_handler
-
A coderef. If this is provided, it will be called for every page break (form feed; ASCII character 12) in the summary, and its return value substituted for that page break. The arguments will be (0) the hit (a KinoSearch::Doc::HitDoc object) and (1) the page number.
- ellipsis
-
The ellipsis mark to use. The default is three ASCII dots surrounded by spaces: ' ... '
- excerpt_length
-
The length of each excerpt (default is 200), not including ellipses. Actually, an excerpt may end up being shorter than this, because the start is trimmed to the nearest sentence boundary or page break, and the end is trimmed to the nearest word boundary.
- summary_length
-
The approximate length of the summary, not including ellipses. Excerpts are collected together until the lengths of the excerpts (before trimming) equal or exceed the number passed to this argument. If this is omitted, only one excerpt will be made.
create_excerpt
This requires a KinoSearch::Doc::HitDoc object as its sole argument. It creates and returns a summary.
BUGS
A very long custom ellipsis, or two page breaks a few characters apart, can break the page-counting algorithm.
SINE QUIBUS NON
This module requires perl and the following modules, which available from the CPAN:
The development version of KinoSearch available at http://www.rectangular.com/svn/kinosearch/trunk, revision 3118 or later. It has only been tested with revision 3122.
AUTHOR & COPYRIGHT
Copyright (C) 2008 Father Chrysostomos <sprout at, um, cpan.org>
This program is free software; you may redistribute or modify it (or both) under the same terms as perl.
ACKNOWLEDGEMENTS
Much of the code in this module is based on Marvin Humphrey's KinoSearch::Highlight::Highlighter
, of which this is a subclass.