NAME
Search::Tools::Snipper - extract terms in context
SYNOPSIS
use Search::Tools;
my $query = qw/ quick dog /;
my $text = 'the quick brown fox jumped over the lazy dog';
my $s = Search::Tools->snipper(
occur => 3,
context => 8,
word_len => 5,
max_chars => 300,
query => $query
);
print $s->snip( $text );
DESCRIPTION
Search::Tools::Snipper extracts terms and their context from a larger block of text. The larger block may be plain text or HTML/XML.
METHODS
new( query => query )
Instantiate a new object. query must be either a scalar string or a Search::Tools::Query object
Many of the following methods are also available as key/value pairs to new().
BUILD
Called internally by new().
as_sentences Experimental feature
Attempt to extract a snippet that starts at a sentence boundary.
occur
The number of snippets that should be returned by snip().
Available via new().
context
The number of context words to include in the snippet.
Available via new().
max_chars
The maximum number of characters (not bytes! under Perl >= 5.8) to return in a snippet. NOTE: This is only used to test whether test is worth snipping at all, or if no terms are found.
See also show() and ignore_length().
Available via new().
word_len
The estimated average word length used in combination with context(). You can usually ignore this value.
Available via new().
show
Boolean flag indicating whether snip() should succeed no matter what, or if it should give up if no snippets were found. Default is 1 (true).
If no matches are found, the first max_chars of the snippet are returned.
Available via new().
escape
Boolean flag indicating whether snip() should escape any HTML/XML markup in the resulting snippet or not. Default is 0 (false).
Available via new().
strip_markup
Boolean flag indicating whether snip() should attempt to remove any HTML/XML markup in the original text before snipping is applied. Default is 0 (false).
Available via new().
snipper
The CODE ref used by the snip() method for actually extracting snippets. You can use your own snipper function if you want (though if you have a better snipper algorithm than the ones in this module, why not share it?). If you go this route, have a look at the source code for snip() to see how snipper() is used.
Available via new().
type
There are different algorithms used internally for snipping text. They are, in order of speed:
- dumb
-
Just grabs the first max_chars characters and returns it, doing a little clean up to prevent partial words from ending the snippet and (optionally) escaping the text.
- loop
-
Fastest for single-word queries.
- token
-
Most accurate, for both single-word and phrase queries, although it relies on a HeatMap in order to locate phrases.
See also the use_pp feature.
- offset (default)
-
Same as
re
but optimized slightly to look at a substr of text. - re
-
The regular expression algorithm. Will match phrases exactly.
type_used
The name of the internal snipper function used. In case you're curious.
force
Boolean flag indicating whether the snipper() value should always be used, regardless of the type of query keyword. Default is 0 (false).
Available via new().
count
The number of snips made by the Snipper object.
collapse_whitespace
Boolean flag indicating whether multiple whitespace characters should be collapsed into a single space. A whitespace character is defined as anything that Perl's \s
pattern matches, plus the nobreak space (\xa0
). Default is 1 (true).
Available via new().
use_pp( n )
Set to a true value to use Tokenizer->tokenize_pp() and TokenListPP and TokenPP instead of the XS versions of the same. XS is the default and is much faster, but harder to modify or subclass.
Available via new().
ignore_length
Boolean flag. If set to false (default) then max_chars
is respected. If set to true, max_chars
is ignored.
Available via new().
treat_phrases_as_singles
Boolean flag. If set to true (default), individual terms within a phrase are considered a match. If false, only match if individual terms have a proximity distance of 1.
snip( text )
Return a snippet of text from text that matches query plus context() words of context. Matches are case insensitive.
The snippet returned will be in UTF-8 encoding, regardless of the encoding of text.
AUTHOR
Peter Karman <karman at cpan dot org>
ACKNOWLEDGEMENTS
Based on the HTML::HiLiter regular expression building code, originally by the same author, copyright 2004 by Cray Inc.
Thanks to Atomic Learning www.atomiclearning.com
for sponsoring the development of this module.
COPYRIGHT
Copyright 2006 by Peter Karman.
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
SWISH::HiLiter