NAME

HTML::HiLiter - highlight words in an HTML document just like a felt-tip HiLiter

VERSION

0.13

DESCRIPTION

HTML::HiLiter is designed to make highlighting search queries in HTML easy and accurate. HTML::HiLiter was designed for CrayDoc 4, the Cray documentation server. It has been written with SWISH::API users in mind, but can be used within any Perl program.

SYNOPSIS

use HTML::HiLiter;

my $hiliter = new HTML::HiLiter(

	WordCharacters 	=>	'\w\-\.',
	BeginCharacters =>	'\w',
	EndCharacters	=>	'\w',
	HiTag =>	'span',
	Colors =>	[ qw(#FFFF33 yellow pink) ],
	Links =>	1
	TagFilter =>	\&yourtagcode(),
	TextFilter =>	\&yourtextcode(),
	Force	=>	1,
	SWISH	=>	$swish_api_object
);

$hiliter->Queries( 'foo bar or "some phrase"' );

$hiliter->CSS;

$hiliter->Run('some_file_or_URL');

REQUIREMENTS

The following are absolutely required:

  • Perl version 5.6.1 or later.

  • Text::ParseWords

Required if using with SWISH::HiLiter or the SWISH param in new():

  • SWISH::API version 0.03 or later

Required if running with Parser=>1 (default):

  • HTML::Parser

  • HTML::Entities

  • HTML::Tagset

Required to use the HTTP option in the Run() method:

  • HTTP::Request

  • LWP::UserAgent

The Debug feature requires Data::Dumper when you run Report().

FEATURES

A cornucopia of features.

  • With HTML::Parser enabled (default), HTML::HiLiter evals highlighted HTML chunk by chunk, buffering all text within an HTML block element before evaluating the buffer for highlighting. If no matches to the queries are found, the HTML is immediately printed (default) or cached and returned at the end of all evaluation (Print=>0).

    You can direct the print() to a filehandle with the standard select() function in your script. Or use Print=>0 to return the highlighted HTML as a scalar string.

  • Turn highlighting off on a per-tagset basis with the custom HTML "nohiliter" attribute. Set the attribute to a TRUE value (like 1) to turn off highlighting for the duration of that tag.

  • Ample debugging. Set the $HTML::HiLiter::Debug variable to a level between 1 and 3, and lots of debugging info will be printed within HTML comments <!-- -->.

  • Will highlight link text (the stuff within an <a href> tagset) if the HREF value is a valid match. See the Links option.

  • Smart context. Won't highlight across an HTML block element like a <p></p> tagset or a <div></div> tagset. (IMHO, your indexing software shouldn't consider matches for phrases that span across those tags either.)

  • Rotating colors. Each query gets a unique color. The default is four different colors, which will repeat if you have more than four queries in a single document. You can define more colors in the new() object call.

  • Cascading Style Sheets. Will add a <style> tagset in CSS to the <head> of an HTML document if you use the CSS() method. If you use the Inline() method, the style attribute will be used instead. The added <style> set will be placed immediately after the opening <head> tag, so that any subsequent CSS defined in the document will override the added <style>. This allows you to re-define the highlighting appearance in one of your own CSS files.

VARIABLES

The following variables may be redefined by your script.

  • $HTML::HiLiter::Delim

    The phrase delimiter. Default is double quotation marks (").

  • $HTML::HiLiter::Debug

    Debugging info prints on STDOUT inside <!-- --> comments. Default is 0. Set it to 1 - 3 to enable debugging. Use the 'debug' param in new() to set this as well.

  • $HTML::HiLiter::White_Space

    Regular expression of what constitutes HTML white space. Redefine at your own risk.

  • $HTML::HiLiter::CSS_Class

    The class attribute value used by the CSS() method. Default is 'hilite'.

METHODS

new()

Create a HiLiter object handle.

Many of following parameters take values that can be made into a regexp class. If you are using SWISH-E, for example, you will want to set these parameters equal to the equivalent SWISH-E configuration values. Otherwise, the defaults should work for most cases.

WordCharacters

Characters that constitute a word.

BeginCharacters

Characters that may begin a word.

EndCharacters

Characters that may end a word.

StartBound

Characters that may not begin a word. If not specified, will be automatically based on [^BeginCharacters] plus some regexp niceties.

EndBound

Characters that may not end a word. If not specified, will be automatically based on [^EndCharacters] plus some regexp niceties.

HiTag

The HTML tag to use to wrap highlighted words. Default: span

HiClass

The HTML class attribute used within the HiTag. If not specified, the class name will be auto generated by the Colors array.

Colors

A reference to an array of HTML colors.

A boolean (1 or 0). If set to '1', consider <a href="foo">my link</a> a valid match for 'foo' and highlight the visible text within the <a> tagset ('my link'). Default Links flag is '0'.

TagFilter

A CODE reference of your choosing for filtering HTML tags as they pass through the HTML::Parser. See FILTERS.

TextFilter

A CODE reference of your choosing for filtering HTML tags as they pass through the HTML::Parser. See FILTERS.

BufferLim

When the number of characters in the HTML buffer exceeds the value of BufferLim, the buffer is printed without highlighting being attempted. The default is 100000 characters. Make this higher at your peril. Most HTML will not exceed more than 100,000 characters in a <p> tagset, for example. (At least, most legible HTML will not...)

Print

Print highlighted HTML as the HTML::Parser encounters it. If TRUE (the default), use a select() in your script to print somewhere besides the perl default of STDOUT.

NOTE: set this to 0 (FALSE) only if you are highlighting small chunks of HTML (i.e., smaller than BufferLim). See Run().

Force

Will force Run() to wrap <p> tagset around the text you pass. This will force the highlighting of plain text if using HTML::Parser (which depends on at least one tag to activate highlighting). Use this only with Inline().

noplain

Starting with version 0.11, a plaintext() method helps optimize performance by using a simpler algorithm for highlighting plain (nonHTML) text. If noplain=>1, the optimization will not be used and htmltext() will be called every time.

SWISH

For SWISH::API compatibility. See the SWISH::API documentation and the EXAMPLES section later in this document.

Parser

If set to 0 (FALSE), then the HTML::Parser module will not be loaded. This allows you to use the regexp methods without the overhead of loading the parser. The default is to load the parser.

Queries( )

Queries( query )

Queries( \@queries )

Queries( \@queries, \@metanames, \@stopwords )

Queries( \@queries, \@metanames, stopwords )

Parse the queries you want to highlight, and create the corresponding regular expressions in the object. This method must be called prior to Run(), but need only be done once for a query or queries. You may Run() multiple times with only one Queries() setup.

Queries() requires a single parameter: either a query text string, or a reference to an array of words or phrases. Phrases should be delimited with a double quotation mark (or as redefined in $HTML::HiLiter::Delim ).

If using SWISH-E, Queries() takes several factors into account.

MetaNames

A reference to an array of a MetaNames may be passed to Queries(). If the MetaNames appear in the query, they will be removed from the regexp used for highlighting. MetaNames used in queries are expected to be of the form:

meta=value

in which case the 'meta=' would be removed. There may be space before or after the '='.

NOTE: If using SWISH feature, MetaNames are automatically retrieved, as are StopWords.

Ignore*Char

Any characters defined in IgnoreFirstChar or IgnoreLastChar will be stripped from the query. This assumes that your search would ignore these characters anyway.

StopWords

Either a scalar string or an array ref of StopWords. If using SWISH feature, StopWords are retrieved automatically.

FuzzyMode

a.k.a. stemming. New in version 0.11 is support for the SWISH FuzzyMode option. If FuzzyMode was used in the SWISH::API object passed in new(), then Queries() will take the stemmed version of the word into account.

In scalar context, Queries() returns a hash ref of the queries, with key = query and value = regexp. In array context, returns an array of queries (keys of the hash ref).

With no arguments, returns the regular expression hash ref currently in use.

Inline

Create the inline style attributes for highlighting without CSS. Use this method when you want to Run() a piece of HTML text.

CSS

Create a CSS <style> tagset for the <head> of your output. Use this if you intend to pass Run() a file name, filehandle or a URL.

Run( file_or_url )

Run() takes either a file name, a URL (indicated by a leading 'http://'), or a scalar reference to a string of HTML text.

The HTML::Parser must be used with this method.

htmltext( html )

Same as calling hilite().

plaintext( text )

If you want to highlight plain, nonHTML text, you can use plaintext(). It uses a simpler regexp to match your query and should be slightly faster than htmltext(). plaintext() is called automatically by hilite() if the text you pass does not contain any <> characters.

hilite( html )

Usually accessed via Run() but documented here in case you want to run without the HTML::Parser. Returns the text, highlighted. Note that CSS() will probably not work for you here; use Inline() prior to calling this method, so that the object has the styles defined.

See also SWISH::HiLiter which uses this method.

Example:

my $hilited_text = $hiliter->hilite('some text');

build_regexp( words_to_highlight )

Returns the regular expression for a string of word(s). Usually called by Queries() but you might use directly if you are running without the HTML::Parser.

my $pattern = $hiliter->build_regexp( 'foo or bar' );

This is the heart of the HiLiter. We leverage the speed of Perl's regexp engine against the complication of a regexp that matches inline tags, entities, and combinations of both.

NOTE: $pattern is an array ref of two regexps: the first is a complex one for HTML, the second is a simpler one for plain text. Access them like:

my $complex = $pattern->[0];
my $simple = $pattern->[1];

prep_queries

prep_queries() takes same arguments as Queries() (which actually uses prep_queries() internally).

Parse a list of query strings and return them as individual word/phrase tokens. Removes stopwords and metanames from queries. Stemming is also supported, though it may behave unpredictably in the resulting regexps from Queries().

my @q = $hiliter->prep_queries( ['foo', 'bar', 'baz'] );

The reason we support multiple @query instead of $query is to allow for compounded searches.

Don't worry about nots since those aren't going to be in the results anyway. Just let the highlight fail.

Report

Return a summary of how many instances of each query were found, how many highlighted, and how many missed.

FILTERS

TextFilter and TagFilter are two optional parameters that allow you to filter the contents of your HTML beyond normal highlighting. Each parameter takes a CODE reference.

TextFilter should expect these parameters in this order:

parserobj, dtext, text, offset, length

TagFilter should expect these parameters in this order:

parserobj, tag, tagname, offset, length, offset_end, attr, text

Both should return a scalar string of text. TagFilter should return a set of attributes. TextFilter may return whatever you want. See EXAMPLES and the HTML::Parser documentation for what these parameters mean and for more about writing filters.

EXAMPLES

See examples/ directory in source distribution.

HISTORY

Yet another highlighting module?

My goal was complete, exhaustive, tear-your-hair-out efforts to highlight HTML. No other modules I found on the web supported nested tags within words and phrases, or character entities. Cray uses the standard DocBook stylesheets from Norm Walsh et al, to generate HTML. These stylesheets produce valid HTML but often fool the other highlighters I found.

The problem became most evident when we started using SWISH-E. SWISH-E does such a good job at converting entities and doing phrase matching that we found ourselves in a dilemma: SWISH-E often gave valid search results that mere mortal highlighters could not match in the source HTML -- not even the SWISH::*Highlight modules.

I assume ISO-8859-1 Latin1 encoding. Unicode is beyond me at this point, though I suspect you could make it work fairly easily with newer Perl versions (>= 5.8) and the 'use locale' and 'use encoding' pragmas. Thus regex matching would work with things like \w and [^\w] since perl interprets the \w for you.

With the exception of the 'nohiliter' attribute, I think I follow the W3C HTML 4.01 specification. Please prove me wrong.

Prime Example of where this module overcomes other attempts by other modules.

The query 'bold in the middle' should match this HTML:

<p>some phrase <b>with <i>b</i>old</b> in&nbsp;the middle</p>

GOOD highlighting:

<p>some phrase <b>with <i><span>b</span></i><span>old</span></b><span>
in&nbsp;the middle</span></p>

BAD highlighting:

<p>some phrase <b>with <span><i>b</i>bold</b> in&nbsp;the middle</span></p>

No module I tried in my tests could even find that as a match (let alone perform bad highlighting on it), even though indexing programs like SWISH-E would consider a document with that HTML a valid match.

Should you use this module?

I would suggest not using HTML::HiLiter if your HTML is fairly simple, since in HTML::HiLiter, speed has been sacrificed for accuracy and rich features. Check out HTML::Highlight instead.

Unlike other highlighting code I've found, HTML::HiLiter supports nested tags and character entities, such as might be found in technical documentation or HTML generated from some other source (like DocBook SGML or XML).

To speed up runtime, try using the Parser=>0 feature (though that doesn't support all the features, like Links, TagFilter, TextFilter, smart context, etc.). Parser=>0 has the advantage of not requiring the HTML::Parser (and associated modules), but it makes the highlighting rather 'blind'.

The goal is server-side highlighting that looks as if you used a felt-tip marker on the HTML page. You shouldn't need to know what the underlying tags and entities and encodings are: you just want to easily highlight some text as your browser presents it.

TODO

  • Better approach to stopwords in prep_queries().

  • Highlight IMG tags where ALT attribute matches query??

KNOWN BUGS AND LIMITATIONS

Report() may be inaccurate when Links flag is on. Report() may be inaccurate if the moon is full. Report() may just be inaccurate, plain and simple. Improvements welcome.

Will not highlight literal parentheses ().

Phrases that contain stopwords may not highlight correctly. It's more a problem of *which* stopword the original doc used and is not an intrinsic problem with the HiLiter, but noted here for completeness' sake.

If using the SWISH param in new(), only the first index's Char* settings are considered.

Stemming support "works" but feels to the author like a crude hack. YMMV.

Locale

NOTE: locale settings will affect what [\w] will match in regular expressions. Here's a little test program to determine how \w will work on your system. By default, no locale is set in HTML::HiLiter, so \w should default to the locale with which your perl was compiled.

This test program was copied verbatim from http://rf.net/~james/perli18n.html#Q3

I find it very helpful.

#!/usr/bin/perl -w
use strict;
use diagnostics;

use locale;
use POSIX qw (locale_h);

my @lang = ('default','en_US', 'es_ES', 'fr_CA', 'C', 'en_us', 'POSIX');

foreach my $lang (@lang) {
 if ($lang eq 'default') {
    $lang = setlocale(LC_CTYPE);
 }
 else {
    setlocale(LC_CTYPE, $lang)
 }
 print "$lang:\n";
 print +(sort grep /\w/, map { chr() } 0..255), "\n";
 print "\n";
}

AUTHOR

Peter Karman, karpet@peknet.com

Thanks to the Swish-edevelopers, in particular Bill Moseley for graciously sharing time, advice and code examples.

Thanks to Cray for allowing this module to be publically released as part of CrayDoc and allowing me to continue supporting it.

Comments and suggestions are welcome.

COPYRIGHT

###############################################################################
#    CrayDoc 4
#    Copyright (C) 2004 Cray Inc swpubs@cray.com
#
#    This program is free software; you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation; either version 2 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program; if not, write to the Free Software
#    Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
###############################################################################

SUPPORT

Send email to swpubs@cray.com.

SEE ALSO

SWISH::HiLiter, SWISH::API, HTML::Parser, HTML::Tagset, HTML::Entities, Text::ParseWords, LWP::UserAgent, HTTP::Request, Data::Dumper