NAME

HTML::HiLiter - highlight words in an HTML document just like a felt-tip HiLiter

DESCRIPTION

HTML::HiLiter is designed to make highlighting search queries in HTML easy and accurate. HTML::HiLiter was designed for CrayDoc 4, the Cray documentation server. It has been written with SWISH::API users in mind, but can be used within any Perl program.

Should you use this module?

I would suggest not using HTML::HiLiter if your HTML is fairly simple, since in HTML::HiLiter, speed has been sacrificed for accuracy. Check out HTML::Highlighter instead.

Unlike other highlighting code I've found, this one supports nested tags and character entities, such as might be found in technical documentation or HTML generated from some other source (like DocBook SGML or XML).

You might try using the Parser=>0 feature, which should speed up the run time but doesn't support all the features (like Links, TagFilter, TextFilter, smart context, etc.). Parser=>0 has the advantage of not requiring the HTML::Parser (and associated modules), but it makes the highlighting rather 'blind'.

The goal is server-side highlighting that looks as if you used a felt-tip marker on the HTML page. You shouldn't need to know what the underlying tags and entities and encodings are: you just want to easily highlight some text as your browser presents it.

SYNOPSIS

use HTML::HiLiter;

my $hiliter = new HTML::HiLiter;

$hiliter->Queries([
		'foo',
		'bar',
		'"some phrase"'
		]
		);

$hiliter->CSS;

$hiliter->Run('some_file_or_URL');

REQUIREMENTS

Perl version 5.6.1 or later.

Requires the following modules:

HTML::Parser (but optional with Parser=>0 -- see new() )
HTML::Entities (but optional with Parser=>0 -- see new() )
HTML::Tagset (but optional with Parser=>0 -- see new() )
Text::ParseWords
HTTP::Request (only if fetching HTML via http)
LWP::UserAgent (only if fetching HTML via http)

FEATURES

  • With HTML::Parser enabled (default), HTML::HiLiter evals highlighted HTML chunk by chunk, buffering all text within an HTML block element before evaluating the buffer for highlighting. If no matches to the queries are found, the HTML is immediately printed (default) or cached and returned at the end of all evaluation. Otherwise, the HTML is highlighted and then printed (or cached). The buffer is flushed after each print/cache.

    You can direct the print() to a FILEHANDLE with the standard select() function in your script. Or use Print=>0 to return the highlighted HTML as a scalar string.

  • Ample debugging. Set the $HTML::HiLiter::debug variable to 1, and lots of debugging info will be printed within HTML comments <!-- -->.

  • Will highlight link text (the stuff within an <a href> tagset) if the HREF value is a valid match. See the Links option.

  • Smart context. Won't highlight across an HTML block element like a <p></p> tagset or a <div></div> tagset. (Your indexing software shouldn't consider matches for phrases that span across those tags either. But of course, that's probably just my opinion...)

  • Rotating colors. Each query gets a unique color. The default is four different colors, which will repeat if you have more than four queries in a single document. You can define more colors in the new() object call.

  • Cascading Style Sheets. Will add a <style> tagset in CSS to the <head> of an HTML document if you use the CSS() method. If you use the Inline() method, the style attribute will be used instead. The added <style> set will be placed immediately after the opening <head> tag, so that any subsequent CSS defined in the document will override the added <style>. This allows you to re-define the highlighting appearance in one of your own CSS files.

Object Oriented Interface

The following parameters take values that can be made into a regexp class. If you are using SWISH-E, for example, you will want to set these parameters equal to the equivalent SWISH-E configuration values. Otherwise, the defaults should work for most cases.

Example:

my $hiliter = new HTML::HiLiter(

			WordCharacters 	=>	'\w\-\.',
			BeginCharacters =>	'\w',
			EndCharacters	=>	'\w',
			HiTag =>	'span',
			Colors =>	[ qw(#FFFF33 yellow pink) ],
			Links =>	1
			TagFilter =>	\&yourcode(),
			TextFilter =>	\&yourcode(),
			Force	=>	1,
			SWISHE	=>	$swish_api_object
				);
WordCharacters

Characters that constitute a word.

BeginCharacters

Characters that may begin a word.

EndCharacters

Characters that may end a word.

StartBound

Characters that may not begin a word. If not specified, will be automatically based on [^BeginCharacters] plus some regexp niceties.

EndBound

Characters that may not end a word. If not specified, will be automatically based on [^EndCharacters] plus some regexp niceties.

HiTag

The HTML tag to use to wrap highlighted words. Default: span

Colors

A reference to an array of HTML colors. Default is: '#FFFF33', '#99FFFF', '#66FFFF', '#99FF99'

A boolean (1 or 0). If set to '1', consider <a href="foo">my link</a> a valid match for 'foo' and highlight the visible text within the <a> tagset ('my link'). Default Links flag is '0'.

TagFilter

Not yet implemented.

TextFilter

Not yet implemented.

BufferLim

When the number of characters in the HTML buffer exceeds the value of BufferLim, the buffer is printed without highlighting being attempted. The default is 100000 characters. Make this higher at your peril. Most HTML will not exceed more than 100,000 characters in a <p> tagset, for example. (At least, most legible HTML will not...)

Print

Print highlighted HTML as the HTML::Parser encounters it. If TRUE (the default), use a select() in your script to print somewhere besides the perl default of STDOUT.

NOTE: set this to 0 (FALSE) only if you are highlighting small chunks of HTML (i.e., smaller than BufferLim). See Run().

Force

Automatically wrap <p> tagset around HTML passed in Run(). This will force the highlighting of plain text. Use this only with Inline().

SWISHE

For SWISH::API compatibility. See the SWISH::API documentation and the EXAMPLES section later in this document.

Parser

If set to 0 (FALSE), then the HTML::Parser module will not be loaded. This allows you to use the regexp methods without the overhead of loading the parser. The default is to load the parser.

Variables

The following variables may be redefined by your script.

$HTML::HiLiter::Delim

The phrase delimiter. Default is double quotation marks (").

$HTML::HiLiter::debug

Debugging info prints on STDOUT inside <!-- --> comments. Default is 0. Set it to 1 to enable debugging.

$HTML::HiLiter::White_Space

Regular expression of what constitutes HTML white space. Redefine at your own risk.

$HTML::HiLiter::CSS_Class

The class attribute value used by the CSS() method. Default is 'hilite'.

Methods

Queries( \@queries, [ \@metanames ] )

Parse the queries you want to highlight, and create the corresponding regular expressions in the object. This method must be called prior to Run(), but need only be done once for a set of queries. You may Run() multiple times with only one Queries() setup.

Queries() requires a single parameter: a reference to an array of words or phrases. Phrases should be delimited with a double quotation mark (or as redefined in $HTML::HiLiter::Delim ).

If using SWISH-E, Queries() takes a second parameter: a reference to an array of a metanames. If the metanames are used as part of the query, they will be removed from the regexp used for highlighting.

NOTE to SWISH::API users: be wary of using the SWISH::API ParsedWords() method with Queries() as SWISH::API will lowercase all your queries. This will result in the highlighted words being lowercased as well, which may not be what you want.

Returns a hash ref of the queries, with key = query and value = regexp.

Inline

Create the inline style attributes for highlighting without CSS. Use this method when you want to Run() a piece of HTML text.

CSS

Create a CSS <style> tagset for the <head> of your output. Use this if you intend to pass Run() a file name, filehandle or a URL.

Run( file_or_url )

Run() takes either a file name, a URL (indicated by a leading 'http://'), or a scalar reference to a string of HTML text.

The HTML::Parser must be used with this method.

Usually accessed via Run() but documented here in case you want to run without the HTML::Parser. Returns the text, highlighted. Note that CSS() will probably not work for you here; use Inline() prior to calling this method, so that the object has the styles defined.

See SWISH::API in EXAMPLES.

NOTE: that the second param 'links' is an array ref and only works if using the HTML::Parser and you have set the Links param in the new() method -- in which case, use Run() instead.

Example:

my $hilited_text = $hiliter->hilite('some text');

build_regexp( words_to_highlight )

Returns the regular expression for a string of word(s). Usually called by Queries() but you might use directly if you are running without the HTML::Parser.

my $pattern = $hiliter->build_regexp( 'foo or bar' );

This is the heart of the HiLiter. We leverage the speed of Perl's regexp engine against the complication of a regexp that matches inline tags, entities, and combinations of both.

prep_queries( \@queries [, \@metanames, \@stopwords ] )

Parse a list of query strings and return them as individual word/phrase tokens. Removes stopwords and metanames from queries.

my @q = $hiliter->prep_queries( ['foo', 'bar', 'baz'] );
	

The reason we support multiple @query instead of $query is to allow for compounded searches.

Don't worry about 'not's since those aren't going to be in the results anyway. Just let the highlight fail.

NOTE that you do not need to pass @stopwords when you used the SWISHE option in the new() call, since the HiLiter object will contain a StopWords parameter.

Report

Return a summary of how many instances of each query were found, how many highlighted, and how many missed.

EXAMPLES

Filesystem

A very simple example for highlighting a document from the filesystem.

use HTML::HiLiter;

my $hiliter = new HTML::HiLiter;

#$HTML::HiLiter::debug=1;	# uncomment for oodles of debugging info

my $file = shift || die "$0 file.html expr\n";

# you should do some error checks on $file for security and sanity
# same with ARGV
my @q = @ARGV;

$hiliter->Queries(\@q);

select(STDOUT);

$hiliter->CSS;

$hiliter->Run($file);

# if you wanted to know how accurate you were.
warn $hiliter->Report;	

SWISH::API

An example for SWISH::API users (SWISH-E 2.4 and later).

	#!/usr/bin/perl

	# highlight swishdescription text in search results.
	# use as CGI script.
	# NOTE this is not a pretty output -- dress it up as you will
	
	use CGI;
	my $cgi = new CGI;
	$| = 1;
	
	print $cgi->header;
	
	print "<pre>";
	
	use SWISH::API;
	
	my $index = 'index.swish-e';
	
	my @metanames = qw/ swishtitle swishdefault swishdocpath /;
	
	my $swish = SWISH::API->new( $index );
	
	use HTML::HiLiter;
	
	#$HTML::HiLiter::debug = 1;
	my $hiliter = new HTML::HiLiter(
				#Force => 1,  # because swishdescription
					     # is not stored as HTML
				SWISHE => $swish,
				Parser=> 0,  # don't load HTML::Parser
				);
	


        $swish->AbortLastError
               if $swish->Error;

        my $search = $swish->New_Search_Object;

	my @query = $cgi->param('q');
	
	@query || die "$0 'words to query'\n";

        my $results = $search->Execute( join(' ', @query) );

        $swish->AbortLastError
               if $swish->Error;
	       
        my $hits = $results->Hits;
        if ( !$hits ) {
               print "No Results\n";
               exit;
        }

        print "Found ", $results->Hits, " hits\n";

	my $query_str = join(' ', @query );
	my @parsed_query = keys %{ $hiliter->Queries(
					[ $query_str ],
					[ @metanames ]
					)
				};
	$hiliter->Inline;

        # highlight the queries in each file description
	
	# NOTE that this will print ALL results
	# so in a real SWISH application, you'd likely
	# quit after N number of results.
	
	# NOTE too that swishdescription does NOT store
	# HTML text per se, just tagless characters as parsed
	# by the indexer. But since SWISH-E is often used
	# via CGI, this lets the output from your CGI
	# script show higlighted.
	
	# and finally, NOTE that swishdescription is,
	# by default, pretty long (> 100 chars), so
	# we do a test and a little substr magic to avoid
	# printing everything.	
	
	while ( my $result = $results->NextResult ) {
          
	  print "Rank: ", $result->Property( 'swishrank' ), "\n";
	  print "Title: ", $result->Property( 'swishtitle' ), "\n";
	  print "Path: ", $result->Property( 'swishdocpath' ), "\n";

	  my $snippet = get_snippet ( $result->Property( "swishdescription" ) );
	
	  print $hiliter->hilite( $snippet );
	  
	  # warn $hiliter->Report if $hiliter->Report;
	  # comment in for some debugging.
	  
	  print "\n<hr/ >\n";
	
	}
	
	print "\n";
	
	print "</pre>";
	
	sub get_snippet
	{
		my $context_chars = 100;
	
		my %char = (
		'>' 	=> '&gt;',
		'<' 	=> '&lt;',
		'&' 	=> '&amp;',
		'\xa0' 	=> '&nbsp;',
		'"'	=> '&quot;'
		);

		my $desc = shift || return '';
		# test if $desc contains any of our query words
	  	my @snips;
	  	Q: for my $q (@parsed_query) {
	  	  if ($desc =~ m/(.*?)(\Q$q\E)(.*)/si) {
			my $bef = $1;
			my $qm = $2;
			my $af = $3;
			$bef = substr $bef, -$context_chars;
			$af = substr $af, 0, $context_chars;
			
			# no partial words...
			$af =~ s,^\S+\s+|\s+\S+$,,gs;
			$bef =~ s,^\S+\s+|\s+\S+$,,gs;

			push(@snips, "$bef $qm $af");
		  }
	  	}
	  	my $ellip = ' ... ';
	  	my $snippet = $ellip. join($ellip, @snips) . $ellip;
	  
	  	# convert special HTML characters
 	  	$snippet =~ s/([<>&"\xa0])/$char{$1}/g;
		
		return $snippet;
		
	}

A simple CGI script.

#!/usr/bin/perl -T
#
# usage: hilight.cgi?f='somefile_or_url';q='some words to highlight'

use CGI qw(:standard);
use CGI::Carp qw(fatalsToBrowser);

print header();

my $f = param('f');
my (@q) = param('q');

use HTML::HiLiter;

my $hl = new HTML::HiLiter;

$hl->Queries([ @q ]);

$hl->CSS;

$hl->Run($f);

print "<p><pre>". $hl->Report . "</pre></p>";

BACKGROUND

Why one more highlighting module? My goal was complete, exhaustive, tear-your-hair-out efforts to highlight HTML. No other modules I found on the web supported nested tags within words and phrases, or character entities.

I assume ISO-8859-1 Latin1 encoding. Unicode is beyond me at this point, though I suspect you could make it work fairly easily with newer Perl versions (>= 5.8) and the 'use locale' and 'use encoding' pragmas. Thus regex matching would work with things like \w and [^\w] since perl interprets the \w for you.

I think I follow the W3C HTML 4.01 specification. Please prove me wrong.

Prime Example of where this module overcomes other attempts by other modules.

The query 'bold in the middle' should match this HTML:

<p>some phrase <b>with <i>b</i>old</b> in&nbsp;the middle</p>

GOOD highlighting:

<p>some phrase <b>with <i><span>b</span></i><span>old</span></b><span>
in&nbsp;the middle</span></p>

BAD highlighting:

<p>some phrase <b>with <span><i>b</i>bold</b> in&nbsp;the middle</span></p>

No module I tried in my tests could even find that as a match (let alone perform bad highlighting on it), even though indexing programs like SWISH-E would consider a document with that HTML a valid match.

LOCALE

NOTE: locale settings will affect what [\w] will match in regular expressions. Here's a little test program to determine how \w will work on your system. By default, no locale is set in HTML::HiLiter, so \w should default to the locale with which your perl was compiled.

This test program was copied verbatim from http://rf.net/~james/perli18n.html#Q3

I find it very helpful.

Testing locale

#!/usr/bin/perl -w
use strict;
use diagnostics;

use locale;
use POSIX qw (locale_h);

my @lang = ('default','en_US', 'es_ES', 'fr_CA', 'C', 'en_us', 'POSIX');

foreach my $lang (@lang) {
 if ($lang eq 'default') {
    $lang = setlocale(LC_CTYPE);
 }
 else {
    setlocale(LC_CTYPE, $lang)
 }
 print "$lang:\n";
 print +(sort grep /\w/, map { chr() } 0..255), "\n";
 print "\n";
}

TODO

  • Better approach to stopwords in prep_queries().

  • Highlight IMG tags where ALT attribute matches query??

  • Support the TagFilter and TextFilter parameters. This will extend the use of HiLiter as an HTML filter. For example, you might want every link in your highlit HTML to point back at your CGI script, so that every link target gets highlighted as well.

HISTORY

 * 0.05
	first CPAN release

 * 0.06
	use Text::ParseWords instead of original clumsy regexps in prep_queries()
	add support for 8211 (ndash) and 8212 (mdash) entities
	tweeked StartBound and EndBound to not match within a word
	fixed doc to reflect that debugging prints on STDOUT, not STDERR

 * 0.07
	made HTML::Parser optional to allow for more flexibility with using methods
	added perldoc for previously undocumented methods
	corrected perldoc for Queries() to refer to metanames as second param
	updated SWISH::API example to avoid using HTML::Parser
	added unicode entity -> ascii equivs for better DocBook support
	  (NOTE: this expands the ndash/mdash feature from 0.06)
	misc cleanup
	
 * 0.08
 	fixed bug in SWISH::API example with ParsedWords and updated Queries()
	  perldoc to reflect the change.
	removed dependency on HTML::Entities by hardcoding all relevant entities.
	  (HTML::Entities does a 'require HTML::Parser' which made the parser=>0
	  feature break.)
	  
 * 0.09
 	added Print feature to new() to allow Run() to return highlighted text instead
	  of automatically printing in a streaming fashion. Set Print=>0 to turn off print().
	Run() now returns highlighted text if Print=>0.
	changed parser=>0 to Parser=>0.
	the ParsedWords bug reported in 0.08 was really with my example in get_snippet().
	  so rather than blame someone else's code, I fixed mine... :)
	fixed bug with count of real HTML matches that was most evident with running hilite()
	added test2.t test to test the Parser=>0 feature
	
 * 0.10
 	fixed prep_queries() perldoc head
	Queries() now returns hash ref of q => regexp
	fixed SWISH::API example to use new Queries()
	fixed Queries() perldoc
	added StopWords note to prep_queries()
	fixed regexp that caused make test to fail in perl < 5.8.1 (thanks to m@perlmeister.com)
	added note to hilite() perldoc to always use Inline()
	
	

KNOWN BUGS

Report() may be inaccurate when Links flag is on. Report() may be inaccurate if the moon is full. Report() may just be inaccurate, plain and simple. Improvements welcome.

HiLiter will not highlight literal parentheses ().

Phrases that contain stopwords may not highlight correctly. It's more a problem of *which* stopword the original doc used and is not an intrinsic problem with the HiLiter, but noted here for completeness' sake.

AUTHOR

Peter Karman, karman@cray.com

Thanks to the SWISH-E developers, in particular Bill Moseley for graciously sharing time, advice and code examples.

Comments and suggestions are welcome.

COPYRIGHT

###############################################################################
#    CrayDoc 4
#    Copyright (C) 2004 Cray Inc swpubs@cray.com
#
#    This program is free software; you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation; either version 2 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program; if not, write to the Free Software
#    Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
###############################################################################

SUPPORT

Send email to swpubs@cray.com.