NAME
HTML::HiLiter - highlight words in an HTML document just like a felt-tip HiLiter
DESCRIPTION
HTML::HiLiter is designed to make highlighting search queries in HTML easy and accurate. HTML::HiLiter was designed for CrayDoc 4, the Cray Inc documentation server. It has been written with SWISH::API users in mind, but can be used within any Perl program.
Unlike other highlighting code I've found, this one supports nested tags and character entities, such as might be found in technical documentation or HTML generated from some other source (like DocBook SGML or XML). I would suggest not using HTML::HiLiter if your HTML is fairly simple, since in HTML::HiLiter, speed has been sacrificed for accuracy.
The goal is server-side highlighting that looks as if you used a felt-tip marker on the HTML page. You shouldn't need to know what the underlying tags and entities and encodings are: you just want to easily highlight some text as your browser presents it.
SYNOPSIS
use HTML::HiLiter;
my $hiliter = new HTML::HiLiter;
$hiliter->Queries([
'foo',
'bar',
'"some phrase"'
]
);
$hiliter->CSS;
$hiliter->Run('some_file_or_URL');
REQUIREMENTS
Perl version 5.6.1 or later.
Requires the following modules:
- HTML::Parser
- HTML::Entities
- HTML::Tagset
- Text::ParseWords
- HTTP::Request (only if fetching HTML via http)
- LWP::UserAgent (only if fetching HTML via http)
FEATURES
HTML::HiLiter prints highlighted HTML chunk by chunk, buffering all text within an HTML block element before evaluating the buffer for highlighting. If no matches to the queries are found, the HTML is immediately printed. Otherwise, the HTML is highlighted and then printed. The buffer is flushed after each print.
You can direct the print() to a FILEHANDLE with the standard select() function in your script.
Ample debugging. Set the $HTML::HiLiter::debug variable to something true, and lots of debugging info will be printed within HTML comments <!-- -->.
Will highlight link text (the stuff within an <a href> tagset) if the HREF value is a valid match.
Smart context. Won't highlight across an HTML block element like a <p></p> tagset or a <div></div> tagset. (Your indexing software shouldn't consider matches for phrases that span across those tags either. But of course, that's probably just my opinion...)
Rotating colors. Each query gets a unique color. The default is four different colors, which will repeat if you have more than four queries in a single document. You can define more colors in the new() object call.
Cascading Style Sheets. Will add a <style> tagset in CSS to the <head> of an HTML document if you use the CSS() method. If you use the Inline() method, the style attribute will be used instead. The added <style> set will be placed immediately after the opening <head> tag, so that any subsequent CSS defined in the document will override the added <style>. This allows you to re-define the highlighting appearance in one of your own CSS files.
Object Oriented Interface
The following parameters take values that can be made into a regexp class. If you are using SWISH-E, for example, you will want to set these parameters equal to the equivalent SWISH-E configuration values. Otherwise, the defaults should work for most cases.
Example:
my $hiliter = new HTML::HiLiter(
WordCharacters => '\w\-\.',
BeginCharacters => '\w',
EndCharacters => '\w',
HiTag => 'span',
Colors => [ qw(#FFFF33 yellow pink) ],
Links => 1
TagFilter => \&yourcode(),
TextFilter => \&yourcode(),
Force => 1,
SWISHE => $swish_api_object
);
- WordCharacters
-
Characters that constitute a word.
- BeginCharacters
-
Characters that may begin a word.
- EndCharacters
-
Characters that may end a word.
- StartBound
-
Characters that may not begin a word. If not specified, will be automatically based on [^BeginCharacters] plus some regexp niceties.
- EndBound
-
Characters that may not end a word. If not specified, will be automatically based on [^EndCharacters] plus some regexp niceties.
- HiTag
-
The HTML tag to use to wrap highlighted words. Default: span
- Colors
-
A reference to an array of HTML colors. Default is: '#FFFF33', '#99FFFF', '#66FFFF', '#99FF99'
- Links
-
A boolean (1 or 0). If set to '1', consider <a href="foo"> a valid match for 'foo' and hilite the visible text within the <a> tagset. Default Links flag is '0'.
- TagFilter
-
Not yet implemented.
- TextFilter
-
Not yet implemented.
- BufferLim
-
When the number of characters in the HTML buffer exceeds the value of BufferLim, the buffer is printed without highlighting being attempted. The default is 100000 characters. Make this higher at your peril. Most HTML will not exceed more than 100,000 characters in a <p> tagset, for example. (At least, most legible HTML will not...)
- Force
-
Automatically wrap <p> tagset around HTML passed in Run(). This will force the highlighting of plain text. Use this only with Inline().
- SWISHE
-
For SWISH::API compatibility. See the SWISH::API documentation and the EXAMPLES section later in this document.
Variables
The following variables may be redefined by your script.
- $HTML::HiLiter::Delim
-
The phrase delimiter. Default is double quotation marks (").
- $HTML::HiLiter::debug
-
Debugging info prints on STDOUT inside <!-- --> comments. Default is 0. Set it to 1 to enable debugging.
- $HTML::HiLiter::White_Space
-
Regular expression of what constitutes HTML white space. Redefine at your own risk.
- $HTML::HiLiter::CSS_Class
-
The class attribute value used by the CSS() method. Default is 'hilite'.
Methods
Queries
Parse the queries you want to highlight, and create the corresponding regular expressions in the object. This method must be called prior to Run(), but need only be done once for a set of queries. You may Run() multiple times with only one Queries() setup.
Queries() takes a single parameter: a reference to an array of words or phrases. Phrases should be delimited with a double quotation mark (or as redefined in $HTML::HiLiter::Delim ).
Inline
Create the inline style attributes for highlighting without CSS. Use this method when you want to Run() a piece of HTML text.
CSS
Create a CSS <style> tagset for the <head> of your output. Use this if you intend to pass Run() a file name, filehandle or a URL.
Run
Run() takes either a file name, a URL (indicated by a leading 'http://'), or a scalar reference to a string of HTML text.
Report
Return a summary of how many instances of each query were found, how many highlighted, and how many missed.
EXAMPLES
Filesystem
A very simple example for highlighting a document from the filesystem.
use HTML::HiLiter;
my $hiliter = new HTML::HiLiter;
#$HTML::HiLiter::debug=1; # uncomment for oodles of debugging info
my $file = shift || die "$0 file.html expr\n";
# you should do some error checks on $file for security and sanity
# same with ARGV
my @q = @ARGV;
$hiliter->Queries(\@q);
select(STDOUT);
$hiliter->CSS;
$hiliter->Run($file);
# if you wanted to know how accurate you were.
warn $hiliter->Report;
SWISH::API
An example for SWISH-E users (SWISH-E 2.4 and later).
#!/usr/bin/perl
# highlight swishdescription text in search results.
# use from command line or something similar from
# a CGI script
use SWISH::API;
my $index = 'index.swish-e';
my @metanames = qw/ swishtitle swishdefault swishdocpath /;
my $swish = SWISH::API->new( $index );
use HTML::HiLiter;
my $hiliter = new HTML::HiLiter(
Force => 1, # because swishdescription
# is not stored as HTML
SWISHE => $swish
);
$swish->AbortLastError
if $swish->Error;
my $search = $swish->New_Search_Object;
my @query = @ARGV;
@query || die "$0 'words to query'\n";
my $results = $search->Execute( join(' ', @query) );
$swish->AbortLastError
if $swish->Error;
my $hits = $results->Hits;
if ( !$hits ) {
print "No Results\n";
exit;
}
print "Found ", $results->Hits, " hits\n";
$hiliter->Queries(
[ join(' ', $results->ParsedWords( $index ) ) ],
[ @metanames ]
);
$hiliter->Inline;
# highlight the queries in each file description
# NOTE that this will print ALL results
# so in a real SWISH application, you'd likely
# quit after N number of results.
# NOTE too that swishdescription does NOT store
# HTML text per se, just tagless characters as parsed
# by the indexer. But since SWISH-E is often used
# via CGI, this lets the output from your CGI
# script show higlighted.
# and finally, NOTE that swishdescription is,
# by default, pretty long (> 100 chars), so
# we do a test and a little substr magic to avoid
# printing everything.
while ( my $result = $results->NextResult ) {
print "Rank: ", $result->Property( 'swishrank' ), "\n";
print "Title: ", $result->Property( 'swishtitle' ), "\n";
print "Path: ", $result->Property( 'swishdocpath' ), "\n";
my $snippet = get_snippet ( $result->Property( "swishdescription" ) );
$hiliter->Run(\$snippet);
# warn $hiliter->Report if $hiliter->Report;
# comment in for some debugging.
print "\n<hr/ >\n";
}
print "\n";
sub get_snippet
{
my $context_chars = 100;
my %char = (
'>' => '>',
'<' => '<',
'&' => '&',
'\xa0' => ' ',
'"' => '"'
);
my $desc = shift || return '';
# test if $desc contains any of our query words
my @snips;
Q: for my $q (keys %{ $hiliter->{Queries} }) {
if ($desc =~ m/(.*?)\Q$q\E(.*)/si) {
my $bef = $1;
my $af = $2;
$bef = substr $bef, -$context_chars;
$af = substr $af, 0, $context_chars;
# no partial words...
$af =~ s,^\S+\s+|\s+\S+$,,gs;
$bef =~ s,^\S+\s+|\s+\S+$,,gs;
push(@snips, "$bef $q $af");
}
}
my $ellip = '...';
my $snippet = $ellip. join($ellip, @snips) . $ellip;
# convert special HTML characters
$snippet =~ s/([<>&"\xa0])/$char{$1}/g;
return $snippet;
}
A simple CGI script.
#!/usr/bin/perl -T
#
# usage: hilight.cgi?f='somefile_or_url';q='some words to highlight'
use CGI qw(:standard);
use CGI::Carp qw(fatalsToBrowser);
print header();
my $f = param('f');
my (@q) = param('q');
use lib qw(/Users/karpet/perl_mods);
use HTML::HiLiter;
my $hl = new HTML::HiLiter;
$hl->Queries([ @q ]);
$hl->CSS;
$hl->Run($f);
print "<p><pre>". $hl->Report . "</pre></p>";
BACKGROUND
Why one more highlighting module? My goal was complete, exhaustive, tear-your-hair-out efforts to highlight HTML. No other modules I found on the web supported nested tags within words and phrases, or character entities.
I assume ISO-8859-1 Latin1 encoding. Unicode is beyond me at this point, though I suspect you could make it work fairly easily with newer Perl versions (>= 5.8) and the 'use locale' and 'use encoding' pragmas. Thus regex matching would work with things like \w and [^\w] since perl interprets the \w for you.
I think I follow the W3C HTML 4.01 specification. Please prove me wrong.
Prime Example of where this module overcomes other attempts by other modules.
The query 'bold in the middle' should match this HTML:
<p>some phrase <b>with <i>b</i>old</b> in the middle</p>
GOOD highlighting:
<p>some phrase <b>with <i><span>b</span></i><span>old</span></b><span>
in the middle</span></p>
BAD highlighting:
<p>some phrase <b>with <span><i>b</i>bold</b> in the middle</span></p>
No module I tried in my tests could even find that as a match (let alone perform bad highlighting on it), even though indexing programs like SWISH-E would consider a document with that HTML a valid match.
LOCALE
NOTE: locale settings will affect what [\w] will match in regular expressions. Here's a little test program to determine how \w will work on your system. By default, no locale is set in HTML::HiLiter, so \w should default to the locale with which your perl was compiled.
This test program was copied verbatim from http://rf.net/~james/perli18n.html#Q3
I find it very helpful.
Testing locale
#!/usr/bin/perl -w
use strict;
use diagnostics;
use locale;
use POSIX qw (locale_h);
my @lang = ('default','en_US', 'es_ES', 'fr_CA', 'C', 'en_us', 'POSIX');
foreach my $lang (@lang) {
if ($lang eq 'default') {
$lang = setlocale(LC_CTYPE);
}
else {
setlocale(LC_CTYPE, $lang)
}
print "$lang:\n";
print +(sort grep /\w/, map { chr() } 0..255), "\n";
print "\n";
}
TODO
Better approach to stopwords in prep_queries().
When using the get_url() routine, check for framesets with relative objects, so we can get those too. Or are we trying too much??
Highlight IMG tags where ALT attribute matches query??
Support the TagFilter and TextFilter parameters. This will extend the use of HiLiter as an HTML filter. For example, you might want every link in your highlit HTML to point back at your CGI script, so that every link target gets highlighted as well.
HISTORY
* 0.05
first CPAN release
* 0.06
use Text::ParseWords instead of original clumsy regexps in prep_queries()
add support for 8211 (ndash) and 8212 (mdash) entities
tweeked StartBound and EndBound to not match within a word
fixed doc to reflect that debugging prints on STDOUT, not STDERR
KNOWN BUGS
Report() may be inaccurate when Links flag is on. Report() may be inaccurate if the moon is full. Report() may just be inaccurate, plain and simple. Improvements welcome.
HiLiter will not highlight literal parentheses ().
AUTHOR
Peter Karman, karman@cray.com
Thanks to the SWISH-E developers, in particular Bill Moseley for graciously sharing time, advice and code examples.
Comments and suggestions are welcome.
COPYRIGHT
###############################################################################
# CrayDoc 4
# Copyright (C) 2004 Cray Inc swpubs@cray.com
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
###############################################################################
SUPPORT
Send email to swpubs@cray.com.