NAME
Dancer::SearchApp::HTMLSnippet - HTML snippet extractor
SYNOPSIS
my @document_snippets = Dancer::SearchApp::HTMLSnippet->extract_highlights(
html => $html,
hl_tag => '<em>',
hl_end => '</em>',
snippet_length => 150,
max_snippets => 8,
);
METHODS
Dancer::SearchApp::HTMLSnippet->extract_highlights
my @document_snippets = Dancer::SearchApp::HTMLSnippet->extract_highlights(
html => $html,
hl_tag => '<em>',
hl_end => '</em>',
snippet_length => 150,
max_snippets => 8,
);
This extract the highlight snippets and metadata from the HTML as prepared by Tika and highlightedd by Elasticsearch. It returns a list of hash references, each containing a (well-formed) HTML snippet containing the highlights and a page
entry noting the original page number if the snippet originated from within a <p class="page\d+">
section (or crosses that)
{
html => 'this is a <b>result</b> you searched for',
page => 42,
}
Dancer::SearchApp::HTMLSnippet->extract_highlights
my @hits = Dancer::SearchApp::HTMLSnippet->extract_highlights(
html => $html,
max_length => 300,
);
for my $entry (@hits) {
print "Match: $entry->{start} ($entry->{length} bytes)\n";
};
Dancer::SearchApp::HTMLSnippet->cleanup_tika
my $content = Dancer::SearchApp::HTMLSnippet->cleanup_tika( $html );
Cleans up HTML output from Apache Tika.
BUG TRACKER
Please report bugs in this module via the RT CPAN bug queue at https://rt.cpan.org/Public/Dist/Display.html?Name=Dancer-SearchApp or via mail to dancer-searchapp-Bugs@rt.cpan.org.
AUTHOR
Max Maischein corion@cpan.org
COPYRIGHT (c)
Copyright 2014-2016 by Max Maischein corion@cpan.org
.
LICENSE
This module is released under the same terms as Perl itself.