NAME

Dancer::SearchApp::HTMLSnippet - HTML snippet extractor

SYNOPSIS

my @document_snippets = Dancer::SearchApp::HTMLSnippet->extract_highlights(
    html => $html,
    hl_tag => '<em>',
    hl_end => '</em>',
    snippet_length => 150,
    max_snippets => 8,
);

METHODS

Dancer::SearchApp::HTMLSnippet->extract_highlights

my @document_snippets = Dancer::SearchApp::HTMLSnippet->extract_highlights(
    html => $html,
    hl_tag => '<em>',
    hl_end => '</em>',
    snippet_length => 150,
    max_snippets => 8,
);

This extract the highlight snippets and metadata from the HTML as prepared by Tika and highlightedd by Elasticsearch. It returns a list of hash references, each containing a (well-formed) HTML snippet containing the highlights and a page entry noting the original page number if the snippet originated from within a <p class="page\d+"> section (or crosses that)

{
    html => 'this is a <b>result</b> you searched for',
    page => 42,
}

Dancer::SearchApp::HTMLSnippet->extract_highlights

my @hits = Dancer::SearchApp::HTMLSnippet->extract_highlights(
    html => $html,
    max_length => 300,
);

for my $entry (@hits) {
  print "Match: $entry->{start} ($entry->{length} bytes)\n";
};

Dancer::SearchApp::HTMLSnippet->cleanup_tika

my $content = Dancer::SearchApp::HTMLSnippet->cleanup_tika( $html );

Cleans up HTML output from Apache Tika.

BUG TRACKER

Please report bugs in this module via the RT CPAN bug queue at https://rt.cpan.org/Public/Dist/Display.html?Name=Dancer-SearchApp or via mail to dancer-searchapp-Bugs@rt.cpan.org.

AUTHOR

Max Maischein corion@cpan.org

COPYRIGHT (c)

Copyright 2014-2016 by Max Maischein corion@cpan.org.

LICENSE

This module is released under the same terms as Perl itself.