NAME
HTML::Content::Extractor - Recieving a main text of publication from HTML page and main media content that is bound to the text
SYNOPSIS
my
$obj
= HTML::Content::Extractor->new();
$obj
->analyze(
$html
);
my
$main_text
=
$obj
->get_main_text();
my
$main_images
=
$obj
->get_main_images();
my
$raw_text
=
$obj
->get_raw_text();
my
$main_text_we
=
$obj
->get_main_text_with_elements(1, [
"a"
,
"b"
,
"br"
,
"strike"
, ...]);
$main_text
,
"\n\n"
;
"Images:\n"
;
foreach
my
$url
(
@$main_images
) {
$url
,
"\n"
;
}
DESCRIPTION
This module analyzes an HTML document and extracts the main text (for example front page article contents on the news site) and all related images.
METHODS
new
my
$obj
= HTML::Content::Extractor->new();
Creates and prepares the structure for the subsequent analysis and parsing HTML.
analyze
$obj
->analyze(
$html
);
Creates an HTML document tree and analyzes it.
get_main_text
# UTF-8
my
$main_text
=
$obj
->get_main_text(1);
# or not
my
$main_text
=
$obj
->get_main_text(0);
# default UTF-8 is on
Return plain text.
get_raw_text
# UTF-8
my
$raw_text
=
$obj
->get_raw_text(1);
# or not
my
$raw_text
=
$obj
->get_raw_text(0);
# default UTF-8 is on
Return the main text without post-processing (saving all html tags)
get_main_text_with_elements
# UTF-8
my
$main_text_we
=
$obj
->get_main_text_with_elements(1, [
"span"
, ...]);
# or not
my
$main_text_we
=
$obj
->get_main_text_with_elements(0, [
"span"
, ...]);
# default UTF-8 is on
Returns the main text while saving selected html tags. Post-processing is skipped
get_main_images
# UTF-8
my
$main_images
=
$obj
->get_main_images(1);
# or not
my
$main_images
=
$obj
->get_main_images(0);
# default UTF-8 is on
Returns ARRAY with pictures URL.
DESTROY
undef
$obj
;
Cleaning of all internal structures (HTML tree and other)
AUTHOR
Alexander Borisov <lex.borisov@gmail.com>
COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by Alexander Borisov.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.