NAME

Text::Corpus::CNN::Document - Parse CNN article for research.

SYNOPSIS

use Cwd;
use File::Spec;
use Text::Corpus::CNN;
use Data::Dump qw(dump);
use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init ($INFO);
my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_cnn');
my $corpus = Text::Corpus::CNN->new (corpusDirectory => $corpusDirectory);
$corpus->update (verbose => 1);
my $document = $corpus->getDocument (index => 0);
dump $document->getBody;
dump $document->getCategories;
dump $document->getContent;
dump $document->getDate;
dump $document->getDescription;
dump $document->getHighlights;
dump $document->getTitle;
dump $document->getUri;

DESCRIPTION

Text::Corpus::CNN::Document provides methods for accessing specific portions of CNN news articles for personnel researching and testing of information processing methods.

Read the CNN Interactive Service Agreement to ensure you abide with their Service Agreement when using this module.

CONSTRUCTOR

`new`

The constructor new creates an instance of the Text::Corpus::CNN::Document class with the following parameters:

htmlContent

htmlContent => '...'

htmlContent must be a string containing the HTML of the document to be parsed.

uri

uri => '...'

uri must be a string containing the URL of the document provided by htmlContent; it is also returned as the document's unique identifier with getUri.

METHODS

`getBody`

getBody ()

getBody returns an array reference of strings of sentences that are the body of the document.

`getCategories`

getCategories ()

getCategories returns an array reference of strings of categories assigned to the document. They are the phrases and words extracted from the /html/head/meta[@name="KEYWORDS"] field in the HTML of the document, from the 'RELATED TOPICS' section of the document, and from the URL of the document.

`getContent`

getContent ()

getContent returns an array reference of strings of sentences that form the content of the document, which are the title and body of the document.

`getDate`

getDate (format => '%g')

getDate returns the date and time of the article in the format speficied by format that uses the print directives of Date::Manip::Date. The default is to return the date and time in RFC2822 format.

`getDescription`

getDescription ()

getDescription returns an array reference of strings of sentences, usually one, that describes the document content. It is from the /html/head/meta[@name="description"] field in the HTML of the document.

`getHighlights`

getHighlights ()

getHighlights returns an array reference of the highlights of the document.

`getTitle`

getTitle ()

getTitle returns an array reference of strings, usually one, of the title of the document.

`getUri`

getUri ()

getUri returns the URL of the document.

INSTALLATION

For installation instructions see Text::Corpus::CNN.

AUTHOR

Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

cnn, cable news network, english corpus, information processing

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

CONSTRUCTOR

new

METHODS

getBody

getCategories

getContent

getDate

getDescription

getHighlights

getTitle

getUri