NAME
Text::Corpus::NewYorkTimes::Document
- Parse NYT article for research.
SYNOPSIS
use Text::Corpus::NewYorkTimes;
use Data::Dump qw(dump);
use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init ($INFO);
my $corpus = Text::Corpus::NewYorkTimes->new (fileList => $fileList, corpusDirectory => $corpusDirectory);
my $document = $corpus->getDocument (index => 0);
dump $document->getBody;
dump $document->getCategories;
dump $document->getContent;
dump $document->getDate;
dump $document->getDescription;
dump $document->getTitle;
dump $document->getUri;
DESCRIPTION
Text::Corpus::NewYorkTimes::Document
provides methods for accessing specific portions of news articles from the New York Times corpus.
CONSTRUCTOR
new
The constructor new
creates an instance of the Text::Corpus::NewYorkTimes::Document
class with the following parameters:
filename
-
filename => '...'
filename
is the path name to the XML document that is to be parsed. dtdname
-
dtdname => '...'
dtdname
is the path name to the data type definition file provided with the corpus; it is usually something like.../nyt_corpus/dtd/nitf-3-3.dtd
. If not defined an attempt is made to located it using the path provided byfilename
.
METHODS
getBody
getBody ()
getBody
returns an array reference of strings of sentences that are the body of the article.
getCategories
getCategories (type => 'all')
The method getCategories
returns the categories of type
assigned to the document, where type
must be 'all'
, 'controlled'
, or 'uncontrolled'
. The 'uncontrolled'
categories are those assigned to the document by an editor without machine assistance, the 'controlled'
categories are those assigned with machine assistance. The type 'all'
returns the union of the categories from 'controlled'
and 'uncontrolled'
. The default is 'all'
.
getContent
getContent ()
The method getContent
returns the content of the document as an array reference of the text where each item in the array is a sentence, with the first sentence being the headline or title of the article. If the lead sentence equals the headline of the article, then the headline is not prefixed to the list.
getDate
getDate (format => '%g')
getDate
returns the date and time of the article in the format speficied by format
that uses the print directives of Date::Manip::Date. The default is to return the date and time in RFC2822 format.
getDescription
getDescription ()
getDescription
returns an array reference of strings of sentences that describe the articles content.
getTitle
getTitle ()
getTitle
returns an array reference of strings, usually one, of the title of the article.
getUri
getUri (type => 'file')
getUri
returns the URI of the document where type
must be 'file'
or 'url'
. If type
is 'file'
, the file path of the document is returned; otherwise the URL of the document is returned. The default is 'file'
.
INSTALLATION
For installation instructions see Text::Corpus::NewYorkTimes.
AUTHOR
Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
KEYWORDS
nyt, new york times, english corpus, information processing
SEE ALSO
This module requires the The New York Times Annotated Corpus from the Linguistic Data Consortium; discussions about the corpus are moderated at the Google Group called The New York Times Annotated Corpus Community.Log::Log4perl, Text::Corpus::NewYorkTimes, XML::LibXML, XML::LibXML::XPathContext