NAME
Text::StemTagPOS
- Computes stemmed/POS tagged lists of text.
SYNOPSIS
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'The first sentence. Sentence number two.';
dump $stemTagger->getStemmedAndTaggedText ($text);
DESCRIPTION
Text::StemTagPOS
uses the modules Lingua::Stem::Snowball and Lingua::EN::Tagger to do part-of-speech tagging and stemming of English text. It was developed to pre-process text for other modules. Encoding of all text should be in Perl's internal format; see Encode for converting text from various encodes to a Perl string.
CONSTRUCTOR
new
The method new
creates an instance of the Text::StemTagPOS
class with the following parameters:
isoLangCode
-
isoLangCode => 'en'
isoLangCode
is the ISO language code of the language that will be tagged and stemmed by the object. It must be 'en', which is the default; other languages may be added when POS taggers for them are added to CPAN. endingSentenceTag
-
endingSentenceTag => 'PP'
endingSentenceTag
is the part-of-speech tag from Lingua::EN::Tagger that will be used to indicate the end of a sentence. The default is 'PP'. The value ofendingSentenceTag
must be a tag generated by the module Lingua::EN::Tagger; see methodgetListOfPartOfSpeechTags
for all the possible tags; which are based on the Penn Treebank tagset. listOfPOSTypesToKeep
and/orlistOfPOSTagsToKeep
-
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
The method
getTaggedTextToKeep
useslistOfPOSTypesToKeep
andlistOfPOSTagsToKeep
to build the default list of the parts-of-speech to be retained when filtering previously tagged text. The default list is[qw(TEXTRANK_WORDS)]
, which is all the nouns and adjectives in the text, as used in the textrank algorithm. Permitted types forgetTaggedTextToKeep
are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'.listOfPOSTagsToKeep
provides finer control over the parts-of-speech to be retained. For a list of all the possible tags see methodgetListOfPartOfSpeechTags
.
METHODS
getStemmedAndTaggedText
getStemmedAndTaggedText (@Text, $Text, \@Text)
The method getStemmedAndTaggedText
returns a hierarchy of array references containing the stemmed words, the original words, their part-of-speech tag, and their word position index within the original text. The hierarchy is of the form
[
[ # sentence level: first sentence.
[ # word level: first word.
stemmed word, original word, part-of-speech tag, word index, word position, word length
]
[ # word level: second word.
stemmed word, original word, part-of-speech tag, word index, word position, word length
]
...
]
[ # sentence level: second sentence.
[ # word level: first word.
stemmed word, original word, part-of-speech tag, word index, word position, word length
]
[ # word level: second word.
stemmed word, original word, part-of-speech tag, word index, word position, word length
]
...
]
]
Its only parameters are any combination of strings of text as scalars, references to scalars, arrays of strings of text, or references to arrays of strings of text, etc... The following examples below show the various ways to call the method; note that the constants Text::StemTagPOS::WORD_STEMMED, Text::StemTagPOS::WORD_ORIGINAL, Text::StemTagPOS::WORD_POSTAG, and Text::StemTagPOS::WORD_INDEX are used to access the information about each word.
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'The first sentence. Sentence number two.';
my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
dump $listOfStemmedTaggedSentences;
# dumps:
# [
# [
# ["the", "The", "/DET", 0, 0, 3],
# ["first", "first", "/JJ", 1, 4, 5],
# ["sentenc", "sentence", "/NN", 2, 10, 8],
# [".", ".", "/PP", 3, 18, 1],
# ],
# [
# ["sentenc", "Sentence", "/NN", 4, 20, 8],
# ["number", "number", "/NN", 5, 29, 6],
# ["two", "two", "/CD", 6, 36, 3],
# [".", ".", "/PP", 7, 39, 1],
# ],
# ]
my $word = $listOfStemmedTaggedSentences->[0][0];
print
'WORD_STEMMED: ' .
"'" . $word->[Text::StemTagPOS::WORD_STEMMED] . "'\n" .
'WORD_ORIGINAL: ' .
"'" . $word->[Text::StemTagPOS::WORD_ORIGINAL] . "'\n" .
'WORD_POSTAG: ' .
"'" . $word->[Text::StemTagPOS::WORD_POSTAG] . "'\n" .
'WORD_INDEX: ' .
$word->[Text::StemTagPOS::WORD_INDEX] . "\n" .
'WORD_CHAR_POSITION: ' .
$word->[Text::StemTagPOS::WORD_CHAR_POSITION] . "\n" .
'WORD_CHAR_LENGTH: ' .
$word->[Text::StemTagPOS::WORD_CHAR_LENGTH] . "\n";
# prints:
# WORD_STEMMED: 'the'
# WORD_ORIGINAL: 'The'
# WORD_POSTAG: '/DET'
# WORD_INDEX: 0
# WORD_CHAR_POSITION: 0
# WORD_CHAR_LENGTH: 3
The following example shows the various ways the text can be passed to the method:
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'This is a sentence with seven words.';
dump $stemTagger->getStemmedAndTaggedText ($text,
[$text, \$text], ($text, \$text));
getTaggedTextToKeep
getTaggedTextToKeep (listOfStemmedTaggedSentences => [...],
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]);
The method getTaggedTextToKeep
returns all the array references of the words that have a part-of-speech tag that is of a type specified by listOfPOSTypesToKeep
or listOfPOSTagsToKeep
. The word lists returned have the same hierarchical sentence structure used by listOfStemmedTaggedSentences
. Note listOfPOSTypesToKeep
and listOfPOSTagsToKeep
are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values.
listOfStemmedTaggedSentences
-
listOfStemmedTaggedSentences => [...]
listOfStemmedTaggedSentences
is the array reference returned bygetStemmedAndTaggedText
or a previous call togetTaggedTextToKeep
. listOfPOSTypesToKeep
and/orlistOfPOSTagsToKeep
-
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
listOfPOSTypesToKeep
andlistOfPOSTagsToKeep
define the list of parts-of-speech types to be retained when filtering previously tagged text. Permitted values forlistOfPOSTypesToKeep
are are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value oflistOfPOSTagsToKeep
see the methodgetListOfPartOfSpeechTags
. NotelistOfPOSTypesToKeep
andlistOfPOSTagsToKeep
are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values.
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'This is the first sentence. This is the last sentence.';
my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
dump $stemTagger->getTaggedTextToKeep (
listOfStemmedTaggedSentences => $listOfStemmedTaggedSentences);
# dumps:
# [
# [
# ["first", "first", "/JJ", 3, 12, 5],
# ["sentenc", "sentence", "/NN", 4, 18, 8],
# ],
# [
# ["last", "last", "/JJ", 9, 40, 4],
# ["sentenc", "sentence", "/NN", 10, 45, 8],
# ],
# ]
getWordsPhrasesInTaggedText
getWordsPhrasesInTaggedText (listOfStemmedTaggedSentences => ...,
listOfPhrasesToFind => [...], listOfPOSTypesToKeep => [...],
listOfPOSTagsToKeep => [...]);
The method getWordsPhrasesInTaggedText
returns a reference to an array where each entry in the array corresponds to the word or phrase in listOfPhrasesToFind
. The value of each entry is a list of word indices where the words or phrases were found. Each list contains integer pairs of the form [first-word-index, last-word-index] where first-word-index is the index to the first word of the phrase and last-word-index the index of the last word. The values of the index are those assigned to the stemmed and tagged word in listOfStemmedTaggedSentences
.
[
[ # first phrase locations
[first word index, last word index],
[first word index, last word index], ...]
]
[ # second phrase locations
[first word index, last word index],
[first word index, last word index], ...]
]
...
]
listOfStemmedTaggedSentences
-
listOfStemmedTaggedSentences => [...]
listOfStemmedTaggedSentences
is the array reference returned bygetStemmedAndTaggedText
orgetTaggedTextToKeep
. listOfPhrasesToFind
-
listOfPhrasesToFind => [...]
listOfPhrasesToFind
is an array reference containing a list of strings of text that are either single words or phrases that are to be located in the text provided bylistOfStemmedTaggedSentences
. Before the words or phrases are located they are filtered usinglistOfPOSTypesToKeep
orlistOfPOSTagsToKeep
. listOfPOSTypesToKeep
and/orlistOfPOSTagsToKeep
-
listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]
listOfPOSTypesToKeep
andlistOfPOSTagsToKeep
defines the list of parts-of-speech types to be retained when filtering previously tagged text. Permitted values forlistOfPOSTypesToKeep
are are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value oflistOfPOSTagsToKeep
see the methodgetListOfPartOfSpeechTags
. NotelistOfPOSTypesToKeep
andlistOfPOSTagsToKeep
are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values.
The code below illustrates the output format:
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'This is the first sentence. This is the last sentence.';
my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
dump $listOfStemmedTaggedSentences;
my $listOfWordsOrPhrasesToFind = ['first sentence','this is',
'third sentence', 'sentence'];
my $phraseLocations = $stemTagger->getWordsPhrasesInTaggedText (
listOfPOSTypesToKeep => [qw(ALL)],
listOfStemmedTaggedSentences => $listOfStemmedTaggedSentences,
listOfWordsOrPhrasesToFind => $listOfWordsOrPhrasesToFind);
dump $phraseLocations;
# [
# [[3, 4]], # 'first sentence'
# [[0, 1], [6, 7]], # 'this is': note period in text has index 5.
# [], # 'third sentence'
# [[4, 4], [10, 10]] # 'sentence'
# ]
getListOfPartOfSpeechTags
The method getListOfPartOfSpeechTags
takes no parameters. It returns an array reference where each item in the list is of the form [part of speech tag, description, examples]
. It is meant for getting the part-of-speech tags that can be used to populate listOfPOSTagsToKeep
.
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
dump $stemTagger->getListOfPartOfSpeechTags;
getListOfStemmedWordsInText
The method getListOfStemmedWordsInText
returns an array reference of the sorted stemmed words in the text given by listOfStemmedTaggedSentences
.
listOfStemmedTaggedSentences
-
listOfStemmedTaggedSentences => [...]
listOfStemmedTaggedSentences
is the array reference returned bygetStemmedAndTaggedText
orgetTaggedTextToKeep
of the text.
use Text::StemTagPOS;
use Data::Dump qw(dump);
my $stemTagger = Text::StemTagPOS->new;
my $text = 'The first sentence. Sentence number two.';
my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
dump $listOfStemmedTaggedSentences;
getListOfStemmedWordsInAllDocuments
The method getListOfStemmedWordsInAllDocuments
returns an array reference of the sorted stemmed words of the intersection of all the words in the documents given by listOfStemmedTaggedDocuments
;
listOfStemmedTaggedDocuments
-
listOfStemmedTaggedDocuments => [...]
listOfStemmedTaggedDocuments
is a list of document references returned bygetStemmedAndTaggedText
orgetTaggedTextToKeep
.
INSTALLATION
To install the module run the following commands:
perl Makefile.PL
make
make test
make install
If you are on a windows box you should use 'nmake' rather than 'make'.
AUTHOR
Jeff Kubina<jeff.kubina@gmail.com>
BUGS
Please email bugs reports or feature requests to bug-text-stemtagpos@rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-StemTagPOS. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
COPYRIGHT
Copyright (c) 2010 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
KEYWORDS
natural language processing, NLP, part of speech tagging, POS, stemming
SEE ALSO
Encode, Lingua::Stem::Snowball, Lingua::EN::Tagger, perlunicode, Text::Iconv, utf8
See the Lingua::EN::Tagger README file for a list of the part-of-speech tags.