NAME

Lingua::EN::Tagger - Part of speech tagger for English natural language processing.

SYNOPSIS

# Add part of speech tags to a text
my $tagged_text = $p->add_tags( $text );

...

# Get a list of all noun phrases with occurence counts
my %word_list = $p->get_words( $text );

...

# Get a readable version of the tagged text
my $readable_text = $p->get_readable( $text );

DESCRIPTION

The module is a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and probability values. The tagger determines appropriate tags based on conditional probabilities - it looks at the preceding tag to figure out what the appropriate tag is for the current word. Unknown words can be treated as nouns or other parts of speech.

The tagger also recursively extracts as many nouns and noun phrases as it can, using a set of regular expressions.

CLASS METHODS

initialize: Downloads some corpus data and saves it in a stored hash on the local filesystem. This is called automatically the first time you run the tagger.

METHODS

new %PARAMS

Class constructor. Takes a hash with the following parameters (shown with default values):

unknown_word_tag => '': Tag to assign to unknown words
debug => 0: Print some debugging info.
stem => 1: Stem single words using Lingua::Stem::EN
weight_noun_phrases => 1: When returning occurence counts for a noun phrase, multiply the value by the number of words in the NP.
longest_noun_phrase => 100: Will ignore noun phrases longer than this threshold

get_words DOC or TEXT

Given a text string, return as many nouns and noun phrases as possible. Applies add_tags and involves three stages:

Tag the text
Extract all the maximal noun phrases
Recursively extract all noun phrases from the MNPs

get_readable TEXT

Return an easy-on-the-eyes tagged version of a text string. Applies add_tags and reformats to be easier to read.

add_tags TEXT

Examine the string provided and return it fully tagged ( XML style )

strip_tags TEXT

Return a text string with the XML-style part-of-speech tags removed.

clean_text TEXT

Strip the provided text of punctuation and HTML-style tags in preparation for tagging

_split_punct TEXT

Separate punctuation from words, where appropriate. This leaves trailing periods in place to be dealt with later. Called by clean_text.

choose_tag WORD

Select an appropriate tag for the word provided. This subroutine is context- sensitive - it remembers the immediately preceding word, and uses it to calculate the probabilities for various POS assignments for this word.

_assign_tag TAG, WORD ( memoized )

Given a preceding tag TAG, assign a tag to WORD. Called by choose_tag.

reset

This subroutine will reset the preceeding tag to a sentence ender ( PP ). This prepares the first word of a new sentence to be tagged correctly.

_clean_word WORD

This subroutine determines whether a word should be considered in its lower or upper case form. This is useful in considering proper nouns and words that begin sentences. Called by choose_tag.

_classify_unknown_word WORD

This changes any word not appearing in the lexicon to identifiable classes of words handled by a simple unknown word classification metric. Called by _clean_word.

stem WORD ( memoized )

Returns the word stem as given by Lingua::Stem::EN. This can be turned off with the class parameter 'stem' => 0.

_get_max_noun_regex

This returns a compiled regex for extracting maximal noun phrases from a POS-tagged text.

_get_sentence_regex

This returns a compiled regex for extracting individual sentences from a POS-tagged text. This is used by get_style_data.

_get_noun_regex

This returns a compiled regex for extracting single nouns from a POS-tagged text. This is used by get_nouns.

get_style_data TAGGED_STRING

This subroutine extracts style data statistics for each sentence of the tagged string input. The return value is an array of hash references, where each array element corresponds to a sentence. Each hash reference contains the following data:

$ref->{sentence}: The tagged sentence, itself =item * $ref->{words}: The number of words in the sentence =item * $ref->{phrases}: The average length of maximal noun phrases =item * $ref->{nouns}: The frequency of nouns in the sentence =item * $ref->{adj}: The frequency of adjectives in the sentence =item * $ref->{prep}: The frequency of prepositions in the sentence =item * $ref->{verbs}: The frequency of verbs in the sentence =item * $ref->{adv}: The frequency of adverbs in the sentence

get_nouns TAGGED_TEXT

Given a POS-tagged text, this method returns all nouns and their occurance frequencies.

get_max_noun_phrases TAGGED_TEXT

Given a POS-tagged text, this method returns only the maximal noun phrases. May be called directly, but is also used by get_noun_phrases

get_noun_phrases TAGGED_TEXT

Similar to get_words, but requires a POS-tagged text as an argument.

_get_sub_phrases TAGGED_TEXT

Used by get_nouns, this extracts the nested noun phrases from a maximal noun phrase.

HISTORY

0.01: Created 10/02 by Aaron Coburn as LSI::Parser::POS Moved to Lingua::EN::Tagger 2/03 Maciej Ceglowski

AUTHORS

Maciej Ceglowski <developer@ceglowski.com>
Aaron Coburn <acoburn@middlebury.edu>

This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 214:

You forgot a '=back' before '=head1'

Around line 941:

You forgot a '=back' before '=head1'

To install Lingua::Tagger::EN, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::Tagger::EN

CPAN shell

perl -MCPAN -e shell
install Lingua::Tagger::EN

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)