NAME

Lingua::TreeTagger - Using TreeTagger from Perl

VERSION

This documentation refers to Lingua::TreeTagger version 0.04.

SYNOPSIS

use Lingua::TreeTagger;

# Create a Tagger object.
my $tagger = Lingua::TreeTagger->new(
    'language' => 'english',
    'options'  => [ qw( -token -lemma -no-unknown ) ],
);

# The tagger's input text can be stored in a string (passed by reference)...
my $text_to_tag = 'Yet another sample text.';
my $tagged_text = $tagger->tag_text( \$text_to_tag );

# ... or in a file.
my $file_path = 'path/to/some/file.txt';
$tagged_text = $tagger->tag_file( $file_path );

# Both methods return a Lingua::TreeTagger::TaggedText object, i.e. a
# sequence of Lingua::TreeTagger::Token objects, which can be stringified
# as raw text...
print $tagged_text->as_text();

# ... or in XML format.
print $tagged_text->as_XML();

# Token objects may be accessed directly for more specific purposes.
foreach my $token ( @{ $tagged_text->sequence() } ) {

    # A token may contain a single SGML tag...
    if ( $token->is_SGML_tag() ) {
        print 'An SGML tag: ', $token->tag(), "\n";
    }

    # ... or a part-of-speech tag.
    else {
        print 'A part-of-speech tag: ', $token->tag(), "\n";

        # In the latter case, the token may also have attributes specifying
        # the original string...
        if ( defined $token->original() ) {
            print '  token: ', $token->original(), "\n";
        }

        # ... or the corresponding lemma.
        if ( defined $token->lemma() ) {
            print '  lemma: ', $token->lemma(), "\n";
        }
    }
}

DESCRIPTION

This Perl module provides a simple object-oriented interface to the TreeTagger part-of-speech tagger created by Helmut Schmid. See also Lingua::TreeTagger::TaggedText and Lingua::TreeTagger::Token.

METHODS

new()

Creates a new Tagger object. One named parameter is required:

language

A (lowercase) string specifying the language that the tagger object will have to cope with (e.g. 'english', 'french', 'german', and so on). Note that the corresponding TreeTagger parameter files have to be installed by the user.

Three optional named parameters may be passed to the constructor:

use_utf8

A boolean flag indicating that the utf-8 version of the parameter file should be used. This also enables use of Unicode strings internally, use of the utf8 tokenizer by default, and use of the utf8 abbreviations file (if present) for tokenization.

options

A reference to a list of options to be passed to TreeTagger. Note that this module supports only those options that work as flags (e.g. '-token' or '-lemma') and it excludes some flags that are used for other purposes than part-of-speech tagging (e.g. '-proto' or '-print-prob-tree').

At present, the full list of supported options is the following (see the documentation of TreeTagger for details):

-token              -lemma              -sgml               -ignore-prefix
-no-unknown         -cap-heuristics     -hyphen-heuristics  -pt-with-lemma
-pt-with-prob       -base

The list of options defaults to '-token' and '-lemma'.

tokenizer

A reference to a subroutine for tokenizing the input text. This subroutine must take a reference to a string as argument and return a reference to the tokenized string, where each line contains a distinct token. Here is a simple example of such a subroutine:

sub my_tokenizer {
    my ( $original_text_ref ) = @_;
    my @tokens = split /\s+/, $$original_text_ref;
    my $tokenized_text = join "\n", @tokens;
    return \$tokenized_text;
}
tag_file()

Tokenizes and tags the textual content of a file. It requires only one argument, namely the path to the file, e.g. a string such as 'path/to/some/file.txt'. The method returns a Lingua::TreeTagger::TaggedText object.

tag_text()

Tokenizes and tags the text contained in a string. It requires only one argument, namely a reference to the string to be tagged. The method returns a Lingua::TreeTagger::TaggedText object.

ACCESSORS

language()

Read-only accessor for the 'language' attribute of a TreeTagger object.

options()

Read-only accessor for the 'options' attribute of a TreeTagger object, i.e. a reference to the list of options it uses.

tokenizer()

Read-only accessor for the 'tokenizer' attribute of a TreeTagger object, i.e. a reference to the custom tokenizer subroutine it uses (if any).

DIAGNOSTICS

There is no parameter file for language ...

This exception is raised by the class constructor when attempting to create a new TreeTagger object with a 'language' attribute for which no parameter file is installed in TreeTagger's /lib directory.

Method tag_file requires a path argument

This exception is raised when method tag_file() is called without specifying a path argument.

Method tag_text requires a string reference as argument

This exception is raised when method tag_text() is called without providing a string reference as argument.

Couldn't fork: ...

This exception is raised by methods tag_file() and tag_text() when they fail to create a child process for executing the TreeTagger program.

File ... not found

This exception is raised when method tag_file() is called with a path argument corresponding to a file that does not exist.

CONFIGURATION AND ENVIRONMENT

Installing and using this module requires a working version of TreeTagger (available at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger). Windows users are advised to follow the installation instructions given on page http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm. There is also a Lingua::TreeTagger::Installer module created by Alberto Simoes (this distribution is not directly related to the present one).

The particular set of TreeTagger parameter files installed on the user's machine determines the set of languages that can by used by this module. Note that the parameter file for English must be installed for the successful execution of the distribution tests.

During the installation procedure, the user is prompted for the path to TreeTagger's base directory (e.g. C:\Program Files\TreeTagger), which is used for testing and saved for later use in module Lingua::TreeTagger::ConfigData.

DEPENDENCIES

This is the base module of the Lingua::TreeTagger distribution. It uses modules Lingua::TreeTagger::TaggedText (version 0.01), Lingua::TreeTagger::Token (version 0.01), and Lingua::TreeTagger::ConfigData (automatically generated during the installation procedure).

This module requires module Moose and was developed using version 1.09. Please report incompatibilities with earlier versions to the author.

Also required are modules File::Temp (version 0.19 or later) and Path::Class (version 0.19 was used for development, please report incompatibilies with earlier versions).

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

In the current version, the options() accessor is read-only, which implies that a new TreeTagger object must be created whenever a change in the set of options is needed (see "BUGS AND LIMITATIONS" in Lingua::TreeTagger::TaggedText). This can be expected to change in a future version.

This module attempts to provide a user-friendly object-oriented interface to TreeTagger, but it is seriously limited from the point of view of performance. Each call to methods tag_text() and tag_file() translates into a new execution of the TreeTagger program, which entails a considerable time most probably devoted to the program's initialization.

If performance is critical, there are essentially three available options: (i) reduce the number of calls to tag_text() and tag_file() by buffering a larger amount of text to tag, (ii) try the Alvis::TreeTagger module (which does not seem to work on Windows), or (iii) help the author find out how to use a module such as IPC::Open2 to open a permanent two-ways communication channel between this module and the TreeTagger executable.

ACKNOWLEDGEMENTS

The author is grateful to Alberto Simões, Christelle Cocco, Yannis Haralambous, and Andrew Zappella for their useful feedback.

Also a warm thank you to Tara Andrews who provided a patch for adding unicode support to the module, as well as Zoffix Znet and Hiroyuki Yamanaka, who provided patches for fixing a bug related to a modification of the Moose dependency.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2010-2017 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::TreeTagger::TaggedText, Lingua::TreeTagger::Token