NAME
Lingua::CollinsParser - Head-driven syntactic sentence parser
SYNOPSIS
use Lingua::CollinsParser;
my $p = Lingua::CollinsParser->new();
my $cp_home = '/path/to/COLLINS-PARSER';
$p->load_grammar("$cp_home/models/model1/grammar");
$p->load_events( "$cp_home/models/model1/events");
my @words = qw(The bird flies);
my @tags = qw(DT NN VBZ);
my $tree = $p->parse_sentence(\@words, \@tags);
DESCRIPTION
Syntactic parsing is the act of constructing a phrase-structure tree (or several alternative trees) from a natural-language sentence.
There are many different ways to do this, resulting in lots of different styles of output and using various amounts of space & time resources. One of the most successful recent methods was developed by Michael Collins as part of his 1999 Ph.D. work at the University of Pennsylvania. It uses the notion of "head-driven" statistical models, in which a certain word from each subtree is designated as the "head" of that subtree. It can be very useful to use the head words when analyzing the tree output.
This module, Lingua::CollinsParser
, is a Perl wrapper around Collins' parser. The parser itself is written in C.
CONCURRENCY
Because the internal C code of the parser uses lots of global variables to maintain state, it is currently impossible to create more than one parser instance at the same time. Therefore, the class behaves in a "Singleton" manner, i.e. repeated calls to new()
will actually return the same parser, not actually new ones.
However, if a cleanup effort is undertaken in the parser's C code in the future, it may be possible to remove its reliance on global variables, and the new()
method could start returning new instances with each call. Therefore, please don't rely on future versions of Lingua::CollinsParser
behaving as singletons.
METHODS
The following methods are available in the Lingua::CollinsParser
class:
- new(...)
-
Creates a new
Lingua::CollinsParser
object and returns it. For initialization,new()
accepts a list of key-value pairs corresponding to the five accessor methods below (beamsize
,punc_flag
,distaflag
,distvflag
,npflag
) - if present, the accessors will be called and the corresponding values will be passed to them. - beamsize( [value] )
-
A real number specifying the size of the "beam". The beam XXX. Default value is 10000. Smaller numbers like 1000 may be used to increase speed at a slight cost in accuracy.
- punc_flag( [value] )
-
A boolean flag indicating whether to use the "punctuation constraint". A description of this constraint comes from Collins' Ph.D. thesis:
If for any constituent
Z
in the chartZ -> <..X Y..>
two of its childrenX
andY
are separated by a comma, then the last word inY
must be directly followed by a comma, or must be the last word in the sentence. In training data 96% of commas follow this rule. The rule also has the benefit of improving efficiency by reducing the number of constituents in the chart.The default is true, i.e. to use the constraint.
- distaflag( [value] )
-
A boolean flag indicating whether the "adjacency condition" in the distance measure should be used. This is explained somewhere in Collins' Ph.D. thesis, though I couldn't quite figure out where. Default is true.
- distvflag( [value] )
-
A boolean flag indicating whether the "verb condition" in the distance measure should be used. This is explained somewhere in Collins' Ph.D. thesis, though I couldn't quite figure out where. Default is true.
- npflag( [value] )
-
A boolean flag indicating whether noun phrases should always include
NP
andNPB
levels, or whether the extraNP
level may be omitted when superfluous. The default is to omit, i.e. the flag is true by default. For example, withnpflag=1
you may get the following structure:(TOP (S (NPB the man) (VP saw (NPB the dog))))
whereas with
npflag=0
you might get the following:(TOP (S (NP (NPB the man)) (VP saw (NP (NPB the dog)))))
(This example comes from the README in Collins' parser distribution.)
- load_grammar($file)
-
Loads a grammar file (a few sample grammar files ship with Collins' parser distribution) into the parser. This must be done before calling
parse_sentence()
. - load_events($file)
-
Loads a events file (a few sample events files ship with Collins' parser distribution) into the parser. This or
undump_events_hash()
must be done before callingparse_sentence()
. -
Invokes the parser on the given sentence. The first argument must be an array reference containing the words of the sentence. The second argument must be an array reference containing those words' corresponding part-of-speech tags. A
Lingua::CollinsParser::Node
object is returned, representing a syntax tree for the sentence.To generate the array of part-of-speech tags, you may be interested in
Lingua::BrillTagger
, InXight (http://www.inxight.com/), or GATE (http://gate.ac.uk/). - dump_events_hash($file)
-
It takes a really long time to call
load_events()
, so this method is provided to "freeze" the loaded events hash to a file, so that it can be "thawed" out again later withundump_events_hash()
. This is much faster. For instance, if during installation you run the regression tests twice in a row, you'll notice that the second time is much faster, because it dumped the hash information the first time. - undump_events_hash($file)
-
Loads an events hash from a file that was previously created using
dump_events_hash()
.
AUTHOR
Ken Williams, ken.williams@thomson.com
COPYRIGHT
The Lingua::CollinsParser perl interface is copyright (C) 2004 Thomson Legal & Regulatory, and written by Ken Williams. It is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The Collins Parser is copyright (C) 1999 by Michael Collins - you will find full copyright and license information in its distribution. The Parser.patch file distributed here is granted under the same license terms as the parser code itself.
SEE ALSO
Lingua::CollinsParser::Node
Lingua::BrillTagger
http://www.ai.mit.edu/people/mcollins/code.html (The Collins Parser)