NAME
CWB::CEQL - The Common Elementary Query Language for CQP front-ends
SYNOPSIS
use CWB::CEQL;
our $CEQL = new CWB::CEQL;
$CEQL->SetParam("pos_attribute", "tags"); # **TODO: parameters**
$cqp_query = $CEQL->Parse($ceql_query);
if (not defined $cqp_query) {
@error_msg = $CEQL->ErrorMessage;
$html_msg = $CEQL->HtmlErrorMessage;
}
## extend or modify standard CEQL grammar by subclassing
package BNCWEB::CEQL;
use base 'CWB::CEQL';
sub lemma {
## overwrite 'lemma' rule here (e.g. to allow for BNCweb's ``{bucket/N}'' notation)
my $orig_result = $self->SUPER::lemma($string); # call original rule if needed
}
## you can now use BNCWEB::CEQL in the same way as CWB::CEQL
DESCRIPTION
** TODO **
METHODS
Most important user-level methods inherited from CWB::CEQL::Parser.
- $CEQL = new CWB::CEQL;
-
Create parser object for CEQL queries. Use the Parse method of $CEQL to translate a CEQL query into CQP code.
- $cqp_query = $CEQL->Parse($simple_query);
-
Parses simple query in CEQL syntax and returns equivalent CQP code. If there is a syntax error in $simple_query or parsing fails for some other reason, an undefined value is returned.
- @text_lines = $CEQL->ErrorMessage;
- $html_code = $CEQL->HtmlErrorMessage;
-
If the last CEQL query failed to parse, these methods return an error message either as a list of text lines (ErrorMessage) or as pre-formatted HTML code that can be used directly by a Web interface (HtmlErrorMessage). The error message includes a backtrace of the internal call stack in order to help users identify the precise location of the problem.
- $CEQL->SetParam($name, $value);
-
Change parameters of the CEQL grammar. Currently, the following parameters are available:
pos_attribute
-
The p-attribute used to store part-of-speech tags in the CWB corpus (default:
pos
). CEQL queries should not be used for corpora without POS tagging, which we consider to be a minimal level of annotation. lemma_attribute
-
The p-attribute used to store lemmata (base forms) in the CWB corpus (default:
lemma
). Set to undef if the corpus has not been lemmatised. simple_pos
-
Lookup table for simple part-of-speech tags (in CEQL constructions like
run_{N}
). Must be a hashref with simple POS tags as keys and CQP regular expressions matching an appropriate set of standard POS tags as the corresponding values. The default value is undef, indicating that no simple POS tags have been defined. A very basic setup for the Penn Treebank tag set might look like this:$CEQL->SetParam("simple_pos", { "N" => "NN.*", # common nouns "V" => "V.*", # any verb forms "A" => "JJ.*", # adjectives });
simple_pos_attribute
-
Simple POS tags may use a different p-attribute than standard POS tags, specified by the
simple_pos_attribute
parameter. If it is set to undef (default), thepos_attribute
will be used for simplified POS tags as well. s_attributes
-
Lookup table indicating which s-attributes in the CWB corpus may be accessed in CEQL queries (using the XML tag notation, e.g.
<s>
or</s>
, or as a distance operator in proximity queries, e.g.<<s>>
). The main purpose of this table is to keep the CEQL parser from passing through arbitrary tags to the CQP code, which might generate confusing error messages. Must be a hashref with the names of valid s-attributes as keys mapped to TRUE values. The default setting only allows sentences or s-unit, which should be annotated in every corpus:$CEQL->SetParam("s_attributes", { "s" => 1 });
default_ignore_case
-
Indicates whether CEQL queries should perform case-insensitive matching for word forms and lemmas (
:c
modifier), which can be overridden with an explicit:C
modifier. By default, case-insensitive matching is activated, i.e.default_ignore_case
is set to 1. default_ignore_diac
-
Indicates whether CEQL queries should ignore accents (diacritics) for word forms and lemmas (
:d
modifier), which can be overridden with an explicit:D
modifier. By default, matching does not ignore accents, i.e.default_ignore_diac
is set to 0.
See the CWB::CEQL::Parser manpage for more detailed information and further methods.
CEQL SYNTAX
** TODO **
EXTENDING CEQL
** TODO **: How to extend the standard CEQL grammar by subclassing. Note that the grammar is split into many small rules, so it is easy to modify by overriding individual rules completely (without having to call the original rule in between or having to replicate complicated functionality).
See CWB::CEQL::Parser for details on how to write grammar rules. You should always have a copy of the CWB::CEQL source code file at hand when writing your extensions. All rules of the standard CEQL grammar are listed below with short descriptions of their function and purpose.
STANDARD CEQL RULES
ceql_query
default
-
The default rule of CWB::CEQL is
ceql_query
. After sanitising whitespace, it uses a heuristic to determine whether the input string is a phrase query or a proximity query and delegates parsing to the appropriate rule (phrase_query
orproximity_query
).
Phrase Query
phrase_query
-
A phrase query is the standard form of CEQL syntax. It matches a single token described by constraints on word form, lemma and/or part-of-speech tag, a sequence of such tokens, or a complex lexico-grammatical pattern. The
phrase_query
rule splits its input into whitespace-separated token expressions, XML tags and metacharacters such as(
,)
and|
. Then it applies thephrase_element
rule to each item in turn, and concatenates the results into the complete CQP query. phrase_element
-
A phrase element is either a token expression (delegated to rule
token_expression
), a XML tag for matching structure boundaries (delegated to rulexml_tag
), sequences of arbitrary (+
) or skipped (*
) tokens, or a phrase-level metacharacter (the latter two are handled by thephrase_element
rule itself). Proper nesting of parenthesised groups is automatically ensured by the parser. xml_tag
-
A start or end tag matching the boundary of an s-attribute region. The
xml_tag
rule only performs validation, in particularly ensuring that the region name is listed as an allowed s-attribute in the parameters_attributes
, then passes the tag through to the CQP query.
Proximity Query
proximity_query
-
A proximity query searches for combinations of words within a certain distance of each other, specified either as a number of tokens (numeric distance) or as co-occurrence within an s-attribute region (structural distance). The
proximity_query
rule splits its input into a sequence of token patterns, distance operators and parentheses used for grouping. Shorthand notation for word sequences is expanded (e.g.as long as
intoas >>1>> long >>2>> as
), and then theproximity_expression
rule is applied to each item in turn. A shift-reduce algorithm inproximity_expression
reduces the resulting list into a single CQP query (using the undocumented "MU" notation). proximity_expression
-
A proximity expression is either a token expression (delegated to
token_expression
), a distance operator (delegated todistance_operator
) or a parenthesis for grouping subexpressions (handled directly). At each step, the current result list is examined to check whether the respective type of proximity expression is valid here. When 3 elements have been collected in the result list (term, operator, term), they are reduced to a single term. This ensures that the Apply method inproximity_query
returns only a single string containing the (almost) complete CQP query. distance_operator
-
A distance operator specifies the allowed distance between two tokens or subexpressions in a proximity query. Numeric distances are given as a number of tokens and can be two-sided (
<<n>>
) or one-sided (<<n<<
to find the second term to the left of the first, or>>n>>
to find it to the right). Structural distances are always two-sided and specifies an s-attribute region, in which both items must co-occur (e.g.<<s>>
).
Token Expression
token_expression
-
Evaluate complete token expression with word form (or lemma) constraint and or part-of-speech (or simple POS) constraint. The two parts of the token expression are passed on to
word_or_lemma_constraint
andpos_constraint
, respectively. This rule returns a CQP token expression enclosed in square brackets.
Word Form / Lemma
word_or_lemma_constraint
-
Evaluate complete word form or lemma constraint, including case/diacritics flags, and return suitable CQP code to be included in a token expression
word_or_lemma
-
Evaluate word form (without curly braces) or lemma constraint (with curly braces) and return a single CQP constraint, to which
%c
and%d
flags can then be added. wordform_pattern
-
Translate wildcard pattern for word form into CQP constraint (using the default
word
attribute). lemma_pattern
-
Translate wildcard pattern for lemma into CQP constraint, using the appropriate p-attribute for base forms (given by the parameter
lemma_attribute
).
Parts of Speech
pos_constraint
-
Evaluate a part-of-speech constraint (either a
pos_tag
orsimple_pos
), returning suitable CQP code to be included in a token expression. pos_tag
-
Translate wildcard pattern for part-of-speech tag into CQP constraint, using the appropriate p-attribute for POS tags (given by the parameter
pos_attribute
). simple_pos
-
Translate simple part-of-speech tag into CQP constraint. The specified tag is looked up in the hash provided by the
simple_pos
parameter, and replaced by the regular expression listed there. If the tag cannot be found, or if no simple tags have been defined, a helpful error message is generated.
Wildcard Patterns
wildcard_pattern
-
Translate string containing wildcards into regular expression, which is enclosed in double quotes so it can directly be interpolated into a CQP query.
Internally, the input string is split into wildcards and literal substrings, which are then processed one item at a time with the
wildcard_item
rule. wildcard_item
-
Process an item of a wildcard pattern, which is either some metacharacter (handled directly) or a literal substring (delegated to the
literal_string
rule). Proper nesting of alternatives is ensured using the shift-reduce parsing mechanism (with BeginGroup and EndGroup calls). literal_string
-
Translate literal string into regular expression, escaping all metacharacters with backslashes (backslashes in the input string are removed first).
Note that escaping of
^
and"
isn't fully reliable because CQP might interpret the resulting escape sequences as latex-style accents if they are followed by certain letters. Future versions of CQP should provide a safer escaping mechanism and/or allow interpretation of latex-style accents to be turned off.
Internal Subroutines
- ($has_empty_alt, @tokens) = $self->_remove_empty_alternatives(@tokens);
-
This internal method identifies and removes empty alternatives from a tokenised group of alternatives (@tokens), with alternatives separated by
|
tokens. In particular, leading an trailing separator tokens are removed, and multiple consecutive separators are collapsed to a single|
. The first return value ($has_empty_alt) indicates whether one or more empty alternatives were found; it is followed by the sanitised list of tokens.
COPYRIGHT
Copyright (C) 1999-2010 Stefan Evert [http::/purl.org/stefan.evert]
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.