NAME

Lingua::EN::Grammarian - Detect grammatical problems in text

VERSION

This document describes Lingua::EN::Grammarian version 0.000005

SYNOPSIS

use Lingua::EN::Grammarian;

# Create a list of issues...
my @caution_objs = extract_cautions_from( $text );
my @error_objs   = extract_errors_from(   $text );

# Identify a single issue at a known 2D position...
my $caution_obj  = get_caution_at($text, $line, $col);
my $error_obj    =   get_error_at($text, $line, $col);

# Identify a single issue at a known index...
my $caution_obj  = get_caution_at($text, $index);
my $error_obj    =   get_error_at($text, $index);

# Extract information on each issue...
for my $problem (@cautions_or_errors) {
    my $actual_word_or_phrase  = $problem->match;
    my $start_location_in_text = $problem->from;
    my $end_location_in_text   = $problem->to;
    my $description_of_problem = $problem->explanation;
    my $suggested_correction   = $problem->suggestions;
    ...
}

DESCRIPTION

This module provides a data-driven grammar checker for English text.

It builds a list of potential grammar problems from templates specified in two files (grammarian_errors and grammarian_cautions) and locates any such problems in a text string. Because the module is data-driven, it is easy for even non-technical users to refine or augment the rules used to identify grammatical problems.

Each problem discovered is reported as an object, whose methods can then be called to retrieve the corresponding substring of the original text, its location within the original text, a description of the problem, and a suggested remediation for it.

The module classifies grammatical problems as either errors or cautions. Errors are grammatical usages that are unequivocally wrong, such as "I gone home", "principle ingredient", "laying low", "their are it's collar", "its comprised from", "she do try and learns the words" "them have be quite unique", etc.

Cautions are words or phrases that are not inherently wrong, but which are commonly confused or misapplied. For example: "affect" vs "effect", "infer" vs "imply", "beg the question" vs "raise the question", "indict" vs "indite", "less" vs "fewer", "disburse" vs "disperse", etc.

Note that Lingua::EN::Grammarian is not a spell-checker. Neither errors nor cautions contain words that have been misspelt; they are composed of words that have been misused.

INTERFACE

Exportable subroutines

By default, the module exports only two subroutines:

extract_errors_from()

extract_errors_from() expects a single argument: a string in which it is to locate and identify grammatical errors, according to the rules in your grammarian_errors file(s).

It returns a list of objects, each of which represents a single error. The objects are returned in the order in which the errors were encountered in the text.

See "Methods of caution and error objects" for details of how these objects may be used.

extract_cautions_from()

Behaves exactly the same as extract_errors_from(), except that it locates and identifies grammatical cautions (according to your grammarian_cautions file), not errors.

You can also request a further three subroutines, by passing their names when the module is loaded. For example:

use Lingua::EN::Grammarian  'get_error_at', 'get_caution_at';

Note that, if this feature is used, only the explicitly named subroutines are exported (hence you may also need to add 'extract_cautions_from' or 'extract_errors_from', if you need either of those subs as well).

You can also request all available subroutines be exported, with:

use Lingua::EN::Grammarian  ':all';

The various exportable-by-request subroutines are:

get_error_at()

This subroutine expects two or three arguments: a string in which to search, followed by a location at which to search. The location may be either a single integer (which is taken as a zero-based index into the string), or two integers (which are treated as 1-based line and column specifiers):

my $error_obj = get_error_at($text, $index);

my $error_obj = get_error_at($text, $line, $column);

In either case, the sub returns an object representing the error occurring at that position in the text. If there is nothing wrong at that position, the sub returns undef instead.

The idea is that the index or line/column specification represents the location of a cursor or mouse over the text. Then get_error_at() can be used to determine whether or not that marker is hovering over a grammatical error and, if so, the nature of this error.

get_caution_at()

Works exactly the same as get_error_at(), with exactly the same interface. But returns an object only if there is a grammatical caution (not an error) at the specified location in the text.

get_coverage_stats()

Returns a hash whose entries indicate how many error and caution rules the module can currently identify. These numbers are, of course, determined by the configurations in your grammarian_errors and grammarian_cautions files.

get_vim_error_regexes()
get_vim_caution_regexes()

Each of these subroutines returns a list of strings that represent regexes (in the regex notation used by the Vim editor). Collectively these strings will match all the error and caution entries the module can recognize.

A list of strings is returned, rather than a single string, because Vim seems to impose a limit of about 32000 characters for a single regex pattern.

Typically, these patterns are passed to Vim's matchadd() function, to highlight grammatical problems in a buffer.

Methods of caution and error objects

Each object returned by the various subroutines of Lingua::EN::Grammarian encapsulates information regarding a single error or caution located in a particular text.

These objects provide the following methods for querying that information. None of them takes an argument.

match()

Returns the substring (of the original text) that was identified as a problem.

from()

Returns a hash representing the location in the original text at which the start of the problematic usage was detected. The keys of this hash are:

'index'

The zero-based offset in the string at which the start of the usage was detected.

'line'

The 1-based line-number of the line within the string in which the usage was detected. In other words: one more than the number of newlines between the start of the string and the start of the problem.

'column'

The 1-based column-number of the column within the string at which the start of the usage was detected. In other words: the number of characters between the start of the problem and the preceding newline (or the start of string).

to()

Returns a hash representing the location in the original text at which the end of the problematic usage was detected. The keys of this hash are identical in name and nature to those of the from() method.

explanation()

Returns a single string describing the problem that was detected.

This description will be taken from the relevant comments in grammarian_errors or from the appropriate definitions in grammarian_cautions. It may consist of multiple lines (i.e. the string may contain embedded newlines).

explanation_hash()

For cautions, returns a reference to a hash containing each alternative as a key, with that alternative's definition as the corresponding value. This is precisely the same information as returned by the explanation() method, but in a more structured form. For example, if $caution->explanation() returns:

"adverse  :  hostile or difficult
 averse   :  disinclined"

then $caution->explanation_hash() will return:

{
 'adverse' => 'hostile or difficult',
 'averse'  => 'disinclined',
}

For errors (whose explanations are not alternatives), this method returns a reference to an empty hash.

suggestions()

Returns a list of strings representing possible alternatives to the problematical usage.

For errors, this list is constructed from the replacement(s) specified after --> arrows in grammarian_errors.

For cautions, this list is constructed from the other terms specified in the same paragraph in grammarian_cautions.

The list is sorted "most-likely-replacement-first".

CONFIGURATION

Lingua::EN::Grammarian's grammar checking is configured via two files: grammarian_errors and grammarian_cautions. These files may be placed in any one of more of the following locations:

/usr/local/share/grammarian/
~/   (i.e. your home directory)
./   (i.e. the current directory)

whence they are read (in that order) and their contents concatenated into a single specification.

The two filenames may also be prefixed with a . (to render them invisible in directory listings). A given directory may contain both a visible and an invisible configuration file, in which case the invisible file is concatenated before--and hence is overridden by--the visible file in the same directory.

The configuration formats for the two files are different, however both formats ignore blank lines and both use a leading # to specify comments. However, unlike Perl comments, comments in these two configuration files can only be specified at the start of a line.

The grammarian_errors file

This file specifies the rules for detecting grammatical errors using several different formats.

Simple error specifications

Each line of the file specifies an erroneous pattern of text, followed by one or more possible corrections. Error and correction(s) are separated by a -->. Case is always ignored.

For example:

reply back      --> reply
koala bear      --> koala
could care less --> couldn't care less

can't never     -->  can't ever  --> can never
more optimal    -->  optimal     --> more optimized  -->  better

In addition to the problem and suggested replacement(s), you can also specify a description of the specific problem (or the general class of problem) in a preceding line that starts and ends with ===. The explanation itself then starts after the first whitespace gap, and ends at the last whitespace gap. For example:

===  incorrect use of preposition after "comprise"  ===

is comprised of -->  comprises


====[ A koala is a marsupial, not a bear ]====
koala bear  --> koala

A single explanation can apply to two or more errors...

=====/ Unnecessary extra word \========================

actual fact               --> fact
and plus                  --> and
because of the fact that  --> because

====={  Incorrect participle  }============================
has did   -->  has done
has have  -->  has had

Parallel error specifications

If either usage contains a (set,of,comma-separated,words,like,this) the list is expanded into separate rules for each alternative. For example:

(I,you,we,they) sees  -->  (I,you,we,they) see

(can't,won't) never   -->  (can't,won't) ever  -->  (can,will) never

(very,totally) unique -->  unique

are shorthands for:

       I sees  -->     I see
     you sees  -->   you see
      we sees  -->    we see
    they sees  -->  they see

  can't never  -->  can't ever  -->  can never
  won't never  -->  won't ever  -->  will never

   very unique -->  unique
totally unique -->  unique

Note, however, that you can currently only specify one list of alternatives in any given rule. For example, the following construction does not (yet) work:

(to,at,from,with) (I,we,they) --> (to,at,from,with) (me,us,them)

Shortcuts for parallel specifications

Transformations involving pronouns can be tedious to write (and read). Both because of the large number of alternatives often required on each side, and because of the frequent repetition of the same sets of pronouns:

about (she,he)  --> about (her,him)

ring (my,your,her,his,its,our,their) neck --> wring (my,your,her,his,its,our,their) neck

(she,he,it) have --> (she,he,it) has
(she,he,it) do   --> (she,he,it) does
(she,he,it) are  --> (she,he,it) is

So a variety of shortcuts are provided. You can specify complete sets of pronouns and possessive adjectives more succinctly with:

Shortcut     Is expanded to
========     ===================================

<I>          (I,you,she,he,it,we,they)
<me>         (me,you,her,him,it,us,them)
<my>         (my,your,hers,his,its,our,their)
<mine>       (mine,yours,hers,his,its,ours,theirs)

For only the gendered 3rd person pronouns and possessive adjectives:

<she>        (she,he)
<he>         (he,she)
<her>        (her,him)
<him>        (him,her)
<his>        (his,her)
<hers>       (hers,his)

For only the plural pronouns and possessive adjectives:

<we>         (we,you,they)
<us>         (us,you,them)
<our>        (our,your,their)
<ours>       (ours,yours,theirs)

Note that the abbreviation in angles is always the first alternative of the corresponding expansion.

This means that the following are exactly the same as the earlier parenthesized examples:

   about <she>  -->  about <her>

ring <my> neck  -->  wring <my> neck

(<he>,it) have  -->  (<he>,it) has

As the last example implies, if one of these shortcuts is placed inside a set of parentheses, it expands to just the list of pronouns. Everywhere else, each abbreviation expands to the appropriate list of pronouns surrounded by parentheses.

Verb conjugation errors

A line beginning with the marker <verb> specifies the inflection of a verb as follows:

<verb>   [present]  [3rd person]  [past simple]  [past participle]

For example:

<verb>      see        sees           saw             seen

Each line in this format is used to generate a large number of standard error rules involving the specified verb. For example, the previous specification for "see", produces the following rules:

              (<she>,it) see    -->               (<she>,it) sees
         (I,you,we,they) sees   -->          (I,you,we,they) see
                     <I> seen   -->   <I> saw  -->  <I> have seen

(be,being,been,was,were) see    --> (be,being,been,was,were) seen
(be,being,been,was,were) saw    --> (be,being,been,was,were) seen
   (has,had,have,having) see    -->    (has,had,have,having) seen
   (has,had,have,having) saw    -->    (has,had,have,having) seen
                   being seeing -->                    being seen
(be,being,been,was,were) saw    --> (be,being,been,was,were) seen

           to (sees, seen, saw) -->                        to see

                 try and see    -->                    try to see
          tried (and,to) seen   -->                    try to see
          tried (and,to) saw    -->                    try to see

Errors with absolutes

If the line begins with the marker "<absolute>", the format is either

<absolute>             [adjective]

or:

<absolute: [modifier]> [adjective]

A line in the first format, such as:

<absolute>  unique

produces the following set of standard rules:

                    (more,most) unique  --> unique
           (somewhat,extremely) unique  --> unique
     (quite,rather,very,highly) unique  --> unique
(totally,completely,absolutely) unique  --> unique

A line in the second format, such as:

<absolute: often>  fatal

produces the following rules:

    (somewhat,highly,extremely) fatal  --> fatal
(totally,completely,absolutely) fatal  --> fatal

 (more,most) fatal  -->  fatal  -->  (more,most) often fatal
(quite,very) fatal  -->  fatal  --> (quite,very) often fatal
      rather fatal  -->  fatal  -->       rather often fatal

The grammarian_cautions file

This file specifies the rules for detecting grammatical cautions using a single format.

The file should consist of one or more blank-delimited paragraphs. Each paragraph should contain one or more lines of the form:

<word or phrase>  :  <description of word or phrase>

Each paragraph represents two or more words or phrases that are frequently confused or misused. For example:

adverse   :  hostile or difficult
averse    :  disinclined

beg the question    :  to use a circular argument
raise the question  :  to call for an answer

council   :  a group that governs, deliberates, or advises
counsel   :  an individual who advises
consul    :  an individual who represents a foreign government

"Invisible" cautions

Any of these specifications may also be prefixed with a -, to indicate that the particular word or phrase is never to be searched for; that it appears only to provide contrast to--and an alternative suggestion for--other words or phrases in the same paragraph.

For example, you may wish to be warned about "wont" (which is very possibly a typo), but not about "won't" (which is most likely correct). However, when being warned about "wont" you'd still like to be offered "won't" as an alternative. That's achieved with:

  wont   :  a habitual custom
- won't  :  will not

Parallel caution specifications

As with errors in grammarian_errors, you can specify multiple cautions in a single line, by using a parenthesized list of alternatives. For example:

straight(en,ened)  :  in line
strait(en,ened)    :  tight or narrow or difficult

Note that, if such a list of alternatives is part of a larger word, it is expanded into each of the alternatives, plus the bare root word. So the previous example is equivalent to:

straight      :  in line
straighten    :  in line
straightened  :  in line
strait        :  tight or narrow or difficult
straiten      :  tight or narrow or difficult
straitened    :  tight or narrow or difficult

Shortcuts for parallel caution specifications

The most common use of parallel specifications in grammarian_cautions is to list all likely inflections of a verb. For example:

flaunt(s,ed,ing)  :  to show off
flout(s,ed,ing)   :  to ignore or show contempt for

So there is a shortcut for this:

flaunt*  :  to show off
flout*   :  to ignore or show contempt for

This shortcut is smarter than a mere substitution, as it has a partial understanding of the rules of English inflection:

Ending                 Plural   Past    Continuous
===============        ======   ====    ==========

-e*                     -es     -ed        -ing

-<consonant>y*          -ies    -ied       -ying

-ch*                    -ches   -ched      -ching

-<anything else>*       -s      -ed        -ing

This means that rules like:

indite* :  to write down
indict* :  to charge with a crime

behave correctly (i.e. you get "inditing", not "inditeing")

A second shortcut (**) is available to handle the case where a terminal consonant must be doubled when forming participles.

For example, a single * would create errors here:

rebut*   :  to argue against a proposition

because it would expand to:

rebut    :  to argue against a proposition
rebuts   :  to argue against a proposition
rebuted  :  to argue against a proposition
rebuting :  to argue against a proposition

In contrast:

rebut**   :  to argue against a proposition

would correctly expand to:

rebut     :  to argue against a proposition
rebuts    :  to argue against a proposition
rebutted  :  to argue against a proposition
rebutting :  to argue against a proposition

For less regular words or phrases, you can either list all the alternatives in a single set of parentheses:

(partake,partakes,partaken,partaking,partook) :  to consume
participate*                                  :  to take part in

or list the irregular forms within the same paragraph, but on separate lines and without descriptions:

partake(s,n)         :  to consume
partaking
partook
participate*         :  to take part in

DIAGNOSTICS

Invalid entry in grammarian_cautions: %s

The module found a non-blank line in one of your grammarian_cautions files in which there was no term defined. For example, the third line of this entry would generate this diagnostic, because it lacks a term before the colon:

# Out-of-control vehicles do both...
career  :  to move quickly and out of control, in a specific direction
        :  a long-term occupation
Lingua::EN::Grammarian does not provide %s

You loaded the module and passed a string naming a particular subroutine to be exported, but the module does not export that subroutine. Did you perhaps misspell the subroutine name?

DEPENDENCIES

This module requires Perl 5.10 or later.

It also requires the "Method::Signatures" and "Hash::Util::FieldHash" modules.

INCOMPATIBILITIES

None reported.

BUGS AND LIMITATIONS

The module will not identify overlapping errors or cautions. For example:

"...and then he he go home..."

Only the first error (e.g. the doubled word: "he he") will be reported; any overlapping errors (e.g. the incorrect conjugation: "he go") will be ignored.

No bugs have been reported.

Please report any bugs or feature requests to bug-lingua-en-grammarian@rt.cpan.org, or through the web interface at http://rt.cpan.org.

AUTHOR

Damian Conway <DCONWAY@CPAN.org>

LICENCE AND COPYRIGHT

Copyright (c) 2013, Damian Conway <DCONWAY@CPAN.org>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.