NAME
Lingua::EN::CommonMistakes - map of common English spelling errors
SYNOPSIS
use Lingua::EN::CommonMistakes qw(%MISTAKES);
foreach my $word (split /\b/, $text) {
if (my $correction = $MISTAKES{lc $word}) {
warn "Likely spelling error: $word (-> $correction)\n";
}
}
# or use a different flavor of English
use Lingua::EN::CommonMistakes qw(:no-punct :british %MISTAKES);
...
Provides a customizable map of common English spelling errors with their respective corrections.
USAGE
The behavior of this package is customized at import time.
By default, importing this package will create a hash named %MISTAKES
in the calling package, containing most corrections, but not containing either American English or British English corrections.
This behavior may be customized by providing the following parameters when importing:
- %NAME [default:
%MISTAKES
] -
The map will be imported with the given name.
:common
,:no-common
[default::common
]-
If enabled, include the base set of corrections common among all English variants. This is the largest set of corrections.
:american
,:no-american
[default::no-american
]-
If enabled, American English is desirable; include corrections from British English to American English. For example, "colour" should be replaced with "color".
:british
,:no-british
[default::no-british
]-
If enabled, British English is desirable; include corrections from American English to British English. For example, "recognized" should be replaced with "recognised".
:punct
,:no-punct
[default::punct
]-
If enabled, include corrections which introduce punctuation characters; for example, "dont" should be replaced with "don't".
:no-punct
is often useful when scanning input text where punctuation characters have special meaning, such as in most programming languages. :no-defaults
-
If set, the corrections map only includes sets which have been explicitly enabled.
It's possible to use
the package several times if multiple mappings are needed, as in the following example:
# one map for common mistakes, another for british->american only
use Lingua::EN::CommonMistakes qw(%MISTAKES_COMMON);
use Lingua::EN::CommonMistakes qw(:no-defaults :american %MISTAKES_GB_TO_US);
WHY?
One might justifiably wonder why it would make sense to use a list of mistakes rather than a full dictionary when spell checking.
Spell checking typically uses a whitelist approach: all words are considered incorrect unless they can be found in the whitelist (dictionary). This module instead facilitates a blacklist approach: words are considered correct unless they can be found in the blacklist (map of mistakes).
A blacklist approach to spell-checking is often more suitable than a whitelist approach when scanning text which is partly but not entirely English.
Computer programs are a prime example of semi-English documents; comments and identifiers may be written in English, with additional restrictions (such as no punctuation characters permitted in identifiers) and often contain words which are intentionally not spelled correctly (abbreviations or corruptions of valid English words, e.g. "int" for "integer").
Other examples include mixed language documents or documents which are ostensibly English but contain a lot of domain-specific jargon unlikely to be found in an English dictionary.
Despite the fact that such bodies of text are only partly English, any occurrences of words in the blacklist are likely to be genuine errors.
A blacklist approach also makes sense when it is more important to have a low rate of false positives than it is to find every error (for example, an automated system which risks being ignored if it generates too many reports of dubious value).
AUTHOR
Rohan McGovern, rohan@mcgovern.id.au
BUGS
Please view and report any bugs here: http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-EN-CommonMistakes
ACKNOWLEDGEMENTS
Most of the word list has been sourced from other projects, including:
krazy code checker tool, written for KDE: http://gitorious.org/krazy/krazy/blobs/master/plugins/general/spelling
lintian package checker tool, written for Debian: http://anonscm.debian.org/gitweb/?p=lintian/lintian.git;a=blob;f=data/spelling/corrections
LICENSE AND COPYRIGHT
Copyright 2012 Rohan McGovern.
Incorporated word lists may be Copyright their respective authors.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 dated June, 1991 or at your option any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A copy of the GNU General Public License is available in the source tree; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.