NAME
Lingua::EN::StopWordList - A sorted list of English stop words
Synopsis
use Lingua::EN::StopWordList;
my($ara_ref) = Lingua::EN::StopWordList -> new -> words;
Here's a complete program:
use strict;
use warnings;
use Lingua::EN::StopWordList;
my($count) = 0;
print map{"@{[++$count]}: $_\n"} @{Lingua::EN::StopWordList -> new -> words};
Description
Lingua::EN::StopWordList
is a pure Perl module.
It returns a sorted arrayref of 659 English stop words.
Constructor and initialization
new(...) returns an object of type Lingua::EN::StopWordList
.
This is the class's contructor.
Usage: Lingua::EN::StopWordList -> new
.
Distributions
This module is available as a Unix-style distro (*.tgz).
Install Lingua::EN::StopWordList
as you would for any Perl
module:
Run:
cpanm Lingua::EN::StopWordList
or run:
sudo cpan Lingua::EN::StopWordList
or unpack the distro, and then run one of:
perl Build.PL
./Build
./Build test
./Build install
or
perl Makefile.PL
make (or dmake)
make test
make install
See http://savage.net.au/Perl-modules.html for details.
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.
Methods
new()
See "Constructor and initialization".
words()
Returns the sorted arrayref of English stop words.
FAQ
Is there a definitive list of stop words?
No, there is no such thing as a definitive list. For an important discussion, e.g. including 'phrase search', see the Wikipedia discussion of word lists.
Where does the list come from?
I downloaded it from the bottom of this page: http://www.translatum.gr/forum/index.php?topic=2476.0. It contains 659 words.
Are there other lists available?
Sure. Try http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop. This list contains 570 words.
Another good place to look is http://www.ranks.nl/resources/stopwords.html, but its English list only contains 174 words. Since Lingua::StopWords (below) also has 174 words in its Englist list, perhaps this is where that module got its words from. Lastly, it has stop word lists for a whole range of languages.
Alternately, just Google for references to various lists. Note however these lists are normally very short.
Why another Perl module for stop words?
Lingua::StopWords only has a short list of words (174). And its bug list goes back 3 years.
Lingua::EN::StopWords only has a short list of words (227). Also, this module is part of Lingua::EN::Segmenter, whose documentation is poor. Even the exact basis of how it splits text is not documented. Lastly, its bug list goes back 6 years.
I could have offered to take over maintentance of either or both those modules, but there are problems:
- o Lingua::StopWords
-
It ships with a set of sub-modules, with names like Lingua::StopWords::EN, but I'm not in a position to support its other languages if I put my module's English list into it.
Nevertheless, the fact that it supports 13 languages is definitely something in favour of this module.
- o Lingua::EN::StopWords
-
This is part of text processing stuff which I don't want to get involved with. Also, it has a long list of pre-reqs (not listed on MetaCPAN until you view the makefile), which may well suit the purposes of Lingua::EN::Segmenter, but is overkill for just a stop word list.
Several other Perl modules, written for various purposes, either use one of the above, or have their own very short (as always) lists.
How can I help?
If you translate the list of stop words in this module into your favourite language and email it to me, I will include your words in the next release.
It all depends on whether you think this new list is somehow 'better' than the lists in pre-existing modules. I cannot make that decision on your behalf.
See Also
Benchmark::Featureset::StopwordLists.
This module includes a comparison of various stopword list modules.
See http://savage.net.au/Perl-modules/html/stopwordlists.report.html.
Support
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=Lingua::EN::StopWordList.
Repository
https://github.com/ronsavage/Lingua-EN-StopWordList.git.
Author
Lingua::EN::StopWordList
was written by Ron Savage <ron@savage.net.au> in 2012.
Homepage: http://savage.net.au/index.html.
Copyright
Australian copyright (c) 2012 Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License, a copy of which is available at:
http://www.opensource.org/licenses/index.html