NAME
WordList::ID::Common::Wikipedia1000 - Top 1000 words from Wikipedia Indonesia pages
VERSION
This document describes version 0.006 of WordList::ID::Common::Wikipedia1000 (from Perl distribution WordLists-ID-Common), released on 2020-10-11.
SYNOPSIS
use WordList::ID::Common::Wikipedia1000;
my $wl = WordList::ID::Common::Wikipedia1000->new;
# Pick a (or several) random word(s) from the list
my $word = $wl->pick;
my @words = $wl->pick(3);
# Check if a word exists in the list
if ($wl->word_exists('foo')) { ... }
# Call a callback for each word
$wl->each_word(sub { my $word = shift; ... });
# Iterate
my $first_word = $wl->first_word;
while (defined(my $word = $wl->next_word)) { ... }
# Get all the words
my @all_words = $wl->all_words;
DESCRIPTION
This module contains 1000 most frequently used Indonesian words in Wikipedia Indonesian pages.
Here's how the list is produced: First the Wikipedia Indonesia's XML.bz2 [1] was downloaded (last downloaded: Dec 30, 2017). Then a couple of ad-hoc, rather simplistic Perl scripts were used to process this large file: one script to split the file to a per-page basis, and the other to strip Wikimedia markup. All-lowercase words were then extracted from these files and merged to become a single file. Then the list is curated to get the final {1000,2500,5000} top words (false positives, misspellings removed).
Note that Wikipedia article pages do not represent general Indonesian text, some words are overrepresented e.g. "lagu" (in articles about particular songs) or "filum".
Some words are derivative forms (not-root words), e.g. "makanannya" or "berdasarkan".
The order of the words in this wordlist is asciibetical, as required by the WordList convention. If you want to know the ranks of words by frequency, as well as the scripts used to generate the result, see the devscripts/
and work/
directories in the Git repository.
[1] https://id.wikipedia.org/wiki/Wikipedia:Wikipedia_bahasa_Indonesia_versi_luring
WORDLIST STATISTICS
+----------------------------------+-------+
| key | value |
+----------------------------------+-------+
| avg_word_len | 6.585 |
| longest_word_len | 15 |
| num_words | 1000 |
| num_words_contain_nonword_chars | 0 |
| num_words_contain_unicode | 0 |
| num_words_contain_whitespace | 0 |
| num_words_contains_nonword_chars | 0 |
| num_words_contains_unicode | 0 |
| num_words_contains_whitespace | 0 |
| shortest_word_len | 2 |
+----------------------------------+-------+
The statistics is available in the %STATS
package variable.
HOMEPAGE
Please visit the project's homepage at https://metacpan.org/release/WordLists-ID-Common.
SOURCE
Source repository is at https://github.com/perlancar/perl-WordLists-ID-Common.
BUGS
Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=WordLists-ID-Common
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
perlancar <perlancar@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2020, 2018, 2017 by perlancar@cpan.org.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.