From Code to Community: Sponsoring The Perl and Raku Conference 2025 Learn more

#!/bin/sh -xe
# README.linux.words - file used to create linux.words
# Created: Wed Mar 10 09:12:49 1993 by faith@cs.unc.edu (Rik Faith)
# Revised: Sat Mar 13 17:02:08 1993 by faith@cs.unc.edu
#
# Care was taken to be sure that the linux.words list was free of
# copyright. This makes linux.words a suitable /usr/dict/words
# replacement for the Linux community.
#
# Since the majority of the words are from Tanenbaum's minix.dict file,
# the notice from Barry Brachman, included below, should accompany any
# redistribution of this list.
# Here is a detailed explaination of how I created the linux.words file.
#
# This README.words file is actually a shell script that you can use to
# recreate the linux.words file from original sources.
#
# First, I started with minix.dict
# from cs.ubc.ca:/pub/local/src/sp-1.5/wordlists-1.0.tar.Z
#
# The following is from the NOTES file in wordlists-1.0.tar.Z:
# NOTES> These word lists were collected by Barry Brachman
# NOTES> <brachman@cs.ubc.ca> at the University of British Columbia. They
# NOTES> may be freely distributed as long as this notice accompanies them.
# NOTES>
# NOTES> ==================================================================
# NOTES> Info for minix.dict:
# NOTES>
# NOTES> Article 1997 of comp.os.minix:
# NOTES> From: ast@botter.UUCP
# NOTES> Subject: A spelling checker for MINIX
# NOTES> Date: 6 Jan 88 22:28:22 GMT
# NOTES> Reply-To: ast@cs.vu.nl (Andy Tanenbaum)
# NOTES> Organization: VU Informatica, Amsterdam
# NOTES>
# NOTES> This dictionary is NOT based on the UNIX dictionary so it is free
# NOTES> of AT&T copyright. I built the dictionary from three sources.
# NOTES> First, I started by sorting and uniq'ing some public domain
# NOTES> dictionaries. Second, as some of you probably know, I have
# NOTES> written somewhere between 3 and 6 books (depending on precisely
# NOTES> what you count) and an additional 50 published papers on operating
# NOTES> systems, networks, compilers, languages, etc. This data base,
# NOTES> which is online, is nonnegligible :-) Finally, I added a number of
# NOTES> words that I thought ought to be in the dictionary including all
# NOTES> the U.S. states, all the European and some other major countries,
# NOTES> principal U.S. and world cities, and a bunch of technical terms.
# NOTES> I don't want my spelling checker to barf on arpanet, diskless,
# NOTES> modem, login, internetwork, subdirectory, superuser, vlsi, or
# NOTES> winchester just because Webster wouldn't approve of them. All in
# NOTES> all, the dictionary is over 40,000 words. If you have any
# NOTES> suggestions for additions or deletions, please post them. But
# NOTES> please be sure you are not infringing on anyone's copyright in
# NOTES> doing so.
# NOTES>
# NOTES> Andy Tanenbaum (ast@cs.vu.nl)
# The main problem with minix.dict is that many proper names are not
# capitalized. So, I got english.tar.Z from ftp.uu.net:/doc/dictionaries,
# which is a mirror of nic.funet.fi:/pub/unix/security/dictionaries.
#
# Here is part of the README file for english.tar.Z:
# README>
# README> FILE: english.words
# README> VERSION: DEC-SRC-92-04-05
# README>
# README> EDITOR
# README>
# README> Jorge Stolfi <stolfi@src.dec.com>
# README> DEC Systems Research Center
# README>
# README> AUTHORS OF ORIGIONAL WORDLISTS
# README>
# README> Andy Tanenbaum <ast@cs.vu.nl>
# README> Barry Brachman <brachman@cs.ubc.ca>
# README> Geoff Kuenning <geoff@itcorp.com>
# README> Henk Smit <henk@cs.vu.nl>
# README> Walt Buehring <buehring%ti-csl@csnet-relay>
#
# [stuff seleted]
#
# README> AUXILIARY LISTS
# README>
# README> In the same directory as englis.words there are a few
# README> complementary word lists, all derived from the same sources
# README> [1--8] as the main list:
# README>
# README> english.names
# README>
# README> A list of common English proper names and their derivatives.
# README> The list includes: person names ("John", "Abigail",
# README> "Barrymore"); countries, nations, and cities ("Germany",
# README> "Gypsies", "Moscow"); historical, biblical and mythological
# README> figures ("Columbus", "Isaiah", "Ulysses"); important
# README> trademarked products ("Xerox", "Teflon"); biological genera
# README> ("Aerobacter"); and some of their derivatives ("Germans",
# README> "Xeroxed", "Newtonian").
# README>
# README> misc.names
# README>
# README> A list of foreign-sounding names of persons and places
# README> ("Antonio", "Albuquerque", "Balzac", "Stravinski"), extracted
# README> from the lists [1--8]. (The distinction betweeen
# README> "English-sounding" and "foreign-sounding" is of course rather
# README> arbitrary).
# README>
# README> org.names
# README>
# README> A short lists names of corporations and other institutions
# README> ("Pepsico", "Amtrak", "Medicare"), and a few derivatives.
# README>
# README> The file also includes some initialisms --- acronyms and
# README> abbreviations that are generally pronounced as words rather
# README> than spelled out ("NASA", "UNESCO").
# README>
# README> english.abbrs
# README>
# README> A list of common abbreviations ("etc.", "Dr.", "Wed."),
# README> acronyms ("A&M", "CPU", "IEEE"), and measurement symbols
# README> ("ft", "cm", "ns", "kHz").
# README>
# README> english.trash
# README>
# README> A list of words from the original wordlists
# README> that I decided were either wrong or unsuitable for inclusion
# README> in the file english.words or any of the other auxiliary
# README> lists. It includes
# README>
# README> typos ("accupy", "aquariia", "automatontons")
# README> spelling errors ("abcissa", "alleviater", "analagous")
# README> bogus derived forms ("homeown", "unfavorablies", "catched")
# README> uncapitalized proper names ("afghanistan",
# README> "algol", "decnet")
# README> uncapitalized acronyms ("apl", "ccw", "ibm")
# README> unpunctuated abbreviations ("amp", "approx", "etc")
# README> British spellings ("advertize", "archaeology")
# README> archaic words ("bedight")
# README> rare variants ("babirousa")
# README> unassimilated foreign words ("bambino", "oui", "caballero")
# README> mis-hyphenated compounds ("babylike", "backarrows")
# README> computer keywords and slang ("lconvert", "noecho", "prog")
# README>
# README> (I apologize for excluding British spellings. I should have
# README> split the list in three sublists--- common English, British,
# README> American---as ispell does. But there are only so many hours
# README> in a day...)
# README>
# README> english.maybe
# README>
# README> A list of about 5,000 lowercase words from the "mts.dict"
# README> wordlist [6] that weren't included in english.words.
# README>
# README> This list seems to include lots of "trash", like
# README> uncapitalized proper names and weird words. It would
# README> take me several days to sort this mess, so I decided to
# README> leave it as a separate file. Use at your own risk...
#
# [stuff deleted]
#
# README> (NON-)COPYRIGHT STATUS
# README>
# README> To the best of my knowledge, all the files I used to build these
# README> wordlists were available for public distribution and use, at least
# README> for non-commercial purposes. I have confirmed this assumption with
# README> the authors of the lists, whenever they were known.
# README>
# README> Therefore, it is safe to assume that the wordlists in this
# README> package can also be freely copied, distributed, modified, and
# README> used for personal, educational, and research purposes. (Use of
# README> these files in commercial products may require written
# README> permission from DEC and/or the authors of the original lists.)
# README>
# README> Whenever you distribute any of these wordlists, please distribute
# README> also the accompanying README file. If you distribute a modified
# README> copy of one of these wordlists, please include the original README
# README> file with a note explaining your modifications. Your users will
# README> surely appreciate that.
# README>
# README> (NO-)WARRANTY DISCLAIMER
# README>
# README> These files, like the original wordlists on which they are
# README> based, are still very incomplete, uneven, and inconsitent, and
# README> probably contain many errors. They are offered "as is" without
# README> any warranty of correctness or fitness for any particular
# README> purpose. Neither I nor my employer can be held responsible for
# README> any losses or damages that may result from their use.
# subtract english.trash
cat minix.dict english.trash english.trash | sort | uniq -u > dict.1
# subtract english.maybe
cat dict.1 english.maybe english.maybe | sort | uniq -u > dict.2
# build subtraction list of proper names and abbreviations
cat english.names misc.names org.names computer.names english.abbrs > sub.1
tr 'A-Z' 'a-z' < sub.1 | sort | uniq -u > sub.2
# subtract proper names with incorrect capitalization
cat dict.2 sub.2 sub.2 | sort | uniq -u > dict.3
# build proper name list without possessives
cat english.names misc.names org.names computer.names | fgrep -v \'s > names.1
# add in proper names (use sort twice to get uppercase before lowercase)
cat dict.3 names.1 | sort | sort -df | uniq > linux.words
# clean up
rm dict.[123] sub.[12] names.1