Sources of documents

If you only have a limited quantity of documents, the easiest way to obtain larger word lists is by downloading documents or word lists from the Internet. Some useful sites are:

S1) http://www.comp.lancs.ac.uk/computing/research/stemming/Links/resources.htm

Home of the Lancaster Stemmer (Paice-Husk)

Grouped Word List A - 9722 words (sortedtest.txt)

Grouped Word List B - 17369 words (commonwords.txt)

These lists are grouped by concepts; the Stemming Tester will recognise the boundaries between the groups and give a true word count that enables each word to be identified in the results log file.

sortedtest.txt
The sortedtest.txt file is composed of words from the titles and abstracts of the CISI test collection; it contains many isolated words with no other word forms in the 'groups' such as fizika, fron, gardin, garvey, gdr, gesamtkatalog, gilyarevskii, handbuch, harrassowitz, isbd, kindergartener, krupskaya, witnessed.

commonwords.txt
Despite the file name, the commonwords.txt file contains such words as axolotls, babbitt, gewgaw, worrywart, worstings, xebecs, yessing but lacks words such as wit, witness, generous, general, generic and in fact any other words starting with gen. The word list is purported to have been constructed from various Scrabble® dictionaries. [Ref: Project Report (PDF) 2005 by Rob Hooper at http://www.comp.lancs.ac.uk/computing/research/stemming/Links/program.htm].

Unix Spelling Dictionary (UnixSD.txt 25143 lines)

Three Hundred Thousand Word List (300twl.zip)

This is a large (3.3 Mb file) file and is too large to be read in directly by Stemming tester 1.4. It contains many duplicates, one and two letter words, and words starting with digits (12th, 1st etc.) which are not useful in evaluation of the dtSearch stemming. The word list also contains many archaic and obscure words which you will not find in many dictionaries (e.g. Mabinogian), but nevertheless it can be useful once edited. We suggest that you use dtSearch to index this file, you can then produce sorted and deduplicated lists by using the List Words in Index option in dtSearch Desktop with suitable filters to create smaller files each starting with a different letter of the alphabet.

 

S2) http://wwgw.gutenberg.org

Project Gutenberg offers over 36,000 free e-books to download. Simple text format is sufficient for the purpose of word gathering. Literature is a useful source for stemming testing; choose a range of authors over a period of years in different genres with a minimum of 10 books to obtain a sufficient variety of word forms.

This website also houses the Moby word lists which have been used in evaluating information retrieval software, see also the link at Sheffield University below.

S3) Sheffield University, The Institute for Language Speech and Hearing.
Grady Ward's Moby project download.

S4) http://ir.dcs.gla.ac.uk/resources/

Glasgow University, Department of Computer Science, Information retrieval resources.

You will find the CISI test collection here in cisi.tar.gz format. The CISI.ALL file contains words like anglo_american and ben_ami where words are joined by underscores and also contains numbers and words containing digits (e.g. chem7071, e611, Pz3, Pz4, x1) and some dates where the leading character is actually a lower case L instead of a digit (e.g. l967, l970, l972), although dtSearch will remove words starting with a digit (assuming the option to Index Numbers is unselected) it will not remove words containing digits, so these words will need to be edited out manually or by using the Word List Cleaner on the Tools menu of the Stemming Tester before using it as a word list.

 

S5) http://tartarus.org/martin/PorterStemmer/

Official page for the Porter stemmer. The page contains a sample vocabulary (voc.txt 190kB;  23,531 words) and corresponding output file (output.txt) from the stemmer. This word list is not grouped and contains very many archaic and obscure words (e.g. zounds, youtli, zanies, yokefellow, yoketh, yon, yond, yonder, yongrey, behoof, behooffull, anatomiz, amaz, addeth, baptiz, witnesseth). Although useful for making a comparison with the results from the Porter stemmer from an academic viewpoint it is otherwise not useful when evaluating results that are representative of document collections as used by contemporary professionals in the general commercial or legal world.

 

S6) ftp://ftp.ox.ac.uk/pub/wordlists/

Oxford University.  Compressed Word Lists in several languages

 

S7) http://en.wikipedia.org/wiki/Wikipedia_database

Dumps of Wikipedia databases.

 

S8) http://www.nottingham.ac.uk/alzsh3/acvocab/wordlists.htm

Nottingham University. This website contains a copy of The General Service List (GSL) and The Academic Word List (AWL).

The GSL is a list of 2284 words of the basic vocabulary of English in order of frequency. See  http://jbauman.com/aboutgsl.html

S9) http://www.victoria.ac.nz/lals/resources/academicwordlist/information

University of Wellington, Victoria, New Zealand. Original source of the AWL. The AWL has 570 word families. For example, the word family analyse includes the regular inflections of the verb, analysed, analysing, analyses and the derivations of the word, analysis, analyst, analysts, analytical, analytically etc., and the American spelling, analyze, analyzed, analyzes, analyzing. You can download the Most Frequent Words in the AWL document either as a text (.rtf) file or as an Acrobat (.pdf) file from  http://www.victoria.ac.nz/lals/resources/academicwordlist/most-frequent

S10) Unix Spelling Dictionary

http://code.google.com/p/unix-spell/