Using dtSearch Desktop to prepare word lists

To prepare a list of words, you first need to have a suitable collection of documents (a corpus) covering suitable subject domains in the languages for your particular needs. If your organisation already has many documents, you need to be careful that any documents that are not in the same language are excluded from the index. Before building the index you should unselect all the option boxes (e.g. Index Numbers) under the Options|Preferences|Indexing Options dialog.

You can exclude not only files but particular folders in dtSearch by entering a file pattern in the exclude text box, for example if you add *\My Documents\NotThisFolder\* to your Exclude Filters, it will skip anything under C:\Documents and Settings\User\My Documents\NotThisFolder\   

After you have created the index, from the dtSearch Desktop Index menu select List Index Contents... select the index and then enter a suitable pattern to match, for example the pattern a* will provide you with a sorted and de-duplicated list of all words starting with the letter 'a', see below:

After you have listed the words Save... them as a .txt file; you will find that dtSearch has added two additional words, xfirstword and xlastword, you should open the file in a suitable text editor* and delete these extra words, you can also add comments at the end of the word list by adding a line starting with \\.

* Stemming Tester 1.3 and later includes a Word List Cleaner on the Tools menu which can be used to automatically remove the extra xfirstword and xlastword that dtSearch has added, it can also remove words that contain digits that dtSearch is unable to do e.g. gr8t, X1, P45, e611, b4 and can remove duplicates, etc.


2 Hooper R P.  The Design and Optimisation of Stemming Algorithms. Dept of Computing, Lancaster Univ. UK. 2005