Word List Cleaner

You can clean any word list by first opening a word list from the Options|Use List of Words... menu, then click on the Tools|Word List Cleaner... menu.

Output - Word Length.

Adjust the settings to control the lengths of the words in the output list either by setting a range like 8 to 63, or you can set them to the same number e.g. 5 to 5 to produce a list of only words of five letters in length. For testing dtSearch stemming a range of 3 to 32 is suitable.

The cleaned word list will be saved in Unicode format in the same folder as the word list that you opened, the output filename will depend on the options that were selected; the Keyboard Shortcut Access (Underlined) Keys of the selected options will be included in the filename. For example if your word list was named mywordlist.txt it will be saved as mywordlist_cleaned_F.txt if the option to remove the Final 's and ' from words was selected. If the remove Duplicate words was also selected the output file will be named mywordlist_cleaned_FD.txt. If all the options are selected the filename will include FDNXW.

Remove... Final 's or ' from words

This is intended to clean up test collections such as the CISI word list or the stemmed or truncated outputs. For example it will change Zipf's to Zipf.

Remove... Duplicate words (Except Barriers)

This will remove any word occurring more than once in the list. It will not remove barriers such as ==== or ---- that are used in word lists intended for measuring error counting as used by the List Analyser. This is case-sensitive, Apple and apple are considered separate words and it will not remove one as a duplicate.

Remove... Words containing upper-case characters

This is for cleaning word lists such as the Moby Common Dictionary (COMMON.TXT) file to remove proper nouns and abbreviations, etc. for example it will remove Brown but not brown,

Remove... Words containing non-letter characters

Caution: This option will remove barriers such as ==== or ---- which you may need for use with the List Analyser.

This is for cleaning lists such as the CISI.ALL file that contains words that include digits e.g. X1 or words that appear to be numbers but may actually start with a letter that dtSearch cannot remove e.g. l974.

This option allows words that contain only ‘letters’ (including accented and Asian text like Chinese, Bengali, Korean, etc.), it will remove any word that contains a digit or other non-letter, for example it will remove any of these words:

“hello”

help!

pardon?

post-code

_value

4en6

gr8t

If you want to retain words that contain punctuation like ! “ ” or ? it is better to first clean the word list by indexing it with dtSearch so that you can remove the punctuation without removing the words from the list. (To retain 'barriers' you will need to edit the dtSearch Alphabet file so that the characters - and = are treated as Letters).

Remove... xfirstword xlastword

This is for cleaning word lists that have been prepared using the List Words in Index feature of dtSearch Desktop.

Remove... Lines containing multiple words

The will remove any line containing a phrase such as White House.

 

Note: The removal rules are applied in the same order that they are listed.