Noise Word lists

In dtSearch the default Noise Word list is in a file named noise.dat.

The purpose of excluding words from the index is to reduce the size of the index. dtSearch will work without a noise word list and does not create a default list.

Where casual users carry out natural language searching, it is preferable to keep the number of noise words to a minimum. Also, if searches are typically on book or play titles, or everyday phrases - perhaps from a film script - then it may be preferable to work without a noise word list.

The Language Extension Pack includes noise word lists for many languages, including an empty file called None.dat.

The noise word lists are MODIFIABLE and REDISTRIBUTABLE.

IMPORTANT. dtSearch will store its own copy of the noise word list within each index. After you add or delete words in the noise word file, changes will be reflected in future indexes that you create, but will not affect existing indexes.

Adding words to the list

If you are sure that you never want to search for certain words in the files to be indexed, add them to the noise word list. You can easily modify the noise word files with a text editor. To add words, insert them in alphabetical order in the existing list. The alphabetical order is only a practical way to manage the list. The indexing program ignores the order of the words.

Choosing Noise Words

Noise words are chosen if they are not likely to be useful in a search. Generally indefinite and definite articles such as 'a'and 'the'(note: some languages such as Finnish and Russian do not have articles), conjunctions such as and, or, but, thatand prepositions such as with, in, to, atare included. You may need to refine your choice of words depending on the frequency of words that are actually found in the indexed document collection.

Creating noise word lists in other languages

To create a noise word list for another language it is incorrect to simple translate on a word-for-word basis from one language to another. The noise words should reflect how common the words are found in typical written texts of that language. The word frequency of a particular word may be different in another language or may not be used at all. Sometimes a single word in one language can only be represented by several words in the other language. A dictionary is not adequate for this work, in each case a grammar text should be used and sample text in each language should be used for testing.