Noise Word file format format

Unicode

The noise word files included in the Language Extension Pack LEP500 use Unicode encoding. Some of the languages contain words with 'accented' characters (e.g. après).

Earlier versions of dtSearch used ANSI encoding and if Unicode files are used with dtSearch Desktop/Network 5.25 or earlier then noise words with accented characters will appear with 'garbage' characters in the Search Dialog scrolling word list.

In dtSearch Desktop 6.0 and later there is a facility for modifying the noise word list. From the Options|Preferences menu, on the Indexing Options tab, under Letters and Words section click the 'Edit' button alongside the noise word edit box.


Accent sensitive index

If you are using an accent sensitive index and need to make a word with an accented character as a noise word, then the word MUST be entered into the noise word list with the accented character.

The file must be saved in Unicode format. In systems that support Unicode (Windows XP, Vista, Windows 7) edit the file with Notepad and choose Save as... then choose Encoding as Unicode.

Accent insensitive index (default)

If you need to have an accented word as a noise word, the noise word must be entered in the noise word list WITHOUT any accented characters.

e.g. If you need to make après a noise word, enter apres in the noise word list. If you enter après it will not prevent après from being indexed.

What this means is that if you need to have an accented word like après as a noise word, then to ensure that it will not be indexed with a 'normal' index or an accent sensitive index, the word has to be entered in the noise word list twice, with and without the accent, and must be saved in Unicode format.