Methods to improve the performance of stemming in dtSearch

Well known methods to improve the under-stemming and over-stemming performance of stemming algorithms are the use of a lexicon and the use of exception lists; both of these methods can be applied to dtSearch.

Use of a Lexicon

A common cause of under-stemming within English is the existence of irregular verbs (e.g. go, went, gone), strong verbs (e.g. throw, threw, thrown), and irregular nouns (e.g. foot, feet; mouse, mice) you should not try to create stemming rules to cater for these words; a method to compensate for understemming of words that represent the same concept but do not follow regular morphological rules is to enter the groups of words in the dtSearch user thesaurus as a 'synonym group'.

The User Thesaurus Plus add-on is supplied with thesaurus files to overcome under-stemming of all common English irregular verbs and irregular nouns as well as samples of irregular verbs for other languages. User Thesaurus Plus also contains many other examples for cross-lingual search and specialised vocabularies. http://www.dtsearch.co.uk/User-Thesaurus-Plus.htm

 

Exception lists

Over-stemming errors can sometimes be overcome by making a rule less 'aggressive' (e.g. change a rule like 4 + es -> to 5 + es -> so that shorter words are not stemmed), or by removing the rule, or by adding an 'exception rule' at the start of a stemming rule file; an example of an exception rule would be to add the rule news -> news to the start of the English stemming rule file, this will prevent the word news being stemmed to the word new.