Evaluation of stemming performance

Historically the most common method of evaluating the performance of information retrieval systems has been to measure the average Recall and average Precision using experimental test collections of documents and using search queries where the 'relevance' has been predetermined; as noted by others (Hull, 1995 Ref W6) "these experiments can be successfully run without the researcher reading a single word of text in either a query or a document".

While useful, this method gives no insight into the specific causes of errors, when developing and optimising stemming rules it is necessary to look at the details of what documents are being missed, or what documents are being incorrectly retrieved because of stemming errors.

Stemming errors are of two types, under-stemming and over-stemming. Under-stemming will result in documents being missed, while over-stemming will result in possibly non-relevant documents being retrieved.


This can be caused by an incomplete dictionary in stemmers that use a dictionary (lemmatizers) or where the word is of an irregular form and not handled by any rule in a purely algorithmic stemmer. In English - doing, done; being, been; foot, feet ; throw, threw - are examples of common word pairs that are usually not capable of being handled by suffix-removal stemming algorithms.

We recommend that you handle these types of 'under-stemming errors' by creating word groups in User Thesaurus Plus (if they are not already handled by the WordNet thesaurus built-in to dtSearch) rather than attempt to add rules which would result in a stem of less than 3 letters.

The User Thesaurus Plus add-on is supplied with thesaurus files to overcome under-stemming of all common English irregular verbs and irregular nouns as well as samples of irregular verbs for other languages. User Thesaurus Plus also contains many other examples for cross-lingual search and specialised vocabularies, see: http://www.dtsearch.co.uk/User-Thesaurus-Plus.htm

Although some may view that under-stemming is not a serious error because it has no effect in terms of recall and precision (i.e. the same search results would have been obtained without stemming), in practice the human and legal consequences of missing a document that could prove a man's innocence, miss a report on a drug trial or miss an existing patent could be very serious.


This is where unrelated words are conflated to the same search query term, for example the Porter stemmer will return documents containing the unrelated words wit and witness on a search query containing either one of the words. (Ref W7)

The test files in LEP500 contain groups of related words in each language, with 'barriers' between each group. The test files are designed primarily to show how words in certain grammatical forms (e.g. verb conjugations) are handled by the dtSearch stemmer; usually you should expect all the words in each group to produce the same stem as output, however some languages have exceptions marked or listed in separate groups at the ends of the list. The test files are not intended to give an accurate value of the Under-stemming Index (UI) or Over-stemming Index (OI) (Paice 1994), however they can be used as a starting point to which you can add many more groups, a total number of words around 10,000 is usually sufficient to obtain a reasonably accurate figure for measurement of OI and UI values.