Comparison of dtSearch and Porter stemmer


The dtSearch stemmer is a proprietary longest match iterative suffix removal/replacement stemmer, it has been designed for fast operation and has relatively simple conditions for removing or changing a word suffix; it was designed to read in language specific rules from plain text files, as a result stemming rules in over 25 languages have been developed by dtSearch UK since 2001.

In comparison the Porter stemmer is relatively complex and harder to understand with a five stage algorithm (reduced to four in the 2002 ‘Porter 2’ version), it was developed in 1980 for English and was released into the public domain; after 30 years it has only been adapted for about a dozen languages by various authors and very few of these have had extensive testing results published.

It should be pointed out that the Porter (1980) stemmer has been frozen in its design so that it can be used as a benchmark, similarly the dtSearch English stemmer has remained unchanged since 1992, the advantage to users particularly in the legal profession is that the results of a search using dtSearch stemming will be consistent between different versions of the software.