Customising dtSearch Stemming Rules

dtSearch uses an iterative longest match stemming algorithm. It is important to appreciate that dtSearch does not simply read through a list of rules from beginning to end and just remove word endings; it reads through the rule list looking for a match between a rule and the search query word and once a match is found it operates on the word according to the rule, then returns to the start of the rule list and repeats the process until the word does not change.

There are three types of rule:

Suffix removal rules

Example: 3 + ed -> will remove the suffix 'ed' from any word where that suffix is preceded by 3 letters or more.

Suffix substitution rules

Example: 3 + ies -> y will remove the suffix 'ies' and replace it with the suffix y, where the ies is preceded by 3 letters or more.

Exception rules

Examples: ss -> ss, news -> news

The first rule could be followed by a rule 3 + s -> so that the letter s would be removed from the end of a word unless it ended in ss; in this case the ss -> ss rule would ensure that any word ending in ss would never reach the 3 + s -> rule.

The second rule, news -> news if placed at the top of the list of stemming rules would ensure that a search for news will not find documents containing the word new.

Care needs to be taken in preparing the sequence of rules to ensure that longer matches are placed earlier in the list, for example if you have a rule 3 + e -> to remove the letter e from the end of a word and you place it before a rule like 3 + de -> that needs to remove the suffix 'de' then the second rule will not work, because the first rule will leave only words that don't end in e to be processed.

The rules are different for each language, and languages differ in how effective stemming can be. The stemming characteristics needed also depend on the application, in general where the consequences of not finding a document can be serious, such as in criminal investigations (eForensics), patent research, corporate investigations (eDiscovery) a high search recall is needed, this requires a stronger stemmer.

In general a stronger stemmer will have a low under stemming index but may have a high over stemming index. To effectively tune the stemming to optimise the balance of OI and UI we recommended measuring these metrics using the List Analyser in conjunction with a word list of around 10,000 words, with the words grouped into concepts according to the method described in Paice Ref W8 .