Overview

List Analyser is designed for use with Stemming Tester 1.4 for the purpose of tuning stemming rules. It can be used with plain text word lists where the words are ungrouped or where the words are arranged in concept groups separated by barriers according to the method described by Chris Paice of Lancaster University (Ref W8) it will display Under Stemming Error count, Over Stemming Error count and Stemmer Weight.

Stemming characteristics should ideally be optimised for each application and for each document collection to be searched. A common mistake is to see stemming as the solution for finding all variants of a word, even though the stem is quite different (e.g. think, thought; sing, sung), the correct solution using dtSearch in these cases is to add the variant terms to the thesaur.xml file which is edited either by the built-in User Thesaurus editor in dtSearch Desktop or by using the add-on product User Thesaurus Plus.

List Analyser is designed to compare two words lists of the same length. It was specifically designed to analyse outputs from Stemming Tester, but may also find applications for comparing input and output word lists of other stemmers or word processing algorithms such as translation, encryption, etc.

List Analyser displays descriptive statistics of stemmer strength and similarity, together with stemming error counts as listed below:


Word Count
Unique Words
Mean Word Length

Stemmer Strength
List Analyser displays the following six metrics as described by Frakes & Fox (Ref W1)
1 - Mean Conflation Class Size
2 - Compression Factor
3 - Number of Words and Stems that Differ
4 - Mean Characters Removed
5 & 6 The median and mean modified Hamming distance between words and their stems

Stemmer Similarity
Inverse mean MHD
SSM*

Stemming Error Counts
Under-Stemming Index
Over-Stemming Index

Stemmer Weight

List View Columns
MHD
Levenshtein Distance
Relative MHD