Results Using Moby Common Dictionary combined with UNIX Spelling Dictionary Word List (Frakes & Fox)

Frakes & Fox in their paper "Strength and Similarity of Affix Removal Stemming Algorithms" (Ref W1) compared a number of stemmers using various metrics and recommended the Mean Modified Hamming Distance (Mean MHD) for comparing the relative strengths of stemmers.

 

Their measurements were made with a word list consisting of the Moby Common Dictionary word list of 74,550 words combined with a 20,046 word UNIX spelling dictionary. They then removed all entries

not consisting entirely of lower case letters (eliminating proper names, abbreviations, hyphenated

terms, etc.) The resulting word list was 49,656 English words with an average length of 8.07 letters, and a standard deviation of 2.53 letters.

 

Their results are show in the following table
(From Table 2: Modified Hamming Distance Descriptive Statistics in ref W1)

 

Stemmer Mean MHD Mean Chars Removed Compression Factor Mean Conflation
Class Size
Word & Stem
Different
Lovins 1.72 1.67 0.29 1.42 34437 (69.4%)
Paice 1.98 1.94 0.33 1.49

34533 (69.5%)

Porter 1.16 1.08 0.17 1.20 27897 (56.2%)
S-Removal 0.03 0.03 0.01 1.01 1636 (3.3%)

 

 

A similar word list was prepared using the Moby COMMON.TXT file (from Ref W11) and combining it with the original UNIX spelling dictionary (from Ref W2a), this spelling dictionary has 24,001 words compared with Frakes 20,046 words. The Word List Cleaner in Stemming Tester 1.4 was then used to remove all words containing capital letters, non-letter characters, trailing 's, and duplicates, the resulting word list was 49,373 words (283 less than reported in the Frakes paper W1) with an average word length of 8.064.

 

The Word List was then used as input to Stemming Tester 1.4 and stemmed files were obtained using the Porter option and the dtSearch default English stemmer. The 'Word & Stem Different' figure is obtained by setting the Levenshtein distance setting at 1 to 32.  Note that List Analyser displays Inverse mean MHD, so divide that into 1 to obtain the Mean MHD figure used by Frakes and Fox.

 

Stemmer Mean MHD Mean Chars Removed Compression Factor Mean Conflation
Class Size
Word & Stem
Different
Porter 1.1521 1.071 0.165 1.198 27577(55.854%)
dtSearch 0.8045 0.783 0.12 1.136 21637(43.824%)

 

It can be seen in the table below that the results for the Porter stemmer are very close to those reported by Frakes and Fox. All these metrics show the dtSearch default English stemmer is a lighter stemmer than the Porter stemmer. However, these crude metrics give no indication of how accurate a stemmer is, the fact that a stemmer removes more characters does not mean that the resulting stem will only match words with the same concept - more likely is that the stemmer will produce more over-stemming errors.

 

The Error Counting method of Paice (Ref W8) gives a more useful measurement of how well a stemmer is optimised for producing higher recall without significant loss of precision.