Comparison of similarity of dtSearch and Porter stemmer

Similarity

Another metric for comparing stemmers is their similarity, a crude method would be to count how many of the output words from the stemmers were the same, a comparison using the same List Analyser software as before gives the results in the table below.

 

Levenshtein Distance

Word Count

0

16933

1

3183

2

2197

3

742

4

273

5

128

6

9

7

6

Total

23531

Table 3. Showing that 71.96% (16933/23531) of the output words were the same.

Frakes and Fox in "Strength & Similarity of Affix Removal Stemming Algorithms" [Ref: W1] used a measure of Inverse Mean Modified Hamming Distance (Inverse MHD). For comparing stemmers they used the Moby Common Word list combined with a Unix Spelling [Refs: W2, W3] dictionary of 20,046 words, they reported a resulting total of 49,656 words with an average length of 8.07 letters.

For more details see the List Analyser Web Help.