Comparison of strength of dtSearch and Porter stemmer


Method 1 - Using Stemming Tester 1.4

Stemmer strength

There have been a number of methods suggested for measuring the strength of a stemmer

The mean number of characters removed in forming stems is "a gross measure of the stemmer strength; stronger stemmers remove more characters from words to make stems." (Frakes and Fox 1999).  

To measure the mean number of characters removed, download the sample vocabulary (voc.txt) and associated output file (output.txt) from the official Porter stemmer website.

1) Using Stemmer Tester 1.4 from the Options menu select Use List of Words... and browse to the voc.txt file.

2) From the Tools menu select Show List of Words Analysis the result should be as in Fig 1a below.

3) From the Options menu select Use List of Words... and browse to the output.txt file.

4) From the Tools menu select Show List of Words Analysis the result should be as in Fig 1b below.

5) From the File menu select Open Stemming File... browse to the dtSearch stemming.dat file, this will normally be in your Program Files\dtSearch\bin\ folder. Press the Stem button and wait for ==End== to appear in the Result text box.

6) From the Options menu select Use List of Words... and browse to the Documents\StemmingTester folder and open the file voc_ST_English.txt.

7) From the Tools menu select Show List of Words Analysis the result should be as in Fig 1c below.

 

voc.txt input file output.txt from Porter Output from Stemming Tester 1.4
1a 1b 1c

Note that dtSearch does not stem words of 3 letters or less, whereas the Porter stemmer  has stemmed many of the three letter words and as a result has added a further 121 two letter words in output.txt.

The results show that both stemmers reduce the average word length of the voc.txt file by an average of 1.14785 +- 0.000915 characters (0.797%), with dtSearch being very slightly stronger.

This measurement of stemmer strength is very crude and is only true for affix removal stemmers, it ignores where endings have been modified instead of being removed; Frakes and Fox state that another measurement of stemmer strength is the number of words and stems that differ; "stemmers often leave words unchanged. For example, a stemmer might not alter 'engineer' because it is already a root word. Stronger stemmers will change words more often than weaker stemmers." [Frakes and Fox Ref: W1].

Method 2 - Using List Analyser 1.0

Using the output.txt file and voc_ST_English.txt as inputs to the List Analyser gives the results shown in Fig 2a and 2b.

 

Levenshtein Distance

Word Count

0

9820

1

5557

2

3773

3

3025

4

1018

5

236

6

7

7

27

8

1

9

1

2a. Results for the dtSearch English stemmer

A Levenshtein distance of 0 means that the two words being compared were identical. [Ref: W4]
Fig 2a shows that 41.73% (9820/23531) of the words from the voc.txt file were unchanged.

 

Levenshtein Distance

Word Count

0

9129

1

6910

2

3257

3

2582

4

1104

5

435

6

91

7

21

8

1

9

1

2b. Results for Porter stemmer

This shows that 38.79% (9129/23531) of the words from the voc.txt file were unchanged.

These results indicate that dtSearch is a slightly weaker stemmer, this contradicts the earlier result.

The problem with these crude measurements is that they are highly dependant on the input word list, for example since the dtSearch stemmer does not change words less than three letters in length the results obviously depend on the percentage of three letter words in the list. These methods also do not take into account if the actual differences in the outputs had been due to an over-stemming or under-stemming error.

For further information on more advanced methods for calculating stemmer strength see the List Analyser WebHelp.