Results Using Word List A (Lancs)

Table 1 below are results for 'tight grouping' shown in [Chris Paice Ref: W8] using a CISI sample file containing 9757 words.

 

  UI OI SW* ERRT
trunc4 0.062 0.000814 0.013127* -
trunc5 0.176 0.000262 0.001487 -
trunc6 0.337 0.000073 0.000218 -
trunc7 0.527 0.000028 0.000054 -
trunc8 0.700 0.000012 0.000017 -
         
Lovins 0.326 0.000063 0.000193 0.92
Porter 0.374 0.000028 0.000074 0.76
Paice/Huck 0.121 0.000118 0.000978 0.55
         

 

*SW = OI/UI Note that the results for SW shown in Ref: W8 differ from those calculated by dividing by hand the OI value by the UI value shown in the table, this is because of rounding errors. For example using the Calculator in Windows 7 gives SW for trunc4  = 0.0131290322580645 (0.013129) compared with 0.013127 shown in the table above. The List Analyser also calculates SW internally to a high precision but presents results of OI rounded to 8 decimal places and UI rounded to 6 decimal places.

The Word List A (sorted.txt) from the Lancs University website [Ref: W12] contains 9722 words. This was input to Stemming tester 1.4 and used also as the File A to the List Analyser. The stemmed file was used as File B input to the List Analyser. The results are shown in Table 2. The dtSearch (default) stemmer is the 27 rule English stemmer supplied with dtSearch Desktop/Network as standard.

 

  UI OI SW ERRT
trunc4 0.068165 0.00081146 0.011904 -
trunc5 0.184311 0.00025832 0.001402 -
trunc6 0.349101 0.00007142 0.000205 -
trunc7 0.539128 0.00002733 0.000051 -
trunc8 0.706874 0.00001168 0.000017 -
         
         
Porter 0.432692 0.00002620 0.000061  
dtSearch (default) 0.575495 0.00000887 0.000015  
         

 

Word List A contains 33 words ending in 's and one ending in 't, it also contain two French words containing an apostrophe - d'etie and d'info, the Lovins stemmer removes 's by default, whereas other stemmers normally leave such normalisation to other parts of the information retrieval system.

 

The List Analyser will treat  Zipf and Zipf' as two separate words because of the trailing apostrophe. To avoid introduction of errors in the comparisons between stemmers and also because the truncators built in to Stemming Tester 1.4 do not remove the trailing apostrophe; Word List A was cleaned using the Word List Cleaner in Stemming Tester 1.4 with the options to remove a final 's or ' and also to remove the duplicates that would appear as a result of this.

 

Table 3 shows the results using sortedtest_cleaned_FD.txt (total word 9583) as input to the Stemming Tester, and then applying the same word listed as stemmed/truncated files to the List Analyser.

 

 

  UI OI SW ERRT
trunc4 0.067904 0.00082268 0.012115 -
trunc5 0.183361 0.00026187 0.001428 -
trunc6 0.348456 0.00007181 0.000206 -
trunc7 0.537476 0.00002732 0.000051 -
trunc8 0.707138 0.00001170 0.000017 -
         
         
Porter 0.393825 0.00002697 0.000068  
dtSearch(default) 0.541547 0.00000917 0.000017  
         

 

 

It can be seen that the standard dtSearch English stemming is a very light stemmer, with very few over-stemming errors.