Error Counting

List Analyser uses the method described by Chris Paice in "Method for Evaluation of Stemming Algorithms Based on Error Counting" [Ref: W8]

 

UI = Under-Stemming Index

 

This is given by UI = 1 - CI, where CI is the Conflation Index, the proportion of equivalent word pairs which were successfully grouped to the same stem.

 

OI = Over-Stemming Index

 

This is given by OI = 1 - DI, where DI is the proportion of non-equivalent word pairs which remained distinct after stemming.

 

SW = Stemmer Weight = OI/UI

 

Verification:

Open the sample file  English2Grouped.txt as File A and English2Grouped_Trunc5.txt as File B.

Set the Levenshtein Range at 0 to 32 and click on the Calculate button.

The results shown in the Error Count group box should be UI = 0.545 and OI = 0.

 

These results are taken from the examples at: www.comp.lancs.ac.uk/computing/research/stemming/Links/error.htm

 

English2Grouped.txt   English2Grouped_Trunc5.txt
divide
dividing
divided
division
divisor
====
divine
divination
  divid
divid
divid
divis
divis
====
divin
divin

 

UI  can be calculated by plotting the results in a table thus:

 

               
  divide dividing divided division divisor divine divination
divide   1 1 0 0    
dividing 1   1 0 0    
divided 1 1   0 0    
division 0 0 0   1    
divisor 0 0 0 1      
divine             1
divination           1  

 

1 = identical stems from the same input group.

0 =  different stems from same input group.

 

The total possible matches (total of all 0 and 1 results) is 22

The total with result 1 is 10

 

Hence UI = 1 - (10/22) = 0.545

 

 

OI  can be calculated by plotting the results in a table thus:

 

               
  divide dividing divided division divisor divine divination
divide   x x x x 1 1
dividing x   x x x 1 1
divided x x   x x 1 1
division x x x   x 1 1
divisor x x x x   1 1
divine 1 1 1 1 1   x
divination 1 1 1 1 1 x  

 

 

1 = non-identical stems from the different input groups. A successful stemming.

0 =  a pair of words from different groups with the same stem, i.e. an over-stemming error.

 

The total possible matches (total of all 0 and 1 results) is 20

The total with result 1 is 20

 

Hence OI = 1 - (20/20) = 0