List Analyser uses the method described by Chris Paice in "Method for Evaluation of Stemming Algorithms Based on Error Counting" [Ref: W8]

UI = Under-Stemming Index

This is given by UI = 1 - CI, where CI is the Conflation Index, the proportion of equivalent word pairs which were successfully grouped to the same stem.

OI = Over-Stemming Index

This is given by OI = 1 - DI, where DI is the proportion of non-equivalent word pairs which remained distinct after stemming.

SW = Stemmer Weight = OI/UI

Verification:

Open the sample file English2Grouped.txt as File A and English2Grouped_Trunc5.txt as File B.

Set the Levenshtein Range at 0 to 32 and click on the Calculate button.

The results shown in the Error Count group box should be UI = 0.545 and OI = 0.

These results are taken from the examples at: www.comp.lancs.ac.uk/computing/research/stemming/Links/error.htm

English2Grouped.txt | English2Grouped_Trunc5.txt | |

divide
dividing divided division divisor ==== divine divination |
divid
divid divid divis divis ==== divin divin |

UI can be calculated by plotting the results in a table thus:

divide | dividing | divided | division | divisor | divine | divination | |

divide | 1 | 1 | 0 | 0 | |||

dividing | 1 | 1 | 0 | 0 | |||

divided | 1 | 1 | 0 | 0 | |||

division | 0 | 0 | 0 | 1 | |||

divisor | 0 | 0 | 0 | 1 | |||

divine | 1 | ||||||

divination | 1 |

1 = identical stems from the same input group.

0 = different stems from same input group.

The total possible matches (total of all 0 and 1 results) is 22

The total with result 1 is 10

Hence UI = 1 - (10/22) = 0.545

OI can be calculated by plotting the results in a table thus:

divide | dividing | divided | division | divisor | divine | divination | |

divide | x | x | x | x | 1 | 1 | |

dividing | x | x | x | x | 1 | 1 | |

divided | x | x | x | x | 1 | 1 | |

division | x | x | x | x | 1 | 1 | |

divisor | x | x | x | x | 1 | 1 | |

divine | 1 | 1 | 1 | 1 | 1 | x | |

divination | 1 | 1 | 1 | 1 | 1 | x |

1 = non-identical stems from the different input groups. A successful stemming.

0 = a pair of words from different groups with the same stem, i.e. an over-stemming error.

The total possible matches (total of all 0 and 1 results) is 20

The total with result 1 is 20

Hence OI = 1 - (20/20) = 0