Inverse Mean MHD (Inverse of the mean Modified Hamming Distance)

"The stemmer similarity metric M for a pair of stemming algorithms A1 and A2, given a word list W, is the inverse of the mean modified Hamming distance (d) for all words in the word list, M(A1,A2,W) = n/Σ d(xi,yi), for i ranging from 1 to n, where n is the size of W and for all words wi in W, xi is the result of the application of A1 to wi and yi is the is the result of the application of A2 to wi. The inverse of the mean is used so that more similar algorithms will have higher values of M.

For example, suppose W = {brittle, engineered, fairies} and that stemming algorithm A1 produces the stems {brit, engineer, fairy} from W, and stemming algorithm A2 produces {britt, engineered, fairi} from W. Then M(A1,A2,W) is computed by dividing the wordlist size (3) by the sum of the Modified Hamming Distances between the stems produced by each stemmer for each word (for example, the Modified Hamming Distance between brit and britt is 1). The result is 3/(1+2+1) = 0.75 as our measure of the similarity of algorithms A1 and A2." Frakes and Fox [Ref: W1]

Test files SSM1 and SSM2 are based on the above and will give an Inverse Mean MHD of 0.75

Improved SSM (SSM*)

The Inverse Mean MHD method suggested above leaves the following problems:

- It would be better if the result could be normalised to give a value between 1 and 0,
- The above metric fails to take into
consideration the overall lengths of the two strings. Thus, removing
‘s' from the word ‘reds' to give ‘red', is a removal of 1/4 which
is 25% of the word, and would have a Hamming distance of 1. Removing
the suffix ‘s' from the word ‘methylenedioxymethamphetamins' to give
‘methylenedioxymethamphetamin' is a removal of 1/29 which is 3.4%
of the word, yet this also gives a Modified Hamming Distance (MHD)
of 1.

A new metric was therefore developed
which, though still based on the Hamming distance, also takes into account
the lengths of the words being compared. It uses the idea that there is
a maximum and minimum Modified Hamming Distance that can be obtained between
two strings. The minimum distance is zero, when both strings are identical,
whereas the maximum distance *MD *is
the length of the longer string. The relative distance *RD *is
obtained from the Modified Hamming Distance *MHD *thus:

RD is labelled Rel MHD in List Analyser.

Using test files SSM3 and SSM4 will give:

SSM3 words | Length | MHD | Rel MHD | |

reds | 4 | 1 | 0.25 | |

methylenedioxymethamphetamins | 29 | 1 | 0.0345 | |

engineered | 10 | 0 | 0 |

Therefore, by calculating the sum of the relative distances (Rel MHD) for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:

where

U & V are the Stemmers being compared,

N = Number of words in the sample

MHD = Modified Hamming Distance

MD = Maximum Distance

For the test files SSM3 and SSM4 the value of SSM* will therefore be:

100 * ((0.25 + 0.0345 + 0)/3 ) - 1) = 90.517%

*Source: http://www.comp.lancs.ac.uk/computing/research/stemming/Links/similarity.htm