Stemmer Similarity Metrics


Inverse Mean MHD (Inverse of the mean Modified Hamming Distance)


"The stemmer similarity metric M for a pair of stemming algorithms A1 and A2, given a word list W, is the inverse of the mean modified Hamming distance (d) for all words in the word list, M(A1,A2,W) = n/Σ d(xi,yi), for i ranging from 1 to n, where n is the size of W and for all words wi in W, xi is the result of the application of A1 to wi and yi is the is the result of the application of A2 to wi. The inverse of the mean is used so that more similar algorithms will have higher values of M.


For example, suppose W = {brittle, engineered, fairies} and that stemming algorithm A1 produces the stems {brit, engineer, fairy} from W, and stemming algorithm A2 produces {britt, engineered, fairi} from W. Then M(A1,A2,W) is computed by dividing the wordlist size (3) by the sum of the Modified Hamming Distances between the stems produced by each stemmer for each word (for example, the Modified Hamming Distance between brit and britt is 1). The result is 3/(1+2+1) = 0.75 as our measure of the similarity of algorithms A1 and A2." Frakes and Fox [Ref: W1]


Test files SSM1 and SSM2 are based on the above and will give an Inverse Mean MHD of 0.75


Improved SSM (SSM*)


The Inverse Mean MHD method suggested above leaves the following problems:

A new metric was therefore developed which, though still based on the Hamming distance, also takes into account the lengths of the words being compared. It uses the idea that there is a maximum and minimum Modified Hamming Distance that can be obtained between two strings. The minimum distance is zero, when both strings are identical, whereas the maximum distance MD is the length of the longer string. The relative distance RD is obtained from the Modified Hamming Distance MHD thus:



RD is labelled Rel MHD in List Analyser.


Using test files SSM3 and SSM4 will give:

SSM3 words Length MHD Rel MHD  
reds 4 1 0.25  
methylenedioxymethamphetamins 29 1 0.0345  
engineered 10 0 0  


Therefore, by calculating the sum of the relative distances (Rel MHD) for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:



U & V are the Stemmers being compared,

N = Number of words in the sample

MHD = Modified Hamming Distance

MD = Maximum Distance


For the test files SSM3 and SSM4 the value of SSM* will therefore be:


100 * ((0.25 + 0.0345 + 0)/3 ) - 1) = 90.517%