The median and mean Modified Hamming Distance (MHD)

"The median and mean modified Hamming distance between words and their stems—The Hamming distance between two strings of equal length is defined as the number of characters in the two strings that are different at the same position. For strings of unequal length we add the difference in length to the Hamming distance to give a modified Hamming distance function d. This measure takes into account transformations of stem endings. For example, a stemming algorithm might reduce the corpus { try, tried, trying } to the stem tri. The mean modified Hamming distance between the original words and the stem is (1+2+4)/3 = 2.33 characters, and the median is 2. " Frakes & Fox [Ref: W1].




"Suppose the string lengths are P and Q, where P< Q, we use the formula MHD = HD(1,P) + (Q-P) where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer, suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing "parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every word in the original sample."
Source: Lancs University Website: