Test Files - Detail

 

 

Mean Characters Removed

File A = SS1.txt

 

File B = SS1_trunc8.txt

engineer

 

engineer

engineered

 

engineer

engineering

 

engineer

engineers

  engineer

The above output removes an average of (0+2+3+1)/4 = 1.5 characters. A weakness of this metric is that it does not measure transformations of stem endings; the method below overcomes this problem.

 

Mean MHD

File A =  SS2.txt

 

File B =  SS2_stem3.txt

try

 

tri

tried

 

tri

trying

 

tri

The Mean Modified Hamming distance (Mean MHD) between the original words and the stem is (1+2+4)/3 = 2.33 characters, and the median is 2.
List Analyser only displays Inverse mean MHD which is 1/2.33 = 0.4292

 

 

Stemming Similarity Metric (Inverse Mean MHD)

File A =  SSM1.txt

 

File B =  SSM2.txt

brittle

 

britt

engineered

 

engineered

fairies

 

fairi

Inverse mean MHD = (3+0+2)/3 = of 0.75

 

Stemming Similarity Metric (SSM*)

File A =  SSM3.txt

 

File B =  SSM4.txt

reds

 

red

engineered

 

engineered

methylenedioxymethamphetamins

 

methylenedioxymethamphetamin

Inverse mean MHD = 1/((1+1+0)/3) = 1.500

SSM* =( 0.03 + 0.25 + 0 )/3 = 0.28

 

Stemmer Strength

A  react.txt

 

B  react_trunc5.txt

MHD
react   react 0
reacts   react 1
reacting   react 3
reacted   react 2
reaction   react 3
reactions   react 4
reactive   react 3
reactivity   react 5
reactivities   react 7

 

Mean Conflation Class size = 9/1 = 9

Compression Factor = (9 - 1)/9 = 0.889 rounded to 3 decimal places.

Mean Characters Removed  (0 + 1 + 3 + 2 + 3 + 4 + 3 + 5 + 7) /9 = 3.111

MHD = HD(1,P) + (Q-P) where HD(1,P) is the Hamming Distance for the first P characters of both strings.

Mean MHD = the average MHD value for every word in the original sample = 3.111

 

Error counting

Source: Lancs University     UI = 0.545  OI = 0

 

English2Grouped.txt   English2Grouped_trunc5.txt
divide
dividing
divided
division
divisor
====
divine
divination
  divid
divid
divid
divis
divis
====
divin
divin