Testing - Overview

A number of test files are supplied with List Analyser, their purpose is to enable verification that the descriptive statistics obtained by using List Analyser agree with figures published in the academic papers listed in the Reference section.


The test files can be downloaded from www.dtsearch.co.uk/support/resources/LA100_test.zip



Test File Name Related To



Compression Factor

Mean Conflation Class Size

Use alpha2.txt for File A and alpha2_trunc1.txt for file B for the strongest possible results; alpha2.txt contains 52 words, two of each starting with a different letter of the alphabet, thus compression factor using alpha2_trunc1.txt will be (52-26)/52 = 1/2 = 50% and the Mean Conflation Class Size will be 52/26 = 2.


Use alpha2.txt for File A and File B for the weakest possible results. The results will be

Mean Conflation Class Size = 1, Compression Factor, number of words and stems that differ, mean characters removed, and mean and median modified Hamming distance between word and stem will all be 0.

Mean Characters Removed Use SS1.txt for File A and SS1_trunc8.txt for File B,  the result should be 1.5; see Mean Characters Removed
Mean Modified Hamming distance (Mean MHD) The Mean Modified Hamming distance (Mean MHD) between the original words and the stem is (1+2+4)/3 = 2.33 characters, and the median is 2. See Median and Mean Modified Hamming Distance
Inverse mean MHD Use SSM1.txt for File A and SSM2.txt for File B,  the result should be 0.75; see String Similarity Metric

English2Grouped.txt English2Grouped_trunc5.txt

Over Stemming Index
Under Stemming Index
Stemmer Weight

Use English2Grouped.txt as File A and English2Grouped_Trunc5.txt as File B.

Set the Levenshtein Range at 0 to 32 and click on the Calculate button.

The results shown in the Error Count group box should be UI = 0.545 and OI = 0.
See Error Counting.



Mean Conflation Class
Compression Factor
Mean Characters Removed
Mean MHD

Mean Conflation Class size = 9/1 = 9

Compression Factor = (9 - 1)/9 = 0.889 rounded to 3 decimal places.

Mean Characters Removed  (0 + 1 + 3 + 2 + 3 + 4 + 3 + 5 + 7) /9 = 3.111

MHD = HD(1,P) + (Q-P) where HD(1,P) is the Hamming Distance for the first P characters of both strings.

Mean MHD = the average MHD value for every word in the original sample = 3.111

Inverse mean MHD

Improved SSM
Use SSM3.txt for File A and SSM4.txt for File B,  the Inverse mean MHD should be 1.0.
see String Similarity Metric