Introduction

 

Kenza is an editing tool for Microsoft Search (SharePoint) Thesaurus files as used in SharePoint® 2010, 2007; Microsoft® Search Service 2010; Microsoft Search Server 2010 Express; Microsoft Search Server 2008 Express and SQL iFTS (SQL Server™ Integrated Full Text Search 2008).

 

Thesaurus files are used to expand searches so that search results will contain results not only where the words in the documents match exactly those used in user search queries, but where the documents may contain words which are synonyms, spelling variants or acronyms for example; the end result is that search recall is improved. Apart from expanding search queries, the thesaurus files in SharePoint can also be used to substitute alternate search queries so that if  the search query contains certain words or phrases (e.g. profanities, requests for sensitive information) the user might be directed to company policy documents or approved responses for example.

 

The Problem

The thesaurus files are in XML format, and Microsoft state that only system administrators can update, modify, or delete the files. Editing the files in Windows NotePad or an XML editor is a slow and error prone process and any errors in the file format or encoding can lead to search working incorrectly. Furthermore it is not easy for the work to be shared among other key staff such as Business Domain Experts, Information Architects, Librarians, Linguistic Experts, Business Intelligence Staff or non-technical users of the search system.

 

The Solution

Kenza is designed for ease of use and can be run directly on the SharePoint server for use by the SharePoint Administrator or on any PC running Windows 7, Vista or XP for use by non-technical staff. Kenza can create multiple thesaurus XML files and includes a merge feature so that files created by different people can easily be combined. Users can type or copy & paste from other documents or web pages for fast error free production.

 

Kenza ensures that the file is in the correct format and takes care of empty entries, unwanted white space, punctuation and other special characters automatically. Automatic de-duplication, text cleanup and highlighting of noise words assist the user to produce correctly formatted* files in record time.

 

 

 

 

* Microsoft state that thesaurus files must be saved in Unicode format, and Byte Order Marks must be specified. Thesaurus entries cannot be empty or word break to an empty string, phrases must be no longer than 512 characters. A thesaurus must not contain any duplicate entries among the <sub> entries of expansion sets and the <pat> elements of replacement sets.  Microsoft recommends that entries in the thesaurus file contain no special characters and that <sub> entries contain no stop words.

 

References:

How to: Edit a Thesaurus File (Full-Text Search)  SQL Server 2008 R2  http://msdn.microsoft.com/en-us/library/ms345187.aspx

Thesaurus Configuration - SQL Server 2008 R2 http://msdn.microsoft.com/en-us/library/ms142491.aspx