Sample Thesaurus files

 

Months

 

This file is designed to demonstrate a typical cross-lingual information retrieval (CLIR) application. The sample file has group names for all the months in English arranged in chronological order, obviously you are unlikely to want to sort the month names alphabetically, so the months have been prefixed with numbers and letters 1-9, A-C ( so that if the sort button was clicked accidentally or deliberately, they get sorted numerically to match the chronological order. When you use this file with dtSearch, you will be able to search for a month in any of over 30 languages* and will find documents referring to that month in any of those languages.

 

A problem to be aware of is that if you are searching for month 11, November in English, documents that are returned containing listopad will be correct for Polish, Czech, and archaic Slovene (the Cyrillic equivalent is лістапада in Belarusian, листопаді in Ukrainian) but in Croatian be aware that listopad is the month of October.  A similar problem exists with month 12 - December, the word prosinec (Czech) or prosinac (Croatian) refers to December, but prosinec may be found referring to the month of January in some archaic Slovene documents.  

 

Finally note the similar words sierpień (Polish) and srpen (Czech) is month 8 – August, but srpanj is July in Croatian and veliki srpan (“large sickle”) is August and mali srpan (“small sickle”) is July in archaic Slovene.

 

Referencies:

  1    http://en.wikipedia.org/wiki/Slovene_months

  2    https://digital.lib.washington.edu/ojs/index.php/ssj/article/viewFile/4179/3518

 

Days

 

Days of the week in over 20 languages**. In a similar fashion to the Months file, the days of the week have been prefixed with numbers so that they will sorted correctly, in addition they have been prefixed with a zero so that they will appear in order if you merge the file with the Months sample file. The order is in accordance with international standard ISO 8601 with Monday as the first day of the week, however many countries such as the USA still have their calendars refer to Sunday as the first day of the week. You can Rename the list to suit the order and language you prefer.

 

Currencies

 

ISO 4217 currency codes (current and some not current) and some additional commercially used codes, listed against currency name and symbol, so that a search for 100 EUR will find 100 euro or 100 €. It should be noted that by default dtSearch does not index currency symbols, to be able to index and search currency symbols it is necessary to edit the alphabet file. The editor built into dtSearch Desktop controls the processing of characters in the range from 33 to 127 only, the dollar sign ($, character code 36) can be made searchable by selecting the Character type ‘Letter’ instead of the default ‘Space’; characters above 127 are processed according to the Unicode specification and dtSearch does not treat other symbols as searchable characters, however it is possible to manually edit the alphabet file (default.abc) to make additional Unicode characters searchable.

 

User Thesaurus Plus has a built in dtSearch alphabet editor on the Tools menu that is able to make other currency symbols with character codes above 127 searchable.

 

Note, the WordNet thesaurus built into dtSearch Desktop has no entry for euro, but does have entries for dollar, pound, drachma, dirham and others but contains no symbols or ISO codes.

 

According to the European Union's Publication Office, in English, Irish, Latvian and Maltese texts, the ISO 4217 code is followed by a fixed space and the amount:  a sum of EUR 30

 

In Bulgarian, Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Lithuanian, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish the order is reversed; the amount is followed by a fixed space and the ISO 4217 code:  une somme de 30 EUR

 

Although the above should be taken with a huge pinch of salt, since individuals within each country do not necessarily write in that format, to ensure a search for EUR 100 also finds 100 EUR you should search using EUR w/1 100 (i.e. “EUR within one word of 100”).

 

Reference

http://en.wikipedia.org/wiki/ISO_4217

 

Irregular verbs in French, German, English, Dutch, Spanish, Italian

 

Handles verb forms (irregular verbs, strong verbs, stem change verbs) that stemming may not, for example in English go, went, gone; run, ran; speak, spoke; drink, drank. Not all conjugations for each verb may be listed, if a verb follows regular verb patterns in most of the forms the dtSearch stemmer may handle those correctly without the need to have them included in the thesaurus, listing all verb conjugations may slow searches.

 

Irregular nouns in English

 

Handles plurals that stemming rules may not, for example woman - women, foot -feet, goose - geese, child – children.

 

Trade

 

This file is another CLIR example, it has the English word Invoice with equivalents in over 20 languages.  

 

Fashion

 

This file is an example of using brand names as synonyms for items of clothing.

 

Geographic

 

Small sample from: http://www.alexandria.ucsb.edu/gazetteer/FeatureTypes

 

Legal & Medical

 

These files are examples of using legal/medical terms (in Latin) with common English synonyms. The medical file also includes an example of using synonyms that include trade names, chemical names, and 'street names' for drugs.

 

Names – Genealogy – Cross-lingual – political

 

Examples of various methods of name searching, nick names, maiden/married names, diminutives, transliterations, by political office.

 

Names Russian Male and Female (2 separate files)

 

Examples of name searching - diminutives, transliterations, similar names.

 

Prenoms francaise masculin

 

Examples of Name searching - French male forenames and diminutives.

 

* Languages : Arabic, Belarusian, Bosnian, Bulgarian, Croatian, Czech, Chinese, Danish, Dutch, English, Estonian, Finnish, French, German, Greece, Hungarian, Italian, Japanese, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Uzbek, Welsh. (Note: for Chinese and Japanese the month name is a 'numbered month' in traditional Chinese; for Arabic there are variants for names as used in Algeria, Egypt, Iran, Iraq, Jordan, Lebanon, Palestine, Sudan, Syria).

 

** Languages : Afrikaans, Albanian, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finish, French, Galacian, Georgian, German, Greek.