Setting up a new dictionary for DictionaryForMIDs
See important change notes from past releases here!
Setting up a dictionary is just configuration, there is no need to have
programming knowledge or a development environment. And if you have any
problem during setting up a dictionary for DictionaryForMIDs, just
contact us and we will assist you.
Setting up a dictionary for DictionaryForMIDs involves the following 3 steps:
- Configuring the properties of the file DictionaryForMIDs.properties
- Generating the files for DictionaryForMIDs
- Putting the generated files in DictionaryForMIDs.jar
1. Configuring the properties of the file DictionaryForMIDs.properties
DictionaryForMIDs is customized via properties in DictionaryForMIDs.properties. Each of the properties must be provided unless noted otherwise.
Here is the list of properties:
- infoText
Text that is shown on the top of the info-dialog. Please provide here
information about the dictionary. Obligatory, include here contact
information for someone who can be contacted concerning the dictionary.
That may be you (the person who set up this dictionary into
DictionaryForMIDs) and/or the maintainer of the dictionary itself.
Please include an email-address and/or homepage.
Also obligatory, please put a copyright notice for the dictionary.
- dictionaryAbbreviation
Short abbreviation for identifying the origin of the dictionary. This
is an abbreviation for the name of the organization or project where the
dictionary comes from, e.g. freedict for the dictionaries from
freedict.org. Preferably only a few characters long. The JarCreator tool
uses this property to form the application name.
- numberOfAvailableLanguages
Defines how many languages are in the dictionary. For many
dictionaries this will be 2. For each language the languageX-properties
need to be defined as described below (X is a number starting from 1 to
numberOfAvailableLanguages).
- languageXDisplayText
Text that is used on the user interface to identify the language. X
needs to be replaced with the number of the column for the language. For
example:
language1DisplayText: English
language2DisplayText: Portuguese
- languageXFilePostfix
Text that is used in file names to identify searchfile and index
files for a language. Typically a 3-letter text, such as Eng for
English; as defined in the ISO 3-letter codes at
http://etext.lib.virginia.edu/tei/iso639.html.
- languageXIsSearchable
A boolean property with either the value true or the value false. Set
to true when it is allowed to search for translations for that language.
Normally this property is set to true for bi-directional translation
dictionaries. For lookup dictionaries, e.g. for an acronym dictionary,
where it is only possibly to search from the acronym to the explanation,
this value is set false for the explanation language/column. For an
example see the elements dictionary from
the download section.
Also for an unidirectional dictionary, which for example only translates
English to Portuguese (but not Portuguese to English), you have to set
languageXIsSearchable to false for Portuguese.
This property is optional, the default value is true.
- languageXGenerateIndex
A boolean property with either the value true or the value false.
Tells DictionayGeneration whether to generate an index for this
language.
This property is optional, the default value is true.
Normally this property has the same value as languageXIsSearchable.
- languageXHasSeparateDictionaryFile
A boolean property with either the value true or the value false. Set
to true, when there is a separate dictionaryXXX.csv file for this
language (for an explanation about the files, see section
Files generated by the DictionaryGeneration tool). Normally all
languages use the same dictionaryXXX.csv files, namely for those
dictionaries where expression ABC translates to XYZ and this means that
XYZ translates back to ABC. For dictionaries where ABC translates to
XYZ, however XYZ translates to DEF, this property is set to true. For an
example, see the German-French freedict dictionary from
the download section.
For documentation, see
here.
This property is optional, the default value is false.
- dictionaryGenerationSeparatorCharacter
Separation character for the input dictionary file that is read by
DictionayGeneration. This character needs to be put in apostrophes, e.g.
','. Can also be '\t' (backslash plus t) for a tab-character.
This property is optional, the default value is \t (tab-character)
- indexFileSeparationCharacter/searchListFileSeparationCharacter/dictionaryFileSeparationCharacter
Separation character for the output csv files that are generated by
DictionayGeneration. This character needs to be put in apostrophes, e.g.
','. Can also be '\t' (backslash plus t) for a tab-character. Typically
these properties are set to the same value as
dictionaryGenerationSeparatorCharacter.
- dictionaryGenerationLanguageXExpressionSplitString
Used by DictionaryGeneration: when for a language a read expression
actually contains several expressions, then this property is set to the
string that separates the expressions.
Example: the expression "to choose, to select, to pick" contains not one
but three expressions: (1) "to choose", (2) "to select" and (3) "to
pick". By setting dictionaryGenerationLanguageXExpressionSplitString to
, for this language, DictionaryGeneration will extract these 3
expressions. This is done for language2 with the following line:
dictionaryGenerationLanguage2ExpressionSplitString=,
This property is optional, the default value is 'property not set'.
- dictionaryGenerationInputCharEncoding
Character set encoding for the input dictionary file that is read by DictionaryGeneration.
Supported character set encodings are:
UTF-8
ISO-8859-1
US-ASCII
This property is optional, the default value is ISO-8859-1.
- indexCharEncoding/searchListCharEncoding/dictionaryCharEncoding
These 3 properties define the character set encoding that is used for the output searchlist file/index files/dictionary files.
Supported character set encodings are:
UTF-8
ISO-8859-1
US-ASCII
Note: on very old mobiles/PDA devices UTF-8 may not yet be supported.
We would expect that each model that was released recently supports
UTF-8.
- languageXDictionaryUpdateClassName
This property defines for the DictionaryGeneration-tool a "DictionaryUpdate"-class that is used for a language. DictionaryUpdateClass changes the ways entries are stored when the tool converts an input dictionary file into the DictionaryforMIDs generated files. For details on the files created when a dictionary is generated, see Generating the files for DictionaryForMIDs.
For example: DictionaryUpdateEngDef removes unneeded words from the indexes such as "the", "a", and "at". These words are unneeded in the indexes and adds unnecessarily to the file size. When a user performs a search then these words will still be displayed in the definition, however.
The property languageXDictionaryUpdateClassName is optional. Use this
property only if you really need it, otherwise remove any
languageXDictionaryUpdateClassName-lines in the property file !
This is the name of a Java class that is used to 'normate' words. Whereas DictionaryUpdateClass changes dictionary files only when the dictionary is generated, NormationClass affects the words that the user enters when searching.
For example: NormationGer parses the nonNormatedWord for the German 'Umlauts' (ä, ö, ü) and returns the word with the Umlaut-paraphrasing (ae, oe, ue). So the user can search for "Mädchen" or "Maedchen" and the translation will be found in both cases.
These changes in the dictionary files are done in 2 steps. First the DictionaryGeneration-tool calls the NormationClass to change the indexes to incorporate the phonetic changes (ä is changed to ae, for example). Then when the user searches, the NormationClass is called again to make the phonetic changes to match the changes that were made earlier with the DictionaryGeneration-tool.
Via Normation-classes it is possible to provide language-specific
search features and phonetic search. A lot of power lies in these
Normation classes !!
For documentation of NormationClass, see
here.
The property languageXNormationClassName is optional.
- languageXContentNN properties
With the languageXContentNN properties you can specify the content of
your dictionary. For example you can specify that there is a
pronunciation part, an explanation part, etc.
For more information on the languageXContentNN-properties, see
here.
The languageXContentNN properties are optional.
- searchListFileMaxSize/indexFileMaxSize/dictionaryFileMaxSize
Defines the size in bytes of the biggest searchlist file/index
file/dictionary file, as generated by DictionaryGeneration. These
properties are automatically determined and set by DictionaryGeneration,
normally there is no longer a need to set these properties manually.
However these properties need to be manually defined when a dictionary
is merged from two (or more) different source dictionaries and
DictionaryGeneration is run once for each of these source dictionary
(the values from the first run would be overwritten by the second run).
When these properties are already manually set when DictionaryGeneration
is run, then no automatic generation for these properties is done. For
the manual values you must ensure that no searchlist file/index
file/dictionary file is bigger than the property value, otherwise some
translations are not found. There is no problem if the value of these
properties is bigger than the actual maximum file size. For example if
you set the dictionaryFileMaxSize to 50000 even if the biggest
dictionary file is only 35000 bytes everything will work correctly.
However DictionaryForMIDs will allocate 50000 bytes of heap memory, and
keep in mind that specifically for older devices heap memory is scarce.
- dictionaryGenerationMinNumberOfEntriesPerDictionaryFile/dictionaryGenerationMinNumberOfEntriesPerIndexFile
Defines for DictionaryGeneration the number of entries (= lines) per
dictionary file and per index file.
These properties are optional, the default value for
dictionaryGenerationMinNumberOfEntriesPerDictionaryFile is 200 and the
default value for dictionaryGenerationMinNumberOfEntriesPerIndexFile is
500.
As a general hint, you could try to set these values so that the size of
a single directory file and the size of a single index file do not
exceed 100 kB (size defined by properties
searchListFileMaxSize/indexFileMaxSize/dictionaryFileMaxSize).
If you set up a small dictionary that should support very old devices
with very little heap memory, then set these values low enough, for
example so that the biggest file does not exceed 10 kB.
- languageXIndexNumberOfSourceEntries
This property is automatically generated by DictionaryGeneration. The
value contains the number of 'begin of expression'-index entries. This
value gives the number of words/expressions that the dictionary contains
for languageX. The values of languageXIndexNumberOfSourceEntries are
shown in the Info-Dialog (will be implemented in a future version).
Note that when you merge a dictionary from two (or more) source
dictionaries and DictionaryGeneration is run once for each of these
source dictionary, then you need to manually copy the entries for
languageXIndexNumberOfSourceEntries into the final
DictionaryForMIDs.properties file.
- logLevel
Note: the logLevel is set in the file DictionaryForMIDs.jad (not
DictionaryForMIDs.properties).
Allows to switch on some debugging output. A logLevel of 0 switches
debugging output off, a logLevel of 3 switches all debugging output on,
and the levels 1 and 2 switch some debugging output on. A higher
logLevel means more debugging output.
Here is a sample DictionaryForMIDs.properties file:
infoText: English-Portuguese dictionary from IDP: http://www.june29.com/IDP dictionaryAbbreviation: IDP numberOfAvailableLanguages: 2 language1DisplayText: English language2DisplayText: Portuguese language1FilePostfix: Eng language2FilePostfix: Por dictionaryGenerationSeparatorCharacter: '\t' indexFileSeparationCharacter: '\t' searchListFileSeparationCharacter: '\t' dictionaryFileSeparationCharacter: '\t' dictionaryGenerationInputCharEncoding: ISO-8859-1 indexCharEncoding: ISO-8859-1 searchListCharEncoding: ISO-8859-1 dictionaryCharEncoding: ISO-8859-1 language1DictionaryUpdateClassName: de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateIDP language2DictionaryUpdateClassName: de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateIDPSpa language1NormationClassName: de.kugihan.dictionaryformids.translation.normation.NormationEng language2NormationClassName: de.kugihan.dictionaryformids.translation.normation.NormationLat
2. Generating the files for DictionaryForMIDs
Downloading
Download the latest version of the DictionaryGeneration tool:
DictionaryForMIDs_DictionaryGeneration_3.1.0.zip (373 kB) (this is
the right version for DictionaryForMIDs >= 2.5.0)
DictionaryGeneration requires a J2SE runtime on your PC. If you are
not sure whether you have the J2SE runtime installed, see if from the
command prompt the command "java" exits. If you do not have installed a
J2SE runtime, you can download it from
http://java.com/en/download/download_the_latest.jsp (this is > 10
MB).
Using the DictionaryGeneration tool
DictionaryGeneration is a command line tool. You start the DictionaryGeneration tool as following:
java -jar DictionaryGeneration.jar inputdictionaryfile outputdirectory propertydirectory inputdictionaryfile: file from which the directory is read outputdirectory: pathname where the generated directory files are written to (must end with "dictionary" !) propertydirectory: directory where the file DictionaryForMIDs.properties is located
inputdictionaryfile:
The first parameter is the dictionary file that you want to set up on
DictionaryForMIDs. Here is a sample dictionary file from the IDP:
PortugueseNoHeader.txt (38 kB).
This file is a 'Comma Separated Value' file (CSV-file), whereas instead
of a comma you can use any separation character. The separation
character is specified by the property
dictionaryGenerationSeparatorCharacter (see section
Configuring the properties of the file
DictionaryForMIDs.properties).
In the inputdictionaryfile for each language there is a column. Most
often you will have two languages (property numberOfAvailableLanguages
set to 2) and two columns.
If the dictionary that you want to set up is not yet in a CSV-format,
you need to convert it in such a format first.
outputdirectory:
The second parameter specifies the directory path to which the generated files a written. This directory path must end in "dictionary" ! DictionaryGeneration writes to this directory the following files: searchlistxxx.csv, indexxxn.csv, dictionaryn (xxx is a placeholder for the value specified by the property languageXFilePostfix and n is a sequence number).
propertydirectory:
The third parameter is a directory path where the configurated file
DictionaryForMIDs.properties is found.
Here is an example for starting DictionaryGeneration:
java -jar DictionaryGeneration.jar dictionaries\IDP\Por\PortugueseNoHeader.txt output\dictionary dictionaries\IDP\Por
Customization of DictionaryGeneration with DictionaryUpdate classes
The DictionaryGeneration tool can be customized by DictionaryUpdate
classes. Read here for a
description of DictionaryUpdate classes.
Files generated by the DictionaryGeneration tool
DictionaryGeneration generates searchfiles, indexfiles and dictionaryfiles. In addition to the generation of these files, DictionaryGeneration copies the file DictionaryForMIDs.properties to the outputdirectory.
searchfileXXX
For each language one searchfile is generated. XXX is defined by the
property languageXFilePostfix.
A searchfile contains one entry per line in the following format:
-
keyword<searchListFileSeparationCharacter>indexfilenumber
The searchListFileSeparationCharacter-property is typically set to a
tab-character.
The keywords are the first keyword of the indexfile with the given
indexfilenumber. So for example the line
-
monument 18
indicates that the keyword monument is found at the beginning of
indexfile 18.
The entries in the searchfiles are sorted alphabetically according to
the keyword.
The keywords are normated.
indexfileXXXN
For each language several indexfiles are generated. XXX is defined by the property languageXFilePostfix; N is a sequence number.
An indexfile contains one entry per line in the following format:
-
keyword<indexFileSeparationCharacter>dictionaryfilenumber-charpos-searchindicator[,...]
The indexFileSeparationCharacter-property is typically set to a
tab-character.
For example the line
-
monument 29-383-B
indicates that the word monument together with its translation is
found in the dictionaryfile 29 at the byte position 383 and that
monument occurs at the begin of the expression. The searchindicator is
either B for 'begin of expression' or S for 'substring in expression'.
For example for the expression "give up": "give up" will have a
searchindicator of B and "up" will have a searchindicator of S.
For one keyword there may be several references to different locations
in dictionaryfiles, each of these references is separated by comma.
The indexfiles contain one line per word to be translated. The entries
in the indexfiles are sorted alphabetically according to the keyword.
The keywords are normated.
directoryXXXN
For each language several dictionaryfiles are generated. XXX is
defined by the property languageXFilePostfix; N is a sequence number.
Note: the number of dictionaryfiles is not the same as the number of
indexfiles (typically there are more dictionaryfiles).
An dictionaryfile contains one entry per line in the following format:
-
expression-from<dictionaryFileSeparationCharacter>expression-to
The dictionaryFileSeparationCharacter-property is typically set to a
tab-character.
For example the line
-
monument Denkmal (n)
translates the English "monument" to the German "Denkmal (n)".
The dictionaryfiles are non-sorted (well, if the inputdictionaryfile is
sorted, then also the generated dictionaryfiles are sorted; but there is
no need for the dictionaryfiles to be sorted).
Creating a bitmap font
If you wish to include a bitmap font with the dictionary, please see the instructions for help creating one. The bitmap font generator will create the file 'font.bmf' which contains all of the font data.
3. Putting the generated files in DictionaryForMIDs.jar
Download the empty DictionaryForMIDs ('empty' means that there is no
dictionary included):
DictionaryForMIDs_3.2.0_empty.zip
(227 kB).
You need to extract the files DictionaryForMIDs.jar and
DictionaryForMIDs.jad.
Next you need to include the dictionary-files that were generated
with DictionaryGeneration in the JAR-file. This can be done conveniently
with the JarCreator tool:
DictionaryForMIDs_JarCreator_3.1.2.zip (352 kB) (this is the
right version for DictionaryForMIDs >= 3.1.2)
Updating DictionaryForMIDs.jar with JarCreator
Here is how to use JarCreator:
java -jar JarCreator.jar dictionarydirectory emptyjar outputdirectory
dictionarydirectory:
The first parameter is the directory where the generated files are located. This is the same path as outputdirectory for DictionaryGeneration. Also this directory path must end in "dictionary". If you are also using a bitmap font with this dictionary, font.bmf must be located in this directory. JarCreator will automatically move the font file from this directory to its correct location in the JAR package.
emptyjar:
The second parameter is the directory where the 'empty' files DictionaryForMIDs.jar and DictionaryForMIDs.jad are found.
outputdirectory:
The third parameter is the directory where JarCreator stores the
completed DictionaryForMIDs_xxx.jar and DictionaryForMIDs_xxx.jad (xxx
is filled in by JarCreator with languageFileXPostfix and
dictionaryAbbreviation). This output contains the dictionary files from
the dictionarydirectory.
Updating DictionaryForMIDs.jar manually (if you are not using JarCreator)
(if you are using JarCreator, continue with
Sample DictionaryForMIDs.jar)
Alternatively to using JarCreator, you can also update the file
DictionaryForMIDs manually: You use a ZIP-utility to do so, such as the
free info-zip (Windows
version or
command line version) or WinZip.
In the file DictionaryForMIDs.jar add the directory "dictionary" including all the generated files. Important: in the JAR-file the generated files _must_ be in the directory dictionary, otherwise you will receive an error message when you translate from DictionaryForMIDs. Depending on the ZIP-utility that you are using, adding the files in the directory dictionary can be little bit tricky.
If you are including a bitmap font with the dictionary, font.bmf must be moved out of the 'dictionary' directory, and into a new directory called 'fonts'. Bitmap font support will not be available in DictionaryForMIDs unless the font file is in the correct directory. The 'fonts' folder should be at the same level as the 'dictionary' directory in the directory tree.
After adding the dictionary files, you need to adjust the property MIDlet-Jar-Size in the file DictionaryForMIDs.jad. Put precisely the file size of the file DictionaryForMIDs.jar behind this property, such as here
MIDlet-Jar-Size: 58037
Sample DictionaryForMIDs.jar
Here is an example for DictionaryForMIDs set up with a dictionary.
You can see there that the generated files are in the directory
"dictionary":
DictionaryForMIDs_2.4.0_EngPor_IDP_dev.zip (66 kB).
Packaging into a ZIP.file
For packaging put the 4 files (1) DictionaryForMIDs_xxx.jar (2) DictionaryForMIDs_xxx.jad (3) README and (4) COPYING into a ZIP file. You should use this file naming convention:
DictionaryForMIDs_VVVVV_XXXYYY_ZZZ.zip
VVVVV: version of DictionaryForMIDs, for example "3.0.0"
XXX: language1FilePostfix, for example "Eng"
YYY: language2FilePostfix, for example "Por"
ZZZ: info on the origin of the dictionary (can be longer than 3
characters), for example "IDP" or "freedict"; sould be the same as
defined in the property dictionaryAbbreviation.
If you have any problem with setting up a new dictionary, just contact us and we will try to help you !