INTRODUCTION
The main purpose of the encoding registry is to allow encoding conversion information to be installed once and used anywhere. The encoding registry consists of an XML file providing:
- Byte Encoding
-
A byte encoding is a non-Unicode encoding which is mappable too or from Unicode. It contains a reference to a defining mapping that relates that encoding to Unicode.
- Unicode Encoding
-
A natural understanding of Unicode is that there is only one Unicode encoding: Unicode. But Unicode is a very large encoding, and it is common to have processes that work on subsets of Unicode. It is these subsets that we define as Unicode encodings. Thus, for example, an IPA subset of Unicode would be a Unicode encoding, along with lower ASCII. Notice that the subsets can be overlapping.
A Unicode encoding is defined in terms of a set of characters or as the set of characters covered by a specified mapping.
- Mappings
-
Mappings can be used for different processes other than simply converting between bytes and Unicode. Other examples of mapping processes are:
- transcription
-
Transcription is the process of sounding the words of one language in the script of another. This is the process used for multi-script languages.
- transliteration
-
Transliteration is the process of spelling the letters of words in one script in another. For example, Hebrew is often transliterated into Roman script so that each character is uniquely identifiable in the Roman script rendering.
Each mapping may be implemented in multiple ways. For example, conversion from SIL IPA to Unicode may be achieved using a TECkit binary mapping, a UTR22C XML mapping or a TECkit source language mapping. Someone may have written a Python transducer for the process or ICU may be able to do the conversion, or any of many different ways. One of the aims of the encoding registry is to keep track of all these different forms allowing an application to use whichever is most suitable to it.
- Fonts
-
A side issue of the data conversion issue is that of identifying which encoding a particular font implies. While this is not necessarily immediately possible, in some cases it is. For example, it is unlikely that text in SILDoulos IPA93 is in any other encoding than SilIPA93.
Encoding REGISTRY
The encoding registry consists of an XML file that contains all the information needed. This allows different applications to make use of and manipulate the information in a cross platform way.
Locating the Registry
On Windows the encoding registry may be found at:
HKLM\SOFTWARE\SIL\EncodingConverterRepository\Registry
which is a textual key containing the path and filename of the registry file. On Linux the default locations of:
~/.SIL/Converters/registry.xml
/etc/SIL/Converters/registry.xml
All these locations may be overriden using the environment variable MAPPINGPATH
XML Format
The module contains both a DTD and an XSD definition of the XML file format, but the aim is that users do not need to know anything about the details of the file format.
Using the Registry
The Encode::Registry module is designed to make use of the encoding registry. In addition, while other programs may use the registry as they may like, there is a textual tool: encrem, included with this module to allow relatively easy interaction with the registry.
Encrem allows the addition, removal and listing of mappings, encodings and font information. Since it uses a simple interface it can be scripted or used from the command line.
Example Session
Here we examine a sample session with encrem:
encrem -o registry.xml
Runs encrem and will output the resulting XML to registry.xml
encrem: help lists known commands
encrem: create creates an empty database
encrem: register register the file on Windows in the registry
encrem: help add-encoding get help on the add-encoding command
encrem: add-encoding silipa93 silipa93.tec
This last command creates a new byte encoding called silipa93, a corresponding unicode encoding with the same coverage and a mapping that converts between the two which is implemented by silipa93.tec
. Names can be specified or else the program will come up with its own.
encrem: add-alias sil_ipa93 silipa93 add some aliases for this encoding
encrem: add-alias sil-ipa93 silipa93