Localizing Software
under regional preferences is the requirement of current time. The
setting of ISCII standards was the initial steps taken to promote
Indian language scripts. UNICODE standards adopted part of ISCII-1988
specifications while defining specifications for Indian languages.
These standards have been instrumental in addressing different
localization issues. However usage of UNICODE doesn't serve the purpose
completely. Unicode provides specifications and standards for languages
across the world. Indian Language Script is called as Indic script.
Unicode consortium defines Code Chart for defining code points. Code
Charts are provided for following Indian Scripts
- Devanagari
- Bengali
- Gurmukhi
- Gujarati
- Oriya
- Tamil
- Telugu
- Kannada
- Malayalam
Some Unicode Code Charts assigned to Indic scripts make no distinction
between languages.
Devanagari Script is base for following languages.
- Hindi
- Marathi
- Sanskrit
- Konkani
- Nepali
Most of Localization software use Unicode encoding schemes for Data
representation and storage. However only storage of data and displaying
them on screen will not serve purpose of local users who would prefer
manipulating data for daily usage. But in the Text Manipulations there
are certain points to be considered:
- Contrary to English, Indian Languages are phonetic in
nature.
- Languages using the same script follow different rules
while manipulating the text.
- For the same language different people follow different
rules for manipulating the text hence flexibility to the users in the
same has to be provided.
- Unicode character encoding order doesn’t match Indian
Language collation order.
- In Indian Languages, unlike English, single code point
cannot represent a character. In Indian languages, a character is
combination of one or more code points (i.e. combination of consonants,
modifiers, Matra, Viram sign etc.) and is called 'syllable'.
- It often requires multiple code points to be treated as
single element.
We would like to address the problem of non-support for text
manipulation by Unicode standards taking into consideration the rules
followed by different Indian Scripts and the corresponding Languages.
The main point of focus here is that many Indian Languages use same
Script and different Indian Languages though using same Script, follow
different rules for the text Manipulation.
The project provides Text Manipulation Engine (TME) for all Indian
Languages.
Some of the
services that shall be provided by our engine are
-
Sorting:
As stated earlier that most of the Indian languages though based on the
same script have different ordering behaviour and there is no unique
standard defined for collation order for Indian Languages. To give user
flexibility we will provide parameter-based functionality.
We will consider ‘syllable’ (Combination of one or more code points) as
Sorting element so that results are according to user’s perception. It
also involves case insensitive and intelligent guess sorting.
-
Searching:
We will provide both sequential and indexed searching algorithms. It
will also include case insensitive searching.
-
Soundex
Soundex algorithms will be provided using rule base.
-
Length
We will provide length operations considering syllable as element. This
idea will provide length according to user’s perception.
-
Substring
The substring operations will consider syllable as an element.
-
Concatenation (Known as 'Sandhi' in Sanskrit & Hindi)
This is the most complicated part of the project due to the fact that
concatenation on two words will result in an independent meaningful
word depending on the 'Sandhi' rules which in turn depend upon
syllables involved. TME shall be integrated with PostGreSQL for
providing the INDCHAR data-type. TME services shall be utilized to
perform operations over information stored as INDCHAR data-type.
|