Text Manipulation Engine
[ Main ] [ Features ] [ CVS ] [ Download ] [ Developers ] [ Documentation ]
Localizing Software under regional preferences is the requirement of current time. The setting of ISCII standards was the initial steps taken to promote Indian language scripts. UNICODE standards adopted part of ISCII-1988 specifications while defining specifications for Indian languages. These standards have been instrumental in addressing different localization issues. However usage of UNICODE doesn't serve the purpose completely. Unicode provides specifications and standards for languages across the world. Indian Language Script is called as Indic script. Unicode consortium defines Code Chart for defining code points. Code Charts are provided for following Indian Scripts Devanagari Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada Malayalam Some Unicode Code Charts assigned to Indic scripts make no distinction between languages. Devanagari Script is base for following languages. Hindi Marathi Sanskrit Konkani Nepali Most of Localization software use Unicode encoding schemes for Data representation and storage. However only storage of data and displaying them on screen will not serve purpose of local users who would prefer manipulating data for daily usage. But in the Text Manipulations there are certain points to be considered: Contrary to English, Indian Languages are phonetic in nature. Languages using the same script follow different rules while manipulating the text. For the same language different people follow different rules for manipulating the text hence flexibility to the users in the same has to be provided. Unicode character encoding order doesn’t match Indian Language collation order. In Indian Languages, unlike English, single code point cannot represent a character. In Indian languages, a character is combination of one or more code points (i.e. combination of consonants, modifiers, Matra, Viram sign etc.) and is called 'syllable'. It often requires multiple code points to be treated as single element. We would like to address the problem of non-support for text manipulation by Unicode standards taking into consideration the rules followed by different Indian Scripts and the corresponding Languages. The main point of focus here is that many Indian Languages use same Script and different Indian Languages though using same Script, follow different rules for the text Manipulation. The project provides Text Manipulation Engine (TME) for all Indian Languages. Some of the services that shall be provided by our engine are Sorting: As stated earlier that most of the Indian languages though based on the same script have different ordering behaviour and there is no unique standard defined for collation order for Indian Languages. To give user flexibility we will provide parameter-based functionality. We will consider ‘syllable’ (Combination of one or more code points) as Sorting element so that results are according to user’s perception. It also involves case insensitive and intelligent guess sorting. Searching: We will provide both sequential and indexed searching algorithms. It will also include case insensitive searching. Soundex Soundex algorithms will be provided using rule base. Length We will provide length operations considering syllable as element. This idea will provide length according to user’s perception. Substring The substring operations will consider syllable as an element. Concatenation (Known as 'Sandhi' in Sanskrit & Hindi) This is the most complicated part of the project due to the fact that concatenation on two words will result in an independent meaningful word depending on the 'Sandhi' rules which in turn depend upon syllables involved. TME shall be integrated with PostGreSQL for providing the INDCHAR data-type. TME services shall be utilized to perform operations over information stored as INDCHAR data-type.
[ Main ] [ Features ] [ CVS ] [ Download ] [ Developers ] [ Documentation ]
contact us: tme_general@lists.sourceforge.net