ISO/IEC JTC1/SC2/WG2 N_______
Date: 1997-06-13
This is an unofficial HTML version of a document submitted to WG2.

Title: Mapping of Sinhala between ISO/IEC 10646 and SLS 1134

Source: Michael Everson, Everson Gunn Teoranta (IE)
Status: Expert Contribution
Action: For consideration by JTC1/SC2/WG2

Since Sinhala will be encoded according the the Brahmic harmonization in ISO/IEC 10646, it is important that mappings between UCS Sinhala and 7- and 8-bit Sinhala be made, so that roundtrip integrity in text transfer for data encoded in the two standards can be achieved. This paper is a contribution toward a mapping for data encoded in ISO/IEC 10646 and data encoded in SLS 1134.

In addition to the basic mapping table, information is given on the encoding of conjuncts in Sinhala and Pali with ISO/IEC 10646.

Equivalence table

ISO/IEC 10646SLS 1134
0000D80(This position shall not be used)--
0010D81(This position shall not be used)--
0020D82SINHALA SIGN ANUSVARA82
0030D83SINHALA SIGN VISARGA83
0040D84(This position shall not be used)--
0050D85SINHALA LETTER A85
0060D86SINHALA LETTER AA85+CF
0070D87SINHALA LETTER I89
0080D88SINHALA LETTER II8A
0090D89SINHALA LETTER U8B
0100D8ASINHALA LETTER UU8B+DF
0110D8BSINHALA LETTER VOCALIC R8D
0120D8CSINHALA LETTER VOCALIC L8F
0130D8D(This position shall not be used)--
0140D8ESINHALA LETTER E91
0150D8FSINHALA LETTER EE91+CA
0160D90SINHALA LETTER AI91+D9
0170D91(This position shall not be used)--
0180D92SINHALA LETTER O94
0190D93SINHALA LETTER OO94+D3
0200D94SINHALA LETTER AU94+DF
0210D95SINHALA LETTER KA9A
0220D96SINHALA LETTER KHA9B
0230D97SINHALA LETTER GA9C
0240D98SINHALA LETTER GHA9D
0250D99SINHALA LETTER NGA9E
0260D9ASINHALA LETTER CAA0
0270D9BSINHALA LETTER CHAA1
0280D9CSINHALA LETTER JAA2
0290D9DSINHALA LETTER JHAA3
0300D9ESINHALA LETTER NYAA4
0310D9FSINHALA LETTER TTAA7
0320DA0SINHALA LETTER TTHAA8
0330DA1SINHALA LETTER DDAA9
0340DA2SINHALA LETTER DDHAAA
0350DA3SINHALA LETTER NNAAB
0360DA4SINHALA LETTER TAAD
0370DA5SINHALA LETTER THAAE
0380DA6SINHALA LETTER DAAF
0390DA7SINHALA LETTER DHAB0
0400DA8SINHALA LETTER NAB1
0410DA9SINHALA LETTER NNNAB2
0420DAASINHALA LETTER PAB4
0430DABSINHALA LETTER PHAB5
0440DACSINHALA LETTER BAB6
0450DADSINHALA LETTER BHAB7
0460DAESINHALA LETTER MAB8
0470DAFSINHALA LETTER YABA
0480DB0SINHALA LETTER RABB
0490DB1SINHALA LETTER RRABC
0500DB2SINHALA LETTER LABD
0510DB3SINHALA LETTER LLAC5
0520DB4SINHALA LETTER LLLABE
0530DB5SINHALA LETTER VAC0
0540DB6SINHALA LETTER SHAC1
0550DB7SINHALA LETTER SSAC2
0560DB8SINHALA LETTER SAC3
0570DB9SINHALA LETTER HAC4
0580DBA(This position shall not be used)--
0590DBB(This position shall not be used)--
0600DBC(This position shall not be used)--
0610DBD(This position shall not be used)--
0620DBESINHALA VOWEL SIGN AACF
0630DBFSINHALA VOWEL SIGN ID2
0640DC0SINHALA VOWEL SIGN IID3
0650DC1SINHALA VOWEL SIGN UD4
0660DC2SINHALA VOWEL SIGN UUD6
0670DC3SINHALA VOWEL SIGN VOCALIC RD8
0680DC4SINHALA VOWEL SIGN VOCALIC RRD8+D8
0690DC5(This position shall not be used)--
0700DC6SINHALA VOWEL SIGN ED9
0710DC7SINHALA VOWEL SIGN EEDA
0720DC8SINHALA VOWEL SIGN AIDB
0730DC9(This position shall not be used)--
0740DCASINHALA VOWEL SIGN ODC
0750DCBSINHALA VOWEL SIGN OODD
0760DCCSINHALA VOWEL SIGN AUDE
0770DCDSINHALA SIGN VIRAMACA
0780DCE(This position shall not be used)--
0790DCF(This position shall not be used)--
0800DD0SINHALA LETTER AE85+D0
0810DD1SINHALA LETTER AAE85+D1
0820DD2SINHALA VOWEL SIGN AED0
0830DD3SINHALA VOWEL SIGN AAED1
0840DD4(This position shall not be used)--
0850DD5(This position shall not be used)--
0860DD6(This position shall not be used)--
0870DD7(This position shall not be used)--
0880DD8SINHALA LETTER NYGA9F
0890DD9SINHALA LETTER JNYAA5
0900DDASINHALA LETTER NYJAA6
0910DDBSINHALA LETTER NNDDAAC
0920DDCSINHALA LETTER NDAB3
0930DDDSINHALA LETTER MBAB9
0940DDESINHALA LETTER FAC6
0950DDF(This position shall not be used)--
0960DE0SINHALA LETTER VOCALIC RR8D+D8
0970DE1SINHALA LETTER VOCALIC LL8F+F3
0980DE2SINHALA VOWEL SIGN VOCALIC LDF
0990DE3SINHALA VOWEL SIGN VOCALIC LLF3
1000DE4(This position shall not be used)--
1010DE5(This position shall not be used)--
1020DE6(This position shall not be used)--
1030DE7SINHALA DIGIT ONE--
1040DE8SINHALA DIGIT TWO--
1050DE9SINHALA DIGIT THREE--
1060DEASINHALA DIGIT FOUR--
1070DEBSINHALA DIGIT FIVE--
1080DECSINHALA DIGIT SIX--
1090DEDSINHALA DIGIT SEVEN--
1100DEESINHALA DIGIT EIGHT--
1110DEFSINHALA DIGIT NINE--
1120DF0SINHALA NUMBER TEN--
1130DF1SINHALA NUMBER TWENTY--
1140DF2SINHALA NUMBER THIRTY--
1150DF3SINHALA NUMBER FORTY--
1160DF4SINHALA NUMBER FIFTY--
1170DF5SINHALA NUMBER SIXTY--
1180DF6SINHALA NUMBER SEVENTY--
1190DF7SINHALA NUMBER EIGHTY--
1200DF8SINHALA NUMBER NINETY--
1210DF9SINHALA NUMBER ONE HUNDRED--
1220DFASINHALA NUMBER ONE THOUSAND--
1230DFB(This position shall not be used)--
1240DFC(This position shall not be used)--
1250DFD(This position shall not be used)--
1260DFE(This position shall not be used)--
1270DFFSINHALA SIGN KUNDALIYAF4
ISO/IEC 10646SLS 1134
00B0+0DCDSINHALA REPAYABB+CD
0020+ZWNJINVISIBLE CODE84
0D80+ZWNJSINHALA AL-LAKUNACA
0D80+ZWJLINK CODECC
0D80SHORT CODECD

Archaic Sinhala numerals in UCS data cannot be encoded in SLS 1134 (see clause 4.11). The SHORT CODE is also used to generate secondary forms of the vowel symbols U and UU. I am not sure if this is necessary in a 10646 context, if selection of the long or short form is based on the preceding consonant.


Coding of Sinhala syllables and conjuncts

In my reading of SLS 1134 I do not find explicit information as to how the vowels which that standard composes of multiple units are to be coded in the text stream. In Appendix A.1 of that standard it is stated, a bit ambiguously:
The documentation system used for transliterating throughout this standard is a linear method and all entries will display on the screen as they are entered. But in the other method which could be developed on the phonetic basis of the characters, some entries carrying combinations will be stored, if necessary modified and displayed as they are to be documented, for example E + KA + AA (= KO) and KA + O (= KO).
SLS 1134 therefore appears to encode vowels by decomposing them into their consituent parts. See Appendix A.3:
Some vowels wich can be consructed using relevant consonant modifiers are not included. But for users who are interested in having the complete set of vowels, spaces are still reserved to include those characters, in accordance with the alphabetical order.
Thus SLS 1134 appears to support both phonetic encoding and visual encoding, which will mean that some kind of mode identification will need to be made in order to properly migrate 1134 data to 10646 data.

ISO/IEC 10646-encoded Sinhala is based on the Brahmic harmonization, which means that the underlying encoding is phonetically-based. The advantages of Brahmic harmonization are many: improved facility for transliterating Pali to and from Sinhala script, greater ease of portability for software (providing a wider market for Sri Lankan programmers), simplification of string comparisons for historical comparative linguistics, etc. The mapping table presented here will work easily if an SLS 1134 text has been encoded according to this phonetic principle. If it has been encoded according to the visual principle, reliable transfer can also be achieved, though the mapping table will need to be considerably larger. Further information on coding of SLS 1134 texts will be welcome.

Of course despite the fact that Sinhala is coded according to the phonetic principle in ISO/IEC 10646, input of characters could still be achieved according to the visual principle if Sri Lankan input preferences require it thus (the same can be said of Burmese).

Correct formation of conjuncts is somewhat more complex in Sinhala than in other Brahmic scripts, but it is very important because conjuncts are formed differently depending on whether a word is Sinhala or Pali. From Gunasekare:

It is important to observe that letters in Elu [Sinhala] words are generally written separately, while the rule is reversed in Pali and Sanskrit. It is therefore incorrect to write the Elu words AT-TA atta, 'branch,' VIS-SA vissa, 'twenty,' as A-TTA, VI-SSA; nor is it thought proper to write the Sanskrit words SA-RVA sarva, 'all,' VII-RYYA víryya, 'strength,' A-NTA-RAA-YA antaráya 'danger,' as SAR-VA, VIIR-YA, AN-TA-RAA-YA. [§19]
Conjunct formation affects syllable boundaries and thereby word-breaking and sorting. For example, the word BUDDHO is realized BUD-DHO in Sinhala but BU-DDHO in Pali.In ISO/IEC 10646, this word would be coded BA + U + DA + VIRAMA + ZWNJ + DHA + O in Sinhala and BA + U + DA + VIRAMA + DHA + O in Pali (in linguistic comparison operations the viramas and joiners can simply be ignored). An additional complexity is the difference between Pali "kerned" conjuncts (as in the BU-DDHO example here) and those with special ligating forms. It should be noted that the kerned conjuncts are more frequent in Sinhala texts than the special conjuncts.

In SLS 1134, BUD-DHO is coded BA + U + DA + LINK + E + DHA + AA (or BA + U + DA + LINK + DHA + O). In clause 1.3 it appears to be claimed that BU-DDHA cannot be coded with SLS 1134, but this seems to be contradicted by clause 4.6. I suspect my reading of the standard; clause 1.3 states that it "does not make provisions to produce double linked characters". It seems to me that the three examples here can be encoded thus:

The glyphs for the Pali special conjuncts [CC]A and [BB]A happen to look identical to those for the base letters DDA and NNYA; they should always be encoded with their constituent consonants and VIRAMA or with SHORT.

Sorting

It needs to be said once again that the Sri Lankan requirement for sorting Sinhala correctly has been understood. However, it has also to be recognized that we have not been successful in convincing the experts in SLSI that the positions of the characters in the code table is of no relevance to sorting. So here is another attempt. It is true that binary sorting is easy to achieve in 7-bit and 8-bit code tables by following the hexadecimal values of the characters and so their arrangement is of relevance in such codings (it is also true that correct sorting can be achieved in other ways in 7-bits and 8-bits). But ISO/IEC 10646 sorting cannot be based on binary order of the characters. There are tens of thousands of characters in that standard and all sorting implementations will have to be based on special tables. ISO/IEC JTC1/SC22/WG20 is the committee responsible for specifying the 10646 sorting and will ensure that Sinhala is correctly specified in those tables with the help of Sri Lankan experts.

Further information from SLSI and other Sri Lankan experts on the sorting of texts with both Sinhala and Pali or Sanskrit words (vis à vis the conjuncts discussed in this paper and with reference to clause 1 note 2 of SLS 1134) will be gratefully received.


Téir go dtí innéacs EGT (Go to the EGT index)
Michael Everson, Dublin, 1997-06-28