ISO/IEC JTC1/SC2/WG2 N_______
Date: 1997-06-13
This is an unofficial HTML version of a document submitted to WG2.
In addition to the basic mapping table, information is given on the encoding of conjuncts in Sinhala and Pali with ISO/IEC 10646.
ISO/IEC 10646 | SLS 1134 | ||
000 | 0D80 | (This position shall not be used) | -- |
001 | 0D81 | (This position shall not be used) | -- |
002 | 0D82 | SINHALA SIGN ANUSVARA | 82 |
003 | 0D83 | SINHALA SIGN VISARGA | 83 |
004 | 0D84 | (This position shall not be used) | -- |
005 | 0D85 | SINHALA LETTER A | 85 |
006 | 0D86 | SINHALA LETTER AA | 85+CF |
007 | 0D87 | SINHALA LETTER I | 89 |
008 | 0D88 | SINHALA LETTER II | 8A |
009 | 0D89 | SINHALA LETTER U | 8B |
010 | 0D8A | SINHALA LETTER UU | 8B+DF |
011 | 0D8B | SINHALA LETTER VOCALIC R | 8D |
012 | 0D8C | SINHALA LETTER VOCALIC L | 8F |
013 | 0D8D | (This position shall not be used) | -- |
014 | 0D8E | SINHALA LETTER E | 91 |
015 | 0D8F | SINHALA LETTER EE | 91+CA |
016 | 0D90 | SINHALA LETTER AI | 91+D9 |
017 | 0D91 | (This position shall not be used) | -- |
018 | 0D92 | SINHALA LETTER O | 94 |
019 | 0D93 | SINHALA LETTER OO | 94+D3 |
020 | 0D94 | SINHALA LETTER AU | 94+DF |
021 | 0D95 | SINHALA LETTER KA | 9A |
022 | 0D96 | SINHALA LETTER KHA | 9B |
023 | 0D97 | SINHALA LETTER GA | 9C |
024 | 0D98 | SINHALA LETTER GHA | 9D |
025 | 0D99 | SINHALA LETTER NGA | 9E |
026 | 0D9A | SINHALA LETTER CA | A0 |
027 | 0D9B | SINHALA LETTER CHA | A1 |
028 | 0D9C | SINHALA LETTER JA | A2 |
029 | 0D9D | SINHALA LETTER JHA | A3 |
030 | 0D9E | SINHALA LETTER NYA | A4 |
031 | 0D9F | SINHALA LETTER TTA | A7 |
032 | 0DA0 | SINHALA LETTER TTHA | A8 |
033 | 0DA1 | SINHALA LETTER DDA | A9 |
034 | 0DA2 | SINHALA LETTER DDHA | AA |
035 | 0DA3 | SINHALA LETTER NNA | AB |
036 | 0DA4 | SINHALA LETTER TA | AD |
037 | 0DA5 | SINHALA LETTER THA | AE |
038 | 0DA6 | SINHALA LETTER DA | AF |
039 | 0DA7 | SINHALA LETTER DHA | B0 |
040 | 0DA8 | SINHALA LETTER NA | B1 |
041 | 0DA9 | SINHALA LETTER NNNA | B2 |
042 | 0DAA | SINHALA LETTER PA | B4 |
043 | 0DAB | SINHALA LETTER PHA | B5 |
044 | 0DAC | SINHALA LETTER BA | B6 |
045 | 0DAD | SINHALA LETTER BHA | B7 |
046 | 0DAE | SINHALA LETTER MA | B8 |
047 | 0DAF | SINHALA LETTER YA | BA |
048 | 0DB0 | SINHALA LETTER RA | BB |
049 | 0DB1 | SINHALA LETTER RRA | BC |
050 | 0DB2 | SINHALA LETTER LA | BD |
051 | 0DB3 | SINHALA LETTER LLA | C5 |
052 | 0DB4 | SINHALA LETTER LLLA | BE |
053 | 0DB5 | SINHALA LETTER VA | C0 |
054 | 0DB6 | SINHALA LETTER SHA | C1 |
055 | 0DB7 | SINHALA LETTER SSA | C2 |
056 | 0DB8 | SINHALA LETTER SA | C3 |
057 | 0DB9 | SINHALA LETTER HA | C4 |
058 | 0DBA | (This position shall not be used) | -- |
059 | 0DBB | (This position shall not be used) | -- |
060 | 0DBC | (This position shall not be used) | -- |
061 | 0DBD | (This position shall not be used) | -- |
062 | 0DBE | SINHALA VOWEL SIGN AA | CF |
063 | 0DBF | SINHALA VOWEL SIGN I | D2 |
064 | 0DC0 | SINHALA VOWEL SIGN II | D3 |
065 | 0DC1 | SINHALA VOWEL SIGN U | D4 |
066 | 0DC2 | SINHALA VOWEL SIGN UU | D6 |
067 | 0DC3 | SINHALA VOWEL SIGN VOCALIC R | D8 |
068 | 0DC4 | SINHALA VOWEL SIGN VOCALIC RR | D8+D8 |
069 | 0DC5 | (This position shall not be used) | -- |
070 | 0DC6 | SINHALA VOWEL SIGN E | D9 |
071 | 0DC7 | SINHALA VOWEL SIGN EE | DA |
072 | 0DC8 | SINHALA VOWEL SIGN AI | DB |
073 | 0DC9 | (This position shall not be used) | -- |
074 | 0DCA | SINHALA VOWEL SIGN O | DC |
075 | 0DCB | SINHALA VOWEL SIGN OO | DD |
076 | 0DCC | SINHALA VOWEL SIGN AU | DE |
077 | 0DCD | SINHALA SIGN VIRAMA | CA |
078 | 0DCE | (This position shall not be used) | -- |
079 | 0DCF | (This position shall not be used) | -- |
080 | 0DD0 | SINHALA LETTER AE | 85+D0 |
081 | 0DD1 | SINHALA LETTER AAE | 85+D1 |
082 | 0DD2 | SINHALA VOWEL SIGN AE | D0 |
083 | 0DD3 | SINHALA VOWEL SIGN AAE | D1 |
084 | 0DD4 | (This position shall not be used) | -- |
085 | 0DD5 | (This position shall not be used) | -- |
086 | 0DD6 | (This position shall not be used) | -- |
087 | 0DD7 | (This position shall not be used) | -- |
088 | 0DD8 | SINHALA LETTER NYGA | 9F |
089 | 0DD9 | SINHALA LETTER JNYA | A5 |
090 | 0DDA | SINHALA LETTER NYJA | A6 |
091 | 0DDB | SINHALA LETTER NNDDA | AC |
092 | 0DDC | SINHALA LETTER NDA | B3 |
093 | 0DDD | SINHALA LETTER MBA | B9 |
094 | 0DDE | SINHALA LETTER FA | C6 |
095 | 0DDF | (This position shall not be used) | -- |
096 | 0DE0 | SINHALA LETTER VOCALIC RR | 8D+D8 |
097 | 0DE1 | SINHALA LETTER VOCALIC LL | 8F+F3 |
098 | 0DE2 | SINHALA VOWEL SIGN VOCALIC L | DF |
099 | 0DE3 | SINHALA VOWEL SIGN VOCALIC LL | F3 |
100 | 0DE4 | (This position shall not be used) | -- |
101 | 0DE5 | (This position shall not be used) | -- |
102 | 0DE6 | (This position shall not be used) | -- |
103 | 0DE7 | SINHALA DIGIT ONE | -- |
104 | 0DE8 | SINHALA DIGIT TWO | -- |
105 | 0DE9 | SINHALA DIGIT THREE | -- |
106 | 0DEA | SINHALA DIGIT FOUR | -- |
107 | 0DEB | SINHALA DIGIT FIVE | -- |
108 | 0DEC | SINHALA DIGIT SIX | -- |
109 | 0DED | SINHALA DIGIT SEVEN | -- |
110 | 0DEE | SINHALA DIGIT EIGHT | -- |
111 | 0DEF | SINHALA DIGIT NINE | -- |
112 | 0DF0 | SINHALA NUMBER TEN | -- |
113 | 0DF1 | SINHALA NUMBER TWENTY | -- |
114 | 0DF2 | SINHALA NUMBER THIRTY | -- |
115 | 0DF3 | SINHALA NUMBER FORTY | -- |
116 | 0DF4 | SINHALA NUMBER FIFTY | -- |
117 | 0DF5 | SINHALA NUMBER SIXTY | -- |
118 | 0DF6 | SINHALA NUMBER SEVENTY | -- |
119 | 0DF7 | SINHALA NUMBER EIGHTY | -- |
120 | 0DF8 | SINHALA NUMBER NINETY | -- |
121 | 0DF9 | SINHALA NUMBER ONE HUNDRED | -- |
122 | 0DFA | SINHALA NUMBER ONE THOUSAND | -- |
123 | 0DFB | (This position shall not be used) | -- |
124 | 0DFC | (This position shall not be used) | -- |
125 | 0DFD | (This position shall not be used) | -- |
126 | 0DFE | (This position shall not be used) | -- |
127 | 0DFF | SINHALA SIGN KUNDALIYA | F4 |
ISO/IEC 10646 | SLS 1134 | |
00B0+0DCD | SINHALA REPAYA | BB+CD |
0020+ZWNJ | INVISIBLE CODE | 84 |
0D80+ZWNJ | SINHALA AL-LAKUNA | CA |
0D80+ZWJ | LINK CODE | CC |
0D80 | SHORT CODE | CD |
Archaic Sinhala numerals in UCS data cannot be encoded in SLS 1134 (see clause 4.11). The SHORT CODE is also used to generate secondary forms of the vowel symbols U and UU. I am not sure if this is necessary in a 10646 context, if selection of the long or short form is based on the preceding consonant.
The documentation system used for transliterating throughout this standard is a linear method and all entries will display on the screen as they are entered. But in the other method which could be developed on the phonetic basis of the characters, some entries carrying combinations will be stored, if necessary modified and displayed as they are to be documented, for example E + KA + AA (= KO) and KA + O (= KO).SLS 1134 therefore appears to encode vowels by decomposing them into their consituent parts. See Appendix A.3:
Some vowels wich can be consructed using relevant consonant modifiers are not included. But for users who are interested in having the complete set of vowels, spaces are still reserved to include those characters, in accordance with the alphabetical order.Thus SLS 1134 appears to support both phonetic encoding and visual encoding, which will mean that some kind of mode identification will need to be made in order to properly migrate 1134 data to 10646 data.
ISO/IEC 10646-encoded Sinhala is based on the Brahmic harmonization, which means that the underlying encoding is phonetically-based. The advantages of Brahmic harmonization are many: improved facility for transliterating Pali to and from Sinhala script, greater ease of portability for software (providing a wider market for Sri Lankan programmers), simplification of string comparisons for historical comparative linguistics, etc. The mapping table presented here will work easily if an SLS 1134 text has been encoded according to this phonetic principle. If it has been encoded according to the visual principle, reliable transfer can also be achieved, though the mapping table will need to be considerably larger. Further information on coding of SLS 1134 texts will be welcome.
Of course despite the fact that Sinhala is coded according to the phonetic principle in ISO/IEC 10646, input of characters could still be achieved according to the visual principle if Sri Lankan input preferences require it thus (the same can be said of Burmese).
Correct formation of conjuncts is somewhat more complex in Sinhala than in other Brahmic scripts, but it is very important because conjuncts are formed differently depending on whether a word is Sinhala or Pali. From Gunasekare:
It is important to observe that letters in Elu [Sinhala] words are generally written separately, while the rule is reversed in Pali and Sanskrit. It is therefore incorrect to write the Elu words AT-TA atta, 'branch,' VIS-SA vissa, 'twenty,' as A-TTA, VI-SSA; nor is it thought proper to write the Sanskrit words SA-RVA sarva, 'all,' VII-RYYA víryya, 'strength,' A-NTA-RAA-YA antaráya 'danger,' as SAR-VA, VIIR-YA, AN-TA-RAA-YA. [§19]Conjunct formation affects syllable boundaries and thereby word-breaking and sorting. For example, the word BUDDHO is realized BUD-DHO in Sinhala but BU-DDHO in Pali.In ISO/IEC 10646, this word would be coded BA + U + DA + VIRAMA + ZWNJ + DHA + O in Sinhala and BA + U + DA + VIRAMA + DHA + O in Pali (in linguistic comparison operations the viramas and joiners can simply be ignored). An additional complexity is the difference between Pali "kerned" conjuncts (as in the BU-DDHO example here) and those with special ligating forms.
In SLS 1134, BUD-DHO is coded BA + U + DA + LINK + E + DHA + AA (or BA + U + DA + LINK + DHA + O). In clause 1.3 it appears to be claimed that BU-DDHA cannot be coded with SLS 1134, but this seems to be contradicted by clause 4.6. I suspect my reading of the standard; clause 1.3 states that it "does not make provisions to produce double linked characters". It seems to me that the three examples here can be encoded thus:
The glyphs for the Pali special conjuncts [CC]A and [BB]A happen to look identical to those for the base letters DDA and NNYA; they should always be encoded with their constituent consonants and VIRAMA or with SHORT.
Further information from SLSI and other Sri Lankan experts on the sorting of texts with both Sinhala and Pali or Sanskrit words (vis à vis the conjuncts discussed in this paper and with reference to clause 1 note 2 of SLS 1134) will be gratefully received.