[Evertype]  Armenian encoding on the Macintosh Home
 
 

Armenian encoding on the Macintosh:
WorldScript, Unicode, and ArmSCII

Michael Everson

The author presents some of the background of 8-bit encoded Armenian on the Macintosh, and discusses some of the technical problems involved in making Macintosh Armenian compatible with worldwide and Armenian encoding standards.

This article first appeared in the Inaugural Issue of the Journal of the Association of Armenian Information Professionals, May 1994.

Introduction

Since the beginning of the use of computers in text processing, languages with particular character needs have been ill-served by the predominance of English as a development medium. Armenian is no exception. This is surprising, because since 1984, with the advent of the first popular computer with a friendly graphical user interface – the Apple Macintosh – even languages with script requirements far more taxing than Armenian’s have had at least the potential for convenient access to their requisite characters. Early in 1994 I undertook a project to create an Apple WorldScript system for Armenian, because I sought to explore some of WorldScript’s capabilities and because I was interested in the Armenian script. WorldScript introduces a new way of handling 8-bit character sets, and enables a user to switch scripts, fonts, and keyboard layouts quickly and easily.

Character sets

Most computers nowadays use 8-bit character sets. What this means to the user is that his or her keyboard can generate 256 different characters. Usually about 32 of these are reserved for special computer functions, such as “delete”, “return” and so forth. Therefore only 224 characters can be used to display and print letters, numbers, and other symbols and signs. ASCII is a 7-bit character set, having only 128 characters; many of today’s 8-bit character sets retain ASCII for their first 128 characters, and so when discussing the differences between them, it’s usually only the last 128 characters which have to be worried about. Unicode, or tha Basic Multilingual Plane of ISO 10646, is a 16-bit character set, recently established but not yet widely implemented. It has 65,536 characters, and when it is universally available, many 8-bit issues will become moot.

Interchange between character sets

We are currently dealing with an 8-bit situation for encoding Armenian texts, whether on the Mac or on the PC. Different computer systems have used different 8-bit codepages for various historical reasons. Mostly they have to do with developers not communicating nor agreeing to use international standards for text interchange. Although many of the first codepages were pioneering implementations, it’s easy to understand how this could have happened. On a Macintosh, LATIN CAPITAL LETTER U WITH ACUTE lives at F2, while on a DOS 850 codepage it lives at E9, and on Windows at DA. In order to transfer a document from one computer to another, special translation utilities must be used which move F9 or DA to F2, as well as juggling the other 128 characters in the second half of the codepage for the document to be read on the new computer. Fortunately many such utilities are available.

Armenian Macs

Armenian has had the usual history of a non-Roman script on the Mac: early on, bitmaps were created, and Armenian characters mapped to the keys in a way that made sense to the font developer. This might have been phonetic, so ARMENIAN LETTER AYB was put on LATIN LETTER A , or it may have been according to a typewriter layout. Later, laser fonts were encoded in the same way. Lola Koundakjian (1991) wrote an article describing some of the problems arising from a lack of keyboard/codepage standardization. Mapping characters to the codepage in order to achieve a particular keyboard configuration creates new codepages. A text written in Raffi Kojian’s Ararat, which has a Papazian (Babazean Babazean) phonetic key layout, looks like gibberish if transported to Dirk Van Damme’s Armenian, which has an Olympia typewriter key layout. The way to correct this is to adopt a single codepage, accessed by different keyboard layouts.

WorldScript requirements

In order to create a WorldScript system for Armenian, the following things are required:
  1. the character repertoire (alphabet and punctuation) must be defined;
  2. those characters must be arranged on an Apple Macintosh codepage;
  3. bitmapped system fonts must be created for menus and filenames;
  4. keyboard layout resources for inputting must be designed;
  5. date, time, number, and currency resources must be defined;
  6. sorting routines and word-selection tables must be defined;
  7. PostScript or TrueType fonts must be created or remapped to the new codepage; and
  8. conversion utilities for PC and Mac codepages must be developed.
Following are some of the findings I made during the course of the development. of Armenian WorldScript utilities to beta-stage.

Alphabet

Everyone agrees that the standard alphabet,

Aa Bb Gg Dd Ee Zz E/e/ A/a/ T/t/ Z/z/ Ii Ll Xx Cc Kk Hh Jj L/l/ Qq Mm Yy Nn S/s/ Oo Q/q/ Pp J/j/ R/r/ Ss Vv Tt Rr C/c/ Ww P/p/ K/k/ O/o/ Ff (Aa Bb Gg Dd Ee Zz Ēē Əə Tʻtʻ Žž Ii Ll Xx Cc Kk Hh Jj Łł Qq Mm Yy Nn Šš Oo Čʻčʻ Pp J̌ǰ Ṙṙ Ss Vv Tt Rr Cʻcʻ Ww Pʻpʻ Kʻkʻ Ōō Ff)
must be included. Unicode 1.1 and ISO 10646 contain the ligature as well as some other presentation forms (ligatures with MEN and NOW ). Ligature ev ev does not appear, however, in DIS 10585, a Draft International Standard on Armenian encoding. There are technical and cultural reasons for the disagreement with regard to the inclusion of the digraphs Uu Evev Uu Evev. Technical reasons for not including these characters in a standard codepage are:

  1. they are digraphs, not ligatures, and can be composed of other existing characters; and
  2. they give rise to ambiguous capitalizations (glyph forms) OW? Ow? EV? Ev? EW? Ew? Ev? OW? Ow? EV? Ev? EW? Ew? Ev?.
Technical reasons for including these characters are:
  1. they form part of one standard used currently in Armenia, called ArmSCII (ArmSCII Standard 1990); and
  2. they have been proposed (for the same reason) to be included in Unicode/ISO 10646.
I allocated space for the ArmSCII ligatures in the Mac codepage.

Punctuation

Most of the characters are well-defined and there is no problem determining what they correspond to in Unicode. There are some questionable ones, however. In ArmSCII and in DIS 10585, three kinds of dash are defined: “hyphen sign” (miowt/yan gic miowt/yan gic), “direct speech sign” (anq/atman gic anq/atman gic), and “ligature” (ent/amna ent/amna). Apparently the consensus is that ent/amna ent/amna functions as a hyphen and should be unified with hyphen; this however introduces an imperfect compatibility with ArmSCII and DIS 10585, so I have unified ent/amna ent/amna with EN DASH to ensure that it has a unique encoding. Characters which do not seem to have much currency in Armenian, such as ke/t ke/t MIDDLE DOT , and dubious characters (such as ARMENIAN MODIFIER LETTER LEFT HALF RING and ARMENIAN APOSTROPHE found in ISO 10646) I did not include in the Macintosh codepage. Other characters are unified with their Roman equivalents. (The punctuation repertoire: apat/arc/ apat/arc/ (apostrophe), p/avagcer p/avagcer (parentheses), storake/t storake/t (comma), mij/ake/t mij/ake/t (full stop), q/akertner q/akertner (double angle quotation marks), s/es/t s/es/t (Armenian emphasis mark (not = spacing acute)), bac/akanq/akan ns/an bac/akanq/akan ns/an (Armenian exclamation mark (not = spacing tilde)), bowt/ bowt/ (Armenian comma (not = spacing grave)), harc/akan ns/an harc/akan ns/an (Armenian question mark), patiw patiw (Armenian abbreviation mark), verj/ake/t verj/ake/t (Armenian full stop (not = colon)), ent/amna ent/amna (en dash), anq/atman gic anq/atman gic (em dash), basmake/t basmake/t (ellipsis).) Some rare characters used in dialectology, such as t/aw t/aw, sosk sosk, and sul/ sul/, are identical with existing characters in ISO 10646; I have not burdened the 8-bit Mac codepage with them (For examples, see Darbinyan 1965:42).

WorldScript codepage

Creation of an Apple WorldScript means that Apple’s rules must be followed. One such rule is that applications should be compatible with all scripts, and therefore can assume “that certain character codes (other than the control codes below $20) are never used”, and “that certain nonlinguistic symbols, such as numerals and punctuation marks, are always located at the same code positions” (Apple 1992:301). Examples of these characters are 7F (delete), CA (nonbreaking space), and A8 (registered sign). Because of this limitation, it is impossible to make the Macintosh codepage identical with most DOS codepages (and there is more than one). To do so would be to violate Apple’s policy and practice, and the result would be a non-conformant WorldScript. This means that an Apple Armenian codepage won’t be identical with an Armenian national standard set for DOS. It is, however quite simple to convert texts written in one codepage to another. I have tested some of these conversion utilities such as AIEA’s Transliterate and Jon Wind’s Add/Strip and found that they work quite well. The advantage of having an Apple Armenian codepage is that it’s easier to translate from one Mac standard to ISO 10646 or ArmSCII than it is to translate from ten . And no inter-Mac translation will be necessary.

Arrangement of Armenian characters on a Mac codepage

There exist at least ten Macintosh codepages already. Most of them employ a remapping of Armenian characters according to various phonetic transliterations or keyboard layouts. As such, it can hardly be said that there is a standard at all on the Macintosh. Many existing Mac fonts fail to conform to ArmSCII with regard to the ligatures. The draft Apple Armenian codepage I have developed is, I believe, conformant to ArmSCII as well as ISO 10646, as well as to Apple’s standard encoding practice.

Keyboard layouts

There are several types of key layouts in existence already: “Phonetic Key Layouts”, based either on Eastern or Western pronunciation and arranged according to one or another romanization of Armenian on a standard QWERTY keyboard; and “Ergonomic Key Layouts”, based presumably on some sort of assessment of the most common characters in Armenian. Some of the keyboards in use are patterned off of other keyboards; for instance, the original Olympia typewriter key layout was modified by several developers, some keeping closer to the original typewriter keyboard (by using ARMENIAN CAPITAL LETTER JA (ja) for 2, ARMENIAN LETTER CAPITAL YI (yi) for 3, and ARMENIAN CAPITAL LETTER OH (oh) for 0), while others kept numerals in their standard positions and used OPTION- or ALT -keys to access these and other characters.

Apple keyboard requirements

For the purposes of an Apple Macintosh WorldScript package, economy of keyboards is desirable. The chief reason for this is that each additional keyboard in use adds to the amount of memory in the system heap. I have suggested that something like four keyboards be employed based on the Olympia (Eastern phonetic), Papazian (Western phonetic), and Royal (ergonomic) typewriter layouts, and one based on the Hübschmann-Meillet (linguistic transliteration) layout. However, I have seen three ergonomic key layouts, and if these are widely used in Armenia or in the Diaspora, perhaps more than one ergonomic keyboard should be supported.

Keyboard advantages for Armenian

Since an Armenian WorldScript uses only one codepage, users anywhere will, in principle, be able to transfer texts from one Mac to the next without having to change fonts or reencode files with transliteration tools as they do now. By supporting multiple keyboard layouts, the WorldScript allows users to choose their preferred method of input. A Royal typist and an Olympia typist can work on the same document, in the same font on the same machine without any difficulty – all one needs is to select the appropriate keyboard layout, which is as easy as typing OPTION-COMMAND-SPACEBAR ! To support Armenian WorldScript, once it has been finished and released, one thing will remain to be done: all existing Macintosh fonts will need to be remapped to the Apple Armenian codepage. But this is the whole point of the exercise: the Macintosh allows you to have multiple key layouts accessing a single codepage. Using the ten key layouts I have seen to date, I’ll give an example of how to type the word Hayastan with each of these key layouts, all accessing the same codepage. I’ll assume we’re typing on a standard Mac keyboard. In the example below, the letters I give are not any kind of transliteration, but simply the keys on the keyboard which you press to get the Armenian letters.

Hübschmann-Meillet H-a-y-a-s-t-a-n
Olympia F-a-3-a-s-t-a-n
Van Damme F-a-z-a-s-t-a-n
Kalantarian Phonetic H-a-j-a-s-t-a-n
Papazian H-a-3-a-s-d-a-n
Royal L-g-3-g-r-k-g-h
Kalantarian Ergonomic L-g-v-g-r-k-g-h
Ani3 L-g-7-g-r-k-g-h
Hieren Y-f-\-f-l-[-f-n
RFE/RL S-h-u-h-t-n-h-j

Now, if (and only if) each of these key layouts references the same codepage, then it doesn’t matter which key layout anyone types in, the text remains the same. Thus someone who prefers the Papazian layout, versus another who prefers the Olympia or Hübschmann-Meillet layout can all use the same computer, font, and document, simply by switching the keyboard layout and typing away. Then the only reason to use text conversion utilities will be when switching from platform to platform (Mac to DOS, for instance). It means, for instance, that one doesn’t need to have more than one version of Raffi’s Ararat font on a Mac at any given time, saving valuable disk space and conversion time. And, most importantly, a single codepage standard would facilitate the development of a spell-check dictionary for the Macintosh, or provide a convenient platform for porting an existing PC dictionary to the Mac. Below I give a sample of the four keyboards I have proposed, in Armenian and Roman transliteration.

Continued development

Work continues apace on this project. I am interested in comments of any kind, particularly on the key layout issue. Persons interested in beta-testing can contact me via e-mail.

References

Apple Computer, Inc. 1992. Guide to Macintosh software localization. Reading, MA: Addison Wesley.

ArmSCII Standard. 1990. Published in Annual of Armenian Linguistics , from the “Armenian Standard of Information Exchange Codes” (Informac/iayi kodi haykakan himnorinak (Informacʻiayi kodi haykakan himnorinak). Erevan. (I have not seen the actual standard, only a photocopy of part of the AAL article.)

Darbinyan, T. 1965. Katalog Haykakan SSR/ tparannerum gorcacvol/ tar/atesakneri (Katalog Haykakan SSṘ tparannerum gorcacvoł taṙatesakneri). Erevan.

ISO 10646. Information technology: universal multiple octet coded character set (UCS). ISO/IEC 10646-1, 11 March 1993.

ISO/DIS 10585. Information and documentation: Armenian alphabet coded character set for bibliographic information interchange. Draft International Standard, 1992. (There are some serious incompatibilities between this, ISO 10646, and ArmSCII.)

Kalantarian, Andrey. 1990. VGA Armenian DOS standard, Version 3.1 (2 March 1990). Keyboard and font software. Armenian Academy of Sciences. Erevan.

Koundakjian, R. H. Lola. 1991. “In search of a standard Armenian keyboard”, Armenian International Magazine (A.I.M.) , March 1991, pp. 32-33.

Unicode Consortium. 1991. The Unicode standard: worldwide character encoding. Version 1.0, volume 1. Reading, MA: Addison Wesley Publishing Company.

Thanks to: Bedo Agopian, Raffi Kojian, Lola Koundakjian, Rick McGowan, Michael Stone, Dirk Van Damme, Jos Weitenberg, and Richard Youatt. E-thanks to the other members of aiea@telf.com, hayastan@usc.edu, and hye-font@sain.org as well. All responsibility for errors or infelicities are those of the author and the author alone.
 
HTML Michael Everson, Evertype, 48B Gleann na Carraige, Cill Fhionntain, Baile Átha Cliath 13, Éire, 2003-01-06

Copyright © 1993-2002 Evertype. All Rights Reserved