1. Uyghur language processing on the Web
Dr. Waris Abdukerim Janbaz , Prof. Imad Saleh
Paragraphe Laboratory, University of Paris VIII, France
warisabdukerim@yahoo.com, isaleh@wanadoo.fr
http://paragraphe.univ-paris8.fr
Abstract navigators) and correctly displaying Uyghur characters
In this paper, we discuss some important issues related to presented huge difficulties. In spite of the fairly passive
web processing of an agglutinative Turkic language – attitude of Government authorities to the development of
Uyghur. Especially, we will discuss the advent of Uyghur information technology, many individuals started
grassroots efforts on Uyghur Unicode font developing, creating Uyghur websites using the three above
Uyghur character displaying, font embedding and mentioned script. ASU, used by the most populous
Uyghur character inputting method within Uyghur- segment of XUAR Uyghurs caused special coding
support-less environment. We will also introduce a problems given that it uses a non-standard set of Arabic-
multiscript conversion application to further use the based glyphs.
Unicode standard for Uyghur language processing.
2. Background
Keywords: Unicode, Font, Turkic Language, multiscript, For ASU, before 2002, either of the two following
transliteration, Arabic-Script Uyghur, Cyrillic-Script methods became very common on web publishing in
Uyghur, Latin-Script Uyghur. Uyghur: 1) font downloading; and/or 2) image format.
There is no need to explain the inconvenience of the
1. Introduction second method. More interesting but complex problems
The Uyghurs are a Turkic-speaking ethnic group, occurred in the case of the first one. The major problem
officially about nine million, inhabiting in Central Asia came from the fact that every web site owner created and
including today’s Xinjiang Uyghur Autonomous Region named his/her own fonts, and users/visitors had to
(hereafter: XUAR, also called Chinese Turkistan) as well download a specific font (or different fonts) for almost
as parts of Kazakhstan and urban regions in the Ferghana every single website. No one accepted the font name and
valley. The official writing system of the XUAR Uyghurs coding of the other, and no common standard was created.
is Arabic-Script Uyghur 1 (hereafter: ASU) whereas the Most of the fonts created during this period, either
Cyrillic-Script Uyghur2 (hereafter: CSU ) is still in used replaced the ASCII characters or replaced the Unicode
by the Uyghurs of the ex-Soviet Union Republics Arabic characters (0x600-0x6FF) with Uyghur characters,
(USSR). The newly introduced transliteration 3 – Latin- without replacement agreement. Since the number of the
Script Uyghur 4 (hereafter: LSU) has become widely Arabic letters in the code rage 0x600-0x6FF is larger
accepted among Uyghurs and Uyghurologists is a than the number of ASU letters, people made different
commonly used standard for the transliteration for both choices as they replaced some Arabic characters with
ASU and CSU. ASU characters. Therefore, multiplication of the font
The influence of web publishing started appearing in names and the growth of coding differences (for the same
Uyghur society in the last 10 years. Since the existing glyphs) among the fonts became an obstacle to the
platforms don’t supply any Uyghur input method nor any development of ASU computer processing and web
fonts that including all the glyphs of the ASU alphabet, publishing. A large number of issues regarding non-
inputting Uyghur text into interactive web pages (in the standard fonts and their use were addressed in many
different ways to the individual computer scientists.
Meanwhile, many of these problems were circumvented
1
See annex 2 by using methods unrelated to the Unicode standard. As a
2
See annex 1 result, web site creators eventually expressed their strong
3
Using one writing system to represent words in another is desire to further use the Unicode standard for Uyghur
called transliteration. language processing.
4
called Uyghur Kompyutér Yéziqi (UKY) or Uyghur Latin
Yéziqi (ULY) in Uyghur, meaning “Uyghur Computer Writing”
In June 2002, the author developed the first Uyghur
or “Latin-Script Uyghur”. See Unicode font and implemented both system-level and
http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYeziq).htm browser-level Input Method Editors for Windows. It
2. became a revolutionary accomplishment, owing mostly The creation of a Unicode based Uyghur font has became
to the new method and applications that are fully a necessity for the progress of Uyghur information
Unicode-compliant (as opposed to occasionally processing since the existing platforms do not include
compatible). Hence, a campaign was launched to (supply) any Uyghur font. Existing fonts (both Arabic
popularize and adapt the Unicode standard for Uyghur fonts and other fonts which include Arabic letters) do not
fonts. In this paper, we present the entire process that we include all the necessary shapes of Uyghur letters (see
have been following and developing for three years. The annex 2), and therefore some substitution sequences
following subsections will cover four major parts of the mislead display problems. For example:
entire implementation procedure. 1. ﺋﺎﻟەﻣﺪىﻜﻰ هەﻣﻤە ﺋىﻨﺴﺎن ﻗەﺑىﻪ ﺋەﻣەس
2. ﺋﺎﻟﻪﻣﺪﯨﻜﻰ ھﻪﻣﻤﻪ ﺋﯩﻨﺴﺎﻥ ﻗﻪﺑﯩﻬ ﺋﻪﻣﻪﺱ
3. Uyghur Unicode font developing (Not all human beings in the world are evil)
Uyghur (ASU) letters have been developed on the basis The first sentence above is considered illegal character
of the Arabic alphabet from Arabic. The ASU alphabet combination if it uses existing fonts (ex: Times New
has 8 vowels5 and 24 consonants (see annex1). Uyghur, Roman, Traditional Arabic) because the cursive shapes of
just like Arabic, is written from right to left, each letter ﺋﻪ ,ھ ,ﻯare not correct according to the ASU alphabet
having different shapes depending on its position in a (see annex 2). It should appear as in sentence 2 in which
word. The Uyghur letters have initial, median, final and
the letters use a specific font — UKIJ Tuz Tom. In order
isolated forms; some letters have conjunct forms6. In total,
to create right cursive connection forms for Uyghur, it
the Uyghur alphabet has 126 different glyphs. The 108
was necessary to take special measures for three
basic glyphs 7 of the Uyghur letters have already been
problem-letters ﺋﻪ ,ھ , ﻯand two “glottal stop signs ”ﺌ , ﺉ
accepted by the Unicode Consortium/ISO, and 18 glyphs8
out of the 20 glyphs for composed forms were added in (supported hamze), during the creation of Uyghur fonts.
1998. Unfortunately, two conjunct median forms (of the The absence of such measures would make it impossible
Uyghur letters ﺋﯥand 9ﺌﯧ )ﺋﻰand 01ﺌﯩare still absent11 in to display the cursive forms of the three letters correctly
in browsers and other application software.
the Unicode Standard’s table 12 – Arabic Presentation
: 31 ﻯUyghur letter i as in ishik ( ,ﺋﯩﺸﯩﻚdoor). The 8
forms-A. This lack renders the Unicode Consortium/ISO
as it stands incomplete and this has forced people to different forms are listed in the table 1 below. For the
supplement it through borrowing from FBD1 and FBD2 initial′ and median′ forms ( )ﯨ , ﯩof this letter we use the
the “supported hamze” which is then combined with the initial and median forms of the Arabic letter ;9460 ﻯfor
median′ form of ﺋﯥand ﺋﻰto generate two synthetic the final′ and isolated′ forms ( )ﻯ , ﻰwe use the final and
combined letters. isolated forms of the Farsi letter 60 ﻯCC, respectively.
The 20 conjunct glyphs can also be expressed as a
:41ﺋﻪUyghur letter e as in eyneklerde ( ,ﺋﻪﻳﻨﻪﻛﻠﻪﺭﺩەin the
sequence of two existing Unicode glyphs (as it is the case
now for the two missing conjunct glyphs). But this kind mirrors). This letter uses the final and isolated glyph s(, ﻩ
of usage may cause problems like reducing text inputting )ﻪof the Arabic letter (7460 ھh), in the same way as
speed, increasing data storage redundancy, complicating Persian does. This causes a special problem due to the
data sorting operations etc. fact that the glyphs of Arabic (7460 51ھh) in the initial
and median positions( )ھ , ﻬcorrespond to those of Uyghur
5
The Arabic alphabet only has 3 letters and for long vowels ( ھh as in ھﯧﻠﯩﻬﻪﻡhélihem, even now; ﮔﯘﻧﺎھgunah, sin or
uses .ﺍ ﻭ ﻱThe others are not noted in normal writing. Given its
offense; ﻗﻪﺑﯩﻬqebih, odious), which, in turn, has different
phonetic characteristics, Uyghur notes down all vowels: ،ﺋﺎ، ﺋﻪ
, ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰusing derivates of traditional Arabic
final and isolated glyphs( .)ھ , ﻬIn order to deal with this
letters. inconsistency, we have chosen to use 06D5 for the
6
The initial form and, under some circumstances, the median Uyghur letter ﺋﻪand 06BE for the Uyghur letter .ھ
form of all vowels is preceded by one “glottal stop sign ﺉor ”ﺌ iso.′ fin.′ med.′ ini.′ iso. fin. med. ini.
(supported hamze) with which they form a common letter ﺍ ﺎ ﯫ ﯪ
(treated by Uyghur as a single letter, see annex 2). ﻝfollowed
ﻩ ﻪ ﯭ ﯬ
by ﺍforms ﻼor ﻻdepending on their position.
7 ﻭ ﻮ ﯯ ﯮ
See http://www.oyghan.com/images/UyghurUnicodeTable.gif
8
See Arabic Presentation Forms-A, glyph code range: FBEA – ﯗ ﯘ ﯱ ﯰ
FBFB. See also table 1. ﯙ ﯚ ﯳ ﯲ
9
Character name for the Unicode Standard: ARABIC
LIGATURE YEH WITH HAMZA ABOVE WITH E
ﯛ ﯜ ﯵ ﯴ
MEDIAN FORM. Ex: ( ﺑﺎﻏﺌﯧﺮﯨﻖBaghériq).
10
Character name for the Unicode Standard: ARABIC
LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA 13
Character name for the Unicode Standard: ARABIC
ABOVE WITH ALEF MAKSURA MEDIAN FORM. Ex: LETTER UIGHUR KAZAKH KIRGHIZ ALEF MAKSURA
( ﻗﻪﺗﺌﯩﻲcertainly, doubtlessly) (represents YEH-shaped letter with no dots in any positional
11
The XUAR’s delegation members, Prof. Hoshur Islam and form), 0649.
14
Yasin Imin, who have submitted the proposition also admit this Character name for the Unicode Standard:ARABIC LETTER
fault. See also Arabic Presentation Forms-A (code range: FBEA AE (Uighur, Kazakh, Kirghiz), 06D (isolated form is .)ە
15
– FBFB). See http://www.unicode.org/standard/where/ , Variant shapes
12
http://www.unicode.org/charts/PDF/UFB50.pdf of the Arabic character hah.
3. ې ﯥ ﯧ ﯦ ﯶ ﯷ ﺌﯧ ﯸ and RTL (right to left mark; 200F), is also recommended
ﻯ ﻰ ﯩ ﯨ ﯹ ﯺ ﺌﯩ ﯻ in any Uyghur font. The rest of the time-consuming
repetitive font developing task is absolutely the same as
ھ ﻬ ﻬ ھ
when creating an Arabic script font 20 . Some Uyghur
Table 1. Uyghur vowels and the three problem-letters (the one Arabic
character ھhah has four different basic shapes, which correspond to the
Unicode fonts are available for free at the UCSA website.
four shapes of two different letters in Uyghur).
Our recommended font creating tools are: Font Creator21
and Fontographer 22 . Glyph substitutions, positioning
ﺉand :61ﺌthe glottal stop: this is a phoneme which is not lookups and shaping features and Open Type tables of
listed separately in the ASU alphabet but still covered by Arabic fonts can be added with the help of software like
its spelling rules. In Uyghur words, the glottal stop is not Microsoft VOLT.
as strongly pronounced as it is in Semitic languages or in
Uzbek, for example, and it has weakened to become no 4. Font embedding and character displaying
more than a hiatus. Marked in ASU by a hamza on top of Web pages can be rendered without downloading or
a “tooth”, it appears usually in words of Arabic origin installing any specific fonts if: 1) the fonts used in the
and replaces an original ‘ain ( )عor a hamza ( )ءin a pages are available on user’s computer, and 2) if the
median or final position (e.g. ﺋﺎﻟﻪﻡfrom Arabic ,ﻋﺎﻟﹶﻢ browsers provide native support for the fonts and
ﺳﺎﺋﻪﺕfrom Arabic ﺧﺎﺋﯩﻦ ,ﺳﺎ َﺔfrom Arabic ﺳﻮﺋﺎﻝ , ﺧﺎﺋِﻦ
ﻋ languages used. The second condition has already been
from Arabic .)ﺳ َالIn initial position, the same sign is
ُﺆ met but unfortunately the first one has not yet, as there
considered as part of the initial form of a vowel and does are no Uyghur fonts available on the existing platforms
not have any phonetic value 17 . They correspond to the that are installed on the users’ computers. Therefore, to
initial and median forms of the Arabic letter .6260 ئ ensure that Uyghur texts are displayed correctly in web
These Arabic glyphs are not considered as different browsers, users must find a way to install in their
shapes of any independent letter in the Uyghur alphabet computers the fonts that are used in the web pages. The
(cf. annex 2). Since one glyph of each of the two letters same holds true for all the other “forgotten languages” on
ﺋﯥand ( ﺋﻰshown in light red in the table above) are still different platforms. The font installation requirement
either causes difficulties for people who don’t have much
missing in Unicode, we can use a sequence of either of technical experience, or discourages others from
these glyphs ( ﺉor )ﺌfollowed by the final, isolated, attempting to read the text.
median′ or final′ forms of vowels ﺋﯥand ( ﺋﻰshown in These difficulties can be overcome by embedding fonts
blue in the table above). More precisely, the other into the web pages. When a page is downloaded into a
conjunct forms can be obtained combining with the browser via the Hypertext Transfer Protocol, any
Arabic letter 6260 ئand a vowel respectively. embedded fonts in the page are also downloaded without
In spite of the above mentioned limitations (two glyphs any need for the user to intervene. The Microsoft Web
instead of one conjunct glyph for ﺋﯥand )ﺋﻰthe above Embedding Fonts Tool—WEFT 23 makes it possible to
mentioned conventions have now been widely accepted create embedded font objects that can be linked to web
by the Uyghur Computer Science Association(UCSA18), pages. The following steps let web pages developers
and at a later date, by the Xinjiang University branch of create embedded fonts and link them to a web page:
the 863 Research Group19. • Create embedded fonts using Microsoft WEFT
After having learnt the specificities of those letters, it is • Prepare the web page using any fonts that are
easy to create Uyghur fonts using existing font creating installed on the platform, and
software. The inclusion of non-spacing combining marks, • Link the embedded fonts to the web page.
such as ZWJ (zero width joiner 200C), ZWNJ (zero Microsoft WEFT generates 1) embedded fonts for every
width non-joiner; 200D), LTR (left to right mark; 200E), web site with a different extension (.EOT), and 2) a script
that links an embedding font to a web page. The
16 disadvantage of the WEFT generated embedded fonts is
Character name for the Unicode Standard: ARABIC
LETTER YEH WITH HAMZA ABOVE <initial> and that the fonts are compatible only with Internet Explorer.
<median> 0626. This makes it highly desirable for more efforts to be
17
It is often said that the decision of Uyghur linguists to add invested in providing a cross-platform functionality for
this sign as part of the initial form of letters is a link with the this kind of software.
old Uyghur writing system, in which all initial vowels were
preceded by a tooth. The Arabic alphabet has 3 letters, و ,اand
يwhich can be used to indicate long vowels. Short vowels can
be indicated through the use of vowel marks above or under the
consonants but which are dispensed of in normal writing. Given
its phonetic characteristics, Uyghur notes down all vowels: ،ﺋﺎ
,ﺋﻪ، ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰusing derivates of traditional Arabic 20
See
letters. http://www.microsoft.com/typography/OpenType%20Dev/arabi
18
UCSA – The Uyghur Computer Science Association (or c/intro.mspx for more information about developing OpenType
UKIJ – Uyghur Kompyutér Ilimi Jem’iyiti in Uyghur) is a non- Fonts for Arabic Script
21
profit association, founded by the author in Jan 2004. Web site: http://www.high-logic.com/fontcreator.html
22
http://www.ukij.org http://www.fontlab.com/Font-tools/Fontographer
19 23
A National High-Tech Research Group, financed by the PRC Free software at
government. The XJU branch is specialized in multilingual http://www.microsoft.com/typography/web/embedding/default.
software development. htm
4. 5. Creation of a browser-level virtual input events” module frees the hook immediately after the user
method decides to switch the inputting language to another one.
As mentioned in the introduction, the existing platforms This method has been implemented using JavaScript and
do not supply any system-level Uyghur language VBScript language, tested on different browsers and
inputting service. Late in 2003, the first system-level commonly used in some Uyghur web sites25.
Uyghur Unicode IME for Windows was developed by the
author and distributed free of charge24. Six month later, 6. Multiscript converting
the Xinjiang University branch of the 863 Research Due to the co-existence of different writing systems
Group and some individuals started joining the Uyghur (Arabic-Script Uyghur, Cyrillic-Script Uyghur and Latin-
Unicode Popularization campaign by distributing their Script Uyghur) for the Uyghur language, research on a
Unicode-supported IME. Nevertheless, it still can not be conversion tool with which people can toggle between
said that all or even most Uyghur internet users are the three scripts is forthcoming for future information
equipped with Uyghur inputting tools. Therefore, the sharing. The fact that there is one-to-one
browser-level inputting method still fills a great need correspondence 26 between the letters of these three
since it enables people to input Uyghur letter into any writing systems is certainly a major helping factor. For
text-inputting field on a web page without having to better understanding, we take an example of the Uyghur
install a system-level Uyghur IME. The basic structure of proverb “working for free is better than doing nothing” in
the browser-level Uyghur text inputting tool is three scripts: ﺑﯩﻜﺎﺭ ﻳﯜﺭﮔﯩﭽﻪ ﺑﯩﻜﺎﺭ ﺋﯩﺸﻠﻪ
represented as in figure 1: бикар йүргичə бикар ишлə
bikar yürgiche bikar ishle
The following basic workflow explains the basic
Keyboard and mouse events conversion process:
Source text in source script
Input Uyghur?
no
yes Pre-processing
Capture K.&M. Events
Character mapping
Code – Char. Mapping
Character converting
Dispatch Events
Disambiguation
no
Switch Lang.?
no
Conversion end.?
yes
yes
Release K.&M. Events
Result in destination script
Figure 1. workflow of the browser-level inputting method Figure 2. script converting
As we can see from the workflow above, once the user The functionalities of each module may require some
selects the Uyghur Inputting option, the “capture clarification:
keyboard and mouse events” module creates a hook to Pre-processing: this is an important step in converting. It
monitor the keyboard and mouse activities. The “code- involves preserving elements that should remain
char. mapping” module creates a keycode-to-Uyghur- unchanged27 after the conversion. For example, when
Character matrix to get the right Uyghur character that converting LSU text “Men Photoshop ni yaxshi körimen”
corresponds to the key code (ex: 109 .)ﻡThe “dispatch (I love Photoshop) into ASU, we should be able to obtain
events” module sends Uyghur characters from the map to “ ﻧﻰ ﻳﺎﺧﺸﻰ ﻛﯚﺭﯨﻤﻪﻥPhotoshop ”ﻣﻪﻥand vice-versa.
the active text inputting field on a web page. This process
repeats itself until the “release keyboard and mouse 25
See www.ukij.org , www.biliwal.com, www.oyghan.com,
www.uyghurdictionary.org etc.
26
The only exception is j (as in jurnal) in LSU
24 27
More than 200,000 downloads counted since Dec 2003 from This is the case of hypertext links, HTML tags and proper
www.oyghan.com and www.bizuyghur.com/oyghan . names.
5. Character mapping: creates an “A_is_B” matrix for The embeddable web fonts, generated by third-party
every script pair, or three matrices in total. software WEFT, are compatible only with Internet
Character converting: uses the three matrices in order to Explorer. Therefore, we are truly looking forward to
convert between the different scripts. more efforts by the computer software industry to expand
Disambiguation: this module is necessary when compatibility. We expect to improve the pre-processing
converting from LSU to ASU and/or CSU, because of module of the converting tool to make it more user-
spelling mistakes or, more importantly, because of the friendly. There are undoubtedly other theoretical issues to
problems due to the difficulty encountered in typing the resolve especially in the disambiguating of LSU
LSU diacritical makes on many keyboards: very misspelled words.
commonly, the letters Ö, Ü, É, ö, ü and é are replaced by Another important problem related to Uyghur is the
O, U, E, o, u and e. This may cause fatal errors. For major impediment to developing a spell-check
example: öltürüsh (to kill) olturush(to sit, party), functionality caused by its agglutinative language,
térim yer (cultivable land) terim yer (who eats my coupled with associated spelling changes in root words.
sweat), yétim(orphan) yetim(spelling mistake). This work is going to be the focus of our attention in a
Besides, spelling mistakes due to the poor grasp of LSU next stage of development.
rules are significant problem. All these problems require Finally, we call on software companies not to omit the
intensive language processing. This functionality of the Uyghur from their supported language list in the future.
multiscript converting tool28 that we have released on the
internet is still under development. The following images 8. References
will help you understand our converting tools which use [1] Waris A. Janbaz, Online Uyghur Unicode processing
above mentioned methods. technique and its implementation (publication in
Chinese), Xinjiang University Press, China, 2002.
[2] Abdurehim, Waris A. Janbaz, Orthographic rules of
the Latin-Script Uyghur (in Uyghur) , 2004,
http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYe
ziq).htm.
[3] The Unicode Consortium The Unicode Standard,
Version 4.0, Addison-Wesley Professional, ISBN:
0321185781, USA, 2003.
[4] Xinjiang University, Proceedings 2000 International
Conference on Multilingual Information Processing.
Ürümchi (publication in Chinese), China, 2000.
[5] The Unicode Consortium Website
Image 1. Offline plug-in version for Microsoft Word http://www.unicode.org
[6] Reinhard F. Hahn, Spoken Uyghur. Washington: the
University of Washington Press, ISBN: 0-295-
97015-4, USA, 1991.
Annex 1: Arabic-Script Uyghur, Cyrillic-
Script Uyghur and Latin-Script Uyghur
Alphabets
ﺥ چ ﺝ ﺕ پ ﺏ ﺋﻪ ﺋﺎ ASU
x ch j t p b e a LSU
x ч җ т п б ə а CSU
ﻑ ﻍ ﺵ ﺱ ژ ﺯ ﺭ ﺩ ASU
f gh sh s j (zh) z r d LSU
Image 2. Online demo version
ф ғ ш c ж з р д CSU
7. Conclusions and future work ھ ﻥ ﻡ ﻝ ڭ گ ﻙ ﻕ ASU
Our work to date has focused mainly on the design and LSU
implementation issues related to creating Uyghur h n m l ng g k q
Unicode fonts, as well as on browser-level input method һ н м л ң г k қ CSU
and multi-script converting application. According to ASU
ﻱ ﺋﻰ ﺋﯥ ۋ ﺋﯜ ﺋﯚ ﺋﯘ ﺋﻮ
user feedback, we feel fairly satisfied with the results of
this first ever research on Uyghur language processing. y i é w ü ö u o LSU
й и e в ү ө у o CSU
28
Online demo version is available at Additional Cyrillic letters : ы ё ц э ю я
http://www.uyghurdictionary.org/tools.asp, offline plug-in
version for Microsoft Word is available at
http://oyghan.com/OTB/index.html