SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
Uyghur language processing on the Web
                                    Dr. Waris Abdukerim Janbaz , Prof. Imad Saleh
                                Paragraphe Laboratory, University of Paris VIII, France
                                   warisabdukerim@yahoo.com, isaleh@wanadoo.fr
                                           http://paragraphe.univ-paris8.fr

Abstract                                                      navigators) and correctly displaying Uyghur characters
In this paper, we discuss some important issues related to    presented huge difficulties. In spite of the fairly passive
web processing of an agglutinative Turkic language –          attitude of Government authorities to the development of
Uyghur. Especially, we will discuss the advent of             Uyghur information technology, many individuals started
grassroots efforts on Uyghur Unicode font developing,         creating Uyghur websites using the three above
Uyghur character displaying, font embedding and               mentioned script. ASU, used by the most populous
Uyghur character inputting method within Uyghur-              segment of XUAR Uyghurs caused special coding
support-less environment. We will also introduce a            problems given that it uses a non-standard set of Arabic-
multiscript conversion application to further use the         based glyphs.
Unicode standard for Uyghur language processing.
                                                              2. Background
Keywords: Unicode, Font, Turkic Language, multiscript,        For ASU, before 2002, either of the two following
transliteration, Arabic-Script Uyghur, Cyrillic-Script        methods became very common on web publishing in
Uyghur, Latin-Script Uyghur.                                  Uyghur: 1) font downloading; and/or 2) image format.
                                                              There is no need to explain the inconvenience of the
1. Introduction                                               second method. More interesting but complex problems
The Uyghurs are a Turkic-speaking ethnic group,               occurred in the case of the first one. The major problem
officially about nine million, inhabiting in Central Asia     came from the fact that every web site owner created and
including today’s Xinjiang Uyghur Autonomous Region           named his/her own fonts, and users/visitors had to
(hereafter: XUAR, also called Chinese Turkistan) as well      download a specific font (or different fonts) for almost
as parts of Kazakhstan and urban regions in the Ferghana      every single website. No one accepted the font name and
valley. The official writing system of the XUAR Uyghurs       coding of the other, and no common standard was created.
is Arabic-Script Uyghur 1 (hereafter: ASU) whereas the        Most of the fonts created during this period, either
Cyrillic-Script Uyghur2 (hereafter: CSU ) is still in used    replaced the ASCII characters or replaced the Unicode
by the Uyghurs of the ex-Soviet Union Republics               Arabic characters (0x600-0x6FF) with Uyghur characters,
(USSR). The newly introduced transliteration 3 – Latin-       without replacement agreement. Since the number of the
Script Uyghur 4 (hereafter: LSU) has become widely            Arabic letters in the code rage 0x600-0x6FF is larger
accepted among Uyghurs and Uyghurologists is a                than the number of ASU letters, people made different
commonly used standard for the transliteration for both       choices as they replaced some Arabic characters with
ASU and CSU.                                                  ASU characters. Therefore, multiplication of the font
The influence of web publishing started appearing in          names and the growth of coding differences (for the same
Uyghur society in the last 10 years. Since the existing       glyphs) among the fonts became an obstacle to the
platforms don’t supply any Uyghur input method nor any        development of ASU computer processing and web
fonts that including all the glyphs of the ASU alphabet,      publishing. A large number of issues regarding non-
inputting Uyghur text into interactive web pages (in the      standard fonts and their use were addressed in many
                                                              different ways to the individual computer scientists.
                                                              Meanwhile, many of these problems were circumvented
1
  See annex 2                                                 by using methods unrelated to the Unicode standard. As a
2
  See annex 1                                                 result, web site creators eventually expressed their strong
3
  Using one writing system to represent words in another is   desire to further use the Unicode standard for Uyghur
called transliteration.                                       language processing.
4
  called Uyghur Kompyutér Yéziqi (UKY) or Uyghur Latin
Yéziqi (ULY) in Uyghur, meaning “Uyghur Computer Writing”
                                                              In June 2002, the author developed the first Uyghur
or “Latin-Script Uyghur”. See                                 Unicode font and implemented both system-level and
http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYeziq).htm        browser-level Input Method Editors for Windows. It
became a revolutionary accomplishment, owing mostly                 The creation of a Unicode based Uyghur font has became
to the new method and applications that are fully                   a necessity for the progress of Uyghur information
Unicode-compliant (as opposed to occasionally                       processing since the existing platforms do not include
compatible). Hence, a campaign was launched to                      (supply) any Uyghur font. Existing fonts (both Arabic
popularize and adapt the Unicode standard for Uyghur                fonts and other fonts which include Arabic letters) do not
fonts. In this paper, we present the entire process that we         include all the necessary shapes of Uyghur letters (see
have been following and developing for three years. The             annex 2), and therefore some substitution sequences
following subsections will cover four major parts of the            mislead display problems. For example:
entire implementation procedure.                                                 ‫1. ﺋﺎﻟەﻣﺪىﻜﻰ هەﻣﻤە ﺋىﻨﺴﺎن ﻗەﺑىﻪ ﺋەﻣەس‬
                                                                                ‫2. ﺋﺎﻟﻪﻣﺪﯨﻜﻰ ھﻪﻣﻤﻪ ﺋﯩﻨﺴﺎﻥ ﻗﻪﺑﯩﻬ ﺋﻪﻣﻪﺱ‬
3. Uyghur Unicode font developing                                           (Not all human beings in the world are evil)
Uyghur (ASU) letters have been developed on the basis               The first sentence above is considered illegal character
of the Arabic alphabet from Arabic. The ASU alphabet                combination if it uses existing fonts (ex: Times New
has 8 vowels5 and 24 consonants (see annex1). Uyghur,               Roman, Traditional Arabic) because the cursive shapes of
just like Arabic, is written from right to left, each letter        ‫ ﺋﻪ ,ھ ,ﻯ‬are not correct according to the ASU alphabet
having different shapes depending on its position in a              (see annex 2). It should appear as in sentence 2 in which
word. The Uyghur letters have initial, median, final and
                                                                    the letters use a specific font — UKIJ Tuz Tom. In order
isolated forms; some letters have conjunct forms6. In total,
                                                                    to create right cursive connection forms for Uyghur, it
the Uyghur alphabet has 126 different glyphs. The 108
                                                                    was necessary to take special measures for three
basic glyphs 7 of the Uyghur letters have already been
                                                                    problem-letters‫ ﺋﻪ ,ھ , ﻯ‬and two “glottal stop signs ‫”ﺌ , ﺉ‬
accepted by the Unicode Consortium/ISO, and 18 glyphs8
out of the 20 glyphs for composed forms were added in               (supported hamze), during the creation of Uyghur fonts.
1998. Unfortunately, two conjunct median forms (of the              The absence of such measures would make it impossible
Uyghur letters ‫ ﺋﯥ‬and ‫ 9ﺌﯧ )ﺋﻰ‬and ‫ 01ﺌﯩ‬are still absent11 in        to display the cursive forms of the three letters correctly
                                                                    in browsers and other application software.
the Unicode Standard’s table 12 – Arabic Presentation
                                                                    ‫ : 31 ﻯ‬Uyghur letter i as in ishik (‫ ,ﺋﯩﺸﯩﻚ‬door). The 8
forms-A. This lack renders the Unicode Consortium/ISO
as it stands incomplete and this has forced people to               different forms are listed in the table 1 below. For the
supplement it through borrowing from FBD1 and FBD2                  initial′ and median′ forms (‫ )ﯨ , ﯩ‬of this letter we use the
the “supported hamze” which is then combined with the               initial and median forms of the Arabic letter ‫ ;9460 ﻯ‬for
median′ form of ‫ ﺋﯥ‬and ‫ ﺋﻰ‬to generate two synthetic                 the final′ and isolated′ forms (‫ )ﻯ , ﻰ‬we use the final and
combined letters.                                                   isolated forms of the Farsi letter ‫60 ﻯ‬CC, respectively.
The 20 conjunct glyphs can also be expressed as a
                                                                    ‫ :41ﺋﻪ‬Uyghur letter e as in eyneklerde (‫ ,ﺋﻪﻳﻨﻪﻛﻠﻪﺭﺩە‬in the
sequence of two existing Unicode glyphs (as it is the case
now for the two missing conjunct glyphs). But this kind             mirrors). This letter uses the final and isolated glyph s(‫, ﻩ‬
of usage may cause problems like reducing text inputting            ‫ )ﻪ‬of the Arabic letter ‫(7460 ھ‬h), in the same way as
speed, increasing data storage redundancy, complicating             Persian does. This causes a special problem due to the
data sorting operations etc.                                        fact that the glyphs of Arabic ‫(7460 51ھ‬h) in the initial
                                                                    and median positions(‫ )ھ , ﻬ‬correspond to those of Uyghur
5
  The Arabic alphabet only has 3 letters and for long vowels        ‫( ھ‬h as in ‫ ھﯧﻠﯩﻬﻪﻡ‬hélihem, even now; ‫ ﮔﯘﻧﺎھ‬gunah, sin or
uses ‫ .ﺍ ﻭ ﻱ‬The others are not noted in normal writing. Given its
                                                                    offense; ‫ ﻗﻪﺑﯩﻬ‬qebih, odious), which, in turn, has different
phonetic characteristics, Uyghur notes down all vowels: ،‫ﺋﺎ، ﺋﻪ‬
 ‫ , ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰ‬using derivates of traditional Arabic
                                                                    final and isolated glyphs(‫ .)ھ , ﻬ‬In order to deal with this
letters.                                                            inconsistency, we have chosen to use 06D5 for the
6
  The initial form and, under some circumstances, the median        Uyghur letter ‫ ﺋﻪ‬and 06BE for the Uyghur letter ‫.ھ‬
form of all vowels is preceded by one “glottal stop sign ‫ ﺉ‬or ‫”ﺌ‬        iso.′ fin.′ med.′ ini.′ iso. fin. med. ini.
(supported hamze) with which they form a common letter                                                  ‫ﺍ‬    ‫ﺎ‬        ‫ﯫ‬    ‫ﯪ‬
(treated by Uyghur as a single letter, see annex 2). ‫ ﻝ‬followed
                                                                                                       ‫ﻩ‬     ‫ﻪ‬        ‫ﯭ‬   ‫ﯬ‬
by ‫ ﺍ‬forms ‫ ﻼ‬or ‫ ﻻ‬depending on their position.
7                                                                                                      ‫ﻭ‬     ‫ﻮ‬        ‫ﯯ‬   ‫ﯮ‬
  See http://www.oyghan.com/images/UyghurUnicodeTable.gif
8
  See Arabic Presentation Forms-A, glyph code range: FBEA –                                            ‫ﯗ‬     ‫ﯘ‬        ‫ﯱ‬   ‫ﯰ‬
FBFB. See also table 1.                                                                                ‫ﯙ‬     ‫ﯚ‬        ‫ﯳ‬   ‫ﯲ‬
9
  Character name for the Unicode Standard: ARABIC
LIGATURE YEH WITH HAMZA ABOVE WITH E
                                                                                                       ‫ﯛ‬     ‫ﯜ‬        ‫ﯵ‬   ‫ﯴ‬
MEDIAN FORM. Ex: ‫( ﺑﺎﻏﺌﯧﺮﯨﻖ‬Baghériq).
10
   Character name for the Unicode Standard: ARABIC
LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA                              13
                                                                       Character name for the Unicode Standard: ARABIC
ABOVE WITH ALEF MAKSURA MEDIAN FORM. Ex:                            LETTER UIGHUR KAZAKH KIRGHIZ ALEF MAKSURA
‫( ﻗﻪﺗﺌﯩﻲ‬certainly, doubtlessly)                                     (represents YEH-shaped letter with no dots in any positional
11
   The XUAR’s delegation members, Prof. Hoshur Islam and            form), 0649.
                                                                    14
Yasin Imin, who have submitted the proposition also admit this         Character name for the Unicode Standard:ARABIC LETTER
fault. See also Arabic Presentation Forms-A (code range: FBEA       AE (Uighur, Kazakh, Kirghiz), 06D (isolated form is ‫.)ە‬
                                                                    15
– FBFB).                                                               See http://www.unicode.org/standard/where/ , Variant shapes
12
   http://www.unicode.org/charts/PDF/UFB50.pdf                      of the Arabic character hah.
‫ې‬       ‫ﯥ‬         ‫ﯧ‬        ‫ﯦ‬      ‫ﯶ‬       ‫ﯷ‬        ‫ﺌﯧ‬       ‫ﯸ‬         and RTL (right to left mark; 200F), is also recommended
     ‫ﻯ‬       ‫ﻰ‬         ‫ﯩ‬        ‫ﯨ‬      ‫ﯹ‬       ‫ﯺ‬        ‫ﺌﯩ‬       ‫ﯻ‬         in any Uyghur font. The rest of the time-consuming
                                                                           repetitive font developing task is absolutely the same as
                                       ‫ھ‬       ‫ﻬ‬         ‫ﻬ‬       ‫ھ‬
                                                                           when creating an Arabic script font 20 . Some Uyghur
 Table 1. Uyghur vowels and the three problem-letters (the one Arabic
character ‫ ھ‬hah has four different basic shapes, which correspond to the
                                                                           Unicode fonts are available for free at the UCSA website.
             four shapes of two different letters in Uyghur).
                                                                           Our recommended font creating tools are: Font Creator21
                                                                           and Fontographer 22 . Glyph substitutions, positioning
‫ ﺉ‬and ‫ :61ﺌ‬the glottal stop: this is a phoneme which is not                lookups and shaping features and Open Type tables of
listed separately in the ASU alphabet but still covered by                 Arabic fonts can be added with the help of software like
its spelling rules. In Uyghur words, the glottal stop is not               Microsoft VOLT.
as strongly pronounced as it is in Semitic languages or in
Uzbek, for example, and it has weakened to become no                       4. Font embedding and character displaying
more than a hiatus. Marked in ASU by a hamza on top of                     Web pages can be rendered without downloading or
a “tooth”, it appears usually in words of Arabic origin                    installing any specific fonts if: 1) the fonts used in the
and replaces an original ‘ain (‫ )ع‬or a hamza (‫ )ء‬in a                      pages are available on user’s computer, and 2) if the
median or final position (e.g. ‫ ﺋﺎﻟﻪﻡ‬from Arabic ‫,ﻋﺎﻟﹶﻢ‬                    browsers provide native support for the fonts and
‫ ﺳﺎﺋﻪﺕ‬from Arabic ‫ ﺧﺎﺋﯩﻦ ,ﺳﺎ َﺔ‬from Arabic ‫ﺳﻮﺋﺎﻝ , ﺧﺎﺋِﻦ‬
                      ‫ﻋ‬                                                    languages used. The second condition has already been
from Arabic ‫ .)ﺳ َال‬In initial position, the same sign is
                ‫ُﺆ‬                                                         met but unfortunately the first one has not yet, as there
considered as part of the initial form of a vowel and does                 are no Uyghur fonts available on the existing platforms
not have any phonetic value 17 . They correspond to the                    that are installed on the users’ computers. Therefore, to
initial and median forms of the Arabic letter ‫.6260 ئ‬                      ensure that Uyghur texts are displayed correctly in web
These Arabic glyphs are not considered as different                        browsers, users must find a way to install in their
shapes of any independent letter in the Uyghur alphabet                    computers the fonts that are used in the web pages. The
(cf. annex 2). Since one glyph of each of the two letters                  same holds true for all the other “forgotten languages” on
‫ ﺋﯥ‬and ‫( ﺋﻰ‬shown in light red in the table above) are still                different platforms. The font installation requirement
                                                                           either causes difficulties for people who don’t have much
missing in Unicode, we can use a sequence of either of                     technical experience, or discourages others from
these glyphs ( ‫ ﺉ‬or ‫ )ﺌ‬followed by the final, isolated,                    attempting to read the text.
median′ or final′ forms of vowels ‫ ﺋﯥ‬and ‫( ﺋﻰ‬shown in                      These difficulties can be overcome by embedding fonts
blue in the table above). More precisely, the other                        into the web pages. When a page is downloaded into a
conjunct forms can be obtained combining with the                          browser via the Hypertext Transfer Protocol, any
Arabic letter ‫ 6260 ئ‬and a vowel respectively.                             embedded fonts in the page are also downloaded without
In spite of the above mentioned limitations (two glyphs                    any need for the user to intervene. The Microsoft Web
instead of one conjunct glyph for ‫ ﺋﯥ‬and ‫ )ﺋﻰ‬the above                     Embedding Fonts Tool—WEFT 23 makes it possible to
mentioned conventions have now been widely accepted                        create embedded font objects that can be linked to web
by the Uyghur Computer Science Association(UCSA18),                        pages. The following steps let web pages developers
and at a later date, by the Xinjiang University branch of                  create embedded fonts and link them to a web page:
the 863 Research Group19.                                                       • Create embedded fonts using Microsoft WEFT
 After having learnt the specificities of those letters, it is                  • Prepare the web page using any fonts that are
easy to create Uyghur fonts using existing font creating                             installed on the platform, and
software. The inclusion of non-spacing combining marks,                         • Link the embedded fonts to the web page.
such as ZWJ (zero width joiner 200C), ZWNJ (zero                           Microsoft WEFT generates 1) embedded fonts for every
width non-joiner; 200D), LTR (left to right mark; 200E),                   web site with a different extension (.EOT), and 2) a script
                                                                           that links an embedding font to a web page. The
16                                                                         disadvantage of the WEFT generated embedded fonts is
   Character name for the Unicode Standard: ARABIC
LETTER YEH WITH HAMZA ABOVE <initial> and                                  that the fonts are compatible only with Internet Explorer.
<median> 0626.                                                             This makes it highly desirable for more efforts to be
17
   It is often said that the decision of Uyghur linguists to add           invested in providing a cross-platform functionality for
this sign as part of the initial form of letters is a link with the        this kind of software.
old Uyghur writing system, in which all initial vowels were
preceded by a tooth. The Arabic alphabet has 3 letters, ‫ و ,ا‬and
‫ ي‬which can be used to indicate long vowels. Short vowels can
be indicated through the use of vowel marks above or under the
consonants but which are dispensed of in normal writing. Given
its phonetic characteristics, Uyghur notes down all vowels: ،‫ﺋﺎ‬
 ‫ ,ﺋﻪ، ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰ‬using derivates of traditional Arabic         20
                                                                              See
letters.                                                                   http://www.microsoft.com/typography/OpenType%20Dev/arabi
18
   UCSA – The Uyghur Computer Science Association (or                      c/intro.mspx for more information about developing OpenType
UKIJ – Uyghur Kompyutér Ilimi Jem’iyiti in Uyghur) is a non-               Fonts for Arabic Script
                                                                           21
profit association, founded by the author in Jan 2004. Web site:              http://www.high-logic.com/fontcreator.html
                                                                           22
http://www.ukij.org                                                           http://www.fontlab.com/Font-tools/Fontographer
19                                                                         23
   A National High-Tech Research Group, financed by the PRC                   Free software at
government. The XJU branch is specialized in multilingual                  http://www.microsoft.com/typography/web/embedding/default.
software development.                                                      htm
5. Creation of a browser-level virtual input                    events” module frees the hook immediately after the user
method                                                          decides to switch the inputting language to another one.
As mentioned in the introduction, the existing platforms        This method has been implemented using JavaScript and
do not supply any system-level Uyghur language                  VBScript language, tested on different browsers and
inputting service. Late in 2003, the first system-level         commonly used in some Uyghur web sites25.
Uyghur Unicode IME for Windows was developed by the
author and distributed free of charge24. Six month later,       6. Multiscript converting
the Xinjiang University branch of the 863 Research              Due to the co-existence of different writing systems
Group and some individuals started joining the Uyghur           (Arabic-Script Uyghur, Cyrillic-Script Uyghur and Latin-
Unicode Popularization campaign by distributing their           Script Uyghur) for the Uyghur language, research on a
Unicode-supported IME. Nevertheless, it still can not be        conversion tool with which people can toggle between
said that all or even most Uyghur internet users are            the three scripts is forthcoming for future information
equipped with Uyghur inputting tools. Therefore, the            sharing. The fact that there is one-to-one
browser-level inputting method still fills a great need         correspondence 26 between the letters of these three
since it enables people to input Uyghur letter into any         writing systems is certainly a major helping factor. For
text-inputting field on a web page without having to            better understanding, we take an example of the Uyghur
install a system-level Uyghur IME. The basic structure of       proverb “working for free is better than doing nothing” in
the browser-level Uyghur text inputting tool is                 three scripts:   ‫ﺑﯩﻜﺎﺭ ﻳﯜﺭﮔﯩﭽﻪ ﺑﯩﻜﺎﺭ ﺋﯩﺸﻠﻪ‬
represented as in figure 1:                                                    бикар йүргичə бикар ишлə
                                                                                bikar yürgiche bikar ishle
                                                                The following basic workflow explains the basic
                 Keyboard and mouse events                      conversion process:


                                                                              Source text in source script
                         Input Uyghur?
       no
                                  yes                                                Pre-processing

                    Capture K.&M. Events
                                                                                  Character mapping

                    Code – Char. Mapping
                                                                                 Character converting

                        Dispatch Events
                                                                                    Disambiguation

                                                       no
                         Switch Lang.?
                                                                                                                  no
                                                                                   Conversion end.?
                          yes
                                                                                 yes
                   Release K.&M. Events
                                                                              Result in destination script


     Figure 1. workflow of the browser-level inputting method                       Figure 2. script converting

As we can see from the workflow above, once the user            The functionalities of each module may require some
selects the Uyghur Inputting option, the “capture               clarification:
keyboard and mouse events” module creates a hook to             Pre-processing: this is an important step in converting. It
monitor the keyboard and mouse activities. The “code-           involves preserving elements that should remain
char. mapping” module creates a keycode-to-Uyghur-              unchanged27 after the conversion. For example, when
Character matrix to get the right Uyghur character that         converting LSU text “Men Photoshop ni yaxshi körimen”
corresponds to the key code (ex: 109 ‫ .)ﻡ‬The “dispatch          (I love Photoshop) into ASU, we should be able to obtain
events” module sends Uyghur characters from the map to          “‫ ﻧﻰ ﻳﺎﺧﺸﻰ ﻛﯚﺭﯨﻤﻪﻥ‬Photoshop ‫ ”ﻣﻪﻥ‬and vice-versa.
the active text inputting field on a web page. This process
repeats itself until the “release keyboard and mouse            25
                                                                   See www.ukij.org , www.biliwal.com, www.oyghan.com,
                                                                www.uyghurdictionary.org etc.
                                                                26
                                                                   The only exception is j (as in jurnal) in LSU
24                                                              27
 More than 200,000 downloads counted since Dec 2003 from           This is the case of hypertext links, HTML tags and proper
www.oyghan.com and www.bizuyghur.com/oyghan .                   names.
Character mapping: creates an “A_is_B” matrix for               The embeddable web fonts, generated by third-party
every script pair, or three matrices in total.                  software WEFT, are compatible only with Internet
Character converting: uses the three matrices in order to       Explorer. Therefore, we are truly looking forward to
convert between the different scripts.                          more efforts by the computer software industry to expand
Disambiguation: this module is necessary when                   compatibility. We expect to improve the pre-processing
converting from LSU to ASU and/or CSU, because of               module of the converting tool to make it more user-
spelling mistakes or, more importantly, because of the          friendly. There are undoubtedly other theoretical issues to
problems due to the difficulty encountered in typing the        resolve especially in the disambiguating of LSU
LSU diacritical makes on many keyboards: very                   misspelled words.
commonly, the letters Ö, Ü, É, ö, ü and é are replaced by       Another important problem related to Uyghur is the
O, U, E, o, u and e. This may cause fatal errors. For           major impediment to developing a spell-check
example: öltürüsh (to kill)          olturush(to sit, party),   functionality caused by its agglutinative language,
térim yer (cultivable land)        terim yer (who eats my       coupled with associated spelling changes in root words.
sweat), yétim(orphan)             yetim(spelling mistake).      This work is going to be the focus of our attention in a
Besides, spelling mistakes due to the poor grasp of LSU         next stage of development.
rules are significant problem. All these problems require       Finally, we call on software companies not to omit the
intensive language processing. This functionality of the        Uyghur from their supported language list in the future.
multiscript converting tool28 that we have released on the
internet is still under development. The following images       8. References
will help you understand our converting tools which use         [1] Waris A. Janbaz, Online Uyghur Unicode processing
above mentioned methods.                                            technique and its implementation (publication in
                                                                    Chinese), Xinjiang University Press, China, 2002.
                                                                [2] Abdurehim, Waris A. Janbaz, Orthographic rules of
                                                                    the Latin-Script Uyghur (in Uyghur) , 2004,
                                                                    http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYe
                                                                    ziq).htm.
                                                                [3] The Unicode Consortium The Unicode Standard,
                                                                    Version 4.0, Addison-Wesley Professional, ISBN:
                                                                    0321185781, USA, 2003.
                                                                [4] Xinjiang University, Proceedings 2000 International
                                                                    Conference on Multilingual Information Processing.
                                                                    Ürümchi (publication in Chinese), China, 2000.
                                                                [5] The Unicode Consortium Website
        Image 1. Offline plug-in version for Microsoft Word         http://www.unicode.org
                                                                [6] Reinhard F. Hahn, Spoken Uyghur. Washington: the
                                                                    University of Washington Press, ISBN: 0-295-
                                                                    97015-4, USA, 1991.

                                                                Annex 1: Arabic-Script Uyghur, Cyrillic-
                                                                Script Uyghur and Latin-Script Uyghur
                                                                Alphabets
                                                                 ‫ﺥ‬      ‫چ‬      ‫ﺝ‬     ‫ﺕ‬        ‫پ‬      ‫ﺏ‬    ‫ﺋﻪ‬    ‫ﺋﺎ‬   ASU
                                                                  x     ch     j      t       p      b    e     a    LSU
                                                                  x     ч      җ      т       п      б    ə     а    CSU
                                                                 ‫ﻑ‬      ‫ﻍ‬     ‫ﺵ‬      ‫ﺱ‬         ‫ژ‬     ‫ﺯ‬     ‫ﺭ‬    ‫ﺩ‬    ASU
                                                                  f    gh     sh      s     j (zh)   z     r    d    LSU
                  Image 2. Online demo version
                                                                 ф      ғ      ш      c       ж      з    р     д    CSU
7. Conclusions and future work                                    ‫ھ‬     ‫ﻥ‬      ‫ﻡ‬      ‫ﻝ‬       ‫ڭ‬      ‫گ‬    ‫ﻙ‬     ‫ﻕ‬    ASU
Our work to date has focused mainly on the design and                                                                LSU
implementation issues related to creating Uyghur                  h     n      m      l       ng     g    k     q
Unicode fonts, as well as on browser-level input method           һ     н      м      л       ң      г    k     қ    CSU
and multi-script converting application. According to                                                                ASU
                                                                 ‫ﻱ‬      ‫ﺋﻰ‬    ‫ﺋﯥ‬      ‫ۋ‬       ‫ﺋﯜ‬     ‫ﺋﯚ‬   ‫ﺋﯘ‬    ‫ﺋﻮ‬
user feedback, we feel fairly satisfied with the results of
this first ever research on Uyghur language processing.           y     i      é      w       ü      ö    u     o    LSU
                                                                  й     и     e      в        ү       ө    у    o    CSU
28
   Online demo version is available at                                   Additional Cyrillic letters : ы ё ц э ю я
http://www.uyghurdictionary.org/tools.asp, offline plug-in
version for Microsoft Word is available at
http://oyghan.com/OTB/index.html
Annex 2: Arabic-Script Uyghur Alphabet with shapes

Mais conteúdo relacionado

Semelhante a P1120625101

Transliteration/Romanization of Urdu Processing by Rashida sharif
Transliteration/Romanization of Urdu Processing by Rashida sharif Transliteration/Romanization of Urdu Processing by Rashida sharif
Transliteration/Romanization of Urdu Processing by Rashida sharif Rashida Sharif
 
Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...
Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...
Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...Federal University of Technology, Owerri
 
The Arabic Speech Database: PADAS
The Arabic Speech Database: PADASThe Arabic Speech Database: PADAS
The Arabic Speech Database: PADASCSCJournals
 
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Waqas Tariq
 
Arabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetArabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetIJDKP
 
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ijcsit
 
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ijcsit
 
A New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic WordsA New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic WordsIJERA Editor
 
The classification of the modern arabic poetry using machine learning
The classification of the modern arabic poetry using machine learningThe classification of the modern arabic poetry using machine learning
The classification of the modern arabic poetry using machine learningTELKOMNIKA JOURNAL
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
 
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...CSCJournals
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
 
Azhary: An Arabic Lexical Ontology
Azhary: An Arabic Lexical OntologyAzhary: An Arabic Lexical Ontology
Azhary: An Arabic Lexical OntologyIJwest
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)ThennarasuSakkan
 
Machine Translation And Computer Assisted Translation
Machine Translation And Computer Assisted TranslationMachine Translation And Computer Assisted Translation
Machine Translation And Computer Assisted TranslationTeritaa
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
 
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
 
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMSTANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
 
development of a novel keyboard interface unit for writing quran using computer
development of a novel keyboard interface unit for writing quran using computerdevelopment of a novel keyboard interface unit for writing quran using computer
development of a novel keyboard interface unit for writing quran using computerINFOGAIN PUBLICATION
 
Development of arabic sign language dictionary using 3D avatar technologies
Development of arabic sign language dictionary using 3D avatar technologiesDevelopment of arabic sign language dictionary using 3D avatar technologies
Development of arabic sign language dictionary using 3D avatar technologiesnooriasukmaningtyas
 

Semelhante a P1120625101 (20)

Transliteration/Romanization of Urdu Processing by Rashida sharif
Transliteration/Romanization of Urdu Processing by Rashida sharif Transliteration/Romanization of Urdu Processing by Rashida sharif
Transliteration/Romanization of Urdu Processing by Rashida sharif
 
Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...
Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...
Re-Engineering of A Virtual Igbo Keyboard In Standard Orthography Using Andro...
 
The Arabic Speech Database: PADAS
The Arabic Speech Database: PADASThe Arabic Speech Database: PADAS
The Arabic Speech Database: PADAS
 
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
 
Arabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetArabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnet
 
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
 
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...
 
A New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic WordsA New Approach to Romanize Arabic Words
A New Approach to Romanize Arabic Words
 
The classification of the modern arabic poetry using machine learning
The classification of the modern arabic poetry using machine learningThe classification of the modern arabic poetry using machine learning
The classification of the modern arabic poetry using machine learning
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable ...
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Azhary: An Arabic Lexical Ontology
Azhary: An Arabic Lexical OntologyAzhary: An Arabic Lexical Ontology
Azhary: An Arabic Lexical Ontology
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
Machine Translation And Computer Assisted Translation
Machine Translation And Computer Assisted TranslationMachine Translation And Computer Assisted Translation
Machine Translation And Computer Assisted Translation
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
 
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...T URN S EGMENTATION I NTO U TTERANCES F OR  A RABIC  S PONTANEOUS D IALOGUES ...
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
 
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMSTANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
 
development of a novel keyboard interface unit for writing quran using computer
development of a novel keyboard interface unit for writing quran using computerdevelopment of a novel keyboard interface unit for writing quran using computer
development of a novel keyboard interface unit for writing quran using computer
 
Development of arabic sign language dictionary using 3D avatar technologies
Development of arabic sign language dictionary using 3D avatar technologiesDevelopment of arabic sign language dictionary using 3D avatar technologies
Development of arabic sign language dictionary using 3D avatar technologies
 

Mais de tughchi

Resume_Mehmud
Resume_MehmudResume_Mehmud
Resume_Mehmudtughchi
 
Prayer Profile - Uighur
Prayer Profile - UighurPrayer Profile - Uighur
Prayer Profile - Uighurtughchi
 
ug_ozumge_xitap
ug_ozumge_xitapug_ozumge_xitap
ug_ozumge_xitaptughchi
 
Microsoft Word - mso4F
Microsoft Word - mso4FMicrosoft Word - mso4F
Microsoft Word - mso4Ftughchi
 
hon-TohtiTunyaz
hon-TohtiTunyazhon-TohtiTunyaz
hon-TohtiTunyaztughchi
 
9- oghuz 60-62
9- oghuz 60-629- oghuz 60-62
9- oghuz 60-62tughchi
 
Uyghurs And Nowruz
Uyghurs And NowruzUyghurs And Nowruz
Uyghurs And Nowruztughchi
 
parchilar-ijadiyet
parchilar-ijadiyetparchilar-ijadiyet
parchilar-ijadiyettughchi
 
Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1tughchi
 
_ghalib ark eserliri
_ghalib ark  eserliri_ghalib ark  eserliri
_ghalib ark eserliritughchi
 
adil tuniyaz dastanliri
adil tuniyaz dastanliriadil tuniyaz dastanliri
adil tuniyaz dastanliritughchi
 
15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doc15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doctughchi
 
ug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliriug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliritughchi
 
en_The_Prohibition_of_Music
en_The_Prohibition_of_Musicen_The_Prohibition_of_Music
en_The_Prohibition_of_Musictughchi
 
tb51pisk
tb51pisktb51pisk
tb51pisktughchi
 
ug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlarug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlartughchi
 
Bir_Oqung
Bir_OqungBir_Oqung
Bir_Oqungtughchi
 

Mais de tughchi (20)

Resume_Mehmud
Resume_MehmudResume_Mehmud
Resume_Mehmud
 
noruz
noruznoruz
noruz
 
rom1_ug
rom1_ugrom1_ug
rom1_ug
 
Prayer Profile - Uighur
Prayer Profile - UighurPrayer Profile - Uighur
Prayer Profile - Uighur
 
ug_ozumge_xitap
ug_ozumge_xitapug_ozumge_xitap
ug_ozumge_xitap
 
Microsoft Word - mso4F
Microsoft Word - mso4FMicrosoft Word - mso4F
Microsoft Word - mso4F
 
hon-TohtiTunyaz
hon-TohtiTunyazhon-TohtiTunyaz
hon-TohtiTunyaz
 
9- oghuz 60-62
9- oghuz 60-629- oghuz 60-62
9- oghuz 60-62
 
Uyghurs And Nowruz
Uyghurs And NowruzUyghurs And Nowruz
Uyghurs And Nowruz
 
parchilar-ijadiyet
parchilar-ijadiyetparchilar-ijadiyet
parchilar-ijadiyet
 
Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1
 
_ghalib ark eserliri
_ghalib ark  eserliri_ghalib ark  eserliri
_ghalib ark eserliri
 
adil tuniyaz dastanliri
adil tuniyaz dastanliriadil tuniyaz dastanliri
adil tuniyaz dastanliri
 
15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doc15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doc
 
ug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliriug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliri
 
en_The_Prohibition_of_Music
en_The_Prohibition_of_Musicen_The_Prohibition_of_Music
en_The_Prohibition_of_Music
 
beller
bellerbeller
beller
 
tb51pisk
tb51pisktb51pisk
tb51pisk
 
ug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlarug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlar
 
Bir_Oqung
Bir_OqungBir_Oqung
Bir_Oqung
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

P1120625101

  • 1. Uyghur language processing on the Web Dr. Waris Abdukerim Janbaz , Prof. Imad Saleh Paragraphe Laboratory, University of Paris VIII, France warisabdukerim@yahoo.com, isaleh@wanadoo.fr http://paragraphe.univ-paris8.fr Abstract navigators) and correctly displaying Uyghur characters In this paper, we discuss some important issues related to presented huge difficulties. In spite of the fairly passive web processing of an agglutinative Turkic language – attitude of Government authorities to the development of Uyghur. Especially, we will discuss the advent of Uyghur information technology, many individuals started grassroots efforts on Uyghur Unicode font developing, creating Uyghur websites using the three above Uyghur character displaying, font embedding and mentioned script. ASU, used by the most populous Uyghur character inputting method within Uyghur- segment of XUAR Uyghurs caused special coding support-less environment. We will also introduce a problems given that it uses a non-standard set of Arabic- multiscript conversion application to further use the based glyphs. Unicode standard for Uyghur language processing. 2. Background Keywords: Unicode, Font, Turkic Language, multiscript, For ASU, before 2002, either of the two following transliteration, Arabic-Script Uyghur, Cyrillic-Script methods became very common on web publishing in Uyghur, Latin-Script Uyghur. Uyghur: 1) font downloading; and/or 2) image format. There is no need to explain the inconvenience of the 1. Introduction second method. More interesting but complex problems The Uyghurs are a Turkic-speaking ethnic group, occurred in the case of the first one. The major problem officially about nine million, inhabiting in Central Asia came from the fact that every web site owner created and including today’s Xinjiang Uyghur Autonomous Region named his/her own fonts, and users/visitors had to (hereafter: XUAR, also called Chinese Turkistan) as well download a specific font (or different fonts) for almost as parts of Kazakhstan and urban regions in the Ferghana every single website. No one accepted the font name and valley. The official writing system of the XUAR Uyghurs coding of the other, and no common standard was created. is Arabic-Script Uyghur 1 (hereafter: ASU) whereas the Most of the fonts created during this period, either Cyrillic-Script Uyghur2 (hereafter: CSU ) is still in used replaced the ASCII characters or replaced the Unicode by the Uyghurs of the ex-Soviet Union Republics Arabic characters (0x600-0x6FF) with Uyghur characters, (USSR). The newly introduced transliteration 3 – Latin- without replacement agreement. Since the number of the Script Uyghur 4 (hereafter: LSU) has become widely Arabic letters in the code rage 0x600-0x6FF is larger accepted among Uyghurs and Uyghurologists is a than the number of ASU letters, people made different commonly used standard for the transliteration for both choices as they replaced some Arabic characters with ASU and CSU. ASU characters. Therefore, multiplication of the font The influence of web publishing started appearing in names and the growth of coding differences (for the same Uyghur society in the last 10 years. Since the existing glyphs) among the fonts became an obstacle to the platforms don’t supply any Uyghur input method nor any development of ASU computer processing and web fonts that including all the glyphs of the ASU alphabet, publishing. A large number of issues regarding non- inputting Uyghur text into interactive web pages (in the standard fonts and their use were addressed in many different ways to the individual computer scientists. Meanwhile, many of these problems were circumvented 1 See annex 2 by using methods unrelated to the Unicode standard. As a 2 See annex 1 result, web site creators eventually expressed their strong 3 Using one writing system to represent words in another is desire to further use the Unicode standard for Uyghur called transliteration. language processing. 4 called Uyghur Kompyutér Yéziqi (UKY) or Uyghur Latin Yéziqi (ULY) in Uyghur, meaning “Uyghur Computer Writing” In June 2002, the author developed the first Uyghur or “Latin-Script Uyghur”. See Unicode font and implemented both system-level and http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYeziq).htm browser-level Input Method Editors for Windows. It
  • 2. became a revolutionary accomplishment, owing mostly The creation of a Unicode based Uyghur font has became to the new method and applications that are fully a necessity for the progress of Uyghur information Unicode-compliant (as opposed to occasionally processing since the existing platforms do not include compatible). Hence, a campaign was launched to (supply) any Uyghur font. Existing fonts (both Arabic popularize and adapt the Unicode standard for Uyghur fonts and other fonts which include Arabic letters) do not fonts. In this paper, we present the entire process that we include all the necessary shapes of Uyghur letters (see have been following and developing for three years. The annex 2), and therefore some substitution sequences following subsections will cover four major parts of the mislead display problems. For example: entire implementation procedure. ‫1. ﺋﺎﻟەﻣﺪىﻜﻰ هەﻣﻤە ﺋىﻨﺴﺎن ﻗەﺑىﻪ ﺋەﻣەس‬ ‫2. ﺋﺎﻟﻪﻣﺪﯨﻜﻰ ھﻪﻣﻤﻪ ﺋﯩﻨﺴﺎﻥ ﻗﻪﺑﯩﻬ ﺋﻪﻣﻪﺱ‬ 3. Uyghur Unicode font developing (Not all human beings in the world are evil) Uyghur (ASU) letters have been developed on the basis The first sentence above is considered illegal character of the Arabic alphabet from Arabic. The ASU alphabet combination if it uses existing fonts (ex: Times New has 8 vowels5 and 24 consonants (see annex1). Uyghur, Roman, Traditional Arabic) because the cursive shapes of just like Arabic, is written from right to left, each letter ‫ ﺋﻪ ,ھ ,ﻯ‬are not correct according to the ASU alphabet having different shapes depending on its position in a (see annex 2). It should appear as in sentence 2 in which word. The Uyghur letters have initial, median, final and the letters use a specific font — UKIJ Tuz Tom. In order isolated forms; some letters have conjunct forms6. In total, to create right cursive connection forms for Uyghur, it the Uyghur alphabet has 126 different glyphs. The 108 was necessary to take special measures for three basic glyphs 7 of the Uyghur letters have already been problem-letters‫ ﺋﻪ ,ھ , ﻯ‬and two “glottal stop signs ‫”ﺌ , ﺉ‬ accepted by the Unicode Consortium/ISO, and 18 glyphs8 out of the 20 glyphs for composed forms were added in (supported hamze), during the creation of Uyghur fonts. 1998. Unfortunately, two conjunct median forms (of the The absence of such measures would make it impossible Uyghur letters ‫ ﺋﯥ‬and ‫ 9ﺌﯧ )ﺋﻰ‬and ‫ 01ﺌﯩ‬are still absent11 in to display the cursive forms of the three letters correctly in browsers and other application software. the Unicode Standard’s table 12 – Arabic Presentation ‫ : 31 ﻯ‬Uyghur letter i as in ishik (‫ ,ﺋﯩﺸﯩﻚ‬door). The 8 forms-A. This lack renders the Unicode Consortium/ISO as it stands incomplete and this has forced people to different forms are listed in the table 1 below. For the supplement it through borrowing from FBD1 and FBD2 initial′ and median′ forms (‫ )ﯨ , ﯩ‬of this letter we use the the “supported hamze” which is then combined with the initial and median forms of the Arabic letter ‫ ;9460 ﻯ‬for median′ form of ‫ ﺋﯥ‬and ‫ ﺋﻰ‬to generate two synthetic the final′ and isolated′ forms (‫ )ﻯ , ﻰ‬we use the final and combined letters. isolated forms of the Farsi letter ‫60 ﻯ‬CC, respectively. The 20 conjunct glyphs can also be expressed as a ‫ :41ﺋﻪ‬Uyghur letter e as in eyneklerde (‫ ,ﺋﻪﻳﻨﻪﻛﻠﻪﺭﺩە‬in the sequence of two existing Unicode glyphs (as it is the case now for the two missing conjunct glyphs). But this kind mirrors). This letter uses the final and isolated glyph s(‫, ﻩ‬ of usage may cause problems like reducing text inputting ‫ )ﻪ‬of the Arabic letter ‫(7460 ھ‬h), in the same way as speed, increasing data storage redundancy, complicating Persian does. This causes a special problem due to the data sorting operations etc. fact that the glyphs of Arabic ‫(7460 51ھ‬h) in the initial and median positions(‫ )ھ , ﻬ‬correspond to those of Uyghur 5 The Arabic alphabet only has 3 letters and for long vowels ‫( ھ‬h as in ‫ ھﯧﻠﯩﻬﻪﻡ‬hélihem, even now; ‫ ﮔﯘﻧﺎھ‬gunah, sin or uses ‫ .ﺍ ﻭ ﻱ‬The others are not noted in normal writing. Given its offense; ‫ ﻗﻪﺑﯩﻬ‬qebih, odious), which, in turn, has different phonetic characteristics, Uyghur notes down all vowels: ،‫ﺋﺎ، ﺋﻪ‬ ‫ , ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰ‬using derivates of traditional Arabic final and isolated glyphs(‫ .)ھ , ﻬ‬In order to deal with this letters. inconsistency, we have chosen to use 06D5 for the 6 The initial form and, under some circumstances, the median Uyghur letter ‫ ﺋﻪ‬and 06BE for the Uyghur letter ‫.ھ‬ form of all vowels is preceded by one “glottal stop sign ‫ ﺉ‬or ‫”ﺌ‬ iso.′ fin.′ med.′ ini.′ iso. fin. med. ini. (supported hamze) with which they form a common letter ‫ﺍ‬ ‫ﺎ‬ ‫ﯫ‬ ‫ﯪ‬ (treated by Uyghur as a single letter, see annex 2). ‫ ﻝ‬followed ‫ﻩ‬ ‫ﻪ‬ ‫ﯭ‬ ‫ﯬ‬ by ‫ ﺍ‬forms ‫ ﻼ‬or ‫ ﻻ‬depending on their position. 7 ‫ﻭ‬ ‫ﻮ‬ ‫ﯯ‬ ‫ﯮ‬ See http://www.oyghan.com/images/UyghurUnicodeTable.gif 8 See Arabic Presentation Forms-A, glyph code range: FBEA – ‫ﯗ‬ ‫ﯘ‬ ‫ﯱ‬ ‫ﯰ‬ FBFB. See also table 1. ‫ﯙ‬ ‫ﯚ‬ ‫ﯳ‬ ‫ﯲ‬ 9 Character name for the Unicode Standard: ARABIC LIGATURE YEH WITH HAMZA ABOVE WITH E ‫ﯛ‬ ‫ﯜ‬ ‫ﯵ‬ ‫ﯴ‬ MEDIAN FORM. Ex: ‫( ﺑﺎﻏﺌﯧﺮﯨﻖ‬Baghériq). 10 Character name for the Unicode Standard: ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA 13 Character name for the Unicode Standard: ARABIC ABOVE WITH ALEF MAKSURA MEDIAN FORM. Ex: LETTER UIGHUR KAZAKH KIRGHIZ ALEF MAKSURA ‫( ﻗﻪﺗﺌﯩﻲ‬certainly, doubtlessly) (represents YEH-shaped letter with no dots in any positional 11 The XUAR’s delegation members, Prof. Hoshur Islam and form), 0649. 14 Yasin Imin, who have submitted the proposition also admit this Character name for the Unicode Standard:ARABIC LETTER fault. See also Arabic Presentation Forms-A (code range: FBEA AE (Uighur, Kazakh, Kirghiz), 06D (isolated form is ‫.)ە‬ 15 – FBFB). See http://www.unicode.org/standard/where/ , Variant shapes 12 http://www.unicode.org/charts/PDF/UFB50.pdf of the Arabic character hah.
  • 3. ‫ې‬ ‫ﯥ‬ ‫ﯧ‬ ‫ﯦ‬ ‫ﯶ‬ ‫ﯷ‬ ‫ﺌﯧ‬ ‫ﯸ‬ and RTL (right to left mark; 200F), is also recommended ‫ﻯ‬ ‫ﻰ‬ ‫ﯩ‬ ‫ﯨ‬ ‫ﯹ‬ ‫ﯺ‬ ‫ﺌﯩ‬ ‫ﯻ‬ in any Uyghur font. The rest of the time-consuming repetitive font developing task is absolutely the same as ‫ھ‬ ‫ﻬ‬ ‫ﻬ‬ ‫ھ‬ when creating an Arabic script font 20 . Some Uyghur Table 1. Uyghur vowels and the three problem-letters (the one Arabic character ‫ ھ‬hah has four different basic shapes, which correspond to the Unicode fonts are available for free at the UCSA website. four shapes of two different letters in Uyghur). Our recommended font creating tools are: Font Creator21 and Fontographer 22 . Glyph substitutions, positioning ‫ ﺉ‬and ‫ :61ﺌ‬the glottal stop: this is a phoneme which is not lookups and shaping features and Open Type tables of listed separately in the ASU alphabet but still covered by Arabic fonts can be added with the help of software like its spelling rules. In Uyghur words, the glottal stop is not Microsoft VOLT. as strongly pronounced as it is in Semitic languages or in Uzbek, for example, and it has weakened to become no 4. Font embedding and character displaying more than a hiatus. Marked in ASU by a hamza on top of Web pages can be rendered without downloading or a “tooth”, it appears usually in words of Arabic origin installing any specific fonts if: 1) the fonts used in the and replaces an original ‘ain (‫ )ع‬or a hamza (‫ )ء‬in a pages are available on user’s computer, and 2) if the median or final position (e.g. ‫ ﺋﺎﻟﻪﻡ‬from Arabic ‫,ﻋﺎﻟﹶﻢ‬ browsers provide native support for the fonts and ‫ ﺳﺎﺋﻪﺕ‬from Arabic ‫ ﺧﺎﺋﯩﻦ ,ﺳﺎ َﺔ‬from Arabic ‫ﺳﻮﺋﺎﻝ , ﺧﺎﺋِﻦ‬ ‫ﻋ‬ languages used. The second condition has already been from Arabic ‫ .)ﺳ َال‬In initial position, the same sign is ‫ُﺆ‬ met but unfortunately the first one has not yet, as there considered as part of the initial form of a vowel and does are no Uyghur fonts available on the existing platforms not have any phonetic value 17 . They correspond to the that are installed on the users’ computers. Therefore, to initial and median forms of the Arabic letter ‫.6260 ئ‬ ensure that Uyghur texts are displayed correctly in web These Arabic glyphs are not considered as different browsers, users must find a way to install in their shapes of any independent letter in the Uyghur alphabet computers the fonts that are used in the web pages. The (cf. annex 2). Since one glyph of each of the two letters same holds true for all the other “forgotten languages” on ‫ ﺋﯥ‬and ‫( ﺋﻰ‬shown in light red in the table above) are still different platforms. The font installation requirement either causes difficulties for people who don’t have much missing in Unicode, we can use a sequence of either of technical experience, or discourages others from these glyphs ( ‫ ﺉ‬or ‫ )ﺌ‬followed by the final, isolated, attempting to read the text. median′ or final′ forms of vowels ‫ ﺋﯥ‬and ‫( ﺋﻰ‬shown in These difficulties can be overcome by embedding fonts blue in the table above). More precisely, the other into the web pages. When a page is downloaded into a conjunct forms can be obtained combining with the browser via the Hypertext Transfer Protocol, any Arabic letter ‫ 6260 ئ‬and a vowel respectively. embedded fonts in the page are also downloaded without In spite of the above mentioned limitations (two glyphs any need for the user to intervene. The Microsoft Web instead of one conjunct glyph for ‫ ﺋﯥ‬and ‫ )ﺋﻰ‬the above Embedding Fonts Tool—WEFT 23 makes it possible to mentioned conventions have now been widely accepted create embedded font objects that can be linked to web by the Uyghur Computer Science Association(UCSA18), pages. The following steps let web pages developers and at a later date, by the Xinjiang University branch of create embedded fonts and link them to a web page: the 863 Research Group19. • Create embedded fonts using Microsoft WEFT After having learnt the specificities of those letters, it is • Prepare the web page using any fonts that are easy to create Uyghur fonts using existing font creating installed on the platform, and software. The inclusion of non-spacing combining marks, • Link the embedded fonts to the web page. such as ZWJ (zero width joiner 200C), ZWNJ (zero Microsoft WEFT generates 1) embedded fonts for every width non-joiner; 200D), LTR (left to right mark; 200E), web site with a different extension (.EOT), and 2) a script that links an embedding font to a web page. The 16 disadvantage of the WEFT generated embedded fonts is Character name for the Unicode Standard: ARABIC LETTER YEH WITH HAMZA ABOVE <initial> and that the fonts are compatible only with Internet Explorer. <median> 0626. This makes it highly desirable for more efforts to be 17 It is often said that the decision of Uyghur linguists to add invested in providing a cross-platform functionality for this sign as part of the initial form of letters is a link with the this kind of software. old Uyghur writing system, in which all initial vowels were preceded by a tooth. The Arabic alphabet has 3 letters, ‫ و ,ا‬and ‫ ي‬which can be used to indicate long vowels. Short vowels can be indicated through the use of vowel marks above or under the consonants but which are dispensed of in normal writing. Given its phonetic characteristics, Uyghur notes down all vowels: ،‫ﺋﺎ‬ ‫ ,ﺋﻪ، ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰ‬using derivates of traditional Arabic 20 See letters. http://www.microsoft.com/typography/OpenType%20Dev/arabi 18 UCSA – The Uyghur Computer Science Association (or c/intro.mspx for more information about developing OpenType UKIJ – Uyghur Kompyutér Ilimi Jem’iyiti in Uyghur) is a non- Fonts for Arabic Script 21 profit association, founded by the author in Jan 2004. Web site: http://www.high-logic.com/fontcreator.html 22 http://www.ukij.org http://www.fontlab.com/Font-tools/Fontographer 19 23 A National High-Tech Research Group, financed by the PRC Free software at government. The XJU branch is specialized in multilingual http://www.microsoft.com/typography/web/embedding/default. software development. htm
  • 4. 5. Creation of a browser-level virtual input events” module frees the hook immediately after the user method decides to switch the inputting language to another one. As mentioned in the introduction, the existing platforms This method has been implemented using JavaScript and do not supply any system-level Uyghur language VBScript language, tested on different browsers and inputting service. Late in 2003, the first system-level commonly used in some Uyghur web sites25. Uyghur Unicode IME for Windows was developed by the author and distributed free of charge24. Six month later, 6. Multiscript converting the Xinjiang University branch of the 863 Research Due to the co-existence of different writing systems Group and some individuals started joining the Uyghur (Arabic-Script Uyghur, Cyrillic-Script Uyghur and Latin- Unicode Popularization campaign by distributing their Script Uyghur) for the Uyghur language, research on a Unicode-supported IME. Nevertheless, it still can not be conversion tool with which people can toggle between said that all or even most Uyghur internet users are the three scripts is forthcoming for future information equipped with Uyghur inputting tools. Therefore, the sharing. The fact that there is one-to-one browser-level inputting method still fills a great need correspondence 26 between the letters of these three since it enables people to input Uyghur letter into any writing systems is certainly a major helping factor. For text-inputting field on a web page without having to better understanding, we take an example of the Uyghur install a system-level Uyghur IME. The basic structure of proverb “working for free is better than doing nothing” in the browser-level Uyghur text inputting tool is three scripts: ‫ﺑﯩﻜﺎﺭ ﻳﯜﺭﮔﯩﭽﻪ ﺑﯩﻜﺎﺭ ﺋﯩﺸﻠﻪ‬ represented as in figure 1: бикар йүргичə бикар ишлə bikar yürgiche bikar ishle The following basic workflow explains the basic Keyboard and mouse events conversion process: Source text in source script Input Uyghur? no yes Pre-processing Capture K.&M. Events Character mapping Code – Char. Mapping Character converting Dispatch Events Disambiguation no Switch Lang.? no Conversion end.? yes yes Release K.&M. Events Result in destination script Figure 1. workflow of the browser-level inputting method Figure 2. script converting As we can see from the workflow above, once the user The functionalities of each module may require some selects the Uyghur Inputting option, the “capture clarification: keyboard and mouse events” module creates a hook to Pre-processing: this is an important step in converting. It monitor the keyboard and mouse activities. The “code- involves preserving elements that should remain char. mapping” module creates a keycode-to-Uyghur- unchanged27 after the conversion. For example, when Character matrix to get the right Uyghur character that converting LSU text “Men Photoshop ni yaxshi körimen” corresponds to the key code (ex: 109 ‫ .)ﻡ‬The “dispatch (I love Photoshop) into ASU, we should be able to obtain events” module sends Uyghur characters from the map to “‫ ﻧﻰ ﻳﺎﺧﺸﻰ ﻛﯚﺭﯨﻤﻪﻥ‬Photoshop ‫ ”ﻣﻪﻥ‬and vice-versa. the active text inputting field on a web page. This process repeats itself until the “release keyboard and mouse 25 See www.ukij.org , www.biliwal.com, www.oyghan.com, www.uyghurdictionary.org etc. 26 The only exception is j (as in jurnal) in LSU 24 27 More than 200,000 downloads counted since Dec 2003 from This is the case of hypertext links, HTML tags and proper www.oyghan.com and www.bizuyghur.com/oyghan . names.
  • 5. Character mapping: creates an “A_is_B” matrix for The embeddable web fonts, generated by third-party every script pair, or three matrices in total. software WEFT, are compatible only with Internet Character converting: uses the three matrices in order to Explorer. Therefore, we are truly looking forward to convert between the different scripts. more efforts by the computer software industry to expand Disambiguation: this module is necessary when compatibility. We expect to improve the pre-processing converting from LSU to ASU and/or CSU, because of module of the converting tool to make it more user- spelling mistakes or, more importantly, because of the friendly. There are undoubtedly other theoretical issues to problems due to the difficulty encountered in typing the resolve especially in the disambiguating of LSU LSU diacritical makes on many keyboards: very misspelled words. commonly, the letters Ö, Ü, É, ö, ü and é are replaced by Another important problem related to Uyghur is the O, U, E, o, u and e. This may cause fatal errors. For major impediment to developing a spell-check example: öltürüsh (to kill) olturush(to sit, party), functionality caused by its agglutinative language, térim yer (cultivable land) terim yer (who eats my coupled with associated spelling changes in root words. sweat), yétim(orphan) yetim(spelling mistake). This work is going to be the focus of our attention in a Besides, spelling mistakes due to the poor grasp of LSU next stage of development. rules are significant problem. All these problems require Finally, we call on software companies not to omit the intensive language processing. This functionality of the Uyghur from their supported language list in the future. multiscript converting tool28 that we have released on the internet is still under development. The following images 8. References will help you understand our converting tools which use [1] Waris A. Janbaz, Online Uyghur Unicode processing above mentioned methods. technique and its implementation (publication in Chinese), Xinjiang University Press, China, 2002. [2] Abdurehim, Waris A. Janbaz, Orthographic rules of the Latin-Script Uyghur (in Uyghur) , 2004, http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYe ziq).htm. [3] The Unicode Consortium The Unicode Standard, Version 4.0, Addison-Wesley Professional, ISBN: 0321185781, USA, 2003. [4] Xinjiang University, Proceedings 2000 International Conference on Multilingual Information Processing. Ürümchi (publication in Chinese), China, 2000. [5] The Unicode Consortium Website Image 1. Offline plug-in version for Microsoft Word http://www.unicode.org [6] Reinhard F. Hahn, Spoken Uyghur. Washington: the University of Washington Press, ISBN: 0-295- 97015-4, USA, 1991. Annex 1: Arabic-Script Uyghur, Cyrillic- Script Uyghur and Latin-Script Uyghur Alphabets ‫ﺥ‬ ‫چ‬ ‫ﺝ‬ ‫ﺕ‬ ‫پ‬ ‫ﺏ‬ ‫ﺋﻪ‬ ‫ﺋﺎ‬ ASU x ch j t p b e a LSU x ч җ т п б ə а CSU ‫ﻑ‬ ‫ﻍ‬ ‫ﺵ‬ ‫ﺱ‬ ‫ژ‬ ‫ﺯ‬ ‫ﺭ‬ ‫ﺩ‬ ASU f gh sh s j (zh) z r d LSU Image 2. Online demo version ф ғ ш c ж з р д CSU 7. Conclusions and future work ‫ھ‬ ‫ﻥ‬ ‫ﻡ‬ ‫ﻝ‬ ‫ڭ‬ ‫گ‬ ‫ﻙ‬ ‫ﻕ‬ ASU Our work to date has focused mainly on the design and LSU implementation issues related to creating Uyghur h n m l ng g k q Unicode fonts, as well as on browser-level input method һ н м л ң г k қ CSU and multi-script converting application. According to ASU ‫ﻱ‬ ‫ﺋﻰ‬ ‫ﺋﯥ‬ ‫ۋ‬ ‫ﺋﯜ‬ ‫ﺋﯚ‬ ‫ﺋﯘ‬ ‫ﺋﻮ‬ user feedback, we feel fairly satisfied with the results of this first ever research on Uyghur language processing. y i é w ü ö u o LSU й и e в ү ө у o CSU 28 Online demo version is available at Additional Cyrillic letters : ы ё ц э ю я http://www.uyghurdictionary.org/tools.asp, offline plug-in version for Microsoft Word is available at http://oyghan.com/OTB/index.html
  • 6. Annex 2: Arabic-Script Uyghur Alphabet with shapes