[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"
Semelhante a [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"
Difference between Arabic and English in the Basic Sentence Structure Contras...ijtsrd
Semelhante a [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments" (16)
[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"
7. The Glossarium Graeco-Arabicum
makes available information in the following
fields of research:
• the vocabulary and syntax of Classical and
Middle Arabic;
• the development of a scientific and technical
vocabulary in Arabic;
• the vocabulary of Classical and Middle
Greek;
• the chronology and nature of the translation
movement into Arabic;
• the establishment of the texts of Greek
works and their Arabic translations.
12. November 2013
Telota
Glossarium Graeco-Arabicum
Languages used within the project:
Ancient Greek
Medieval Arabic
Modern English
Greek alphabet
Arabic alphabet
Latin alphabet
3 layers of diacritics
optional vowel signs
1 layer of diacritics
LTR (left to right)
RTL
LTR
I.1. LANGUAGES
13. November 2013
Telota
Glossarium Graeco-Arabicum
Unicode Chart
Range
Description
C0 Controls and Basic Latin
Latin Extended-A
Latin Extended-Additional
0000-007F
0100-017F
1E00-1EFF
Latin Alphabet
transliteration symbols
transliteration symbols
Greek and Coptic
Greek Extended
0370-03FF
1F00-1FFF
Greek Alphabet
Greek Diacritics
Arabic
Arabic Supplement
Spacing Modifier Letters
0600-06FF
0750-077F
02B0-02FF
Arabic Alphabet
Arabic Alphabet
special Arabic characters
→ in total: about 450 different characters from eight different charts
I.2. UNICODE
14. November 2013
Telota
Glossarium Graeco-Arabicum
Requirements:
1. Data input in all three alphabets with all vowels and diacritics
→ How to implement a comfortable interface?
2. Simultaneous display of texts in three alphabets and two directions
→ How to implement concurrent writing directions?
3. Search for terms, insensitive for diacritics or vowels
→ How to implement queries with different collation sets?
I.3. REQUIREMENTS
17. November 2013
Telota
Glossarium Graeco-Arabicum
[ʾ]
U+02BE
MODIFIER LETTER RIGHT HALF RING
transliteration of Arabic hamza
[˒]
U+02D2
MODIFIER LETTER CENTRED RIGHT HALF RING
more rounded articulation
[ʿ]
U+02BF
MODIFIER LETTER LEFT HALF RING
transliteration of Arabic ain
[˓]
U+02D3
MODIFIER LETTER CENTRED LEFT HALF RING
less rounded articulation
I.4.a. DATA INPUT
18. November 2013
Telota
Glossarium Graeco-Arabicum
Problem: Appearance vs. Encoding
Users will normally choose charaters …
→ not because of their unicode description
→ but because of their appearance
How to bring Unicode to the user?
I.4.a. DATA INPUT
19. November 2013
Telota
Glossarium Graeco-Arabicum
Solutions:
–
restrict the characters accepted by the database
→ safe, but required validation methods
–
provide a virtual keyboard (onscreen)
→ user-friendly
Alternative methods:
–
beta code
→ less recommendable from unicode point of view
→ but widely used
I.4.a. DATA INPUT
21. November 2013
Telota
Glossarium Graeco-Arabicum
Problem: Strong vs. Weak Characters
In Unicode, alphabetic characters are usually
STRONG CHARACTERS
which determine the writing direction,
while punctuation characters are usually
WEAK CHARACTERS
which do not change the writing direction.
→ relevant in:
comma separated lists, bibliographic references,
breadcrumb lines, table alignments …
I.4.b. WRITING DIRECTIONS
24. November 2013
Telota
Glossarium Graeco-Arabicum
Solution:
Greek
Greek collation
Arabic
Arabic collation
English
Latin collation
Collation Charts: <http://unicode.org/charts/uca/>
Restrictions:
–
–
does not work for mixed texts
→ data needs to be separated
some environments do not support Arabic vowel collation
→ e.g. MySQL <6.0
I.4.c. SEARCH
25. November 2013
Telota
Glossarium Graeco-Arabicum
Phenomenon:
–
–
user searches for Arabic words starting with مل
truncation sysmbol (asterisk) appears at the wrong side
*مل
Problem: Neutral Writing Direction
–
–
the standard asterisk is a NEUTRAL CHARACTER
it adapts the main writing direction
I.4.d. SEARCH TERMS
27. November 2013
Telota
Glossarium Graeco-Arabicum
Challenges for the Developer:
–
Unicode does not provide general truncation or joker symbols
–
different asterisk and joker signs must be processed
–
no standard solution available
I.4.d. SEARCH TERMS
28. November 2013
Telota
Glossarium Graeco-Arabicum
Technical Recommendations for Polyalphabetic Environments
–
use software components that supports unicode thoughout
–
compose a project corpus of unicode characters
–
provide input methods to make the characters easily available
–
consider unicode writing directions and collations
–
make sure that all characters do not only appear correctly,
but that they are also encoded correctly
SUMMARY OF I.
29. November 2013
Telota
Glossarium Graeco-Arabicum
1
Corpus
→ How to deal with a database of 70,000+ words?
2
Translation movements
→ How to visualize transformations of language structures?
3
Single Lexemes
→ How to transform the database into a dictionary?
II. SCHOLARLY REQUIREMENTS
30. November 2013
Telota
Glossarium Graeco-Arabicum
How to deal with a database of 70,000+ words?
–
search form
→ user needs to know exactly what he/she is looking for
–
browsing
(e.g. by sources and words in alphabetical order)
→ user needs to know roughly what he/she is looking for
–
visualization
→ statistical and/or graphical approach
→ user can explore the corpus
II.1. CORPUS
35. November 2013
Telota
Glossarium Graeco-Arabicum
Compared Parts of Speech
X-Axis:
Greek Parts of Speech
Y-Axis:
Arabic Parts of Speech
Intersections:
Dot size represents number
of words transferred from
Greek PoS into Arabic PoS
II.2.b. TRANSLATION MOVEMENTS
36. November 2013
Telota
Glossarium Graeco-Arabicum
How to transform the database into a dictionary?
Experimental preview:
→ collation of all entries of a Greek lexeme
→ ordered by Arabic lexeme
→ output with source and context
II.3.a. SINGLE LEXEMES
38. November 2013
Telota
Glossarium Graeco-Arabicum
Recommendations
1
provide multiple access methods
→ support various user scenarios
2
invent statistical and visual evaluation methods
→ profit from electronic data processing
3
provide conventional scholarly formats
→ correspond to the community’s needs
SUMMARY OF II.
39. November 2013
Telota
Glossarium Graeco-Arabicum
Situation: Technical vs. Scholarly Requirements
–
which one goes first?
→ technical requirements as necessary basis
→ scholarly requirements as superior objective
–
–
both need attention from scholars
both need attention from techies
→ vice versa understanding
→ team competence
LAST BUT ONE SLIDE