DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Discourse annotation for arabic 2
1. Survey on Discourse
Annotation for Arabic
A. Algarni, H. Alharbi and N. Almutairy
Supervisor: Dr. A. Alsaif
April 23, 2013
Kingdom of Saudi Arabia
Ministry of Higher Education
Imam Mohammed Ibn Saud Islamic University
College of computer and Information Sciences
CS465 - Natural Language Processing –
1
2. Outline
Introduction
The Leeds Arabic Discourse Treebank
Discourse Connective Recognition
Discourse Relation Recognition
Semantic-Based Segmentation
Discourse Segmentation Based on Rhetorical
Methods
A Comprehensive Taxonomy of Arabic Discourse
Coherence Relations
2
3. Introduction
Linguistic annotation covers any descriptive
or analytic notations applied to raw language
data.
Annotated Discourse Corpora can be very
useful to facilitate theoretical studies along
with contributing in the development of NLP
applications.
3
5. Discourse Relations and
Discourse Connectives
Discourse Relation is the way that two
arguments (text segments) logically connected.
Temporal, Comparison, Causal, Expansion..etc
Discourse Connective (DC) :A lexical marker
used to link two abstract objects in a text.
Abstract Object (AO) : Abstract objects in
discourse are things like proposition
, events, facts and opinions.
Argument (Arg) : A text expressing an abstract
object and linked by a DC.
5
6. The Leeds Arabic Discourse
Treebank
6
• First effort towards producing an Arabic
Discourse Treebank was introduced in 2011
by A. Alsaif and K. Markert.
• Collected a large set of Arabic discourse
connectives using text analysis and corpus
based techniques.
•Final list contains 107 discourse
connectives.
11. Annotation Methodology
1. Measuring whether annotators agree on
the binary decision on whether an item
constitutes a discourse connective in
context.
2. Measuring whether annotators agree on
which discourse relation an identified
connective expresses. As annotators can
use sets of relations for a connective.
11
12. Results
Agreement in task 1 is highly reliable
(N=23331) percentage agreement of
0.95, kappa of 0.88.
Agreement in task 2 (relation assignment)
is relatively low (N=5586), percentage
agreement of 0.66, kappa 0.57, and alpha
of 0.58.
12
13. Discourse Connective Recognition
To distinguish between discourse and non-
discourse usage of a connective.
Example: once, while.
A. Alsaif and K.Markert (2011) introduced
a Connective identifier for Arabic based on
syntactic features.
13
14. Discourse Connective Recognition
by A. Alsaif and K.Markert (2011)
Features:
Surface Features (SConn)
Lexical features of surrounding words
(Lex)
Example
Arg1DC
Arg2.
[Children might be tired]Arg1 [and]DC [feel sleepy]Arg2 during school time if they did
not sleep well
14
15. Features:
Part of Speech features (POS)
Syntactic category of related phrases
(Syn) (E.g.: / the school is
very large and beautiful)
Al-Masdar feature.
Discourse Connective Recognition
by A. Alsaif and K.Markert (2011) Cont…
15
16. Results
Discourse Connective Recognition
by A. Alsaif and K.Markert (2011) Cont…
Features Acurr K
Baseline (not Conn) 68.9 0
M1 Conn only 75.7 0.48
Tokenization by white space + auto tagger
M2
M3
M4
Conn+ SConn+Lex
Conn+ SConn+Lex+POS
Conn+SConn+Lex+POS+Masdar
85.6 0.62
87.6 0.69
88.5 0.70
ATB-based features
M5
M6
M7
Conn+SConn+Lex
Conn+SConn+Lex+Syn/POS
Conn+SConn+Lex+Syn/POS+Masdar
86.2 0.65
91.2 0.79
92.4 0.82
M8
M9
Conn+SConn+Syn
SConn+Lex+Syn+Masdar
91.2 0.79
91.2 0.79
16
17. Discourse Relation Recognition
To identify the type of the relation
A. Alsaif and K.Markert (2011) introduced
the first algorithms to automatically
identify relations for Arabic
17
18. Features:
Connective features
Words and POS of arguments
Masdar
Tense and Negation
Length, Distance and Order Features
Argument Parent
Production Rules
Discourse Relation Recognition
by A. Alsaif and K.Markert (2011)
18
19. Results
Acurr kFeatures
All connectives (6039)
52.5 0Baseline (CONJUNCTION)
77.2 0.60
78.7 0.66
78.3 0.65
Conn only (1)
Conn+Conn f+ Arg f (37)
Conn+Conn f+ Arg f+ Production rules (1237)
M1
M2
M3
Excluding wa at BOP (3813)
35 0Baseline (CONJUNCTION)
74.3 0.65
77.0 0.69
76.7 0.69
Conn only (1)
Conn+Conn f+ Arg f (37)
Conn+Conn f+ Arg f+ Production rules (1237)
M1
M2
M3
19
20. Results
Acurr kFeatures
All connectives (6039)
62.4 0Baseline (EXPANSION )
88.7 0.78
88.7 0.78
Conn only (1)
Conn+Conn f+ Arg f (37)
M1
M2
Excluding wa at BOP (3813)
41.8 0Baseline (EXPANSION)
82.7 0.74
83.5 0.75
Conn only (1)
Conn+Conn f+ Arg f (37)
M1
M2
20
21. Semantic-Based Segmentation of
Arabic Texts
Corpus Analysis
Definition: Let L be a list of candidate
segments connectors, each element c in L is
classified based on its effects on the text
segmentation as either active or passive
Examples:
.1[
][
[
.2]][
]
[
21
22. Segmentation Process
Identifying the connectors that indicate
complete segments.
Locating the active connectors.
Resolving the case where adjacent active
connectors exist.
Setting the segments boundaries.
Creating the final list of segments.
22
23. Discussion
evaluate the segmentation process, they
collected ten essays.
Each essay ranges between 500 and 700
words.
After implementing the segmentation
process.
Gave the output to judges to evaluate
them in terms of two factors: correct
hit and incorrect hit.
23
25. Arabic Discourse Segmentation
Based on Rhetorical Methods
This Method is depends on the meaning of
the connector " " in Arabic language.
There are six types of " " classified into
two classes, "Fasl" and "Wasl " :
"Fasl " : segmenting place.
"Wasl " : unsegmenting but connecting
the text.
25
29. Experiment and Results
They used 1200 instances for training.
They used 293 instances for testing after
testing there are 290 correct and 3
incorrect instances.
The result with:
94.68%Recall
96.82%Precision
98.98 %Accuracy
29
30. A Comprehensive Taxonomy of Arabic
Discourse Coherence Relations
Coherence relations are classified into two
types: explicit relations and implicit
relations.
exampleCoherence relations
I am very happy because I got
excellent marks in exams.
Explicit relations
I am very happy. I got excellent
marks in exams.
Implicit relations.
30
31. The procedure of creating an Arabic
Taxonomy of Coherence Relations
31
33. Results
They got a set of 47 Arabic coherence
relations.
coherence relations.Result
From English coherence
relations.
31
additional Arabic explicit
coherence relations.
12
Arabic implicit relations.4
33
34. Conclusion
Discourse Annotation is a very fertile field
and it has many NLP applications, for
Arabic there are some challenges due to
the lack of annotated corpora and studies.
34