OpenShift Commons Paris - Choose Your Own Observability Adventure
Discourse annotation
1. A SURVEY OF ARABIC DISCOURSE
ANNOTATION
By:
Abeer Al-Qahtani
Afnan Al-Moadi
Nujoud Al-Ghamdi
2. INTRODUCTION
Arabic language discourse annotation or
segmentation have become a popular area of research.
The aim of this presentation is to survey and summarize
some techniques which used in discourse annotation and
segmentation and to show their methods and results.
3. CLAUSE-BASED DISCOURSE SEGMENTATION OF
ARABIC TEXTS
Discourse parsing consists in two steps:
1- discourse segmentation which aims at identifying
Elementary Discourse Units (EDU).
2- building the discourse structure by linking EDUs using a
set of rhetorical or discursive relations
Arabic language characteristics:
- An agglutinative.
- Does not have capital letters.
- Absence of diacritics.
4. METHODOLOGY
Their analysis was carried out on two different corpus
genres: news articles and elementary school textbooks.
They proposed a three steps segmentation algorithm:
Step1: punctuation marks.
Step2: lexical cues.
Step3: Mixed of punctuation marks and lexical cues.
5. METHODOLOGY CONT.
Step1- punctuation marks:
[ ]
[Dr. Tarak Swiden has treated various diseases.]
Step2: lexical cues:
][][
[They will know when we start][but they don't know when
we finish]
6. METHODOLOGY CONT.
Step3: Mixed of punctuation marks and lexical cues:
If comma is followed by the conjunction " " (waw) or " " (fā)
and then by a preposition of localization) { },
it indicates the end of a segment.
Example:
.(
[Like Tunisian families, her family left Marsa city,]
[then, they found themselves at the wonderful Marsa’s beach.]
7. METHODOLOGY CONT.
If comma is followed by the conjunction " " (waw) or " " (fā)
and then by a possessive noun {
}, it indicates the end of a segment.
Example:
[I saw my sister outside,] [with a talking doll]
If a comma is followed by a demonstrative pronoun {
} and then by a word that is
not a verb, there is not a segment frontier.
Example:
[Mr. Hamed, our teacher, was standing up, looking at us.]
9. SEMANTIC-BASED SEGMENTATION FOR ARABIC
TEXT
In this approach the aim is to divide the text into
complete meaningful parts which can exist
independently without their prefix or postfix parts .
Connectors Classification:
Active: words that indicate the beginning of a new
segment, the end of a segment or a complete
segment. ( – )
Passive: words that don't indicate a new segment, an end
of a segment or a complete segment by
themselves, but when they come with active
elements, they contribute in determining the position of the
start or the end of the segments.
10. METHODOLOGY
Identifying the
connectors that indicate
complete segments (with
S instances in the
SegBoundary property).
Locating the active
connectors.
Resolving the case where
adjacent active
connectors exist
Setting the segments
boundaries.
Creating the final list of
segments
12. ARABIC DISCOURSE SEGMENTATION BASED ON
RHETORICAL METHOD
This technique derived from Arabic Rhetorical as defined by
Arabic.
Focuses on connector Waw “ ”.
Categorizes the six known Rhetorical types of “ ” into tow classes:
“Fasl” and “Wasl”.
They use SVM Machine Learning.
“Fasl”: 1,2 and 3
“Wasl”: 4,5 and 6
13. EXAMPLES
1Waw
[Professors teach students sciences and virtue, I swear to God, they have done a
great mission for their nation]
2Waw
[Young people are not the only ones who suffer, but their crises are part of the crises
of the whole society and someone may ask: Why have focused only on youth only
and not on the divisions of the whole society?]
3Waw
[Adolescents suffer from some psychological problems and there are, in general,
other numerous problems in the society.]
4Waw
[The teacher came smiley into the classroom.]
5Waw
[The couple sat together with the light of the moon.]
6Waw
[The study started and students and teachers enrolled in schools.]
14. METHODOLOGY
Preprocessing
Diacretization
Discriminate the connector “ ” from the letter “ ”
Feature Extraction
They extract 22 features to distinguish each type of “ ”.
Classification
15. FEATURE EXTRACTION
Waw1:
X1= “ ” and X7= genitive mark.
X3=noun, X7= genitive mark and X16=no.
Waw2: “ ”
X1= “ ” and X7= accusative mark.
X3=noun, X5= indefinite, X6≠genitive
mark and X7 = genitive mark.
Waw3: “ ”
X12≠X13.
X14 ≠ X15.
X19 ≠X20.
X21=no and X22=no.
Waw4: “ ”
X16=yes.
X1= “ ”, X10= verb and X11=past tense.
Waw5: “ ”
X3= noun and X7 = accusative mark.
Waw6: “ ”
X2=X3, X6=X7, and (X4=X5 OR X8=X9
OR X17= X18).
X12=X13, X14=X15, X19=X20 and
(X21= yes OR X22= yes)
16. THE RESULT
The Corpus of Arabic Discourse Segmentation incorporated in this
experiment.
They use 1200 instances for training and 293 for testing.
Class Waw5 did not appear in training and testing.
Class Waw3 and 6 are the most appearance.
Segmentation
accuracy =
98.98%
17. THE LEEDS ARABIC DISCOURSE TREEBANK: ANNOTATING
DISCOURSE CONNECTIVES FOR ARABIC
First effort toward producing an Arabic Discourse Treebank.
Defining discourse connectives as lexical expression that relate two text
segment.
Segments called arguments.
Discourse relations play an important role in producing a coherent
discourse.
Collecting Arabic Connectives:
They using text analysis and corpus-based technique.
Manually extracting connectives from 50 randomly selected texts from PATB and from
10 different websites.
Resulting list was manually tested by two native speakers.
107 discourse connectives.
20. METHODOLOGY
Done by two independent Arabic native speakers.
Agreement is measured on two tasks:
Task1:
measures whether annotators agree on the binary decision on
whether an item constitutes a discourse connective in context.
Task2:
measures whether annotators agree on which discourse
relation an identified connective expresses.
21. THE RESULT
Agreement on TASK I is highly reliable.
Agreement on TASK II (relation assignment) is
relatively low.
22. MODELLING DISCOURSE RELATIONS FOR
ARABIC.
Discourse Connective Recognition.
Discourse connective recognition distinguishes between
the discourse usage and non-discourse usage of
potential connectives.
Conjunctions such as /w/and, /¯aw/or can have
discourse usage or just conjoin two non-abstract entities
as in /,mr w s¯arh/Omar and Sarah.
23. CONT.
Features:
1. Surface Features (SConn).
2. Part of speech features(POS).
3. Lexical features of surrounding words (Lex). E.g.
4. Syntactic category of related phrases (Syn).
5. Al-Masdar feature:
25. Discourse Relation Recognition:
1. Connective features.
2. Words and POS of arguments. E.g. when the
first word of Arg2 is /qd/might/may or /k¯an/had, the
relation is likely to be EXPANSION.BACKGROUND or
EXPANSION.CONJUNCTION.
3. Tense and Negation.
4. Masdar.
5. Argument Parent.
6. Production Rules.
26. Performance of different models for identifying fine-
grained discourse relations on two datasets
Performance of different models for identifying
class-level discourse relations on two datasets
27. CONCLUSION
In this survey we presented some annotating
connectives and some segmentation techniques which
related with Arabic language and depended on different
corpora and methods. according to that , we get many
different results.