SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
Syntactic Aggregation in Bengali Text Generation


                  Sumit Das, Anupam Basu, Sudeshna Sarkar
                Department of Computer Science and Engineering,
             Indian Institute of Technology, Kharagpur, India – 721302
   sumit.jucse@gmail.com,{anupam,sudeshna}@cse.iitkgp.ernet.in




                    Abstract                                 two text spans in (1a), linked by a C ONJUNCTION
   The quality of the sentences generated by a               rhetorical relation (Mann and Thompson, 1988)
   natural language generation system can be                 can be combined as in (1b). But (1b) contains un-
   evaluated based on their well-formedness                  necessary repetitions shown by the words in bold.
   (fluency, conciseness and coherence) and                   So, these can be aggregated to produce (1c) which
   faithfulness to the communication intent.                 is more fluent, concise, and coherent than (1b).
   In this paper, we explore the prevalent                     1.     a.  * Jack went up the hill.
   syntactic aggregation constructs in Ben-
                                                                          * Jill went up the hill.
   gali and present an approach towards gen-
                                                                     b. Jack went up the hill and Jill went up
   erating Bengali compound sentences using
                                                                        the hill.
   the identified constructs. The inputs to our
                                                                     c. Jack and Jill went up the hill.
   syntactic aggregation method are the con-
   stituent simple sentences, rhetorical rela-               Syntactic aggregation is the most common form of
   tions defined over them and the discourse                  aggregation observed in any real discourse. Shaw
   markers realizing the relations. The paper                (2002) proposed that in syntactic aggregation sim-
   describes a rule based approach to form                   pler linguistic components are combined in accor-
   the compound sentences, by reorganiza-                    dance with linguistic rules. As it is a language de-
   tion of components followed by elimina-                   pendent process, so linguistic knowledge, such as,
   tion of redundancies of lexical entities, and             preferred word ordering, special verb form usage
   presents a user based evaluation of the re-               etc. are required for combining text spans. For
   sults obtained.                                           example, in Bengali the two simple text spans in
1 Introduction                                               (2a), linked by S EQUENCE rhetorical relation, can
                                                             be simply combined using appropriate discourse
Any Natural language Generation (NLG) system                 marker eba.n as in (2b). But in (2b), the word in
should have the capability to remove unneces-                bold is redundant. So, applying the conjunction
sary repetitions when generating text. Unneces-              reduction construct the two text spans can be ag-
sary repetitions make the text less fluent and non-           gregated to generate (2c). But, (2c) can further be
coherent. In NLG, the task of combining con-                 aggregated to (2d) by using non-finite verb giYe.
stituent simpler text spans by removing repetitions
                                                               2.     a.                              1 (Ram
is called aggregation. According to the standard                         * rAma mAThe giYechhila
three-stage pipeline NLG architecture proposed by                          went to the playground).
Reiter and Dale (2000) aggregation is a basic task                       * rAma      phuTabala      khelechhila
of any NLG system for generating fluent, concise,                           (Ram played football).
and coherent text. Dalianis (1993) viewed aggre-                     b. rAma mAThe giYechhila eba.n rAma
gation mainly as redundancy elimination problem                         phuTabala khelechhila (Ram went to
and should be done in such a way that the origi-                1
                                                                  In this paper, Bengali graphemes are written using Ro-
nal meaning of the text is preserved and no unde-            man Script in ITRANS notation. They are written in italics
sirable implication is produced. For example, the            font.



                 Proceedings of ICON-2009: 7th International Conference on Natural Language Processing
                 Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
the playground and Ram played foot-            eration. Apart from redundancy elimination, ag-
          ball).                                         gregation choices can affect other characteristics
       c. rAma mAThe giYechhila eba.n phuTa-             of text, such as sentence complexity, focus, em-
          bala khelechhila (Ram went to the              phasis, theme/rhyme, prosody etc.
          playground and played football).                  Reape and Mellish (1999) defined aggregation
       d. rAma mAThe giYe phuTabala khelech-             as a process to generate more concise, cohesive,
          hila. (Ram went to the playground and          and fluent text by omitting or substituting repeat-
          played football).                              ing entities where the reader can infer the deleted
                                                         entities from the remaining text. Reaper and Mel-
Clearly, to syntactically aggregate smaller text         lish distinguished among different types of aggre-
spans in Bengali an NLG system should have the           gation: conceptual, discourse, semantic, syntactic,
knowledge of Bengali grammar.                            lexical, and referential. According to them syn-
   In this work, we have studied a corpus of Ben-        tactic aggregation is the most common and can be
gali sentences to identify the prevalent syntac-         stated by some grouping rules, like, subject group-
tic aggregation constructs in Bengali. Then, we          ing, predicate grouping etc.
have proposed a method to syntactically aggregate           Horacek (1992) has given a more theoretical
two simple clauses using the constructs identified        view of aggregation. He explained it by some
to generate a more fluent, concise and coherent           grouping phenomena, like content based grouping,
compound sentence. The inputs are two simple             structurally motivated propositional grouping.
clauses, the rhetorical relation between them and           Shaw (2002) categorized aggregation into four
the discourse marker realizing that relation.            types: interpretive, referential , syntactic, and lex-
   The rest of this paper is organized as follows: In    ical. He focused mainly on syntactic aggregation.
section 2, we briefly mentioned the related works         He divided syntactic aggregation into two types:
in syntactic aggregation. In Section 3, we present a     hypotactic and paratactic. In paratactic aggrega-
corpus analysis to identify the prevalent syntactic      tion all the constituent text spans are of equal sta-
aggregation constructs in Bengali. Rhetorical rela-      tus. On the other hand, in hypotactic aggregation
tions considered in this work are mentioned in sec-      the constituent text spans are related by some sub-
tion 4 and the semantic representation used is de-       ordinate relation.
scribed in section 5. We described our approach in          In Virtual Storyteller project (Marit Theune and
section 6 and the evaluation methods in 7. In sec-       Hendriks, 2006) different conjunctive and ellipti-
tion 8, concluding remarks and some future scopes        cal constructs were used to syntactically aggregate
relevant to this work have been provided.                simpler text span to generate more coherent and
                                                         concise fairy-tales.
2 Related Work                                              All the works in the area of text aggregation en-
                                                         countered so far are focused on English and other
There does not exist any general consensus regard-
                                                         European languages. In this work, we have pro-
ing the exact definition of aggregation, the types
                                                         posed methods to perform syntactic aggregation in
of aggregation or the component of an NLG sys-
                                                         Bengali text generation.
tem where aggregation tasks should be performed.
The general approach is to handle the aggregation
                                                         3 Corpus Analysis
tasks in domain and application specific way.
    Dalianis (1993; 1996) equated aggregation with       We conducted a corpus analysis to identify the
the process of redundancy elimination. He divided        prevalent syntactic aggregation constructs used in
it into four principal categories: syntactic, elision,   Bengali for generating compound sentences. For
lexical, and referential aggregation. In syntactic       this we have chosen text of narrative style be-
aggregation repetitions are removed syntactically        cause narrative texts are mainly activity or event
leaving one item (at least) in the text to express       driven. So, it is easier to model the different
the meaning explicitly.                                  types of aggregation construct in narrative text.
    Wilkinson (1995) contradicted Dalianis’s views       We have a corpus of 600 compound sentences col-
of equating text aggregation with redundancy el-         lected from Bengali story books. We have ran-
emination because in certain context it can be           domly chosen 350 sentences from that corpus for
done by using suitable referring expression gen-         analysis. First the selected compound sentences
were segmented into simple clauses. A simple                      * rAma bhAta eba.n shyAma ruTi
clause is equivalent to a simple sentence which                      khAbe (Ram will eat rice and
contains only one finite verb and no coordinating                     Shyam will eat roti).
conjunction. For example, the compound sentence                 Here the right most portion of the first
rAma eba.n shyAma kAla skule giYechhila (Ram                    proposition(khAbe) is deleted.
and Shyam went to school yesterday) contains 2                – Coordinating one constituent: In this
simple clauses: rAma kAla skule giYechhila (Ram                 case, one constituent entity from each
went to school yesterday) and shyAma kAla skule                 of the input simple clauses are co-
giYechhila (Shyam went to school yesterday). By                 ordinated by a conjunction. This can
decomposing the 350 compound sentences, we got                  happen to any entity of the constituent
868 simple clauses (2.48 simple clauses per sen-                simple clauses.
tence). This measure is important to determine the
                                                                  * rAma eba.n shyAma phuTbala
maximum number of simple clauses that can be                         khelachhila (Ram and Shyam was
aggregated in a single sentence. We cannot keep                      playing football).
on aggregating arbitrarily large number of sim-                 The subjects of the two constituent sim-
ple clauses even if they are syntactically similar,             ple clauses in the above example are co-
since it may result in too complex but less fluent               ordinated.
text. From the corpus analysis, we have identi-
                                                              – Non-finite verb generation: If both
fied two types of frequently used syntactic aggre-
                                                                the input simple clauses are about some
gation constructs in Bengali, e.g., paratactic con-
                                                                events or actions performed sequen-
struct and elliptic construct.
                                                                tially or concurrently by the same sub-
   • Simple paratactic construction: In this                    ject then they are aggregated using non-
     case, the two constituent simple clauses are               finite form of the verb of the first simple
     simply connected by the conjunctive dis-                   clause.
     course marker and no word deletion is re-                    * rAma baAta kheYe skule yAbe
     quired.                                                         (Ram will eat rice and go to school).
                                                                In the above example, the two con-
       – rAma ekatA boi paRachhila eba.n
                                                                stituent simple clauses are about two
         shyAma phuTabala khelachhila (Ram
                                                                actions performed sequentially by the
         was reading a book and Shyam was
                                                                same subject. So, perfect participle form
         playing football).
                                                                of the verb khAoYA i.e. kheYe is used for
   • Elliptic construction: Ellipsis is defined as               aggregation.
     the omission of superfluous words from the             Any combination of the above four types of
     surface form which are inferable from the en-         elliptic constructs is also allowed. For ex-
     tities in the remaining text. The different el-       ample, in (3) both conjunction reduction and
     liptic constructs observed in Bengali are:            RNR are used and (4) is generated by us-
       – Conjunction reduction: In conjunction             ing both conjunction reduction and non-finite
         reduction, the subject of the second sim-         verb.
         ple clause is deleted.                              3. rAma bhAta eba.n mAchha khAbe
           * rAma khAbAra kheYechhe eba.n                       (Ram will eat rice and roti).
             bandhudera sAthe sinemA dekhate                 4. rAma skule giYe phuTabala khelabe
             gechhe (Ram has eaten food and                     (Ram will go to school and play foot-
             gone to see a movie with friends).                 ball).
         In the example given above, the subject       In summary, though for corpus study we have con-
         of the second simple clause, i.e., rAma       sidered only narrative Bengali text, it is a part
         is deleted using conjunction reduction        of more general approach. As syntactic aggrega-
         construct.                                    tion is language dependent but domain indepen-
       – Right node raising (RNR): In RNR,             dent task (Shaw, 2002), the contributions of this
         the right most portion of the first simple     work can be extended to generate aggregated text
         clause is deleted.                            in Bengali in other domains as well.
4 Rhetorical Relations Considered                       information, such as, verb root (v-root), theme,
                                                        tense, aspect, mood, polarity etc. The arg frame
From the corpus study, we know that paratactic
                                                        contains the nominal entities along with the the-
aggregations are the most common form of syn-
                                                        matic role of that entity in that clause. If there
tactic aggregation in Bengali. In paratactic ag-
                                                        is any modifier for the verb or any nominal en-
gregation, the constituent text spans are of equal
                                                        tity in a clause then the respective modifier frames
status and are linked by a multi-nuclear rhetori-
                                                        (v-mod and w-mod frame) are present inside the
cal relations (Mann and Thompson, 1988). In this
                                                        corresponding pre and arg frame.
work, we have focused on the different paratac-
tic constructs for syntactic aggregation of Bengali
text. The multi-nuclear rhetorical relations consid-
ered in this paper are C ONJUNCTION , D ISJUNC -
TION , C ONTRAST , and S EQUENCE as defined by
original Rhetorical Structure Theory (RST). In ad-
dition to the said relations, we have considered
another multi-nuclear temporal coherence relation
PARALLEL as defined below:

     Two text spans are said to be related by
     PARALLEL relation if the actions or the
     events in those two text spans are occur-
     ring simultaneously.

For example, the two constituent clauses present in
(5) are rAma khAbAra khAchchhila (Ram was eat-
ing food) and rAma Tibhi dekhachhila (Ram was
watching TV). The actions in these two clauses
are concurrent. So, the coherence relation between
them is PARALLEL.

  5. rAma khAbAra khete khete Tibhi dekhachhila
     (Ram was watching TV while eating food).

5 The Semantic Representation
The semantic representation chosen here is a case-
frame representation. This is called predicate-
argument representation. The basic building block
in this representation is sentence. An example of
the sentence frame is given in Figure 1. A sentence
contains a clause frame and clause-count which
                                                        Figure 1: Case-frame representation for the sen-
denotes the number of simple clauses present in
                                                        tence “rAma pa.Dachhila eba.n shyAma khelach-
the sentence. The clause is a recursive structure
                                                        hila.” (Ram was reading and Shyam was play-
that can contain clauses inside itself which makes
                                                        ing).
it capable of representing both simple and com-
posite (compound and complex) sentences. For
simple sentence, the outer clause only contains         6 Proposed Approach
one inner clause. On the other hand, for composite
sentence the outer clause contains the constituent      In our approach for syntactic aggregation, the in-
inner clauses along with the rhetorical relation (rh-   puts are two simple clauses, the rhetorical relation
rel) connecting and discourse marker (dm) realiz-       between them, and the discourse marker realiz-
ing that rhetorical relation. A clause frame con-       ing that relation. To syntactically aggregate the
tains a predicate frame (pre) and list of argument      two simple clauses by using the different paratac-
frames (arg). The pre frame contains verb related       tic constructs identified in section 3 we propose
the following steps:                                   kakhana < kothAYa. The role on the left side of <
                                                       will appear before the role on the right side in the
   • Step 1: Ordering arguments in the constituent     surface form.
     clauses.
                                                         6. Ami AgAmIkAla skule yAba          (I shall go to
   • Step 2: Repeating entity identification.
                                                            school with my father).
   • Step 3: Ordering constituent clauses.
                                                       Again, in (7) the role set is {ke, kothAYa, kakhana,
   • Step 4: Superfluous words deletion and non-        kAra sAthe}. By using (7) the total order obtained
     finite verb generation.                            from (6) can be extended to ke < kakhana < kAra
                                                       sAthe < kothAYa.
   • Step 5: Correct surface form generation.
                                                         7. Ami AgAmIkAla bAbAra sAthe skule yAba
The above steps are described below.                        (Tomorrow I shall go to school with my fa-
                                                            ther).
6.1   Argument Ordering in the Constituent
      Clauses                                          Using the above method for the entire set of sim-
Preferred word ordering in a sentence varies with      ple clauses we have identified the set of possible
languages and it is very important for syntactic ag-   roles in Bengali and developed a total order among
gregation. Though Bengali is a free-word-order         them. The arg frames in the input simple clauses
language, the preferred word ordering in a Bengali     are ordered using the developed total order.
sentence is subject-object-verb.
   In this work, the input simple clauses are taken    6.2 Repeating Entity Identification
in their corresponding semantic case-frame repre-      In our current approach, to remove the redundant
sentation as shown in Figure 1. The arg frames in      entities first we have identified the repeating enti-
the clause are then ordered by using a total order     ties present in both the simple clauses taken as in-
among the roles associated with the arg frames.        put. We are assuming that the nominal entities are
These roles are neither semantic roles nor Paninian    equivalent if they have the same thematic role and
roles. The problem that prevents both the seman-       root word in the constituent simple clauses. For
tic and Paninian roles is that, none of them can       example, in the simplified semantic representa-
be associated with a unique postposition which         tion of the compound sentence shown in Figure 2,
is very important for generating sentence in Ben-      the constituent simple clauses have one repeating
gali. So the alternative approach should be to de-     nominal entity. In both the simple sentences, the
sign some intermediate representation that has suf-    thematic role of that entity is ki and surface form is
ficient granularity of the roles, such that ambigu-     bhAta. Two verbs are equivalent if they have same
ous assignments of postpositions are not possible.     root words and other functional parameters, such
Now, Bengali has a list of postpositions that are      as, tense, aspect, mood, polarity etc. In Figure 2,
used in different contexts to convey different se-     verbs are equivalent and thus repeating. Two noun
mantics. In this work, roles have been designed        modifiers are equivalent if they have the same root
at a granularity level where one role is assigned to   word and are modifying two nominal entities with
a semantically unique postposition. For develop-       the same thematic role. Lastly, two verb modifiers
ing the total order of the roles, we have followed     are equivalent if they have same root word. The
an approach taken in the SANYOG system (Bhat-          repeating entities are tagged with the status RE-
tacharya, 2004). We have taken the constituent         PEATING.
simple clauses of the compound sentences used
for corpus analysis. Each simple clause was rep-       6.3 Ordering Constituent Clauses
resented in their case-frame representation and the    All the rhetorical relations considered in this work,
arg frames inside them are then ordered as they ap-    mentioned in section 4, are multi-nuclear rela-
pear in the surface form of the clause. In this way,   tions. So, two simple clauses connected by any
the ordering among the roles of the arg frames in a    of these relations, except S EQUENCE relation, can
clause is known. For example, the role set for (6)     be realized in any order. In case of S EQUENCE
is {ke, kothAYa, kakhana}. From (6) we can infer       relation, an ordering constraint is imposed by the
that the preferred order among these roles is ke <     sequence of the input clauses. So, for S EQUENCE
Figure 2: Simplified case-frame representation for the sentence “rAma eba.n shyAma bhAta khAbe.”
(Ram and Shyam will eat rice). Note: ∼() denotes a frame.


relation the clauses cannot be reordered. For              • Polarity: If two simple clauses have the
other relations, after identifying the repeating en-         same tense but different polarity for the verb
tities, the constituent simple clauses in the result-        then the clause with negative polarity will
ing compound sentence are reordered on the basis             come first in the surface form. For exam-
of their chronological order and polarity following          ple, if the simple clauses in (9a), linked by
the rules mentioned below:                                   C ONJUNCTION relation, are aggregated as in
                                                             (9b) then the negative polarity marker nA af-
   • Tense: If the two constituent clauses have
                                                             fects both the verb kinabe and khAbe. So, the
     different tense then they are ordered chrono-
                                                             communicative goal is not preserved. How-
     logically. This improves the fluency of the
                                                             ever, if the clauses are reordered and then ag-
     generated compound sentence. For example,
                                                             gregated, (9c) results which is grammatically
     if the two clauses in (8a), linked by C ON -
                                                             correct, fluent and preserves the meaning.
     JUNCTION relation, are aggregated without
     chronological ordering then (8b) is gener-                9. a. rAma chakaleTa kinabe.       rAma
     ated. But if they are ordered according to                      chakaleTa khAbe nA. (Ram will
     their tense and aggregated then (8c) is gener-                  buy chocolate. Ram will not eat
     ated which is more fluent and coherent then                      chocolate).
     (8b).                                                        b. rAma chakaleTa kinabe eba.n khAbe
       8. a. · Ami bA.Di yAba. (I shall go                           nA (Ram will buy chocolate and
                home).                                               will not eat).
              · rAma skule gechhe. (Ram has                       c. rAma chakaleTa khAbe nA eba.n
                gone to school).                                     kinabe (Ram will not eat chocolate
                                                                     and will buy).
          b. Ami bA.Di yAba eba.n rAma skule
             gechhe. (I shall go home and Ram                The ordering based on polarity is done when
             has gone to school).                            the clauses are linked by either C ONJUNC -
          c. rAma skule gechhe eba.n Ami bA.Di               TION or D ISJUNCTION relation.
             yAba. (Ram has gone to school
                                                        6.4 Superfluous Words Identification and
             and I shall go home).
                                                            Non-finite Verb Generation
     The chronological ordering is done when
                                                        After identifying the repeating entities and order-
     the rhetorical relation between the two con-
                                                        ing the constituent clauses, the superfluous words
     stituent clauses is C ONJUNCTION, D ISJUNC -
                                                        are identified using the following two methods:
     TION or C ONTRAST . As the constituent sim-
     ple clauses are concurrent for PARALLEL re-           • Forward deletion: If the entities at the be-
     lation, this ordering is not required.                  ginning of the surface forms of both clauses
are REPEATING then they are marked as             bold faced words in the second clause are forward
     DELETED in the second clause. Surface             deleted.
     forms of both the clauses are traversed from
     left-to-right and REPEATING entities are          12. rAma Aja bhAta khAbe eba.n rAma kAle
     marked as DELETED in the second clause                bhAta khAbe (Ram will eat rice and Shaym
     unless a NON-REPEATING entity is encoun-              will eat rice).
     tered. For example, the two constituent
                                                       13. rAma Aja bhAta khAbe kintu rAma kAle
     clauses in (10), linked by C ONJUNCTION re-
                                                           bAbAra sAthe ruti khAbe (Ram will eat rice
     lation, have REPEATING entities with the
                                                           today but Ram will eat roti with father tomor-
     role ke and kakhana and they occur at the
                                                           row).
     beginning of both the clauses. So, the RE-
     PEATING entities are marked DELETED in            In case of S EQUENCE or PARALLEL relation, only
     the second clause indicated by the words in       forward deletion is done. In addition to that, the
     bold face.                                        verb of the first clause is modified to non-finite
                                                       form if the subjects of both the clauses are the
     10. rAma gatakAla khAbAra kheYechhila
                                                       same. For S EQUENCE relation, the non-finite form
         eba.n rAma gatakAla skule giYechhila
                                                       is the perfect participle of the verb and for PAR -
         (Ram ate food yesterday and Ram went
                                                       ALLEL relation, it is the progressive participle.
         to school yesterday).
                                                       For example, in (14a) the two clauses are linked
   • Backward deletion: If the verb and the            by S EQUENCE relation. So, first the bold faced
     entities at the end of the surface forms of       words in the second clauses are forward deleted
     both clauses are REPEATING then they are          and then perfect participle form of the verb of the
     marked as DELETED in the first clause. Sur-        first clause is generated. This results in the com-
     face forms of both the clauses are traversed      pound sentence (14b). Similarly, the two clauses
     from right-to-left and REPEATING verb and         in (15a), linked by PARALLEL relation, are also
     entities are marked as DELETED in the first        aggregated to (15b) by using the progressive par-
     clause unless a NON-REPEATING entity is           ticiple of the root verb paRA.
     encountered. For example, the two con-            14.    a. rAma bA.Di yAbe eba.n rAma bhAta
     stituent clauses in (11), linked by C ONJUNC -              khAbe (Ram will go home and Ram
     TION relation, have REPEATING verb and
                                                                 will eat rice).
     a REPEATING entity with the role ki and
                                                              b. rAma bA.Di giYe bhAta khAbe (Ram
     they occur at the end of both the clauses.
                                                                 will go home and eat rice).
     So, the REPEATING elements are marked
     DELETED in the first clause indicated by the       15.    a. rAma bai pa.Dachhila eba.n rAma
     words in bold face.                                         khAbAra khAchchhila (Ram was read-
     11. rAma bhAta khAbe eba.n shyAma                           ing a book. Ram was eating food).
         bhAta khAbe (Ram will eat rice and                   b. rAma bai pa.Date pa.Date khAbAra
         Shaym will eat rice).                                   khAchchhila (Ram was eating food
                                                                 while he was reading a book).
If the two simple clauses, linked by C ONJUNC -
TION , D ISJUNCTION or C ONTRAST relation, have        6.5 Correct Surface Form Generation
the same role set then the REPEATING entities are      The redundant words are identified in the previ-
forward deleted and backward deleted. For exam-        ous step but the actual deletion is done is this
ple, in (12) the two simple clauses, connected by      step. While generating the resulting compound
C ONJUNCTION relation, have the same set of as-        sentence, the entities marked as DELETED are not
sociated roles. So, bold faced words in the second     realized i.e. deleted from the surface form.
clause are deleted forward and those in the first          In case of subject coordinating and RNR con-
clause are deleted backward. However, if the role      structs, if the subjects of the two input clauses are
set is different then only forward deletion is done.   different then correct surface form of the common
As the two clauses in (13), connected by a C ON -      verb should be generated. For example, in (16)
TRAST relation, has different role sets, only the      the surface form used for the common verb khelA
is khelba which is generated by the subject of the    7 Evaluation
first clause i.e. Ami.
                                                      We have developed a system which performs syn-
16. Ami eba.n rAma kAla phuTabala khelaba        (I   tactic aggregation of two simple clauses by follow-
    and Ram will play football tomorrow).             ing the steps mentioned in section 6. Evaluation of
                                                      that system is important to validate our approach.
Here we have given some rules for generating cor-
                                                      We performed a user based evaluation. The sys-
rect inflectional form of the common verb for dif-
                                                      tem outputs were shown to the human evaluators
ferent syntactic aggregation constructs in Bengali.
                                                      and they were asked to rate those outputs based
   • In case of subject coordinating, if one of the   on some parameters. Depending upon their feed-
     subjects is of first person then the common       backs the overall system performance is measured.
     verb will be inflected by that first person sub-      We evaluated the system with three human eval-
     ject. As, in (17) the common verb inflection      uators and they were native speakers of Bengali.
     yAba is generated by the first person subject     They were only given a brief idea about the rhetor-
     Ami.                                             ical relations considered in this work. As men-
                                                      tioned in section 3, from a corpus of 600 com-
     17. Ami eba.n tumi kAla skule yAba (I and
                                                      pound sentences 350 were chosen randomly for
         Ram will play football tomorrow).
                                                      corpus study. The remaining 250 sentences were
   • In case of subject coordinating, if one of the   used as test sentences in the evaluation. The test
     subjects is of second person and the other is    sentences were segmented into constituent sim-
     of either second or third person then the com-   ple clauses. The simple clauses, the rhetorical re-
     mon verb will be inflected by that second per-    lation connecting them, and the appropriate dis-
     son subject. As, in (18) the common verb in-     course marker realizing that relation were given to
     flection yAo is generated by the second per-      the human evaluator as the test inputs. The evalu-
     son subject tumi.                                ation is performed depending upon the following
                                                      two criteria:
     18. tumi eba.n rAma skule yAo       (You and
         Ram go to school).                              • Well-formedness:    We define the well-
                                                           formedness of an output sentence by its
   • In case of subject coordinating, if both the
                                                           grammatical correctness and conciseness.
     subjects are of third person then the subject
                                                           The grammatical correctness measures the
     of the complete clause will inflect the com-
                                                           accuracy of the syntax, word order and the
     mon verb. As, in (19) both the subjects are of
                                                           morphological inflections used.
     third person and the common verb inflection
     karabena is generated by the subject of the         • Faithfulness: The faithfulness of an output
     complete clause i.e. tini.                            measures how well the communication goal
     19.     rAma eba.n tini kAjatA karabena               is preserved by the generated output.
           (Ram and he will do the work).
                                                         For both the measures, the evaluators were
   • In case of RNR construct other than the sub-     asked to score the outputs on a scale of 1 to 5.
     ject coordinating, the subject of the complete   1 is the best and 5 is the worst. The scoring for
     clause will inflect the common verb. As,          well-formedness and faithfulness were done sepa-
     in (20) the common verb inflection khelabe        rately by an individual evaluator so that the score
     is generated by the subject of the complete      of one does not influence the score of the other.
     clause i.e. se.                                  The results of each evaluator for well-formedness
                                                      and faithfulness are shown in Figure 3 and Figure
     20. Ami krikeTa eba.n se phuTabala khe-
                                                      4 respectively.
         labe (I shall play cricket and he will
                                                         To calculate overall performance of the system
         play football).
                                                      the scores given by individual evaluator were com-
So, following the above rules the correct inflec-      bined as follows: If two or more evaluators have
tional form of the common verb is generated           given a common score to a test sentence then it
which increases the fluency and naturalness of the     is assigned to that common score; If all the eval-
generated text.                                       uators have given different scores to a test sen-
tence then it is not considered for overall perfor-
mance calculation. The overall performance of
our system for well-formedness and faithfulness
are shown in Figure 5 and Figure 6 respectively.




                                                               Figure 6: Faithfulness Pie Chart


                                                      ciseness. For example, the two clauses in (21a) are
      Figure 3: Well-formedness Bar Graph
                                                      connected by S EQUENCE relation and the system
                                                      syntactically aggregates them to (21b). But (21b)
                                                      is very good in terms of word ordering and con-
                                                      ciseness.

                                                      21. a. rahima ekadina rAstAYa bhi.Da
                                                             dekhechhila. rahimera mAthA ghure
                                                             giYechhila (One day Rahim saw a
                                                             huge mass in the street. Rahim was
                                                             moved by that).
                                                          b. rahima ekadina rAstAYa bhi.Da
                                                             dekhechhila eba.n tAra mAthA ghure
                                                             giYechhila (One day Rahim saw a
                                                             huge mass in the street and he was
        Figure 4: Faithfulness Bar Graph                     moved by that).

                                                      The errors regarding the faithfulness measure are
                                                      due to wrong order of the constituent clauses and
                                                      absence of cues which indicates emphasis and
                                                      prosody. For example, the two clause in (22a),
                                                      connected by C ONJUNCTION relation, are aggre-
                                                      gated to (22b). But the output is ambiguous in
                                                      terms of faithfulness as both the verbs are now in
                                                      the scope of the words bAbAra sAthe.

                                                      22. a. rAma bAbAra sAthe khAbAra khAbe.
                                                             rAma Tibhi dekhabe (Ram will eat
                                                             food with father. Ram will watch TV).
                                                          b. rAma bAbAra sAthe khAbAra khAbe
                                                             eba.n Tibhi dekhabe (Ram will eat
                                                             food with father and watch TV).
      Figure 5: Well-formedness Pie Chart
                                                      8 Conclusion
   The inconsistencies with respect to well-
formedness of the system generated output are         In this article, we have shown our methods to gen-
mainly due to the errors in word ordering and con-    erate aggregated and elliptic sentences in Bengali
from clause-sized semantic representations. The        Mukhopadhyay for their valuable advice and sup-
current system can produce paratactic construc-        port. This work is supported by the project Sanyog
tions and use ellipsis to omit repeated entities. We   - Phase II, funded by Media Lab Asia, and con-
were able to produce all the desired forms of syn-     ducted in Communication Empowerment Labora-
tactic aggregation (see Section 3), though there are   tory, Indian Institute of Technology.
scopes for improvements.
   Deletion of the repeating words in the gener-
ated output sentence sometimes does not preserve       References
meaning. In that case, to make the text fluent          Samit Bhattacharya. 2004. Sanyog: An iconic sys-
anaphoric pronouns need to be used. For example,         tem for multilingual communication for people with
                                                         speech and motor impairments. M.S. Thesis, IIT,
if the two clauses in (23a), connected by C ON -         Kharagpur, Supervisor-Basu, A, Sarkar, Sudeshna.
JUNCTION relation, are aggregated by removing
the repeating words in boldface then actual com-       Hercules Dalianis and Eduard H. Hovy. 1993. Aggre-
municative goal is not preserved. In place of that,      gation in natural language generation. In EWNLG
                                                         ’93, Proceedings of the 4th European Workshop on
these two clauses are correctly aggregated to (23b)      Natural Language Generation, Pisa, Italy.
by using anaphoric pronoun tAra.
                                                       H. Dalianis. 1996. Aggregation as a subtask of text and
                                                          sentence planning. In J.H.Stewman (ed.), Proceed-
23.    a. Ami rAmer sAthe phuTabala khelaba
                                                          ings of Florida AI Research Symposium, FLAIRS-
          eba.n yadu rAmer sAthe sinemA                   96, pages 1–5, Key West, Florida.
          dekhabe (I shall play football with
          Ram and Jadu will see a movie with           Helmut Horacek. 1992. An integrated view of text
                                                         planning. In Proceedings of the 6th International
          Ram).                                          Workshop on Natural Language Generation, pages
       b. Ami rAmer sAthe phuTabala khelaba              29–44, London, UK. Springer-Verlag.
          eba.n yadu tAra sAthe sinemA dekhabe
                                                       William C. Mann and Sandra A. Thompson. 1988.
          (I shall play football with Ram and Jadu       Rhetorical structure theory: Toward a functional the-
          will see a movie with him.                     ory of text organization. Text, 8(3):243–281.
                                                       Feikje Hielkema Marit Theune and Petra Hendriks.
The current system takes discourse marker as in-         2006. Performing aggregation and ellipsis using dis-
put for a combining simple clauses. But it can           course structures. Research on Language and Com-
be extended to select the appropriate discourse          putation, 4(4):353–375.
marker depending upon the rhetorical relation and
                                                       M. Reape and C. Mellish. 1999. Just what is aggre-
other functional informations such as polarity,          gation anyway. In Proceedings of the 7th European
prosody, emphasis etc.                                   Workshop on Natural Language Generation, pages
   The system can be extended to aggregate more          20–29, May.
than two simple clauses. In that case the docu-        Ehud Reiter and Robert Dale. 2000. Building Natural
ment structure tree (Reiter and Dale, 2000) will be      Language Generation Systems. Cambridge Univer-
the input. Clauses can be aggregated according to        sity Press, New York, NY, USA.
the specification of the document structure tree un-
                                                       James Chi-Kuei Shaw. 2002. Clause aggregation: an
less the complexity of an single sentence exceed         approach to generating concise text. Ph.D. thesis,
a predefined threshold. Depending upon the re-            New York, NY, USA. Sponsor-Mckeown, Kathleen
sulting sentence complexity and other contextual         R.
information, sentence break may be declared re-        John Wilkinson. 1995. Aggregation in natural lan-
sulting in multi-sentential text.                        guage generation: Another look. Technical report,
   In our future works, we intend to handle the          Computer Science Department, University of Water-
above mentioned limitations to generate more nat-        loo.
ural Bengali text.

Acknowledgement
We would like to thank anonymous reviewers for
valuable comments. We would also like to thank
Mr. Plaban Kumar Bhowmik and Mr. Sibansu

Mais conteúdo relacionado

Destaque (16)

Book Proposals
Book ProposalsBook Proposals
Book Proposals
 
Giulia manetti
Giulia manettiGiulia manetti
Giulia manetti
 
Dot
DotDot
Dot
 
Safari tipográfico (VM)
Safari tipográfico (VM)Safari tipográfico (VM)
Safari tipográfico (VM)
 
Recordamos mal
Recordamos malRecordamos mal
Recordamos mal
 
Safari Tipográfico por Valeska Mesquita
Safari Tipográfico por Valeska MesquitaSafari Tipográfico por Valeska Mesquita
Safari Tipográfico por Valeska Mesquita
 
Hardware komputer dan merakit p ci
Hardware komputer dan merakit p ciHardware komputer dan merakit p ci
Hardware komputer dan merakit p ci
 
Jadual Waktu Peperiksaan SPM 2012
Jadual Waktu Peperiksaan SPM 2012Jadual Waktu Peperiksaan SPM 2012
Jadual Waktu Peperiksaan SPM 2012
 
Newsletter 03 2016
Newsletter 03 2016Newsletter 03 2016
Newsletter 03 2016
 
Astronomia na luzie
Astronomia na luzieAstronomia na luzie
Astronomia na luzie
 
Newsletter 09 2016
Newsletter 09 2016Newsletter 09 2016
Newsletter 09 2016
 
Newsletter 04 2016
Newsletter 04 2016Newsletter 04 2016
Newsletter 04 2016
 
Experiment buoyant force
Experiment buoyant forceExperiment buoyant force
Experiment buoyant force
 
Newsletter 12 2016
Newsletter 12 2016Newsletter 12 2016
Newsletter 12 2016
 
Sports vocabulary learn english vocabulary
Sports vocabulary   learn english vocabularySports vocabulary   learn english vocabulary
Sports vocabulary learn english vocabulary
 
Company Profile: Zara
Company Profile: ZaraCompany Profile: Zara
Company Profile: Zara
 

Semelhante a Syntactic aggregation

FCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of OntologiesFCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of Ontologies
alemarrena
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Fulvio Rotella
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
University of Bari (Italy)
 

Semelhante a Syntactic aggregation (20)

Constructive Adpositional Grammars, Formally
Constructive Adpositional Grammars, FormallyConstructive Adpositional Grammars, Formally
Constructive Adpositional Grammars, Formally
 
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYSTRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
 
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYSTRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
 
Cohesion In English
Cohesion In EnglishCohesion In English
Cohesion In English
 
A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
(Semantics) saeed's book ch 9
(Semantics) saeed's book ch 9(Semantics) saeed's book ch 9
(Semantics) saeed's book ch 9
 
Lexical sets
Lexical setsLexical sets
Lexical sets
 
FCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of OntologiesFCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of Ontologies
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
 
Surface realization
Surface realizationSurface realization
Surface realization
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
 
Role of unification and realization in natural language generation
Role of unification and realization in natural language generationRole of unification and realization in natural language generation
Role of unification and realization in natural language generation
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontology
 
Treebank annotation
Treebank annotationTreebank annotation
Treebank annotation
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 

Último

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

Syntactic aggregation

  • 1. Syntactic Aggregation in Bengali Text Generation Sumit Das, Anupam Basu, Sudeshna Sarkar Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India – 721302 sumit.jucse@gmail.com,{anupam,sudeshna}@cse.iitkgp.ernet.in Abstract two text spans in (1a), linked by a C ONJUNCTION The quality of the sentences generated by a rhetorical relation (Mann and Thompson, 1988) natural language generation system can be can be combined as in (1b). But (1b) contains un- evaluated based on their well-formedness necessary repetitions shown by the words in bold. (fluency, conciseness and coherence) and So, these can be aggregated to produce (1c) which faithfulness to the communication intent. is more fluent, concise, and coherent than (1b). In this paper, we explore the prevalent 1. a. * Jack went up the hill. syntactic aggregation constructs in Ben- * Jill went up the hill. gali and present an approach towards gen- b. Jack went up the hill and Jill went up erating Bengali compound sentences using the hill. the identified constructs. The inputs to our c. Jack and Jill went up the hill. syntactic aggregation method are the con- stituent simple sentences, rhetorical rela- Syntactic aggregation is the most common form of tions defined over them and the discourse aggregation observed in any real discourse. Shaw markers realizing the relations. The paper (2002) proposed that in syntactic aggregation sim- describes a rule based approach to form pler linguistic components are combined in accor- the compound sentences, by reorganiza- dance with linguistic rules. As it is a language de- tion of components followed by elimina- pendent process, so linguistic knowledge, such as, tion of redundancies of lexical entities, and preferred word ordering, special verb form usage presents a user based evaluation of the re- etc. are required for combining text spans. For sults obtained. example, in Bengali the two simple text spans in 1 Introduction (2a), linked by S EQUENCE rhetorical relation, can be simply combined using appropriate discourse Any Natural language Generation (NLG) system marker eba.n as in (2b). But in (2b), the word in should have the capability to remove unneces- bold is redundant. So, applying the conjunction sary repetitions when generating text. Unneces- reduction construct the two text spans can be ag- sary repetitions make the text less fluent and non- gregated to generate (2c). But, (2c) can further be coherent. In NLG, the task of combining con- aggregated to (2d) by using non-finite verb giYe. stituent simpler text spans by removing repetitions 2. a. 1 (Ram is called aggregation. According to the standard * rAma mAThe giYechhila three-stage pipeline NLG architecture proposed by went to the playground). Reiter and Dale (2000) aggregation is a basic task * rAma phuTabala khelechhila of any NLG system for generating fluent, concise, (Ram played football). and coherent text. Dalianis (1993) viewed aggre- b. rAma mAThe giYechhila eba.n rAma gation mainly as redundancy elimination problem phuTabala khelechhila (Ram went to and should be done in such a way that the origi- 1 In this paper, Bengali graphemes are written using Ro- nal meaning of the text is preserved and no unde- man Script in ITRANS notation. They are written in italics sirable implication is produced. For example, the font. Proceedings of ICON-2009: 7th International Conference on Natural Language Processing Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
  • 2. the playground and Ram played foot- eration. Apart from redundancy elimination, ag- ball). gregation choices can affect other characteristics c. rAma mAThe giYechhila eba.n phuTa- of text, such as sentence complexity, focus, em- bala khelechhila (Ram went to the phasis, theme/rhyme, prosody etc. playground and played football). Reape and Mellish (1999) defined aggregation d. rAma mAThe giYe phuTabala khelech- as a process to generate more concise, cohesive, hila. (Ram went to the playground and and fluent text by omitting or substituting repeat- played football). ing entities where the reader can infer the deleted entities from the remaining text. Reaper and Mel- Clearly, to syntactically aggregate smaller text lish distinguished among different types of aggre- spans in Bengali an NLG system should have the gation: conceptual, discourse, semantic, syntactic, knowledge of Bengali grammar. lexical, and referential. According to them syn- In this work, we have studied a corpus of Ben- tactic aggregation is the most common and can be gali sentences to identify the prevalent syntac- stated by some grouping rules, like, subject group- tic aggregation constructs in Bengali. Then, we ing, predicate grouping etc. have proposed a method to syntactically aggregate Horacek (1992) has given a more theoretical two simple clauses using the constructs identified view of aggregation. He explained it by some to generate a more fluent, concise and coherent grouping phenomena, like content based grouping, compound sentence. The inputs are two simple structurally motivated propositional grouping. clauses, the rhetorical relation between them and Shaw (2002) categorized aggregation into four the discourse marker realizing that relation. types: interpretive, referential , syntactic, and lex- The rest of this paper is organized as follows: In ical. He focused mainly on syntactic aggregation. section 2, we briefly mentioned the related works He divided syntactic aggregation into two types: in syntactic aggregation. In Section 3, we present a hypotactic and paratactic. In paratactic aggrega- corpus analysis to identify the prevalent syntactic tion all the constituent text spans are of equal sta- aggregation constructs in Bengali. Rhetorical rela- tus. On the other hand, in hypotactic aggregation tions considered in this work are mentioned in sec- the constituent text spans are related by some sub- tion 4 and the semantic representation used is de- ordinate relation. scribed in section 5. We described our approach in In Virtual Storyteller project (Marit Theune and section 6 and the evaluation methods in 7. In sec- Hendriks, 2006) different conjunctive and ellipti- tion 8, concluding remarks and some future scopes cal constructs were used to syntactically aggregate relevant to this work have been provided. simpler text span to generate more coherent and concise fairy-tales. 2 Related Work All the works in the area of text aggregation en- countered so far are focused on English and other There does not exist any general consensus regard- European languages. In this work, we have pro- ing the exact definition of aggregation, the types posed methods to perform syntactic aggregation in of aggregation or the component of an NLG sys- Bengali text generation. tem where aggregation tasks should be performed. The general approach is to handle the aggregation 3 Corpus Analysis tasks in domain and application specific way. Dalianis (1993; 1996) equated aggregation with We conducted a corpus analysis to identify the the process of redundancy elimination. He divided prevalent syntactic aggregation constructs used in it into four principal categories: syntactic, elision, Bengali for generating compound sentences. For lexical, and referential aggregation. In syntactic this we have chosen text of narrative style be- aggregation repetitions are removed syntactically cause narrative texts are mainly activity or event leaving one item (at least) in the text to express driven. So, it is easier to model the different the meaning explicitly. types of aggregation construct in narrative text. Wilkinson (1995) contradicted Dalianis’s views We have a corpus of 600 compound sentences col- of equating text aggregation with redundancy el- lected from Bengali story books. We have ran- emination because in certain context it can be domly chosen 350 sentences from that corpus for done by using suitable referring expression gen- analysis. First the selected compound sentences
  • 3. were segmented into simple clauses. A simple * rAma bhAta eba.n shyAma ruTi clause is equivalent to a simple sentence which khAbe (Ram will eat rice and contains only one finite verb and no coordinating Shyam will eat roti). conjunction. For example, the compound sentence Here the right most portion of the first rAma eba.n shyAma kAla skule giYechhila (Ram proposition(khAbe) is deleted. and Shyam went to school yesterday) contains 2 – Coordinating one constituent: In this simple clauses: rAma kAla skule giYechhila (Ram case, one constituent entity from each went to school yesterday) and shyAma kAla skule of the input simple clauses are co- giYechhila (Shyam went to school yesterday). By ordinated by a conjunction. This can decomposing the 350 compound sentences, we got happen to any entity of the constituent 868 simple clauses (2.48 simple clauses per sen- simple clauses. tence). This measure is important to determine the * rAma eba.n shyAma phuTbala maximum number of simple clauses that can be khelachhila (Ram and Shyam was aggregated in a single sentence. We cannot keep playing football). on aggregating arbitrarily large number of sim- The subjects of the two constituent sim- ple clauses even if they are syntactically similar, ple clauses in the above example are co- since it may result in too complex but less fluent ordinated. text. From the corpus analysis, we have identi- – Non-finite verb generation: If both fied two types of frequently used syntactic aggre- the input simple clauses are about some gation constructs in Bengali, e.g., paratactic con- events or actions performed sequen- struct and elliptic construct. tially or concurrently by the same sub- • Simple paratactic construction: In this ject then they are aggregated using non- case, the two constituent simple clauses are finite form of the verb of the first simple simply connected by the conjunctive dis- clause. course marker and no word deletion is re- * rAma baAta kheYe skule yAbe quired. (Ram will eat rice and go to school). In the above example, the two con- – rAma ekatA boi paRachhila eba.n stituent simple clauses are about two shyAma phuTabala khelachhila (Ram actions performed sequentially by the was reading a book and Shyam was same subject. So, perfect participle form playing football). of the verb khAoYA i.e. kheYe is used for • Elliptic construction: Ellipsis is defined as aggregation. the omission of superfluous words from the Any combination of the above four types of surface form which are inferable from the en- elliptic constructs is also allowed. For ex- tities in the remaining text. The different el- ample, in (3) both conjunction reduction and liptic constructs observed in Bengali are: RNR are used and (4) is generated by us- – Conjunction reduction: In conjunction ing both conjunction reduction and non-finite reduction, the subject of the second sim- verb. ple clause is deleted. 3. rAma bhAta eba.n mAchha khAbe * rAma khAbAra kheYechhe eba.n (Ram will eat rice and roti). bandhudera sAthe sinemA dekhate 4. rAma skule giYe phuTabala khelabe gechhe (Ram has eaten food and (Ram will go to school and play foot- gone to see a movie with friends). ball). In the example given above, the subject In summary, though for corpus study we have con- of the second simple clause, i.e., rAma sidered only narrative Bengali text, it is a part is deleted using conjunction reduction of more general approach. As syntactic aggrega- construct. tion is language dependent but domain indepen- – Right node raising (RNR): In RNR, dent task (Shaw, 2002), the contributions of this the right most portion of the first simple work can be extended to generate aggregated text clause is deleted. in Bengali in other domains as well.
  • 4. 4 Rhetorical Relations Considered information, such as, verb root (v-root), theme, tense, aspect, mood, polarity etc. The arg frame From the corpus study, we know that paratactic contains the nominal entities along with the the- aggregations are the most common form of syn- matic role of that entity in that clause. If there tactic aggregation in Bengali. In paratactic ag- is any modifier for the verb or any nominal en- gregation, the constituent text spans are of equal tity in a clause then the respective modifier frames status and are linked by a multi-nuclear rhetori- (v-mod and w-mod frame) are present inside the cal relations (Mann and Thompson, 1988). In this corresponding pre and arg frame. work, we have focused on the different paratac- tic constructs for syntactic aggregation of Bengali text. The multi-nuclear rhetorical relations consid- ered in this paper are C ONJUNCTION , D ISJUNC - TION , C ONTRAST , and S EQUENCE as defined by original Rhetorical Structure Theory (RST). In ad- dition to the said relations, we have considered another multi-nuclear temporal coherence relation PARALLEL as defined below: Two text spans are said to be related by PARALLEL relation if the actions or the events in those two text spans are occur- ring simultaneously. For example, the two constituent clauses present in (5) are rAma khAbAra khAchchhila (Ram was eat- ing food) and rAma Tibhi dekhachhila (Ram was watching TV). The actions in these two clauses are concurrent. So, the coherence relation between them is PARALLEL. 5. rAma khAbAra khete khete Tibhi dekhachhila (Ram was watching TV while eating food). 5 The Semantic Representation The semantic representation chosen here is a case- frame representation. This is called predicate- argument representation. The basic building block in this representation is sentence. An example of the sentence frame is given in Figure 1. A sentence contains a clause frame and clause-count which Figure 1: Case-frame representation for the sen- denotes the number of simple clauses present in tence “rAma pa.Dachhila eba.n shyAma khelach- the sentence. The clause is a recursive structure hila.” (Ram was reading and Shyam was play- that can contain clauses inside itself which makes ing). it capable of representing both simple and com- posite (compound and complex) sentences. For simple sentence, the outer clause only contains 6 Proposed Approach one inner clause. On the other hand, for composite sentence the outer clause contains the constituent In our approach for syntactic aggregation, the in- inner clauses along with the rhetorical relation (rh- puts are two simple clauses, the rhetorical relation rel) connecting and discourse marker (dm) realiz- between them, and the discourse marker realiz- ing that rhetorical relation. A clause frame con- ing that relation. To syntactically aggregate the tains a predicate frame (pre) and list of argument two simple clauses by using the different paratac- frames (arg). The pre frame contains verb related tic constructs identified in section 3 we propose
  • 5. the following steps: kakhana < kothAYa. The role on the left side of < will appear before the role on the right side in the • Step 1: Ordering arguments in the constituent surface form. clauses. 6. Ami AgAmIkAla skule yAba (I shall go to • Step 2: Repeating entity identification. school with my father). • Step 3: Ordering constituent clauses. Again, in (7) the role set is {ke, kothAYa, kakhana, • Step 4: Superfluous words deletion and non- kAra sAthe}. By using (7) the total order obtained finite verb generation. from (6) can be extended to ke < kakhana < kAra sAthe < kothAYa. • Step 5: Correct surface form generation. 7. Ami AgAmIkAla bAbAra sAthe skule yAba The above steps are described below. (Tomorrow I shall go to school with my fa- ther). 6.1 Argument Ordering in the Constituent Clauses Using the above method for the entire set of sim- Preferred word ordering in a sentence varies with ple clauses we have identified the set of possible languages and it is very important for syntactic ag- roles in Bengali and developed a total order among gregation. Though Bengali is a free-word-order them. The arg frames in the input simple clauses language, the preferred word ordering in a Bengali are ordered using the developed total order. sentence is subject-object-verb. In this work, the input simple clauses are taken 6.2 Repeating Entity Identification in their corresponding semantic case-frame repre- In our current approach, to remove the redundant sentation as shown in Figure 1. The arg frames in entities first we have identified the repeating enti- the clause are then ordered by using a total order ties present in both the simple clauses taken as in- among the roles associated with the arg frames. put. We are assuming that the nominal entities are These roles are neither semantic roles nor Paninian equivalent if they have the same thematic role and roles. The problem that prevents both the seman- root word in the constituent simple clauses. For tic and Paninian roles is that, none of them can example, in the simplified semantic representa- be associated with a unique postposition which tion of the compound sentence shown in Figure 2, is very important for generating sentence in Ben- the constituent simple clauses have one repeating gali. So the alternative approach should be to de- nominal entity. In both the simple sentences, the sign some intermediate representation that has suf- thematic role of that entity is ki and surface form is ficient granularity of the roles, such that ambigu- bhAta. Two verbs are equivalent if they have same ous assignments of postpositions are not possible. root words and other functional parameters, such Now, Bengali has a list of postpositions that are as, tense, aspect, mood, polarity etc. In Figure 2, used in different contexts to convey different se- verbs are equivalent and thus repeating. Two noun mantics. In this work, roles have been designed modifiers are equivalent if they have the same root at a granularity level where one role is assigned to word and are modifying two nominal entities with a semantically unique postposition. For develop- the same thematic role. Lastly, two verb modifiers ing the total order of the roles, we have followed are equivalent if they have same root word. The an approach taken in the SANYOG system (Bhat- repeating entities are tagged with the status RE- tacharya, 2004). We have taken the constituent PEATING. simple clauses of the compound sentences used for corpus analysis. Each simple clause was rep- 6.3 Ordering Constituent Clauses resented in their case-frame representation and the All the rhetorical relations considered in this work, arg frames inside them are then ordered as they ap- mentioned in section 4, are multi-nuclear rela- pear in the surface form of the clause. In this way, tions. So, two simple clauses connected by any the ordering among the roles of the arg frames in a of these relations, except S EQUENCE relation, can clause is known. For example, the role set for (6) be realized in any order. In case of S EQUENCE is {ke, kothAYa, kakhana}. From (6) we can infer relation, an ordering constraint is imposed by the that the preferred order among these roles is ke < sequence of the input clauses. So, for S EQUENCE
  • 6. Figure 2: Simplified case-frame representation for the sentence “rAma eba.n shyAma bhAta khAbe.” (Ram and Shyam will eat rice). Note: ∼() denotes a frame. relation the clauses cannot be reordered. For • Polarity: If two simple clauses have the other relations, after identifying the repeating en- same tense but different polarity for the verb tities, the constituent simple clauses in the result- then the clause with negative polarity will ing compound sentence are reordered on the basis come first in the surface form. For exam- of their chronological order and polarity following ple, if the simple clauses in (9a), linked by the rules mentioned below: C ONJUNCTION relation, are aggregated as in (9b) then the negative polarity marker nA af- • Tense: If the two constituent clauses have fects both the verb kinabe and khAbe. So, the different tense then they are ordered chrono- communicative goal is not preserved. How- logically. This improves the fluency of the ever, if the clauses are reordered and then ag- generated compound sentence. For example, gregated, (9c) results which is grammatically if the two clauses in (8a), linked by C ON - correct, fluent and preserves the meaning. JUNCTION relation, are aggregated without chronological ordering then (8b) is gener- 9. a. rAma chakaleTa kinabe. rAma ated. But if they are ordered according to chakaleTa khAbe nA. (Ram will their tense and aggregated then (8c) is gener- buy chocolate. Ram will not eat ated which is more fluent and coherent then chocolate). (8b). b. rAma chakaleTa kinabe eba.n khAbe 8. a. · Ami bA.Di yAba. (I shall go nA (Ram will buy chocolate and home). will not eat). · rAma skule gechhe. (Ram has c. rAma chakaleTa khAbe nA eba.n gone to school). kinabe (Ram will not eat chocolate and will buy). b. Ami bA.Di yAba eba.n rAma skule gechhe. (I shall go home and Ram The ordering based on polarity is done when has gone to school). the clauses are linked by either C ONJUNC - c. rAma skule gechhe eba.n Ami bA.Di TION or D ISJUNCTION relation. yAba. (Ram has gone to school 6.4 Superfluous Words Identification and and I shall go home). Non-finite Verb Generation The chronological ordering is done when After identifying the repeating entities and order- the rhetorical relation between the two con- ing the constituent clauses, the superfluous words stituent clauses is C ONJUNCTION, D ISJUNC - are identified using the following two methods: TION or C ONTRAST . As the constituent sim- ple clauses are concurrent for PARALLEL re- • Forward deletion: If the entities at the be- lation, this ordering is not required. ginning of the surface forms of both clauses
  • 7. are REPEATING then they are marked as bold faced words in the second clause are forward DELETED in the second clause. Surface deleted. forms of both the clauses are traversed from left-to-right and REPEATING entities are 12. rAma Aja bhAta khAbe eba.n rAma kAle marked as DELETED in the second clause bhAta khAbe (Ram will eat rice and Shaym unless a NON-REPEATING entity is encoun- will eat rice). tered. For example, the two constituent 13. rAma Aja bhAta khAbe kintu rAma kAle clauses in (10), linked by C ONJUNCTION re- bAbAra sAthe ruti khAbe (Ram will eat rice lation, have REPEATING entities with the today but Ram will eat roti with father tomor- role ke and kakhana and they occur at the row). beginning of both the clauses. So, the RE- PEATING entities are marked DELETED in In case of S EQUENCE or PARALLEL relation, only the second clause indicated by the words in forward deletion is done. In addition to that, the bold face. verb of the first clause is modified to non-finite form if the subjects of both the clauses are the 10. rAma gatakAla khAbAra kheYechhila same. For S EQUENCE relation, the non-finite form eba.n rAma gatakAla skule giYechhila is the perfect participle of the verb and for PAR - (Ram ate food yesterday and Ram went ALLEL relation, it is the progressive participle. to school yesterday). For example, in (14a) the two clauses are linked • Backward deletion: If the verb and the by S EQUENCE relation. So, first the bold faced entities at the end of the surface forms of words in the second clauses are forward deleted both clauses are REPEATING then they are and then perfect participle form of the verb of the marked as DELETED in the first clause. Sur- first clause is generated. This results in the com- face forms of both the clauses are traversed pound sentence (14b). Similarly, the two clauses from right-to-left and REPEATING verb and in (15a), linked by PARALLEL relation, are also entities are marked as DELETED in the first aggregated to (15b) by using the progressive par- clause unless a NON-REPEATING entity is ticiple of the root verb paRA. encountered. For example, the two con- 14. a. rAma bA.Di yAbe eba.n rAma bhAta stituent clauses in (11), linked by C ONJUNC - khAbe (Ram will go home and Ram TION relation, have REPEATING verb and will eat rice). a REPEATING entity with the role ki and b. rAma bA.Di giYe bhAta khAbe (Ram they occur at the end of both the clauses. will go home and eat rice). So, the REPEATING elements are marked DELETED in the first clause indicated by the 15. a. rAma bai pa.Dachhila eba.n rAma words in bold face. khAbAra khAchchhila (Ram was read- 11. rAma bhAta khAbe eba.n shyAma ing a book. Ram was eating food). bhAta khAbe (Ram will eat rice and b. rAma bai pa.Date pa.Date khAbAra Shaym will eat rice). khAchchhila (Ram was eating food while he was reading a book). If the two simple clauses, linked by C ONJUNC - TION , D ISJUNCTION or C ONTRAST relation, have 6.5 Correct Surface Form Generation the same role set then the REPEATING entities are The redundant words are identified in the previ- forward deleted and backward deleted. For exam- ous step but the actual deletion is done is this ple, in (12) the two simple clauses, connected by step. While generating the resulting compound C ONJUNCTION relation, have the same set of as- sentence, the entities marked as DELETED are not sociated roles. So, bold faced words in the second realized i.e. deleted from the surface form. clause are deleted forward and those in the first In case of subject coordinating and RNR con- clause are deleted backward. However, if the role structs, if the subjects of the two input clauses are set is different then only forward deletion is done. different then correct surface form of the common As the two clauses in (13), connected by a C ON - verb should be generated. For example, in (16) TRAST relation, has different role sets, only the the surface form used for the common verb khelA
  • 8. is khelba which is generated by the subject of the 7 Evaluation first clause i.e. Ami. We have developed a system which performs syn- 16. Ami eba.n rAma kAla phuTabala khelaba (I tactic aggregation of two simple clauses by follow- and Ram will play football tomorrow). ing the steps mentioned in section 6. Evaluation of that system is important to validate our approach. Here we have given some rules for generating cor- We performed a user based evaluation. The sys- rect inflectional form of the common verb for dif- tem outputs were shown to the human evaluators ferent syntactic aggregation constructs in Bengali. and they were asked to rate those outputs based • In case of subject coordinating, if one of the on some parameters. Depending upon their feed- subjects is of first person then the common backs the overall system performance is measured. verb will be inflected by that first person sub- We evaluated the system with three human eval- ject. As, in (17) the common verb inflection uators and they were native speakers of Bengali. yAba is generated by the first person subject They were only given a brief idea about the rhetor- Ami. ical relations considered in this work. As men- tioned in section 3, from a corpus of 600 com- 17. Ami eba.n tumi kAla skule yAba (I and pound sentences 350 were chosen randomly for Ram will play football tomorrow). corpus study. The remaining 250 sentences were • In case of subject coordinating, if one of the used as test sentences in the evaluation. The test subjects is of second person and the other is sentences were segmented into constituent sim- of either second or third person then the com- ple clauses. The simple clauses, the rhetorical re- mon verb will be inflected by that second per- lation connecting them, and the appropriate dis- son subject. As, in (18) the common verb in- course marker realizing that relation were given to flection yAo is generated by the second per- the human evaluator as the test inputs. The evalu- son subject tumi. ation is performed depending upon the following two criteria: 18. tumi eba.n rAma skule yAo (You and Ram go to school). • Well-formedness: We define the well- formedness of an output sentence by its • In case of subject coordinating, if both the grammatical correctness and conciseness. subjects are of third person then the subject The grammatical correctness measures the of the complete clause will inflect the com- accuracy of the syntax, word order and the mon verb. As, in (19) both the subjects are of morphological inflections used. third person and the common verb inflection karabena is generated by the subject of the • Faithfulness: The faithfulness of an output complete clause i.e. tini. measures how well the communication goal 19. rAma eba.n tini kAjatA karabena is preserved by the generated output. (Ram and he will do the work). For both the measures, the evaluators were • In case of RNR construct other than the sub- asked to score the outputs on a scale of 1 to 5. ject coordinating, the subject of the complete 1 is the best and 5 is the worst. The scoring for clause will inflect the common verb. As, well-formedness and faithfulness were done sepa- in (20) the common verb inflection khelabe rately by an individual evaluator so that the score is generated by the subject of the complete of one does not influence the score of the other. clause i.e. se. The results of each evaluator for well-formedness and faithfulness are shown in Figure 3 and Figure 20. Ami krikeTa eba.n se phuTabala khe- 4 respectively. labe (I shall play cricket and he will To calculate overall performance of the system play football). the scores given by individual evaluator were com- So, following the above rules the correct inflec- bined as follows: If two or more evaluators have tional form of the common verb is generated given a common score to a test sentence then it which increases the fluency and naturalness of the is assigned to that common score; If all the eval- generated text. uators have given different scores to a test sen-
  • 9. tence then it is not considered for overall perfor- mance calculation. The overall performance of our system for well-formedness and faithfulness are shown in Figure 5 and Figure 6 respectively. Figure 6: Faithfulness Pie Chart ciseness. For example, the two clauses in (21a) are Figure 3: Well-formedness Bar Graph connected by S EQUENCE relation and the system syntactically aggregates them to (21b). But (21b) is very good in terms of word ordering and con- ciseness. 21. a. rahima ekadina rAstAYa bhi.Da dekhechhila. rahimera mAthA ghure giYechhila (One day Rahim saw a huge mass in the street. Rahim was moved by that). b. rahima ekadina rAstAYa bhi.Da dekhechhila eba.n tAra mAthA ghure giYechhila (One day Rahim saw a huge mass in the street and he was Figure 4: Faithfulness Bar Graph moved by that). The errors regarding the faithfulness measure are due to wrong order of the constituent clauses and absence of cues which indicates emphasis and prosody. For example, the two clause in (22a), connected by C ONJUNCTION relation, are aggre- gated to (22b). But the output is ambiguous in terms of faithfulness as both the verbs are now in the scope of the words bAbAra sAthe. 22. a. rAma bAbAra sAthe khAbAra khAbe. rAma Tibhi dekhabe (Ram will eat food with father. Ram will watch TV). b. rAma bAbAra sAthe khAbAra khAbe eba.n Tibhi dekhabe (Ram will eat food with father and watch TV). Figure 5: Well-formedness Pie Chart 8 Conclusion The inconsistencies with respect to well- formedness of the system generated output are In this article, we have shown our methods to gen- mainly due to the errors in word ordering and con- erate aggregated and elliptic sentences in Bengali
  • 10. from clause-sized semantic representations. The Mukhopadhyay for their valuable advice and sup- current system can produce paratactic construc- port. This work is supported by the project Sanyog tions and use ellipsis to omit repeated entities. We - Phase II, funded by Media Lab Asia, and con- were able to produce all the desired forms of syn- ducted in Communication Empowerment Labora- tactic aggregation (see Section 3), though there are tory, Indian Institute of Technology. scopes for improvements. Deletion of the repeating words in the gener- ated output sentence sometimes does not preserve References meaning. In that case, to make the text fluent Samit Bhattacharya. 2004. Sanyog: An iconic sys- anaphoric pronouns need to be used. For example, tem for multilingual communication for people with speech and motor impairments. M.S. Thesis, IIT, if the two clauses in (23a), connected by C ON - Kharagpur, Supervisor-Basu, A, Sarkar, Sudeshna. JUNCTION relation, are aggregated by removing the repeating words in boldface then actual com- Hercules Dalianis and Eduard H. Hovy. 1993. Aggre- municative goal is not preserved. In place of that, gation in natural language generation. In EWNLG ’93, Proceedings of the 4th European Workshop on these two clauses are correctly aggregated to (23b) Natural Language Generation, Pisa, Italy. by using anaphoric pronoun tAra. H. Dalianis. 1996. Aggregation as a subtask of text and sentence planning. In J.H.Stewman (ed.), Proceed- 23. a. Ami rAmer sAthe phuTabala khelaba ings of Florida AI Research Symposium, FLAIRS- eba.n yadu rAmer sAthe sinemA 96, pages 1–5, Key West, Florida. dekhabe (I shall play football with Ram and Jadu will see a movie with Helmut Horacek. 1992. An integrated view of text planning. In Proceedings of the 6th International Ram). Workshop on Natural Language Generation, pages b. Ami rAmer sAthe phuTabala khelaba 29–44, London, UK. Springer-Verlag. eba.n yadu tAra sAthe sinemA dekhabe William C. Mann and Sandra A. Thompson. 1988. (I shall play football with Ram and Jadu Rhetorical structure theory: Toward a functional the- will see a movie with him. ory of text organization. Text, 8(3):243–281. Feikje Hielkema Marit Theune and Petra Hendriks. The current system takes discourse marker as in- 2006. Performing aggregation and ellipsis using dis- put for a combining simple clauses. But it can course structures. Research on Language and Com- be extended to select the appropriate discourse putation, 4(4):353–375. marker depending upon the rhetorical relation and M. Reape and C. Mellish. 1999. Just what is aggre- other functional informations such as polarity, gation anyway. In Proceedings of the 7th European prosody, emphasis etc. Workshop on Natural Language Generation, pages The system can be extended to aggregate more 20–29, May. than two simple clauses. In that case the docu- Ehud Reiter and Robert Dale. 2000. Building Natural ment structure tree (Reiter and Dale, 2000) will be Language Generation Systems. Cambridge Univer- the input. Clauses can be aggregated according to sity Press, New York, NY, USA. the specification of the document structure tree un- James Chi-Kuei Shaw. 2002. Clause aggregation: an less the complexity of an single sentence exceed approach to generating concise text. Ph.D. thesis, a predefined threshold. Depending upon the re- New York, NY, USA. Sponsor-Mckeown, Kathleen sulting sentence complexity and other contextual R. information, sentence break may be declared re- John Wilkinson. 1995. Aggregation in natural lan- sulting in multi-sentential text. guage generation: Another look. Technical report, In our future works, we intend to handle the Computer Science Department, University of Water- above mentioned limitations to generate more nat- loo. ural Bengali text. Acknowledgement We would like to thank anonymous reviewers for valuable comments. We would also like to thank Mr. Plaban Kumar Bhowmik and Mr. Sibansu