Non-adjacent linguistic phenomena such as non-contiguous multiwords and other phrasal units containing insertions, i.e., words that are not part of the unit, are difficult to process
and remain a problem for NLP applications. Non-contiguous multiword units are common across languages and constitute some of the most important challenges to high quality machine
translation. This paper presents an empirical analysis of non-contiguous multiwords, and highlights our use of the Logos
Model and the Semtab function to deploy semantic knowledge to align non-contiguous multiword units with the goal to translate these units with high fidelity. The phrase level manual
alignments illustrated in the paper were produced with the CLUE-Aligner, a Cross-Language Unit Elicitation alignment tool.
2. • Introduction
– Discontinuous Multiword Units (DMWU) in NLP
– Main Current Shortcomings
– Our Goal
• CLUE-Aligner Alignment Tool
• The Logos Model
– Alignment of DMWU Inspired by Logos
• bring [ ] to a conclusion
• set [ ] in motion
• play [ ] role
• take [ ] interest in
• keep [ ] informed about
• Preliminary Results
– Analysis of Preliminary Results
• Advantages of the Logos Model
• Conclusions and Future Directions
• Final Remark
– The eSPERTo Project
Outline
2
3. • Increasing interest in multiword units (MWU) in the Rield of NLP
“lexical items that: (a) can be decomposed into multiple lexemes; and (b) display lexical,
syntactic, semantic, pragmatic and/or statistical idiomaticity” (Baldwin and Kim 2010)
• Compositionality property – causes automatic processing of MWU
particularly challenging
– Free combinations
round table = meeting
– Opaque meanings
• piece of cake = easy to do
• pay a visit = visit
– Cannot be translated word-for-word
• raining cats and dogs
– Allow insertions (= words that are not part of the unit)
• to bring [INSERTION] to a conclusion
I would urge the European Commission to bring the process of adopting the directive on additional pensions to a conclusion
Introduction
3
4. • Non-adjacent linguistic phenomena – remote dependency
• Common across languages
• DifRicult to recognize and process
• Remain a problem for NLP applications
• Lack of formalization still triggers problems with the syntactic and
semantic analysis of sentences containing MWU
• Impairment of NLP systems’ performance
• Cause MT to fail in assigning the correct translation
• For SMT systems, DMWU constitute signiRicant challenges to
correct word and phrase alignment (Shen et al. 2009), and
therefore, to high quality MT
Discontinuous Multiword Units in NLP
4
5. • Linguistic knowledge is still limited in most systems
– Some SMT methodologies rely mostly on statistics to train/evaluate
MT systems, use probabilistic alignments with no/little linguistic
knowledge, disregard syntactic discontinuity.
– Inability to identify MWU correctly results in translation deRiciencies.
• Lack of publicly available manual multilingual datasets, and of
linguistically motivated alignment guidelines
– Publicly available alignments are mostly bilingual, with some
exceptions (Graça et al. 2008)
– Guidelines cover cross-linguistic phenomena superRicially, excluding
important alignment challenges presented by DMWU.
• Lack of more robust alignment tools
– Limitations in assisting human annotators in the task of identifying
and aligning correctly DMWU and produce rules from them.
Main Current Shortcomings
5
6. • Present an experimental empirical analysis of DMWU
• Stress the relevance of correct (and non-arbitrary) alignment of DMWU
• Highlight an alignment methodology inspired by the Logos Model (Scott,
2003; Barreiro et al., 2011) and the Semtab function to deploy semantico-
syntactic knowledge that allows to translate DMWU with high Ridelity
• Illustrate DMWU manual alignments produced with CLUE-Aligner – Cross-
Language Unit Elicitation – a Web alignment interactive tool (Barreiro,
Raposo, Luís 2016)
*Even though similar in name to the "clue alignment approach” (Tiedemann, 2003; 2004; 2011),
mainly devoted to word-level alignment, our approach is theoretically and methodologically
different with a focus on phrase alignment, contemplating multiwords and linguistically-relevant
phrasal units.
Our goal
6
10. • Integrates semantic and contextual knowledge and applies it to the
translation process
• Precision is associated with the application of Semtab semantic and
contextual data-driven pattern-rules, which are deep structure patterns
that match on (apply to) a great variety of surface structures, including
DMWU
– deal(VI) with N(questions) = s’occuper de N
• Alignments that mirror Semtab semantic nuances can help create new MT
systems and improve existing ones
The Logos Model
10
18. Analysis of Preliminary Results
18
DMWU (support verb construction) Google Translate Correct translation
to bring [this dossier] to a conclusion trazer a uma conclusão concluir / terminar [este dossier]
set […] in motion estabeleceu […] em movimento iniciou / pôs em marcha […]
play [the] role jogar [o] papel desempenhar [o] papel
take [a lukewarm] interest in *ter um interesse [*morna] em manifeste / demonstre um interesse
[morno/fraco/ténue]
keep [us] informed about *tem [nos] *manteve informados sobre nos tem mantido informados sobre
nos tem informado sobre
EN – It is unacceptable for the Commission only to take a lukewarm interest in a country.
PT-GT – É inaceitável que a Comissão só a *ter um interesse morna em um país.
Lexical errors related to DMWU + Structural errors
• Lack of agreement (para nos manter regular e estreitamente
*informado sobre; que o Parlamento *ser bem *informados sobre)
• Incorrect word order (se conseguirmos *a adoptar e de5ini-lo em
movimento)
• Etc.
19. Advantages of the Logos Model
19
• Consistent and efRicient solution to process DMWU, not consistently
processed in former word or phrase alignment techniques
• Ability to relate constituents that are apart (even very far apart) in the
sentence
• Consistent way to analyze and translate words in context
• Ability to generalize between alternative forms of the same MWU, phrase
or expression (take a walk = walk)
• Semtab has a robust solution for the problem of open class items or less
frequent MWU and phrases that cannot be learnt quickly and translated
correctly by an SMT system, but annoyingly can be observed in MT
translations (also used in non-native speakearisms)
– make a visit or pay a visit?
• MWU are not processed on a word-for-word basis, they represent atomic
semantico-syntactic and translation units
20. • Standard MT systems can beneRit from a correct processing of DMWU
• currently not being explored efRiciently
• processing, recognition and translation of DMWU is challenging
• Some methodologies are inefRicient
• they violate the intrinsic property of the unit as an atomic group of
elements
• elements of the unit cannot be separated or aligned individually
• unit boundaries need to be respected
• Post-editing efforts can be minimized by improving alignment quality
• Even though we analyzed just a few cases of SVC, our Rindings point out to a
general lack of quality in the translation of DMWU (and discontinuous
phrasal expressions)
Conclusions and Future Directions
20
21. • Validation
• Broader quantiRication of phenomena needed to validate exploratory results
• Evaluation
• Evaluation of the performance in hierarchical phrase and syntax-based MT and
neural network translation models (with theoretical capacity to learn DMWU)
• Annotation
• Manual multilingual alignments (gold sets)
• Alignment Guidelines
• Improved and enlarged sets of linguistically-based/motivated alignment
guidelines (gold standards)
• Cross-Linguistic Analysis
• Deep analysis of challenging cross-linguistic phenomena, including DMWU
• Rule / Grammar Construction
• Translation rules extracted from quality manually-annotated corpora
• Tool Enhancement and Automation
• Feed CLUE-Aligner with manual training data and enhance the tool for automatic
alignment and extraction of large amounts of translation pairs for MT case studies
• Translation Applications
• Increase precision and recall in MT systems
• Paraphrases
• Methodology and resources - a valuable asset for applications requiring paraphrases
Conclusions and Future Directions
21
23. The eSPERTo Project
23
the man who is American
the man from America
the man with American nationality
…
The American man
https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl
Paraphrases 4 Translation (Human + MT)
24. 24
Thank you!
Acknowledgements
This research work was supported by Fundação para a Ciência e a Tecnologia (FCT), under project
eSPERTo EXPL/MHC-LIN/2260/2013, UID/CEC/50021/2013, and post-doctoral grant SFRH/BPD/
91446/2012