2. The authoring process – and
where it needs support
challenges for correctness
• time pressure
• non-native writing
• not enough capacity for careful proofreading
automatic support possibilities
• spell checking
• grammar checking
3. The authoring process – and
where it needs support
challenges for understandability and
readability
• authors are experts of subject and language –
users often are not
automatic support possibilities
• style checking
4. The authoring process – and
where it needs support
challenges for consistence and corporate wording
• guidelines for corporate wording exist – in a large document on
the shelf
• terminology lists exist – in an excel sheet somewhere in the file
system
• distributed writing
automatic support possibilities
• terminology checking
• sentence clustering
5. The authoring process – and
where it needs support
challenges for translatability
• authors write without having the translation process in
mind
• lexical, syntactic and semantic ambiguity
• translation costs depend on translation memory matches
automatic support possibilities
• style checking
• terminology checking
7. tokenization
• Close the door of our XYZ car.
capital word lower word space dot_EOS
花子が本を読んだ。
based on rules
and lists of
花子 が 本 を 読ん だ 。 abbreviations
Kanji Hiragana dot_EOS
8. POS tagging
Close the door of our XYZ car.
V DET N PREP PRON NE N
XML and attribute
value structures
statistical methods
large dictionaries
9. morphology
• Close the door of our XYZ car.
Lemma: close
Tense: present_imp Lemma: car
Person: third Number: singular
Number: singular Case: nominative_accusative
based on dictionaries,
rules for inflection
and derivation
10. dictionary
• words
unknown to
the standard
NLP system
http://wiki.openoffice.org/wiki/Documentation/
11. words are defined in a errors are defined
dictionary unknown words that
are not defined as
anything not in the
errors are term
dictionary is an error candidates
high recall, low based on words and
precision (depending rules
on the domain) consider terminology
high precision, recall is
dependent on data
work
language analysis error analysis
13. why work on terminology?
• avoid false alarms in spelling
• consistency
• less ambiguity
• translatability
• corporate wording
ultimate goal:
1 term - 1 meaning - 1 translation
14. reality: variants
• web server – web-server
• upload protection – upload-protection
• timeout – time out
• Reset – ReSet
• sub station – sub-station
15. term variants
– orthographic variants
- hyphen, blank, case: term bank, termbank
– semi-orthographic variants
- number : 6-digit, six-digit
- trademark : MyCompany™, MyCompany
– syntactic variants
- preposition: oil level, level of oil
- gerund/noun : call center, calling center
– synonyms
“classical” : vehicle, car
– language-specific variants
(e.g. Fugenelemente DE, Katakana JA)
16. how to
get consistent terminology
• author/company defines the term bank
• list of deprecated terms
deprecated term: vehicle
approved term: car
• list of approved terms
automatic identification of variants
approved term: SWASSNet User
deprecated term: SWASSNet user, SWASS-Net
User
19. NLP for terminology
• NLP methods for term extraction
– corpus analysis (morphology, POS, NER)
– information extraction (potential product names)
– ontologies (e.g. semantic groups)
• NLP methods for setting up a term database
– morphology (finding the base form)
– POS
• NLP methods for term checking
– variants
– similar words
– inflection
20. approaches to grammar
checking
descriptive grammar error grammar
• definition of correct grammar • implementation of grammar
• e.g. HPSG, LFG, chunk-grammar, errors
statistical grammars • preconditions:
• anything that‘s not analyzable • work with error corpora
must be a grammar error • error grammar with a high
• preconditions: number of error types
• grammar with large coverage • „deepness“ of analysis varies
• large dictionaries with the type of error to be
• robust, but not too robust described
parsing • high precision, recall is based on
• efficient parsing methods the number of rules
• high recall, low precision
21. grammar rules, examples
• subject verb agreement:
– Check if instructions are programmed in
such a way that a scan never finish.
– When the operations is completed, the
return to home completes.
22. grammar rules, examples
• a an distinction:
– a isolating transformer
– an program
• wrong verb form:
– it cannot communicates with them
– IP can be automatically get
23. example grammar rule*
• write_words_together
– @can ::= [ TOK "^(can)$"
– MORPH.READING.MCAT "^Verb$" ];
– The application can not start.
– The application can tomorrow not start.
– TRIGGER(80) == @can^1 [@adv]* 'not'^2
– -> ($can, $not)
– -> { mark: $can, $not;
– suggest: $can -> '', $not -> 'cannot';
– }
– Branch circuits can not only minimize system damage but can interrupt the flow of fault
current
– NEG_EV(40) == $can 'not' 'only' @verbInf []* 'but';
* implemented in Acrolinx
24. style - controlled language
• controlled languages
• AeroSpace and Defence Industries Association of Europe (ASD)
ASD-STE100 (simplified English)
• Caterpillar Technical English (CTE)
• disadvantages:
• very restrictive
• low acceptance of users
25. style – moderately controlled
language
• rules define errors (like grammar rules)
• rules (and instructional information) are
defined by authors
• implementation in authoring support systems
• high acceptance
• good usability
26. style guidelines
• different for different usages
– text type
• (e.g., press release – technical documentation)
– domain
• (e.g., software – machines)
– readers
• (e.g., end users – service personnel)
– authors
• (e.g., Germans tend to write long sentences)
29. style rule examples MT preediting
• avoid_nested_sentences
• avoid_ing_words
• keep_two_verb_parts_together
• avoid_parenthetical_expressions
dependent of MT system and language pair
30. automatic suggestions for style rules
– replacement of words or phrases
– replacement using the correct writing with
uppercase or lowercase
– replacement of words using the correct inflection
– generation of whole sentences (e.g. passive –
active) requires semantic analysis and generation
and is therefore not (yet) possible
31. example style rule*
• avoid_future_tense
• /* Example: „.. It will be necessary .." */
• TRIGGER (80) == @will^1 [-@comma]* @verbInf^2
• ->($will, $verbInf)
• -> { mark : $will, $verbInf;}
• /* Example: „.. The router services will be offered in the future
.." */
• NEG_EV(40) == $will []* @in @det @time;
* implemented in Acrolinx
32. consistent phrasing
• Use the same phrase for the same meaning.
• Examples:
– Congratulations on acquiring your new wearable digital
audio player
– Congratulations, you have acquired your new wearable
digital audio player!
– Dear Customer, congratulations on purchasing the new
wearable digital audio player!
33. Acrolinx intelligent reuse™
Acrolinx server
Terminology Writing
Standards
Intelligent Grammar
Reuse &
Spelling
Content / Translation Reuse
repository Repository
Clusters
micro-clustering
the cat sat on the mat
The dog sat on the rug the cat sat on the carpet
The elk sat on the moss The cat slept on the sofa
The moose sat on the elk
review and release
the cat sat on the mat
this is a sentence you can’t
Fish swam in the blue water read
The fish swam in the green
water
redundancy and quality
The fish swam in the red sea.
the cat sat on the mat
Another small test snippet
the cat sat on the mat
the cat sat on the malt
The cat ate on the mat
This is the same as the other
one.
filters
the cat sat on the mat
the cat sat on the doormat
the cat sat on the mat.
the cat sat on the mat
The cat sat on the mat
More useless data points
the cat sat on the mat
41. summary
• The authoring process is challenging
– correctness
– consistency
– understandability
– translatability
• It can be effectively supported by NLP-
enhanced tools
42. Thank you!
Melanie Siegel
Hochschule Darmstadt – University of Applied Sciences
melanie.siegel@h-da.de