1. A5 - Specialized Dictionaries
Alberto Sim˜oes
ambs@ilch.uminho.pt
EMLex 2012/2013
Erlangen
Alberto Sim˜oes A5 - Specialized Dictionaries 1/138
2. Part I
Terminology vs Lexicography
Alberto Sim˜oes A5 - Specialized Dictionaries 2/138
3. Overview
1 Term Orientation vs Concept Orientation
2 Classification Systems
What are Classification Systems?
Folksonomies
Taxonomies
Thesauri
Ontologies
3 Further Reading
Alberto Sim˜oes A5 - Specialized Dictionaries 3/138
4. Term vs Concept Orientation
Most dictionaries are organized by terms:
users look up entries by the word;
entries describe all possible senses;
the same explanation can appear for different words
(synonyms);
Most terminologies are organized by concepts:
users look up entries by an instance word;
but concepts exist organized as a single block;
each concept is represented only once;
all synonyms (and antonyms) are presented together;
Alberto Sim˜oes A5 - Specialized Dictionaries 4/138
5. Term vs Concept Orientation
Term Orientation: Dictionary
Definition from Dictionary.com (May 3rd, 2013)
Alberto Sim˜oes A5 - Specialized Dictionaries 5/138
6. Term vs Concept Orientation
Concept Orientation: Terminology
Entry from DeCS - Health Sciences Descriptors (May 3rd, 2013)
Alberto Sim˜oes A5 - Specialized Dictionaries 6/138
7. Classification Systems
Humans tend to organize;
“disorganization is a kind of organization”
This organization is usually done by classification;
Classification can be as simple as tagging an object;
“this is the pile of important documents, that of the
unimportant ones”
Classification is used everywhere!
Alberto Sim˜oes A5 - Specialized Dictionaries 7/138
8. Where are classification systems used?
Internet Social Networks (tagging);
Libraries (ex. Universal Decimal Classification);
Medicine (ex. Unified Medical Language System)
Chemistry (ex. Periodic Table);
Geography (ex. Geographic Taxonomy);
Biology (ex. Linnaean taxonomy, Protein classification, . . . );
Alberto Sim˜oes A5 - Specialized Dictionaries 8/138
9. Classification Systems Classes
Classification Systems can also be classified;
One way to classify classification systems is by their ability to
include properties and relations between the classified objects;
We will discuss four types of classification systems:
Folksonomies
Taxonomies
Thesauri
Ontologies
Alberto Sim˜oes A5 - Specialized Dictionaries 9/138
11. Folksonomies
A folksonomy is a system of classification derived from
the practice and method of collaboratively creating and
managing tags to annotate and categorize content;
this practice is also known as collaborative tagging, so-
cial classification, social indexing, and social tagging.
Folksonomy, a term coined by Thomas Vander Wal, is
a portmanteau of folk and taxonomy.
Folksonomy (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 11/138
12. Folsksonomies: How they work
Other classification techniques often define someone or some
group in charge of creating the classification system structure
(authority);
This group of people see the world from a specific point of
view, that can be, or not, shared by others;
Folksonomies solve this problem: power to the people;
Instead of partitioning the world according to one particular
view. They let the user present facets of objects;
Users assign keywords (or tags, or labels) to objects
(individuals);
These keywords can be searched, indexed, and mathematical
models can be applied to this data.
Alberto Sim˜oes A5 - Specialized Dictionaries 12/138
13. Folksonomies
An empirical analysis of the complex dynamics of tag-
ging systems, published in 2007, has shown that con-
sensus around stable distributions and shared vocab-
ularies does emerge, even in the absence of a central
controlled vocabulary. For content to be searchable,
it should be categorized and grouped. While this was
believed to require commonly agreed on sets of con-
tent describing tags (much like keywords of a journal
article), recent research has found that, in large folk-
sonomies, common structures also emerge on the level
of categorizations. Accordingly, it is possible to devise
mathematical models that allow for translating from
personal tag vocabularies (personomies) to the vocab-
ulary shared by most users.
Folksonomy (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 13/138
14. Folksonomies: example
Top categories in the Portuguese Wikipedia (single words):
375 Sociologia 383 Ponerinae
395 Afro-brasileiros 404 Drilliidae
413 Filosofia 415 Coleophoridae
424 Psicologia 428 Terebridae
445 Clathurellinae 445 Digimons
445 Teuto-brasileiros 451 Apiaceae
483 Asteroides 486 Luso-brasileiros
492 Acaena 526 Rubiaceae
537 Dolichoderinae 730 Agonoxenidae
735 Acalypha 753 Mangeliinae
762 Crambidae 787 Poaceae
808 Colet^aneas 824 Theraphosidae
854 Myrmicinae 962 Fabaceae
974 Formicidae 1065 Agrostis
1096 Formicinae 1177 Aloe
1328 Conus 1338 ´Italo-brasileiros
1395 Asteraceae 1433 Coleophora
1514 Arctiidae 1516 Alchemilla
1689 Turridae 1879 Camponotus
2163 Acer 2744 Acacia
Alberto Sim˜oes A5 - Specialized Dictionaries 14/138
15. Folksonomies: Pros and Cons
Pros:
doesn’t require expert cataloguers, authoritative sources or
expert users;
capability of matching users’ real needs and language:
(inclusive — includes everyone’s words and vocabulary)
controlled vocabularies are not practically and economically
extensible, while folksonomies are;
a low-investment bridge between personal classification and
shared classification;
easy to use and quick to classify big quantities of individuals;
not all the limitations of folksonomies are defects :-)
Alberto Sim˜oes A5 - Specialized Dictionaries 15/138
16. Folksonomies: Pros and Cons
Cons:
by itself, the vocabulary is flat;
(there is no structure, just terms)
not usable for small collections or those with few users;
(statistical methods are dependent of population size)
without some technology help, vocabularies get inexact or
ambiguous;
have a very low findability quotient. They are great for
serendipity and browsing but not aimed at a targeted
approach or search;
Alberto Sim˜oes A5 - Specialized Dictionaries 16/138
18. Taxonomies
Taxonomy is the science of identifying and naming
species, and arranging them into a classification. The
field of taxonomy, sometimes referred to as “biological
taxonomy”, revolves around the description and use of
taxonomic units, known as taxa. A resulting taxonomy
is a particular classification, arranged in a hierarchical
structure or classification scheme.
Taxonomy (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 18/138
19. Taxonomies
taxonomy [tæk’s6n@mI]
n.
(Life Sciences & Allied Applications / Biology)
the branch of biology concerned with the
classification of organisms into groups based on
similarities of structure, origin, etc.
the practice of arranging organisms in this way.
the science or practice of classification.
[from French taxonomie, from Greek taxis “order” + –nomy]
Collins English Dictionary – Complete and Unabridged
c HarperCollins Publishers 1991, 1994, 1998, 2000, 2003
Alberto Sim˜oes A5 - Specialized Dictionaries 19/138
20. Taxonomies: How they work?
Used to partition the world into disjunctive classes or groups;
Each group is, again, partitioned into sub-classes or
sub-groups;
And sub-classes are partitioned, and. . .
Individuals are classified in one leaf category;
(a classification is a path in the tree)
Alberto Sim˜oes A5 - Specialized Dictionaries 20/138
22. Taxonomies: examples used everyday
Main index (top level) of Universal Decimal Classification:
0 Generalities
(now Science and knowledge. Organization. Computer Science.
Information. Documentation. Librarianship. Institutions.
Publications)
1 Philosophy. Psychology
2 Religion. Theology
3 Social Sciences
4 Vacant
5 Mathematics and natural sciences
6 Applied sciences. Medicine. Technology
7 The arts. Recreation. Entertainment. Sport
8 Language. Linguistics. Literature
9 Geography. Biography. History
Alberto Sim˜oes A5 - Specialized Dictionaries 22/138
23. Taxonomies: examples used everyday
8 Language. Linguistics. Literature
80 General questions [. . . ] linguistics and literature. Philology
81 Linguistics and languages
81-11 Schools and trends in linguistics
81-13 Methodology of linguistics. Methods and means
811 Languages
811.1/.2 Indo-European Languages
811.3 Dead languages of unknown affiliation. Caucasian languages
811.4 Afro-Asiatic, Nilo-Saharan, Congo-Kordofanian, Khoisan
languages
811.5 Ural-Altaic, Palaeo-Siberian, Eskimo-Aleut, Dravidian and
Sino-Tibetan languages. Japanese. Korean. . .
811.6 Austro-Asiatic languages. Austronesian languages
811.7 Indo-Pacific (non-Austronesian) languages. Australian
languages
811.8 American indigenous languages
811.9 Artificial languages
82 Literature
Alberto Sim˜oes A5 - Specialized Dictionaries 23/138
24. Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
1 Philosophy. Psychology
2 Religion. Theology
3 Social Sciences
5 Mathematics and natural sciences
6 Applied sciences. Medicine.
Technology
7 The arts. Recreation. Entertainment.
Sport
8 Language. Linguistics. Literature
9 Geography. Biography. History
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 24/138
25. Taxonomies: Class Task
5 Mathematics, Natural Sciences
51 Mathematics
519 (no name, virtual class)
519.6 Computational mathematics.
Numerical Analysis
University of Minho Library
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 25/138
26. Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
00 Prolegomena. Fundamentals of
knowledge and culture.
Propaedeutics
004 Computer science and
technology. Computing. Data
processing
004.4 Software
004.42 Computer
programming. Computer
programs
Aveiro University Library
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 26/138
27. Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
00 Prolegomena. Fundamentals of
knowledge and culture.
Propaedeutics
004 Computer science and
technology. Computing. Data
processing
004.4 Software
004.43 Computer Languages
Porto Polytechnic Institute Library
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 27/138
28. Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
00 Prolegomena. Fundamentals of
knowledge and culture.
Propaedeutics
004 Computer science and
technology. Computing. Data
processing
004.8 Artificial intelligence
Algarve’s University Library Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 28/138
29. Taxonomies: Pros and Cons
Pros:
rigid tree, makes it easy to process;
suitable for some areas (like life classification);
the hierarchy helps searching for terms (abstraction);
Cons:
rigid tree, makes it difficult to classify;
(different people classify objects differently)
the structure is defined by some authority group;
(for example, the UDC Consortium)
forces the subdivision of the world;
(categories are single-parental)
as a workaround, people classify in more than one category;
(so, the rigid tree Pro gets a Con)
Alberto Sim˜oes A5 - Specialized Dictionaries 29/138
31. Thesauri
A thesaurus is a reference work that lists words
grouped together according to similarity of meaning
(containing synonyms and sometimes antonyms), in
contrast to a dictionary, which contains definitions and
pronunciations.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 31/138
32. Thesauri
In Information Science, Library Science, and Informa-
tion Technology, specialized thesauri are designed for
information retrieval. They are a type of controlled
vocabulary, for indexing or tagging purposes. Such a
thesaurus can be used as the basis of an index for on-
line material. [. . . ] Unlike a literary thesaurus, these
specialized thesauri typically focus on one discipline,
subject or field of study.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 32/138
33. Thesauri: How they work!
Thesauri for information retrieval are typically con-
structed by information specialists, and have their own
unique vocabulary defining different kinds of terms and
relationships.
Terms are the basic semantic units for conveying con-
cepts. They are usually single-word nouns, since nouns
are the most concrete part of speech. [. . . ] When a
term is ambiguous, a “scope note” can be added to
ensure consistency, and give direction on how to inter-
pret the term.
“Term relationships” are links between terms. These
relationships can be divided into three types: hierar-
chical, equivalency or associative.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 33/138
34. Thesauri: How they work!
Hierarchical relationships are used to indicate terms
which are narrower and broader in scope. A “Broader
Term” (BT) or hyperonym is a more general term.
Reciprocally, a “Narrower Term” (NT) or hyponym is
a more specific term.
BT and NT are reciprocals; a broader term necessarily
implies at least one other term which is narrower. BT
and NT are used to indicate class relationships, as well
as part-whole relationships (meronyms and holonyms).
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 34/138
35. Thesauri: How they work!
Example of a thesaurus with hierarchical relations.
Feline
NT Cat
NT Panther
Cat
BT Feline
Panther
BT Feline
NT Pink Panther
Pink Panther
BT Panther
Alberto Sim˜oes A5 - Specialized Dictionaries 35/138
36. Thesauri: How they work!
The equivalency relationship is used primarily to con-
nect synonyms and near-synonyms. “Use” (USE) and
“Used For” (UF) indicators are used when an autho-
rized term is to be used for another, unauthorized,
term. Unauthorized terms are often called “entry vo-
cabulary”, “entry points”, “lead-in terms”, or “non-
preferred terms”, pointing to the authorized term (also
referred to as the “preferred term” or “descriptor”)
that has been chosen to stand for the concept.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 36/138
37. Thesauri: How they work!
Example of a thesaurus with equivalency relations.
Parliament
USE European Parliament
Parliament of Europe
USE European Parliament
European Parliament
UF Parliament
UF Parliament of Europe
Alberto Sim˜oes A5 - Specialized Dictionaries 37/138
38. Thesauri: How they work!
Associative relationships are used to connect two
related terms whose relationship is neither hierarchical
nor equivalent. This relationship is described by the
indicator “Related Term” (RT). Associative relation-
ships should be applied with caution, since excessive
use of RTs will reduce specificity in searches.
Consider the following: if the typical user is searching
with term ”A”, would they also want resources tagged
with term ”B”? If the answer is no, then an associative
relationship should not be established.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 38/138
39. Thesauri: How they work!
Example of a thesaurus with associative relations.
Douro Porto
BT River BT Portugal
RT Porto
RT Gaia Portugal
NT Porto
River NT Gaia
NT Douro
City
Gaia NT Gaia
BT Portugal NT Porto
Note RT is not symmetrical. a RT b ⇒ b RT a.
Alberto Sim˜oes A5 - Specialized Dictionaries 39/138
40. Thesauri: a simple example
Quality Asia
ChinaFood Safety Contamination
Food Food Contamination
BT
NT
BT
NT
RT RT
USE
USE
Extract of Food Safety relationships in AGROVOC
Alberto Sim˜oes A5 - Specialized Dictionaries 40/138
41. Thesauri: Pros and Cons
Pros:
More flexible than Taxonomies;
(does not require a tree, work as a graph)
Have other types of relationships than simple hierarchy;
(like the associative relation)
There is an ISO standard that documents their correct use;
Standard defines mathematical properties for relationships;
Cons:
Standardized types of relationships are somewhat limited;
(same relation for hyperonyms and meronyms)
(non-hierarchical relation is too vague: related)
No support for relationships with non-terms (features);
Alberto Sim˜oes A5 - Specialized Dictionaries 41/138
43. Ontologies
Ontology is the philosophical study of the nature of
being, existence, or reality as such, as well as the ba-
sic categories of being and their relations. Tradition-
ally listed as a part of the major branch of philosophy
known as metaphysics, ontology deals with questions
concerning what entities exist or can be said to exist,
and how such entities can be grouped, related within a
hierarchy, and subdivided according to similarities and
differences.
Ontology (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 43/138
44. Ontologies
In computer science and information science, an ontol-
ogy formally represents knowledge as a set of concepts
within a domain, and the relationships between those
concepts. It can be used to reason about the entities
within that domain and may be used to describe the
domain.
Ontology: information science (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 44/138
45. Ontologies
Contemporary ontologies share many structural simi-
larities, regardless of the language in which they are
expressed. Most ontologies describe individuals (in-
stances), classes (concepts), attributes, and relations.
Ontology: information science (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 45/138
46. Ontologies
Individuals are the instances or objects (the basic or
“ground level” objects).
Ontology: information science (Wikipedia, 2012)
Unlike any of the other classification systems, Ontologies clearly
include the individuals (or objects being classified) in the structure.
Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
47. Ontologies
Individuals are the instances or objects (the basic or
“ground level” objects).
Ontology: information science (Wikipedia, 2012)
Unlike any of the other classification systems, Ontologies clearly
include the individuals (or objects being classified) in the structure.
Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
48. Ontologies
Classes are sets, collections, concepts, [. . . ] or kinds
of things.
Ontology: information science (Wikipedia, 2012)
Classes are the concepts used in Thesauri and Taxonomy. They
can be super-classes, including sub-classes, or can just include
individuals (low level classes, leafs if we were talking about
taxonomies).
Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
49. Ontologies
Classes are sets, collections, concepts, [. . . ] or kinds
of things.
Ontology: information science (Wikipedia, 2012)
Classes are the concepts used in Thesauri and Taxonomy. They
can be super-classes, including sub-classes, or can just include
individuals (low level classes, leafs if we were talking about
taxonomies).
Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
50. Ontologies
Attributes are aspects, properties, features, character-
istics, or parameters that objects (and classes) can
have.
Ontology: information science (Wikipedia, 2012)
Attributes are properties of individuals or classes. If the individual
is a book in a library, a property can be the number of pages, the
title, the author. For a class, like “mammal”, an attribute can be a
reference to its fur. Attributes are usually specified as a pair, the
name of the attribute and its value.
Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
51. Ontologies
Attributes are aspects, properties, features, character-
istics, or parameters that objects (and classes) can
have.
Ontology: information science (Wikipedia, 2012)
Attributes are properties of individuals or classes. If the individual
is a book in a library, a property can be the number of pages, the
title, the author. For a class, like “mammal”, an attribute can be a
reference to its fur. Attributes are usually specified as a pair, the
name of the attribute and its value.
Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
52. Ontologies
Relations are ways in which classes and individuals can
be related to one another.
Ontology: information science (Wikipedia, 2012)
Relations are similar to the relations used in Thesauri, but unlike
them, there isn’t a list of valid relations. They can be the common
hierarchical relations, or the relation “eat” relating animals with
the animals they eat.
Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
53. Ontologies
Relations are ways in which classes and individuals can
be related to one another.
Ontology: information science (Wikipedia, 2012)
Relations are similar to the relations used in Thesauri, but unlike
them, there isn’t a list of valid relations. They can be the common
hierarchical relations, or the relation “eat” relating animals with
the animals they eat.
Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
54. Ontologies
Function terms: complex structures formed from cer-
tain relations that can be used in place of an individual
term in a statement.
Ontology: information science (Wikipedia, 2012)
Suppose you are adding Portuguese rivers to an Ontology. One can
define a simple macro to add some default relations to the river:
River (name) ∼=
Term → name
Is a → river
Is at → Portugal
Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
55. Ontologies
Function terms: complex structures formed from cer-
tain relations that can be used in place of an individual
term in a statement.
Ontology: information science (Wikipedia, 2012)
Suppose you are adding Portuguese rivers to an Ontology. One can
define a simple macro to add some default relations to the river:
River (name) ∼=
Term → name
Is a → river
Is at → Portugal
Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
56. Ontologies
Restrictions: formally stated descriptions of what must
be true in order for some assertion to be accepted as
input.
Ontology: information science (Wikipedia, 2012)
We can enforce that a capital of a country it a city:
add (X capital-of Y ) iff X is-a City
Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
57. Ontologies
Restrictions: formally stated descriptions of what must
be true in order for some assertion to be accepted as
input.
Ontology: information science (Wikipedia, 2012)
We can enforce that a capital of a country it a city:
add (X capital-of Y ) iff X is-a City
Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
58. Ontologies
Rules: statements in the form of an antecedent-
consequent sentence that describe the logical infer-
ences that can be drawn from an assertion in a partic-
ular form.
Ontology: information science (Wikipedia, 2012)
On the other hand, if we trust who is editing an ontology, we can
classify automatically it as a city, and its country as a. . . country:
X capital-of Y ⇒X is-a City ∧ Y is-a Country
Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
59. Ontologies
Rules: statements in the form of an antecedent-
consequent sentence that describe the logical infer-
ences that can be drawn from an assertion in a partic-
ular form.
Ontology: information science (Wikipedia, 2012)
On the other hand, if we trust who is editing an ontology, we can
classify automatically it as a city, and its country as a. . . country:
X capital-of Y ⇒X is-a City ∧ Y is-a Country
Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
60. Ontologies
Axioms: assertions (including rules) in a logical form
that together comprise the overall theory that the on-
tology describes in its domain of application.
Ontology: information science (Wikipedia, 2012)
Differs from Rules, as axioms are tests to guarantee the ontology
structure. They are not used to infer new relations.
They assert, and can/should be used for consistence checking.
Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
61. Ontologies
Axioms: assertions (including rules) in a logical form
that together comprise the overall theory that the on-
tology describes in its domain of application.
Ontology: information science (Wikipedia, 2012)
Differs from Rules, as axioms are tests to guarantee the ontology
structure. They are not used to infer new relations.
They assert, and can/should be used for consistence checking.
Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
62. Ontologies
Events: the changing of attributes or relations.
Ontology: information science (Wikipedia, 2012)
Similar to rules, but react to events. For example, if the user adds
a feature stating that an individual lays eggs, classify it as an
oviparous.
Note that the division into Rules, Axioms and Events is not
universal, and depends a lot on the application that is used to
support the ontology.
Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
63. Ontologies
Events: the changing of attributes or relations.
Ontology: information science (Wikipedia, 2012)
Similar to rules, but react to events. For example, if the user adds
a feature stating that an individual lays eggs, classify it as an
oviparous.
Note that the division into Rules, Axioms and Events is not
universal, and depends a lot on the application that is used to
support the ontology.
Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
66. Ontologies: Pros and Cons
Pros:
More flexible than Thesauri;
(graph with ad-hoc relationships)
Lots of formalisms and standards (OWL, SKOS, . . . );
Lots of tools to edit (like Prot´eg´e);
Languages for querying and completion (like SPARQL);
Cons:
As a classification approach, requires an authority for its
definition, just like Taxonomies or Thesauri.
Complexity: not everybody is able to create a detailed
ontology.
Alberto Sim˜oes A5 - Specialized Dictionaries 57/138
67. Further Reading
Folksonomies:
Folksonomy Coinage and Definition
http://vanderwal.net/folksonomy.html
Folksonomies: A User-Driven Approach to Organizing Content
http://www.uie.com/articles/folksonomies/
Folksonomies: power to the people
http://www.iskoi.org/doc/folksonomies.htm
Folksonomies: Tidying up Tags?
http://www.dlib.org/dlib/january06/guy/01guy.html
Folksonomies - Cooperative Classification and Communication
Through Shared Metadata
http://www.adammathes.com/academic/
computer-mediated-communication/folksonomies.html
Alberto Sim˜oes A5 - Specialized Dictionaries 58/138
68. Further Reading
Taxonomies:
Taxonomy
http://en.wikipedia.org/wiki/Taxonomy
Perspectives on Taxonomy, Classification, Structure and
Find-ability
http://www.serviceinnovation.org/included/docs/
kcs_taxonomy.pdf
Universal Decimal Classification
http://www.udcc.org/udcsummary/php/index.php
Thesauri:
Thesaurus
http://en.wikipedia.org/wiki/Thesaurus
Thesaurus principles and practice
http://www.willpowerinfo.co.uk/thesprin.htm
Alberto Sim˜oes A5 - Specialized Dictionaries 59/138
69. Further Reading
Ontologies:
Ontology (information science)
http://en.wikipedia.org/wiki/Ontology_
(information_science)
Prot´eg´e Ontology Editor
http://protege.stanford.edu/
OWL Web Ontology Language
http://www.w3.org/TR/owl-features/
SPARQL Query Language for RDF
http://www.w3.org/TR/rdf-sparql-query/
Alberto Sim˜oes A5 - Specialized Dictionaries 60/138
71. Overview
4 How translation works
5 The role of Terminology on Translation
6 Translation Software
Standard translation software
Standard terminology management software
Alberto Sim˜oes A5 - Specialized Dictionaries 62/138
73. How Translation Works
Manual Translation
Translator uses some resources like dictionaries and terminologies, but
search them manually. The type of translation done in the last century.
Computer Assisted Translation
Translator uses tools (CAT tools) to help the translation process. Help
the translator to reuse previous translations, integrates with terminologies
and help the translator dealing with different file formats.
Exploratory Translation
Using machine translation tools, like Google Translate to do a quick
translation and understand texts. Not really a professional translation
process.
Machine Translation
Computer systems that translate text using different techniques, from
statistical information to translation rules. Quality raising in the last
years, but too far away of a real translation work result.
Alberto Sim˜oes A5 - Specialized Dictionaries 64/138
74. Computer Assisted Translation
CAT tools translation process:
1 Document is opened in CAT tool;
2 First sentence is extracted and presented to be translated;
3 Sentence is looked-up in a database of previous translated
sentences, looking up for similar sentences (fuzzy matching);
4 If found, translation is done (or fuzzy translation);
5 A terminology database is queried in order to check if
sentence includes relevant terms to be translated;
6 Translator reviews the translation;
7 System saves the translation in a database of translations;
8 System saves the translation in the translated document;
9 Next sentence is extracted, and go to step 3.
Alberto Sim˜oes A5 - Specialized Dictionaries 65/138
76. Translation Memories
Databases of translations;
Store sentences in two or more languages;
Grow accordingly with the work of the translator;
Can be shared between translators in a same project;
Some big companies make their TM available to contracted
translators in order to guarantee homogeneity in their
translations.
Alberto Sim˜oes A5 - Specialized Dictionaries 67/138
77. Terminology and Translation
Translating terminology takes up to 40% of the time in
translation:
Translators not aware of technical areas;
Translators need to understand term being translated;
Researching on a specific area takes time;
Terminology reduce time to research on term translation.
Terminology helps the comprehension of concepts:
There is no way to translate without understanding;
Terminology might/should include explanations on terms;
Terminology helps on Consistency and Standardization:
Translate terms the same way through all the document;
Translate terms the same way through all documents;
Companies, Organization, Governmental Institutions define
specific terminologies that should be used by translators;
Alberto Sim˜oes A5 - Specialized Dictionaries 68/138
78. Further Reading
CAT software
Discover the benefits of using a CAT Tool: How can CAT
Tools help you? by Jonathan T. Hine Jr.
http://www.translationzone.com/en/translator-solutions/translation-memory/cat-tools/
What is a translation memory? by SDL Trados.
http://www.translationzone.com/en/translator-solutions/translation-memory/default.asp
What is terminology? by SDL Trados.
http://www.translationzone.com/en/translator-solutions/terminology-management/default.asp
Alberto Sim˜oes A5 - Specialized Dictionaries 69/138
79. Further Reading
Terminology in Translation
Terminology in translation, by Thorsten Trippel (1999)
http://www.spectrum.uni-bielefeld.de/~ttrippel/terminology/node19.html
Terminology Management in Translation, by Gabriele
Sauberer (2009) http://www.termnet.org/downloads/english/events/itaindia_
workshop/GS_Terminology_Management_in_Translation.pdf
The Role of Terminology Management in Localization, by Sue
Ellen Wright (2006)
http://www.translationzone.com/en/images/sue_ellen_slides_tcm18-25819.pdf
Managing Terminology for Translation Using Translation
Environment Tools: Towards a Definition of Best Practices,
by Marta G´omez Palou Allard (2012) http://www.ruor.uottawa.ca/fr/
bitstream/handle/10393/22837/Gomez_Palou_Allard_Marta_2012_thesis.pdf
Alberto Sim˜oes A5 - Specialized Dictionaries 70/138
81. Overview
7 Corpora
Monolingual Corpora
Parallel Corpora
Corpora in the Web
8 The web as Corpora
Do-it-yourself Corpora
Basic Crawling Tools
Alberto Sim˜oes A5 - Specialized Dictionaries 72/138
82. What is a Corpus?
cor·pus /’kˆorp@s/
Noun
1. A collection of written texts, esp. the entire
works of a particular author or a body of writing
on a particular subject;
2. A collection of written or spoken material in
machine-readable form, assembled for the
purpose of studying linguistic structures,
frequencies, etc.
corpora is the plural for corpus.
Alberto Sim˜oes A5 - Specialized Dictionaries 73/138
83. Corpora Classification
Corpora is usually classified accordingly with the number of
languages:
Monolingual Corpus:
documents are all written in one language;
(in some cases with more than one variant)
Multilingual Corpus:
documents are written in more than one language;
Alberto Sim˜oes A5 - Specialized Dictionaries 74/138
84. Corpora Classification
There are two specially relevant types of multilingual corpora:
Parallel Corpus:
a text placed alongside its translation or translations. Parallel
text alignment is the identification of corresponding blocks in
both halves of the parallel text.
Comparable Corpus:
is one which selects similar texts in more than one language or
variety. There is as yet no agreement on the nature of the
similarity, because there are very few examples of comparable
corpora
Expert Advisory Group on Language Engineering Standards Guidelines (1996)
Alberto Sim˜oes A5 - Specialized Dictionaries 75/138
85. Monolingual Corpora Examples
British National Corpus (http://www.natcorp.ox.ac.uk/)
The British National Corpus (BNC) is a 100 million word collection of
samples of written and spoken language from a wide range of sources,
designed to represent a wide cross-section of current British English, both
spoken and written.
CETEMP´ublico (http://www.linguateca.pt/cetempublico/)
Corpus de Extractos de Textos Electr´onicos MCT/P´ublico is a corpus of
approximately 180 million words in European Portuguese. It was created
by the Computational Processing of Portuguese Project after an
agreement between the Ministry of Science and Technology and the
P´ublico newspaper, in April, 2000.
CETENFolha (http://www.linguateca.pt/cetenfolha/)
Corpus de Extractos de Textos Electr´onicos NILC/Folha de S. Paulo is a
corpus of approximately 24 million words in Brazilian Portuguese, created
by the Computational Processing of Portuguese Project using texts from
the newspaper Folha de S. Paulo, that are part of the NILC/S˜ao Carlos
Alberto Sim˜oes A5 - Specialized Dictionaries 76/138
86. Monolingual Corpora Examples
Russian National Corpus (http://ruscorpora.ru/en/index.html)
RNC is a corpus of the modern Russian language incorporating over 300
million words. The corpus of Russian is a reference system based on a
collection of Russian texts in electronic form.
Croatian National Corpus (http://www.hnk.ffzg.hr/cnc.htm)
HNK is a systematized collection of selected texts mainly written in
contemporary Croatian covering different media, genres, styles, fields and
topics. The Corpus is accompanied by additional linguistic and
non-linguistic data and stored in a database on our server which can be
accessed with the search client program Bonito.
KOTONOHA Corpus (http://www.kotonoha.gr.jp/)
The Balanced Corpus of Contemporary Written Japanese includes text
samples collected to be able to grasp an overall picture of the modern
Japanese written language and includes about 100 million words.
Alberto Sim˜oes A5 - Specialized Dictionaries 77/138
87. Parallel Corpora Examples
Aligned Hansards
(http://isi.edu/natural-language/download/hansard/)
Aligned Hansards of the 36th Parliament of Canada, contains 1.3 million
pairs of aligned text chunks (sentences or smaller fragments).
COMPARA ( http://www.linguateca.pt/COMPARA/)
COMPARA is a bidirectional parallel corpus of English and Portuguese. In
other words, it is a type of database with original and translated texts in
these two languages that have been linked together sentence by sentence.
Europarl ( http://www.statmt.org/europarl/)
The Europarl parallel corpus is extracted from the proceedings of the
European Parliament. It includes versions in 11 European languages:
Romanic (French, Italian, Spanish, Portuguese), Germanic (English,
Dutch, German, Danish, Swedish), Greek and Finnish.
Alberto Sim˜oes A5 - Specialized Dictionaries 78/138
88. Parallel Corpora Examples
JRC-Acquis (http://langtech.jrc.it/JRC-Acquis.html)
The Acquis Communautaire is the total body of European Union (EU)
law applicable in the the EU Member States. It is a collection of parallel
texts in 22 languages: Bulgarian, Czech, Danish, German, Greek, English,
Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian,
Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene
and Swedish.
OPUS (http://opus.lingfil.uu.se/)
OPUS is a growing collection of translated texts from the web. In the
OPUS project we try to convert and align free online data, to add
linguistic annotation, and to provide the community with a publicly
available parallel corpus.
Per-Fide (http://per-fide.di.uminho.pt/cquery)
Per-fide Project aims on the development of parallel corpora between
Portuguese and six other Languages: English, Russian, French, Italian,
German and Spanish.
Alberto Sim˜oes A5 - Specialized Dictionaries 79/138
89. Querying Corpora
Using http://corpus.leeds.ac.uk/protected/query.html
Concordances of a single word: dog
Concordances for a sequence of words: big bang
Concordances for lemmas: [lemma="have"]
Concordances for part of speech: [pos="NNS"]
Combinations of the above:
[lemma="have"] dog
[lemma="be"] [lemma="have"]
Regular expressions can be used: [pos="N.*"] [pos="V.*"]
Multiple restrictions for same word:
[pos="N.*" & word="d.*"] [pos="V.*"]
Empty words: [pos="N.*"] [] [pos="V.*"]
Alberto Sim˜oes A5 - Specialized Dictionaries 80/138
90. The Web as Corpora
To study “purposeful language behavior,” corpus
linguists require collections of authentic texts (spoken
and/or written). It is therefore not surprising that many
(corpus) linguists have recently turned to the World Wide
Web as the richest and most easily accessible source of
language material available. At the same time, for
language technologists, who have been arguing for long
that “more data is better data,” the WWW is a virtually
unlimited source of “more data.”
Wacky!
A Wacky Introduction
Silvia Bernardini, Marco Baroni and Stefan Evert
Alberto Sim˜oes A5 - Specialized Dictionaries 81/138
91. Do-it-yourself Corpora
The WWW has data from virtually any subject;
There is data in mostly any language;
Therefore, it is possible to build custom corpora!
Collect text from the web. . .
. . . on a specific language. . .
. . . on the subject you want to study . . .
. . . and retrieve as much text as you need.
Alberto Sim˜oes A5 - Specialized Dictionaries 82/138
92. Basic Crawling Tools
There are standard download tools that follow HTML links,
and are able to download complete websites.
They are known as web spiders, or web robots;
Examples include “wget”, “wGetGUI” or “HTTrack”;
But you need to process the files yourself.
There are some projects that developed tools specific for
corpora building.
The most well known is “BootCaT”
Alberto Sim˜oes A5 - Specialized Dictionaries 83/138
93. Further Reading
Corpora:
Corpus Creation - Handbook of NLP http://cgi.cse.unsw.
edu.au/~handbookofnlp/index.php?n=Chapter7.Chapter7
Building and Using Your Own Corpora http:
//www.lancs.ac.uk/fss/courses/ling/corpus/blue/diy_top.htm
CQP Query Language Tutorial
http://cwb.sourceforge.net/files/CQP_Tutorial/
Web as Corpora:
Wacky! Working papers on the Web as Corpus
http://wackybook.sslmit.unibo.it/
Wacky Wiki
http://wacky.sslmit.unibo.it/doku.php
Alberto Sim˜oes A5 - Specialized Dictionaries 84/138
95. Overview
9 Corpora for Terminology Building
10 Obtaining candidate terms from Corpora
N-grams and Frequencies
Lexical Difference
Exploring Mutual Information
Morphology Constraints
11 Exploring a Tool: Term-o-Matic
Alberto Sim˜oes A5 - Specialized Dictionaries 86/138
96. Corpora for Terminology Building
The use of a specific domain text or texts in order to
understand what is that domain terminology is relevant;
Words in context give more information than alone;
There is no automatic method to extract specific domain
terminology from a specific domain corpus;
Nevertheless, there are automatic method to obtain candidate
terms, that can later be analysed and incorporated in a
terminology, or just discarded.
Alberto Sim˜oes A5 - Specialized Dictionaries 87/138
97. Words n-Grams
In the fields of computational linguistics and
probability, an n-gram is a contiguous sequence of n
items from a given sequence of text or speech.
The items in question can be phonemes, syllables,
letters, words or base pairs according to the application.
n-grams are collected automatically from a text or speech
corpus.
Alberto Sim˜oes A5 - Specialized Dictionaries 88/138
98. One-Grams
1-Grams are usually known as words/tokens. :-)
Peter Piper picked a peck
of pickled peppers.
A peck of pickled peppers
Peter Piper picked.
If Peter Piper picked
a peck of pickled peppers,
Where’s the peck of pickled
peppers Peter Piper picked?
peter 4
piper 4
picked 4
a 2
peck 4
of 4
pickled 5
...
...
Alberto Sim˜oes A5 - Specialized Dictionaries 89/138
99. Bigrams
All sequences of two words/tokens found in the text.
Peter Piper picked a peck
of pickled peppers.
A peck of pickled peppers
Peter Piper picked.
If Peter Piper picked
a peck of pickled peppers,
Where’s the peck of pickled
peppers Peter Piper picked?
peter piper 4
piper picked 4
picked a 2
a peck 3
peck of 4
of pickled 4
pickled peppers 4
...
...
Alberto Sim˜oes A5 - Specialized Dictionaries 90/138
100. Top occurring trigrams for a real corpus
in accordance with 31148
referred to in 27581
the member states 16999
accordance with the 16535
of the european 14772
laid down in 13301
to in article 13211
having regard to 12588
regard to the 11416
member states shall 11392
in order to 10563
in the case 10029
the provisions of 9825
the case of 9575
provided for in 9560
the member state 9360
of the member 8656
the commission shall 8013
of this directive 6679
a member state 6306
on the basis 6292
the european parliament 6274
the basis of 6265
and in particular 6225
down in article 6200
of the community 5958
accordance with article 5758
to in paragraph 5690
opinion of the 5599
the opinion of 5191
the competent authorities 5074
for the purposes 5024
the purposes of 4946
with the procedure 4878
to the commission 4843
the european community 4834
Alberto Sim˜oes A5 - Specialized Dictionaries 91/138
101. n-grams frequency
n-Grams are usually computed together with their occurrence
count — or frequency;
In some situations, like statistic language models, other type
of measures are also computed (probability — relative
frequency; conditional probability, etc);
One-grams frequency doesn’t help much on term candidate
extraction — they just say that a word is more or less
frequent.
n-grams for n ≥ 2 can help finding sequence of words that
occur lot of times.
Alberto Sim˜oes A5 - Specialized Dictionaries 92/138
102. Stop Words and Lexical Difference
There are words that rarely occur in terminology;
At least, they rarely occur in the beginning or end of a
multi-word term;
For example, pronouns, articles, prepositions;
These words are usually known as stop words;
It is easy to find bigger or smaller lists of stop words for every
language;
We can ignore these words when computing n-grams.
Alberto Sim˜oes A5 - Specialized Dictionaries 93/138
103. Detecting stop-words
in accordance with 31148
referred to in 27581
the member states 16999
accordance with the 16535
of the european 14772
laid down in 13301
to in article 13211
having regard to 12588
regard to the 11416
member states shall 11392
in order to 10563
in the case 10029
the provisions of 9825
the case of 9575
provided for in 9560
the member state 9360
of the member 8656
the commission shall 8013
of thisi directive 6679
a member state 6306
on the basis 6292
the european parliament 6274
the basis of 6265
and in particular 6225
down in article 6200
of the community 5958
accordance with article 5758
to in paragraph 5690
opinion of the 5599
the opinion of 5191
the competent authorities 5074
for the purposes 5024
the purposes of 4946
with the procedure 4878
to the commission 4843
the european community 4834
Alberto Sim˜oes A5 - Specialized Dictionaries 94/138
104. Replacing stop words by a special token
<tk> member states 32517
member states <tk> 30108
<tk> member state 19345
member state <tk> 17882
council directive <tk> 7869
<tk> council directive 7129
<tk> european parliament 5397
council regulation <tk> 5259
european parliament <tk> 5125
<tk> council regulation 4995
<tk> competent authorities 4964
competent authorities <tk> 4736
procedure laid <tk> 4472
<tk> treaty establishing 4375
treaty establishing <tk> 4373
<tk> competent authority 3694
official journal <tk> 3530
competent authority <tk> 3507
annex ii <tk> 3429
commission regulation <tk> 3171
<tk> commission regulation 2967
commission decision <tk> 2545
<tk> customs authorities 2542
<tk> commission decision 2429
customs authorities <tk> 2410
<tk> european economic 2285
<tk> administrative provisions 2017
<tk> contracting parties 2010
conditions laid <tk> 1998
contracting parties <tk> 1779
commission directive <tk> 1764
detailed rules <tk> 1738
<tk> community industry 1728
<tk> contracting party 1702
Alberto Sim˜oes A5 - Specialized Dictionaries 95/138
105. Trigrams that doesn’t include stop words
member states relating 1523
member state concerned 1200
veterinary medicinal products 955
maximum residue limits 814
physically modified derivatives 700
european economic community 691
community trade mark 538
member states concerned 508
plant protection products 464
home member state 442
host member state 388
council common position 377
community plant variety 368
european atomic energy 346
animal health conditions 342
authorised representative established 327
implementing powers conferred 311
regional economic integration 263
median longitudinal plane 258
plant protection product 249
separate technical unit 246
national regulatory authorities 241
apply mutatis mutandis 241
common technical regulation 229
separate technical units 226
emission limit values 219
technically permissible maximum 215
maximum residue levels 212
retail trade services 200
temporary importation procedure 196
medicinal products intended 195
community transit procedure 195
atomic energy community 193
classical swine fever 189
Alberto Sim˜oes A5 - Specialized Dictionaries 96/138
106. Basic Lexical Difference
What if we remove not just stop words, but common words?
It is not that usual to find Osteoarthritis in common text.
Therefore, it should be some kind of a domain term.
We can obtain a list of common words from a generic corpus
(say, jornalistic text) and subtract that lexicon from the
one-grams we obtained.
Result should include good term candidates!
Alberto Sim˜oes A5 - Specialized Dictionaries 97/138
107. Basic Lexical Difference - Experiment
Two random abstracts from PubMed articles related with cirrhosis;
Top 1 000 occurring words in English;
Compute one-grams on the abstracts;
Subtract the top occurring words.
Before
liver 8
is 7
fibrosis 6
myofibroblast 6
pathway 5
kidney 5
expression 5
interstitial 4
signaling 3
target 3
differentiation 3
diseases 3
medullary 3
antioxidant 3
After
liver 8
myofibroblast 6
fibrosis 6
pathway 5
kidney 5
interstitial 4
β-catenin 3
target 3
signaling 3
genes 3
differentiation 3
medullary 3
renal 3
adult 3Alberto Sim˜oes A5 - Specialized Dictionaries 98/138
108. Lexical Distribution Difference
Previous example could benefit a bigger standard lexicon list;
Abstracts are crowded with terminology, and few other words;
Long lists may include words than are considered terminology!
Example, for Informatics, folder or file can be terms.
Instead of considering words as present or not, use their
frequency;
For instance, compute relative frequency and
compare/subtract;
Use a distribution comparison metric;
ex., Kullback-Leibler terms: log P(i)
Q(i) P (i)
Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
109. Lexical Distribution Difference
Previous example could benefit a bigger standard lexicon list;
Abstracts are crowded with terminology, and few other words;
Long lists may include words than are considered terminology!
Example, for Informatics, folder or file can be terms.
Instead of considering words as present or not, use their
frequency;
For instance, compute relative frequency and
compare/subtract;
Use a distribution comparison metric;
ex., Kullback-Leibler terms: log P(i)
Q(i) P (i)
Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
110. Pointwise Mutual Information
The Mutual Information (MI) is a quantity that measures the
mutual dependence of two random variables X and Y .
MI(X, Y ) =
x∈X y∈Y
P(x, y) log2
P(x, y)
P(x)P(y)
Intuitively, mutual information measures the information that X
and Y share: it measures how much knowing one of these
variables reduces uncertainty about the other.
Alberto Sim˜oes A5 - Specialized Dictionaries 100/138
111. Pointwise Mutual Information
When computing Mutual Information for two specific outcomes,
the Pointwise Mutual Information (PMI) let us measure their
mutual dependence:
PMI(x, y) = log2
P(x, y)
P(x)P(y)
Given the number of tokens in the document N, and the
number of occurrences for x, Oc(x): P(x) = Oc(x)
N
Given the number of tokens in the document N, and the
number of occurrences for bigram x, y, Oc(x, y):
P(x, y) = Oc(x,y)
N
Alberto Sim˜oes A5 - Specialized Dictionaries 101/138
112. Pointwise Mutual Information
Sorted by occurrence count
sonic fabric 14 7.3566
black holes 9 8.0912
black hole 7 8.0912
cassette tape 6 8.4968
build things 4 9.5348
smartphone makers 3 9.0087
alyce santoro 3 8.0912
like scratching 3 9.0087
barnard said 3 8.3042
milky way 3 9.1787
possible black 3 7.6762
neutron star 3 8.8567
just right 3 8.5937
records backwards 3 10.5937
Sorted by PMI
special shuttle 1 12.1787
immediately reminded 1 12.1787
remain aware 1 12.1787
richard branson 1 12.1787
supercooled pods 1 12.1787
richie havens 1 12.1787
auspicious locations 1 12.1787
jimi hendrix 1 12.1787
account settings 1 12.1787
baggage carousel 1 12.1787
buddhist prayer 1 12.1787
reinvents electronics 1 12.1787
melbourne institute 1 12.1787
cow manure 1 12.1787
From a very small corpus constructed with 5 CNN news stories.
Alberto Sim˜oes A5 - Specialized Dictionaries 102/138
113. Morphology Patterns
Commonly, terms are nouns or noun phrases;
Sometimes some verbs are also interesting;
Typically the morphological structure of terms is well known;
There is software that compute morphological information
about each word in a sentence;
We can use that information to obtain better term candidates.
specify terms part-of-speech, genre, number, verb tenses,
etc. . .
Alberto Sim˜oes A5 - Specialized Dictionaries 103/138
114. Morphological Analysis
How it (usually) works:
1 A tokenizer and a splitter split sentences into tokens and
sentences;
(different tools use them in different order, some as a single tool)
2 A morphological analyzer associates possible analysis to each
word;
(does not cope with ambiguity, just tags all possible analysis)
3 A Tagger or Parser choose the more likely analysis;
(uses knowledge from manual annotated corpora, and machine
learning algorithms)
Alberto Sim˜oes A5 - Specialized Dictionaries 104/138
115. Morphological Patterns - Examples
Noun Noun Noun
659 Community trade mark
483 plant protection products
475 EEC component type-approval
448 document number C
320 Community transit procedure
290 plant protection product
288 Community plant variety
257 EC type-examination certificate
214 EC component type-approval
176 EEC pattern approval
157 African swine fever
155 three-wheel motor vehicles
155 foot-and-mouth disease virus
153 conformity assessment procedures
148 emission limit values
Adjective Adjective Noun
912 veterinary medicinal products
453 common agricultural policy
365 separate technical unit
291 separate technical units
265 median longitudinal plane
223 regional economic integration
202 competent national authorities
200 trans-European high-speed rail
199 sound financial management
189 veterinary medicinal product
182 certain agricultural products
176 national regulatory authorities
175 common technical regulation
168 certain third countries
166 other third countries
166 definitive anti-dumping duty
162 certain dangerous substances
Alberto Sim˜oes A5 - Specialized Dictionaries 105/138
117. Term-o-Matic
What it is:
A simple web-application;
Without user control;
Developed specifically for this class;
implement some of the methods presented before;
What it is not:
A commercial software;
A professional tool;
A tool free of bugs;
A multilingue tool.
Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
118. Term-o-Matic
What it is:
A simple web-application;
Without user control;
Developed specifically for this class;
implement some of the methods presented before;
What it is not:
A commercial software;
A professional tool;
A tool free of bugs;
A multilingue tool.
Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
120. Term-o-Matic: Add Text
Use the Add Text option to add one-grams, bigrams and trigrams
into the database (English, please!).
Alberto Sim˜oes A5 - Specialized Dictionaries 109/138
121. Term-o-Matic: Add Text feedback
After adding some text, a summary of the amount of data added is
shown.
Alberto Sim˜oes A5 - Specialized Dictionaries 110/138
122. Term-o-Matic: Manage Stopwords
The Stop Words option allows to manage the list of stop-words. It
is possible to add (to add more than one just separate words using
spaces or other punctuation), and to delete them.
Alberto Sim˜oes A5 - Specialized Dictionaries 111/138
123. Term-o-Matic: Manage Lexicon
The Standard Lexicon option is very similar to the Stop Words
option, but for the generic words.
Alberto Sim˜oes A5 - Specialized Dictionaries 112/138
124. T-o-M: Words, Bigrams and Trigrams
The Study Words, Study Bigrams and Study Trigrams work all in
the same way, showing a list of words/bigrams/trigrams.
Alberto Sim˜oes A5 - Specialized Dictionaries 113/138
125. T-o-M: Words, Bigrams and Trigrams
Note that the PMI column is empty. This measure takes some time
to compute, and therefore should be computed only when needed.
Alberto Sim˜oes A5 - Specialized Dictionaries 114/138
126. T-o-M: Words, Bigrams and Trigrams
To compute PMI use the Compute bi/trigrams PMI. After the
software issue an ”OK” message, hit the back button on your
browser and refresh.
Alberto Sim˜oes A5 - Specialized Dictionaries 115/138
127. T-o-M: Words, Bigrams and Trigrams
By default the list is sorted by occurrence count. You can change
to PMI order as soon as it is computed.
Alberto Sim˜oes A5 - Specialized Dictionaries 116/138
128. T-o-M: Words, Bigrams and Trigrams
It is possible to remove entries with stop-words or punctuation; or
entries with common words.
Alberto Sim˜oes A5 - Specialized Dictionaries 117/138
129. T-o-M: Filtering by pattern
To filter by a morphological pattern you must ensure that you run
the Compute Morph. Analysis option after the last time you
entered text.
When the software says the process is complete (OK), hit the back
button, and you are realy to use the pattern filtering.
Just choose the categories you are looking for, and search for them.
Alberto Sim˜oes A5 - Specialized Dictionaries 118/138
130. T-o-M: Filtering by Pattern
Alberto Sim˜oes A5 - Specialized Dictionaries 119/138
131. Term-o-Matic: standard operation guide
1 Use the Add Text option to add text.
Use it as many times as you need to create a big enough
corpus;
Do not add too much text at once. Add by blocks.
Be sure to add thematic text;
2 Define a list of stop words (you might already have one).
3 Define a list of common words.
Look for such lists in the web.
4 Compute PMIs and Morphological Analysis
5 Do queries!
Alberto Sim˜oes A5 - Specialized Dictionaries 120/138
132. Evaluation Task
Five students, Five subject areas, Five Term-o-Matic.
Computer Science (http://termomatic.com/termomatic1)
Medicine (http://termomatic.com/termomatic2)
Europe (http://termomatic.com/termomatic3)
Animal Biology (http://termomatic.com/termomatic4)
Sports (http://termomatic.com/termomatic5)
Alberto Sim˜oes A5 - Specialized Dictionaries 121/138
134. Overview
12 Sentence and Word Alignment
13 Parallel Patterns
Alberto Sim˜oes A5 - Specialized Dictionaries 123/138
135. Sentence Alignment
Sentence alignment is the task of detecting translation
relationships between sentences in parallel corpora.
If sα is a sentence in a language Lα and sβ is a sentence in a
language Lβ, the alignment process creates the pair (sα, sβ) if
(there is a high probability that) sβ is a translation of sα.
Alberto Sim˜oes A5 - Specialized Dictionaries 124/138
136. Word Alignment
The Word Alignment is the task of detecting translation
relationships between words or terms in sentence-aligned parallel
corpora.
There are two trends on word alignment:
for each aligned sentence, create a link between every word
and its translation;
for the complete corpora, obtain a relationship between a
word and a set of probable translations, together with a
confidence measure (a kind of translation probability);
Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
137. Word Alignment
The Word Alignment is the task of detecting translation
relationships between words or terms in sentence-aligned parallel
corpora.
There are two trends on word alignment:
for each aligned sentence, create a link between every word
and its translation;
for the complete corpora, obtain a relationship between a
word and a set of probable translations, together with a
confidence measure (a kind of translation probability);
Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
138. Probabilistic Translation Dictionaries
Obtained with one of the word alignment methods;
Define a relationship between a word and a set of probable
translations;
T (europe) =
europa 94.7%
europeus 3.4%
europeu 0.8%
europeia 0.1%
T (stupid) =
est´upido 47.6%
est´upida 11.0%
est´upidos 7.4%
avisada 5.6%
direita 5.6%
impasse 4.5%
ocupado 3.8%
Alberto Sim˜oes A5 - Specialized Dictionaries 126/138
139. Translation Matrix
discussion
about
alternative
sources
of
financing
for
the
european
radical
alliance
.
discussão 44 0 0 0 0 0 0 0 0 0 0 0
sobre 0 11 0 0 0 0 0 0 0 0 0 0
fontes 0 0 0 74 0 0 0 0 0 0 0 0
de 0 3 0 0 27 0 6 3 0 0 0 0
financiamento 0 0 0 0 0 56 0 0 0 0 0 0
alternativas 0 0 23 0 0 0 0 0 0 0 0 0
para 0 0 0 0 0 0 28 0 0 0 0 0
a 0 1 0 0 1 0 4 33 0 0 0 0
aliança 0 0 0 0 0 0 0 0 0 0 65 0
radical 0 0 0 0 0 0 0 0 0 80 0 0
europeia 0 0 0 0 0 0 0 0 59 0 0 0
. 0 0 0 0 0 0 0 0 0 0 0 80
Using the probabilistic translation dictionaries we are able to
construct a translation matrix;
Each cell has a translation probability obtained from the
dictionary;
Alberto Sim˜oes A5 - Specialized Dictionaries 127/138
140. Translation Patterns
Translation changes word order (for some language pairs!);
This change can be foreseen;
This change can be defined formally as a pattern;
These patterns can be used to obtain term candidates.
Alberto Sim˜oes A5 - Specialized Dictionaries 128/138
141. Translation Pattern 1: ABBA
Jogos
Ol´ımpicos
Olimpic X
Games X
Formally,
T (A · B) = T (B) · T (A)
Or in the tool syntax:
[ABBA] A B = B A
Alberto Sim˜oes A5 - Specialized Dictionaries 129/138
142. Translation Pattern 2: IDH
´ındice
de
desenvolvimento
humano
human X
development X
index X
T (I · ”de” · D · H) = T (H) · T (D) · T (I)
[IDH] I "de" D H = H D I
Alberto Sim˜oes A5 - Specialized Dictionaries 130/138
143. Translation Pattern 3: FTP
protocolo
de
transferˆencia
de
ficheiros
file X
transfer X
protocol X
T (P · ”de” · T · ”de” · F) = T (F) · T (T) · T (P)
[FTP] P "de" T "de" F = F T P
Alberto Sim˜oes A5 - Specialized Dictionaries 131/138
144. Patterns in Translation Matrix
discussion
about
alternative
sources
of
financing
for
the
european
radical
alliance
.
discussão 44 0 0 0 0 0 0 0 0 0 0 0
sobre 0 11 0 0 0 0 0 0 0 0 0 0
fontes 0 0 0 74 0 0 0 0 0 0 0 0
de 0 3 0 0 27 0 6 3 0 0 0 0
financiamento 0 0 0 0 0 56 0 0 0 0 0 0
alternativas 0 0 23 0 0 0 0 0 0 0 0 0
para 0 0 0 0 0 0 28 0 0 0 0 0
a 0 1 0 0 1 0 4 33 0 0 0 0
aliança 0 0 0 0 0 0 0 0 0 0 65 0
radical 0 0 0 0 0 0 0 0 0 80 0 0
europeia 0 0 0 0 0 0 0 0 59 0 0 0
. 0 0 0 0 0 0 0 0 0 0 0 80
The two boxes correspond to the following two patterns:
[P1] F "de" N A = A F "of" N
[P2] A B C = C B A
Alberto Sim˜oes A5 - Specialized Dictionaries 132/138
145. Terms extracted using A B = B A
21007 uni˜ao europeia ⇒ european union
9301 parlamento europeu ⇒ european parliament
4171 direitos humanos ⇒ human rights
3504 estados unidos ⇒ united states
2353 mercado interno ⇒ internal market
1911 posi¸c˜ao comum ⇒ common position
1826 pa´ıses candidatos ⇒ candidate countries
1776 comiss˜ao europeia ⇒ european commission
1708 conselho europeu ⇒ european council
1629 sa´ude p´ublica ⇒ public health
1558 direitos fundamentais ⇒ fundamental rights
1546 na¸c˜oes unidas ⇒ united nations
1337 pa´ıses terceiros ⇒ third countries
1294 conferˆencia intergovernamental ⇒ intergovernmental conference
1258 fundos estruturais ⇒ structural funds
Alberto Sim˜oes A5 - Specialized Dictionaries 133/138
146. Terms extracted using A ”de” B = B A
729 plano de ac¸c˜ao ⇒ action plan
722 conselho de seguran¸ca ⇒ security council
680 processo de paz ⇒ peace process
582 mercado de trabalho ⇒ labour market
580 pena de morte ⇒ death penalty
492 pacto de estabilidade ⇒ stability pact
431 pol´ıtica de defesa ⇒ defence policy
353 acordo de associa¸c˜ao ⇒ association agreement
348 protocolo de quioto ⇒ kyoto protocol
343 programa de ac¸c˜ao ⇒ action programme
259 branqueamento de capitais ⇒ money laundering
258 comit´e de concilia¸c˜ao ⇒ conciliation committee
241 pol´ıtica de concorrˆencia ⇒ competition policy
226 processo de concilia¸c˜ao ⇒ conciliation procedure
217 requerentes de asilo ⇒ asylum seekers
Alberto Sim˜oes A5 - Specialized Dictionaries 134/138
147. Terms extracted using A B C = C B A
531 pol´ıtica agr´ıcola comum ⇒ common agricultural policy
418 banco central europeu ⇒ european central bank
329 tribunal penal internacional ⇒ international criminal court
166 alian¸ca livre europeia ⇒ european free alliance
156 modelo social europeu ⇒ european social model
153 partidos pol´ıticos europeus ⇒ european political parties
83 fundo monet´ario internacional ⇒ international monetary fund
75 pol´ıtica externa comum ⇒ common foreign policy
66 organiza¸c˜ao mar´ıtima internacional ⇒ international maritime organisation
65 pr´opria uni˜ao europeia ⇒ european union itself
65 fundo social europeu ⇒ european social fund
55 direitos humanos fundamentais ⇒ fundamental human rights
45 rela¸c˜oes econ´omicas externas ⇒ external economic relations
45 homens e mulheres ⇒ women and men
45 agˆencia espacial europeia ⇒ european space agency
Alberto Sim˜oes A5 - Specialized Dictionaries 135/138
148. Terms extracted: I ”de” D H = H D I
95 mandato de captura europeu ⇒ european arrest warrant
85 fontes de energia renov´aveis ⇒ renewable energy sources
80 mandado de captura europeu ⇒ european arrest warrant
67 sistemas de seguran¸ca social ⇒ social security systems
64 zona de com´ercio livre ⇒ free trade area
55 for¸ca de reac¸c˜ao r´apida ⇒ rapid reaction force
54 orienta¸c˜oes de pol´ıtica econ´omica ⇒ economic policy guidelines
46 planos de ac¸c˜ao nacionais ⇒ national action plans
46 direitos de propriedade intelectual ⇒ intellectual property rights
33 sistema de alerta r´apido ⇒ rapid alert system
29 pol´ıtica de defesa comum ⇒ common defence policy
29 m´etodo de coordena¸c˜ao aberta ⇒ open coordination method
27 m´etodo de coordena¸c˜ao aberto ⇒ open coordination method
27 conselho de empresa europeu ⇒ european works council
25 acordo de com´ercio livre ⇒ free trade agreement
Alberto Sim˜oes A5 - Specialized Dictionaries 136/138
149. Adding Morphological Constraints
The pattern language supports constraints;
Constrains can be of different types;
The most interesting are the morphological ones:
[ABBA] A B[CAT<-adj] = B[CAT<-adj] A
With this kind of constrain we can force the words in specific
positions to be of specific morphological category.
Alberto Sim˜oes A5 - Specialized Dictionaries 137/138
150. Further Reading
Alignment tasks
Sentence Alignment Survey
http://www.statmt.org/survey/Topic/SentenceAlignment
An overview of bitext alignment algorithms http://www.
ida.liu.se/~jodfo/gslt/bitext-alignment-jody.pdf
Word Alignment Survey
http://www.statmt.org/survey/Topic/WordAlignment
Terminology from Parallel Corpora
Parallel corpus-based bilingual terminology extraction http:
//ambs.perl-hackers.net/publications/tia09.pdf
Alberto Sim˜oes A5 - Specialized Dictionaries 138/138