SlideShare uma empresa Scribd logo
1 de 150
A5 - Specialized Dictionaries
Alberto Sim˜oes
ambs@ilch.uminho.pt
EMLex 2012/2013
Erlangen
Alberto Sim˜oes A5 - Specialized Dictionaries 1/138
Part I
Terminology vs Lexicography
Alberto Sim˜oes A5 - Specialized Dictionaries 2/138
Overview
1 Term Orientation vs Concept Orientation
2 Classification Systems
What are Classification Systems?
Folksonomies
Taxonomies
Thesauri
Ontologies
3 Further Reading
Alberto Sim˜oes A5 - Specialized Dictionaries 3/138
Term vs Concept Orientation
Most dictionaries are organized by terms:
users look up entries by the word;
entries describe all possible senses;
the same explanation can appear for different words
(synonyms);
Most terminologies are organized by concepts:
users look up entries by an instance word;
but concepts exist organized as a single block;
each concept is represented only once;
all synonyms (and antonyms) are presented together;
Alberto Sim˜oes A5 - Specialized Dictionaries 4/138
Term vs Concept Orientation
Term Orientation: Dictionary
Definition from Dictionary.com (May 3rd, 2013)
Alberto Sim˜oes A5 - Specialized Dictionaries 5/138
Term vs Concept Orientation
Concept Orientation: Terminology
Entry from DeCS - Health Sciences Descriptors (May 3rd, 2013)
Alberto Sim˜oes A5 - Specialized Dictionaries 6/138
Classification Systems
Humans tend to organize;
“disorganization is a kind of organization”
This organization is usually done by classification;
Classification can be as simple as tagging an object;
“this is the pile of important documents, that of the
unimportant ones”
Classification is used everywhere!
Alberto Sim˜oes A5 - Specialized Dictionaries 7/138
Where are classification systems used?
Internet Social Networks (tagging);
Libraries (ex. Universal Decimal Classification);
Medicine (ex. Unified Medical Language System)
Chemistry (ex. Periodic Table);
Geography (ex. Geographic Taxonomy);
Biology (ex. Linnaean taxonomy, Protein classification, . . . );
Alberto Sim˜oes A5 - Specialized Dictionaries 8/138
Classification Systems Classes
Classification Systems can also be classified;
One way to classify classification systems is by their ability to
include properties and relations between the classified objects;
We will discuss four types of classification systems:
Folksonomies
Taxonomies
Thesauri
Ontologies
Alberto Sim˜oes A5 - Specialized Dictionaries 9/138
Folksonomies
Alberto Sim˜oes A5 - Specialized Dictionaries 10/138
Folksonomies
A folksonomy is a system of classification derived from
the practice and method of collaboratively creating and
managing tags to annotate and categorize content;
this practice is also known as collaborative tagging, so-
cial classification, social indexing, and social tagging.
Folksonomy, a term coined by Thomas Vander Wal, is
a portmanteau of folk and taxonomy.
Folksonomy (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 11/138
Folsksonomies: How they work
Other classification techniques often define someone or some
group in charge of creating the classification system structure
(authority);
This group of people see the world from a specific point of
view, that can be, or not, shared by others;
Folksonomies solve this problem: power to the people;
Instead of partitioning the world according to one particular
view. They let the user present facets of objects;
Users assign keywords (or tags, or labels) to objects
(individuals);
These keywords can be searched, indexed, and mathematical
models can be applied to this data.
Alberto Sim˜oes A5 - Specialized Dictionaries 12/138
Folksonomies
An empirical analysis of the complex dynamics of tag-
ging systems, published in 2007, has shown that con-
sensus around stable distributions and shared vocab-
ularies does emerge, even in the absence of a central
controlled vocabulary. For content to be searchable,
it should be categorized and grouped. While this was
believed to require commonly agreed on sets of con-
tent describing tags (much like keywords of a journal
article), recent research has found that, in large folk-
sonomies, common structures also emerge on the level
of categorizations. Accordingly, it is possible to devise
mathematical models that allow for translating from
personal tag vocabularies (personomies) to the vocab-
ulary shared by most users.
Folksonomy (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 13/138
Folksonomies: example
Top categories in the Portuguese Wikipedia (single words):
375 Sociologia 383 Ponerinae
395 Afro-brasileiros 404 Drilliidae
413 Filosofia 415 Coleophoridae
424 Psicologia 428 Terebridae
445 Clathurellinae 445 Digimons
445 Teuto-brasileiros 451 Apiaceae
483 Asteroides 486 Luso-brasileiros
492 Acaena 526 Rubiaceae
537 Dolichoderinae 730 Agonoxenidae
735 Acalypha 753 Mangeliinae
762 Crambidae 787 Poaceae
808 Colet^aneas 824 Theraphosidae
854 Myrmicinae 962 Fabaceae
974 Formicidae 1065 Agrostis
1096 Formicinae 1177 Aloe
1328 Conus 1338 ´Italo-brasileiros
1395 Asteraceae 1433 Coleophora
1514 Arctiidae 1516 Alchemilla
1689 Turridae 1879 Camponotus
2163 Acer 2744 Acacia
Alberto Sim˜oes A5 - Specialized Dictionaries 14/138
Folksonomies: Pros and Cons
Pros:
doesn’t require expert cataloguers, authoritative sources or
expert users;
capability of matching users’ real needs and language:
(inclusive — includes everyone’s words and vocabulary)
controlled vocabularies are not practically and economically
extensible, while folksonomies are;
a low-investment bridge between personal classification and
shared classification;
easy to use and quick to classify big quantities of individuals;
not all the limitations of folksonomies are defects :-)
Alberto Sim˜oes A5 - Specialized Dictionaries 15/138
Folksonomies: Pros and Cons
Cons:
by itself, the vocabulary is flat;
(there is no structure, just terms)
not usable for small collections or those with few users;
(statistical methods are dependent of population size)
without some technology help, vocabularies get inexact or
ambiguous;
have a very low findability quotient. They are great for
serendipity and browsing but not aimed at a targeted
approach or search;
Alberto Sim˜oes A5 - Specialized Dictionaries 16/138
Taxonomies
Alberto Sim˜oes A5 - Specialized Dictionaries 17/138
Taxonomies
Taxonomy is the science of identifying and naming
species, and arranging them into a classification. The
field of taxonomy, sometimes referred to as “biological
taxonomy”, revolves around the description and use of
taxonomic units, known as taxa. A resulting taxonomy
is a particular classification, arranged in a hierarchical
structure or classification scheme.
Taxonomy (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 18/138
Taxonomies
taxonomy [tæk’s6n@mI]
n.
(Life Sciences & Allied Applications / Biology)
the branch of biology concerned with the
classification of organisms into groups based on
similarities of structure, origin, etc.
the practice of arranging organisms in this way.
the science or practice of classification.
[from French taxonomie, from Greek taxis “order” + –nomy]
Collins English Dictionary – Complete and Unabridged
c HarperCollins Publishers 1991, 1994, 1998, 2000, 2003
Alberto Sim˜oes A5 - Specialized Dictionaries 19/138
Taxonomies: How they work?
Used to partition the world into disjunctive classes or groups;
Each group is, again, partitioned into sub-classes or
sub-groups;
And sub-classes are partitioned, and. . .
Individuals are classified in one leaf category;
(a classification is a path in the tree)
Alberto Sim˜oes A5 - Specialized Dictionaries 20/138
Taxonomies: The typical example
Alberto Sim˜oes A5 - Specialized Dictionaries 21/138
Taxonomies: examples used everyday
Main index (top level) of Universal Decimal Classification:
0 Generalities
(now Science and knowledge. Organization. Computer Science.
Information. Documentation. Librarianship. Institutions.
Publications)
1 Philosophy. Psychology
2 Religion. Theology
3 Social Sciences
4 Vacant
5 Mathematics and natural sciences
6 Applied sciences. Medicine. Technology
7 The arts. Recreation. Entertainment. Sport
8 Language. Linguistics. Literature
9 Geography. Biography. History
Alberto Sim˜oes A5 - Specialized Dictionaries 22/138
Taxonomies: examples used everyday
8 Language. Linguistics. Literature
80 General questions [. . . ] linguistics and literature. Philology
81 Linguistics and languages
81-11 Schools and trends in linguistics
81-13 Methodology of linguistics. Methods and means
811 Languages
811.1/.2 Indo-European Languages
811.3 Dead languages of unknown affiliation. Caucasian languages
811.4 Afro-Asiatic, Nilo-Saharan, Congo-Kordofanian, Khoisan
languages
811.5 Ural-Altaic, Palaeo-Siberian, Eskimo-Aleut, Dravidian and
Sino-Tibetan languages. Japanese. Korean. . .
811.6 Austro-Asiatic languages. Austronesian languages
811.7 Indo-Pacific (non-Austronesian) languages. Australian
languages
811.8 American indigenous languages
811.9 Artificial languages
82 Literature
Alberto Sim˜oes A5 - Specialized Dictionaries 23/138
Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
1 Philosophy. Psychology
2 Religion. Theology
3 Social Sciences
5 Mathematics and natural sciences
6 Applied sciences. Medicine.
Technology
7 The arts. Recreation. Entertainment.
Sport
8 Language. Linguistics. Literature
9 Geography. Biography. History
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 24/138
Taxonomies: Class Task
5 Mathematics, Natural Sciences
51 Mathematics
519 (no name, virtual class)
519.6 Computational mathematics.
Numerical Analysis
University of Minho Library
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 25/138
Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
00 Prolegomena. Fundamentals of
knowledge and culture.
Propaedeutics
004 Computer science and
technology. Computing. Data
processing
004.4 Software
004.42 Computer
programming. Computer
programs
Aveiro University Library
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 26/138
Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
00 Prolegomena. Fundamentals of
knowledge and culture.
Propaedeutics
004 Computer science and
technology. Computing. Data
processing
004.4 Software
004.43 Computer Languages
Porto Polytechnic Institute Library
Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 27/138
Taxonomies: Class Task
0 Science and knowledge. Organization.
Computer Science. Information. . .
00 Prolegomena. Fundamentals of
knowledge and culture.
Propaedeutics
004 Computer science and
technology. Computing. Data
processing
004.8 Artificial intelligence
Algarve’s University Library Prolog Programming for
Artificial Intelligence,
Prof Ivan Bratko
Alberto Sim˜oes A5 - Specialized Dictionaries 28/138
Taxonomies: Pros and Cons
Pros:
rigid tree, makes it easy to process;
suitable for some areas (like life classification);
the hierarchy helps searching for terms (abstraction);
Cons:
rigid tree, makes it difficult to classify;
(different people classify objects differently)
the structure is defined by some authority group;
(for example, the UDC Consortium)
forces the subdivision of the world;
(categories are single-parental)
as a workaround, people classify in more than one category;
(so, the rigid tree Pro gets a Con)
Alberto Sim˜oes A5 - Specialized Dictionaries 29/138
Thesauri
Alberto Sim˜oes A5 - Specialized Dictionaries 30/138
Thesauri
A thesaurus is a reference work that lists words
grouped together according to similarity of meaning
(containing synonyms and sometimes antonyms), in
contrast to a dictionary, which contains definitions and
pronunciations.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 31/138
Thesauri
In Information Science, Library Science, and Informa-
tion Technology, specialized thesauri are designed for
information retrieval. They are a type of controlled
vocabulary, for indexing or tagging purposes. Such a
thesaurus can be used as the basis of an index for on-
line material. [. . . ] Unlike a literary thesaurus, these
specialized thesauri typically focus on one discipline,
subject or field of study.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 32/138
Thesauri: How they work!
Thesauri for information retrieval are typically con-
structed by information specialists, and have their own
unique vocabulary defining different kinds of terms and
relationships.
Terms are the basic semantic units for conveying con-
cepts. They are usually single-word nouns, since nouns
are the most concrete part of speech. [. . . ] When a
term is ambiguous, a “scope note” can be added to
ensure consistency, and give direction on how to inter-
pret the term.
“Term relationships” are links between terms. These
relationships can be divided into three types: hierar-
chical, equivalency or associative.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 33/138
Thesauri: How they work!
Hierarchical relationships are used to indicate terms
which are narrower and broader in scope. A “Broader
Term” (BT) or hyperonym is a more general term.
Reciprocally, a “Narrower Term” (NT) or hyponym is
a more specific term.
BT and NT are reciprocals; a broader term necessarily
implies at least one other term which is narrower. BT
and NT are used to indicate class relationships, as well
as part-whole relationships (meronyms and holonyms).
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 34/138
Thesauri: How they work!
Example of a thesaurus with hierarchical relations.
Feline
NT Cat
NT Panther
Cat
BT Feline
Panther
BT Feline
NT Pink Panther
Pink Panther
BT Panther
Alberto Sim˜oes A5 - Specialized Dictionaries 35/138
Thesauri: How they work!
The equivalency relationship is used primarily to con-
nect synonyms and near-synonyms. “Use” (USE) and
“Used For” (UF) indicators are used when an autho-
rized term is to be used for another, unauthorized,
term. Unauthorized terms are often called “entry vo-
cabulary”, “entry points”, “lead-in terms”, or “non-
preferred terms”, pointing to the authorized term (also
referred to as the “preferred term” or “descriptor”)
that has been chosen to stand for the concept.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 36/138
Thesauri: How they work!
Example of a thesaurus with equivalency relations.
Parliament
USE European Parliament
Parliament of Europe
USE European Parliament
European Parliament
UF Parliament
UF Parliament of Europe
Alberto Sim˜oes A5 - Specialized Dictionaries 37/138
Thesauri: How they work!
Associative relationships are used to connect two
related terms whose relationship is neither hierarchical
nor equivalent. This relationship is described by the
indicator “Related Term” (RT). Associative relation-
ships should be applied with caution, since excessive
use of RTs will reduce specificity in searches.
Consider the following: if the typical user is searching
with term ”A”, would they also want resources tagged
with term ”B”? If the answer is no, then an associative
relationship should not be established.
Thesauri (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 38/138
Thesauri: How they work!
Example of a thesaurus with associative relations.
Douro Porto
BT River BT Portugal
RT Porto
RT Gaia Portugal
NT Porto
River NT Gaia
NT Douro
City
Gaia NT Gaia
BT Portugal NT Porto
Note RT is not symmetrical. a RT b ⇒ b RT a.
Alberto Sim˜oes A5 - Specialized Dictionaries 39/138
Thesauri: a simple example
Quality Asia
ChinaFood Safety Contamination
Food Food Contamination
BT
NT
BT
NT
RT RT
USE
USE
Extract of Food Safety relationships in AGROVOC
Alberto Sim˜oes A5 - Specialized Dictionaries 40/138
Thesauri: Pros and Cons
Pros:
More flexible than Taxonomies;
(does not require a tree, work as a graph)
Have other types of relationships than simple hierarchy;
(like the associative relation)
There is an ISO standard that documents their correct use;
Standard defines mathematical properties for relationships;
Cons:
Standardized types of relationships are somewhat limited;
(same relation for hyperonyms and meronyms)
(non-hierarchical relation is too vague: related)
No support for relationships with non-terms (features);
Alberto Sim˜oes A5 - Specialized Dictionaries 41/138
Ontologies
Alberto Sim˜oes A5 - Specialized Dictionaries 42/138
Ontologies
Ontology is the philosophical study of the nature of
being, existence, or reality as such, as well as the ba-
sic categories of being and their relations. Tradition-
ally listed as a part of the major branch of philosophy
known as metaphysics, ontology deals with questions
concerning what entities exist or can be said to exist,
and how such entities can be grouped, related within a
hierarchy, and subdivided according to similarities and
differences.
Ontology (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 43/138
Ontologies
In computer science and information science, an ontol-
ogy formally represents knowledge as a set of concepts
within a domain, and the relationships between those
concepts. It can be used to reason about the entities
within that domain and may be used to describe the
domain.
Ontology: information science (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 44/138
Ontologies
Contemporary ontologies share many structural simi-
larities, regardless of the language in which they are
expressed. Most ontologies describe individuals (in-
stances), classes (concepts), attributes, and relations.
Ontology: information science (Wikipedia, 2012)
Alberto Sim˜oes A5 - Specialized Dictionaries 45/138
Ontologies
Individuals are the instances or objects (the basic or
“ground level” objects).
Ontology: information science (Wikipedia, 2012)
Unlike any of the other classification systems, Ontologies clearly
include the individuals (or objects being classified) in the structure.
Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
Ontologies
Individuals are the instances or objects (the basic or
“ground level” objects).
Ontology: information science (Wikipedia, 2012)
Unlike any of the other classification systems, Ontologies clearly
include the individuals (or objects being classified) in the structure.
Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
Ontologies
Classes are sets, collections, concepts, [. . . ] or kinds
of things.
Ontology: information science (Wikipedia, 2012)
Classes are the concepts used in Thesauri and Taxonomy. They
can be super-classes, including sub-classes, or can just include
individuals (low level classes, leafs if we were talking about
taxonomies).
Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
Ontologies
Classes are sets, collections, concepts, [. . . ] or kinds
of things.
Ontology: information science (Wikipedia, 2012)
Classes are the concepts used in Thesauri and Taxonomy. They
can be super-classes, including sub-classes, or can just include
individuals (low level classes, leafs if we were talking about
taxonomies).
Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
Ontologies
Attributes are aspects, properties, features, character-
istics, or parameters that objects (and classes) can
have.
Ontology: information science (Wikipedia, 2012)
Attributes are properties of individuals or classes. If the individual
is a book in a library, a property can be the number of pages, the
title, the author. For a class, like “mammal”, an attribute can be a
reference to its fur. Attributes are usually specified as a pair, the
name of the attribute and its value.
Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
Ontologies
Attributes are aspects, properties, features, character-
istics, or parameters that objects (and classes) can
have.
Ontology: information science (Wikipedia, 2012)
Attributes are properties of individuals or classes. If the individual
is a book in a library, a property can be the number of pages, the
title, the author. For a class, like “mammal”, an attribute can be a
reference to its fur. Attributes are usually specified as a pair, the
name of the attribute and its value.
Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
Ontologies
Relations are ways in which classes and individuals can
be related to one another.
Ontology: information science (Wikipedia, 2012)
Relations are similar to the relations used in Thesauri, but unlike
them, there isn’t a list of valid relations. They can be the common
hierarchical relations, or the relation “eat” relating animals with
the animals they eat.
Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
Ontologies
Relations are ways in which classes and individuals can
be related to one another.
Ontology: information science (Wikipedia, 2012)
Relations are similar to the relations used in Thesauri, but unlike
them, there isn’t a list of valid relations. They can be the common
hierarchical relations, or the relation “eat” relating animals with
the animals they eat.
Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
Ontologies
Function terms: complex structures formed from cer-
tain relations that can be used in place of an individual
term in a statement.
Ontology: information science (Wikipedia, 2012)
Suppose you are adding Portuguese rivers to an Ontology. One can
define a simple macro to add some default relations to the river:
River (name) ∼=



Term → name
Is a → river
Is at → Portugal
Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
Ontologies
Function terms: complex structures formed from cer-
tain relations that can be used in place of an individual
term in a statement.
Ontology: information science (Wikipedia, 2012)
Suppose you are adding Portuguese rivers to an Ontology. One can
define a simple macro to add some default relations to the river:
River (name) ∼=



Term → name
Is a → river
Is at → Portugal
Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
Ontologies
Restrictions: formally stated descriptions of what must
be true in order for some assertion to be accepted as
input.
Ontology: information science (Wikipedia, 2012)
We can enforce that a capital of a country it a city:
add (X capital-of Y ) iff X is-a City
Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
Ontologies
Restrictions: formally stated descriptions of what must
be true in order for some assertion to be accepted as
input.
Ontology: information science (Wikipedia, 2012)
We can enforce that a capital of a country it a city:
add (X capital-of Y ) iff X is-a City
Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
Ontologies
Rules: statements in the form of an antecedent-
consequent sentence that describe the logical infer-
ences that can be drawn from an assertion in a partic-
ular form.
Ontology: information science (Wikipedia, 2012)
On the other hand, if we trust who is editing an ontology, we can
classify automatically it as a city, and its country as a. . . country:
X capital-of Y ⇒X is-a City ∧ Y is-a Country
Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
Ontologies
Rules: statements in the form of an antecedent-
consequent sentence that describe the logical infer-
ences that can be drawn from an assertion in a partic-
ular form.
Ontology: information science (Wikipedia, 2012)
On the other hand, if we trust who is editing an ontology, we can
classify automatically it as a city, and its country as a. . . country:
X capital-of Y ⇒X is-a City ∧ Y is-a Country
Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
Ontologies
Axioms: assertions (including rules) in a logical form
that together comprise the overall theory that the on-
tology describes in its domain of application.
Ontology: information science (Wikipedia, 2012)
Differs from Rules, as axioms are tests to guarantee the ontology
structure. They are not used to infer new relations.
They assert, and can/should be used for consistence checking.
Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
Ontologies
Axioms: assertions (including rules) in a logical form
that together comprise the overall theory that the on-
tology describes in its domain of application.
Ontology: information science (Wikipedia, 2012)
Differs from Rules, as axioms are tests to guarantee the ontology
structure. They are not used to infer new relations.
They assert, and can/should be used for consistence checking.
Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
Ontologies
Events: the changing of attributes or relations.
Ontology: information science (Wikipedia, 2012)
Similar to rules, but react to events. For example, if the user adds
a feature stating that an individual lays eggs, classify it as an
oviparous.
Note that the division into Rules, Axioms and Events is not
universal, and depends a lot on the application that is used to
support the ontology.
Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
Ontologies
Events: the changing of attributes or relations.
Ontology: information science (Wikipedia, 2012)
Similar to rules, but react to events. For example, if the user adds
a feature stating that an individual lays eggs, classify it as an
oviparous.
Note that the division into Rules, Axioms and Events is not
universal, and depends a lot on the application that is used to
support the ontology.
Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
Ontologies: Example 1
Alberto Sim˜oes A5 - Specialized Dictionaries 55/138
Ontologies: Example 2
Alberto Sim˜oes A5 - Specialized Dictionaries 56/138
Ontologies: Pros and Cons
Pros:
More flexible than Thesauri;
(graph with ad-hoc relationships)
Lots of formalisms and standards (OWL, SKOS, . . . );
Lots of tools to edit (like Prot´eg´e);
Languages for querying and completion (like SPARQL);
Cons:
As a classification approach, requires an authority for its
definition, just like Taxonomies or Thesauri.
Complexity: not everybody is able to create a detailed
ontology.
Alberto Sim˜oes A5 - Specialized Dictionaries 57/138
Further Reading
Folksonomies:
Folksonomy Coinage and Definition
http://vanderwal.net/folksonomy.html
Folksonomies: A User-Driven Approach to Organizing Content
http://www.uie.com/articles/folksonomies/
Folksonomies: power to the people
http://www.iskoi.org/doc/folksonomies.htm
Folksonomies: Tidying up Tags?
http://www.dlib.org/dlib/january06/guy/01guy.html
Folksonomies - Cooperative Classification and Communication
Through Shared Metadata
http://www.adammathes.com/academic/
computer-mediated-communication/folksonomies.html
Alberto Sim˜oes A5 - Specialized Dictionaries 58/138
Further Reading
Taxonomies:
Taxonomy
http://en.wikipedia.org/wiki/Taxonomy
Perspectives on Taxonomy, Classification, Structure and
Find-ability
http://www.serviceinnovation.org/included/docs/
kcs_taxonomy.pdf
Universal Decimal Classification
http://www.udcc.org/udcsummary/php/index.php
Thesauri:
Thesaurus
http://en.wikipedia.org/wiki/Thesaurus
Thesaurus principles and practice
http://www.willpowerinfo.co.uk/thesprin.htm
Alberto Sim˜oes A5 - Specialized Dictionaries 59/138
Further Reading
Ontologies:
Ontology (information science)
http://en.wikipedia.org/wiki/Ontology_
(information_science)
Prot´eg´e Ontology Editor
http://protege.stanford.edu/
OWL Web Ontology Language
http://www.w3.org/TR/owl-features/
SPARQL Query Language for RDF
http://www.w3.org/TR/rdf-sparql-query/
Alberto Sim˜oes A5 - Specialized Dictionaries 60/138
Part II
Terminology and Translation
Alberto Sim˜oes A5 - Specialized Dictionaries 61/138
Overview
4 How translation works
5 The role of Terminology on Translation
6 Translation Software
Standard translation software
Standard terminology management software
Alberto Sim˜oes A5 - Specialized Dictionaries 62/138
How Translation Works
Alberto Sim˜oes A5 - Specialized Dictionaries 63/138
How Translation Works
Manual Translation
Translator uses some resources like dictionaries and terminologies, but
search them manually. The type of translation done in the last century.
Computer Assisted Translation
Translator uses tools (CAT tools) to help the translation process. Help
the translator to reuse previous translations, integrates with terminologies
and help the translator dealing with different file formats.
Exploratory Translation
Using machine translation tools, like Google Translate to do a quick
translation and understand texts. Not really a professional translation
process.
Machine Translation
Computer systems that translate text using different techniques, from
statistical information to translation rules. Quality raising in the last
years, but too far away of a real translation work result.
Alberto Sim˜oes A5 - Specialized Dictionaries 64/138
Computer Assisted Translation
CAT tools translation process:
1 Document is opened in CAT tool;
2 First sentence is extracted and presented to be translated;
3 Sentence is looked-up in a database of previous translated
sentences, looking up for similar sentences (fuzzy matching);
4 If found, translation is done (or fuzzy translation);
5 A terminology database is queried in order to check if
sentence includes relevant terms to be translated;
6 Translator reviews the translation;
7 System saves the translation in a database of translations;
8 System saves the translation in the translated document;
9 Next sentence is extracted, and go to step 3.
Alberto Sim˜oes A5 - Specialized Dictionaries 65/138
Computer Assisted Translation
Alberto Sim˜oes A5 - Specialized Dictionaries 66/138
Translation Memories
Databases of translations;
Store sentences in two or more languages;
Grow accordingly with the work of the translator;
Can be shared between translators in a same project;
Some big companies make their TM available to contracted
translators in order to guarantee homogeneity in their
translations.
Alberto Sim˜oes A5 - Specialized Dictionaries 67/138
Terminology and Translation
Translating terminology takes up to 40% of the time in
translation:
Translators not aware of technical areas;
Translators need to understand term being translated;
Researching on a specific area takes time;
Terminology reduce time to research on term translation.
Terminology helps the comprehension of concepts:
There is no way to translate without understanding;
Terminology might/should include explanations on terms;
Terminology helps on Consistency and Standardization:
Translate terms the same way through all the document;
Translate terms the same way through all documents;
Companies, Organization, Governmental Institutions define
specific terminologies that should be used by translators;
Alberto Sim˜oes A5 - Specialized Dictionaries 68/138
Further Reading
CAT software
Discover the benefits of using a CAT Tool: How can CAT
Tools help you? by Jonathan T. Hine Jr.
http://www.translationzone.com/en/translator-solutions/translation-memory/cat-tools/
What is a translation memory? by SDL Trados.
http://www.translationzone.com/en/translator-solutions/translation-memory/default.asp
What is terminology? by SDL Trados.
http://www.translationzone.com/en/translator-solutions/terminology-management/default.asp
Alberto Sim˜oes A5 - Specialized Dictionaries 69/138
Further Reading
Terminology in Translation
Terminology in translation, by Thorsten Trippel (1999)
http://www.spectrum.uni-bielefeld.de/~ttrippel/terminology/node19.html
Terminology Management in Translation, by Gabriele
Sauberer (2009) http://www.termnet.org/downloads/english/events/itaindia_
workshop/GS_Terminology_Management_in_Translation.pdf
The Role of Terminology Management in Localization, by Sue
Ellen Wright (2006)
http://www.translationzone.com/en/images/sue_ellen_slides_tcm18-25819.pdf
Managing Terminology for Translation Using Translation
Environment Tools: Towards a Definition of Best Practices,
by Marta G´omez Palou Allard (2012) http://www.ruor.uottawa.ca/fr/
bitstream/handle/10393/22837/Gomez_Palou_Allard_Marta_2012_thesis.pdf
Alberto Sim˜oes A5 - Specialized Dictionaries 70/138
Part III
Introduction to Corpora
Alberto Sim˜oes A5 - Specialized Dictionaries 71/138
Overview
7 Corpora
Monolingual Corpora
Parallel Corpora
Corpora in the Web
8 The web as Corpora
Do-it-yourself Corpora
Basic Crawling Tools
Alberto Sim˜oes A5 - Specialized Dictionaries 72/138
What is a Corpus?
cor·pus /’kˆorp@s/
Noun
1. A collection of written texts, esp. the entire
works of a particular author or a body of writing
on a particular subject;
2. A collection of written or spoken material in
machine-readable form, assembled for the
purpose of studying linguistic structures,
frequencies, etc.
corpora is the plural for corpus.
Alberto Sim˜oes A5 - Specialized Dictionaries 73/138
Corpora Classification
Corpora is usually classified accordingly with the number of
languages:
Monolingual Corpus:
documents are all written in one language;
(in some cases with more than one variant)
Multilingual Corpus:
documents are written in more than one language;
Alberto Sim˜oes A5 - Specialized Dictionaries 74/138
Corpora Classification
There are two specially relevant types of multilingual corpora:
Parallel Corpus:
a text placed alongside its translation or translations. Parallel
text alignment is the identification of corresponding blocks in
both halves of the parallel text.
Comparable Corpus:
is one which selects similar texts in more than one language or
variety. There is as yet no agreement on the nature of the
similarity, because there are very few examples of comparable
corpora
Expert Advisory Group on Language Engineering Standards Guidelines (1996)
Alberto Sim˜oes A5 - Specialized Dictionaries 75/138
Monolingual Corpora Examples
British National Corpus (http://www.natcorp.ox.ac.uk/)
The British National Corpus (BNC) is a 100 million word collection of
samples of written and spoken language from a wide range of sources,
designed to represent a wide cross-section of current British English, both
spoken and written.
CETEMP´ublico (http://www.linguateca.pt/cetempublico/)
Corpus de Extractos de Textos Electr´onicos MCT/P´ublico is a corpus of
approximately 180 million words in European Portuguese. It was created
by the Computational Processing of Portuguese Project after an
agreement between the Ministry of Science and Technology and the
P´ublico newspaper, in April, 2000.
CETENFolha (http://www.linguateca.pt/cetenfolha/)
Corpus de Extractos de Textos Electr´onicos NILC/Folha de S. Paulo is a
corpus of approximately 24 million words in Brazilian Portuguese, created
by the Computational Processing of Portuguese Project using texts from
the newspaper Folha de S. Paulo, that are part of the NILC/S˜ao Carlos
Alberto Sim˜oes A5 - Specialized Dictionaries 76/138
Monolingual Corpora Examples
Russian National Corpus (http://ruscorpora.ru/en/index.html)
RNC is a corpus of the modern Russian language incorporating over 300
million words. The corpus of Russian is a reference system based on a
collection of Russian texts in electronic form.
Croatian National Corpus (http://www.hnk.ffzg.hr/cnc.htm)
HNK is a systematized collection of selected texts mainly written in
contemporary Croatian covering different media, genres, styles, fields and
topics. The Corpus is accompanied by additional linguistic and
non-linguistic data and stored in a database on our server which can be
accessed with the search client program Bonito.
KOTONOHA Corpus (http://www.kotonoha.gr.jp/)
The Balanced Corpus of Contemporary Written Japanese includes text
samples collected to be able to grasp an overall picture of the modern
Japanese written language and includes about 100 million words.
Alberto Sim˜oes A5 - Specialized Dictionaries 77/138
Parallel Corpora Examples
Aligned Hansards
(http://isi.edu/natural-language/download/hansard/)
Aligned Hansards of the 36th Parliament of Canada, contains 1.3 million
pairs of aligned text chunks (sentences or smaller fragments).
COMPARA ( http://www.linguateca.pt/COMPARA/)
COMPARA is a bidirectional parallel corpus of English and Portuguese. In
other words, it is a type of database with original and translated texts in
these two languages that have been linked together sentence by sentence.
Europarl ( http://www.statmt.org/europarl/)
The Europarl parallel corpus is extracted from the proceedings of the
European Parliament. It includes versions in 11 European languages:
Romanic (French, Italian, Spanish, Portuguese), Germanic (English,
Dutch, German, Danish, Swedish), Greek and Finnish.
Alberto Sim˜oes A5 - Specialized Dictionaries 78/138
Parallel Corpora Examples
JRC-Acquis (http://langtech.jrc.it/JRC-Acquis.html)
The Acquis Communautaire is the total body of European Union (EU)
law applicable in the the EU Member States. It is a collection of parallel
texts in 22 languages: Bulgarian, Czech, Danish, German, Greek, English,
Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian,
Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene
and Swedish.
OPUS (http://opus.lingfil.uu.se/)
OPUS is a growing collection of translated texts from the web. In the
OPUS project we try to convert and align free online data, to add
linguistic annotation, and to provide the community with a publicly
available parallel corpus.
Per-Fide (http://per-fide.di.uminho.pt/cquery)
Per-fide Project aims on the development of parallel corpora between
Portuguese and six other Languages: English, Russian, French, Italian,
German and Spanish.
Alberto Sim˜oes A5 - Specialized Dictionaries 79/138
Querying Corpora
Using http://corpus.leeds.ac.uk/protected/query.html
Concordances of a single word: dog
Concordances for a sequence of words: big bang
Concordances for lemmas: [lemma="have"]
Concordances for part of speech: [pos="NNS"]
Combinations of the above:
[lemma="have"] dog
[lemma="be"] [lemma="have"]
Regular expressions can be used: [pos="N.*"] [pos="V.*"]
Multiple restrictions for same word:
[pos="N.*" & word="d.*"] [pos="V.*"]
Empty words: [pos="N.*"] [] [pos="V.*"]
Alberto Sim˜oes A5 - Specialized Dictionaries 80/138
The Web as Corpora
To study “purposeful language behavior,” corpus
linguists require collections of authentic texts (spoken
and/or written). It is therefore not surprising that many
(corpus) linguists have recently turned to the World Wide
Web as the richest and most easily accessible source of
language material available. At the same time, for
language technologists, who have been arguing for long
that “more data is better data,” the WWW is a virtually
unlimited source of “more data.”
Wacky!
A Wacky Introduction
Silvia Bernardini, Marco Baroni and Stefan Evert
Alberto Sim˜oes A5 - Specialized Dictionaries 81/138
Do-it-yourself Corpora
The WWW has data from virtually any subject;
There is data in mostly any language;
Therefore, it is possible to build custom corpora!
Collect text from the web. . .
. . . on a specific language. . .
. . . on the subject you want to study . . .
. . . and retrieve as much text as you need.
Alberto Sim˜oes A5 - Specialized Dictionaries 82/138
Basic Crawling Tools
There are standard download tools that follow HTML links,
and are able to download complete websites.
They are known as web spiders, or web robots;
Examples include “wget”, “wGetGUI” or “HTTrack”;
But you need to process the files yourself.
There are some projects that developed tools specific for
corpora building.
The most well known is “BootCaT”
Alberto Sim˜oes A5 - Specialized Dictionaries 83/138
Further Reading
Corpora:
Corpus Creation - Handbook of NLP http://cgi.cse.unsw.
edu.au/~handbookofnlp/index.php?n=Chapter7.Chapter7
Building and Using Your Own Corpora http:
//www.lancs.ac.uk/fss/courses/ling/corpus/blue/diy_top.htm
CQP Query Language Tutorial
http://cwb.sourceforge.net/files/CQP_Tutorial/
Web as Corpora:
Wacky! Working papers on the Web as Corpus
http://wackybook.sslmit.unibo.it/
Wacky Wiki
http://wacky.sslmit.unibo.it/doku.php
Alberto Sim˜oes A5 - Specialized Dictionaries 84/138
Part IV
Terminology Extraction
from Monolingual Corpora
Alberto Sim˜oes A5 - Specialized Dictionaries 85/138
Overview
9 Corpora for Terminology Building
10 Obtaining candidate terms from Corpora
N-grams and Frequencies
Lexical Difference
Exploring Mutual Information
Morphology Constraints
11 Exploring a Tool: Term-o-Matic
Alberto Sim˜oes A5 - Specialized Dictionaries 86/138
Corpora for Terminology Building
The use of a specific domain text or texts in order to
understand what is that domain terminology is relevant;
Words in context give more information than alone;
There is no automatic method to extract specific domain
terminology from a specific domain corpus;
Nevertheless, there are automatic method to obtain candidate
terms, that can later be analysed and incorporated in a
terminology, or just discarded.
Alberto Sim˜oes A5 - Specialized Dictionaries 87/138
Words n-Grams
In the fields of computational linguistics and
probability, an n-gram is a contiguous sequence of n
items from a given sequence of text or speech.
The items in question can be phonemes, syllables,
letters, words or base pairs according to the application.
n-grams are collected automatically from a text or speech
corpus.
Alberto Sim˜oes A5 - Specialized Dictionaries 88/138
One-Grams
1-Grams are usually known as words/tokens. :-)
Peter Piper picked a peck
of pickled peppers.
A peck of pickled peppers
Peter Piper picked.
If Peter Piper picked
a peck of pickled peppers,
Where’s the peck of pickled
peppers Peter Piper picked?
peter 4
piper 4
picked 4
a 2
peck 4
of 4
pickled 5
...
...
Alberto Sim˜oes A5 - Specialized Dictionaries 89/138
Bigrams
All sequences of two words/tokens found in the text.
Peter Piper picked a peck
of pickled peppers.
A peck of pickled peppers
Peter Piper picked.
If Peter Piper picked
a peck of pickled peppers,
Where’s the peck of pickled
peppers Peter Piper picked?
peter piper 4
piper picked 4
picked a 2
a peck 3
peck of 4
of pickled 4
pickled peppers 4
...
...
Alberto Sim˜oes A5 - Specialized Dictionaries 90/138
Top occurring trigrams for a real corpus
in accordance with 31148
referred to in 27581
the member states 16999
accordance with the 16535
of the european 14772
laid down in 13301
to in article 13211
having regard to 12588
regard to the 11416
member states shall 11392
in order to 10563
in the case 10029
the provisions of 9825
the case of 9575
provided for in 9560
the member state 9360
of the member 8656
the commission shall 8013
of this directive 6679
a member state 6306
on the basis 6292
the european parliament 6274
the basis of 6265
and in particular 6225
down in article 6200
of the community 5958
accordance with article 5758
to in paragraph 5690
opinion of the 5599
the opinion of 5191
the competent authorities 5074
for the purposes 5024
the purposes of 4946
with the procedure 4878
to the commission 4843
the european community 4834
Alberto Sim˜oes A5 - Specialized Dictionaries 91/138
n-grams frequency
n-Grams are usually computed together with their occurrence
count — or frequency;
In some situations, like statistic language models, other type
of measures are also computed (probability — relative
frequency; conditional probability, etc);
One-grams frequency doesn’t help much on term candidate
extraction — they just say that a word is more or less
frequent.
n-grams for n ≥ 2 can help finding sequence of words that
occur lot of times.
Alberto Sim˜oes A5 - Specialized Dictionaries 92/138
Stop Words and Lexical Difference
There are words that rarely occur in terminology;
At least, they rarely occur in the beginning or end of a
multi-word term;
For example, pronouns, articles, prepositions;
These words are usually known as stop words;
It is easy to find bigger or smaller lists of stop words for every
language;
We can ignore these words when computing n-grams.
Alberto Sim˜oes A5 - Specialized Dictionaries 93/138
Detecting stop-words
in accordance with 31148
referred to in 27581
the member states 16999
accordance with the 16535
of the european 14772
laid down in 13301
to in article 13211
having regard to 12588
regard to the 11416
member states shall 11392
in order to 10563
in the case 10029
the provisions of 9825
the case of 9575
provided for in 9560
the member state 9360
of the member 8656
the commission shall 8013
of thisi directive 6679
a member state 6306
on the basis 6292
the european parliament 6274
the basis of 6265
and in particular 6225
down in article 6200
of the community 5958
accordance with article 5758
to in paragraph 5690
opinion of the 5599
the opinion of 5191
the competent authorities 5074
for the purposes 5024
the purposes of 4946
with the procedure 4878
to the commission 4843
the european community 4834
Alberto Sim˜oes A5 - Specialized Dictionaries 94/138
Replacing stop words by a special token
<tk> member states 32517
member states <tk> 30108
<tk> member state 19345
member state <tk> 17882
council directive <tk> 7869
<tk> council directive 7129
<tk> european parliament 5397
council regulation <tk> 5259
european parliament <tk> 5125
<tk> council regulation 4995
<tk> competent authorities 4964
competent authorities <tk> 4736
procedure laid <tk> 4472
<tk> treaty establishing 4375
treaty establishing <tk> 4373
<tk> competent authority 3694
official journal <tk> 3530
competent authority <tk> 3507
annex ii <tk> 3429
commission regulation <tk> 3171
<tk> commission regulation 2967
commission decision <tk> 2545
<tk> customs authorities 2542
<tk> commission decision 2429
customs authorities <tk> 2410
<tk> european economic 2285
<tk> administrative provisions 2017
<tk> contracting parties 2010
conditions laid <tk> 1998
contracting parties <tk> 1779
commission directive <tk> 1764
detailed rules <tk> 1738
<tk> community industry 1728
<tk> contracting party 1702
Alberto Sim˜oes A5 - Specialized Dictionaries 95/138
Trigrams that doesn’t include stop words
member states relating 1523
member state concerned 1200
veterinary medicinal products 955
maximum residue limits 814
physically modified derivatives 700
european economic community 691
community trade mark 538
member states concerned 508
plant protection products 464
home member state 442
host member state 388
council common position 377
community plant variety 368
european atomic energy 346
animal health conditions 342
authorised representative established 327
implementing powers conferred 311
regional economic integration 263
median longitudinal plane 258
plant protection product 249
separate technical unit 246
national regulatory authorities 241
apply mutatis mutandis 241
common technical regulation 229
separate technical units 226
emission limit values 219
technically permissible maximum 215
maximum residue levels 212
retail trade services 200
temporary importation procedure 196
medicinal products intended 195
community transit procedure 195
atomic energy community 193
classical swine fever 189
Alberto Sim˜oes A5 - Specialized Dictionaries 96/138
Basic Lexical Difference
What if we remove not just stop words, but common words?
It is not that usual to find Osteoarthritis in common text.
Therefore, it should be some kind of a domain term.
We can obtain a list of common words from a generic corpus
(say, jornalistic text) and subtract that lexicon from the
one-grams we obtained.
Result should include good term candidates!
Alberto Sim˜oes A5 - Specialized Dictionaries 97/138
Basic Lexical Difference - Experiment
Two random abstracts from PubMed articles related with cirrhosis;
Top 1 000 occurring words in English;
Compute one-grams on the abstracts;
Subtract the top occurring words.
Before
liver 8
is 7
fibrosis 6
myofibroblast 6
pathway 5
kidney 5
expression 5
interstitial 4
signaling 3
target 3
differentiation 3
diseases 3
medullary 3
antioxidant 3
After
liver 8
myofibroblast 6
fibrosis 6
pathway 5
kidney 5
interstitial 4
β-catenin 3
target 3
signaling 3
genes 3
differentiation 3
medullary 3
renal 3
adult 3Alberto Sim˜oes A5 - Specialized Dictionaries 98/138
Lexical Distribution Difference
Previous example could benefit a bigger standard lexicon list;
Abstracts are crowded with terminology, and few other words;
Long lists may include words than are considered terminology!
Example, for Informatics, folder or file can be terms.
Instead of considering words as present or not, use their
frequency;
For instance, compute relative frequency and
compare/subtract;
Use a distribution comparison metric;
ex., Kullback-Leibler terms: log P(i)
Q(i) P (i)
Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
Lexical Distribution Difference
Previous example could benefit a bigger standard lexicon list;
Abstracts are crowded with terminology, and few other words;
Long lists may include words than are considered terminology!
Example, for Informatics, folder or file can be terms.
Instead of considering words as present or not, use their
frequency;
For instance, compute relative frequency and
compare/subtract;
Use a distribution comparison metric;
ex., Kullback-Leibler terms: log P(i)
Q(i) P (i)
Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
Pointwise Mutual Information
The Mutual Information (MI) is a quantity that measures the
mutual dependence of two random variables X and Y .
MI(X, Y ) =
x∈X y∈Y
P(x, y) log2
P(x, y)
P(x)P(y)
Intuitively, mutual information measures the information that X
and Y share: it measures how much knowing one of these
variables reduces uncertainty about the other.
Alberto Sim˜oes A5 - Specialized Dictionaries 100/138
Pointwise Mutual Information
When computing Mutual Information for two specific outcomes,
the Pointwise Mutual Information (PMI) let us measure their
mutual dependence:
PMI(x, y) = log2
P(x, y)
P(x)P(y)
Given the number of tokens in the document N, and the
number of occurrences for x, Oc(x): P(x) = Oc(x)
N
Given the number of tokens in the document N, and the
number of occurrences for bigram x, y, Oc(x, y):
P(x, y) = Oc(x,y)
N
Alberto Sim˜oes A5 - Specialized Dictionaries 101/138
Pointwise Mutual Information
Sorted by occurrence count
sonic fabric 14 7.3566
black holes 9 8.0912
black hole 7 8.0912
cassette tape 6 8.4968
build things 4 9.5348
smartphone makers 3 9.0087
alyce santoro 3 8.0912
like scratching 3 9.0087
barnard said 3 8.3042
milky way 3 9.1787
possible black 3 7.6762
neutron star 3 8.8567
just right 3 8.5937
records backwards 3 10.5937
Sorted by PMI
special shuttle 1 12.1787
immediately reminded 1 12.1787
remain aware 1 12.1787
richard branson 1 12.1787
supercooled pods 1 12.1787
richie havens 1 12.1787
auspicious locations 1 12.1787
jimi hendrix 1 12.1787
account settings 1 12.1787
baggage carousel 1 12.1787
buddhist prayer 1 12.1787
reinvents electronics 1 12.1787
melbourne institute 1 12.1787
cow manure 1 12.1787
From a very small corpus constructed with 5 CNN news stories.
Alberto Sim˜oes A5 - Specialized Dictionaries 102/138
Morphology Patterns
Commonly, terms are nouns or noun phrases;
Sometimes some verbs are also interesting;
Typically the morphological structure of terms is well known;
There is software that compute morphological information
about each word in a sentence;
We can use that information to obtain better term candidates.
specify terms part-of-speech, genre, number, verb tenses,
etc. . .
Alberto Sim˜oes A5 - Specialized Dictionaries 103/138
Morphological Analysis
How it (usually) works:
1 A tokenizer and a splitter split sentences into tokens and
sentences;
(different tools use them in different order, some as a single tool)
2 A morphological analyzer associates possible analysis to each
word;
(does not cope with ambiguity, just tags all possible analysis)
3 A Tagger or Parser choose the more likely analysis;
(uses knowledge from manual annotated corpora, and machine
learning algorithms)
Alberto Sim˜oes A5 - Specialized Dictionaries 104/138
Morphological Patterns - Examples
Noun Noun Noun
659 Community trade mark
483 plant protection products
475 EEC component type-approval
448 document number C
320 Community transit procedure
290 plant protection product
288 Community plant variety
257 EC type-examination certificate
214 EC component type-approval
176 EEC pattern approval
157 African swine fever
155 three-wheel motor vehicles
155 foot-and-mouth disease virus
153 conformity assessment procedures
148 emission limit values
Adjective Adjective Noun
912 veterinary medicinal products
453 common agricultural policy
365 separate technical unit
291 separate technical units
265 median longitudinal plane
223 regional economic integration
202 competent national authorities
200 trans-European high-speed rail
199 sound financial management
189 veterinary medicinal product
182 certain agricultural products
176 national regulatory authorities
175 common technical regulation
168 certain third countries
166 other third countries
166 definitive anti-dumping duty
162 certain dangerous substances
Alberto Sim˜oes A5 - Specialized Dictionaries 105/138
Term-o-Matic
http://www.termomatic.com/
Alberto Sim˜oes A5 - Specialized Dictionaries 106/138
Term-o-Matic
What it is:
A simple web-application;
Without user control;
Developed specifically for this class;
implement some of the methods presented before;
What it is not:
A commercial software;
A professional tool;
A tool free of bugs;
A multilingue tool.
Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
Term-o-Matic
What it is:
A simple web-application;
Without user control;
Developed specifically for this class;
implement some of the methods presented before;
What it is not:
A commercial software;
A professional tool;
A tool free of bugs;
A multilingue tool.
Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
Term-o-Matic: overview
Main screen, shows options, and summary on available data.
Alberto Sim˜oes A5 - Specialized Dictionaries 108/138
Term-o-Matic: Add Text
Use the Add Text option to add one-grams, bigrams and trigrams
into the database (English, please!).
Alberto Sim˜oes A5 - Specialized Dictionaries 109/138
Term-o-Matic: Add Text feedback
After adding some text, a summary of the amount of data added is
shown.
Alberto Sim˜oes A5 - Specialized Dictionaries 110/138
Term-o-Matic: Manage Stopwords
The Stop Words option allows to manage the list of stop-words. It
is possible to add (to add more than one just separate words using
spaces or other punctuation), and to delete them.
Alberto Sim˜oes A5 - Specialized Dictionaries 111/138
Term-o-Matic: Manage Lexicon
The Standard Lexicon option is very similar to the Stop Words
option, but for the generic words.
Alberto Sim˜oes A5 - Specialized Dictionaries 112/138
T-o-M: Words, Bigrams and Trigrams
The Study Words, Study Bigrams and Study Trigrams work all in
the same way, showing a list of words/bigrams/trigrams.
Alberto Sim˜oes A5 - Specialized Dictionaries 113/138
T-o-M: Words, Bigrams and Trigrams
Note that the PMI column is empty. This measure takes some time
to compute, and therefore should be computed only when needed.
Alberto Sim˜oes A5 - Specialized Dictionaries 114/138
T-o-M: Words, Bigrams and Trigrams
To compute PMI use the Compute bi/trigrams PMI. After the
software issue an ”OK” message, hit the back button on your
browser and refresh.
Alberto Sim˜oes A5 - Specialized Dictionaries 115/138
T-o-M: Words, Bigrams and Trigrams
By default the list is sorted by occurrence count. You can change
to PMI order as soon as it is computed.
Alberto Sim˜oes A5 - Specialized Dictionaries 116/138
T-o-M: Words, Bigrams and Trigrams
It is possible to remove entries with stop-words or punctuation; or
entries with common words.
Alberto Sim˜oes A5 - Specialized Dictionaries 117/138
T-o-M: Filtering by pattern
To filter by a morphological pattern you must ensure that you run
the Compute Morph. Analysis option after the last time you
entered text.
When the software says the process is complete (OK), hit the back
button, and you are realy to use the pattern filtering.
Just choose the categories you are looking for, and search for them.
Alberto Sim˜oes A5 - Specialized Dictionaries 118/138
T-o-M: Filtering by Pattern
Alberto Sim˜oes A5 - Specialized Dictionaries 119/138
Term-o-Matic: standard operation guide
1 Use the Add Text option to add text.
Use it as many times as you need to create a big enough
corpus;
Do not add too much text at once. Add by blocks.
Be sure to add thematic text;
2 Define a list of stop words (you might already have one).
3 Define a list of common words.
Look for such lists in the web.
4 Compute PMIs and Morphological Analysis
5 Do queries!
Alberto Sim˜oes A5 - Specialized Dictionaries 120/138
Evaluation Task
Five students, Five subject areas, Five Term-o-Matic.
Computer Science (http://termomatic.com/termomatic1)
Medicine (http://termomatic.com/termomatic2)
Europe (http://termomatic.com/termomatic3)
Animal Biology (http://termomatic.com/termomatic4)
Sports (http://termomatic.com/termomatic5)
Alberto Sim˜oes A5 - Specialized Dictionaries 121/138
Part V
Terminology Extraction
from Multilingual Corpora
Alberto Sim˜oes A5 - Specialized Dictionaries 122/138
Overview
12 Sentence and Word Alignment
13 Parallel Patterns
Alberto Sim˜oes A5 - Specialized Dictionaries 123/138
Sentence Alignment
Sentence alignment is the task of detecting translation
relationships between sentences in parallel corpora.
If sα is a sentence in a language Lα and sβ is a sentence in a
language Lβ, the alignment process creates the pair (sα, sβ) if
(there is a high probability that) sβ is a translation of sα.
Alberto Sim˜oes A5 - Specialized Dictionaries 124/138
Word Alignment
The Word Alignment is the task of detecting translation
relationships between words or terms in sentence-aligned parallel
corpora.
There are two trends on word alignment:
for each aligned sentence, create a link between every word
and its translation;
for the complete corpora, obtain a relationship between a
word and a set of probable translations, together with a
confidence measure (a kind of translation probability);
Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
Word Alignment
The Word Alignment is the task of detecting translation
relationships between words or terms in sentence-aligned parallel
corpora.
There are two trends on word alignment:
for each aligned sentence, create a link between every word
and its translation;
for the complete corpora, obtain a relationship between a
word and a set of probable translations, together with a
confidence measure (a kind of translation probability);
Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
Probabilistic Translation Dictionaries
Obtained with one of the word alignment methods;
Define a relationship between a word and a set of probable
translations;
T (europe) =



europa 94.7%
europeus 3.4%
europeu 0.8%
europeia 0.1%
T (stupid) =



est´upido 47.6%
est´upida 11.0%
est´upidos 7.4%
avisada 5.6%
direita 5.6%
impasse 4.5%
ocupado 3.8%
Alberto Sim˜oes A5 - Specialized Dictionaries 126/138
Translation Matrix
discussion
about
alternative
sources
of
financing
for
the
european
radical
alliance
.
discussão 44 0 0 0 0 0 0 0 0 0 0 0
sobre 0 11 0 0 0 0 0 0 0 0 0 0
fontes 0 0 0 74 0 0 0 0 0 0 0 0
de 0 3 0 0 27 0 6 3 0 0 0 0
financiamento 0 0 0 0 0 56 0 0 0 0 0 0
alternativas 0 0 23 0 0 0 0 0 0 0 0 0
para 0 0 0 0 0 0 28 0 0 0 0 0
a 0 1 0 0 1 0 4 33 0 0 0 0
aliança 0 0 0 0 0 0 0 0 0 0 65 0
radical 0 0 0 0 0 0 0 0 0 80 0 0
europeia 0 0 0 0 0 0 0 0 59 0 0 0
. 0 0 0 0 0 0 0 0 0 0 0 80
Using the probabilistic translation dictionaries we are able to
construct a translation matrix;
Each cell has a translation probability obtained from the
dictionary;
Alberto Sim˜oes A5 - Specialized Dictionaries 127/138
Translation Patterns
Translation changes word order (for some language pairs!);
This change can be foreseen;
This change can be defined formally as a pattern;
These patterns can be used to obtain term candidates.
Alberto Sim˜oes A5 - Specialized Dictionaries 128/138
Translation Pattern 1: ABBA
Jogos
Ol´ımpicos
Olimpic X
Games X
Formally,
T (A · B) = T (B) · T (A)
Or in the tool syntax:
[ABBA] A B = B A
Alberto Sim˜oes A5 - Specialized Dictionaries 129/138
Translation Pattern 2: IDH
´ındice
de
desenvolvimento
humano
human X
development X
index X
T (I · ”de” · D · H) = T (H) · T (D) · T (I)
[IDH] I "de" D H = H D I
Alberto Sim˜oes A5 - Specialized Dictionaries 130/138
Translation Pattern 3: FTP
protocolo
de
transferˆencia
de
ficheiros
file X
transfer X
protocol X
T (P · ”de” · T · ”de” · F) = T (F) · T (T) · T (P)
[FTP] P "de" T "de" F = F T P
Alberto Sim˜oes A5 - Specialized Dictionaries 131/138
Patterns in Translation Matrix
discussion
about
alternative
sources
of
financing
for
the
european
radical
alliance
.
discussão 44 0 0 0 0 0 0 0 0 0 0 0
sobre 0 11 0 0 0 0 0 0 0 0 0 0
fontes 0 0 0 74 0 0 0 0 0 0 0 0
de 0 3 0 0 27 0 6 3 0 0 0 0
financiamento 0 0 0 0 0 56 0 0 0 0 0 0
alternativas 0 0 23 0 0 0 0 0 0 0 0 0
para 0 0 0 0 0 0 28 0 0 0 0 0
a 0 1 0 0 1 0 4 33 0 0 0 0
aliança 0 0 0 0 0 0 0 0 0 0 65 0
radical 0 0 0 0 0 0 0 0 0 80 0 0
europeia 0 0 0 0 0 0 0 0 59 0 0 0
. 0 0 0 0 0 0 0 0 0 0 0 80
The two boxes correspond to the following two patterns:
[P1] F "de" N A = A F "of" N
[P2] A B C = C B A
Alberto Sim˜oes A5 - Specialized Dictionaries 132/138
Terms extracted using A B = B A
21007 uni˜ao europeia ⇒ european union
9301 parlamento europeu ⇒ european parliament
4171 direitos humanos ⇒ human rights
3504 estados unidos ⇒ united states
2353 mercado interno ⇒ internal market
1911 posi¸c˜ao comum ⇒ common position
1826 pa´ıses candidatos ⇒ candidate countries
1776 comiss˜ao europeia ⇒ european commission
1708 conselho europeu ⇒ european council
1629 sa´ude p´ublica ⇒ public health
1558 direitos fundamentais ⇒ fundamental rights
1546 na¸c˜oes unidas ⇒ united nations
1337 pa´ıses terceiros ⇒ third countries
1294 conferˆencia intergovernamental ⇒ intergovernmental conference
1258 fundos estruturais ⇒ structural funds
Alberto Sim˜oes A5 - Specialized Dictionaries 133/138
Terms extracted using A ”de” B = B A
729 plano de ac¸c˜ao ⇒ action plan
722 conselho de seguran¸ca ⇒ security council
680 processo de paz ⇒ peace process
582 mercado de trabalho ⇒ labour market
580 pena de morte ⇒ death penalty
492 pacto de estabilidade ⇒ stability pact
431 pol´ıtica de defesa ⇒ defence policy
353 acordo de associa¸c˜ao ⇒ association agreement
348 protocolo de quioto ⇒ kyoto protocol
343 programa de ac¸c˜ao ⇒ action programme
259 branqueamento de capitais ⇒ money laundering
258 comit´e de concilia¸c˜ao ⇒ conciliation committee
241 pol´ıtica de concorrˆencia ⇒ competition policy
226 processo de concilia¸c˜ao ⇒ conciliation procedure
217 requerentes de asilo ⇒ asylum seekers
Alberto Sim˜oes A5 - Specialized Dictionaries 134/138
Terms extracted using A B C = C B A
531 pol´ıtica agr´ıcola comum ⇒ common agricultural policy
418 banco central europeu ⇒ european central bank
329 tribunal penal internacional ⇒ international criminal court
166 alian¸ca livre europeia ⇒ european free alliance
156 modelo social europeu ⇒ european social model
153 partidos pol´ıticos europeus ⇒ european political parties
83 fundo monet´ario internacional ⇒ international monetary fund
75 pol´ıtica externa comum ⇒ common foreign policy
66 organiza¸c˜ao mar´ıtima internacional ⇒ international maritime organisation
65 pr´opria uni˜ao europeia ⇒ european union itself
65 fundo social europeu ⇒ european social fund
55 direitos humanos fundamentais ⇒ fundamental human rights
45 rela¸c˜oes econ´omicas externas ⇒ external economic relations
45 homens e mulheres ⇒ women and men
45 agˆencia espacial europeia ⇒ european space agency
Alberto Sim˜oes A5 - Specialized Dictionaries 135/138
Terms extracted: I ”de” D H = H D I
95 mandato de captura europeu ⇒ european arrest warrant
85 fontes de energia renov´aveis ⇒ renewable energy sources
80 mandado de captura europeu ⇒ european arrest warrant
67 sistemas de seguran¸ca social ⇒ social security systems
64 zona de com´ercio livre ⇒ free trade area
55 for¸ca de reac¸c˜ao r´apida ⇒ rapid reaction force
54 orienta¸c˜oes de pol´ıtica econ´omica ⇒ economic policy guidelines
46 planos de ac¸c˜ao nacionais ⇒ national action plans
46 direitos de propriedade intelectual ⇒ intellectual property rights
33 sistema de alerta r´apido ⇒ rapid alert system
29 pol´ıtica de defesa comum ⇒ common defence policy
29 m´etodo de coordena¸c˜ao aberta ⇒ open coordination method
27 m´etodo de coordena¸c˜ao aberto ⇒ open coordination method
27 conselho de empresa europeu ⇒ european works council
25 acordo de com´ercio livre ⇒ free trade agreement
Alberto Sim˜oes A5 - Specialized Dictionaries 136/138
Adding Morphological Constraints
The pattern language supports constraints;
Constrains can be of different types;
The most interesting are the morphological ones:
[ABBA] A B[CAT<-adj] = B[CAT<-adj] A
With this kind of constrain we can force the words in specific
positions to be of specific morphological category.
Alberto Sim˜oes A5 - Specialized Dictionaries 137/138
Further Reading
Alignment tasks
Sentence Alignment Survey
http://www.statmt.org/survey/Topic/SentenceAlignment
An overview of bitext alignment algorithms http://www.
ida.liu.se/~jodfo/gslt/bitext-alignment-jody.pdf
Word Alignment Survey
http://www.statmt.org/survey/Topic/WordAlignment
Terminology from Parallel Corpora
Parallel corpus-based bilingual terminology extraction http:
//ambs.perl-hackers.net/publications/tia09.pdf
Alberto Sim˜oes A5 - Specialized Dictionaries 138/138

Mais conteúdo relacionado

Mais procurados

05 phylogeny modern taxonomy
05   phylogeny modern taxonomy05   phylogeny modern taxonomy
05 phylogeny modern taxonomymrtangextrahelp
 
Diversity in living organisms
Diversity in living organismsDiversity in living organisms
Diversity in living organismsRohitsatyaanand
 
Classification prt1
Classification prt1Classification prt1
Classification prt1jdrinks
 
Intro to biodiversity and taxonomy
Intro to biodiversity and taxonomyIntro to biodiversity and taxonomy
Intro to biodiversity and taxonomyandrewxhill
 
1.14 Why are organisms classified into groups ?
1.14 Why are organisms classified into groups ?1.14 Why are organisms classified into groups ?
1.14 Why are organisms classified into groups ?netzwellenedu
 
Chapter 18.1
Chapter 18.1Chapter 18.1
Chapter 18.1fj560
 
Vertebrate systematics
Vertebrate systematicsVertebrate systematics
Vertebrate systematicsKendral Flores
 
Biology - Chp 18 - Classification - Notes
Biology - Chp 18 - Classification - NotesBiology - Chp 18 - Classification - Notes
Biology - Chp 18 - Classification - NotesMr. Walajtys
 

Mais procurados (10)

Taxonomy
TaxonomyTaxonomy
Taxonomy
 
05 phylogeny modern taxonomy
05   phylogeny modern taxonomy05   phylogeny modern taxonomy
05 phylogeny modern taxonomy
 
Diversity in living organisms
Diversity in living organismsDiversity in living organisms
Diversity in living organisms
 
Classification prt1
Classification prt1Classification prt1
Classification prt1
 
Intro to biodiversity and taxonomy
Intro to biodiversity and taxonomyIntro to biodiversity and taxonomy
Intro to biodiversity and taxonomy
 
1.14 Why are organisms classified into groups ?
1.14 Why are organisms classified into groups ?1.14 Why are organisms classified into groups ?
1.14 Why are organisms classified into groups ?
 
Chapter 18.1
Chapter 18.1Chapter 18.1
Chapter 18.1
 
Taxonomy
TaxonomyTaxonomy
Taxonomy
 
Vertebrate systematics
Vertebrate systematicsVertebrate systematics
Vertebrate systematics
 
Biology - Chp 18 - Classification - Notes
Biology - Chp 18 - Classification - NotesBiology - Chp 18 - Classification - Notes
Biology - Chp 18 - Classification - Notes
 

Semelhante a EMLex-A5: Specialized Dictionaries

Murtha Baca
Murtha BacaMurtha Baca
Murtha Bacavonjobi
 
Semantic Technologies at FAO
Semantic Technologies at FAOSemantic Technologies at FAO
Semantic Technologies at FAOguestdef88f8
 
What do the fields of cosmology, financial matters, fund, law, scien.pdf
What do the fields of cosmology, financial matters, fund, law, scien.pdfWhat do the fields of cosmology, financial matters, fund, law, scien.pdf
What do the fields of cosmology, financial matters, fund, law, scien.pdfannaielectronicsvill
 
161S16_systematics.pptx
161S16_systematics.pptx161S16_systematics.pptx
161S16_systematics.pptxaprilrances1
 
Arágon et all english grammar in context for academic and professional purp...
Arágon et all   english grammar in context for academic and professional purp...Arágon et all   english grammar in context for academic and professional purp...
Arágon et all english grammar in context for academic and professional purp...Telma Ventura
 
Franz et al 2015 escjam 2015 logic resolution taxonomic variable
Franz et al 2015 escjam 2015 logic resolution taxonomic variableFranz et al 2015 escjam 2015 logic resolution taxonomic variable
Franz et al 2015 escjam 2015 logic resolution taxonomic variabletaxonbytes
 
Aspects Of The Theory Of Syntax
Aspects Of The Theory Of SyntaxAspects Of The Theory Of Syntax
Aspects Of The Theory Of SyntaxJill Brown
 
Can there be such a thing as Ontology Engineering?
Can there be such a thing as Ontology Engineering?Can there be such a thing as Ontology Engineering?
Can there be such a thing as Ontology Engineering?robertstevens65
 
Final PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docxFinal PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docxbryanwest16882
 
Final PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docxFinal PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docxtjane3
 
4_Lexicography_vs_terminology.ppt
4_Lexicography_vs_terminology.ppt4_Lexicography_vs_terminology.ppt
4_Lexicography_vs_terminology.pptmadina131293
 
Essay on the embryonic field of language
Essay on the embryonic field of languageEssay on the embryonic field of language
Essay on the embryonic field of languageKen Ewell
 

Semelhante a EMLex-A5: Specialized Dictionaries (20)

Taxonomies & thesauri
Taxonomies & thesauriTaxonomies & thesauri
Taxonomies & thesauri
 
Murtha Baca
Murtha BacaMurtha Baca
Murtha Baca
 
Semantic Technologies at FAO
Semantic Technologies at FAOSemantic Technologies at FAO
Semantic Technologies at FAO
 
What do the fields of cosmology, financial matters, fund, law, scien.pdf
What do the fields of cosmology, financial matters, fund, law, scien.pdfWhat do the fields of cosmology, financial matters, fund, law, scien.pdf
What do the fields of cosmology, financial matters, fund, law, scien.pdf
 
161S16_systematics.pptx
161S16_systematics.pptx161S16_systematics.pptx
161S16_systematics.pptx
 
Taxonomy made easy
Taxonomy made easyTaxonomy made easy
Taxonomy made easy
 
Arágon et all english grammar in context for academic and professional purp...
Arágon et all   english grammar in context for academic and professional purp...Arágon et all   english grammar in context for academic and professional purp...
Arágon et all english grammar in context for academic and professional purp...
 
English grammar in context
English grammar in contextEnglish grammar in context
English grammar in context
 
Taxonomy
TaxonomyTaxonomy
Taxonomy
 
Franz et al 2015 escjam 2015 logic resolution taxonomic variable
Franz et al 2015 escjam 2015 logic resolution taxonomic variableFranz et al 2015 escjam 2015 logic resolution taxonomic variable
Franz et al 2015 escjam 2015 logic resolution taxonomic variable
 
Fao Isko Short
Fao Isko ShortFao Isko Short
Fao Isko Short
 
The agricultural ontology service
The agricultural ontology serviceThe agricultural ontology service
The agricultural ontology service
 
Aspects Of The Theory Of Syntax
Aspects Of The Theory Of SyntaxAspects Of The Theory Of Syntax
Aspects Of The Theory Of Syntax
 
Remsen Lect04
Remsen Lect04Remsen Lect04
Remsen Lect04
 
Can there be such a thing as Ontology Engineering?
Can there be such a thing as Ontology Engineering?Can there be such a thing as Ontology Engineering?
Can there be such a thing as Ontology Engineering?
 
KAG B303.pptx
KAG B303.pptxKAG B303.pptx
KAG B303.pptx
 
Final PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docxFinal PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docx
 
Final PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docxFinal PaperYour final paper should discuss similarities and diff.docx
Final PaperYour final paper should discuss similarities and diff.docx
 
4_Lexicography_vs_terminology.ppt
4_Lexicography_vs_terminology.ppt4_Lexicography_vs_terminology.ppt
4_Lexicography_vs_terminology.ppt
 
Essay on the embryonic field of language
Essay on the embryonic field of languageEssay on the embryonic field of language
Essay on the embryonic field of language
 

Mais de Alberto Simões

Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approachAlberto Simões
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryAlberto Simões
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationAlberto Simões
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAlberto Simões
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAlberto Simões
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAlberto Simões
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAlberto Simões
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with PerlAlberto Simões
 
Processing XML: a rewriting system approach
Processing XML: a rewriting system approachProcessing XML: a rewriting system approach
Processing XML: a rewriting system approachAlberto Simões
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaAlberto Simões
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaAlberto Simões
 

Mais de Alberto Simões (20)

Source Code Quality
Source Code QualitySource Code Quality
Source Code Quality
 
Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approach
 
Google Maps JS API
Google Maps JS APIGoogle Maps JS API
Google Maps JS API
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionary
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry Translation
 
Modelação de Dados
Modelação de DadosModelação de Dados
Modelação de Dados
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de Sequência
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de Atividade
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de Informação
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
 
PLN em Perl
PLN em PerlPLN em Perl
PLN em Perl
 
Redes de Pert
Redes de PertRedes de Pert
Redes de Pert
 
Dancing Tutorial
Dancing TutorialDancing Tutorial
Dancing Tutorial
 
Processing XML: a rewriting system approach
Processing XML: a rewriting system approachProcessing XML: a rewriting system approach
Processing XML: a rewriting system approach
 
Sistemas de Numeração
Sistemas de NumeraçãoSistemas de Numeração
Sistemas de Numeração
 
Álgebra de Boole
Álgebra de BooleÁlgebra de Boole
Álgebra de Boole
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução Automática
 
Dicionário Aberto
Dicionário AbertoDicionário Aberto
Dicionário Aberto
 

Último

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Último (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

EMLex-A5: Specialized Dictionaries

  • 1. A5 - Specialized Dictionaries Alberto Sim˜oes ambs@ilch.uminho.pt EMLex 2012/2013 Erlangen Alberto Sim˜oes A5 - Specialized Dictionaries 1/138
  • 2. Part I Terminology vs Lexicography Alberto Sim˜oes A5 - Specialized Dictionaries 2/138
  • 3. Overview 1 Term Orientation vs Concept Orientation 2 Classification Systems What are Classification Systems? Folksonomies Taxonomies Thesauri Ontologies 3 Further Reading Alberto Sim˜oes A5 - Specialized Dictionaries 3/138
  • 4. Term vs Concept Orientation Most dictionaries are organized by terms: users look up entries by the word; entries describe all possible senses; the same explanation can appear for different words (synonyms); Most terminologies are organized by concepts: users look up entries by an instance word; but concepts exist organized as a single block; each concept is represented only once; all synonyms (and antonyms) are presented together; Alberto Sim˜oes A5 - Specialized Dictionaries 4/138
  • 5. Term vs Concept Orientation Term Orientation: Dictionary Definition from Dictionary.com (May 3rd, 2013) Alberto Sim˜oes A5 - Specialized Dictionaries 5/138
  • 6. Term vs Concept Orientation Concept Orientation: Terminology Entry from DeCS - Health Sciences Descriptors (May 3rd, 2013) Alberto Sim˜oes A5 - Specialized Dictionaries 6/138
  • 7. Classification Systems Humans tend to organize; “disorganization is a kind of organization” This organization is usually done by classification; Classification can be as simple as tagging an object; “this is the pile of important documents, that of the unimportant ones” Classification is used everywhere! Alberto Sim˜oes A5 - Specialized Dictionaries 7/138
  • 8. Where are classification systems used? Internet Social Networks (tagging); Libraries (ex. Universal Decimal Classification); Medicine (ex. Unified Medical Language System) Chemistry (ex. Periodic Table); Geography (ex. Geographic Taxonomy); Biology (ex. Linnaean taxonomy, Protein classification, . . . ); Alberto Sim˜oes A5 - Specialized Dictionaries 8/138
  • 9. Classification Systems Classes Classification Systems can also be classified; One way to classify classification systems is by their ability to include properties and relations between the classified objects; We will discuss four types of classification systems: Folksonomies Taxonomies Thesauri Ontologies Alberto Sim˜oes A5 - Specialized Dictionaries 9/138
  • 10. Folksonomies Alberto Sim˜oes A5 - Specialized Dictionaries 10/138
  • 11. Folksonomies A folksonomy is a system of classification derived from the practice and method of collaboratively creating and managing tags to annotate and categorize content; this practice is also known as collaborative tagging, so- cial classification, social indexing, and social tagging. Folksonomy, a term coined by Thomas Vander Wal, is a portmanteau of folk and taxonomy. Folksonomy (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 11/138
  • 12. Folsksonomies: How they work Other classification techniques often define someone or some group in charge of creating the classification system structure (authority); This group of people see the world from a specific point of view, that can be, or not, shared by others; Folksonomies solve this problem: power to the people; Instead of partitioning the world according to one particular view. They let the user present facets of objects; Users assign keywords (or tags, or labels) to objects (individuals); These keywords can be searched, indexed, and mathematical models can be applied to this data. Alberto Sim˜oes A5 - Specialized Dictionaries 12/138
  • 13. Folksonomies An empirical analysis of the complex dynamics of tag- ging systems, published in 2007, has shown that con- sensus around stable distributions and shared vocab- ularies does emerge, even in the absence of a central controlled vocabulary. For content to be searchable, it should be categorized and grouped. While this was believed to require commonly agreed on sets of con- tent describing tags (much like keywords of a journal article), recent research has found that, in large folk- sonomies, common structures also emerge on the level of categorizations. Accordingly, it is possible to devise mathematical models that allow for translating from personal tag vocabularies (personomies) to the vocab- ulary shared by most users. Folksonomy (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 13/138
  • 14. Folksonomies: example Top categories in the Portuguese Wikipedia (single words): 375 Sociologia 383 Ponerinae 395 Afro-brasileiros 404 Drilliidae 413 Filosofia 415 Coleophoridae 424 Psicologia 428 Terebridae 445 Clathurellinae 445 Digimons 445 Teuto-brasileiros 451 Apiaceae 483 Asteroides 486 Luso-brasileiros 492 Acaena 526 Rubiaceae 537 Dolichoderinae 730 Agonoxenidae 735 Acalypha 753 Mangeliinae 762 Crambidae 787 Poaceae 808 Colet^aneas 824 Theraphosidae 854 Myrmicinae 962 Fabaceae 974 Formicidae 1065 Agrostis 1096 Formicinae 1177 Aloe 1328 Conus 1338 ´Italo-brasileiros 1395 Asteraceae 1433 Coleophora 1514 Arctiidae 1516 Alchemilla 1689 Turridae 1879 Camponotus 2163 Acer 2744 Acacia Alberto Sim˜oes A5 - Specialized Dictionaries 14/138
  • 15. Folksonomies: Pros and Cons Pros: doesn’t require expert cataloguers, authoritative sources or expert users; capability of matching users’ real needs and language: (inclusive — includes everyone’s words and vocabulary) controlled vocabularies are not practically and economically extensible, while folksonomies are; a low-investment bridge between personal classification and shared classification; easy to use and quick to classify big quantities of individuals; not all the limitations of folksonomies are defects :-) Alberto Sim˜oes A5 - Specialized Dictionaries 15/138
  • 16. Folksonomies: Pros and Cons Cons: by itself, the vocabulary is flat; (there is no structure, just terms) not usable for small collections or those with few users; (statistical methods are dependent of population size) without some technology help, vocabularies get inexact or ambiguous; have a very low findability quotient. They are great for serendipity and browsing but not aimed at a targeted approach or search; Alberto Sim˜oes A5 - Specialized Dictionaries 16/138
  • 17. Taxonomies Alberto Sim˜oes A5 - Specialized Dictionaries 17/138
  • 18. Taxonomies Taxonomy is the science of identifying and naming species, and arranging them into a classification. The field of taxonomy, sometimes referred to as “biological taxonomy”, revolves around the description and use of taxonomic units, known as taxa. A resulting taxonomy is a particular classification, arranged in a hierarchical structure or classification scheme. Taxonomy (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 18/138
  • 19. Taxonomies taxonomy [tæk’s6n@mI] n. (Life Sciences & Allied Applications / Biology) the branch of biology concerned with the classification of organisms into groups based on similarities of structure, origin, etc. the practice of arranging organisms in this way. the science or practice of classification. [from French taxonomie, from Greek taxis “order” + –nomy] Collins English Dictionary – Complete and Unabridged c HarperCollins Publishers 1991, 1994, 1998, 2000, 2003 Alberto Sim˜oes A5 - Specialized Dictionaries 19/138
  • 20. Taxonomies: How they work? Used to partition the world into disjunctive classes or groups; Each group is, again, partitioned into sub-classes or sub-groups; And sub-classes are partitioned, and. . . Individuals are classified in one leaf category; (a classification is a path in the tree) Alberto Sim˜oes A5 - Specialized Dictionaries 20/138
  • 21. Taxonomies: The typical example Alberto Sim˜oes A5 - Specialized Dictionaries 21/138
  • 22. Taxonomies: examples used everyday Main index (top level) of Universal Decimal Classification: 0 Generalities (now Science and knowledge. Organization. Computer Science. Information. Documentation. Librarianship. Institutions. Publications) 1 Philosophy. Psychology 2 Religion. Theology 3 Social Sciences 4 Vacant 5 Mathematics and natural sciences 6 Applied sciences. Medicine. Technology 7 The arts. Recreation. Entertainment. Sport 8 Language. Linguistics. Literature 9 Geography. Biography. History Alberto Sim˜oes A5 - Specialized Dictionaries 22/138
  • 23. Taxonomies: examples used everyday 8 Language. Linguistics. Literature 80 General questions [. . . ] linguistics and literature. Philology 81 Linguistics and languages 81-11 Schools and trends in linguistics 81-13 Methodology of linguistics. Methods and means 811 Languages 811.1/.2 Indo-European Languages 811.3 Dead languages of unknown affiliation. Caucasian languages 811.4 Afro-Asiatic, Nilo-Saharan, Congo-Kordofanian, Khoisan languages 811.5 Ural-Altaic, Palaeo-Siberian, Eskimo-Aleut, Dravidian and Sino-Tibetan languages. Japanese. Korean. . . 811.6 Austro-Asiatic languages. Austronesian languages 811.7 Indo-Pacific (non-Austronesian) languages. Australian languages 811.8 American indigenous languages 811.9 Artificial languages 82 Literature Alberto Sim˜oes A5 - Specialized Dictionaries 23/138
  • 24. Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 1 Philosophy. Psychology 2 Religion. Theology 3 Social Sciences 5 Mathematics and natural sciences 6 Applied sciences. Medicine. Technology 7 The arts. Recreation. Entertainment. Sport 8 Language. Linguistics. Literature 9 Geography. Biography. History Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 24/138
  • 25. Taxonomies: Class Task 5 Mathematics, Natural Sciences 51 Mathematics 519 (no name, virtual class) 519.6 Computational mathematics. Numerical Analysis University of Minho Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 25/138
  • 26. Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 00 Prolegomena. Fundamentals of knowledge and culture. Propaedeutics 004 Computer science and technology. Computing. Data processing 004.4 Software 004.42 Computer programming. Computer programs Aveiro University Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 26/138
  • 27. Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 00 Prolegomena. Fundamentals of knowledge and culture. Propaedeutics 004 Computer science and technology. Computing. Data processing 004.4 Software 004.43 Computer Languages Porto Polytechnic Institute Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 27/138
  • 28. Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 00 Prolegomena. Fundamentals of knowledge and culture. Propaedeutics 004 Computer science and technology. Computing. Data processing 004.8 Artificial intelligence Algarve’s University Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 28/138
  • 29. Taxonomies: Pros and Cons Pros: rigid tree, makes it easy to process; suitable for some areas (like life classification); the hierarchy helps searching for terms (abstraction); Cons: rigid tree, makes it difficult to classify; (different people classify objects differently) the structure is defined by some authority group; (for example, the UDC Consortium) forces the subdivision of the world; (categories are single-parental) as a workaround, people classify in more than one category; (so, the rigid tree Pro gets a Con) Alberto Sim˜oes A5 - Specialized Dictionaries 29/138
  • 30. Thesauri Alberto Sim˜oes A5 - Specialized Dictionaries 30/138
  • 31. Thesauri A thesaurus is a reference work that lists words grouped together according to similarity of meaning (containing synonyms and sometimes antonyms), in contrast to a dictionary, which contains definitions and pronunciations. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 31/138
  • 32. Thesauri In Information Science, Library Science, and Informa- tion Technology, specialized thesauri are designed for information retrieval. They are a type of controlled vocabulary, for indexing or tagging purposes. Such a thesaurus can be used as the basis of an index for on- line material. [. . . ] Unlike a literary thesaurus, these specialized thesauri typically focus on one discipline, subject or field of study. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 32/138
  • 33. Thesauri: How they work! Thesauri for information retrieval are typically con- structed by information specialists, and have their own unique vocabulary defining different kinds of terms and relationships. Terms are the basic semantic units for conveying con- cepts. They are usually single-word nouns, since nouns are the most concrete part of speech. [. . . ] When a term is ambiguous, a “scope note” can be added to ensure consistency, and give direction on how to inter- pret the term. “Term relationships” are links between terms. These relationships can be divided into three types: hierar- chical, equivalency or associative. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 33/138
  • 34. Thesauri: How they work! Hierarchical relationships are used to indicate terms which are narrower and broader in scope. A “Broader Term” (BT) or hyperonym is a more general term. Reciprocally, a “Narrower Term” (NT) or hyponym is a more specific term. BT and NT are reciprocals; a broader term necessarily implies at least one other term which is narrower. BT and NT are used to indicate class relationships, as well as part-whole relationships (meronyms and holonyms). Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 34/138
  • 35. Thesauri: How they work! Example of a thesaurus with hierarchical relations. Feline NT Cat NT Panther Cat BT Feline Panther BT Feline NT Pink Panther Pink Panther BT Panther Alberto Sim˜oes A5 - Specialized Dictionaries 35/138
  • 36. Thesauri: How they work! The equivalency relationship is used primarily to con- nect synonyms and near-synonyms. “Use” (USE) and “Used For” (UF) indicators are used when an autho- rized term is to be used for another, unauthorized, term. Unauthorized terms are often called “entry vo- cabulary”, “entry points”, “lead-in terms”, or “non- preferred terms”, pointing to the authorized term (also referred to as the “preferred term” or “descriptor”) that has been chosen to stand for the concept. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 36/138
  • 37. Thesauri: How they work! Example of a thesaurus with equivalency relations. Parliament USE European Parliament Parliament of Europe USE European Parliament European Parliament UF Parliament UF Parliament of Europe Alberto Sim˜oes A5 - Specialized Dictionaries 37/138
  • 38. Thesauri: How they work! Associative relationships are used to connect two related terms whose relationship is neither hierarchical nor equivalent. This relationship is described by the indicator “Related Term” (RT). Associative relation- ships should be applied with caution, since excessive use of RTs will reduce specificity in searches. Consider the following: if the typical user is searching with term ”A”, would they also want resources tagged with term ”B”? If the answer is no, then an associative relationship should not be established. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 38/138
  • 39. Thesauri: How they work! Example of a thesaurus with associative relations. Douro Porto BT River BT Portugal RT Porto RT Gaia Portugal NT Porto River NT Gaia NT Douro City Gaia NT Gaia BT Portugal NT Porto Note RT is not symmetrical. a RT b ⇒ b RT a. Alberto Sim˜oes A5 - Specialized Dictionaries 39/138
  • 40. Thesauri: a simple example Quality Asia ChinaFood Safety Contamination Food Food Contamination BT NT BT NT RT RT USE USE Extract of Food Safety relationships in AGROVOC Alberto Sim˜oes A5 - Specialized Dictionaries 40/138
  • 41. Thesauri: Pros and Cons Pros: More flexible than Taxonomies; (does not require a tree, work as a graph) Have other types of relationships than simple hierarchy; (like the associative relation) There is an ISO standard that documents their correct use; Standard defines mathematical properties for relationships; Cons: Standardized types of relationships are somewhat limited; (same relation for hyperonyms and meronyms) (non-hierarchical relation is too vague: related) No support for relationships with non-terms (features); Alberto Sim˜oes A5 - Specialized Dictionaries 41/138
  • 42. Ontologies Alberto Sim˜oes A5 - Specialized Dictionaries 42/138
  • 43. Ontologies Ontology is the philosophical study of the nature of being, existence, or reality as such, as well as the ba- sic categories of being and their relations. Tradition- ally listed as a part of the major branch of philosophy known as metaphysics, ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences. Ontology (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 43/138
  • 44. Ontologies In computer science and information science, an ontol- ogy formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain. Ontology: information science (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 44/138
  • 45. Ontologies Contemporary ontologies share many structural simi- larities, regardless of the language in which they are expressed. Most ontologies describe individuals (in- stances), classes (concepts), attributes, and relations. Ontology: information science (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 45/138
  • 46. Ontologies Individuals are the instances or objects (the basic or “ground level” objects). Ontology: information science (Wikipedia, 2012) Unlike any of the other classification systems, Ontologies clearly include the individuals (or objects being classified) in the structure. Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
  • 47. Ontologies Individuals are the instances or objects (the basic or “ground level” objects). Ontology: information science (Wikipedia, 2012) Unlike any of the other classification systems, Ontologies clearly include the individuals (or objects being classified) in the structure. Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
  • 48. Ontologies Classes are sets, collections, concepts, [. . . ] or kinds of things. Ontology: information science (Wikipedia, 2012) Classes are the concepts used in Thesauri and Taxonomy. They can be super-classes, including sub-classes, or can just include individuals (low level classes, leafs if we were talking about taxonomies). Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
  • 49. Ontologies Classes are sets, collections, concepts, [. . . ] or kinds of things. Ontology: information science (Wikipedia, 2012) Classes are the concepts used in Thesauri and Taxonomy. They can be super-classes, including sub-classes, or can just include individuals (low level classes, leafs if we were talking about taxonomies). Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
  • 50. Ontologies Attributes are aspects, properties, features, character- istics, or parameters that objects (and classes) can have. Ontology: information science (Wikipedia, 2012) Attributes are properties of individuals or classes. If the individual is a book in a library, a property can be the number of pages, the title, the author. For a class, like “mammal”, an attribute can be a reference to its fur. Attributes are usually specified as a pair, the name of the attribute and its value. Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
  • 51. Ontologies Attributes are aspects, properties, features, character- istics, or parameters that objects (and classes) can have. Ontology: information science (Wikipedia, 2012) Attributes are properties of individuals or classes. If the individual is a book in a library, a property can be the number of pages, the title, the author. For a class, like “mammal”, an attribute can be a reference to its fur. Attributes are usually specified as a pair, the name of the attribute and its value. Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
  • 52. Ontologies Relations are ways in which classes and individuals can be related to one another. Ontology: information science (Wikipedia, 2012) Relations are similar to the relations used in Thesauri, but unlike them, there isn’t a list of valid relations. They can be the common hierarchical relations, or the relation “eat” relating animals with the animals they eat. Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
  • 53. Ontologies Relations are ways in which classes and individuals can be related to one another. Ontology: information science (Wikipedia, 2012) Relations are similar to the relations used in Thesauri, but unlike them, there isn’t a list of valid relations. They can be the common hierarchical relations, or the relation “eat” relating animals with the animals they eat. Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
  • 54. Ontologies Function terms: complex structures formed from cer- tain relations that can be used in place of an individual term in a statement. Ontology: information science (Wikipedia, 2012) Suppose you are adding Portuguese rivers to an Ontology. One can define a simple macro to add some default relations to the river: River (name) ∼=    Term → name Is a → river Is at → Portugal Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
  • 55. Ontologies Function terms: complex structures formed from cer- tain relations that can be used in place of an individual term in a statement. Ontology: information science (Wikipedia, 2012) Suppose you are adding Portuguese rivers to an Ontology. One can define a simple macro to add some default relations to the river: River (name) ∼=    Term → name Is a → river Is at → Portugal Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
  • 56. Ontologies Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input. Ontology: information science (Wikipedia, 2012) We can enforce that a capital of a country it a city: add (X capital-of Y ) iff X is-a City Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
  • 57. Ontologies Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input. Ontology: information science (Wikipedia, 2012) We can enforce that a capital of a country it a city: add (X capital-of Y ) iff X is-a City Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
  • 58. Ontologies Rules: statements in the form of an antecedent- consequent sentence that describe the logical infer- ences that can be drawn from an assertion in a partic- ular form. Ontology: information science (Wikipedia, 2012) On the other hand, if we trust who is editing an ontology, we can classify automatically it as a city, and its country as a. . . country: X capital-of Y ⇒X is-a City ∧ Y is-a Country Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
  • 59. Ontologies Rules: statements in the form of an antecedent- consequent sentence that describe the logical infer- ences that can be drawn from an assertion in a partic- ular form. Ontology: information science (Wikipedia, 2012) On the other hand, if we trust who is editing an ontology, we can classify automatically it as a city, and its country as a. . . country: X capital-of Y ⇒X is-a City ∧ Y is-a Country Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
  • 60. Ontologies Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the on- tology describes in its domain of application. Ontology: information science (Wikipedia, 2012) Differs from Rules, as axioms are tests to guarantee the ontology structure. They are not used to infer new relations. They assert, and can/should be used for consistence checking. Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
  • 61. Ontologies Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the on- tology describes in its domain of application. Ontology: information science (Wikipedia, 2012) Differs from Rules, as axioms are tests to guarantee the ontology structure. They are not used to infer new relations. They assert, and can/should be used for consistence checking. Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
  • 62. Ontologies Events: the changing of attributes or relations. Ontology: information science (Wikipedia, 2012) Similar to rules, but react to events. For example, if the user adds a feature stating that an individual lays eggs, classify it as an oviparous. Note that the division into Rules, Axioms and Events is not universal, and depends a lot on the application that is used to support the ontology. Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
  • 63. Ontologies Events: the changing of attributes or relations. Ontology: information science (Wikipedia, 2012) Similar to rules, but react to events. For example, if the user adds a feature stating that an individual lays eggs, classify it as an oviparous. Note that the division into Rules, Axioms and Events is not universal, and depends a lot on the application that is used to support the ontology. Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
  • 64. Ontologies: Example 1 Alberto Sim˜oes A5 - Specialized Dictionaries 55/138
  • 65. Ontologies: Example 2 Alberto Sim˜oes A5 - Specialized Dictionaries 56/138
  • 66. Ontologies: Pros and Cons Pros: More flexible than Thesauri; (graph with ad-hoc relationships) Lots of formalisms and standards (OWL, SKOS, . . . ); Lots of tools to edit (like Prot´eg´e); Languages for querying and completion (like SPARQL); Cons: As a classification approach, requires an authority for its definition, just like Taxonomies or Thesauri. Complexity: not everybody is able to create a detailed ontology. Alberto Sim˜oes A5 - Specialized Dictionaries 57/138
  • 67. Further Reading Folksonomies: Folksonomy Coinage and Definition http://vanderwal.net/folksonomy.html Folksonomies: A User-Driven Approach to Organizing Content http://www.uie.com/articles/folksonomies/ Folksonomies: power to the people http://www.iskoi.org/doc/folksonomies.htm Folksonomies: Tidying up Tags? http://www.dlib.org/dlib/january06/guy/01guy.html Folksonomies - Cooperative Classification and Communication Through Shared Metadata http://www.adammathes.com/academic/ computer-mediated-communication/folksonomies.html Alberto Sim˜oes A5 - Specialized Dictionaries 58/138
  • 68. Further Reading Taxonomies: Taxonomy http://en.wikipedia.org/wiki/Taxonomy Perspectives on Taxonomy, Classification, Structure and Find-ability http://www.serviceinnovation.org/included/docs/ kcs_taxonomy.pdf Universal Decimal Classification http://www.udcc.org/udcsummary/php/index.php Thesauri: Thesaurus http://en.wikipedia.org/wiki/Thesaurus Thesaurus principles and practice http://www.willpowerinfo.co.uk/thesprin.htm Alberto Sim˜oes A5 - Specialized Dictionaries 59/138
  • 69. Further Reading Ontologies: Ontology (information science) http://en.wikipedia.org/wiki/Ontology_ (information_science) Prot´eg´e Ontology Editor http://protege.stanford.edu/ OWL Web Ontology Language http://www.w3.org/TR/owl-features/ SPARQL Query Language for RDF http://www.w3.org/TR/rdf-sparql-query/ Alberto Sim˜oes A5 - Specialized Dictionaries 60/138
  • 70. Part II Terminology and Translation Alberto Sim˜oes A5 - Specialized Dictionaries 61/138
  • 71. Overview 4 How translation works 5 The role of Terminology on Translation 6 Translation Software Standard translation software Standard terminology management software Alberto Sim˜oes A5 - Specialized Dictionaries 62/138
  • 72. How Translation Works Alberto Sim˜oes A5 - Specialized Dictionaries 63/138
  • 73. How Translation Works Manual Translation Translator uses some resources like dictionaries and terminologies, but search them manually. The type of translation done in the last century. Computer Assisted Translation Translator uses tools (CAT tools) to help the translation process. Help the translator to reuse previous translations, integrates with terminologies and help the translator dealing with different file formats. Exploratory Translation Using machine translation tools, like Google Translate to do a quick translation and understand texts. Not really a professional translation process. Machine Translation Computer systems that translate text using different techniques, from statistical information to translation rules. Quality raising in the last years, but too far away of a real translation work result. Alberto Sim˜oes A5 - Specialized Dictionaries 64/138
  • 74. Computer Assisted Translation CAT tools translation process: 1 Document is opened in CAT tool; 2 First sentence is extracted and presented to be translated; 3 Sentence is looked-up in a database of previous translated sentences, looking up for similar sentences (fuzzy matching); 4 If found, translation is done (or fuzzy translation); 5 A terminology database is queried in order to check if sentence includes relevant terms to be translated; 6 Translator reviews the translation; 7 System saves the translation in a database of translations; 8 System saves the translation in the translated document; 9 Next sentence is extracted, and go to step 3. Alberto Sim˜oes A5 - Specialized Dictionaries 65/138
  • 75. Computer Assisted Translation Alberto Sim˜oes A5 - Specialized Dictionaries 66/138
  • 76. Translation Memories Databases of translations; Store sentences in two or more languages; Grow accordingly with the work of the translator; Can be shared between translators in a same project; Some big companies make their TM available to contracted translators in order to guarantee homogeneity in their translations. Alberto Sim˜oes A5 - Specialized Dictionaries 67/138
  • 77. Terminology and Translation Translating terminology takes up to 40% of the time in translation: Translators not aware of technical areas; Translators need to understand term being translated; Researching on a specific area takes time; Terminology reduce time to research on term translation. Terminology helps the comprehension of concepts: There is no way to translate without understanding; Terminology might/should include explanations on terms; Terminology helps on Consistency and Standardization: Translate terms the same way through all the document; Translate terms the same way through all documents; Companies, Organization, Governmental Institutions define specific terminologies that should be used by translators; Alberto Sim˜oes A5 - Specialized Dictionaries 68/138
  • 78. Further Reading CAT software Discover the benefits of using a CAT Tool: How can CAT Tools help you? by Jonathan T. Hine Jr. http://www.translationzone.com/en/translator-solutions/translation-memory/cat-tools/ What is a translation memory? by SDL Trados. http://www.translationzone.com/en/translator-solutions/translation-memory/default.asp What is terminology? by SDL Trados. http://www.translationzone.com/en/translator-solutions/terminology-management/default.asp Alberto Sim˜oes A5 - Specialized Dictionaries 69/138
  • 79. Further Reading Terminology in Translation Terminology in translation, by Thorsten Trippel (1999) http://www.spectrum.uni-bielefeld.de/~ttrippel/terminology/node19.html Terminology Management in Translation, by Gabriele Sauberer (2009) http://www.termnet.org/downloads/english/events/itaindia_ workshop/GS_Terminology_Management_in_Translation.pdf The Role of Terminology Management in Localization, by Sue Ellen Wright (2006) http://www.translationzone.com/en/images/sue_ellen_slides_tcm18-25819.pdf Managing Terminology for Translation Using Translation Environment Tools: Towards a Definition of Best Practices, by Marta G´omez Palou Allard (2012) http://www.ruor.uottawa.ca/fr/ bitstream/handle/10393/22837/Gomez_Palou_Allard_Marta_2012_thesis.pdf Alberto Sim˜oes A5 - Specialized Dictionaries 70/138
  • 80. Part III Introduction to Corpora Alberto Sim˜oes A5 - Specialized Dictionaries 71/138
  • 81. Overview 7 Corpora Monolingual Corpora Parallel Corpora Corpora in the Web 8 The web as Corpora Do-it-yourself Corpora Basic Crawling Tools Alberto Sim˜oes A5 - Specialized Dictionaries 72/138
  • 82. What is a Corpus? cor·pus /’kˆorp@s/ Noun 1. A collection of written texts, esp. the entire works of a particular author or a body of writing on a particular subject; 2. A collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc. corpora is the plural for corpus. Alberto Sim˜oes A5 - Specialized Dictionaries 73/138
  • 83. Corpora Classification Corpora is usually classified accordingly with the number of languages: Monolingual Corpus: documents are all written in one language; (in some cases with more than one variant) Multilingual Corpus: documents are written in more than one language; Alberto Sim˜oes A5 - Specialized Dictionaries 74/138
  • 84. Corpora Classification There are two specially relevant types of multilingual corpora: Parallel Corpus: a text placed alongside its translation or translations. Parallel text alignment is the identification of corresponding blocks in both halves of the parallel text. Comparable Corpus: is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora Expert Advisory Group on Language Engineering Standards Guidelines (1996) Alberto Sim˜oes A5 - Specialized Dictionaries 75/138
  • 85. Monolingual Corpora Examples British National Corpus (http://www.natcorp.ox.ac.uk/) The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. CETEMP´ublico (http://www.linguateca.pt/cetempublico/) Corpus de Extractos de Textos Electr´onicos MCT/P´ublico is a corpus of approximately 180 million words in European Portuguese. It was created by the Computational Processing of Portuguese Project after an agreement between the Ministry of Science and Technology and the P´ublico newspaper, in April, 2000. CETENFolha (http://www.linguateca.pt/cetenfolha/) Corpus de Extractos de Textos Electr´onicos NILC/Folha de S. Paulo is a corpus of approximately 24 million words in Brazilian Portuguese, created by the Computational Processing of Portuguese Project using texts from the newspaper Folha de S. Paulo, that are part of the NILC/S˜ao Carlos Alberto Sim˜oes A5 - Specialized Dictionaries 76/138
  • 86. Monolingual Corpora Examples Russian National Corpus (http://ruscorpora.ru/en/index.html) RNC is a corpus of the modern Russian language incorporating over 300 million words. The corpus of Russian is a reference system based on a collection of Russian texts in electronic form. Croatian National Corpus (http://www.hnk.ffzg.hr/cnc.htm) HNK is a systematized collection of selected texts mainly written in contemporary Croatian covering different media, genres, styles, fields and topics. The Corpus is accompanied by additional linguistic and non-linguistic data and stored in a database on our server which can be accessed with the search client program Bonito. KOTONOHA Corpus (http://www.kotonoha.gr.jp/) The Balanced Corpus of Contemporary Written Japanese includes text samples collected to be able to grasp an overall picture of the modern Japanese written language and includes about 100 million words. Alberto Sim˜oes A5 - Specialized Dictionaries 77/138
  • 87. Parallel Corpora Examples Aligned Hansards (http://isi.edu/natural-language/download/hansard/) Aligned Hansards of the 36th Parliament of Canada, contains 1.3 million pairs of aligned text chunks (sentences or smaller fragments). COMPARA ( http://www.linguateca.pt/COMPARA/) COMPARA is a bidirectional parallel corpus of English and Portuguese. In other words, it is a type of database with original and translated texts in these two languages that have been linked together sentence by sentence. Europarl ( http://www.statmt.org/europarl/) The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. Alberto Sim˜oes A5 - Specialized Dictionaries 78/138
  • 88. Parallel Corpora Examples JRC-Acquis (http://langtech.jrc.it/JRC-Acquis.html) The Acquis Communautaire is the total body of European Union (EU) law applicable in the the EU Member States. It is a collection of parallel texts in 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish. OPUS (http://opus.lingfil.uu.se/) OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. Per-Fide (http://per-fide.di.uminho.pt/cquery) Per-fide Project aims on the development of parallel corpora between Portuguese and six other Languages: English, Russian, French, Italian, German and Spanish. Alberto Sim˜oes A5 - Specialized Dictionaries 79/138
  • 89. Querying Corpora Using http://corpus.leeds.ac.uk/protected/query.html Concordances of a single word: dog Concordances for a sequence of words: big bang Concordances for lemmas: [lemma="have"] Concordances for part of speech: [pos="NNS"] Combinations of the above: [lemma="have"] dog [lemma="be"] [lemma="have"] Regular expressions can be used: [pos="N.*"] [pos="V.*"] Multiple restrictions for same word: [pos="N.*" & word="d.*"] [pos="V.*"] Empty words: [pos="N.*"] [] [pos="V.*"] Alberto Sim˜oes A5 - Specialized Dictionaries 80/138
  • 90. The Web as Corpora To study “purposeful language behavior,” corpus linguists require collections of authentic texts (spoken and/or written). It is therefore not surprising that many (corpus) linguists have recently turned to the World Wide Web as the richest and most easily accessible source of language material available. At the same time, for language technologists, who have been arguing for long that “more data is better data,” the WWW is a virtually unlimited source of “more data.” Wacky! A Wacky Introduction Silvia Bernardini, Marco Baroni and Stefan Evert Alberto Sim˜oes A5 - Specialized Dictionaries 81/138
  • 91. Do-it-yourself Corpora The WWW has data from virtually any subject; There is data in mostly any language; Therefore, it is possible to build custom corpora! Collect text from the web. . . . . . on a specific language. . . . . . on the subject you want to study . . . . . . and retrieve as much text as you need. Alberto Sim˜oes A5 - Specialized Dictionaries 82/138
  • 92. Basic Crawling Tools There are standard download tools that follow HTML links, and are able to download complete websites. They are known as web spiders, or web robots; Examples include “wget”, “wGetGUI” or “HTTrack”; But you need to process the files yourself. There are some projects that developed tools specific for corpora building. The most well known is “BootCaT” Alberto Sim˜oes A5 - Specialized Dictionaries 83/138
  • 93. Further Reading Corpora: Corpus Creation - Handbook of NLP http://cgi.cse.unsw. edu.au/~handbookofnlp/index.php?n=Chapter7.Chapter7 Building and Using Your Own Corpora http: //www.lancs.ac.uk/fss/courses/ling/corpus/blue/diy_top.htm CQP Query Language Tutorial http://cwb.sourceforge.net/files/CQP_Tutorial/ Web as Corpora: Wacky! Working papers on the Web as Corpus http://wackybook.sslmit.unibo.it/ Wacky Wiki http://wacky.sslmit.unibo.it/doku.php Alberto Sim˜oes A5 - Specialized Dictionaries 84/138
  • 94. Part IV Terminology Extraction from Monolingual Corpora Alberto Sim˜oes A5 - Specialized Dictionaries 85/138
  • 95. Overview 9 Corpora for Terminology Building 10 Obtaining candidate terms from Corpora N-grams and Frequencies Lexical Difference Exploring Mutual Information Morphology Constraints 11 Exploring a Tool: Term-o-Matic Alberto Sim˜oes A5 - Specialized Dictionaries 86/138
  • 96. Corpora for Terminology Building The use of a specific domain text or texts in order to understand what is that domain terminology is relevant; Words in context give more information than alone; There is no automatic method to extract specific domain terminology from a specific domain corpus; Nevertheless, there are automatic method to obtain candidate terms, that can later be analysed and incorporated in a terminology, or just discarded. Alberto Sim˜oes A5 - Specialized Dictionaries 87/138
  • 97. Words n-Grams In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application. n-grams are collected automatically from a text or speech corpus. Alberto Sim˜oes A5 - Specialized Dictionaries 88/138
  • 98. One-Grams 1-Grams are usually known as words/tokens. :-) Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where’s the peck of pickled peppers Peter Piper picked? peter 4 piper 4 picked 4 a 2 peck 4 of 4 pickled 5 ... ... Alberto Sim˜oes A5 - Specialized Dictionaries 89/138
  • 99. Bigrams All sequences of two words/tokens found in the text. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where’s the peck of pickled peppers Peter Piper picked? peter piper 4 piper picked 4 picked a 2 a peck 3 peck of 4 of pickled 4 pickled peppers 4 ... ... Alberto Sim˜oes A5 - Specialized Dictionaries 90/138
  • 100. Top occurring trigrams for a real corpus in accordance with 31148 referred to in 27581 the member states 16999 accordance with the 16535 of the european 14772 laid down in 13301 to in article 13211 having regard to 12588 regard to the 11416 member states shall 11392 in order to 10563 in the case 10029 the provisions of 9825 the case of 9575 provided for in 9560 the member state 9360 of the member 8656 the commission shall 8013 of this directive 6679 a member state 6306 on the basis 6292 the european parliament 6274 the basis of 6265 and in particular 6225 down in article 6200 of the community 5958 accordance with article 5758 to in paragraph 5690 opinion of the 5599 the opinion of 5191 the competent authorities 5074 for the purposes 5024 the purposes of 4946 with the procedure 4878 to the commission 4843 the european community 4834 Alberto Sim˜oes A5 - Specialized Dictionaries 91/138
  • 101. n-grams frequency n-Grams are usually computed together with their occurrence count — or frequency; In some situations, like statistic language models, other type of measures are also computed (probability — relative frequency; conditional probability, etc); One-grams frequency doesn’t help much on term candidate extraction — they just say that a word is more or less frequent. n-grams for n ≥ 2 can help finding sequence of words that occur lot of times. Alberto Sim˜oes A5 - Specialized Dictionaries 92/138
  • 102. Stop Words and Lexical Difference There are words that rarely occur in terminology; At least, they rarely occur in the beginning or end of a multi-word term; For example, pronouns, articles, prepositions; These words are usually known as stop words; It is easy to find bigger or smaller lists of stop words for every language; We can ignore these words when computing n-grams. Alberto Sim˜oes A5 - Specialized Dictionaries 93/138
  • 103. Detecting stop-words in accordance with 31148 referred to in 27581 the member states 16999 accordance with the 16535 of the european 14772 laid down in 13301 to in article 13211 having regard to 12588 regard to the 11416 member states shall 11392 in order to 10563 in the case 10029 the provisions of 9825 the case of 9575 provided for in 9560 the member state 9360 of the member 8656 the commission shall 8013 of thisi directive 6679 a member state 6306 on the basis 6292 the european parliament 6274 the basis of 6265 and in particular 6225 down in article 6200 of the community 5958 accordance with article 5758 to in paragraph 5690 opinion of the 5599 the opinion of 5191 the competent authorities 5074 for the purposes 5024 the purposes of 4946 with the procedure 4878 to the commission 4843 the european community 4834 Alberto Sim˜oes A5 - Specialized Dictionaries 94/138
  • 104. Replacing stop words by a special token <tk> member states 32517 member states <tk> 30108 <tk> member state 19345 member state <tk> 17882 council directive <tk> 7869 <tk> council directive 7129 <tk> european parliament 5397 council regulation <tk> 5259 european parliament <tk> 5125 <tk> council regulation 4995 <tk> competent authorities 4964 competent authorities <tk> 4736 procedure laid <tk> 4472 <tk> treaty establishing 4375 treaty establishing <tk> 4373 <tk> competent authority 3694 official journal <tk> 3530 competent authority <tk> 3507 annex ii <tk> 3429 commission regulation <tk> 3171 <tk> commission regulation 2967 commission decision <tk> 2545 <tk> customs authorities 2542 <tk> commission decision 2429 customs authorities <tk> 2410 <tk> european economic 2285 <tk> administrative provisions 2017 <tk> contracting parties 2010 conditions laid <tk> 1998 contracting parties <tk> 1779 commission directive <tk> 1764 detailed rules <tk> 1738 <tk> community industry 1728 <tk> contracting party 1702 Alberto Sim˜oes A5 - Specialized Dictionaries 95/138
  • 105. Trigrams that doesn’t include stop words member states relating 1523 member state concerned 1200 veterinary medicinal products 955 maximum residue limits 814 physically modified derivatives 700 european economic community 691 community trade mark 538 member states concerned 508 plant protection products 464 home member state 442 host member state 388 council common position 377 community plant variety 368 european atomic energy 346 animal health conditions 342 authorised representative established 327 implementing powers conferred 311 regional economic integration 263 median longitudinal plane 258 plant protection product 249 separate technical unit 246 national regulatory authorities 241 apply mutatis mutandis 241 common technical regulation 229 separate technical units 226 emission limit values 219 technically permissible maximum 215 maximum residue levels 212 retail trade services 200 temporary importation procedure 196 medicinal products intended 195 community transit procedure 195 atomic energy community 193 classical swine fever 189 Alberto Sim˜oes A5 - Specialized Dictionaries 96/138
  • 106. Basic Lexical Difference What if we remove not just stop words, but common words? It is not that usual to find Osteoarthritis in common text. Therefore, it should be some kind of a domain term. We can obtain a list of common words from a generic corpus (say, jornalistic text) and subtract that lexicon from the one-grams we obtained. Result should include good term candidates! Alberto Sim˜oes A5 - Specialized Dictionaries 97/138
  • 107. Basic Lexical Difference - Experiment Two random abstracts from PubMed articles related with cirrhosis; Top 1 000 occurring words in English; Compute one-grams on the abstracts; Subtract the top occurring words. Before liver 8 is 7 fibrosis 6 myofibroblast 6 pathway 5 kidney 5 expression 5 interstitial 4 signaling 3 target 3 differentiation 3 diseases 3 medullary 3 antioxidant 3 After liver 8 myofibroblast 6 fibrosis 6 pathway 5 kidney 5 interstitial 4 β-catenin 3 target 3 signaling 3 genes 3 differentiation 3 medullary 3 renal 3 adult 3Alberto Sim˜oes A5 - Specialized Dictionaries 98/138
  • 108. Lexical Distribution Difference Previous example could benefit a bigger standard lexicon list; Abstracts are crowded with terminology, and few other words; Long lists may include words than are considered terminology! Example, for Informatics, folder or file can be terms. Instead of considering words as present or not, use their frequency; For instance, compute relative frequency and compare/subtract; Use a distribution comparison metric; ex., Kullback-Leibler terms: log P(i) Q(i) P (i) Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
  • 109. Lexical Distribution Difference Previous example could benefit a bigger standard lexicon list; Abstracts are crowded with terminology, and few other words; Long lists may include words than are considered terminology! Example, for Informatics, folder or file can be terms. Instead of considering words as present or not, use their frequency; For instance, compute relative frequency and compare/subtract; Use a distribution comparison metric; ex., Kullback-Leibler terms: log P(i) Q(i) P (i) Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
  • 110. Pointwise Mutual Information The Mutual Information (MI) is a quantity that measures the mutual dependence of two random variables X and Y . MI(X, Y ) = x∈X y∈Y P(x, y) log2 P(x, y) P(x)P(y) Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other. Alberto Sim˜oes A5 - Specialized Dictionaries 100/138
  • 111. Pointwise Mutual Information When computing Mutual Information for two specific outcomes, the Pointwise Mutual Information (PMI) let us measure their mutual dependence: PMI(x, y) = log2 P(x, y) P(x)P(y) Given the number of tokens in the document N, and the number of occurrences for x, Oc(x): P(x) = Oc(x) N Given the number of tokens in the document N, and the number of occurrences for bigram x, y, Oc(x, y): P(x, y) = Oc(x,y) N Alberto Sim˜oes A5 - Specialized Dictionaries 101/138
  • 112. Pointwise Mutual Information Sorted by occurrence count sonic fabric 14 7.3566 black holes 9 8.0912 black hole 7 8.0912 cassette tape 6 8.4968 build things 4 9.5348 smartphone makers 3 9.0087 alyce santoro 3 8.0912 like scratching 3 9.0087 barnard said 3 8.3042 milky way 3 9.1787 possible black 3 7.6762 neutron star 3 8.8567 just right 3 8.5937 records backwards 3 10.5937 Sorted by PMI special shuttle 1 12.1787 immediately reminded 1 12.1787 remain aware 1 12.1787 richard branson 1 12.1787 supercooled pods 1 12.1787 richie havens 1 12.1787 auspicious locations 1 12.1787 jimi hendrix 1 12.1787 account settings 1 12.1787 baggage carousel 1 12.1787 buddhist prayer 1 12.1787 reinvents electronics 1 12.1787 melbourne institute 1 12.1787 cow manure 1 12.1787 From a very small corpus constructed with 5 CNN news stories. Alberto Sim˜oes A5 - Specialized Dictionaries 102/138
  • 113. Morphology Patterns Commonly, terms are nouns or noun phrases; Sometimes some verbs are also interesting; Typically the morphological structure of terms is well known; There is software that compute morphological information about each word in a sentence; We can use that information to obtain better term candidates. specify terms part-of-speech, genre, number, verb tenses, etc. . . Alberto Sim˜oes A5 - Specialized Dictionaries 103/138
  • 114. Morphological Analysis How it (usually) works: 1 A tokenizer and a splitter split sentences into tokens and sentences; (different tools use them in different order, some as a single tool) 2 A morphological analyzer associates possible analysis to each word; (does not cope with ambiguity, just tags all possible analysis) 3 A Tagger or Parser choose the more likely analysis; (uses knowledge from manual annotated corpora, and machine learning algorithms) Alberto Sim˜oes A5 - Specialized Dictionaries 104/138
  • 115. Morphological Patterns - Examples Noun Noun Noun 659 Community trade mark 483 plant protection products 475 EEC component type-approval 448 document number C 320 Community transit procedure 290 plant protection product 288 Community plant variety 257 EC type-examination certificate 214 EC component type-approval 176 EEC pattern approval 157 African swine fever 155 three-wheel motor vehicles 155 foot-and-mouth disease virus 153 conformity assessment procedures 148 emission limit values Adjective Adjective Noun 912 veterinary medicinal products 453 common agricultural policy 365 separate technical unit 291 separate technical units 265 median longitudinal plane 223 regional economic integration 202 competent national authorities 200 trans-European high-speed rail 199 sound financial management 189 veterinary medicinal product 182 certain agricultural products 176 national regulatory authorities 175 common technical regulation 168 certain third countries 166 other third countries 166 definitive anti-dumping duty 162 certain dangerous substances Alberto Sim˜oes A5 - Specialized Dictionaries 105/138
  • 116. Term-o-Matic http://www.termomatic.com/ Alberto Sim˜oes A5 - Specialized Dictionaries 106/138
  • 117. Term-o-Matic What it is: A simple web-application; Without user control; Developed specifically for this class; implement some of the methods presented before; What it is not: A commercial software; A professional tool; A tool free of bugs; A multilingue tool. Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
  • 118. Term-o-Matic What it is: A simple web-application; Without user control; Developed specifically for this class; implement some of the methods presented before; What it is not: A commercial software; A professional tool; A tool free of bugs; A multilingue tool. Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
  • 119. Term-o-Matic: overview Main screen, shows options, and summary on available data. Alberto Sim˜oes A5 - Specialized Dictionaries 108/138
  • 120. Term-o-Matic: Add Text Use the Add Text option to add one-grams, bigrams and trigrams into the database (English, please!). Alberto Sim˜oes A5 - Specialized Dictionaries 109/138
  • 121. Term-o-Matic: Add Text feedback After adding some text, a summary of the amount of data added is shown. Alberto Sim˜oes A5 - Specialized Dictionaries 110/138
  • 122. Term-o-Matic: Manage Stopwords The Stop Words option allows to manage the list of stop-words. It is possible to add (to add more than one just separate words using spaces or other punctuation), and to delete them. Alberto Sim˜oes A5 - Specialized Dictionaries 111/138
  • 123. Term-o-Matic: Manage Lexicon The Standard Lexicon option is very similar to the Stop Words option, but for the generic words. Alberto Sim˜oes A5 - Specialized Dictionaries 112/138
  • 124. T-o-M: Words, Bigrams and Trigrams The Study Words, Study Bigrams and Study Trigrams work all in the same way, showing a list of words/bigrams/trigrams. Alberto Sim˜oes A5 - Specialized Dictionaries 113/138
  • 125. T-o-M: Words, Bigrams and Trigrams Note that the PMI column is empty. This measure takes some time to compute, and therefore should be computed only when needed. Alberto Sim˜oes A5 - Specialized Dictionaries 114/138
  • 126. T-o-M: Words, Bigrams and Trigrams To compute PMI use the Compute bi/trigrams PMI. After the software issue an ”OK” message, hit the back button on your browser and refresh. Alberto Sim˜oes A5 - Specialized Dictionaries 115/138
  • 127. T-o-M: Words, Bigrams and Trigrams By default the list is sorted by occurrence count. You can change to PMI order as soon as it is computed. Alberto Sim˜oes A5 - Specialized Dictionaries 116/138
  • 128. T-o-M: Words, Bigrams and Trigrams It is possible to remove entries with stop-words or punctuation; or entries with common words. Alberto Sim˜oes A5 - Specialized Dictionaries 117/138
  • 129. T-o-M: Filtering by pattern To filter by a morphological pattern you must ensure that you run the Compute Morph. Analysis option after the last time you entered text. When the software says the process is complete (OK), hit the back button, and you are realy to use the pattern filtering. Just choose the categories you are looking for, and search for them. Alberto Sim˜oes A5 - Specialized Dictionaries 118/138
  • 130. T-o-M: Filtering by Pattern Alberto Sim˜oes A5 - Specialized Dictionaries 119/138
  • 131. Term-o-Matic: standard operation guide 1 Use the Add Text option to add text. Use it as many times as you need to create a big enough corpus; Do not add too much text at once. Add by blocks. Be sure to add thematic text; 2 Define a list of stop words (you might already have one). 3 Define a list of common words. Look for such lists in the web. 4 Compute PMIs and Morphological Analysis 5 Do queries! Alberto Sim˜oes A5 - Specialized Dictionaries 120/138
  • 132. Evaluation Task Five students, Five subject areas, Five Term-o-Matic. Computer Science (http://termomatic.com/termomatic1) Medicine (http://termomatic.com/termomatic2) Europe (http://termomatic.com/termomatic3) Animal Biology (http://termomatic.com/termomatic4) Sports (http://termomatic.com/termomatic5) Alberto Sim˜oes A5 - Specialized Dictionaries 121/138
  • 133. Part V Terminology Extraction from Multilingual Corpora Alberto Sim˜oes A5 - Specialized Dictionaries 122/138
  • 134. Overview 12 Sentence and Word Alignment 13 Parallel Patterns Alberto Sim˜oes A5 - Specialized Dictionaries 123/138
  • 135. Sentence Alignment Sentence alignment is the task of detecting translation relationships between sentences in parallel corpora. If sα is a sentence in a language Lα and sβ is a sentence in a language Lβ, the alignment process creates the pair (sα, sβ) if (there is a high probability that) sβ is a translation of sα. Alberto Sim˜oes A5 - Specialized Dictionaries 124/138
  • 136. Word Alignment The Word Alignment is the task of detecting translation relationships between words or terms in sentence-aligned parallel corpora. There are two trends on word alignment: for each aligned sentence, create a link between every word and its translation; for the complete corpora, obtain a relationship between a word and a set of probable translations, together with a confidence measure (a kind of translation probability); Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
  • 137. Word Alignment The Word Alignment is the task of detecting translation relationships between words or terms in sentence-aligned parallel corpora. There are two trends on word alignment: for each aligned sentence, create a link between every word and its translation; for the complete corpora, obtain a relationship between a word and a set of probable translations, together with a confidence measure (a kind of translation probability); Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
  • 138. Probabilistic Translation Dictionaries Obtained with one of the word alignment methods; Define a relationship between a word and a set of probable translations; T (europe) =    europa 94.7% europeus 3.4% europeu 0.8% europeia 0.1% T (stupid) =    est´upido 47.6% est´upida 11.0% est´upidos 7.4% avisada 5.6% direita 5.6% impasse 4.5% ocupado 3.8% Alberto Sim˜oes A5 - Specialized Dictionaries 126/138
  • 139. Translation Matrix discussion about alternative sources of financing for the european radical alliance . discussão 44 0 0 0 0 0 0 0 0 0 0 0 sobre 0 11 0 0 0 0 0 0 0 0 0 0 fontes 0 0 0 74 0 0 0 0 0 0 0 0 de 0 3 0 0 27 0 6 3 0 0 0 0 financiamento 0 0 0 0 0 56 0 0 0 0 0 0 alternativas 0 0 23 0 0 0 0 0 0 0 0 0 para 0 0 0 0 0 0 28 0 0 0 0 0 a 0 1 0 0 1 0 4 33 0 0 0 0 aliança 0 0 0 0 0 0 0 0 0 0 65 0 radical 0 0 0 0 0 0 0 0 0 80 0 0 europeia 0 0 0 0 0 0 0 0 59 0 0 0 . 0 0 0 0 0 0 0 0 0 0 0 80 Using the probabilistic translation dictionaries we are able to construct a translation matrix; Each cell has a translation probability obtained from the dictionary; Alberto Sim˜oes A5 - Specialized Dictionaries 127/138
  • 140. Translation Patterns Translation changes word order (for some language pairs!); This change can be foreseen; This change can be defined formally as a pattern; These patterns can be used to obtain term candidates. Alberto Sim˜oes A5 - Specialized Dictionaries 128/138
  • 141. Translation Pattern 1: ABBA Jogos Ol´ımpicos Olimpic X Games X Formally, T (A · B) = T (B) · T (A) Or in the tool syntax: [ABBA] A B = B A Alberto Sim˜oes A5 - Specialized Dictionaries 129/138
  • 142. Translation Pattern 2: IDH ´ındice de desenvolvimento humano human X development X index X T (I · ”de” · D · H) = T (H) · T (D) · T (I) [IDH] I "de" D H = H D I Alberto Sim˜oes A5 - Specialized Dictionaries 130/138
  • 143. Translation Pattern 3: FTP protocolo de transferˆencia de ficheiros file X transfer X protocol X T (P · ”de” · T · ”de” · F) = T (F) · T (T) · T (P) [FTP] P "de" T "de" F = F T P Alberto Sim˜oes A5 - Specialized Dictionaries 131/138
  • 144. Patterns in Translation Matrix discussion about alternative sources of financing for the european radical alliance . discussão 44 0 0 0 0 0 0 0 0 0 0 0 sobre 0 11 0 0 0 0 0 0 0 0 0 0 fontes 0 0 0 74 0 0 0 0 0 0 0 0 de 0 3 0 0 27 0 6 3 0 0 0 0 financiamento 0 0 0 0 0 56 0 0 0 0 0 0 alternativas 0 0 23 0 0 0 0 0 0 0 0 0 para 0 0 0 0 0 0 28 0 0 0 0 0 a 0 1 0 0 1 0 4 33 0 0 0 0 aliança 0 0 0 0 0 0 0 0 0 0 65 0 radical 0 0 0 0 0 0 0 0 0 80 0 0 europeia 0 0 0 0 0 0 0 0 59 0 0 0 . 0 0 0 0 0 0 0 0 0 0 0 80 The two boxes correspond to the following two patterns: [P1] F "de" N A = A F "of" N [P2] A B C = C B A Alberto Sim˜oes A5 - Specialized Dictionaries 132/138
  • 145. Terms extracted using A B = B A 21007 uni˜ao europeia ⇒ european union 9301 parlamento europeu ⇒ european parliament 4171 direitos humanos ⇒ human rights 3504 estados unidos ⇒ united states 2353 mercado interno ⇒ internal market 1911 posi¸c˜ao comum ⇒ common position 1826 pa´ıses candidatos ⇒ candidate countries 1776 comiss˜ao europeia ⇒ european commission 1708 conselho europeu ⇒ european council 1629 sa´ude p´ublica ⇒ public health 1558 direitos fundamentais ⇒ fundamental rights 1546 na¸c˜oes unidas ⇒ united nations 1337 pa´ıses terceiros ⇒ third countries 1294 conferˆencia intergovernamental ⇒ intergovernmental conference 1258 fundos estruturais ⇒ structural funds Alberto Sim˜oes A5 - Specialized Dictionaries 133/138
  • 146. Terms extracted using A ”de” B = B A 729 plano de ac¸c˜ao ⇒ action plan 722 conselho de seguran¸ca ⇒ security council 680 processo de paz ⇒ peace process 582 mercado de trabalho ⇒ labour market 580 pena de morte ⇒ death penalty 492 pacto de estabilidade ⇒ stability pact 431 pol´ıtica de defesa ⇒ defence policy 353 acordo de associa¸c˜ao ⇒ association agreement 348 protocolo de quioto ⇒ kyoto protocol 343 programa de ac¸c˜ao ⇒ action programme 259 branqueamento de capitais ⇒ money laundering 258 comit´e de concilia¸c˜ao ⇒ conciliation committee 241 pol´ıtica de concorrˆencia ⇒ competition policy 226 processo de concilia¸c˜ao ⇒ conciliation procedure 217 requerentes de asilo ⇒ asylum seekers Alberto Sim˜oes A5 - Specialized Dictionaries 134/138
  • 147. Terms extracted using A B C = C B A 531 pol´ıtica agr´ıcola comum ⇒ common agricultural policy 418 banco central europeu ⇒ european central bank 329 tribunal penal internacional ⇒ international criminal court 166 alian¸ca livre europeia ⇒ european free alliance 156 modelo social europeu ⇒ european social model 153 partidos pol´ıticos europeus ⇒ european political parties 83 fundo monet´ario internacional ⇒ international monetary fund 75 pol´ıtica externa comum ⇒ common foreign policy 66 organiza¸c˜ao mar´ıtima internacional ⇒ international maritime organisation 65 pr´opria uni˜ao europeia ⇒ european union itself 65 fundo social europeu ⇒ european social fund 55 direitos humanos fundamentais ⇒ fundamental human rights 45 rela¸c˜oes econ´omicas externas ⇒ external economic relations 45 homens e mulheres ⇒ women and men 45 agˆencia espacial europeia ⇒ european space agency Alberto Sim˜oes A5 - Specialized Dictionaries 135/138
  • 148. Terms extracted: I ”de” D H = H D I 95 mandato de captura europeu ⇒ european arrest warrant 85 fontes de energia renov´aveis ⇒ renewable energy sources 80 mandado de captura europeu ⇒ european arrest warrant 67 sistemas de seguran¸ca social ⇒ social security systems 64 zona de com´ercio livre ⇒ free trade area 55 for¸ca de reac¸c˜ao r´apida ⇒ rapid reaction force 54 orienta¸c˜oes de pol´ıtica econ´omica ⇒ economic policy guidelines 46 planos de ac¸c˜ao nacionais ⇒ national action plans 46 direitos de propriedade intelectual ⇒ intellectual property rights 33 sistema de alerta r´apido ⇒ rapid alert system 29 pol´ıtica de defesa comum ⇒ common defence policy 29 m´etodo de coordena¸c˜ao aberta ⇒ open coordination method 27 m´etodo de coordena¸c˜ao aberto ⇒ open coordination method 27 conselho de empresa europeu ⇒ european works council 25 acordo de com´ercio livre ⇒ free trade agreement Alberto Sim˜oes A5 - Specialized Dictionaries 136/138
  • 149. Adding Morphological Constraints The pattern language supports constraints; Constrains can be of different types; The most interesting are the morphological ones: [ABBA] A B[CAT<-adj] = B[CAT<-adj] A With this kind of constrain we can force the words in specific positions to be of specific morphological category. Alberto Sim˜oes A5 - Specialized Dictionaries 137/138
  • 150. Further Reading Alignment tasks Sentence Alignment Survey http://www.statmt.org/survey/Topic/SentenceAlignment An overview of bitext alignment algorithms http://www. ida.liu.se/~jodfo/gslt/bitext-alignment-jody.pdf Word Alignment Survey http://www.statmt.org/survey/Topic/WordAlignment Terminology from Parallel Corpora Parallel corpus-based bilingual terminology extraction http: //ambs.perl-hackers.net/publications/tia09.pdf Alberto Sim˜oes A5 - Specialized Dictionaries 138/138