The document discusses DUCKS (Data Unified Concept Knowledge Sets), a tool that aims to build a broad multilingual lexicon by aligning open linguistic data through crowdsourcing. DUCKS allows users to match terms from different languages to concepts, with the goal of creating a complete matrix of human expression across languages over time. Currently, DUCKS contains over 100,000 English concepts from Princeton WordNet aligned to about 60 other languages via Global WordNet. The document provides examples of how specific languages like Fula and Kemedzung have been added to DUCKS from digital lexicons.
Web & Social Media Analytics Previous Year Question Paper.pdf
Aligning open data through crowdsourcing
1. Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 1
4. Goal: A complete matrix of human expression across time and space
• As a knowledge resource
• As a data resource
4
5. In service since 1994 (originally at Yale Council on African Studies)
International NGO since 2009
• Registered non-profit in USA and Switzerland
Academic Home since 2013:
EPFL - Swiss Federal Institute of Technology in Lausanne
LSIR - Distributed Systems Information Laboratory 5
6. White House Big Data Initiative:
Launch Partner for Building the Data Innovation Ecosystem
Networking and Information Technology R&D Program
Office of Science and Technology Policy 6
39. equivalence
• Parallel
• Similar
• Explanatory
mkono (Swahili) = hand + arm (English)
⁇ : might be transitive across languages
mkono (Swahili) = lima (Hawaiian)
translations
difference difference translation
39
40. equivalence
• Parallel
• Similar
• Explanatory (Lexical Gaps)
hand (English) = 10.2 cm (most languages)
✗: not transitive across languages
translations 40
42. • over 100,000 English defined concepts from Princeton Wordnet
• Heavy Anglo-American bias
• James Cook yes, Kalaniʻōpuʻu-a-Kaiamamao no
• about 60 languages aligned via Global Wordnet
• over 800,000 English concepts from Wiktionary (in process)
• Wiktionary translations to many languages (highly problematic)
• other languages (Spanish already) can be pivots when:
• aligned to DUCKS
• entries have definitions
42
43. • Players match term from the left with
concept on the right
• Multiple matches possible
• Bad definitions can be flagged
• Null matches: on indigenous concepts:
43
46. Switch flippable:
• About 60 wordlists and datasets data prepared and
permissions granted
• SignTyp set for ~20 sign languages
• Comparative African Wordlist – pre-aligned, no need for
the tool
• Any lexicon in a useable digital format that is copyright
available
46
47. What we can work with:
• Word lists
• part of speech is helpful
• Electronic versions of print dictionaries
• parse and play
• Digital dictionaries
• FLEx, etc
47
49. 49
Fula (West Africa)
ABADA Ar
abada, abadaa, abadan DFZ Z<->
never(F) (with negation); ever(F); long ago
jamais(D) (avec la négation)(Z); jamais; il y a longtemps
Abada mi yahaali. (F): I have never gone. ; Je ne suis jamais allé.
abada pati (F): don't ever ; ne faîtes jamais (qqch)
gila abada (F): since long ago, forever ; depuis longtemps, toujours
11,000 Fula senses
• 332 clearly computable matches
• ~7500 matches for DUCKS
• ~3000 null matches for manual follow-up
57. Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 57