SlideShare uma empresa Scribd logo
1 de 57
Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 1
2
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
kamusi is Swahili for dictionary
3
Goal: A complete matrix of human expression across time and space
• As a knowledge resource
• As a data resource
4
In service since 1994 (originally at Yale Council on African Studies)
International NGO since 2009
• Registered non-profit in USA and Switzerland
Academic Home since 2013:
EPFL - Swiss Federal Institute of Technology in Lausanne
LSIR - Distributed Systems Information Laboratory 5
White House Big Data Initiative:
Launch Partner for Building the Data Innovation Ecosystem
Networking and Information Technology R&D Program
Office of Science and Technology Policy 6
7
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
light
8
light
why multilingual dictionaries were impossible
9
light
lumineux
léger
allégé
léger
why multilingual dictionaries were impossible
10
light
lumineux
léger
allégé
léger
why multilingual dictionaries were impossible
WOLF 02121424-a:
léger
lumière
WOLF 01186408-a:
léger
WOLF 00993117-a:
léger
allégé
lumière
light
WOLF 00269989-a:
lumière
lumineux
clair
PWN (English Wordnet):
light x 47
WOLF (French Wordnet):
light = lumière x 44
light = léger x 37
léger = lumière x 36
11
light
léger
why multilingual dictionaries were impossible
lumineux
allégé
léger
12
why multilingual dictionaries were impossible
13
why multilingual dictionaries were impossible
lumineux
14
15
light
fr: lumineux
fr: léger
fr: allégé
fr: léger
why multilingual dictionaries were impossible
th: ที่แคลอรี่ต่ำ
fi: kaloriton
sw: pungufu
th: เบำ
fi: kevyt
sw: -epesi
th: สว่ำง
fi: valoisasw: -enye mwanga
th: ซึ่งไร้สำระ
fi: tyhjänpäiväinen
sw: -a kuchekesha
16
en: light
fr: lumineux
fr: léger
fr: allégé
fr: léger
why multilingual dictionaries were impossible
th: ที่แคลอรี่ต่ำ
fi: kaloriton
sw: pungufu
th: เบำ
fi: kevyt
sw: -epesi
th: สว่ำง
fi: valoisasw: -enye mwanga
th: ซึ่งไร้สำระ
fi: tyhjänpäiväinen
sw: -a kuchekesha
en: light
en: light
en: light
17
fr: lumineux
fr: léger
fr: allégé
why multilingual dictionaries were impossible
th: ที่แคลอรี่ต่ำ
fi: kaloriton
sw: pungufu
th: เบำ
fi: kevyt
sw: -epesi
th: สว่ำง
fi: valoisasw: -enye mwanga
light
fr: léger
th: ซึ่งไร้สำระ
fi: tyhjänpäiväinen
sw: -a kuchekesha
18
why multilingual dictionaries were impossible
19
light
how Kamusi makes a multilingual dictionary possible
20
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
21
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: lumineux
fr: léger
fr: allégé
fr: léger
22
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga
23
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: léger th: เบำfi: kevytsw: -epesi
24
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: léger th: ซึ่งไร้สำระfi: tyhjänpäiväinensw: -a kuchekesha
25
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu
26
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu
fr: léger th: ซึ่งไร้สำระfi: hölynpölysw: -a kuchekesha
fr: léger th: เบำfi: kevytsw: -epesi
fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga
27
how Kamusi makes a multilingual dictionary possible
light (not heavy) fr: léger th: เบำfi: kevytsw: -epesi
fr: léger (sandy)
fr: léger (low alcohol)
fr: léger (without much luggage)
28
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
29
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
30
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
31
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
32
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
33
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
34
35
Side by Side Vocabulary Translation:
KamusiGOLD vs. Google Translate
36
equivalence
• Parallel
• Similar
• Explanatory
translations 37
equivalence
• Parallel
• Similar
• Explanatory
hand (English) = main (French)
✓: transitive across languages
translations 38
equivalence
• Parallel
• Similar
• Explanatory
mkono (Swahili) = hand + arm (English)
⁇ : might be transitive across languages
mkono (Swahili) = lima (Hawaiian)
translations
difference difference translation
39
equivalence
• Parallel
• Similar
• Explanatory (Lexical Gaps)
hand (English) = 10.2 cm (most languages)
✗: not transitive across languages
translations 40
41
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
• over 100,000 English defined concepts from Princeton Wordnet
• Heavy Anglo-American bias
• James Cook yes, Kalaniʻōpuʻu-a-Kaiamamao no
• about 60 languages aligned via Global Wordnet
• over 800,000 English concepts from Wiktionary (in process)
• Wiktionary translations to many languages (highly problematic)
• other languages (Spanish already) can be pivots when:
• aligned to DUCKS
• entries have definitions
42
• Players match term from the left with
concept on the right
• Multiple matches possible
• Bad definitions can be flagged
• Null matches: on indigenous concepts:
43
Still to come:
•Multiplayer – validation by consensus
•Manual term searching
•DUCKS for FLEx
44
45
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
Switch flippable:
• About 60 wordlists and datasets data prepared and
permissions granted
• SignTyp set for ~20 sign languages
• Comparative African Wordlist – pre-aligned, no need for
the tool
• Any lexicon in a useable digital format that is copyright
available
46
What we can work with:
• Word lists
• part of speech is helpful
• Electronic versions of print dictionaries
• parse and play
• Digital dictionaries
• FLEx, etc
47
48
Kemedzung (Cameroon) from FLEx
49
Fula (West Africa)
ABADA Ar
abada, abadaa, abadan DFZ Z<->
never(F) (with negation); ever(F); long ago
jamais(D) (avec la négation)(Z); jamais; il y a longtemps
Abada mi yahaali. (F): I have never gone. ; Je ne suis jamais allé.
abada pati (F): don't ever ; ne faîtes jamais (qqch)
gila abada (F): since long ago, forever ; depuis longtemps, toujours
11,000 Fula senses
• 332 clearly computable matches
• ~7500 matches for DUCKS
• ~3000 null matches for manual follow-up
50
51
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
52
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
53
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
54
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
55
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
56
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 57

Mais conteúdo relacionado

Semelhante a Aligning open data through crowdsourcing

English intro B1 intensive Feb 2013
English intro B1 intensive Feb 2013English intro B1 intensive Feb 2013
English intro B1 intensive Feb 2013David Nicholson
 
Lexical Resources for Portuguese
Lexical Resources  for PortugueseLexical Resources  for Portuguese
Lexical Resources for PortugueseValeria de Paiva
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddingsRoelof Pieters
 
What is english; what is language (online new)
What is english; what is language (online new)What is english; what is language (online new)
What is english; what is language (online new)Simon Smith
 
History and origin of the english language
History and origin of the english languageHistory and origin of the english language
History and origin of the english languageNathanUrinovsky
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningLena Shakurova
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfDeptii Chaudhari
 
Lit 303 chapter two sounds
Lit 303 chapter two soundsLit 303 chapter two sounds
Lit 303 chapter two soundsMustafa Sherwan
 
Introduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptxIntroduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptxValeryRamirezMendez
 
Language as a Virus: Why a Sustainability Story Spreads, or Doesn’t
Language as a Virus: Why a Sustainability Story Spreads, or Doesn’tLanguage as a Virus: Why a Sustainability Story Spreads, or Doesn’t
Language as a Virus: Why a Sustainability Story Spreads, or Doesn’tSustainable Brands
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportAlexandre Rademaker
 
Colours and Accessibility (a11y) - WordCamp Europe 2014 Sofia
Colours and Accessibility (a11y) - WordCamp Europe 2014 SofiaColours and Accessibility (a11y) - WordCamp Europe 2014 Sofia
Colours and Accessibility (a11y) - WordCamp Europe 2014 SofiaBoiteaweb
 
Building a suite of online resources to support academic vocabulary learning
Building a suite of online resources to support academic vocabulary learningBuilding a suite of online resources to support academic vocabulary learning
Building a suite of online resources to support academic vocabulary learningStefania Spina
 
New Words New Toys
New Words New ToysNew Words New Toys
New Words New Toysresec
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Software languages
Software languagesSoftware languages
Software languagesEelco Visser
 

Semelhante a Aligning open data through crowdsourcing (20)

English intro B1 intensive Feb 2013
English intro B1 intensive Feb 2013English intro B1 intensive Feb 2013
English intro B1 intensive Feb 2013
 
Lexical Resources for Portuguese
Lexical Resources  for PortugueseLexical Resources  for Portuguese
Lexical Resources for Portuguese
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
What is english; what is language (online new)
What is english; what is language (online new)What is english; what is language (online new)
What is english; what is language (online new)
 
History and origin of the english language
History and origin of the english languageHistory and origin of the english language
History and origin of the english language
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
Lit 303 chapter two sounds
Lit 303 chapter two soundsLit 303 chapter two sounds
Lit 303 chapter two sounds
 
Introduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptxIntroduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptx
 
Language as a Virus: Why a Sustainability Story Spreads, or Doesn’t
Language as a Virus: Why a Sustainability Story Spreads, or Doesn’tLanguage as a Virus: Why a Sustainability Story Spreads, or Doesn’t
Language as a Virus: Why a Sustainability Story Spreads, or Doesn’t
 
eLearning Systems on examples
eLearning Systems on exampleseLearning Systems on examples
eLearning Systems on examples
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Colours and Accessibility (a11y) - WordCamp Europe 2014 Sofia
Colours and Accessibility (a11y) - WordCamp Europe 2014 SofiaColours and Accessibility (a11y) - WordCamp Europe 2014 Sofia
Colours and Accessibility (a11y) - WordCamp Europe 2014 Sofia
 
Building a suite of online resources to support academic vocabulary learning
Building a suite of online resources to support academic vocabulary learningBuilding a suite of online resources to support academic vocabulary learning
Building a suite of online resources to support academic vocabulary learning
 
New Words New Toys
New Words New ToysNew Words New Toys
New Words New Toys
 
MTT-2013
MTT-2013MTT-2013
MTT-2013
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Software languages
Software languagesSoftware languages
Software languages
 

Mais de EPFL (École polytechnique fédérale de Lausanne) (7)

Big Translations in Small Spaces
Big Translations in Small SpacesBig Translations in Small Spaces
Big Translations in Small Spaces
 
Pre:D Say What You Mean
Pre:D Say What You MeanPre:D Say What You Mean
Pre:D Say What You Mean
 
Live Localization with Kamusi Planet K
Live Localization with Kamusi Planet KLive Localization with Kamusi Planet K
Live Localization with Kamusi Planet K
 
Emoji International Name Finder
Emoji International Name FinderEmoji International Name Finder
Emoji International Name Finder
 
Projet Complet: Paramètres Régionaux Pour 100 Langues Africaines
Projet Complet: Paramètres Régionaux Pour 100 Langues AfricainesProjet Complet: Paramètres Régionaux Pour 100 Langues Africaines
Projet Complet: Paramètres Régionaux Pour 100 Langues Africaines
 
Achievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An LocAchievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An Loc
 
Completed Project: 100 African Language Locales
Completed Project: 100 African Language LocalesCompleted Project: 100 African Language Locales
Completed Project: 100 African Language Locales
 

Último

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 

Último (20)

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 

Aligning open data through crowdsourcing

  • 1. Martin Benjamin, Sina Mansour, Karl Aberer DUCKS in a Row: Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon Vital Voices: Linking Language and Wellbeing 5th International Conference on Language Documentation and Conservation Honolulu, Hawaii - March 5, 2017 1
  • 2. 2 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 3. kamusi is Swahili for dictionary 3
  • 4. Goal: A complete matrix of human expression across time and space • As a knowledge resource • As a data resource 4
  • 5. In service since 1994 (originally at Yale Council on African Studies) International NGO since 2009 • Registered non-profit in USA and Switzerland Academic Home since 2013: EPFL - Swiss Federal Institute of Technology in Lausanne LSIR - Distributed Systems Information Laboratory 5
  • 6. White House Big Data Initiative: Launch Partner for Building the Data Innovation Ecosystem Networking and Information Technology R&D Program Office of Science and Technology Policy 6
  • 7. 7 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 11. light lumineux léger allégé léger why multilingual dictionaries were impossible WOLF 02121424-a: léger lumière WOLF 01186408-a: léger WOLF 00993117-a: léger allégé lumière light WOLF 00269989-a: lumière lumineux clair PWN (English Wordnet): light x 47 WOLF (French Wordnet): light = lumière x 44 light = léger x 37 léger = lumière x 36 11
  • 12. light léger why multilingual dictionaries were impossible lumineux allégé léger 12
  • 13. why multilingual dictionaries were impossible 13
  • 14. why multilingual dictionaries were impossible lumineux 14
  • 15. 15
  • 16. light fr: lumineux fr: léger fr: allégé fr: léger why multilingual dictionaries were impossible th: ที่แคลอรี่ต่ำ fi: kaloriton sw: pungufu th: เบำ fi: kevyt sw: -epesi th: สว่ำง fi: valoisasw: -enye mwanga th: ซึ่งไร้สำระ fi: tyhjänpäiväinen sw: -a kuchekesha 16
  • 17. en: light fr: lumineux fr: léger fr: allégé fr: léger why multilingual dictionaries were impossible th: ที่แคลอรี่ต่ำ fi: kaloriton sw: pungufu th: เบำ fi: kevyt sw: -epesi th: สว่ำง fi: valoisasw: -enye mwanga th: ซึ่งไร้สำระ fi: tyhjänpäiväinen sw: -a kuchekesha en: light en: light en: light 17
  • 18. fr: lumineux fr: léger fr: allégé why multilingual dictionaries were impossible th: ที่แคลอรี่ต่ำ fi: kaloriton sw: pungufu th: เบำ fi: kevyt sw: -epesi th: สว่ำง fi: valoisasw: -enye mwanga light fr: léger th: ซึ่งไร้สำระ fi: tyhjänpäiväinen sw: -a kuchekesha 18
  • 19. why multilingual dictionaries were impossible 19
  • 20. light how Kamusi makes a multilingual dictionary possible 20
  • 21. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 21
  • 22. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: lumineux fr: léger fr: allégé fr: léger 22
  • 23. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga 23
  • 24. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: léger th: เบำfi: kevytsw: -epesi 24
  • 25. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: léger th: ซึ่งไร้สำระfi: tyhjänpäiväinensw: -a kuchekesha 25
  • 26. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu 26
  • 27. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu fr: léger th: ซึ่งไร้สำระfi: hölynpölysw: -a kuchekesha fr: léger th: เบำfi: kevytsw: -epesi fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga 27
  • 28. how Kamusi makes a multilingual dictionary possible light (not heavy) fr: léger th: เบำfi: kevytsw: -epesi fr: léger (sandy) fr: léger (low alcohol) fr: léger (without much luggage) 28
  • 29. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 29
  • 30. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 30
  • 31. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 31
  • 32. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 32
  • 33. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 33
  • 34. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 34
  • 35. 35
  • 36. Side by Side Vocabulary Translation: KamusiGOLD vs. Google Translate 36
  • 37. equivalence • Parallel • Similar • Explanatory translations 37
  • 38. equivalence • Parallel • Similar • Explanatory hand (English) = main (French) ✓: transitive across languages translations 38
  • 39. equivalence • Parallel • Similar • Explanatory mkono (Swahili) = hand + arm (English) ⁇ : might be transitive across languages mkono (Swahili) = lima (Hawaiian) translations difference difference translation 39
  • 40. equivalence • Parallel • Similar • Explanatory (Lexical Gaps) hand (English) = 10.2 cm (most languages) ✗: not transitive across languages translations 40
  • 41. 41 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 42. • over 100,000 English defined concepts from Princeton Wordnet • Heavy Anglo-American bias • James Cook yes, Kalaniʻōpuʻu-a-Kaiamamao no • about 60 languages aligned via Global Wordnet • over 800,000 English concepts from Wiktionary (in process) • Wiktionary translations to many languages (highly problematic) • other languages (Spanish already) can be pivots when: • aligned to DUCKS • entries have definitions 42
  • 43. • Players match term from the left with concept on the right • Multiple matches possible • Bad definitions can be flagged • Null matches: on indigenous concepts: 43
  • 44. Still to come: •Multiplayer – validation by consensus •Manual term searching •DUCKS for FLEx 44
  • 45. 45 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 46. Switch flippable: • About 60 wordlists and datasets data prepared and permissions granted • SignTyp set for ~20 sign languages • Comparative African Wordlist – pre-aligned, no need for the tool • Any lexicon in a useable digital format that is copyright available 46
  • 47. What we can work with: • Word lists • part of speech is helpful • Electronic versions of print dictionaries • parse and play • Digital dictionaries • FLEx, etc 47
  • 49. 49 Fula (West Africa) ABADA Ar abada, abadaa, abadan DFZ Z<-> never(F) (with negation); ever(F); long ago jamais(D) (avec la négation)(Z); jamais; il y a longtemps Abada mi yahaali. (F): I have never gone. ; Je ne suis jamais allé. abada pati (F): don't ever ; ne faîtes jamais (qqch) gila abada (F): since long ago, forever ; depuis longtemps, toujours 11,000 Fula senses • 332 clearly computable matches • ~7500 matches for DUCKS • ~3000 null matches for manual follow-up
  • 50. 50
  • 51. 51 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 52. 52 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 53. 53 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 54. 54 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 55. 55 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 56. 56 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 57. Martin Benjamin, Sina Mansour, Karl Aberer DUCKS in a Row: Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon Vital Voices: Linking Language and Wellbeing 5th International Conference on Language Documentation and Conservation Honolulu, Hawaii - March 5, 2017 57