SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Looking at the Long Tail
2nd Spinoza Workshop
#SpinozaLongTail
It is time to move
from “Big Data”
to “small data”
Language as
a system and
language use
● The world changes rapidly (Brexit)
● The language system changes slowly
(Nexit)
● The relation between the two changes
constantly
Small data example
Imagine you visit a dear friend for a game of chess. During the game you complain
about your white queen being captured too early. While chatting, your friend tells
you that his Ana is now already 13 years old, and is beginning with high school in
September. After the game, you offer to buy him a beer in O’Neil’s, which is just a
2-3 minutes walk.
Small data example
Imagine [2] you visit [11] a dear [8] friend [5] for a game [14] of chess [2]. During
the game [14] you complain [2] about your white [25] queen [12] being captured
[11] too early. While chatting [4], your friend [5] tells [9] you that his Ana [42] is
now already 13 years [4] old [9], and is beginning [11] with [high school [1]] in
September. After the game [14], you offer [16] to buy [6] him a beer [1] in
O’Neil’s[5], which is just a 2-3 minutes [8] walk [17].
2*11*8*5*14*2*14*2*25*12*11*4*5*9*42*4*9*11*1*14*16*6*1*5*8*17
= 2,1185E36 possible interpretations and still missing some
If a machine joined the conversation, what would it
understand?
Probably, it would think that you talk about: capturing “The White Queen”
TV-series, the ANA airways based in Japan being 13 years old, and the sport
equipment store O’Neil where you apparently can get a beer (or the recently
retired famous basketball player Shaq O’Neil).
An interpretation that may make sense from a Big Data perspective but that does
not make any sense as a combination! Where is the coherence?
Understanding of language is about the Long Tail
with many, many small data niches
“Today, a 6 year old has seen less data and read less
language than most machines, but still these machines make
mistakes that the 6 year old will never make.”
How to create semantic tasks/challenges:
● that force systems to use more
‘intelligence’,
● to understand small data and its details,
● without knowing in advance what these
details are.
Looking at the Long Tail
Looking at the Long Tail
Practicalities
Schedule overview
09:30-10:05 Welcome & Introduction
10:05-12:25 Invited talks
12:25-13:25 Lunch
13:25-14:05 Keynote
14:05-17:45 Practical session
17:45-18:00 Wrap-up
18:00- Drinks
Food & Beverages
● Coffee
○ Two official coffee breaks (one in the morning + one in the afternoon)
○ Coffee machines available in the hall at any time
● Lunch
○ Will be brought to the forum
● Drinks
○ After the workshop
○ In a nice place nearby the VU (follow us :-) )
#SpinozaLongTail
the Long Tail
Dinos born in different generations live in different worlds, i.e. what they know
about the world depends on the time they live in.
t
World(t)
Every generation of dinos has its own Ronaldo
t
Datasets as Time-bound World Proxies
t
Big Data can be a bad Proxy
t
The datasets fail to represent everything in the world,
and over-represent some things.
The Long Tail Phenomenon of Disambiguation (1)
In theory, any lexical expression can refer to any meaning of the world at that time
(and vice versa).
People may be able to use the full range of expressions and interpretation of a
language. But, they are not aware of it and only use a specific set in relation to a
real life situation.
This balances the trade-off between:
● using many different expressions, and
● resolving extreme ambiguity of a small set of expressions
● contextual competition (are you stating the obvious)
The Long Tail Phenomenon of Disambiguation (2)
The distribution of the lexical expressions and the denoted meanings in evaluation
datasets both follow Zipf’s Law.
But physically there is no Zipfian distribution between the meanings.
Any of these worlds contains a set of unique concepts and instances,
each existing physically exactly once.
On instance level, any entity, concept, or event is as prominent as any
other.
Tendencies between the long tails of expressions
and meanings
Dataset Analysis
Our analysis shows that the existing disambiguation datasets exhibit:
● Low ambiguity (~1.0)
● Low variance (~1.0)
● High dominance
● Temporal bias towards data from 1961 or 1990s
Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
Datasets: How old is the data?
Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
Datasets: How dominant is the data?
Marieke van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe Rizzo and Joerg Waitelonis
(2016). Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better
Job. In Proceedings of LREC 2016.
Problem Statement: The Long Tail DeTail
Disambiguation datasets contain a lot from the “head”, and only
accidental details from the “tail”.
Hence, our machines are very good at identifying popular objects (the
“head”) whereas their performance is extremely low on unpopular objects
(the “tail”).
This is further complicated by the fact that the popularity is determined by
the context: topic, time, location, community.
Systems: Our analysis
For WSD, we tried to improve on the disambiguation of the less frequent senses
Outcome: better performance on the tail causes worse performance on the head!
Marten Postma, Ruben Izquierdo, Eneko Agirre, German Rigau, Piek Vossen (2016).
Addressing the MFS Bias in WSD systems. In Proceedings of LREC 2016.
We aim to propose this task as a “Long Tail Shared Disambiguation Task” to the
next call for SemEval-2018 tasks, which is expected late 2016/early 2017.
In addition, we plan to propose a workshop for ACL 2017.
Goal(s) of the workshop
Goal(s) of the workshop
We want systems to resolve extreme ambiguity within specific context:
Discriminate one context from another
Use more semantics: coherence, logic, inferencing, comprehensive
Combine semantic layers and subtasks
Use the complete document and more (external knowledge)
Answer more complex questions, e.g. quantification and identity
Be smarter than a 6 year old
Can explain why something is an answer
We do not want: yet another domain task
Looking at the Long Tail: What can be done?
Systems Resources
Evaluation
Datasets
#1 Datasets
What kind of datasets are needed for the long tail disambiguation task?
● Properties
● Multi-task
● Are current ones sufficient?
● Optimal acquisition methods
#1 Datasets: Property-driven data
MEANTIME -> Multi-task corpus
Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begoña Altuna, Marieke van Erp, Anneleen Schoen and Chantal van
Son (2016). MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of LREC 2016.
Dutch SemCor -> Balanced WSD corpus
Piek Vossen, Rubén Izquierdo, and Attila Görög (2013). DutchSemCor: in quest of the ideal sense-tagged corpus. RANLP.
ECB+ -> Increased ambiguity for event coreference
Agata Cybulska and Piek Vossen (2014). Using a sledgehammer to crack a nut? Lexical diversity and event coreference
resolution. In LREC 2014.
#2 Resources
What kind of knowledge is needed for the long tail disambiguation task?
● Long tail knowledge bases
● Contextual knowledge bases
● Locating appropriate knowledge
#3 Evaluation
How should we evaluate the long tail disambiguation task(s)?
● Optimal evaluation metric(s)
● Generalizability over disambiguation tasks
● Incentivizing context- and long tail-aware systems
#4 Systems
What are the requirements for a system to perform well on the long tail
disambiguation task(s)?
● Existing systems
● Multi-task approach
● Long tail performance
● Sustainable systems
Thanks!

Mais conteúdo relacionado

Semelhante a 2nd Spinoza workshop: Looking at the Long Tail - introductory slides

How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For EssayJulia Slater
 
Language And Culture Essay
Language And Culture EssayLanguage And Culture Essay
Language And Culture EssayGermaine Newman
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Knowledge = Information + Context
Knowledge = Information + ContextKnowledge = Information + Context
Knowledge = Information + ContextStefan Gradmann
 
Sample Of A Cause And Effect Essay
Sample Of A Cause And Effect EssaySample Of A Cause And Effect Essay
Sample Of A Cause And Effect EssayKathy Murray
 
What Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data ScienceWhat Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data Sciencegleicher
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesJose Luis Lopez Pino
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...Ana Luísa Pinho
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Use Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence BehaviorUse Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence BehaviorLiz Danzico
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIVerena Rieser
 

Semelhante a 2nd Spinoza workshop: Looking at the Long Tail - introductory slides (20)

How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For Essay
 
Language And Culture Essay
Language And Culture EssayLanguage And Culture Essay
Language And Culture Essay
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Knowledge = Information + Context
Knowledge = Information + ContextKnowledge = Information + Context
Knowledge = Information + Context
 
Sample Of A Cause And Effect Essay
Sample Of A Cause And Effect EssaySample Of A Cause And Effect Essay
Sample Of A Cause And Effect Essay
 
What Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data ScienceWhat Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data Science
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
LabClass1_pt1.ppt
LabClass1_pt1.pptLabClass1_pt1.ppt
LabClass1_pt1.ppt
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Use Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence BehaviorUse Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence Behavior
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AI
 

Mais de Filip Ilievski

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge GraphFilip Ilievski
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in WikidataFilip Ilievski
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailFilip Ilievski
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleFilip Ilievski
 
Systematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity LinkingSystematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity LinkingFilip Ilievski
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataFilip Ilievski
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Filip Ilievski
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceFilip Ilievski
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationFilip Ilievski
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationFilip Ilievski
 

Mais de Filip Ilievski (11)

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge Graph
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in Wikidata
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tail
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubble
 
Systematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity LinkingSystematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity Linking
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimization
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimization
 

Último

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 

Último (20)

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 

2nd Spinoza workshop: Looking at the Long Tail - introductory slides

  • 1. Looking at the Long Tail 2nd Spinoza Workshop #SpinozaLongTail
  • 2. It is time to move from “Big Data” to “small data”
  • 3. Language as a system and language use
  • 4.
  • 5. ● The world changes rapidly (Brexit) ● The language system changes slowly (Nexit) ● The relation between the two changes constantly
  • 6. Small data example Imagine you visit a dear friend for a game of chess. During the game you complain about your white queen being captured too early. While chatting, your friend tells you that his Ana is now already 13 years old, and is beginning with high school in September. After the game, you offer to buy him a beer in O’Neil’s, which is just a 2-3 minutes walk.
  • 7. Small data example Imagine [2] you visit [11] a dear [8] friend [5] for a game [14] of chess [2]. During the game [14] you complain [2] about your white [25] queen [12] being captured [11] too early. While chatting [4], your friend [5] tells [9] you that his Ana [42] is now already 13 years [4] old [9], and is beginning [11] with [high school [1]] in September. After the game [14], you offer [16] to buy [6] him a beer [1] in O’Neil’s[5], which is just a 2-3 minutes [8] walk [17]. 2*11*8*5*14*2*14*2*25*12*11*4*5*9*42*4*9*11*1*14*16*6*1*5*8*17 = 2,1185E36 possible interpretations and still missing some
  • 8. If a machine joined the conversation, what would it understand? Probably, it would think that you talk about: capturing “The White Queen” TV-series, the ANA airways based in Japan being 13 years old, and the sport equipment store O’Neil where you apparently can get a beer (or the recently retired famous basketball player Shaq O’Neil). An interpretation that may make sense from a Big Data perspective but that does not make any sense as a combination! Where is the coherence?
  • 9. Understanding of language is about the Long Tail with many, many small data niches “Today, a 6 year old has seen less data and read less language than most machines, but still these machines make mistakes that the 6 year old will never make.”
  • 10. How to create semantic tasks/challenges: ● that force systems to use more ‘intelligence’, ● to understand small data and its details, ● without knowing in advance what these details are.
  • 11. Looking at the Long Tail
  • 12. Looking at the Long Tail Practicalities
  • 13. Schedule overview 09:30-10:05 Welcome & Introduction 10:05-12:25 Invited talks 12:25-13:25 Lunch 13:25-14:05 Keynote 14:05-17:45 Practical session 17:45-18:00 Wrap-up 18:00- Drinks
  • 14. Food & Beverages ● Coffee ○ Two official coffee breaks (one in the morning + one in the afternoon) ○ Coffee machines available in the hall at any time ● Lunch ○ Will be brought to the forum ● Drinks ○ After the workshop ○ In a nice place nearby the VU (follow us :-) )
  • 17. Dinos born in different generations live in different worlds, i.e. what they know about the world depends on the time they live in. t World(t)
  • 18. Every generation of dinos has its own Ronaldo t
  • 19. Datasets as Time-bound World Proxies t
  • 20. Big Data can be a bad Proxy t The datasets fail to represent everything in the world, and over-represent some things.
  • 21. The Long Tail Phenomenon of Disambiguation (1) In theory, any lexical expression can refer to any meaning of the world at that time (and vice versa). People may be able to use the full range of expressions and interpretation of a language. But, they are not aware of it and only use a specific set in relation to a real life situation. This balances the trade-off between: ● using many different expressions, and ● resolving extreme ambiguity of a small set of expressions ● contextual competition (are you stating the obvious)
  • 22. The Long Tail Phenomenon of Disambiguation (2) The distribution of the lexical expressions and the denoted meanings in evaluation datasets both follow Zipf’s Law. But physically there is no Zipfian distribution between the meanings. Any of these worlds contains a set of unique concepts and instances, each existing physically exactly once. On instance level, any entity, concept, or event is as prominent as any other.
  • 23. Tendencies between the long tails of expressions and meanings
  • 24. Dataset Analysis Our analysis shows that the existing disambiguation datasets exhibit: ● Low ambiguity (~1.0) ● Low variance (~1.0) ● High dominance ● Temporal bias towards data from 1961 or 1990s Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
  • 25. Datasets: How old is the data? Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
  • 26. Datasets: How dominant is the data? Marieke van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe Rizzo and Joerg Waitelonis (2016). Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job. In Proceedings of LREC 2016.
  • 27. Problem Statement: The Long Tail DeTail Disambiguation datasets contain a lot from the “head”, and only accidental details from the “tail”. Hence, our machines are very good at identifying popular objects (the “head”) whereas their performance is extremely low on unpopular objects (the “tail”). This is further complicated by the fact that the popularity is determined by the context: topic, time, location, community.
  • 28. Systems: Our analysis For WSD, we tried to improve on the disambiguation of the less frequent senses Outcome: better performance on the tail causes worse performance on the head! Marten Postma, Ruben Izquierdo, Eneko Agirre, German Rigau, Piek Vossen (2016). Addressing the MFS Bias in WSD systems. In Proceedings of LREC 2016.
  • 29. We aim to propose this task as a “Long Tail Shared Disambiguation Task” to the next call for SemEval-2018 tasks, which is expected late 2016/early 2017. In addition, we plan to propose a workshop for ACL 2017. Goal(s) of the workshop
  • 30. Goal(s) of the workshop We want systems to resolve extreme ambiguity within specific context: Discriminate one context from another Use more semantics: coherence, logic, inferencing, comprehensive Combine semantic layers and subtasks Use the complete document and more (external knowledge) Answer more complex questions, e.g. quantification and identity Be smarter than a 6 year old Can explain why something is an answer We do not want: yet another domain task
  • 31. Looking at the Long Tail: What can be done? Systems Resources Evaluation Datasets
  • 32. #1 Datasets What kind of datasets are needed for the long tail disambiguation task? ● Properties ● Multi-task ● Are current ones sufficient? ● Optimal acquisition methods
  • 33. #1 Datasets: Property-driven data MEANTIME -> Multi-task corpus Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begoña Altuna, Marieke van Erp, Anneleen Schoen and Chantal van Son (2016). MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of LREC 2016. Dutch SemCor -> Balanced WSD corpus Piek Vossen, Rubén Izquierdo, and Attila Görög (2013). DutchSemCor: in quest of the ideal sense-tagged corpus. RANLP. ECB+ -> Increased ambiguity for event coreference Agata Cybulska and Piek Vossen (2014). Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In LREC 2014.
  • 34. #2 Resources What kind of knowledge is needed for the long tail disambiguation task? ● Long tail knowledge bases ● Contextual knowledge bases ● Locating appropriate knowledge
  • 35. #3 Evaluation How should we evaluate the long tail disambiguation task(s)? ● Optimal evaluation metric(s) ● Generalizability over disambiguation tasks ● Incentivizing context- and long tail-aware systems
  • 36. #4 Systems What are the requirements for a system to perform well on the long tail disambiguation task(s)? ● Existing systems ● Multi-task approach ● Long tail performance ● Sustainable systems