SlideShare uma empresa Scribd logo
1 de 26
DutchSemCor:
in Quest of the Ideal
Sense Tagged Corpus
Piek Vossen piek.vossen@vu.nl
Rubén Izquierdo ruben.izquierdobevia@vu.nl
Attila Görög a.gorog@vu.nl
Outline
 Main goal of our project
 WSD and annotated corpus
 Our approach
 Balanced-sense corpus and evaluation
 Balanced-context corpus and evaluation
 Sense distributions, all words corpus and evaluation
 Numbers…
1
Main goal of DSC
 Deliver a Dutch corpus enriched with semantic
information:
 Senses of the most frequent and most polysemous words
 Domains
 Named Entities linked with Wikipedia
 1 million sense tagged tokens:
 250K tagged manually by 2 annotators
 750K tagged by 1 annotator / automatically through Active
Learning
2
Current WSD
 Insights on Word Sense Disambiguation
1. Evaluation tasks depend on the corpus / lexicon
 It seems that the results depend more on the evaluation data than on WSD systems
 Are the evaluation corpora diverse enough?
2. Most frequent sense from SemCor difficult to beat
 Are evaluation tasks neglecting low frequent senses?
3. Predominant senses in specific domains give the best results
4. Supervised systems beat unsupervised systems
 Which are the best corpora for WSD
 How should be the ideal corpus for WSD? (we)
 Define criteria for the ideal sense-tagged corpus
 Describe a novel approach for building a large scale sense tagged corpus for meet
criteria (with as little manual effort as possible)
3
Criteria for a corpus
 A good corpus for WSD should:
 Be balanced for different senses
 Equal number of examples for each meaning
 Be balanced for different contexts
 Different usages of the words
 Provide information on sense frequencies (across
domains and genres)
 Frequency of the words in a representative meaning
4
Annotating a corpus
Sequential
Tagging
All Words
corpus
Targeted
tagging
Lexical Sample
Corpus
Balanced
sense
Balanced
context
 Whole text
 Reconsider meanings
 KWIC
 Repeated contexts
 Small numbers of
texts, genres, domains and
senses
 Sense distributions
 SemCor
 Usually large number of contexts and
senses
 Line-hard-serve
 DSO 5
Annotating a corpus
Sense
distribution
Sense
coverage
Context
diversity
All words ✔ ✖ ✖
Balanced-sense ✖ ✔ ✖
Balanced-context ✖ ✖ ✔
6
Our main approach
1. Annotated corpus that represents ALL the meanings of an existing
lexicon
 Balanced sense
 Manual
2. Train WSD systems using the annotated corpus
 Will be trained for all the senses
3. Extend this annotated corpus to acquire a wider representation of
contexts
 Balanced-context
 Manual + WSD
4. Annotate the full raw corpus
 Sense distributions
 WSD
5. Evaluation of the annotations for the 3 criteria
7
Resources
 Cornetto database
 Lexical semantic database for Dutch
 Structure and content of WN + FrameNet-like
data
 SoNaR (500M tokens)
 Dutch wide range of genres and topics
 34 categories: discussion
lists, books, chats, autocues…)
 CGN (9M tokens)
 Transcribed spontaneous Dutch adult speech
 Internet
8
WSD systems
 DSC-timbl
 Memory learning classifier
 Supervised K-nearest neighbor
 DSC-SVM
 Linear classifier / Support Vector Machines
 Binary classifiers 1 vs all
 DSC-UKB
 Knowledge based system
 Personalized page rank algorithm
 Synsets  nodes Relations  hedges
 Context words inject mass into word senses
9
Balanced-sense corpus
 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
 Student assistants 2 years
 SAT tool and Web-snippets tool
 80% agreement 25 examples per sense
 282,503 tokens double annotated
 80% senses with more than 25 examples
 90% lemmas with 25 examples for each sense
 Distribution-> 67% sonar, 5% CGN, 28% web
10
Balanced-sense corpus
 Student assistants 2 years
 SAT tool
 80% agreement 25 examples per sense
 282,503 tokens double annotated
 80% senses with more than 25 examples
 90% lemmas with 25 examples for each sense
 Distribution-> 67% sonar, 5% CGN, 28% web
11
Balanced-sense corpus
 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
 Student assistants 2 years
 SAT tool
 80% agreement 25 examples per sense
 282,503 tokens double annotated
 80% senses with more than 25 examples
 90% lemmas with 25 examples for each sense
 Distribution-> 67% sonar, 5% CGN, 28% web
12
WSD from balanced sense
 5-FCV at sense level and focus on nouns
 Optimized for annotate SONAR
 Specific features (word_id)
 Overall result for nouns  82.76
 Results used for further annotate weakly performing
senses
 Active Learning approach
 Select 82 lemmas performing under 80%
 3 rounds of annotation till reach 81.62%
13
WSD from balanced sense
 5-FCV at sense level and focus on nouns
 Optimized for annotate SONAR
 Specific features (word_id)
 Overall result for nouns  82.76
 Results used for further annotate weakly performing
senses
 Active Learning approach
 Select 82 lemmas performing under 80%
 3 rounds of annotation till reach 81.62%
14
WSD from balanced sense
 5-FCV at sense level and focus on nouns
 Optimized for annotate SONAR
 Specific features (word_id)
 Overall result for nouns  82.76
 Results used for further annotate weakly performing
senses
 Active Learning approach
 Select 82 lemmas performing under 80%
 3 rounds of annotation till reach 81.62%
15
Balanced context
 Try to annotate the whole corpus  as many contexts as the whole
corpus  have a good WSD  improve problematic cases
 Select all words perform under 80%
 Annotate all corpus with Timbl-wsd system (optimized)
 50 new tokens for senses of words under 80% being different context
 High confidence
 Low distance / High distance to the nearest neighbor
 Manually annotate these 50
 Completely different to first phase where annotators could chose
 Lemmatization errors, PoS errors, figurative, idiomatic unknown senses
16
Evaluating the Balanced-sense
and new annotations
Type Accuracy # examples
Balanced Sense (BS) 81.62 8641
BS + LowD 78.81 13266
BS+ LowD_agreed 85.02 11405
BS+ High 76.24 19055
BS+ HighD_agreed 83.77 13359
BS + LowD_agreed +
HighD_agreed
85.33 16123
• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas
• Better results when using agreed data
• High/Low distance does not make big difference
17
Evaluation balanced-context
 5-FCV using agreed new instances
 Best is majority voting
System Nouns Verbs Adjs
DSC-timbl 83.97 83.44 78.64
DSC-svm 82.69 84.93 79.03
DSC-ukb 73.04 55.84 56.36
Voting 88.65 87.60 83.06
18
Evaluating representativeness
 Our manual annotated corpus probably skewed towards
balanced-sense
 Required to test the performance of our WSD on the rest
of SONAR
 Random evaluation
 Ranges of accuracy (90-100 80-90 70-80 60-70)
 5 nouns 5 verbs and 3 adjs  52 lemmas
 100 tokens for each lemma automatic tagged and
manual validated
19
Evaluating representativeness
 Results lower than previous evaluations
 Difference between approach representing the lexicon (sense) and
representing the corpus
 Results comparable to state-of-the-art English Sens/Sem-eval
System Nouns Verbs Adjs
DSC-timbl 54.25 48.25 46.50
DSC-svm 64.10 52.20 52.00
DSC-ukb 49.37 44.15 38.13
Voting 60.70 53.95 50.83
20
Obtaining sense distributions
 Approach
 Annotate the remainder SoNaR with WSD systems an obtain
sense frequencies
 Assume that automatic annotation still reflects real distribution
 Evaluate this frequency distribution (Most Frequent Sense)
 How can be evaluated this MFS approach?
 Manual annotations
 25 examples per sense, no sense distribution
 Random evaluation corpus
 Only a small selection of words (52 lemmas)
21
Obtaining sense distributions
 All-words corpus was created
 Completely independent texts from Lassy
 Medical journals, manuals, newspapers, magazines, reports,
websites, wikipedia
 23,907 tokens and covers 1,527 of our set of lemmas (53%)
 Evaluation of
 3 WSD systems
 First sense baseline according to cornetto
 Random sense baseline
 Most frequent sense
 Sense distributions obtained from automatic annotation
22
Obtaining sense distributions
 MFS in Dutch similar to English MFS
 MFS better than 1st and random sense baselines
 MFS automatically derived is a good predictor
System Nouns Verbs Adjs
1st sense 53.17 32.84 52.17
Random sense 29.52 24.99 32.16
MFS 61.20 50.76 54.62
DSC-timbl 55.76 37.96 49.00
DSC-svm 64.58 45.81 55.70
DSC-ukb 56.81 31.37 35.93
Voting 66.09 45.68 52.24
23
Numbers of DSC
 Balanced-sense annotated corpus
 274,344 tokens
 2,874 lemmas
 Annotated by 2 annotators, 90% IAA
 Balanced-context annotated corpus
 132,666 tokens
 1,133 lemmas
 Manually annotated by 1 agreeing with
WSD in 44%
 Random evaluation corpus
 5,200 tokens
 52 lemmas
 All words corpus
 23,907 tokens
 1,527 lemmas
 3 WSD systems for Dutch
 DSC-timbl
 DSC-svm
 DSC-ukb
 Automatic annotations by the 3 WSD
 Sense distributions
 48 million of tokens with confidence
 … and more…
 800,000 semantic relations between senses
extracted from manual annotations
 28.080 sense groups
 Improved version of Cornetto
 SAT annotation tool
 Web search tool
 Statistics on figurative, idiomatic and
collocational usage of words
 …
24
Piek Vossen piek.vossen@vu.nl
Rubén Izquierdo ruben.izquierdobevia@vu.nl
Attila Görög a.gorog@vu.nl
Thanks for your attention

Mais conteúdo relacionado

Semelhante a RANLP 2013: DutchSemcor in quest of the ideal corpus

Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...HPCC Systems
 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisAditya Joshi
 
A Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense DisambiguationA Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense DisambiguationMaryOsborne11
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Voice recognitionr.ppt
Voice recognitionr.pptVoice recognitionr.ppt
Voice recognitionr.pptSahidKhan61
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Investigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text AnnotationInvestigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text Annotationnlpg
 
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...Ana Marasović
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingGuy De Pauw
 
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term DetectionARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term DetectionMediaEval2012
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsMatīss ‎‎‎‎‎‎‎  
 

Semelhante a RANLP 2013: DutchSemcor in quest of the ideal corpus (20)

Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
Class14
Class14Class14
Class14
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment Analysis
 
Acm ihi-2010-pedersen-final
Acm ihi-2010-pedersen-finalAcm ihi-2010-pedersen-final
Acm ihi-2010-pedersen-final
 
Semeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic SimilaritySemeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic Similarity
 
A Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense DisambiguationA Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense Disambiguation
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
sr.ppt
sr.pptsr.ppt
sr.ppt
 
Voice recognitionr.ppt
Voice recognitionr.pptVoice recognitionr.ppt
Voice recognitionr.ppt
 
sr.ppt
sr.pptsr.ppt
sr.ppt
 
Conll
ConllConll
Conll
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
Measuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and ConceptsMeasuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and Concepts
 
Investigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text AnnotationInvestigating the Possibilities of Using SMT for Text Annotation
Investigating the Possibilities of Using SMT for Text Annotation
 
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech Tagging
 
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term DetectionARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 

Mais de Rubén Izquierdo Beviá

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityRubén Izquierdo Beviá
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsRubén Izquierdo Beviá
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationRubén Izquierdo Beviá
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesRubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)Rubén Izquierdo Beviá
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Rubén Izquierdo Beviá
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013Rubén Izquierdo Beviá
 

Mais de Rubén Izquierdo Beviá (12)

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systems
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
 

Último

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 

Último (20)

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 

RANLP 2013: DutchSemcor in quest of the ideal corpus

  • 1. DutchSemCor: in Quest of the Ideal Sense Tagged Corpus Piek Vossen piek.vossen@vu.nl Rubén Izquierdo ruben.izquierdobevia@vu.nl Attila Görög a.gorog@vu.nl
  • 2. Outline  Main goal of our project  WSD and annotated corpus  Our approach  Balanced-sense corpus and evaluation  Balanced-context corpus and evaluation  Sense distributions, all words corpus and evaluation  Numbers… 1
  • 3. Main goal of DSC  Deliver a Dutch corpus enriched with semantic information:  Senses of the most frequent and most polysemous words  Domains  Named Entities linked with Wikipedia  1 million sense tagged tokens:  250K tagged manually by 2 annotators  750K tagged by 1 annotator / automatically through Active Learning 2
  • 4. Current WSD  Insights on Word Sense Disambiguation 1. Evaluation tasks depend on the corpus / lexicon  It seems that the results depend more on the evaluation data than on WSD systems  Are the evaluation corpora diverse enough? 2. Most frequent sense from SemCor difficult to beat  Are evaluation tasks neglecting low frequent senses? 3. Predominant senses in specific domains give the best results 4. Supervised systems beat unsupervised systems  Which are the best corpora for WSD  How should be the ideal corpus for WSD? (we)  Define criteria for the ideal sense-tagged corpus  Describe a novel approach for building a large scale sense tagged corpus for meet criteria (with as little manual effort as possible) 3
  • 5. Criteria for a corpus  A good corpus for WSD should:  Be balanced for different senses  Equal number of examples for each meaning  Be balanced for different contexts  Different usages of the words  Provide information on sense frequencies (across domains and genres)  Frequency of the words in a representative meaning 4
  • 6. Annotating a corpus Sequential Tagging All Words corpus Targeted tagging Lexical Sample Corpus Balanced sense Balanced context  Whole text  Reconsider meanings  KWIC  Repeated contexts  Small numbers of texts, genres, domains and senses  Sense distributions  SemCor  Usually large number of contexts and senses  Line-hard-serve  DSO 5
  • 7. Annotating a corpus Sense distribution Sense coverage Context diversity All words ✔ ✖ ✖ Balanced-sense ✖ ✔ ✖ Balanced-context ✖ ✖ ✔ 6
  • 8. Our main approach 1. Annotated corpus that represents ALL the meanings of an existing lexicon  Balanced sense  Manual 2. Train WSD systems using the annotated corpus  Will be trained for all the senses 3. Extend this annotated corpus to acquire a wider representation of contexts  Balanced-context  Manual + WSD 4. Annotate the full raw corpus  Sense distributions  WSD 5. Evaluation of the annotations for the 3 criteria 7
  • 9. Resources  Cornetto database  Lexical semantic database for Dutch  Structure and content of WN + FrameNet-like data  SoNaR (500M tokens)  Dutch wide range of genres and topics  34 categories: discussion lists, books, chats, autocues…)  CGN (9M tokens)  Transcribed spontaneous Dutch adult speech  Internet 8
  • 10. WSD systems  DSC-timbl  Memory learning classifier  Supervised K-nearest neighbor  DSC-SVM  Linear classifier / Support Vector Machines  Binary classifiers 1 vs all  DSC-UKB  Knowledge based system  Personalized page rank algorithm  Synsets  nodes Relations  hedges  Context words inject mass into word senses 9
  • 11. Balanced-sense corpus  2870 most polysemous and frequent words (11982 meanings avg polysemy 3)  Student assistants 2 years  SAT tool and Web-snippets tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 10
  • 12. Balanced-sense corpus  Student assistants 2 years  SAT tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 11
  • 13. Balanced-sense corpus  2870 most polysemous and frequent words (11982 meanings avg polysemy 3)  Student assistants 2 years  SAT tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 12
  • 14. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 13
  • 15. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 14
  • 16. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 15
  • 17. Balanced context  Try to annotate the whole corpus  as many contexts as the whole corpus  have a good WSD  improve problematic cases  Select all words perform under 80%  Annotate all corpus with Timbl-wsd system (optimized)  50 new tokens for senses of words under 80% being different context  High confidence  Low distance / High distance to the nearest neighbor  Manually annotate these 50  Completely different to first phase where annotators could chose  Lemmatization errors, PoS errors, figurative, idiomatic unknown senses 16
  • 18. Evaluating the Balanced-sense and new annotations Type Accuracy # examples Balanced Sense (BS) 81.62 8641 BS + LowD 78.81 13266 BS+ LowD_agreed 85.02 11405 BS+ High 76.24 19055 BS+ HighD_agreed 83.77 13359 BS + LowD_agreed + HighD_agreed 85.33 16123 • Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas • Better results when using agreed data • High/Low distance does not make big difference 17
  • 19. Evaluation balanced-context  5-FCV using agreed new instances  Best is majority voting System Nouns Verbs Adjs DSC-timbl 83.97 83.44 78.64 DSC-svm 82.69 84.93 79.03 DSC-ukb 73.04 55.84 56.36 Voting 88.65 87.60 83.06 18
  • 20. Evaluating representativeness  Our manual annotated corpus probably skewed towards balanced-sense  Required to test the performance of our WSD on the rest of SONAR  Random evaluation  Ranges of accuracy (90-100 80-90 70-80 60-70)  5 nouns 5 verbs and 3 adjs  52 lemmas  100 tokens for each lemma automatic tagged and manual validated 19
  • 21. Evaluating representativeness  Results lower than previous evaluations  Difference between approach representing the lexicon (sense) and representing the corpus  Results comparable to state-of-the-art English Sens/Sem-eval System Nouns Verbs Adjs DSC-timbl 54.25 48.25 46.50 DSC-svm 64.10 52.20 52.00 DSC-ukb 49.37 44.15 38.13 Voting 60.70 53.95 50.83 20
  • 22. Obtaining sense distributions  Approach  Annotate the remainder SoNaR with WSD systems an obtain sense frequencies  Assume that automatic annotation still reflects real distribution  Evaluate this frequency distribution (Most Frequent Sense)  How can be evaluated this MFS approach?  Manual annotations  25 examples per sense, no sense distribution  Random evaluation corpus  Only a small selection of words (52 lemmas) 21
  • 23. Obtaining sense distributions  All-words corpus was created  Completely independent texts from Lassy  Medical journals, manuals, newspapers, magazines, reports, websites, wikipedia  23,907 tokens and covers 1,527 of our set of lemmas (53%)  Evaluation of  3 WSD systems  First sense baseline according to cornetto  Random sense baseline  Most frequent sense  Sense distributions obtained from automatic annotation 22
  • 24. Obtaining sense distributions  MFS in Dutch similar to English MFS  MFS better than 1st and random sense baselines  MFS automatically derived is a good predictor System Nouns Verbs Adjs 1st sense 53.17 32.84 52.17 Random sense 29.52 24.99 32.16 MFS 61.20 50.76 54.62 DSC-timbl 55.76 37.96 49.00 DSC-svm 64.58 45.81 55.70 DSC-ukb 56.81 31.37 35.93 Voting 66.09 45.68 52.24 23
  • 25. Numbers of DSC  Balanced-sense annotated corpus  274,344 tokens  2,874 lemmas  Annotated by 2 annotators, 90% IAA  Balanced-context annotated corpus  132,666 tokens  1,133 lemmas  Manually annotated by 1 agreeing with WSD in 44%  Random evaluation corpus  5,200 tokens  52 lemmas  All words corpus  23,907 tokens  1,527 lemmas  3 WSD systems for Dutch  DSC-timbl  DSC-svm  DSC-ukb  Automatic annotations by the 3 WSD  Sense distributions  48 million of tokens with confidence  … and more…  800,000 semantic relations between senses extracted from manual annotations  28.080 sense groups  Improved version of Cornetto  SAT annotation tool  Web search tool  Statistics on figurative, idiomatic and collocational usage of words  … 24
  • 26. Piek Vossen piek.vossen@vu.nl Rubén Izquierdo ruben.izquierdobevia@vu.nl Attila Görög a.gorog@vu.nl Thanks for your attention