SlideShare a Scribd company logo
1 of 11
Download to read offline
2018-06-08
1
Towards a quality assessment of web corpora
for language technology applications
Wiktor Strandqvist
RISE SICS East and Linköping University, Linköping,
Sweden
wikst813@student.liu.se
(co-authors: M. Santini, L. Lind, A. Jönsson)
Outline
• Introduction
• Purpose
• Domain-specific web corpora
• Two-step approach:
1. Extraction of term seeds from use cases, personas and scenarios
• Results
2. Bootstrapping and evaluation of domain-specific web corpora
• Results
• Conclusion and Future Work
2018-06-08
2
Introduction
• Research exists to assess the representativeness of general purpose web corpora
by comparing them to traditional corpora.
• Our focus and purpose :
• Creation and evaluation of domain-specific web corpora to be used to build Language
Technology real-world applications.
• We propose a two-step approach:
1. Automatic extraction and evaluation of term seeds from use cases, personas and scenarios;
2. Creation and validation of specialized and domain-specific web corpora bootstrapped with
term seeds automatically extracted in step 1.
Step 1: Term extraction from use cases, personas and scenarios
• Why using use cases, personas and scenarios, when available?
• They are based on numerous interviews and observations of real situations;
• Checked by domain experts who know how to correctly use terms in their own domain.
• Use cases, personas and scenario are a good starting point to automatize the manual process
(often arbitrary and tedious) to identify term seeds to bootstrap domain-specific web corpora.
• We focus on the medical terms that occur in use cases, personas and scenarios written in
English for the E-care@home project.
• Challenge: accurate term extractor from a relatively short text (a few dozen pages)
2018-06-08
3
Step 2: Corpus boostrapping and evaluation
• Bootstrap a web corpus using term seeds automatically extracted from use cases,
personas and scenarios.
• Automatically evaluate the ”quality” ot the bootstrapped domain-specific web
corpora.
Open issues and proposed answers
Q1: What is meant by “quality” of a web corpus?
A1: here “quality” means high density of medical terms (lay or specialized ) related to certain illnesses.
Q2: How can we assess the quality of a corpus automatically bootstrapped from the web?
A2: by using metrics that are well-established and easily replicable.
Q3: What if a bootstrapped web corpus contains documents that are NOT relevant to the target
domain?
A3: It depends. We can measure the domain-specificity of a corpus and assess whether it is
satisfactorily domain-specific or whether the corpus needs some amends before being used.
Q4: Can we measure the domain-specificity of a corpus?
A4: Yes, we use word frequency lists (without stopwords) and apply some statistical measures, see part
2 of this presentation.
2018-06-08
4
Word frequency lists: a compact corpus representation
• Our assumptions:
• ”Words are not selected at random” (Adam Kilgarriff)
• Word frequency lists (aka unigram lists) are a “compact representation of a corpus, lacking
much of the information in the corpus but small and easily tractable” (Adam Kilgarriff)
• We use frequency list of content words (i.e. after having applied stopword
removal) to evaluate the “quality” of the web corpora.
Part 1:
Term-Extraction from Use Cases, Personas, Scenarios
• Term candidate extraction
• Part-of-speech tagging (Standford tagger)
• Syntactic patterns
• Term validation
• Partial matching against a medical databse (Snomed CT)
• Ranking the terms based on DF/IDF
• Cutoff
• Seed generation
• Triples sampled from the same context
2018-06-08
5
Part 1:
Term-Extraction results
• Term candidate extraction
• Extraction recall: 81%
• Term validation
• Precision: 34.2%
• Recall: 71%
• F1: 46.2%
Part 2:
Evaluating domain-specific web corpora
• In this part of the presentation:
• We show that a corpus bootstrapped with automatically extracted term seeds from use
cases, personas and scenarios (Auto corpus) has the same ”quality” of a corpus boostrapped
with hand-picked seeds (Gold corpus).
• We show that both the Gold corpus and the Auto corpus have similar domain-specificity
(domainhood), and do not share any similarity with a general language web corpus, like
ukWac.
2018-06-08
6
The Web Corpora used in our experiments
• ukWaCsample (872 565 words): a random subset of ukWaC (general language
corpus)
• Gold (544 677 words): a web corpus collected with hand-picked seeds
• Auto (492 479 words) : a web corpus collected with automatically extracted seeds
Plotting normalized frequencies (wpm)
• ukWaCsample (872 565 words), Gold (544 677 words), Auto (492 479 words)
2018-06-08
7
Plotting ranks (top 1000 words)
• The ranks are based on the normalized frequencies (wpm)
Rank Correlation: Kendall
• Non-parametric Kendall Tau
2018-06-08
8
Rank Correlation: Spearman
• Non parametric Spearman Rho
Smoothing: 0.01
• We apply smoothing before calculating KL divergence and log-likelihood (LL-G2).
2018-06-08
9
KL divergence (aka relative entropy)
• R: entropy package, function KL.empirical()
• KL: ukWacSample vs Gold = 7.544118
• KL: ukWacSample vs Auto = 6.519677
• KL: Gold vs Auto = 1.843863
Log-likelihood (LL-G2)
• Corpus profiling: the larger the LL-G2 scores, the more significant the difference
between two corpora.
• The total LL-G2 scores for the three web corpora (top 1000-ranked words) are
• LL-G2 : ukWaCsample vs Gold = 453 441.6
• LL-G2 : ukWaCsample vs Auto = 393 705.9
• LL-G2 : Gold vs Auto: 114 694.2
2018-06-08
10
List of LL- G2 scores
From left to right: ukWaCsample vs Gold; ukWaCsample vs Auto; Gold vs Auto
For the individual LL scores, a G2score of 3.8415 or higher is significant at the level
of p < 0.05 and a G2 score of 10.8276 is significant at the level of p < 0.001
Discussion
• These simple measures based on word frequency lists give a clear indication of
the ”quality” of a bootstrapped domain-specific werb corpus:
• Rank correlation
• KL divergence
• Log-likelihood (LL-G2)
• These measures can be used to assess the corpus quality BEFORE the corpus is
used to build LT applications, thus avoiding bad surprises.
• If the values returned by the metrics are not satisfactory, a corpus can be
amended accordingly.
2018-06-08
11
Conclusion and Future Work
• It is possible to create a fairly accurate term extractor from a relatively short text
written by domain experts.
• It is possible to assess the quality and domain-specificity of web corpora by using
well-established metrics.
• Future work: expanding word frequency list (including bigram and trigrams) &
identifying more metrics that can help in the evaluation of the quality of the
corpora, such as burstiness ad perplexity.
Thank you for your attention!

More Related Content

What's hot

Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contentsSteffen Staab
 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patentsMIPLM
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeDr. Haxel Consult
 
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2TigerGraph
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
 
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...Dr. Haxel Consult
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Futuredgarijo
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...MOVING Project
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbenchEuropean Data Forum
 
Graph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionGraph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionTigerGraph
 
GLOBE Metadata Analysis
GLOBE Metadata AnalysisGLOBE Metadata Analysis
GLOBE Metadata AnalysisXavier Ochoa
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN Dr. Haxel Consult
 

What's hot (20)

Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patents
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent Office
 
ML in materials discovery
ML in materials discovery ML in materials discovery
ML in materials discovery
 
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
 
Graph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionGraph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community Detection
 
GLOBE Metadata Analysis
GLOBE Metadata AnalysisGLOBE Metadata Analysis
GLOBE Metadata Analysis
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN II-PIC 201: Product Presentation CAS / STN
II-PIC 201: Product Presentation CAS / STN
 

Similar to Towards a Quality Assessment of Web Corpora for Language Technology Applications

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchDaniel Schneiter
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...IOSR Journals
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldSean Chittenden
 
How experiments drive product growth at Viki
How experiments drive product growth at VikiHow experiments drive product growth at Viki
How experiments drive product growth at Vikiishanagrawal90
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Baden Hughes
 
Incremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software EngineeringIncremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software EngineeringÁkos Horváth
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?EDB
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 

Similar to Towards a Quality Assessment of Web Corpora for Language Technology Applications (20)

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
Web-Based System for Software Requirements Quality Analysis Using Case-Based ...
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated World
 
Yuvaraj
YuvarajYuvaraj
Yuvaraj
 
How experiments drive product growth at Viki
How experiments drive product growth at VikiHow experiments drive product growth at Viki
How experiments drive product growth at Viki
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
 
Incremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software EngineeringIncremental Model Queries for Model-Dirven Software Engineering
Incremental Model Queries for Model-Dirven Software Engineering
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 

More from Marina Santini

An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part) Marina Santini
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Marina Santini
 

More from Marina Santini (20)

An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 

Recently uploaded

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Towards a Quality Assessment of Web Corpora for Language Technology Applications

  • 1. 2018-06-08 1 Towards a quality assessment of web corpora for language technology applications Wiktor Strandqvist RISE SICS East and Linköping University, Linköping, Sweden wikst813@student.liu.se (co-authors: M. Santini, L. Lind, A. Jönsson) Outline • Introduction • Purpose • Domain-specific web corpora • Two-step approach: 1. Extraction of term seeds from use cases, personas and scenarios • Results 2. Bootstrapping and evaluation of domain-specific web corpora • Results • Conclusion and Future Work
  • 2. 2018-06-08 2 Introduction • Research exists to assess the representativeness of general purpose web corpora by comparing them to traditional corpora. • Our focus and purpose : • Creation and evaluation of domain-specific web corpora to be used to build Language Technology real-world applications. • We propose a two-step approach: 1. Automatic extraction and evaluation of term seeds from use cases, personas and scenarios; 2. Creation and validation of specialized and domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Step 1: Term extraction from use cases, personas and scenarios • Why using use cases, personas and scenarios, when available? • They are based on numerous interviews and observations of real situations; • Checked by domain experts who know how to correctly use terms in their own domain. • Use cases, personas and scenario are a good starting point to automatize the manual process (often arbitrary and tedious) to identify term seeds to bootstrap domain-specific web corpora. • We focus on the medical terms that occur in use cases, personas and scenarios written in English for the E-care@home project. • Challenge: accurate term extractor from a relatively short text (a few dozen pages)
  • 3. 2018-06-08 3 Step 2: Corpus boostrapping and evaluation • Bootstrap a web corpus using term seeds automatically extracted from use cases, personas and scenarios. • Automatically evaluate the ”quality” ot the bootstrapped domain-specific web corpora. Open issues and proposed answers Q1: What is meant by “quality” of a web corpus? A1: here “quality” means high density of medical terms (lay or specialized ) related to certain illnesses. Q2: How can we assess the quality of a corpus automatically bootstrapped from the web? A2: by using metrics that are well-established and easily replicable. Q3: What if a bootstrapped web corpus contains documents that are NOT relevant to the target domain? A3: It depends. We can measure the domain-specificity of a corpus and assess whether it is satisfactorily domain-specific or whether the corpus needs some amends before being used. Q4: Can we measure the domain-specificity of a corpus? A4: Yes, we use word frequency lists (without stopwords) and apply some statistical measures, see part 2 of this presentation.
  • 4. 2018-06-08 4 Word frequency lists: a compact corpus representation • Our assumptions: • ”Words are not selected at random” (Adam Kilgarriff) • Word frequency lists (aka unigram lists) are a “compact representation of a corpus, lacking much of the information in the corpus but small and easily tractable” (Adam Kilgarriff) • We use frequency list of content words (i.e. after having applied stopword removal) to evaluate the “quality” of the web corpora. Part 1: Term-Extraction from Use Cases, Personas, Scenarios • Term candidate extraction • Part-of-speech tagging (Standford tagger) • Syntactic patterns • Term validation • Partial matching against a medical databse (Snomed CT) • Ranking the terms based on DF/IDF • Cutoff • Seed generation • Triples sampled from the same context
  • 5. 2018-06-08 5 Part 1: Term-Extraction results • Term candidate extraction • Extraction recall: 81% • Term validation • Precision: 34.2% • Recall: 71% • F1: 46.2% Part 2: Evaluating domain-specific web corpora • In this part of the presentation: • We show that a corpus bootstrapped with automatically extracted term seeds from use cases, personas and scenarios (Auto corpus) has the same ”quality” of a corpus boostrapped with hand-picked seeds (Gold corpus). • We show that both the Gold corpus and the Auto corpus have similar domain-specificity (domainhood), and do not share any similarity with a general language web corpus, like ukWac.
  • 6. 2018-06-08 6 The Web Corpora used in our experiments • ukWaCsample (872 565 words): a random subset of ukWaC (general language corpus) • Gold (544 677 words): a web corpus collected with hand-picked seeds • Auto (492 479 words) : a web corpus collected with automatically extracted seeds Plotting normalized frequencies (wpm) • ukWaCsample (872 565 words), Gold (544 677 words), Auto (492 479 words)
  • 7. 2018-06-08 7 Plotting ranks (top 1000 words) • The ranks are based on the normalized frequencies (wpm) Rank Correlation: Kendall • Non-parametric Kendall Tau
  • 8. 2018-06-08 8 Rank Correlation: Spearman • Non parametric Spearman Rho Smoothing: 0.01 • We apply smoothing before calculating KL divergence and log-likelihood (LL-G2).
  • 9. 2018-06-08 9 KL divergence (aka relative entropy) • R: entropy package, function KL.empirical() • KL: ukWacSample vs Gold = 7.544118 • KL: ukWacSample vs Auto = 6.519677 • KL: Gold vs Auto = 1.843863 Log-likelihood (LL-G2) • Corpus profiling: the larger the LL-G2 scores, the more significant the difference between two corpora. • The total LL-G2 scores for the three web corpora (top 1000-ranked words) are • LL-G2 : ukWaCsample vs Gold = 453 441.6 • LL-G2 : ukWaCsample vs Auto = 393 705.9 • LL-G2 : Gold vs Auto: 114 694.2
  • 10. 2018-06-08 10 List of LL- G2 scores From left to right: ukWaCsample vs Gold; ukWaCsample vs Auto; Gold vs Auto For the individual LL scores, a G2score of 3.8415 or higher is significant at the level of p < 0.05 and a G2 score of 10.8276 is significant at the level of p < 0.001 Discussion • These simple measures based on word frequency lists give a clear indication of the ”quality” of a bootstrapped domain-specific werb corpus: • Rank correlation • KL divergence • Log-likelihood (LL-G2) • These measures can be used to assess the corpus quality BEFORE the corpus is used to build LT applications, thus avoiding bad surprises. • If the values returned by the metrics are not satisfactory, a corpus can be amended accordingly.
  • 11. 2018-06-08 11 Conclusion and Future Work • It is possible to create a fairly accurate term extractor from a relatively short text written by domain experts. • It is possible to assess the quality and domain-specificity of web corpora by using well-established metrics. • Future work: expanding word frequency list (including bigram and trigrams) & identifying more metrics that can help in the evaluation of the quality of the corpora, such as burstiness ad perplexity. Thank you for your attention!