SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Can we Quantify Domainhood?
Exploring Measures to Assess Domain-Specificity in Web Corpora
Marina Santini
marina.santini@ri.se
RISE Research Institutes of Sweden
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden TIR 2018, Regensburg, Germany, 4 Sept. 2018
Acknowledgement and Citation
Acknowledgement
This research was supported by E-care@home, a “SIDUS - Strong Distributed
Research Environment” project, funded by the Swedish Knowledge Foundation.
Project websit: <<http://ecareathome.se/>>.
Cite the paper as:
Santini M., Strandqvist W., Nyström M., Alirezai M., Jönsson A. (2018) Can We
Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web
Corpora. In: Elloumi M. et al. (eds) Database and Expert Systems Applications. DEXA
2018. Communications in Computer and Information Science, vol 903. Springer, Cham
DOI https://doi.org/10.1007/978-3-319-99133-7_17
Presented at : TIR 2018: 15th International Workshop on Technologies for Information
Retrieval. In conjunction with DEXA 2018. Regensburg, Germany, 4 Sept. 2018.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 2TIR 2018, Regensburg, Germany, 4 Sept. 2018
Outline
1. Research questions: evaluating specialized web corpora in
terms of ”domainhood”
2. Case study: a web corpus for eCare
3. Methodology: how to measure domainhood
4. Conclusion & Future work
TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 3
1. Evaluating specialized
web corpora in terms of
”domainhood”
Introduction and Research Questions
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 4
Web Corpora
• Web corpora are important
• The evaluation of web corpora is important
• The evaluation of general-purpose web corpora is advanced
• The evaluation of specialized web corpora is less advanced
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 5TIR 2018, Regensburg, Germany, 4 Sept. 2018
Quantitative Corpus Evaluation
“When will a grammar based on one corpus be valid for another?
How much will it cost to port a Natural Language Processing (NLP)
application from one domain, with one corpus, to another, with
another?”
Adam Kilgarriff (2001) Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 6TIR 2018, Regensburg, Germany, 4 Sept. 2018
Definition: domainhood
• Domainhood is the degree of domain representativeness or
domain specificity of a web corpus.
• Ex: a high frequency of medical terms is a sign that the corpus is a
specialized medical corpus
• The importance of domain granularity
• Coarse domains vs fine-grained domains
• Lippincott et al. (2011) “while variation at a coarser domain level such as
between newswire and biomedical text is well-studied and known to affect the
portability of NLP systems, there is a need to develop an awareness of
subdomain variation when considering the practical use of language processing
applications […]”.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 7TIR 2018, Regensburg, Germany, 4 Sept. 2018
Research Questions:
Quantifying domainhood
• ”is it possible to automatically quantify the domainhood of a web
corpus regardless its domain granularity? If so, how?”
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 8TIR 2018, Regensburg, Germany, 4 Sept. 2018
2. Case Study: a Web
Corpus for eCare
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 9
eCare_sv_01
• eCare_sv_01*
• 155 SNOMED CT terms (chronic diseases)
* Santini M., Jönsson A., Nystrom M. and Alirezai M. (2017) "A Web Corpus for eCare: Collection, Lay
Annotation and Learning. First Results". Proceedings of LTA'17, FedCSIS 2017, Prague.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 10TIR 2018, Regensburg, Germany, 4 Sept. 2018
3. Methodology: How to
Measure Domainhood
Which measures?
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 11
SUC & eCare_sv_01
Stockholm-Umeå Corpus (SUC) -> reference corpus (1 million words)
eCare_sv_01: domain-specific corpus (approx. 700 000 words)
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 12TIR 2018, Regensburg, Germany, 4 Sept. 2018
Metrics
1.Mann-Withney-Wilcoxon Test
2.Kendall correlation coefficient (τ )
3.Kullback–Leibler (KL) divergence
4.Log-likelihood
5.Burstiness
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 13TIR 2018, Regensburg, Germany, 4 Sept. 2018
Gold Standard
Tokenized gold standard (165 unigrams)
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 14TIR 2018, Regensburg, Germany, 4 Sept. 2018
Seeds (example):
atrofisk faryngit
atrofisk gastrit
Gold Standard (example)
atrofisk
faryngit
gastrit
Word Frequency Lists
“A word frequency list is a “compact representation of a corpus,
lacking much of the information in the corpus but small and
easily tractable.”
Adam Kilgarriff (2010). Comparable corpora within and across languages, word frequency lists and
the KELLY project. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 15TIR 2018, Regensburg, Germany, 4 Sept. 2018
Ranked Word Frequencies: Scatter Plot
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 16TIR 2018, Regensburg, Germany, 4 Sept. 2018
Mann-Withney-Wilcoxon Test: Theory
Non-parametric test:
Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to
follow the normal distribution.
If the two distributions are dissimilar at .05 significance level, we can
conclude that SUC and eCare come from different populations.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 17TIR 2018, Regensburg, Germany, 4 Sept. 2018
Mann-Withney-Wilcoxon Test: Results
• The null hypothesis is that SUC's word frequency list and
eCare_sv_01 word frequency list come from identical populations.
• To test the hypothesis, we apply the wilcox.test() [R function] to
compare the corpora.
• The p-value turns out to be 0.019, and is less than the .05
significance level, we reject the null hypothesis.
• Conclusion: at .05 significance level, we conclude that SUC and eCare
belong to non-identical populations.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 18TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kendall correlation coefficient: Theory
Kendall correlation coefficient (tau) is a non-parametric measure of
correlation between two rankings.
tau is a probability value which indicates the difference between 2
rankings.
(We used the R function “cor.test()” with method=“kendall” to calculate
the test).
Interpretation:
• -1 = strong negative correlation
• 0 = no association
• 1 strong positive correlation
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 19TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kendall correlation coefficient: Results
Null hypothesis: the two rankings are identical
(We used the “cor.test()” R function with method=“kendall”, "two.sided“ a
to calculate the test.
tau -0.1093077;
the p-value of the test is 0.000000003122 (p-value in R: 3.122e−09) which
is less than the significance level p = .05.
We reject the null hypothesis:
If the rankings of SUC and eCare’s word frequency lists are dissimilar at .05
significance level, we can conclude that the content of eCare is different
from SUC.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 20TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kullback–Leibler (KL) Divergence: Theory
(a.k.a. relative entropy)
• KL quantifies how “distant” an estimation of a distribution may be
from the true distribution.
• Interpretation: KL divergence is non-negative and equal to zero if the
two distributions are identical.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 21TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kullback–Leibler (KL) Divergence: Results
• (We do not need a null hypothesis)
• (We used the R function “KL.empirical()”, (log2), package “entropy”
to compute KL divergence).
The KL divergence between SUC and eCare_Sv_01 is 5.80
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 22TIR 2018, Regensburg, Germany, 4 Sept. 2018
Up to now...
• ... the gold standard was not involved
• It is confirmed that two corpora were largely different, but we do not
know whether eCare is representative of the target domain.
TIR 2018, Regensburg, Germany, 4 Sept.
2018
RISE Research Institutes of Sweden, Division ICT - RISE SICS
East, Sweden
23
Log-Likelihood (LL): Theory
(a.k.a. G2)
A reference corpus is needed.
It is a measure based on a contingency table and compares the expected
values in two corpora under observation.
Interpretation: The larger the LL score of a word, the more different its
distribution in the two corpora.
A LL score of 3.8415 or higher is significant at the level of <0.05 and a LL score
of 10.8276 is significant at the level of <0.001 (Desagulier, 2017).
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 24TIR 2018, Regensburg, Germany, 4 Sept. 2018
Log-Likelihood (LL): Results
The intersection between LL scores and the gold standard is 58, i.e. 35.15%.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 25TIR 2018, Regensburg, Germany, 4 Sept. 2018
Burstiness: Theory
Burstiness helps identify words that are frequent in certain
documents, but that are unevenly distributed in the corpus as a
whole.
Implementation in R:
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 26TIR 2018, Regensburg, Germany, 4 Sept. 2018
”Burstiness is like the mean
but it ignores documents
with no intances” (Church
and Gale, 1995) Irvine, A., & Callison-Burch, C. A (2017)
Comprehensive Analysis of Bilingual Lexicon
Induction. Computational Linguistics, 43(2).
Burstiness: Results
Comparison between bursty words
and the chronic diseases’ gold standard
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 27TIR 2018, Regensburg, Germany, 4 Sept. 2018
Discussion
• Both statistical tests confirm that the two corpora are weakly correlated.
No gold standard involved, but based on a Null Hypothesis
• KL divergence returns a large value that indicate that the two corpora are
distant from each other. No gold standard involved.
• LL scores needs a reference corpus. They single out words with different
distributions in two corpora, results are compared against a gold standard,
but it is not clear to which corpus the the words that are singled out belong
to.
• Burstiness can be computed without a reference corpus. Results can be
measured against a gold standard. Provides promising results.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 28TIR 2018, Regensburg, Germany, 4 Sept. 2018
Profiling Bursty Words: Open Issues
• Less empirical cut-off points.
• Is burstiness affected by the size of corpus?
• Evaluation metrics (overlap coefficients and precision@) are not so
indicative. Intersection gives a better idea of the quantification.
• The best way to test the design of gold standards (=target domains)
for this kind of experiments.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 29TIR 2018, Regensburg, Germany, 4 Sept. 2018
4. Conclusion and
Future Work
What next?
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 30
Conclusion
Mann-Withney-Wilcoxon Test: hypothesis testing on distributions
Kendall correlation coefficient: hypothesis testing on rank correlation
Kullback–Leibler (KL) divergence: requires a reference corpus, cannot be
tested on a gold standard
Log-likelihood: requires a reference corpus, can be tested on a gold standard
Burstiness: does not require a reference corpus and can be tested on a gold
standard
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 31TIR 2018, Regensburg, Germany, 4 Sept. 2018
Future Work
• Implementation of additional burstiness formulas
• Inclusion of multi-words in the frequency lists
• Application of burstiness for domainhood on larger corpora
and other languages
• Investigating the ideal design of a gold standard for
domainhood detection
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 32TIR 2018, Regensburg, Germany, 4 Sept. 2018
Thanks for your attention !
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 33
Any Questions ?
TIR 2018, Regensburg, Germany, 4 Sept.
2018

Mais conteúdo relacionado

Semelhante a Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Editor IJMTER
 

Semelhante a Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora (20)

Official resume titash_mandal_
Official resume titash_mandal_Official resume titash_mandal_
Official resume titash_mandal_
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
EGI impact on science and megatrends
EGI impact on science and megatrendsEGI impact on science and megatrends
EGI impact on science and megatrends
 
E1062530
E1062530E1062530
E1062530
 
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
Cas pratique de la science de la donnée dans le domaine universitaire - Data ...
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
Big data analysis and modelling
Big data analysis and modellingBig data analysis and modelling
Big data analysis and modelling
 
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Air Quality Prediction using Seaborn and TensorFlow
Air Quality Prediction using Seaborn and TensorFlowAir Quality Prediction using Seaborn and TensorFlow
Air Quality Prediction using Seaborn and TensorFlow
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
IRJET- K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
IRJET-  	  K-SVD: Dictionary Developing Algorithms for Sparse Representation ...IRJET-  	  K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
IRJET- K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
 
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
 
Custo sucessivo
Custo sucessivoCusto sucessivo
Custo sucessivo
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONEFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
 

Mais de Marina Santini

Mais de Marina Santini (20)

An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

  • 1. Can we Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora Marina Santini marina.santini@ri.se RISE Research Institutes of Sweden RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 2. Acknowledgement and Citation Acknowledgement This research was supported by E-care@home, a “SIDUS - Strong Distributed Research Environment” project, funded by the Swedish Knowledge Foundation. Project websit: <<http://ecareathome.se/>>. Cite the paper as: Santini M., Strandqvist W., Nyström M., Alirezai M., Jönsson A. (2018) Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora. In: Elloumi M. et al. (eds) Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham DOI https://doi.org/10.1007/978-3-319-99133-7_17 Presented at : TIR 2018: 15th International Workshop on Technologies for Information Retrieval. In conjunction with DEXA 2018. Regensburg, Germany, 4 Sept. 2018. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 2TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 3. Outline 1. Research questions: evaluating specialized web corpora in terms of ”domainhood” 2. Case study: a web corpus for eCare 3. Methodology: how to measure domainhood 4. Conclusion & Future work TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 3
  • 4. 1. Evaluating specialized web corpora in terms of ”domainhood” Introduction and Research Questions TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 4
  • 5. Web Corpora • Web corpora are important • The evaluation of web corpora is important • The evaluation of general-purpose web corpora is advanced • The evaluation of specialized web corpora is less advanced RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 5TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 6. Quantitative Corpus Evaluation “When will a grammar based on one corpus be valid for another? How much will it cost to port a Natural Language Processing (NLP) application from one domain, with one corpus, to another, with another?” Adam Kilgarriff (2001) Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 6TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 7. Definition: domainhood • Domainhood is the degree of domain representativeness or domain specificity of a web corpus. • Ex: a high frequency of medical terms is a sign that the corpus is a specialized medical corpus • The importance of domain granularity • Coarse domains vs fine-grained domains • Lippincott et al. (2011) “while variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, there is a need to develop an awareness of subdomain variation when considering the practical use of language processing applications […]”. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 7TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 8. Research Questions: Quantifying domainhood • ”is it possible to automatically quantify the domainhood of a web corpus regardless its domain granularity? If so, how?” RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 8TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 9. 2. Case Study: a Web Corpus for eCare TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 9
  • 10. eCare_sv_01 • eCare_sv_01* • 155 SNOMED CT terms (chronic diseases) * Santini M., Jönsson A., Nystrom M. and Alirezai M. (2017) "A Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results". Proceedings of LTA'17, FedCSIS 2017, Prague. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 10TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 11. 3. Methodology: How to Measure Domainhood Which measures? TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 11
  • 12. SUC & eCare_sv_01 Stockholm-Umeå Corpus (SUC) -> reference corpus (1 million words) eCare_sv_01: domain-specific corpus (approx. 700 000 words) RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 12TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 13. Metrics 1.Mann-Withney-Wilcoxon Test 2.Kendall correlation coefficient (τ ) 3.Kullback–Leibler (KL) divergence 4.Log-likelihood 5.Burstiness RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 13TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 14. Gold Standard Tokenized gold standard (165 unigrams) RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 14TIR 2018, Regensburg, Germany, 4 Sept. 2018 Seeds (example): atrofisk faryngit atrofisk gastrit Gold Standard (example) atrofisk faryngit gastrit
  • 15. Word Frequency Lists “A word frequency list is a “compact representation of a corpus, lacking much of the information in the corpus but small and easily tractable.” Adam Kilgarriff (2010). Comparable corpora within and across languages, word frequency lists and the KELLY project. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 15TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 16. Ranked Word Frequencies: Scatter Plot RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 16TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 17. Mann-Withney-Wilcoxon Test: Theory Non-parametric test: Using the Mann-Whitney-Wilcoxon Test, we can decide whether the population distributions are identical without assuming them to follow the normal distribution. If the two distributions are dissimilar at .05 significance level, we can conclude that SUC and eCare come from different populations. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 17TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 18. Mann-Withney-Wilcoxon Test: Results • The null hypothesis is that SUC's word frequency list and eCare_sv_01 word frequency list come from identical populations. • To test the hypothesis, we apply the wilcox.test() [R function] to compare the corpora. • The p-value turns out to be 0.019, and is less than the .05 significance level, we reject the null hypothesis. • Conclusion: at .05 significance level, we conclude that SUC and eCare belong to non-identical populations. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 18TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 19. Kendall correlation coefficient: Theory Kendall correlation coefficient (tau) is a non-parametric measure of correlation between two rankings. tau is a probability value which indicates the difference between 2 rankings. (We used the R function “cor.test()” with method=“kendall” to calculate the test). Interpretation: • -1 = strong negative correlation • 0 = no association • 1 strong positive correlation RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 19TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 20. Kendall correlation coefficient: Results Null hypothesis: the two rankings are identical (We used the “cor.test()” R function with method=“kendall”, "two.sided“ a to calculate the test. tau -0.1093077; the p-value of the test is 0.000000003122 (p-value in R: 3.122e−09) which is less than the significance level p = .05. We reject the null hypothesis: If the rankings of SUC and eCare’s word frequency lists are dissimilar at .05 significance level, we can conclude that the content of eCare is different from SUC. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 20TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 21. Kullback–Leibler (KL) Divergence: Theory (a.k.a. relative entropy) • KL quantifies how “distant” an estimation of a distribution may be from the true distribution. • Interpretation: KL divergence is non-negative and equal to zero if the two distributions are identical. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 21TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 22. Kullback–Leibler (KL) Divergence: Results • (We do not need a null hypothesis) • (We used the R function “KL.empirical()”, (log2), package “entropy” to compute KL divergence). The KL divergence between SUC and eCare_Sv_01 is 5.80 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 22TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 23. Up to now... • ... the gold standard was not involved • It is confirmed that two corpora were largely different, but we do not know whether eCare is representative of the target domain. TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 23
  • 24. Log-Likelihood (LL): Theory (a.k.a. G2) A reference corpus is needed. It is a measure based on a contingency table and compares the expected values in two corpora under observation. Interpretation: The larger the LL score of a word, the more different its distribution in the two corpora. A LL score of 3.8415 or higher is significant at the level of <0.05 and a LL score of 10.8276 is significant at the level of <0.001 (Desagulier, 2017). RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 24TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 25. Log-Likelihood (LL): Results The intersection between LL scores and the gold standard is 58, i.e. 35.15%. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 25TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 26. Burstiness: Theory Burstiness helps identify words that are frequent in certain documents, but that are unevenly distributed in the corpus as a whole. Implementation in R: RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 26TIR 2018, Regensburg, Germany, 4 Sept. 2018 ”Burstiness is like the mean but it ignores documents with no intances” (Church and Gale, 1995) Irvine, A., & Callison-Burch, C. A (2017) Comprehensive Analysis of Bilingual Lexicon Induction. Computational Linguistics, 43(2).
  • 27. Burstiness: Results Comparison between bursty words and the chronic diseases’ gold standard RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 27TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 28. Discussion • Both statistical tests confirm that the two corpora are weakly correlated. No gold standard involved, but based on a Null Hypothesis • KL divergence returns a large value that indicate that the two corpora are distant from each other. No gold standard involved. • LL scores needs a reference corpus. They single out words with different distributions in two corpora, results are compared against a gold standard, but it is not clear to which corpus the the words that are singled out belong to. • Burstiness can be computed without a reference corpus. Results can be measured against a gold standard. Provides promising results. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 28TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 29. Profiling Bursty Words: Open Issues • Less empirical cut-off points. • Is burstiness affected by the size of corpus? • Evaluation metrics (overlap coefficients and precision@) are not so indicative. Intersection gives a better idea of the quantification. • The best way to test the design of gold standards (=target domains) for this kind of experiments. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 29TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 30. 4. Conclusion and Future Work What next? TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 30
  • 31. Conclusion Mann-Withney-Wilcoxon Test: hypothesis testing on distributions Kendall correlation coefficient: hypothesis testing on rank correlation Kullback–Leibler (KL) divergence: requires a reference corpus, cannot be tested on a gold standard Log-likelihood: requires a reference corpus, can be tested on a gold standard Burstiness: does not require a reference corpus and can be tested on a gold standard RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 31TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 32. Future Work • Implementation of additional burstiness formulas • Inclusion of multi-words in the frequency lists • Application of burstiness for domainhood on larger corpora and other languages • Investigating the ideal design of a gold standard for domainhood detection RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 32TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 33. Thanks for your attention ! RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 33 Any Questions ? TIR 2018, Regensburg, Germany, 4 Sept. 2018